inclusionAI/dInfer
Python
Captured source
source ↗inclusionAI/dInfer
Description: dInfer: An Efficient Inference Framework for Diffusion Language Models
Language: Python
License: Apache-2.0
Stars: 470
Forks: 45
Open issues: 18
Created: 2025-09-29T08:07:23Z
Pushed: 2026-02-11T03:10:42Z
Default branch: master
Fork: no
Archived: no
README:
Introduction
dInfer is an efficient and extensible inference framework for dLLMs. As illustrated in the following architecture, it modularizes inference into four components: *model*, *diffusion iteration manager*, *decoder* and *KV-cache manager*. It provides well-designed APIs for flexible algorithms combinations in each component. It now supports batched inference for improved throughput.
Figure: Overall Architecture of dInfer
dInfer supports multiple dLLM variants, including LLaDA, LLaDA-MoE and LLaDA2.
News
\[2025/12/21\] release v0.2. The major features of this release can be found here.
\[2025/12/10\] Support and speed up the formal version of block diffusion LLMs (LLaDA2-mini and LLaDA2-flash). Support quant versions of LLaDA2-mini and LLaDA2-flash.
\[2025/11/15\] Support the inference on block diffusion LLMs (LLaDA2-mini-preview and LLaDA2-flash-preview).
\[2025/10/10\] Release the first version of the dInfer framework.
Contents
- [Supported Models](#supported-models)
- [Quick Start](#quick-start)
- [Benchmark Results](#benchmark-results)
Supported Models
dInfer supports multiple diffusion language model variants with different architectures and sizes. Below are the HuggingFace model links and their corresponding implementation files:
| Model | Size | Implementation | HuggingFace Link | |-------|------|----------------|------------------| | LLaDA2.0-mini | 16B | [LLaDA2MoeModelLM](python/dinfer/model/modeling_llada2_moe.py) | inclusionAI/LLaDA2.0-mini | | LLaDA2.0-flash | 100B | [LLaDA2MoeModelLM](python/dinfer/model/modeling_llada2_moe.py) | inclusionAI/LLaDA2.0-flash | | LLaDA2.0-mini-preview | 16B | [LLaDA2MoeModelLM](python/dinfer/model/modeling_llada2_moe.py) | inclusionAI/LLaDA2.0-mini-preview | | LLaDA2.0-flash-preview | 100B | [LLaDA2MoeModelLM](python/dinfer/model/modeling_llada2_moe.py) | inclusionAI/LLaDA2.0-flash-preview | | LLaDA-MoE-7B-A1B-Base | 7B | [LLaDAMoeModelLM](python/dinfer/model/modeling_fused_olmoe.py) | inclusionAI/LLaDA-MoE-7B-A1B-Base | | LLaDA-MoE-7B-A1B-Instruct | 7B | [LLaDAMoeModelLM](python/dinfer/model/modeling_fused_olmoe.py) | inclusionAI/LLaDA-MoE-7B-A1B-Instruct | | LLaDA-8B-Base | 8B | [LLaDAModelLM](python/dinfer/model/modeling_llada.py) | GSAI-ML/LLaDA-8B-Base | | LLaDA-8B-Instruct | 8B | [LLaDAModelLM](python/dinfer/model/modeling_llada.py) | GSAI-ML/LLaDA-8B-Instruct | | LLaDA-1.5 | 8B | [LLaDAModelLM](python/dinfer/model/modeling_llada.py) | GSAI-ML/LLaDA-1.5 |
Quick Start
Install dInfer
git clone https://github.com/inclusionAI/dInfer.git cd dInfer pip install .
To use it with vLLM backend (it works with LLaDA and LLaDA-MoE), please install vLLM.
pip install vllm==0.10.2
To use it with SGLang backend (it works with LLaDA2), please install SGLang.
pip install sglang==0.5.3.post1
Convert to FusedMoE (LLaDA-MoE only)
To run LLaDA-MoE model downloaded from HuggingFace, we need to first convert it to a format supported by dInfer. dInfer provides a script tools/transfer.py for the format conversion.
1) Download and Convert
pip install -U huggingface_hub hf_transfer export HF_HUB_ENABLE_HF_TRANSFER=1 # Download Instruct checkpoint hf download inclusionAI/LLaDA-MoE-7B-A1B-Instruct \ --repo-type model \ --local-dir /path/to/LLaDA-MoE-7B-A1B-Instruct # Convert to FusedMoE python -m tools.transfer \ --input /path/to/LLaDA-MoE-7B-A1B-Instruct \ --output /path/to/LLaDA-MoE-7B-A1B-Instruct-fused
2) Load the model
from dinfer.model import AutoModelForCausalLM from transformers import AutoTokenizer m = "/path/to/LLaDA-MoE-7B-A1B-Instruct-fused" tok = AutoTokenizer.from_pretrained(m, trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained(m, trust_remote_code=True, torch_dtype="bfloat16")
Run Inference
Benchmark (speed only)
Measure throughput (TPS) only; predictions are saved under --output_dir with no automatic scoring.
- LLaDA2 model
- LLaDA2-flash Dataset profiling (threshold decoder, TP across 4 GPUs):
python benchmarks/benchmark_dataset_sglang.py \ --model_name inclusionAI/LLaDA2.0-flash \ --dataset dataset_path \ --gen_len 2048 \ --block_length 32 \ --gpu 0,1,2,3 \ --output_dir runs/llada2_flash \ --use_tp \ --parallel_decoding threshold \ --threshold 0.9 \ --cache prefix \ --use_bd
- LLaDA2-mini Dataset profiling (threshold decoder, TP across 4 GPUs):
python benchmarks/benchmark_dataset_sglang.py \ --model_name inclusionAI/LLaDA2.0-mini \ --dataset dataset_path \ --gen_len 2048 \ --block_length 32 \ --gpu 0,1,2,3 \ --output_dir runs/llada2_mini \ --use_tp \ --parallel_decoding threshold \ --threshold 0.9 \ --cache prefix \ --use_bd
- LLaDA, LLaDA1.5 and LLaDA-MoE model
- LLaDA-MoE Dataset profiling (threshold decoder, TP across 4 GPUs):
python benchmarks/benchmark_dataset.py \ --model_name inclusionAI/LLaDA-MoE-7B-A1B-Instruct \ --model_type llada_moe \ --dataset dataset_path \ --gen_len 1024 \ --block_length 64 \ --gpu 0,1,2,3 \ --output_dir runs/llada_moe_threshold \ --use_tp \ --parallel_decoding threshold \ --threshold 0.8 \ --cache dual \ --prefix_look 16 \ --after_look 16 \ --warmup_times 4 \ --cont_weight 0.3
- LLaDA Single-sample profiling (threshold decoder, TP across 4 GPUs):
python benchmarks/benchmark.py \ --model_name GSAI-ML/LLaDA-8B-Instruct \ --model_type llada \ --gen_len 2048 \ --block_length 32 \ --gpu 0,1,2,3 \ --use_tp \ --parallel_decoding threshold \ --threshold 0.9 \ --cache prefix
- LLaDA, LLaDA1.5, LLaDA-MoE can use benchmark_dataset.py and…
Excerpt shown — open the source for the full document.
Notability
notability 5.0/10Solid repo with moderate traction