RepoInclusionAI (Ant Group)InclusionAI (Ant Group)published Sep 29, 2025seen 5d

inclusionAI/dInfer

Python

Open original ↗

Captured source

source ↗
published Sep 29, 2025seen 5dcaptured 14hhttp 200method plain

inclusionAI/dInfer

Description: dInfer: An Efficient Inference Framework for Diffusion Language Models

Language: Python

License: Apache-2.0

Stars: 470

Forks: 45

Open issues: 18

Created: 2025-09-29T08:07:23Z

Pushed: 2026-02-11T03:10:42Z

Default branch: master

Fork: no

Archived: no

README:

Introduction

dInfer is an efficient and extensible inference framework for dLLMs. As illustrated in the following architecture, it modularizes inference into four components: *model*, *diffusion iteration manager*, *decoder* and *KV-cache manager*. It provides well-designed APIs for flexible algorithms combinations in each component. It now supports batched inference for improved throughput.

Figure: Overall Architecture of dInfer

dInfer supports multiple dLLM variants, including LLaDA, LLaDA-MoE and LLaDA2.

News

\[2025/12/21\] release v0.2. The major features of this release can be found here.

\[2025/12/10\] Support and speed up the formal version of block diffusion LLMs (LLaDA2-mini and LLaDA2-flash). Support quant versions of LLaDA2-mini and LLaDA2-flash.

\[2025/11/15\] Support the inference on block diffusion LLMs (LLaDA2-mini-preview and LLaDA2-flash-preview).

\[2025/10/10\] Release the first version of the dInfer framework.

Contents

  • [Supported Models](#supported-models)
  • [Quick Start](#quick-start)
  • [Benchmark Results](#benchmark-results)

Supported Models

dInfer supports multiple diffusion language model variants with different architectures and sizes. Below are the HuggingFace model links and their corresponding implementation files:

| Model | Size | Implementation | HuggingFace Link | |-------|------|----------------|------------------| | LLaDA2.0-mini | 16B | [LLaDA2MoeModelLM](python/dinfer/model/modeling_llada2_moe.py) | inclusionAI/LLaDA2.0-mini | | LLaDA2.0-flash | 100B | [LLaDA2MoeModelLM](python/dinfer/model/modeling_llada2_moe.py) | inclusionAI/LLaDA2.0-flash | | LLaDA2.0-mini-preview | 16B | [LLaDA2MoeModelLM](python/dinfer/model/modeling_llada2_moe.py) | inclusionAI/LLaDA2.0-mini-preview | | LLaDA2.0-flash-preview | 100B | [LLaDA2MoeModelLM](python/dinfer/model/modeling_llada2_moe.py) | inclusionAI/LLaDA2.0-flash-preview | | LLaDA-MoE-7B-A1B-Base | 7B | [LLaDAMoeModelLM](python/dinfer/model/modeling_fused_olmoe.py) | inclusionAI/LLaDA-MoE-7B-A1B-Base | | LLaDA-MoE-7B-A1B-Instruct | 7B | [LLaDAMoeModelLM](python/dinfer/model/modeling_fused_olmoe.py) | inclusionAI/LLaDA-MoE-7B-A1B-Instruct | | LLaDA-8B-Base | 8B | [LLaDAModelLM](python/dinfer/model/modeling_llada.py) | GSAI-ML/LLaDA-8B-Base | | LLaDA-8B-Instruct | 8B | [LLaDAModelLM](python/dinfer/model/modeling_llada.py) | GSAI-ML/LLaDA-8B-Instruct | | LLaDA-1.5 | 8B | [LLaDAModelLM](python/dinfer/model/modeling_llada.py) | GSAI-ML/LLaDA-1.5 |

Quick Start

Install dInfer

git clone https://github.com/inclusionAI/dInfer.git
cd dInfer
pip install .

To use it with vLLM backend (it works with LLaDA and LLaDA-MoE), please install vLLM.

pip install vllm==0.10.2

To use it with SGLang backend (it works with LLaDA2), please install SGLang.

pip install sglang==0.5.3.post1

Convert to FusedMoE (LLaDA-MoE only)

To run LLaDA-MoE model downloaded from HuggingFace, we need to first convert it to a format supported by dInfer. dInfer provides a script tools/transfer.py for the format conversion.

1) Download and Convert

pip install -U huggingface_hub hf_transfer
export HF_HUB_ENABLE_HF_TRANSFER=1

# Download Instruct checkpoint
hf download inclusionAI/LLaDA-MoE-7B-A1B-Instruct \
--repo-type model \
--local-dir /path/to/LLaDA-MoE-7B-A1B-Instruct

# Convert to FusedMoE
python -m tools.transfer \
--input /path/to/LLaDA-MoE-7B-A1B-Instruct \
--output /path/to/LLaDA-MoE-7B-A1B-Instruct-fused

2) Load the model

from dinfer.model import AutoModelForCausalLM
from transformers import AutoTokenizer
m = "/path/to/LLaDA-MoE-7B-A1B-Instruct-fused"
tok = AutoTokenizer.from_pretrained(m, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(m, trust_remote_code=True, torch_dtype="bfloat16")

Run Inference

Benchmark (speed only)

Measure throughput (TPS) only; predictions are saved under --output_dir with no automatic scoring.

  • LLaDA2 model
  • LLaDA2-flash Dataset profiling (threshold decoder, TP across 4 GPUs):
python benchmarks/benchmark_dataset_sglang.py \
--model_name inclusionAI/LLaDA2.0-flash \
--dataset dataset_path \
--gen_len 2048 \
--block_length 32 \
--gpu 0,1,2,3 \
--output_dir runs/llada2_flash \
--use_tp \
--parallel_decoding threshold \
--threshold 0.9 \
--cache prefix \
--use_bd
  • LLaDA2-mini Dataset profiling (threshold decoder, TP across 4 GPUs):
python benchmarks/benchmark_dataset_sglang.py \
--model_name inclusionAI/LLaDA2.0-mini \
--dataset dataset_path \
--gen_len 2048 \
--block_length 32 \
--gpu 0,1,2,3 \
--output_dir runs/llada2_mini \
--use_tp \
--parallel_decoding threshold \
--threshold 0.9 \
--cache prefix \
--use_bd
  • LLaDA, LLaDA1.5 and LLaDA-MoE model
  • LLaDA-MoE Dataset profiling (threshold decoder, TP across 4 GPUs):
python benchmarks/benchmark_dataset.py \
--model_name inclusionAI/LLaDA-MoE-7B-A1B-Instruct \
--model_type llada_moe \
--dataset dataset_path \
--gen_len 1024 \
--block_length 64 \
--gpu 0,1,2,3 \
--output_dir runs/llada_moe_threshold \
--use_tp \
--parallel_decoding threshold \
--threshold 0.8 \
--cache dual \
--prefix_look 16 \
--after_look 16 \
--warmup_times 4 \
--cont_weight 0.3
  • LLaDA Single-sample profiling (threshold decoder, TP across 4 GPUs):
python benchmarks/benchmark.py \
--model_name GSAI-ML/LLaDA-8B-Instruct \
--model_type llada \
--gen_len 2048 \
--block_length 32 \
--gpu 0,1,2,3 \
--use_tp \
--parallel_decoding threshold \
--threshold 0.9 \
--cache prefix
  • LLaDA, LLaDA1.5, LLaDA-MoE can use benchmark_dataset.py and…

Excerpt shown — open the source for the full document.

Notability

notability 5.0/10

Solid repo with moderate traction