RepoInclusionAI (Ant Group)InclusionAI (Ant Group)published Apr 17, 2026seen 5d

inclusionAI/LLaDA2.0-Uni

Python

Open original ↗

Captured source

source ↗
published Apr 17, 2026seen 5dcaptured 14hhttp 200method plain

inclusionAI/LLaDA2.0-Uni

Description: LLaDA2.0-Uni: Understanding and Generation the World.

Language: Python

Stars: 758

Forks: 49

Open issues: 6

Created: 2026-04-17T08:48:24Z

Pushed: 2026-05-29T07:06:33Z

Default branch: main

Fork: no

Archived: no

README:

🔥 News

  • [2026-05-29] 📣 SGLang Omni support is ready. See cookbook for installation and usage.
  • [2026-05-12] 🖥️ We release ComfyUI and Diffusers support. See [apps](./apps/) for installation and usage.
  • [2026-05-06] ⚡ We release the FP8 quantized versions on HuggingFace and ModelScope.
  • [2026-04-23] 🎉 We release the initial version of LLada2.0-Uni, including:
  • 🎯 Model Checkpoints on HuggingFace!
  • 🎯 Text-to-Image (w/ thinking mode) Inference Code!
  • 🎯 Image Understanding Inference Code!
  • 🎯 Image Editing Inference code!
  • 🎯 SPRINT Acceleration for dLLM Backbone!

📝 TODO

  • [x] Quantized model
  • [x] Diffusers support
  • [x] ComfyUI support
  • [x] SGLang support
  • [ ] RL optimization

📚 Model Introduction

We introduce LLaDA2.0-Uni, a unified dLLM-based Mixture-of-Experts (MoE) model that seamlessly integrates multimodal understanding and generation.

Architectural Innovations

  • Unified dLLM-MoE Backbone: Built on LLaDA 2.0, it unifies multimodal understanding and generation into a simple Mask Token Prediction paradigm.
  • Discrete Semantic Tokenizer: Utilizes SigLIP-VQ to convert visual inputs into discrete semantic tokens, significantly enhancing multimodal understanding.
  • Efficient Diffusion Decoder: Pairs discrete tokens with a specialized diffusion decoder for high-fidelity generation, enabling rapid 8-step inference via distillation.

Core Capabilities

  • Top-Tier Understanding & Generation: Matches dedicated VLMs in answering visual questions and understanding documents, while also generating highly detailed images.
  • Flexible Image Editing: Supports single or multi-reference editing. It enables precise modifications while perfectly preserving original details.
  • Interleaved Generation & Reasoning: Empowered by unified discrete representations, it effortlessly handles complex interleaved generation and unlocks advanced interleaved reasoning.

📊 Evaluation Results

📌 Quick Start

⚙️ Installation

1. Create a conda environment

git clone https://github.com/inclusionAI/LLaDA2-Uni && cd LLaDA2-Uni
conda create -n llada2_uni python=3.10 -y
conda activate llada2_uni

2. Install PyTorch (CUDA 12.4)

pip install torch torchvision --index-url https://download.pytorch.org/whl/cu124

3. Install Flash Attention 2 (required for efficient inference)

pip install flash-attn --no-build-isolation

4. Install remaining dependencies

pip install -r requirements.txt

🧨 Inference

🌟 Text-to-Image Generation

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from decoder import decode_vq_tokens

model_path = "inclusionAI/LLaDA2.0-Uni"
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
model_path, device_map="cuda", torch_dtype="bfloat16", trust_remote_code=True
).eval()
model.tokenizer = tokenizer

# Generate image tokens
result = model.generate_image(
"A modern Scandinavian kitchen with white cabinetry, marble countertops, and a single orchid on the island. A Nordic woman with sleek blonde ponytail, wearing an oversized sweater and dainty silver necklaces, stirs a matcha bowl with a bamboo whisk, eyes sparkling with quiet joy. Shot with 50mm, f/2.5, diffused window light, cool white balance, low saturation, clean skin retouch. Mood: serene, wholesome, hygge.",
image_h=1024, image_w=1024,
steps=8, cfg_scale=2.0,
)

# Decode to PIL image (default: 50-step ODE)
image = decode_vq_tokens(result["token_ids"], result["h"], result["w"], model_path, "cuda")
image.save("output.png")

> [!Note] > 💡 Faster decoding — Use the decoder-turbo (distilled decoder) for ~10× faster image decoding (8 steps instead of 50) with minimal quality loss: > ``python > image = decode_vq_tokens( > result["token_ids"], result["h"], result["w"], model_path, "cuda", > num_steps=8, decode_mode="decoder-turbo", > ) >

🌟 Text-to-Image Generation with Thinking

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from decoder import decode_vq_tokens

model_path = "inclusionAI/LLaDA2.0-Uni"
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
model_path, device_map="cuda", torch_dtype="bfloat16", trust_remote_code=True
).eval()
model.tokenizer = tokenizer

# Generate image tokens with thinking process
result = model.generate_image(
"A fox with thick, dense, fluffy fur in a winter setting, possibly surrounded by snow.",
image_h=1024, image_w=1024,
mode="thinking",
steps=8, cfg_scale=2.0,
thinking_steps=32, thinking_gen_length=4096,
)

# Print thinking trace
print("Thinking:", result["thinking"])

# Decode to PIL image
image = decode_vq_tokens(result["token_ids"], result["h"], result["w"], model_path, "cuda", num_steps=8, decode_mode="decoder-turbo",)
image.save("output_thinking.png")

🌟 Image Understanding

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from encoder.image_tokenizer import ImageTokenizer
from decoder.smart_img_process import smart_resize_images

model_path = "inclusionAI/LLaDA2.0-Uni"
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
model_path, device_map="cuda", torch_dtype="bfloat16", trust_remote_code=True
).eval()
model.tokenizer = tokenizer

# Encode image to discrete tokens
image_tokenizer = ImageTokenizer(model_path=model_path, device="cuda")
pil_image = smart_resize_images(["./assets/understanding_example.png"])[0]
info = image_tokenizer.encode_with_info(pil_image)
image_tokens = [x + model.config.image_token_offset for x in info["token_ids"]]
_, h, w = info["grid_thw"]

# Understand the image
response = model.understand_image(
image_tokens, h, w,…

Excerpt shown — open the source for the full document.

Notability

notability 6.0/10

New model repo with moderate traction.