inclusionAI/LLaDA2.0-Uni
Python
Captured source
source ↗inclusionAI/LLaDA2.0-Uni
Description: LLaDA2.0-Uni: Understanding and Generation the World.
Language: Python
Stars: 758
Forks: 49
Open issues: 6
Created: 2026-04-17T08:48:24Z
Pushed: 2026-05-29T07:06:33Z
Default branch: main
Fork: no
Archived: no
README:
🔥 News
- [2026-05-29] 📣 SGLang Omni support is ready. See cookbook for installation and usage.
- [2026-05-12] 🖥️ We release ComfyUI and Diffusers support. See [apps](./apps/) for installation and usage.
- [2026-05-06] ⚡ We release the FP8 quantized versions on HuggingFace and ModelScope.
- [2026-04-23] 🎉 We release the initial version of LLada2.0-Uni, including:
- 🎯 Model Checkpoints on HuggingFace!
- 🎯 Text-to-Image (w/ thinking mode) Inference Code!
- 🎯 Image Understanding Inference Code!
- 🎯 Image Editing Inference code!
- 🎯 SPRINT Acceleration for dLLM Backbone!
📝 TODO
- [x] Quantized model
- [x] Diffusers support
- [x] ComfyUI support
- [x] SGLang support
- [ ] RL optimization
📚 Model Introduction
We introduce LLaDA2.0-Uni, a unified dLLM-based Mixture-of-Experts (MoE) model that seamlessly integrates multimodal understanding and generation.
Architectural Innovations
- Unified dLLM-MoE Backbone: Built on LLaDA 2.0, it unifies multimodal understanding and generation into a simple Mask Token Prediction paradigm.
- Discrete Semantic Tokenizer: Utilizes SigLIP-VQ to convert visual inputs into discrete semantic tokens, significantly enhancing multimodal understanding.
- Efficient Diffusion Decoder: Pairs discrete tokens with a specialized diffusion decoder for high-fidelity generation, enabling rapid 8-step inference via distillation.
Core Capabilities
- Top-Tier Understanding & Generation: Matches dedicated VLMs in answering visual questions and understanding documents, while also generating highly detailed images.
- Flexible Image Editing: Supports single or multi-reference editing. It enables precise modifications while perfectly preserving original details.
- Interleaved Generation & Reasoning: Empowered by unified discrete representations, it effortlessly handles complex interleaved generation and unlocks advanced interleaved reasoning.
📊 Evaluation Results
📌 Quick Start
⚙️ Installation
1. Create a conda environment
git clone https://github.com/inclusionAI/LLaDA2-Uni && cd LLaDA2-Uni conda create -n llada2_uni python=3.10 -y conda activate llada2_uni
2. Install PyTorch (CUDA 12.4)
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu124
3. Install Flash Attention 2 (required for efficient inference)
pip install flash-attn --no-build-isolation
4. Install remaining dependencies
pip install -r requirements.txt
🧨 Inference
🌟 Text-to-Image Generation
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from decoder import decode_vq_tokens
model_path = "inclusionAI/LLaDA2.0-Uni"
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
model_path, device_map="cuda", torch_dtype="bfloat16", trust_remote_code=True
).eval()
model.tokenizer = tokenizer
# Generate image tokens
result = model.generate_image(
"A modern Scandinavian kitchen with white cabinetry, marble countertops, and a single orchid on the island. A Nordic woman with sleek blonde ponytail, wearing an oversized sweater and dainty silver necklaces, stirs a matcha bowl with a bamboo whisk, eyes sparkling with quiet joy. Shot with 50mm, f/2.5, diffused window light, cool white balance, low saturation, clean skin retouch. Mood: serene, wholesome, hygge.",
image_h=1024, image_w=1024,
steps=8, cfg_scale=2.0,
)
# Decode to PIL image (default: 50-step ODE)
image = decode_vq_tokens(result["token_ids"], result["h"], result["w"], model_path, "cuda")
image.save("output.png")> [!Note] > 💡 Faster decoding — Use the decoder-turbo (distilled decoder) for ~10× faster image decoding (8 steps instead of 50) with minimal quality loss: > ``python > image = decode_vq_tokens( > result["token_ids"], result["h"], result["w"], model_path, "cuda", > num_steps=8, decode_mode="decoder-turbo", > ) >
🌟 Text-to-Image Generation with Thinking
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from decoder import decode_vq_tokens
model_path = "inclusionAI/LLaDA2.0-Uni"
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
model_path, device_map="cuda", torch_dtype="bfloat16", trust_remote_code=True
).eval()
model.tokenizer = tokenizer
# Generate image tokens with thinking process
result = model.generate_image(
"A fox with thick, dense, fluffy fur in a winter setting, possibly surrounded by snow.",
image_h=1024, image_w=1024,
mode="thinking",
steps=8, cfg_scale=2.0,
thinking_steps=32, thinking_gen_length=4096,
)
# Print thinking trace
print("Thinking:", result["thinking"])
# Decode to PIL image
image = decode_vq_tokens(result["token_ids"], result["h"], result["w"], model_path, "cuda", num_steps=8, decode_mode="decoder-turbo",)
image.save("output_thinking.png")🌟 Image Understanding
import torch from transformers import AutoModelForCausalLM, AutoTokenizer from encoder.image_tokenizer import ImageTokenizer from decoder.smart_img_process import smart_resize_images model_path = "inclusionAI/LLaDA2.0-Uni" tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained( model_path, device_map="cuda", torch_dtype="bfloat16", trust_remote_code=True ).eval() model.tokenizer = tokenizer # Encode image to discrete tokens image_tokenizer = ImageTokenizer(model_path=model_path, device="cuda") pil_image = smart_resize_images(["./assets/understanding_example.png"])[0] info = image_tokenizer.encode_with_info(pil_image) image_tokens = [x + model.config.image_token_offset for x in info["token_ids"]] _, h, w = info["grid_thw"] # Understand the image response = model.understand_image( image_tokens, h, w,…
Excerpt shown — open the source for the full document.
Notability
notability 6.0/10New model repo with moderate traction.