inclusionAI/LLaDA2.0-Uni
Captured source
source ↗published Apr 22, 2026seen 5dcaptured 11hhttp 200method plaintask any-to-anylicense apache-2.0library transformersparams 16Bdownloads 7.2klikes 247
Model Capabilities
LLaDA2.0-Uni is a unified diffusion Large Language Model (dLLM) based on Mixture-of-Experts (MoE) that seamlessly integrates multimodal understanding and generation within a single model. It supports:
- 🖼️ Text-to-Image Generation — high-fidelity image synthesis with optional thinking/reasoning.
- 🔍 Image Understanding — visual question answering, image captioning, document understanding, etc.
- ✏️ Image Editing — instruction-based editing with single or multi-reference support.
- 🎨 Interleaved Generation and Reasoning — provide preliminary support for interleaved generation and unlock advanced interleaved reasoning.
- ⚡ Sprint Acceleration — KV cache reuse and adaptive unmasking for faster inference.
Model Architecture
- Unified dLLM-MoE Backbone: Unifies multimodal understanding and generation into a simple Mask Token Prediction paradigm.
- Discrete Semantic Tokenizer: Utilizes SigLIP-VQ to convert visual inputs into discrete semantic tokens, significantly enhancing multimodal understanding.
- Efficient Diffusion Decoder: Pairs discrete tokens with a specialized diffusion decoder for high-fidelity generation, enabling rapid 8-step inference via distillation.
Evaluation Results
Quick Start
> Note: Full installation instructions and CLI scripts are available in the GitHub repository.
⚙️ Installation
1. Create a conda environment
git clone https://github.com/inclusionAI/LLaDA2.0-Uni && cd LLaDA2.0-Uni conda create -n llada2_uni python=3.10 -y conda activate llada2_uni
2. Install PyTorch (CUDA 12.4)
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu124
3. Install Flash Attention 2 (required for efficient inference)
pip install flash-attn --no-build-isolation
4. Install remaining dependencies
pip install -r requirements.txt
🌟 Text-to-Image Generation
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from decoder import decode_vq_tokens
model_path = "inclusionAI/LLaDA2.0-Uni"
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
model_path, device_map="cuda", torch_dtype="bfloat16", trust_remote_code=True
).eval()
model.tokenizer = tokenizer
# Generate image tokens
result = model.generate_image(
"A modern Scandinavian kitchen with white cabinetry, marble countertops, and a single orchid on the island. A Nordic woman with sleek blonde ponytail, wearing an oversized sweater and dainty silver necklaces, stirs a matcha bowl with a bamboo whisk, eyes sparkling with quiet joy. Shot with 50mm, f/2.5, diffused window light, cool white balance, low saturation, clean skin retouch. Mood: serene, wholesome, hygge.",
image_h=1024, image_w=1024,
steps=8, cfg_scale=2.0,
)
# Decode to PIL image (default: 50-step ODE)
image = decode_vq_tokens(result["token_ids"], result["h"], result["w"], model_path, "cuda")
image.save("output.png")> [!Note] > 💡 Faster decoding — Use the decoder-turbo (distilled decoder) for ~10× faster image decoding (8 steps instead of 50) with minimal quality loss: > ``python > image = decode_vq_tokens( > result["token_ids"], result["h"], result["w"], model_path, "cuda", > num_steps=8, decode_mode="decoder-turbo", > ) >
🌟 Text-to-Image Generation with Thinking
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from decoder import decode_vq_tokens
model_path = "inclusionAI/LLaDA2.0-Uni"
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
model_path, device_map="cuda", torch_dtype="bfloat16", trust_remote_code=True
).eval()
model.tokenizer = tokenizer
# Generate image tokens with thinking process
result = model.generate_image(
"A fox with thick, dense, fluffy fur in a winter setting, possibly surrounded by snow.",
image_h=1024, image_w=1024,
mode="thinking",
steps=8, cfg_scale=2.0,
thinking_steps=32, thinking_gen_length=4096,
)
# Print thinking trace
print("Thinking:", result["thinking"])
# Decode to PIL image
image = decode_vq_tokens(result["token_ids"], result["h"], result["w"], model_path, "cuda", num_steps=8, decode_mode="decoder-turbo",)
image.save("output_thinking.png")🌟 Image Understanding
import torch from transformers import AutoModelForCausalLM, AutoTokenizer from encoder.image_tokenizer import ImageTokenizer from decoder.smart_img_process import smart_resize_images model_path = "inclusionAI/LLaDA2.0-Uni" tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained( model_path, device_map="cuda", torch_dtype="bfloat16", trust_remote_code=True ).eval() model.tokenizer = tokenizer # Encode image to discrete tokens image_tokenizer = ImageTokenizer(model_path=model_path, device="cuda") pil_image = smart_resize_images(["./assets/understanding_example.png"])[0] info = image_tokenizer.encode_with_info(pil_image) image_tokens = [x + model.config.image_token_offset for x in info["token_ids"]] _, h, w = info["grid_thw"] # Understand the image response = model.understand_image( image_tokens, h, w, question="Describe this image in detail.", steps=32, gen_length=2048, ) print(response)
🌟 Image Editing
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from encoder.image_tokenizer import ImageTokenizer
from decoder.utils import generate_crop_size_list, var_center_crop
from decoder import decode_vq_tokens
from PIL import Image
model_path = "inclusionAI/LLaDA2.0-Uni"
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
model_path, device_map="cuda", torch_dtype="bfloat16", trust_remote_code=True
).eval()
model.tokenizer = tokenizer
# Encode source image
image_tokenizer = ImageTokenizer(model_path=model_path, device="cuda")
crop_size_list = generate_crop_size_list((512 // 32) ** 2, 32)
pil_image = var_center_crop(Image.open("./assets/edit_example.png").convert("RGB"), crop_size_list=crop_size_list)
info = image_tokenizer.encode_with_info(pil_image)
image_tokens = [x + model.config.image_token_offset for x in info["token_ids"]]
_, h, w = info["grid_thw"]
# Edit the image
result = model.edit_image(…Excerpt shown — open the source for the full document.
Notability
notability 7.0/10Notable model release with moderate traction.