inclusionAI/LLaDA2.0-Uni-FP8
Captured source
source ↗published May 6, 2026seen 5dcaptured 11hhttp 200method plaintask any-to-anylicense apache-2.0library transformersparams 16Bdownloads 6klikes 5
Overview
This is the FP8 quantized version of LLaDA2.0-Uni, featuring block-wise FP8 quantization of MoE expert weights. This reduces GPU memory usage by ~48% for model loading while preserving output quality.
Quantization Details
- Method: Block-wise FP8 (float8_e4m3fn) with per-block scale factors
- Block size: 128×128
- Quantized layers: MoE routed expert weights (gate_proj, up_proj, down_proj)
- Kept in BF16: Embeddings, lm_head, attention projections, shared experts, layer norms, routing gates
Memory Comparison
| Variant | Model Loading | T2I Peak | Understanding Peak | Edit Peak | |---------|--------------|----------|-------------------|-----------| | BF16 | 62.9 GB | 35.3 GB | 33.2 GB | 41.7 GB | | FP8 | 32.5 GB | 35.3 GB | 33.3 GB | 41.8 GB |
> Note: FP8 halves the static model weight memory (~30 GB saved at load time). Peak inference memory is similar because activations dominate during generation.
Quick Start
from transformers import AutoModelForCausalLM, AutoTokenizer
model_path = "inclusionAI/LLaDA2.0-Uni-FP8"
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
model_path, device_map="cuda", trust_remote_code=True
).eval()
model.tokenizer = tokenizer
# Text-to-Image Generation
result = model.generate_image(
"A cat sitting on a windowsill at sunset",
image_h=1024, image_w=1024,
steps=16, cfg_scale=4.0,
)
# Decode VQ tokens to image
from decoder import decode_vq_tokens
image = decode_vq_tokens(
result["token_ids"], result["h"], result["w"],
model_path, "cuda",
num_steps=8, decode_mode="decoder-turbo",
)
image.save("output.png")Model Capabilities
Same as the base LLaDA2.0-Uni model:
- 🖼️ Text-to-Image Generation
- 🔍 Image Understanding
- ✏️ Image Editing
- ⚡ Sprint Acceleration
⚠️ License
This project is licensed under the terms of the Apache License 2.0.
📖 BibTeX
@article{LLaDA2Uni,
title = {LLaDA2.0-Uni: Unifying Multimodal Understanding and Generation with Diffusion Large Language Model},
author = {Tiwei Bie and Haoxing Chen and Tieyuan Chen and Zhenglin Cheng and Long Cui and Kai Gan and Zhicheng Huang and Zhenzhong Lan and Haoquan Li and Jianguo Li and Tao Lin and Qi Qin and Hongjun Wang and Xiaomei Wang and Haoyuan Wu and Yi Xin and Junbo Zhao},
journal = {arXiv preprint arXiv:2604.20796},
year = {2026}
}Notability
notability 5.0/10Moderate downloads for a quantized model release