What does this model signal mean?

InclusionAI (Ant Group) published inclusionAI/LLaDA2.0-Uni-FP8. This model signal is evidence of what shipped on model infrastructure and how the release is positioned. High-signal details: license apache-2.0 · 5.1K HF downloads · Diffusion language model from inclusionAI, version 2.0 with FP8 quantization.. onlylabs links this event to 1 captured evidence page and 6 related model signals.

InclusionAI (Ant Group) Model: inclusionAI/LLaDA2.0-Uni-FP8

Captured source

source ↗

Hugging Face/huggingface.co/inclusionAI/LLaDA2.0-Uni-FP8

inclusionAI/LLaDA2.0-Uni-FP8 model card

Source ↗

published May 6, 2026seen Jun 6captured Jun 11http 200method plaintask any-to-anylicense apache-2.0library transformersparams 16Bdownloads 5.1klikes 5

Overview

This is the FP8 quantized version of LLaDA2.0-Uni, featuring block-wise FP8 quantization of MoE expert weights. This reduces GPU memory usage by ~48% for model loading while preserving output quality.

Quantization Details

Method: Block-wise FP8 (float8_e4m3fn) with per-block scale factors
Block size: 128×128
Quantized layers: MoE routed expert weights (gate_proj, up_proj, down_proj)
Kept in BF16: Embeddings, lm_head, attention projections, shared experts, layer norms, routing gates

Memory Comparison

| Variant | Model Loading | T2I Peak | Understanding Peak | Edit Peak | |---------|--------------|----------|-------------------|-----------| | BF16 | 62.9 GB | 35.3 GB | 33.2 GB | 41.7 GB | | FP8 | 32.5 GB | 35.3 GB | 33.3 GB | 41.8 GB |

> Note: FP8 halves the static model weight memory (~30 GB saved at load time). Peak inference memory is similar because activations dominate during generation.

Quick Start

from transformers import AutoModelForCausalLM, AutoTokenizer

model_path = "inclusionAI/LLaDA2.0-Uni-FP8"

tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
model_path, device_map="cuda", trust_remote_code=True
).eval()
model.tokenizer = tokenizer

# Text-to-Image Generation
result = model.generate_image(
"A cat sitting on a windowsill at sunset",
image_h=1024, image_w=1024,
steps=16, cfg_scale=4.0,
)

# Decode VQ tokens to image
from decoder import decode_vq_tokens
image = decode_vq_tokens(
result["token_ids"], result["h"], result["w"],
model_path, "cuda",
num_steps=8, decode_mode="decoder-turbo",
)
image.save("output.png")

Model Capabilities

Same as the base LLaDA2.0-Uni model:

🖼️ Text-to-Image Generation
🔍 Image Understanding
✏️ Image Editing
⚡ Sprint Acceleration

⚠️ License

This project is licensed under the terms of the Apache License 2.0.

📖 BibTeX

@article{LLaDA2Uni,
title = {LLaDA2.0-Uni: Unifying Multimodal Understanding and Generation with Diffusion Large Language Model},
author = {Tiwei Bie and Haoxing Chen and Tieyuan Chen and Zhenglin Cheng and Long Cui and Kai Gan and Zhicheng Huang and Zhenzhong Lan and Haoquan Li and Jianguo Li and Tao Lin and Qi Qin and Hongjun Wang and Xiaomei Wang and Haoyuan Wu and Yi Xin and Junbo Zhao},
journal = {arXiv preprint arXiv:2604.20796},
year = {2026}
}

Notability

notability 5.0/10

Moderate downloads for a quantized model release