ByteDance-Seed/Cola-DLM
Python
Captured source
source ↗ByteDance-Seed/Cola-DLM
Description: The codebase of Cola DLM
Language: Python
License: Apache-2.0
Stars: 219
Forks: 13
Open issues: 2
Created: 2026-05-15T07:09:04Z
Pushed: 2026-06-10T16:57:09Z
Default branch: main
Fork: no
Archived: no
README:
> Cola DLM (Continuous Latent Diffusion Language Model) is the official, HuggingFace-Transformers-compatible open-source release of the paper *Continuous Latent Diffusion Language Model*. Cola DLM is a hierarchical latent-variable language model: a *Text VAE* learns a stable mapping q_phi(z_0 | x) between text and a continuous latent sequence; a *block-causal Diffusion Transformer (DiT)* models the latent prior p_psi(z_0) via Flow Matching; and the *conditional decoder* p_theta(x | z_0) realizes the actual tokens. From a unified Markov-path perspective, the diffusion process performs latent prior transport rather than token-level observation recovery, separating global semantic organization from local textual realization. This repository ships the trained checkpoint together with a no-padding ("NA") flatten-concat inference pipeline that runs natively under HuggingFace Transformers.
---
Paper
- Title: Continuous Latent Diffusion Language Model
- Authors: Hongcan Guo, Qinyu Zhao, Yian Zhao, Shen Nie, Rui Zhu, Qiushan Guo, Feng Wang, Tao Yang, Hengshuang Zhao, Guoqiang Wei, Yan Zeng (ByteDance Seed et al.)
- arXiv: arxiv.org/abs/2605.06548
- Model weights: huggingface.co/ByteDance-Seed/Cola-DLM
- HuggingFace daily paper: huggingface.co/papers/2605.06548
- Project page: hongcanguo.github.io/Cola-DLM
- Blog post: hongcanguo.github.io/posts/2026-cola-dlm.html
- Unified blog post: zhao-yian.github.io/Unified-Cola/blog/2026/unified-cola-en/
- Zhihu article: zhuanlan.zhihu.com/p/2038324180920313704
---
Method at a glance
Figure 1 — Overall workflow of Cola DLM. Stage 1: Text VAE pretraining with reconstruction, BERT-style masking, and a KL regularizer to a base prior. Stage 2: joint Text VAE + block-causal Text DiT training; the DiT learns the latent prior p_psi(z_0) via Flow Matching under the visible set V_b. Inference: prefix encoding q_phi(zpre | xpre), block-wise prior transport Phipsi0←1 in latent space, and conditional decoding p_theta(x | z_0) with KV cache.
Cola DLM defines the joint generative distribution as
p(x, z_0) = p_theta(x | z_0) * p_psi(z_0), p(x) = ∫ p_theta(x | z_0) * p_psi(z_0) dz_0,
where q_phi(z_0 | x) is an inference (encoder) model used only at training and prefix-encoding time. The latent is decomposed into B blocks z_0 = (z_0^(1), ..., z_0^(B)) with a block-causal factorization `p_psi(z_0) = p_psi(z_0^(1)) * ∏_{b≥2} p_psi(z_0^(b) | z_0^(
Figure 2 — RQ4 headline scaling result. Strictly matched ~2B-parameter setup, unified generative evaluation protocol, scaling curves up to ~2000 EFLOPs across 8 benchmarks plus Task Average. Cola DLM (red) reaches the best final Task Average — and the curve is still rising — with a clear lead on reasoning-heavy MMLU, RACE, Story Cloze, OBQA; SQuAD eventually surpasses AR and approaches LLaDA's strong region. The result is conservative: latent dimension d=16, no extended training, room to scale further.
The scripts/ folder contains a one-click reproduction of the 8-task evaluation pipeline used in the paper's RQ4 scaling comparison:
# Evaluate all 8 tasks (assumes hf_models/ and generate_task_data/ are populated) bash scripts/run_benchmark.sh # Single task, single GPU TASKS="lambada" NUM_GPUS=1 bash scripts/run_benchmark.sh # Compute accuracy from evaluation outputs python scripts/acc_calc.py
Reference accuracy numbers (see [eval_output/accuracy_summary.csv](eval_output/accuracy_summary.csv)):
| Task | Accuracy (%) | |------------|--------------| | LAMBADA | 50.80 | | MMLU | 19.30 | | OBQA | 23.00 | | HellaSwag | 10.70 | | RACE | 19.60 | | SIQA | 28.90 | | SQuAD | 30.90 | | Story Cloze| 30.77 | | Tasks Average | 26.75 |
> Note on open-source model and accuracy: > The released model weights correspond to the 2000 EFLOPs entry on the scaling curve in the paper's RQ4 — the largest training-compute checkpoint reported. Because the internal architecture used for evaluation in the paper differs slightly from the open-source HuggingFace Transformers-based implementation in this repository, per-task accuracy numbers may exhibit minor fluctuations, but the overall trend is consistent with the paper. Notably, the Tasks Average (26.75%) measured here is slightly higher than the final average reported in the paper.
---
Unified text–image (preliminary)
Figure 3 — Towards unified text–image modeling. Modality-specific VAE encoders/decoders interface with a shared block-causal MMDiT prior over a joint latent state — the same hierarchical latent decomposition extends naturally from text to vision. Left: text-only continuation and image-conditioned text generation (image-to-text). Middle: text-to-image samples from in-house pretraining only (no SFT, no high-quality data curation). Right: schematic of the shared block-causal MMDiT prior. This is intentionally early-stage; comprehensive unified multimodal training is left for future work — see the paper's Discussion for the full set of qualitative samples.
> The released open-source code in this repository covers the text-only Cola DLM pipeline (Text VAE + block-causal DiT prior). Unified text–image training and inference are reported in the paper's Discussion as preliminary experiments and are not included in this release.
---
Project layout
cola-dlm/ ├── cola_dlm/ # Importable Python package │ ├── __init__.py # Public API re-exports │ ├── configuration_cola_dit.py # ColaDiTConfig — block-causal DiT prior knobs │ ├── configuration_cola_vae.py # ColaTextVAEConfig — Text VAE knobs │ ├── modeling_cola_dit.py # ColaDiTModel — block-causal DiT prior p_psi(z_0) │ ├── modeling_cola_vae.py # ColaTextVAEModel — encoder q_phi + decoder p_theta │ ├── attention_utils.py # NA flatten-concat helpers +…
Excerpt shown — open the source for the full document.
Notability
notability 5.0/10New repo from ByteDance, moderate stars