RepoByteDance (Doubao/Seed)ByteDance (Doubao/Seed)published May 15, 2026seen 5d

ByteDance-Seed/Cola-DLM

Python

Open original ↗

Captured source

source ↗
published May 15, 2026seen 5dcaptured 15hhttp 200method plain

ByteDance-Seed/Cola-DLM

Description: The codebase of Cola DLM

Language: Python

License: Apache-2.0

Stars: 219

Forks: 13

Open issues: 2

Created: 2026-05-15T07:09:04Z

Pushed: 2026-06-10T16:57:09Z

Default branch: main

Fork: no

Archived: no

README:

> Cola DLM (Continuous Latent Diffusion Language Model) is the official, HuggingFace-Transformers-compatible open-source release of the paper *Continuous Latent Diffusion Language Model*. Cola DLM is a hierarchical latent-variable language model: a *Text VAE* learns a stable mapping q_phi(z_0 | x) between text and a continuous latent sequence; a *block-causal Diffusion Transformer (DiT)* models the latent prior p_psi(z_0) via Flow Matching; and the *conditional decoder* p_theta(x | z_0) realizes the actual tokens. From a unified Markov-path perspective, the diffusion process performs latent prior transport rather than token-level observation recovery, separating global semantic organization from local textual realization. This repository ships the trained checkpoint together with a no-padding ("NA") flatten-concat inference pipeline that runs natively under HuggingFace Transformers.

---

Paper

---

Method at a glance

Figure 1 — Overall workflow of Cola DLM. Stage 1: Text VAE pretraining with reconstruction, BERT-style masking, and a KL regularizer to a base prior. Stage 2: joint Text VAE + block-causal Text DiT training; the DiT learns the latent prior p_psi(z_0) via Flow Matching under the visible set V_b. Inference: prefix encoding q_phi(zpre | xpre), block-wise prior transport Phipsi0←1 in latent space, and conditional decoding p_theta(x | z_0) with KV cache.

Cola DLM defines the joint generative distribution as

p(x, z_0) = p_theta(x | z_0) * p_psi(z_0), p(x) = ∫ p_theta(x | z_0) * p_psi(z_0) dz_0,

where q_phi(z_0 | x) is an inference (encoder) model used only at training and prefix-encoding time. The latent is decomposed into B blocks z_0 = (z_0^(1), ..., z_0^(B)) with a block-causal factorization `p_psi(z_0) = p_psi(z_0^(1)) * ∏_{b≥2} p_psi(z_0^(b) | z_0^(

Figure 2 — RQ4 headline scaling result. Strictly matched ~2B-parameter setup, unified generative evaluation protocol, scaling curves up to ~2000 EFLOPs across 8 benchmarks plus Task Average. Cola DLM (red) reaches the best final Task Average — and the curve is still rising — with a clear lead on reasoning-heavy MMLU, RACE, Story Cloze, OBQA; SQuAD eventually surpasses AR and approaches LLaDA's strong region. The result is conservative: latent dimension d=16, no extended training, room to scale further.

The scripts/ folder contains a one-click reproduction of the 8-task evaluation pipeline used in the paper's RQ4 scaling comparison:

# Evaluate all 8 tasks (assumes hf_models/ and generate_task_data/ are populated)
bash scripts/run_benchmark.sh

# Single task, single GPU
TASKS="lambada" NUM_GPUS=1 bash scripts/run_benchmark.sh

# Compute accuracy from evaluation outputs
python scripts/acc_calc.py

Reference accuracy numbers (see [eval_output/accuracy_summary.csv](eval_output/accuracy_summary.csv)):

| Task | Accuracy (%) | |------------|--------------| | LAMBADA | 50.80 | | MMLU | 19.30 | | OBQA | 23.00 | | HellaSwag | 10.70 | | RACE | 19.60 | | SIQA | 28.90 | | SQuAD | 30.90 | | Story Cloze| 30.77 | | Tasks Average | 26.75 |

> Note on open-source model and accuracy: > The released model weights correspond to the 2000 EFLOPs entry on the scaling curve in the paper's RQ4 — the largest training-compute checkpoint reported. Because the internal architecture used for evaluation in the paper differs slightly from the open-source HuggingFace Transformers-based implementation in this repository, per-task accuracy numbers may exhibit minor fluctuations, but the overall trend is consistent with the paper. Notably, the Tasks Average (26.75%) measured here is slightly higher than the final average reported in the paper.

---

Unified text–image (preliminary)

Figure 3 — Towards unified text–image modeling. Modality-specific VAE encoders/decoders interface with a shared block-causal MMDiT prior over a joint latent state — the same hierarchical latent decomposition extends naturally from text to vision. Left: text-only continuation and image-conditioned text generation (image-to-text). Middle: text-to-image samples from in-house pretraining only (no SFT, no high-quality data curation). Right: schematic of the shared block-causal MMDiT prior. This is intentionally early-stage; comprehensive unified multimodal training is left for future work — see the paper's Discussion for the full set of qualitative samples.

> The released open-source code in this repository covers the text-only Cola DLM pipeline (Text VAE + block-causal DiT prior). Unified text–image training and inference are reported in the paper's Discussion as preliminary experiments and are not included in this release.

---

Project layout

cola-dlm/
├── cola_dlm/ # Importable Python package
│ ├── __init__.py # Public API re-exports
│ ├── configuration_cola_dit.py # ColaDiTConfig — block-causal DiT prior knobs
│ ├── configuration_cola_vae.py # ColaTextVAEConfig — Text VAE knobs
│ ├── modeling_cola_dit.py # ColaDiTModel — block-causal DiT prior p_psi(z_0)
│ ├── modeling_cola_vae.py # ColaTextVAEModel — encoder q_phi + decoder p_theta
│ ├── attention_utils.py # NA flatten-concat helpers +…

Excerpt shown — open the source for the full document.

Notability

notability 5.0/10

New repo from ByteDance, moderate stars