ModelMeituan (LongCat)Meituan (LongCat)published Jan 23, 2026seen 5d

meituan-longcat/LongCat-Flash-Thinking-ZigZag

Open original ↗

Captured source

source ↗
published Jan 23, 2026seen 5dcaptured 14hhttp 200method plaintask text-generationlicense mitlibrary LongCat-Flash-Thinking-ZigZagparams 562Bdownloads 42likes 32

LongCat-Flash-Thinking-ZigZag

Updates

  • [2026.1.28] We have provided the TileLang kernels supporting prefill (chunked-prefill as well) and decode (multi-token prediction as well). The full attention version is placed at [flash_mla_interface.py](flash_mla_interface.py) while the streaming sparse attention version is placed at [streaming_sparse_attn_interface.py](streaming_sparse_attn_interface.py). Basic usage is offered in the following code snippet:
from flash_mla_interface import flash_mla_varlen_func, flash_mla_with_kvcache
from streaming_sparse_attn_interface import streaming_sparse_attn_varlen_func, streaming_sparse_attn_with_kvcache

full_attn_out = flash_mla_varlen_func_(
q, # [nnz_q, num_heads_q, head_dim_qk]
k, # [nnz_k, num_heads_k, head_dim_qk]
v, # [nnz_k, num_heads_v, head_dim_vo]
cu_seqlens_q,
cu_seqlens_k,
max_seqlen_q,
max_seqlen_k,
softmax_scale,
causal=True
)
stream_attn_out = streaming_sparse_attn_varlen_func_(
q,
k,
v,
cu_seqlens_q,
cu_seqlens_k,
max_seqlen_q,
max_seqlen_k,
softmax_scale,
causal=True
)

full_attn_out = flash_mla_with_kvcache(
q, # [batch_size, seqlen_q, num_heads_q, head_dim_nope + head_dim_rope]
blocked_k, # [num_pages, page_size, num_heads_k, head_dim_nope + head_dim_rope]
cache_seqlens,
block_table,
head_dim_nope,
softmax_scale,
causal=True
)
stream_attn_out = streaming_sparse_attn_with_kvcache(
q,
blocked_k,
cache_seqlens,
block_table,
head_dim_nope,
softmax_scale,
causal=True
)

Benchmarking performance as below:

Model Introduction

Along with LongCat-Flash-Thinking-2601, we introduce an efficient alternative termed LongCat-Flash-Thinking-ZigZag. LongCat-Flash-Thinking-ZigZag is nothing different from LongCat-Flash-Thinking-2601 except that it is further enhanced by LongCat ZigZag Attention (LoZA). LoZA is essentially a sparse attention scheme designed to transform any existing full-attention models into sparse versions with rather limited compute budget. In long-context scenarios, LoZA can achieve significant speed-ups both for prefill-intensive (e.g., retrieval-augmented generation) and decode-intensive (e.g., tool-integrated reasoning) cases. Specifically, by diverging LongCat-Flash-Thinking-ZigZag from LongCat-Flash-Thinking-2601 during mid-training using LoZA, we serve LongCat-Flash-Thinking-ZigZag as a long-context foundation model that can swiftly process a long range of tokens, enabling efficient long-term reasoning and long-horizon agentic capabilities.

Key Features

🧮 Limited Compute Overhead

LongCat ZigZag Attention (LoZA) firstly uncovers the layers that can be sparsified without hurting much performance, secondly sparsifies the layers that can be further trained to close performance gap. The whole process behaves very much like what has been described in *lottery tickets hypothesis*. In theory, a mid-trained LM is sequentially sparsified, rewound, mid-trained to maximally recover the full performance. In other words, the calibration starts at the end of mid-training while the training starts at the beginning of the mid-training, resulting in rather marginal compute overhead compared to training from scratch.

📈 Efficient Context Scaling

LoZA enables 50\% sparsity in LongCat-Flash-Thinking-ZigZag, the compute brought by attention should be ideally reduced by a factor of 2. For long-context circumstances where attention dominates the compute, the efficiency could be maximally lifted to 2 times of the original. Promoted by our efforts in kernel and engine customizations, in Figure below, the streaming sparse attention kernel could minimally use 90\% less cost in decode compared to full attention kernel (i.e., FlashMLA) for a context of 128K tokens. Meanwhile, in end-to-end benchmarking, LongCat-Flash-Thinking-ZigZag realizes more than 50\% speed-up in prefill and saves over 30\% cost in decode for a context of 256K tokens.

🔝 Competitive Benchmark Performance

LoZA would not compromise quality for speed. On the concerned benchmarks, LongCat-Flash-Thinking-ZigZag exhibit competitive performance with LongCat-Flash-Thinking-2601. Concretely, LongCat-Flash-Thinking-ZigZag also stands at the same line with other competitors such like DeepSeek-V3.2. And considerable cost savings are also achieved across a diverse range of benchmarks as shown in Figure below.

Evaluation Results

| Benchmark | DeepSeek-V3.2-Thinking | Kimi-K2-Thinking | Qwen3-235B-A22B-Thinking-2507 | GLM-4.7-Thinking | Claude-Opus-4.5-Thinking | Gemini-3-Pro | GPT-5.2-Thinking-xhigh | LongCat-Flash-Thinking-2601 | LongCat-Flash-Thinking-ZigZag | |---------------|------------------------|------------------|-------------------------------|------------------|---------------------------|--------------|------------------------|------------------------------|------------------------------| | Architecture | MoE | MoE | MoE | MoE | - | - | - | MoE | MoE | | Sparse Attention | ✅ | ❌ | ❌ | ❌ | - | - | - | ❌ | ✅ | | # Total Params | 671B | 1T | 235B | 355B | - | - | - | 560B | | # Activated Params | 37B | 32B | 22B | 32B | - | - | - | 27B | | Mathematical Reasoning w/ Tools | | | | | | | | | | AIME-25 (Avg@16) | 93.5* | 99.1† | 92.6* | 95.3* | 100.0 | 99.8 | 100.0 | 99.6 / 100.0‡ | 99.2 / - | | HMMT-25 (Avg@16) | 93.5* | 95.1† | 83.9* | 98.1* | 98.6 | 99.8 | 99.6 | 93.4 / 97.5‡ | 93.5 / - | | AMO-Bench EN (Avg@16) | 51.9* | 56.0* | 47.8* | 62.4* | 66.0 | 72.5 | - | 61.6 / 66.0‡ | 60.4 / - | | AMO-Bench CH (Avg@16) | 52.0* | 51.8* | 28.8* | 35.1* | 67.7 | 74.9 | - | 56.8 / 67.5‡ | 58.3 / - | | Agentic Search | | | | | | | | | | BrowseComp (Pass@1) | 51.4 / 67.6† | - / 60.2† | - | 52.0 / 67.5† | - | - | 65.8 / - | 56.6 / 73.1 | 55.2 / - | | BrowseComp-zh (Pass@1) | 65.0 / - | - / 62.3† | - | 66.6 / - | - | - | - | 69.0 / 77.7 | 71.9 / - | | Agentic Tool Using | | | | | | | | | | τ²-Retail (Avg@4) | 81.8† | - | 71.9† | - | 88.9† | - | 82.0† | 88.6 | 86.8 | | τ²-Airline (Avg@4) | 63.8† | - | 58.6† | - | - | - | - | 76.5 | 76.5 | | τ²-Telecom (Avg@4) | 96.2† | - | 47.3 | - | 98.2† | - | 98.7† | 99.3 | 97.4 | | τ²-Avg (Avg@4) | 80.6 | 74.3† | 59.3 | 87.4† | 82.4 | 90.7† | 80.6 | 88.2 | 86.9 | | General QA | | | | | | | | | | HLE text-only (w/o tools) | 24.1 | 24.4 | 17.8 | 26.9 | 32.0 | 40.3 | 34.5† | 25.2 | 25.8 | | GPQA-Diamond (Avg@16) | 86.9 | 85.4 | 80.5 | 84.9 | 86.9 | 91.9 | 92.9 | 80.5 / 85.2‡ | 80.6 / - | | Coding | | | | | | | | | |…

Excerpt shown — open the source for the full document.

Notability

notability 2.0/10

Very low traction (51 downloads), routine release