WritingTogether AITogether AIpublished May 11, 2026seen 5d

Serving DeepSeek-V4: why million-token context is an inference systems problem

Open original ↗

Captured source

source ↗

Serving DeepSeek-V4: why million-token context is an inference systems problem

⚡️ FlashAttention-4: up to 1.3× faster than cuDNN on NVIDIA Blackwell →

Introducing Together AI's new look →

🔎 ATLAS: runtime-learning accelerators delivering up to 4x faster LLM inference →

⚡ Together GPU Clusters: self-service NVIDIA GPUs, now generally available →

📦 Batch Inference API: Process billions of tokens at 50% lower cost for most models →

🪛 Fine-Tuning Platform Upgrades: Larger Models, Longer Contexts →

All blog posts

Inference

Published 5/11/2026

Serving DeepSeek-V4: why million-token context is an inference systems problem

Authors

Jue Wang, Dan Fu, Alex Angus, Yineng Zhang, Michael Granado, Sonny Khan

Table of contents

40+ Models Chosen for Production...40+ Models Chosen for Production...40+ Models Chosen for Production...

Links in this article

Webinar DeepSeek-V4-Pro

Benchmark tables miss the main point of DeepSeek-V4: the important change is architectural. V4 turns million-token context into a serving-systems problem. The model supports a 1M-token context window through a hybrid attention design that compresses context before key-value (KV) storage, mixes compressed and local attention paths, and changes how prefix reuse works. Those choices reduce KV pressure, but the savings only matter if the inference engine can manage the resulting cache layouts, recover local state, batch requests effectively, and choose endpoint profiles that match the workload. This post focuses on the serving implications of V4's Compressed Sparse Attention (CSA) / Heavily Compressed Attention (HCA) / Sliding Window Attention (SWA) attention design, based on Together's early bring-up work on NVIDIA HGX B200. V4 also includes other architecture and training changes, including Manifold-Constrained Hyper-Connections (mHC) residual connections and Muon optimizer choices, but those are outside the main scope here. V4 compresses the token axis of KV cache Autoregressive inference stores prior context in KV cache. During decode, each new token generated reads from and attends to that stored state. The cache grows with sequence length: KV cache  ∝  layers × tokens × kv_heads × head_dim × bytes At long context, KV cache hits serving twice. It caps concurrency because each active request occupies memory, and it lowers throughput because decode has to read stored context every step. V4 matters because it attacks both sides of that problem: fewer cache entries to store and fewer cache entries to move through attention. On NVIDIA Blackwell, that cache pressure maps directly to serving economics. Long-context inference depends on keeping enough KV resident for concurrency while preserving memory bandwidth for decode. V4’s token-axis compression makes that tradeoff more favorable: the engine has more room to batch requests, reuse prefixes, and keep long-context workloads inside an efficient serving regime. Recent model architectures have reduced different terms in the  product above. Group Query Attention (GQA) reduces KV heads. Multi-Head Latent Attention (MLA) compresses KV into a latent representation. FP8, MXFP4, and NVFP4 reduce bytes per element. DeepSeek-V3.2 sparse attention reduced how much KV had to be read during decode, while the full cache still had to stay resident. DeepSeek-V4 targets the token axis. It compresses context before KV storage. That is the important shift. Under a vanilla BF16 multi-head attention calculation, a 70B-class model can require megabytes of KV cache per token. The exact coefficient depends on layer count, KV heads, head dimension, and precision. At 1M tokens, the cache becomes impractical for a single request. V4's token-axis compression, combined with MLA-style head compression and low-precision KV, reduces the per-request cache footprint enough to make very long context materially more practical. In early bring-up, V4’s serving capacity was governed less by the compressed CSA / HCA cache and more by how the engine handled SWA state. A full-SWA implementation actually had a higher per-token KV footprint than our V3 path — roughly 3.8 KB per token versus 3.4 KB — because the engine was storing the full sliding-window state. The practical gain came from cache policy. By keeping only the SWA states most likely to be reused, we increased total KV-cache capacity on a single NVIDIA HGX B200 node from roughly 1.2M tokens to 3.7M tokens with minimal changes. That is the main serving lesson: V4’s architecture creates the opportunity for long-context efficiency, but the realized capacity depends on how the inference engine stores, recomputes, and evicts the different cache types. The practical win extends beyond full 1M-token requests. It makes 200K–500K-token workloads more concurrent and less fragile, because the engine has more KV budget to work with before memory pressure forces eviction or limits batching. Earlier million-token-context models still left major serving challenges around memory, concurrency, and cost. V4 moves that range closer to an actual workload when the serving policy matches the cache layout.

V4 requires multiple KV-cache layouts Many existing serving paths assume something close to a single KV cache layout: one cache object per layer per token, the same shape across the stack. V4 requires three different cache types, mixed across layers. Compressed Sparse Attention (CSA) compresses context with stride 4, but each compressed entry is built from a slightly wider receptive field. In V4's configuration, each entry summarizes an 8-token neighborhood, so adjacent compressed entries overlap at the boundaries. When a query selects 128 compressed entries, it is selecting summaries of local neighborhoods rather than isolated token positions. That gives CSA a finer-grained path into selected regions of the million-token prefix while still reducing the stored cache footprint.

Heavily Compressed Attention (HCA) uses the same compression idea but with stride 128. At a 1M-token context length, that reduces the cache from 1M token positions to roughly 8K compressed entries. That is the key difference from CSA: the compressed cache is small enough that the model can attend over it densely instead of selecting a top-k subset. HCA gives the model a coarse global read over the whole context, while CSA gives it a finer sparse read over selected regions.

Sliding Window Attention (SWA) preserves the local path. The window is short, around…

Excerpt shown — open the source for the full document.

Notability

notability 7.0/10

Informative post on frontier model inference challenge.