Cache-aware prefill–decode disaggregation (CPD) for up to 40% faster long-context LLM serving
Captured source
source ↗Cache-aware prefill–decode disaggregation (CPD) for up to 40% faster long-context LLM serving
⚡️ FlashAttention-4: up to 1.3× faster than cuDNN on NVIDIA Blackwell →
Introducing Together AI's new look →
🔎 ATLAS: runtime-learning accelerators delivering up to 4x faster LLM inference →
⚡ Together GPU Clusters: self-service NVIDIA GPUs, now generally available →
📦 Batch Inference API: Process billions of tokens at 50% lower cost for most models →
🪛 Fine-Tuning Platform Upgrades: Larger Models, Longer Contexts →
All blog posts
Research
Published 3/4/2026
Cache-aware prefill–decode disaggregation (CPD) for up to 40% faster long-context LLM serving
Authors
Jiejing Zhang, Yubo Wang, Yinghui Liu, Mourya Vangala Srinivasa, Chenxi Li, Jue Wang, Yineng Zhang, Shuaiwen Leon Song, Ce Zhang
Table of contents
40+ Models Chosen for Production...40+ Models Chosen for Production...40+ Models Chosen for Production...
Summary
Serving long prompts doesn’t have to mean slow responses. At Together AI, we built cache-aware prefill–decode disaggregation (CPD), a serving architecture that purposely separates cold and warm workloads by cache hit rate, resulting in fast context reuse. By isolating heavy prefills and leveraging distributed KV cache, CPD delivers up to 40% higher sustainable throughput and significantly lower time-to-first-token (TTFT) for long-context inference — especially under mixed, real-world traffic.
Today's AI native applications are pushing context lengths to new limits. From multi-turn conversations and coding copilots to agent memory and retrieval-augmented systems, long prompts are becoming the norm. But serving these large contexts efficiently remains a challenge: TTFT rises and becomes more variable . Inference performance is increasingly shaped not just by model compute, but by how efficiently systems handle shared context. In real-world workloads, many requests are not entirely new. Some contain large portions of context that have been seen before — e.g., shared system prompts, conversation history, or common documents. We refer to these as warm requests . Others introduce mostly new context and require full computation — these are cold requests . Recent advances like prefix caching and prefill–decode disaggregation (PD) have already improved long-context serving. Prefix caching reduces redundant work by reusing previously computed KV cache of the prompt prefixes, while PD separates the compute-heavy prefill stage from latency-sensitive decoding to reduce interference between them. Together with other associated techniques such as chunk prefill, sequence/context parallelism, etc., they collectively help lower overhead and improve hardware utilization. However, real-world workloads under very high load poses new challenges beyond the common serving scenarios. Consider a system where some concurrent users submit large, entirely new prompts over 100K tokens while others continue with multi-turn conversations that mostly reuse earlier context. PD ensures decoding is not blocked by prefill, but all prefills — both cold and warm — still share the same prefill capacity. The large cold prompts occupy those resources for seconds at a time, and warm requests that could have been served quickly through cache reuse end up waiting in the same queue. As a result, TTFT increases not because these requests need heavy computation, but because they are stuck behind the requests that do. To address this gap, we built a cache-aware disaggregation serving architecture, which handles warm and cold requests with separate compute resources. By identifying how much reusable context a request contains, the system can make smarter scheduling decisions — reducing unnecessary waiting and routing work more effectively across compute resources. Instead of letting expensive cold prefills dominate shared capacity, the system paves fast paths for warm requests while still processing new context efficiently. As a result, the cache-aware disaggregation design enables the system to scale more gracefully under load. As shown in Figure 1, under a tail-sensitive SLA, it consistently sustains higher achievable throughput than conventional baselines. In our evaluation, CPD improves sustainable QPS by up to 35–40% over existing disaggregated designs, while maintaining tighter tail latency bounds even in the presence of large cold prompts.
Figure 1. Maximum achievable QPS under latency SLOs. How CPD works We propose cache-aware prefill–decode disaggregation (CPD), which extends standard prefill–decode disaggregated serving with cache-aware routing and a shared KV-cache hierarchy . The key idea is simple: don't let expensive "cold" prefills block the fast path for reusable context . The system separates inference into three roles: Pre-Prefill nodes : handle low-reuse (cold) prompts, compute new context, and write KV cache into a distributed cache. Prefill nodes : prioritize high-reuse (warm) requests, reading KV blocks from cache instead of recomputing prefixes. Decode nodes : remain latency-focused and isolated from prefill interference.
Prefill and decode are already disaggregated, but CPD adds a dedicated pre-prefill tier that handles requests with little or no cache reuse. These nodes compute large new contexts and write their KV cache into a distributed cache. Meanwhile, normal prefill nodes focus on requests that can reuse existing state, reading KV blocks from the cache instead of recomputing them. Decode nodes remain isolated and latency-focused. Under the hood, CPD relies on a three-level KV-cache hierarchy , as depicted in Figure 2. The fastest layer lives in GPU memory, followed by host DRAM, and a cluster-wide distributed cache connected via RDMA. When a cold request is processed by a pre-prefill node, its KV state is written to the distributed cache. Subsequent similar requests can fetch this state in bulk at high bandwidth, turning what would have been seconds of compute into hundreds of milliseconds of transfer and light recomputation. Over time, frequently accessed contexts naturally move closer to the GPU, further shrinking latency.
Figure 2. System overview The router ties this together. For each request, it estimates how much of the prompt can be served from cache. Requests with low reuse are steered to pre-prefill nodes, while high-reuse requests go directly to normal prefill nodes. This workload separation prevents large cold prefills from saturating shared compute,…
Excerpt shown — open the source for the full document.
Notability
notability 5.0/10Research result with low community traction