WritingDigitalOcean (GradientAI)DigitalOcean (GradientAI)published Jun 2, 2026seen 5d

The Inference Tax: How Prefix-Aware Routing Eliminates the Hidden Cost of LLMs at Scale

Open original ↗

Captured source

source ↗

The Inference Tax: How Prefix-Aware Routing Eliminates the Hidden Cost of LLMs at Scale | DigitalOcean

© 2026 DigitalOcean, LLC. Sitemap .

Dark mode is coming soon. Engineering The Inference Tax: How Prefix-Aware Routing Eliminates the Hidden Cost of LLMs at Scale

By Piyush Srivastava and Simon Mo, CEO of Inferact

Updated: June 2, 2026 13 min read

<- Back to blog home

Introduction

Inference demand is growing fast, and it’s only accelerating. By 2030, inference is expected to account for the majority of AI compute globally. But scaling inference isn’t just a hardware problem. Most teams discover too late that a significant portion of their compute spend is avoidable, primarily because their systems are silently repeating work they have already done, recomputing the same prompt prefixes and system instructions over and over again.

We’ve seen this from two vantage points. From the infrastructure layer, the cost curve becomes visible at scale with clusters that look busy but aren’t efficiently utilized. From the engine layer, the picture is just as clear. Without the right caching and scheduling primitives, even a well-optimized model wastes cycles on redundant computation. The root cause is the same regardless of where you’re standing. The system lacks the memory and coordination to recognize when it’s already done the hard part.

Fixing this requires work at every layer of the stack. DigitalOcean has invested in GPU optimization across multiple fronts, from vLLM parallelism and quantization tuning to hardware-level kernel work. But one technique has had an outsized impact on cost efficiency at scale: prefix-aware routing and caching. In this post, we walk through how vLLM enables advanced prefix caching, how DigitalOcean’s inference gateway uses prefix awareness to make smarter routing decisions, and how we plan to make this available to everyone on Serverless Inference in the coming weeks.

The Cost Cliff and the Hidden Culprit

Inference now accounts for roughly 70% of total AI compute costs. For most teams, a significant share of that is avoidable. It’s not due to hardware limits. Instead, it’s because the system keeps recomputing work it has already done, also known as redundant prefill.

Every LLM inference request has two distinct computational phases. The first phase is prefill, where the model processes the entire input sequence and builds the KV (key-value) cache that represents its state. The second phase is decode, where the model generates output tokens one at a time, attending back to that cached state. Prefill is where the structural inefficiency hides. Its computation scales quadratically with input length: attention computation quadruples with doubling of input length.

Consider a real-world customer support workload running on NVIDIA H200 or AMD Instinct™ MI325X GPUs. A typical deployment carries a 2,000-token system prompt (defining persona, policies, response format) that is identical across every request. With an average user message of 200 tokens, roughly 91% of every input is shared context.

On AMD Instinct™ MI325X GPUs or NVIDIA H200, prefilling 2,000 tokens takes approximately 45–50ms and costs in the range of 100-120 GFLOPs per request. At 10,000 requests per hour, that’s over 1 trillion redundant FLOPs per hour. Compute spent reconstructing the state the system has already built, discarded, and is now rebuilding from scratch.

The pattern is even more pronounced in coding assistants or document Q&A tools, where the same API documentation or reference material is prepended to nearly every request. A 5,000-token shared context costs roughly 600 GFLOPs to prefill, which is nearly 25× more than a 1,000-token prefix, due to that quadratic relationship. When hundreds of users are querying the same underlying documents, the redundant computation compounds rapidly.

This is precisely the redundant “prefill tax” that we will focus on how to eliminate in the rest of this post.

How Prefix Caching Works at the Engine Layer

The redundant prefill problem has a clean structural solution, but landing it at production scale takes several mechanisms working in concert. Here’s what’s happening inside the engine when a cache hit lands.

Block-Based KV Storage

During prefill, every input token produces a key and value tensor at every attention layer, and storing these per-token would be a memory-management nightmare. The engine instead groups them into fixed-size blocks (16 tokens by default on CUDA, though configurable) allocated out of a pre-reserved GPU memory pool sized at engine startup. Each layer maintains its own pool of blocks. A single block holds the K and V tensors for block_size tokens for one layer’s KV heads, laid out so PagedAttention kernels can read them with coalesced memory accesses. A 2,000-token system prompt occupies 125 block positions (allocated per layer under the hood); once those blocks are sitting in the pool, any future request that begins with the same 2,000 tokens can point at them rather than recomputing. PagedAttention is the kernel technique that operates on this block-based layout, and the same memory machinery underlies both prefix caching and paged attention’s batching benefits, described in more detail in the engine anatomy writeup .

Prefix Hashing and Cache Lookup

Recognizing that two requests share a prefix is a string-matching problem on potentially very long inputs, and doing it naively would defeat the point. The engine instead hashes prefixes block by block, with each block’s hash depending on its own tokens, the hash of the previous block, and any extra inputs that affect the computation, including LoRA adapter IDs, multimodal image hashes, and optional cache salts for multi-tenant isolation. Identical prefixes under identical conditions produce identical hash chains, and the first divergent block is also the first point where the hashes disagree. Only full blocks are hashed and cached, so a partial trailing block at the end of a prefix doesn’t get reused and is recomputed along with the rest of the suffix. These hashes live in a lookup table mapping hash to cached block, and finding “the longest prefix of this request that’s already cached” reduces to walking the request’s block hashes against the table. The lookup is small and cheap, and the KV data itself lives in the GPU memory pool. Memory pressure comes from the pool, not the index.

From Cache Miss to Cache Hit: The FLOP Savings

The payoff shows up in…

Excerpt shown — open the source for the full document.

Notability

notability 3.0/10

Routine blog post from a cloud provider, not a notable model or release.