WritingDigitalOcean (GradientAI)DigitalOcean (GradientAI)published Apr 15, 2026seen 5d

Load Balancing and Scaling LLM Serving

Open original ↗

Captured source

source ↗
published Apr 15, 2026seen 5dcaptured 3dhttp 200method plain

Load Balancing and Scaling LLM Serving | DigitalOcean

© 2026 DigitalOcean, LLC. Sitemap .

Dark mode is coming soon. Engineering Load Balancing and Scaling LLM Serving

By Mohammad Ashar Khan

Senior Software Engineer

Updated: April 15, 2026 7 min read

<- Back to blog home

Load balancing for LLMs is fundamentally different from load balancing for traditional services like web servers, APIs, or databases. Prompt caching is the reason. Prompt caching typically cuts input token costs by 50-90% and can reduce Time to First Token (TTFT) latency by up to 80%, but those gains assume your request lands on the replica that already has the relevant prefix cached. Under naive round-robin load balancing across N replicas, that probability is 1/N. The cache hit rate that made caching so attractive at one replica degrades almost linearly as your fleet grows.

Solving this requires rethinking how requests are routed at the infrastructure level. This article covers the load balancing strategies and specialized routers that preserve cache efficiency at scale, starting with why standard approaches fall short and progressing to precise, cache-aware routing techniques.

Inferencing engines

To achieve large-scale inferencing, we use inference engines. These engines simplify the complexities of serving LLMs and offer improved resource utilization on the underlying GPUs. They also enable higher concurrency and allow for customization to suit diverse inference workloads, such as real-time chat completions and long-form document summarization. Noteworthy engine options include vLLM , SGLang , and TensorRT .

The inferencing process is largely consistent across different engines. Sending an HTTP request to an engine initiates a standard sequence of steps.

Prefill Phase:

The input prompt is first converted into token IDs using the model’s tokenizer.

Requests are grouped into batches for efficient concurrent processing by the engine.

During this initial processing, special Key (K) and Value (V) tensors are computed.

This phase concludes after the first forward pass, resulting in the generation of the first output token.

Decode Phase:

This phase involves an auto-regressive loop, continuing until an end-of-sequence token is generated or the maximum sequence length is reached.

For every subsequent token generated, the K and V tensors are incrementally updated.

Optimizations:

To improve efficiency and avoid redundant computation, the engine uses advanced techniques like Paged Attention and Prefix Caching. These methods employ and access K/V tensors efficiently from GPU memory, especially when new requests might share common prefixes.

NOTE: This is a highly simplified description of token processing; in reality, inference engines perform much more significant optimizations, such as Fusing Tensor operations, capturing CUDA graphs, and tuning for various batch sizes.

For conversational workloads with a Large Language Model (LLM), every new message requires sending the entire conversation history to the engine. To improve efficiency, the use of pre-computed Key-Value (KV) caches reduces the “prefill” stage by reusing older caches. This ultimately decreases the TTFT.

Routing in homogeneous instances

If we are running the same model on n independent inference engines, the router policies typically include several popular options, such as:

Strategy Description Drawbacks

Random or round robin Each request is sent to a randomly selected engine or in a sequential, round robin manner. Suboptimal performance and inconsistent results because random routing hinders the effective utilization of engine-specific KV caches.

Consistent hashing Uses a “sticky session” or user_id routing, ensuring requests from the same user consistently hit the same engine. While an improvement over random, the first request from a new user may land on any engine, potentially an engine without the required prompt’s KV cache. This is better suited for long conversational workloads.

Cache-aware load balancing Routes requests to the engine with the maximum prompt prefix overlap, unless the load is uneven, in which case it routes to minimize imbalance. If the engines lack support for KV events, routing decisions are made solely based on the request. This can lead to routing a request to an engine whose cache has been invalidated, making the decision inaccurate.

The standard approach for load balancing is typically cache-aware load balancing . A more sophisticated version of this is known as precise prefix cache-aware routing . In this advanced strategy, the router captures KV cache events emitted by the engine. This information is then used to make routing decisions, specifically directing the request to the engine that offers the greatest overlap with the existing prefix cache.

To highlight the impact of routing on inference performance, we can compare a precise-prefix-cache-aware routing strategy with a standard k8s service employing a round-robin or random policy.

Benchmarking details and results, provided by the LLM-d community, are available on the LLM-d routing benchmark page .

Cache-aware routing achieves an improvement in throughput of up to 108% for the same hardware configuration and workload. To understand cache-aware routing, we will start with its simpler form before progressing to a more advanced routing technique.

Cache-aware routing fundamentally relies on a data structure, often a Radix tree , that facilitates rapid insertion, removal, and prefix matching. The methodology is straightforward: a separate radix tree is maintained for each engine instance. When a new request arrives, its prompt is extracted. The system then iterates through all the instance-specific radix trees to identify the one with the longest matching prefix. This instance is then selected to handle the request. Additionally, the new prompt is inserted into the radix tree of the chosen instance for future routing considerations.

To prevent excessive radix tree growth, it implements a Least Recently Used (LRU) policy for purging its contents. Our routing strategy is not solely based on cache; we also incorporate a load-balancing mechanism that employs a balance threshold. This dynamic approach switches between pure load balancing and cache-aware routing because relying only on cache-aware routing has proven to be suboptimal for ensuring even load distribution across all instances.

The current design stores the entire prefix cache state…

Excerpt shown — open the source for the full document.

Notability

notability 5.0/10

Substantive technical post on scaling LLM serving