The LLM Inference Trilemma: Throughput, Latency, Cost
Captured source
source ↗The LLM Inference Trilemma: Throughput, Latency, Cost | DigitalOcean
© 2026 DigitalOcean, LLC. Sitemap .
Dark mode is coming soon. Engineering The LLM Inference Trilemma: Throughput, Latency, Cost
By Balaji Varadarajan
Staff Engineer
Published: April 22, 2026 12 min read
<- Back to blog home
We know how to scale traditional web services: throw a load balancer in front of stateless microservices and horizontally scale your CPU instances as traffic grows. Large Language Models break this playbook because LLM inference is fundamentally stateful, bottlenecked by memory bandwidth rather than raw compute, and bound to physical hardware interconnects. Scaling LLM inference isn’t just a matter of adding more servers; it’s a delicate, multi-dimensional optimization problem.
Classic case of “Trilemma”
If you’ve served a large language model in production, you’ve encountered the trilemma. Push throughput up, and latency creeps higher. Clamp latency down, and your GPU bill inflates. Try to optimize cost, and you’re forced to make uncomfortable compromises on one of the other two dimensions.
This three-way orthogonal tension—throughput, latency, cost—is the central engineering challenge in dedicated LLM hosting. Understanding it deeply is the difference between a system that helps scale with economics in mind and one that increases your infrastructure budget.
This article is a practitioner’s guide to navigating these trade-offs. We’ll unpack what “cost” actually means in the inference world (spoiler: it’s not just $/token), walk through the levers that dictate cost, and discuss how hardware selection and benchmarking expose the real cost surface. Finally, we’ll touch on when and why you might optimize for throughput versus latency and what that decision costs you.
What Does “Cost” Actually Mean in LLM Inference
In standard web hosting, cost is often linear (more traffic = more servers). In LLM hosting, “cost” is a multi-dimensional metric. When people talk about inference costs, they usually default to a single number— dollars per million tokens . While running dedicated infrastructure, the real cost of serving an LLM is a composite of at least four distinct dimensions.
Capital Cost (CapEx): Paying for the Full Node
This is the hardware cost. Because GPUs are tied together by high-speed interconnects (like NVLink), you can’t just buy “half a node”. For instance, an 8-GPU H100 node is a single indivisible purchase—even if your 70B model only needs four GPUs. You pay for the full capacity of the cluster even if your model only utilizes a fraction of it.
Operational Cost (OpEx): The Electricity & Cloud Tax
Owning hardware is an ongoing “burn” of power and cooling costs, while renting it from a provider shifts the burden to hourly rates. An 8-GPU H100 node pulls 10 to 12 kW under load, which can be thousands of dollars a year in electricity, and cooling in dense GPU racks (40 to 60 kW) can match or exceed that. Cloud rental is the OpEx alternative— H100 pricing has dropped generally in 2026 , but the “idling tax” remains the primary enemy of OpEx efficiency.
Opportunity Cost: The Utilization Gap
This is the “ghost in the machine” for enterprise deployments. Every minute a GPU sits idle during low-traffic hours (like 3 a.m.) is money lost. Because dedicated hardware isn’t easily shared across different models without performance hits, bursty traffic patterns can create a gap between “paid-for” capacity and “used” capacity.
Without sophisticated orchestration or “serverless-on-dedicated” setups, the lack of multi-tenancy on dedicated nodes can make this the largest invisible drain on ROI.
This is where autoscaling shifts from a reliability mechanism to a cost-optimization tool: a coding assistant serving North American developers can scale down to a single replica between 2 a.m. and 6 a.m. Pacific time, reclaiming hours of idle GPU spend every day.
Engineering Cost: The Specialized Labor Premium
Engineering cost is consistently underestimated. The most expensive component isn’t necessarily the silicon; it’s the specialized labor required to tune it. Finding the optimal configuration for vLLM or TensorRT-LLM is a high-level systems engineering task that consumes weeks of expensive human and machine time.
The complexity of the software stack (profiling with Nsight, managing CUDA versions) has only grown. The benchmarking tax is a real phenomenon in which companies may spend considerable engineering time to save on monthly GPU costs.
The Levers That Dictate Cost
Now that we’ve broken cost into its four dimensions, the next question is what you can actually do about it. A handful of engineering decisions—model architecture, quantization, parallelism, and batching—account for most of the cost variance between a well-tuned deployment and a wasteful one. This is where engineering meets economics.
Model Architecture: Dense vs. MoE
Cost for dense models (e.g., Llama 3 70B) scales linearly with memory/VRAM. Cost for MoE (Mixture-of-Experts) models (e.g., DeepSeek-V3) can be a game of communication. A dense 70B model activates all 70 billion parameters on every token. A MoE model like DeepSeek-V3 has 671B total but only activates ~37B per token. This changes the cost equation.
For dense models, scaling is linear and predictable. Cost tracks the ratio of model size to available HBM (High bandwidth memory) - a Llama 3.3 70B in BF16 needs roughly 140 GB, so two H100s minimum or one MI300X (192 GB). MoE models flip the problem. Llama 4 Maverick has 400B total parameters but only activates 17B per token - the total weight footprint in BF16 is ~800 GB, demanding a full 8-GPU node, yet per-token compute is comparable to a model a fraction of that size since only one of 128 routed experts fires per layer.
The cost challenge for dense models is a brute-force memory problem. Since every parameter (W) is activated for every token, your cost is directly tied to how fast your GPU can pull those weights from HBM into the compute cores. If the memory bandwidth is low, latency increases.
For MoE models, the cost challenge shifts from raw compute to communication. Because only a subset of experts fire for any given token, the total compute required is generally modest. However, those experts are shared across multiple GPUs. This requires “all-to-all” routing patterns that can put immense pressure on the interconnect.
Quantization: Trading Precision for Efficiency
Quantization is the…
Excerpt shown — open the source for the full document.
Notability
notability 3.0/10Routine blog post, no code or model