WritingTogether AITogether AIpublished May 4, 2026seen 5d

Foundational research powering efficient inference at scale

Open original ↗

Captured source

source ↗

Foundational research powering efficient inference at scale

⚡️ FlashAttention-4: up to 1.3× faster than cuDNN on NVIDIA Blackwell →

Introducing Together AI's new look →

🔎 ATLAS: runtime-learning accelerators delivering up to 4x faster LLM inference →

⚡ Together GPU Clusters: self-service NVIDIA GPUs, now generally available →

📦 Batch Inference API: Process billions of tokens at 50% lower cost for most models →

🪛 Fine-Tuning Platform Upgrades: Larger Models, Longer Contexts →

All blog posts

Inference

Published 5/4/2026

Foundational research powering efficient inference at scale

As AI moves from research to production, the challenge for AI-native teams shifts from building models to running them — efficiently, reliably, and at scale.

Authors

Will Van Eaton, Adee Feiner, Hiral Jasani

Table of contents

40+ Models Chosen for Production...40+ Models Chosen for Production...40+ Models Chosen for Production...

AI has spent years in the spotlight for training: the massive, GPU-intensive process of building models. But for most teams deploying AI today, ongoing inference costs are what actually shape their unit economics. Estimates put inference at 80-90% of the total lifetime cost of a production AI system , simply because it runs continuously across every user query, agent step, and API call. And while training is a bounded investment, inference scales with every new user and use case you ship. At NVIDIA GTC 2026, NVIDIA CEO Jensen Huang framed this shift plainly: “People pay for information, but people mostly pay for work. Agentic systems get work done.” That shift from AI as a novelty to AI as a workhorse is exactly what’s reshaping infrastructure priorities. For Together AI, none of this is new. The inference imperative is what we’ve been building for. Our CTO Ce Zhang covered these dynamics in depth at GTC, sharing hard-won lessons from running some of the most demanding production inference workloads in the industry.

Why inference is a different kind of hard Inference isn’t just “running the model.” In production, it’s an optimization problem across multiple competing dimensions simultaneously: Latency shapes what’s possible to build. For applications like coding assistants, real-time support, or conversational agents, sub-500ms response times aren’t a nice-to-have — they determine whether a product feels like software or like waiting. Agentic workflows amplify this: five model calls at 200ms each is a full second of accumulated latency before a user sees a result. The threshold matters, and missing it has product consequences. Throughput determines your unit economics. AI-native companies face a structurally different cost profile than traditional SaaS. Where legacy software companies target 80-90% gross margins, AI companies commonly operate at 50-60% , with inference alone accounting for roughly 23% of revenue at scaling-stage companies . Efficient inference means more requests served per GPU-hour. That math flows directly to margins. The model landscape keeps changing. The inference stack optimized for today’s models may need significant rework tomorrow. New architectures, quantization methods, and hardware; staying at the frontier requires continuous investment across the full stack, not just one-time optimization. Concurrency is unforgiving. Serving thousands of simultaneous users means navigating wildly different context lengths, latency requirements, and cost profiles — all at once, without degradation. That’s as much a scheduling and orchestration challenge as it is a compute one.

This is also why the stakes are higher than most teams initially expect. How Together approaches inference Together’s approach to inference isn’t a single optimization. It’s a compounding stack of research, systems engineering, and hardware expertise designed to improve continuously as the frontier moves: Research that ships to production. The Together Research team has contributed some of the most widely adopted advances in inference efficiency: FlashAttention (now up to FlashAttention-4 ), ThunderKittens, and Aurora, our open-source adaptive speculative decoding system delivering up to 1.25x faster LLM inference. This research goes into production for customers, typically within weeks of publication. Adaptive speculative decoding. Standard speculative decoding uses a smaller draft model to propose tokens that a larger model verifies in parallel, delivering 1.5-3x speedups on predictable workloads like code completion or structured outputs. Our ATLAS and Aurora systems go further: Aurora is an open-source RL-based framework that learns from live inference traces in real time, adapting as traffic patterns shift. It achieves meaningful speedups over even well-trained static speculators, without interrupting serving. Full-stack hardware optimization. Running on the latest NVIDIA Blackwell hardware (GB200 NVL72, HGX B200) means building custom parallelism strategies across 72-GPU meshes, implementing NVFP4 quantization, and constructing weights-to-production pipelines that get model releases live within days. When Cursor needed production-grade latency for millions of active developers, Together AI built the full-stack infrastructure to make it work, handling strict latency SLAs across unpredictable, high-concurrency traffic. Intelligent scheduling and batching. High-throughput inference requires making smart real-time decisions: which requests to batch together, how to route based on context length and latency requirements, and when to trade throughput for responsiveness. Together’s inference engine handles this dynamically, extracting maximum efficiency from each GPU-hour without sacrificing the experience that AI-native apps and products depend on.

The economics of getting this right The Stanford 2025 AI Index documents a remarkable trend: inference costs for GPT-3.5-level performance dropped more than 280-fold between late 2022 and late 2024. But total inference spend is rising; as costs fall, teams deploy AI for more use cases, users, and agent steps. Lower costs per token haven’t reduced the infrastructure challenge; they’ve expanded the surface area of it. As the industry converges on lower token cost as a real indicator of AI infrastructure's TCO , Together AI’s approach of optimizing the full hardware and software stack continues to deliver better profitability for customers. For AI-native companies, this makes inference optimization a compounding…

Excerpt shown — open the source for the full document.

Notability

notability 7.0/10

Notable research post from prominent AI company