WritingTogether AITogether AIpublished Mar 31, 2026seen 5d

Captured source

source ↗
published Mar 31, 2026seen 5dcaptured 3dhttp 200method plain

Aurora

⚡️ FlashAttention-4: up to 1.3× faster than cuDNN on NVIDIA Blackwell →

Introducing Together AI's new look →

🔎 ATLAS: runtime-learning accelerators delivering up to 4x faster LLM inference →

⚡ Together GPU Clusters: self-service NVIDIA GPUs, now generally available →

📦 Batch Inference API: Process billions of tokens at 50% lower cost for most models →

🪛 Fine-Tuning Platform Upgrades: Larger Models, Longer Contexts →

All blog posts

Research

Published 3/31/2026

Aurora

Authors

Junxiong Wang, Fengxiang Bie, Jisen Li, Zhongzhu Zhou, Zelei Shao, Yubo Wang, Yinghui Liu, Qingyang Wu, Avner May, Sri Yanamandra, Ce Zhang, Tri Dao, Percy Liang, Shuaiwen Leon Song, Ben Athiwaratkun, Chenfeng Xu, Xiaoxia Wu

Table of contents

40+ Models Chosen for Production...40+ Models Chosen for Production...40+ Models Chosen for Production...

Links in this article

Paper Code Project website ATLAS

Summary

Speculative decoding goes stale in production — draft models can drift and offline retraining can't always keep pace. Aurora fixes this. It's an open-source, RL-based framework that learns directly from live inference traces and continuously updates the speculator without interrupting serving. Key results: → Real-time adaptation across shifting traffic domains → 1.25x additional speedup over a well-trained static speculator

The headline finding: online training from scratch can outperform a carefully pretrained static baseline.

Running large language models in production is a constant tradeoff between performance and cost. Speculative decoding is the standard lever: in principle, it speeds up inference. In practice, it often under-delivers—draft models go stale, acceptance rates drift, and offline retraining is too slow and too expensive to keep pace with live traffic. What if your system could learn continuously, on the fly, from the very requests it's serving? Last year, we introduced ATLAS — our first step toward an adaptive speculator. That work laid the foundation, but the goal was always a fully autonomous system that closes the loop between serving and training. Today, we're releasing Aurora , an open-source, RL-based framework that learns from live inference traces and updates the speculator asynchronously—turning speculative decoding from a static, one-time setup into a dynamic, self-improving flywheel. This unified design unlocks capabilities that are difficult to achieve in standard pipelines, including: (1) direct mitigation of distribution mismatch, achieving a 1.25x improvement over a strong offline baseline; (2) reduced infrastructure cost by eliminating large-scale activation-collection pipelines; (3) an algorithm-agnostic framework compatible with future speculator designs; and (4) support for diverse, heterogeneous user demands. Across experiments, Aurora achieves an additional 1.25x speedup over a well-trained but static speculator on widely used models (e.g., Qwen3 and Llama3). The code to reproduce the paper’s results is open-sourced , and we welcome contributions from the community.

Aurora quickly adapts to shifting domains. End-to-end throughput under varying batch sizes MiniMax M2.5 (FP8, lookahead 5):

BS Config OTPS Mean OTPS P50 OTPS P05 OTPS P95 Speedup Acc Len

1 w/o spec 147.06 146.45 140.46 154.72 -- --

w/ spec 240.39 226.57 186.98 325.36 1.63× 2.41

8 w/o spec 109.41 106.49 99.56 126.57 -- --

w/ spec 160.95 157.42 123.72 207.04 1.47× 2.40

16 w/o spec 93.12 89.56 82.64 113.29 -- --

w/ spec 134.70 129.95 100.97 179.02 1.45× 2.40

32 w/o spec 80.44 77.57 71.77 96.84 -- --

w/ spec 120.67 115.04 92.49 162.77 1.50× 2.45

OTPS = output tokens-per-second. Testing dataset (198 examples).

Qwen3-Coder-Next-FP8 (lookahead 5):

BS Config OTPS Mean OTPS P50 OTPS P05 OTPS P95 Speedup Acc Len

1 w/o spec 195.21 195.23 194.75 195.75 -- --

w/ spec 375.49 350.37 251.92 574.03 1.92× 3.05

8 w/o spec 160.08 157.69 155.81 175.40 -- --

w/ spec 279.09 250.65 188.27 414.05 1.74× 3.10

16 w/o spec 138.70 137.92 130.05 150.44 -- --

w/ spec 221.56 202.96 143.80 323.54 1.60× 2.96

32 w/o spec 117.50 114.36 108.95 130.10 -- --

w/ spec 184.23 166.56 124.03 278.96 1.57× 3.00

OTPS = output tokens-per-second. Testing dataset (198 examples).

1. Why the standard train-then-serve pipeline breaks down Offline speculative training is convenient organizationally, but it introduces several practical issues in production that limit its effectiveness. The traditional pipeline is a one-way street — leading to stale models and a disconnect from real-world performance.

Static speculators typically degrade as traffic patterns shift. Traditional speculative decoding follows a linear, static flow that degrades over time. Aurora introduces a circular, continuously adaptive approach. The verifier moves, but the drafter lags. Production target models change — for quality, safety, cost, or hardware migration. The speculator often updates much more slowly, so it becomes stale and speculative performance degrades over time. Offline distillation pipelines are expensive. Activation collection and replay pipelines for drafter training can be extremely costly to store and operate at scale. At production scale, the storage footprint can reach petabyte-level magnitude, with high cost in memory, bandwidth, and operational complexity. Aurora reduces this burden by learning directly from live serving traces. Acceptance rate is not the same as real speedup. Offline training can optimize acceptance in a lab setting, but production speedup depends on the actual serving stack: kernels, numeric precision (FP8/FP4), batching, scheduling, and hardware behavior. The best draft model offline may not be the best model online. In practice, most teams train multiple drafters but end up selecting only one — Aurora enables a direct speedup comparison because it operates online. These gaps suggest that speculative decoding should not be treated merely as a modeling problem ("train a better drafter"), but as a joint learning-and-serving problem. 2. The core idea: A serve-to-train flywheel powered by RL Aurora turns speculative decoding into a serve-to-train flywheel. Rather than treating the speculator as a static artifact, it learns continuously from every request it serves.

Aurora offers a serve-to-train flywheel powered by RL. The system is built around two decoupled components. The Inference Server runs a speculative decoding engine (based on SGLang or vLLM) with a target model…

Excerpt shown — open the source for the full document.

Notability

notability 3.0/10

Low traction, unknown significance