togethercomputer/Aurora-Spec-Qwen3-Coder-Next-FP8
Captured source
source ↗Aurora-Spec-Qwen3-Coder-Next-FP8
Model Description
This is an EAGLE3 draft model trained from scratch (random initialization) using the Aurora inference-time training framework for speculative decoding. Unlike traditional approaches that fine-tune pre-trained models, this model is built entirely through Aurora's online training process. The model is optimized to generate high-quality draft tokens for the Qwen/Qwen3-Coder-Next-FP8 target model, achieving significant speedups in code generation tasks.
Key Features
- Training Approach: Trained from scratch (random initialization) - no pre-training required
- Framework: Trained with Aurora - an advanced inference-time training system
- Architecture: EAGLE3 speculative decoding draft model
- Target Model: Qwen/Qwen3-Coder-Next-FP8
- Training Data: OnlineSD Code Dataset
- Performance: Achieves 3.1x average accept length for speculative decoding
- Training: 10,000 training steps over 80,000 inference requests
Target Model
This draft model is specifically designed to work with:
- Model: Qwen/Qwen3-Coder-Next-FP8
- Type: Code generation language model
- Precision: FP8 quantized
- Domain: Programming and code synthesis
The draft model learns to predict the target model's token distribution during inference-time training, enabling efficient speculative decoding.
Architecture
EAGLE3 Speculative Decoding
This model implements the EAGLE3 (Extrapolation Algorithm for Greater Language-model Efficiency) architecture:
- Draft Model: Lightweight model that generates candidate tokens
- Tree-based Attention: Enables parallel verification of multiple draft tokens
- Auto-regressive Generation: Produces speculative token sequences
- Dynamic Adaptation: Updates during inference to match target model distribution
Model Structure
- Initialization: Trained from scratch (random initialization, no pre-training)
- Base Architecture: Single-layer Transformer decoder
- Precision: FP8 (8-bit floating point)
- Speculative Steps: 5 tokens per iteration
- Attention Mechanism: Tree-based for parallel draft verification
- Training Paradigm: Online learning during inference (Aurora framework)
Training Details
Aurora Framework
This model was trained from scratch using Aurora, an inference-time training framework that:
- No Pre-training Required: Starts from random initialization and learns entirely through online training
- Updates the draft model dynamically during inference
- Uses reverse KL divergence for distribution matching (minimizing KL(target || draft))
- Employs online learning with periodic model updates
- Optimizes for both draft quality and speculative acceptance rate
- Demonstrates that effective draft models can be built from scratch without expensive pre-training
Training Configuration
- Hardware: NVIDIA H200 GPU
- Training Steps: 10,000 steps over 80,000 inference requests
- Learning Rate: 1e-4
- TTT Length: 5 tokens
- Speculative Steps: 5
- Update Interval: Every 10 requests
- Loss Weights:
- NTP Loss: 1.0
- Prediction Loss: 1.0
- KL Divergence: Reverse KL divergence (draft → target)
Dataset
Trained on the OnlineSD Code Dataset, which contains diverse coding examples suitable for training speculative decoding models.
Benchmarks
End-to-End Throughput Performance
Measured on a holdout dataset from the OnlineSD Code Dataset using the final Aurora checkpoint.
Qwen-Coder-Next: end-to-end throughput under varying batch size and lookahead
We report tokens-per-second (TPS) statistics and speedup relative to the no-speculation baseline.
| BS | Config | Mean TPS | P50 TPS | P05 TPS | P95 TPS | Speedup (Mean) | Acc Len | |:---:|:---------|:--------:|:-------:|:-------:|:-------:|:--------------:|:-------:| | 1 | w/o spec | 176.4 | 178.0 | 172.3 | 178.4 | -- | -- | | | lookahead 3 | 252.1 | 254.8 | 208.8 | 291.6 | 1.43× | 2.67 | | | lookahead 4 | 263.1 | 264.0 | 211.8 | 312.7 | 1.49× | 2.91 | | | lookahead 5 | 265.7 | 264.8 | 208.7 | 320.5 | 1.51× | 3.06 | | 8 | w/o spec | 119.8 | 121.5 | 104.8 | 134.6 | -- | -- | | | lookahead 3 | 141.0 | 138.9 | 110.4 | 178.5 | 1.18× | 2.67 | | | lookahead 4 | 142.5 | 141.2 | 110.3 | 181.6 | 1.19× | 2.91 | | | lookahead 5 | 146.3 | 143.5 | 109.6 | 189.5 | 1.23× | 3.07 | | 16 | w/o spec | 99.6 | 102.1 | 74.5 | 119.2 | -- | -- | | | lookahead 3 | 104.0 | 100.5 | 75.6 | 151.9 | 1.04× | 2.67 | | | lookahead 4 | 105.6 | 101.1 | 77.5 | 149.7 | 1.06× | 2.92 | | | lookahead 5 | 107.6 | 103.7 | 75.7 | 156.6 | 1.09× | 3.06 | | 32 | w/o spec | 85.0 | 88.7 | 54.5 | 104.5 | -- | -- | | | lookahead 3 | 78.9 | 72.8 | 53.0 | 122.3 | 0.93× | 2.68 | | | lookahead 4 | 79.5 | 73.7 | 52.9 | 124.7 | 0.94× | 2.91 | | | lookahead 5 | 80.3 | 72.6 | 52.8 | 130.7 | 0.94× | 3.06 |
Performance Across Different Batch Sizes
Aurora provides the largest gains at small-to-moderate batch sizes, with up to 1.51× speedup at batch size 1, demonstrating the effectiveness of speculative decoding for latency-critical scenarios. The benefits diminish as batch size increases:
- Batch Size 1 (Best Case): Up to 1.51× speedup with lookahead 5 configuration (3.06 average accept length). At low batch sizes, the cost of draft generation and verification is well amortized by reduced target model forward passes.
- Batch Size 8 (Moderate): 1.23× speedup with lookahead 5 configuration (3.07 average accept length). Speculative decoding still provides meaningful throughput improvements for moderate batching.
- Batch Size 16 (Diminishing Returns): 1.09× speedup with lookahead 5 configuration (3.06 average accept length). Benefits become marginal as verification overhead increases relative to baseline throughput.
- Batch Size 32 (Negative Returns): At large batch sizes, verification overhead dominates and speculative decoding becomes slightly slower than the baseline (0.93-0.94×). The…
Excerpt shown — open the source for the full document.
Notability
notability 5.0/10Low downloads, but specialized speculative decoding model.