togethercomputer/Aurora-Spec-Minimax-M2.5
Captured source
source ↗Aurora-Spec-Minimax-M2.5
Model Description
This is an EAGLE3 draft model trained from scratch (random initialization) using the Aurora inference-time training framework for speculative decoding. Unlike traditional approaches that fine-tune pre-trained models, this model is built entirely through Aurora's online training process. The model is optimized to generate high-quality draft tokens for the MiniMax M2.5 target model, achieving significant speedups across various batch sizes.
Key Features
- Training Approach: Trained from scratch (random initialization) - no pre-training required
- Framework: Trained with Aurora - an advanced inference-time training system
- Architecture: EAGLE3 speculative decoding draft model
- Target Model: MiniMax M2.5
- Performance: Achieves 2.62 average accept length with lookahead 4 (recommended configuration)
- Training: 44,000 inference requests on NVIDIA H200 GPU
- Speedup: Up to 1.58× speedup at batch size 1 (lookahead 3), 1.57× with lookahead 4 (recommended)
Target Model
This draft model is specifically designed to work with:
- Model: MiniMax M2.5
- Type: General-purpose language model
- Domain: Broad language understanding and generation
The draft model learns to predict the target model's token distribution during inference-time training, enabling efficient speculative decoding.
Architecture
EAGLE3 Speculative Decoding
This model implements the EAGLE3 (Extrapolation Algorithm for Greater Language-model Efficiency) architecture:
- Draft Model: Lightweight model that generates candidate tokens
- Tree-based Attention: Enables parallel verification of multiple draft tokens
- Auto-regressive Generation: Produces speculative token sequences
- Dynamic Adaptation: Updates during inference to match target model distribution
Model Structure
- Initialization: Trained from scratch (random initialization, no pre-training)
- Base Architecture: Single-layer Transformer decoder
- Recommended Configuration: Lookahead 4 (speculative_num_steps=4)
- Attention Mechanism: Tree-based for parallel draft verification
- Training Paradigm: Online learning during inference (Aurora framework)
Training Details
Aurora Framework
This model was trained from scratch using Aurora, an inference-time training framework that:
- No Pre-training Required: Starts from random initialization and learns entirely through online training
- Updates the draft model dynamically during inference
- Uses reverse KL divergence for distribution matching (minimizing KL(target || draft))
- Employs online learning with periodic model updates
- Optimizes for both draft quality and speculative acceptance rate
- Demonstrates that effective draft models can be built from scratch without expensive pre-training
Training Configuration
- Hardware: NVIDIA B200 GPU
- Training Requests: 12,000 inference requests initialized from togethercomputer/Aurora-Spec-Minimax-M2.5
- Synchronization Interval: Every 800 requests
- Recommended Configuration: Lookahead 4
- KL Divergence: Reverse KL divergence (draft → target)
- Training weight & bias: https://wandb.ai/LIFT_ITT/inference-time-training/runs/gnfacv1r?nw=nwuserxwushirley1
Dataset
Trained on diverse prompts suitable for general-purpose language modeling and speculative decoding.
Usage
This model is designed to be used as a draft model in EAGLE3 speculative decoding pipelines with MiniMax M2.5 as the target model.
Example 1: Python API (Offline Batch Inference)
import sglang as sgl
def main():
# Sample prompts
prompts = [
"Explain the concept of quantum computing:",
"Write a short story about a time traveler:",
"Describe the process of photosynthesis:",
]
# Create sampling params
sampling_params = {"temperature": 0.7, "max_new_tokens": 256}
# Initialize engine with speculative decoding (lookahead 4 - recommended)
llm = sgl.Engine(
model_path="MiniMaxAI/MiniMax-M2.5",
speculative_draft_model_path="togethercomputer/Aurora-Spec-Minimax-M2.5",
speculative_algorithm="EAGLE3",
speculative_num_steps=4, # Recommended: lookahead 4
speculative_eagle_topk=1,
speculative_num_draft_tokens=6,
dtype="bfloat16",
trust_remote_code=True,
)
# Generate with speculative decoding
outputs = llm.generate(prompts, sampling_params)
# Print the outputs
for prompt, output in zip(prompts, outputs):
print("=" * 50)
print(f"Prompt: {prompt}")
print(f"Generated: {output['text']}")
# The __main__ condition is necessary when using spawn to create subprocesses
if __name__ == "__main__":
main()Example 2: Launch Server (Production Use)
Step 1: Start the SGLang server with speculative decoding
python -m sglang.launch_server \ --model-path MiniMaxAI/MiniMax-M2.5 \ --speculative-draft-model-path togethercomputer/Aurora-Spec-Minimax-M2.5 \ --speculative-algorithm EAGLE3 \ --speculative-num-steps 4 \ --speculative-eagle-topk 1 \ --speculative-num-draft-tokens 5 \ --dtype bfloat16 \ --trust-remote-code \ --port 30000 \ --host 0.0.0.0
Step 2: Send requests to the server
import requests
import json
# Server endpoint
url = "http://localhost:30000/v1/completions"
# Request payload
payload = {
"prompt": "Explain the concept of quantum computing:",
"max_tokens": 256,
"temperature": 0.7,
}
# Send request
response = requests.post(url, json=payload)
result = response.json()
print(result["choices"][0]["text"])Or using OpenAI-compatible client:
from openai import OpenAI client = OpenAI( base_url="http://localhost:30000/v1", api_key="EMPTY" ) response = client.completions.create( model="MiniMaxAI/MiniMax-M2.5", prompt="Explain the concept of quantum computing:", max_tokens=256, temperature=0.7, ) print(response.choices[0].text)
Local Model Paths
If you have downloaded the models locally, replace the HuggingFace model paths with local paths:
python -m sglang.launch_server \ --model-path /path/to/MiniMax-M2.5 \ --speculative-draft-model-path /path/to/Aurora-Spec-Minimax-M2.5 \ --speculative-algorithm EAGLE3 \ --speculative-num-steps 4 \ --speculative-eagle-topk 1 \ --speculative-num-draft-tokens 5 \ --dtype bfloat16 \ --trust-remote-code \ --port 30000
Limitations
- Optimized specifically for MiniMax M2.5 target model
- Performance may vary with different target models
- Requires compatible EAGLE3 inference…
Excerpt shown — open the source for the full document.
Notability
notability 4.0/10Low traction, routine release.