ModelTogether AITogether AIpublished Feb 19, 2026seen 5d

togethercomputer/Aurora-Spec-Minimax-M2.5

Open original ↗

Captured source

source ↗
published Feb 19, 2026seen 5dcaptured 11hhttp 200method plaintask text-generationlicense apache-2.0params 858Mdownloads 312likes 6

Aurora-Spec-Minimax-M2.5

Model Description

This is an EAGLE3 draft model trained from scratch (random initialization) using the Aurora inference-time training framework for speculative decoding. Unlike traditional approaches that fine-tune pre-trained models, this model is built entirely through Aurora's online training process. The model is optimized to generate high-quality draft tokens for the MiniMax M2.5 target model, achieving significant speedups across various batch sizes.

Key Features

  • Training Approach: Trained from scratch (random initialization) - no pre-training required
  • Framework: Trained with Aurora - an advanced inference-time training system
  • Architecture: EAGLE3 speculative decoding draft model
  • Target Model: MiniMax M2.5
  • Performance: Achieves 2.62 average accept length with lookahead 4 (recommended configuration)
  • Training: 44,000 inference requests on NVIDIA H200 GPU
  • Speedup: Up to 1.58× speedup at batch size 1 (lookahead 3), 1.57× with lookahead 4 (recommended)

Target Model

This draft model is specifically designed to work with:

  • Model: MiniMax M2.5
  • Type: General-purpose language model
  • Domain: Broad language understanding and generation

The draft model learns to predict the target model's token distribution during inference-time training, enabling efficient speculative decoding.

Architecture

EAGLE3 Speculative Decoding

This model implements the EAGLE3 (Extrapolation Algorithm for Greater Language-model Efficiency) architecture:

  • Draft Model: Lightweight model that generates candidate tokens
  • Tree-based Attention: Enables parallel verification of multiple draft tokens
  • Auto-regressive Generation: Produces speculative token sequences
  • Dynamic Adaptation: Updates during inference to match target model distribution

Model Structure

  • Initialization: Trained from scratch (random initialization, no pre-training)
  • Base Architecture: Single-layer Transformer decoder
  • Recommended Configuration: Lookahead 4 (speculative_num_steps=4)
  • Attention Mechanism: Tree-based for parallel draft verification
  • Training Paradigm: Online learning during inference (Aurora framework)

Training Details

Aurora Framework

This model was trained from scratch using Aurora, an inference-time training framework that:

  • No Pre-training Required: Starts from random initialization and learns entirely through online training
  • Updates the draft model dynamically during inference
  • Uses reverse KL divergence for distribution matching (minimizing KL(target || draft))
  • Employs online learning with periodic model updates
  • Optimizes for both draft quality and speculative acceptance rate
  • Demonstrates that effective draft models can be built from scratch without expensive pre-training

Training Configuration

  • Hardware: NVIDIA B200 GPU
  • Training Requests: 12,000 inference requests initialized from togethercomputer/Aurora-Spec-Minimax-M2.5
  • Synchronization Interval: Every 800 requests
  • Recommended Configuration: Lookahead 4
  • KL Divergence: Reverse KL divergence (draft → target)
  • Training weight & bias: https://wandb.ai/LIFT_ITT/inference-time-training/runs/gnfacv1r?nw=nwuserxwushirley1

Dataset

Trained on diverse prompts suitable for general-purpose language modeling and speculative decoding.

Usage

This model is designed to be used as a draft model in EAGLE3 speculative decoding pipelines with MiniMax M2.5 as the target model.

Example 1: Python API (Offline Batch Inference)

import sglang as sgl

def main():
# Sample prompts
prompts = [
"Explain the concept of quantum computing:",
"Write a short story about a time traveler:",
"Describe the process of photosynthesis:",
]

# Create sampling params
sampling_params = {"temperature": 0.7, "max_new_tokens": 256}

# Initialize engine with speculative decoding (lookahead 4 - recommended)
llm = sgl.Engine(
model_path="MiniMaxAI/MiniMax-M2.5",
speculative_draft_model_path="togethercomputer/Aurora-Spec-Minimax-M2.5",
speculative_algorithm="EAGLE3",
speculative_num_steps=4, # Recommended: lookahead 4
speculative_eagle_topk=1,
speculative_num_draft_tokens=6,
dtype="bfloat16",
trust_remote_code=True,
)

# Generate with speculative decoding
outputs = llm.generate(prompts, sampling_params)

# Print the outputs
for prompt, output in zip(prompts, outputs):
print("=" * 50)
print(f"Prompt: {prompt}")
print(f"Generated: {output['text']}")

# The __main__ condition is necessary when using spawn to create subprocesses
if __name__ == "__main__":
main()

Example 2: Launch Server (Production Use)

Step 1: Start the SGLang server with speculative decoding

python -m sglang.launch_server \
--model-path MiniMaxAI/MiniMax-M2.5 \
--speculative-draft-model-path togethercomputer/Aurora-Spec-Minimax-M2.5 \
--speculative-algorithm EAGLE3 \
--speculative-num-steps 4 \
--speculative-eagle-topk 1 \
--speculative-num-draft-tokens 5 \
--dtype bfloat16 \
--trust-remote-code \
--port 30000 \
--host 0.0.0.0

Step 2: Send requests to the server

import requests
import json

# Server endpoint
url = "http://localhost:30000/v1/completions"

# Request payload
payload = {
"prompt": "Explain the concept of quantum computing:",
"max_tokens": 256,
"temperature": 0.7,
}

# Send request
response = requests.post(url, json=payload)
result = response.json()

print(result["choices"][0]["text"])

Or using OpenAI-compatible client:

from openai import OpenAI

client = OpenAI(
base_url="http://localhost:30000/v1",
api_key="EMPTY"
)

response = client.completions.create(
model="MiniMaxAI/MiniMax-M2.5",
prompt="Explain the concept of quantum computing:",
max_tokens=256,
temperature=0.7,
)

print(response.choices[0].text)

Local Model Paths

If you have downloaded the models locally, replace the HuggingFace model paths with local paths:

python -m sglang.launch_server \
--model-path /path/to/MiniMax-M2.5 \
--speculative-draft-model-path /path/to/Aurora-Spec-Minimax-M2.5 \
--speculative-algorithm EAGLE3 \
--speculative-num-steps 4 \
--speculative-eagle-topk 1 \
--speculative-num-draft-tokens 5 \
--dtype bfloat16 \
--trust-remote-code \
--port 30000

Limitations

  • Optimized specifically for MiniMax M2.5 target model
  • Performance may vary with different target models
  • Requires compatible EAGLE3 inference…

Excerpt shown — open the source for the full document.

Notability

notability 4.0/10

Low traction, routine release.