WritingInclusionAI (Ant Group)InclusionAI (Ant Group)published Aug 5, 2025seen 5d

Introducing Ring-lite-2507

Open original ↗

Captured source

source ↗
published Aug 5, 2025seen 5dcaptured 3dhttp 200method plain

Introducing Ring-lite-2507 | INCLUSION AI

Skip to main content 📖 Technical Report | 🤗 Hugging Face | 🤖 ModelScope

Overview ​

We present Ring-lite-2507 , an upgraded version of our previously released lightweight reasoning model, Ring-lite (2506). Built upon a 16.8B Mixture-of-Experts (MoE) large language model with 2.75B activated parameters, Ring-lite-2507 further advances its reasoning capabilities while demonstrating superior performance across a comprehensive range of LLM benchmarks, including general text understanding, alignment, coding, logical, and agentic tasks. Thanks to our innovative and robust reinforcement learning training pipeline, Ring-lite-2507 distinguishes itself from the latest public dense models under 10B parameters by offering competitive performance across various tasks, despite activating only 1/3 of their parameter size.

To address the optimization instability of MoE RL training, we propose a novel approach, Constrained Contextual Computation Policy Optimization(C3PO), which enhances training stability and improves computational throughput via algorithm-system co-design. Additionally, we systematically investigate the dynamic relationship between long chain-of-thought SFT and RL training. Rather than relying solely on validation metrics, we explore optimal strategies for selecting the suitable fine-tuned model for RL scaling, yielding superior performance-efficiency trade-offs in our RL training pipeline. Last, we develop a two-stage training paradigm to harmonize multi-domain data integration, enhancing reasoning ability while effectively improving performance across various downstream general tasks.

Highlights

🚀 Superior performance across tasks : Ring-lite-2507 demonstrates outstanding performance across both reasoning and general tasks;

🔥 Only 2.75B activated parameters : Ring-lite-2507 is built upon a Mixture-of-Experts (MoE)-based large language model with only 2.75 billion activated parameters;

⛓️‍💥 Algorithm-system co-design : We proposed novel C3PO approach and employ token efficiency to improve training stability and effectiveness;

🔍 Publicly available : We fully release our training recipe and model weights.

Evaluation ​

We conduct a comprehensive evaluation of our models across two main domains: reasoning and general. We utilize a diverse set of public benchmarks, organized according to the specific aspects they measure.

Knowledge Understanding ​

Benchmark Ring-lite-2507 Ring-lite-2506 Qwen3-8B-Thinking MMLU-Pro (EM) 72.50 63.44 72.56 GPQA-Diamond (Pass@1) 69.35 63.51 62.00 SuperGPQA (EM) 40.05 13.97 40.36 Phybench (Pass@1) 28.51 29.19 22.14

Math ​

Benchmark Ring-lite-2507 Ring-lite-2506 Qwen3-8B-Thinking MATH-500 (Pass@1) 97.95 96.80 97.30 CNMO 2024 (Pass@1) 75.09 77.26 74.57 AIME 2024 (Pass@1) 79.79 79.00 74.90 AIME 2025 (Pass@1) 72.92 69.50 67.19 LiveMathBench (Pass@1) 83.37 85.08 81.90 TheoremQA (Pass@1) 70.00 70.19 68.81 OlympiadBench (math) (Pass@1) 80.64 82.86 80.20

Coding ​

Benchmark Ring-lite-2507 Ring-lite-2506 Qwen3-8B-Thinking LiveCodeBench(2408-2505) (Pass@1) 60.35 59.53 55.12 Codeforces(Percentile) (Pass@1) 1830 1673 1580 Codeforces(Rating) 92.16 88.00 79.44

Reasoning & Agentic ​

Benchmark Ring-lite-2507 Ring-lite-2506 Qwen3-8B-Thinking DROP (zero-shot F1) 89.27 60.21 87.13 BBH (EM) 88.65 50.84 87.30 ARCPrize (Pass@1) 19.00 3.12 3.88 MuSR (EM) 77.19 66.77 76.92 BFCL_Live (Pass@1) 74.81 66.76 75.99

Alignment ​

Benchmark Ring-lite-2507 Ring-lite-2506 Qwen3-8B-Thinking IFEval (Prompt Strict) 84.66 54.34 85.40 AlignBench v1.1(gpt-4.1) 80.90 69.60 74.70 FoFo (gpt-4-turbo) 85.02 67.81 81.93 ArenaHard (gpt-4.1) 88.85 56.12 86.14

Constrained Contextual Computation Policy Optimization(C3PO) ​

We introduce C onstrained C ontextual C omputation P olicy O ptimization(C3PO), an innovative token-level optimization framework designed to mitigate training instability while enhancing throughput consistency. Different from sampling-level filtering, C3PO operates at the token level by sampling tokens to form a token-level global batch, each training step maintains consistent token input to optimizer, which results in reduced gradient variance and consequently achieving stable optimization.

C3PO

Balancing Token efficiency between Distillation and RL ​

While distillation is effective, we find that it requires more training tokens to achieve comparable performance than RL training. Furthermore, we observe that varying the number of training epochs for the distilled model significantly influences the trend of entropy loss, thereby affecting the exploration scope for RL. Our experiments show that increasing the number of SFT training epochs leads to a rapid collapse in entropy, whereas insufficient SFT training inevitably results in inferior performance. To systematically quantify the choice of optimal SFT epoch, we employ token efficiency to determine the suitable checkpoint for RL scaling.

Training Data ​

To ensure a high-quality training dataset for reinforcement learning, we established a comprehensive and meticulous data curation pipeline. This pipeline encompasses several key stages, such as data cleansing, answer verification, and data annotation, all designed to thoroughly decontaminate the data and ensure it is both suitable and informative for RL training.

Data Pipeline

Training Pipeline ​

Training Pipeline

Reasoning RL ​

Compared to our previously released Ring-lite-2506, we expanded our reasoning dataset by incorporating more challenging math, coding, and STEM data. Specifically, we adopted 67K math problems, 32K coding problems, and 9.9K scientific problems for reasoning RL training. In addition, we amplified our reasoning dataset by including more than 19K logical games, such as ARC-AGI, Countdown, Sudoku, AlphaMaze, etc. For each type of problem, we specifically designed suitable reward functions to ensure our training examples are verifiable.

General RL ​

Apart from reasoning tasks, our Ring-lite-2507 has significantly expanded the collection of general datasets for RL training. Our general RL training does not compromise performance on reasoning tasks; instead, it enhances overall text understanding across a broad range of general benchmarks.

Our general RL training incorporates a variety of tasks, including instruction following, question answering, text summarization, and more. For open-ended questions, we employ a robust reward model to assign appropriate scores. Additionally, we have…

Excerpt shown — open the source for the full document.

Notability

notability 4.0/10

New lightweight model, low traction