openbmb/MiniCPM5-1B-SFT
Captured source
source ↗MiniCPM Tech Report | GitHub Repo | UltraData | MiniCPM Desk Pet | Online Demo
English | 中文
Highlights
We are releasing MiniCPM5-1B, the first model in the MiniCPM5 series. It is a dense 1B Transformer built for on-device, local deployment, and resource-constrained scenarios, reaching 1B-class open-source SOTA.
🏆 1B-class open-source SOTA: compared with strong open-source models in the same size class, MiniCPM5-1B reaches SOTA within this comparison set. Its advantage is most visible in agentic tool use, code generation, and difficult reasoning.
!MiniCPM5-1B capability comparison by domain
🧠 Hybrid Reasoning: built-in ` chat template, switch via enable_thinking`. The same checkpoint serves as both a fast assistant and a deliberate reasoner.
🛠️ Deployment / Fine-tuning Resources: the MiniCPM GitHub repo provides single-page cookbooks and Agent Skills for major inference backends and fine-tuning frameworks.
🐱 Desktop Pet: a local-LLM desktop pet driven by MiniCPM5-1B.
Model List
Use this directory to choose the model format that matches your runtime:
- [MiniCPM5-1B](https://huggingface.co/openbmb/MiniCPM5-1B) · ModelScope · BF16 final release (post-trained with RL + OPD)
- [MiniCPM5-1B-SFT](https://huggingface.co/openbmb/MiniCPM5-1B-SFT) · ModelScope · BF16 SFT-only checkpoint (before RL / OPD) 👈 you are here
- [MiniCPM5-1B-Base](https://huggingface.co/openbmb/MiniCPM5-1B-Base) · ModelScope · BF16 base checkpoint (pre-training only)
- [MiniCPM5-1B-GGUF](https://huggingface.co/openbmb/MiniCPM5-1B-GGUF) · ModelScope · GGUF for llama.cpp / Ollama / LM Studio
- [MiniCPM5-1B-MLX](https://huggingface.co/openbmb/MiniCPM5-1B-MLX) · ModelScope · MLX / 4bit for Apple Silicon
Model Information
MiniCPM5-1B has the following features:
- Type: Causal Language Model
- Architecture: Standard
LlamaForCausalLM - Number of Parameters: 1,080,632,832
- Number of Non-Embedding Parameters: 679,552,512
- Number of Layers: 24
- Number of Attention Heads (GQA): 16 for Q and 2 for KV
- Context Length: 131,072
Introduction
MiniCPM5-1B is the first checkpoint in the MiniCPM5 series. It is designed for local assistants, coding agents, tool-use workflows, and reasoning scenarios where a compact model is preferred. The model keeps a small deployment footprint while providing native long-context support and both Think / No Think chat modes through the same checkpoint.
Evaluation Results
We compare MiniCPM5-1B with strong open-source models in the same size class, including LFM2.5-1.2B-Thinking, Qwen3-0.6B/think and Qwen3.5-0.8B/think. These are capable baselines; within this comparison set, MiniCPM5-1B reaches 1B-class open-source SOTA, with its advantage most visible in tool use, code generation, and difficult reasoning. This makes it a practical choice for local coding agents, tool assistants, and reasoning assistants.
!MiniCPM-5 1B Public Leaderboard
Training Recipe
The training of MiniCPM5-1B is a full-stack practice of [UltraData Tiered Data Management](https://arxiv.org/pdf/2602.09003), covering three stages: base training, mid-training, and post-training.
During base training, the model goes through stable training and decay training to build core language capability and training stability. It then enters mid-training to further strengthen target capabilities and adapt to the target data distribution. The training corpus is released alongside the model as Ultra-FineWeb, Ultra-FineWeb-L3, and UltraData-Math.
During post-training, we proceed in three steps: SFT, RL, and OPD. We first use 200B tokens of deep-thinking SFT and 200B tokens of hybrid-thinking SFT to establish deep-thinking, hybrid-thinking, and general chat abilities; the SFT data is released as UltraData-SFT-2605. We then train specialized RL teachers for math, code, closed-book QA, writing, and related domains, and use On-Policy Distillation (OPD) to distill these teachers back into one release model.
What does RL + OPD bring?
RL + OPD is a key part of MiniCPM5-1B post-training. On math, code and instruction-following tasks, RL + OPD raises the average score by ↑16 points while cutting the share of responses that hit the max-tokens budget by ↓29 percentage points. The figures below show the two-stage Reasoning RL pipeline, score gains, and the drop in overlong responses.
RL combines complementary training signals for reasoning, closed-book QA, writing, instruction following, long-context understanding, and general dialogue. Reasoning RL is based on DAPO-Math-17k, follows the minimalist recipe of JustRL, and further adds a two-stage length schedule to reduce overlong responses while improving reasoning accuracy. We also use TriviaQA, NQ-Open, LongWriter-Zero-RLData, synthesized verifiable RLVR data, and pair-wise RLHF signals to improve reliability, instruction following, and user experience.
!MiniCPM5-1B RL Two-stage Pipeline
OPD builds on Thinking Machines Lab's On-Policy Distillation and incorporates implementation improvements from…
Excerpt shown — open the source for the full document.
Notability
notability 6.0/10Moderate traction SFT model release