stepfun-ai/Step3-VL-10B-FP8
Captured source
source ↗📢 News & Updates
- 🚀 Online Demo: Explore Step3-VL-10B on Hugging Face Spaces !
- 📢 [Notice] FP8 Quantization Support : FP8 quantized weights are now available. (Download link)
- 📢 [Notice] vLLM Support: vLLM integration is now officially supported! (PR #32329)
- ✅ [Fixed] HF Inference: Resolved the
eos_token_idmisconfiguration inconfig.jsonthat caused infinite generation loops. (PR #abdf3) - ✅ [Fixing] Metric Correction: We sincerely apologize for inaccuracies in the Qwen3VL-8B benchmarks (e.g., AIME, HMMT, LCB). The errors were caused by an incorrect max_tokens setting (mistakenly set to 32k) during our large-scale evaluation process. We are re-running the tests and will provide corrected numbers in the next version of technical report.
🚀 Introduction
STEP3-VL-10B is a lightweight open-source foundation model designed to redefine the trade-off between compact efficiency and frontier-level multimodal intelligence. Despite its compact 10B parameter footprint, STEP3-VL-10B excels in visual perception, complex reasoning, and human-centric alignment. It consistently outperforms models under the 10B scale and rivals or surpasses significantly larger open-weights models (10×–20× its size), such as GLM-4.6V (106B-A12B), Qwen3-VL-Thinking (235B-A22B), and top-tier proprietary flagships like Gemini 2.5 Pro and Seed-1.5-VL.
The success of STEP3-VL-10B is driven by two key strategic designs:
1. Unified Pre-training on High-Quality Multimodal Corpus: A single-stage, fully unfrozen training strategy on a 1.2T token multimodal corpus, focusing on two foundational capabilities: reasoning (e.g., general knowledge and education-centric tasks) and perception (e.g., grounding, counting, OCR, and GUI interactions). By jointly optimizing the Perception Encoder and the Qwen3-8B decoder, STEP3-VL-10B establishes intrinsic vision-language synergy. 2. Scaled Multimodal Reinforcement Learning and Parallel Reasoning: Frontier capabilities are unlocked through a rigorous post-training pipeline comprising two-stage supervised finetuning (SFT) and over 1,400 iterations of RL with both verifiable rewards (RLVR) and human feedback (RLHF). Beyond sequential reasoning, we adopt Parallel Coordinated Reasoning (PaCoRe), which allocates test-time compute to aggregate evidence from parallel visual exploration.
📥 Model Zoo
| Model Name | Type | Hugging Face | ModelScope | | :-------------------- | :--- | :----------------------------------------------------------------: | :----------------------------------------------------------------------: | | STEP3-VL-10B-Base | Base | 🤗 Download | 🤖 Download | | STEP3-VL-10B | Chat | 🤗 Download | 🤖 Download | | STEP3-VL-10B-FP8 | Quantized | 🤗 Download | 🤖 Download |
📊 Performance
STEP3-VL-10B delivers best-in-class performance across major multimodal benchmarks, establishing a new performance standard for compact models. The results demonstrate that STEP3-VL-10B is the most powerful open-source model in the 10B parameter class.
Comparison with Larger Models (10×–20× Larger)
| Benchmark | STEP3-VL-10B (SeRe) | STEP3-VL-10B (PaCoRe) | GLM-4.6V (106B-A12B) | Qwen3-VL (235B-A22B) | Gemini-2.5-Pro | Seed-1.5-VL | | :---------------- | :-----------------: | :-------------------: | :------------------: | :------------------: | :------------: | :---------: | | MMMU | 78.11 | 80.11 | 75.20 | 78.70 | 83.89 | 79.11 | | MathVista | 83.97 | 85.50 | 83.51 | 85.10 | 83.88 | 85.60 | | MathVision | 70.81 | 75.95 | 63.50 | 72.10 | 73.30 | 68.70 | | MMBench (EN) | 92.05 | 92.38 | 92.75 | 92.70 | 93.19 | 92.11 | | MMStar | 77.48 | 77.64 | 75.30 | 76.80 | 79.18 | 77.91 | | OCRBench | 86.75 | 89.00 | 86.20 | 87.30 | 85.90 | 85.20 | | AIME 2025 | 87.66 | 94.43 | 71.88 | 83.59 | 83.96 | 64.06 | | HMMT 2025 | 78.18 | 92.14 | 57.29 | 67.71 | 65.68 | 51.30 | | LiveCodeBench | 75.77 | 76.43 | 48.71 | 69.45 | 72.01 | 57.10 |
> Note on Inference Modes: > > SeRe (Sequential Reasoning): The standard inference mode using sequential generation (Chain-of-Thought) with a max length of 64K tokens. > > PaCoRe (Parallel Coordinated Reasoning): An advanced mode that scales test-time compute. It aggregates evidence from 16 parallel rollouts to synthesize a final answer, utilizing a max context length of 128K tokens. > > _Unless otherwise stated, scores below refer to the standard SeRe mode. Higher scores achieved via PaCoRe are explicitly marked._
Comparison with Open-Source Models (7B–10B)
| Category | Benchmark | STEP3-VL-10B | GLM-4.6V-Flash (9B) | Qwen3-VL-Thinking (8B) | InternVL-3.5 (8B) | MiMo-VL-RL-2508 (7B) | | :----------------- | :--------------- | :----------: | :-----------------: | :--------------------: | :---------------: | :------------------: | | STEM Reasoning | MMMU | 78.11 | 71.17 | 73.53 | 71.69 | 71.14 | | | MathVision | 70.81 | 54.05 | 59.60 | 52.05 | 59.65 | | | MathVista | 83.97 | 82.85 | 78.50 | 76.78 | 79.86 | | | PhyX | 59.45 | 52.28 | 57.67 | 50.51 | 56.00 | | Recognition | MMBench (EN) | 92.05 | 91.04 | 90.55 | 88.20 | 89.91 | | | MMStar | 77.48 | 74.26 | 73.58 | 69.83 | 72.93 | | | ReMI | 67.29 | 60.75 | 57.17 | 52.65 | 63.13 | | OCR & Document | OCRBench | 86.75 | 85.97 | 82.85 | 83.70 | 85.40 | | | AI2D | 89.35 | 88.93 | 83.32 | 82.34 | 84.96 | | GUI Grounding | ScreenSpot-V2 | 92.61 | 92.14 | 93.60 | 84.02 | 90.82 | | | ScreenSpot-Pro | 51.55 | 45.68 | 46.60 | 15.39 | 34.84 | | | OSWorld-G | 59.02 | 54.71 | 56.70 | 31.91 | 50.54 | | Spatial | BLINK | 66.79 | 64.90 | 62.78 | 55.40 | 62.57 | | | All-Angles-Bench | 57.21 | 53.24 | 45.88 | 45.29 | 51.62 | | Code | HumanEval-V | 66.05 | 29.26 |…
Excerpt shown — open the source for the full document.
Notability
notability 3.0/10Low downloads; minor variant release