ModelStepFunStepFunpublished Feb 2, 2026seen 5d

stepfun-ai/Step3-VL-10B-FP8

Open original ↗

Captured source

source ↗
published Feb 2, 2026seen 5dcaptured 15hhttp 200method plaintask image-text-to-textlicense apache-2.0library transformersdownloads 311likes 10

📢 News & Updates

  • 🚀 Online Demo: Explore Step3-VL-10B on Hugging Face Spaces !
  • 📢 [Notice] FP8 Quantization Support : FP8 quantized weights are now available. (Download link)
  • 📢 [Notice] vLLM Support: vLLM integration is now officially supported! (PR #32329)
  • [Fixed] HF Inference: Resolved the eos_token_id misconfiguration in config.json that caused infinite generation loops. (PR #abdf3)
  • [Fixing] Metric Correction: We sincerely apologize for inaccuracies in the Qwen3VL-8B benchmarks (e.g., AIME, HMMT, LCB). The errors were caused by an incorrect max_tokens setting (mistakenly set to 32k) during our large-scale evaluation process. We are re-running the tests and will provide corrected numbers in the next version of technical report.

🚀 Introduction

STEP3-VL-10B is a lightweight open-source foundation model designed to redefine the trade-off between compact efficiency and frontier-level multimodal intelligence. Despite its compact 10B parameter footprint, STEP3-VL-10B excels in visual perception, complex reasoning, and human-centric alignment. It consistently outperforms models under the 10B scale and rivals or surpasses significantly larger open-weights models (10×–20× its size), such as GLM-4.6V (106B-A12B), Qwen3-VL-Thinking (235B-A22B), and top-tier proprietary flagships like Gemini 2.5 Pro and Seed-1.5-VL.

The success of STEP3-VL-10B is driven by two key strategic designs:

1. Unified Pre-training on High-Quality Multimodal Corpus: A single-stage, fully unfrozen training strategy on a 1.2T token multimodal corpus, focusing on two foundational capabilities: reasoning (e.g., general knowledge and education-centric tasks) and perception (e.g., grounding, counting, OCR, and GUI interactions). By jointly optimizing the Perception Encoder and the Qwen3-8B decoder, STEP3-VL-10B establishes intrinsic vision-language synergy. 2. Scaled Multimodal Reinforcement Learning and Parallel Reasoning: Frontier capabilities are unlocked through a rigorous post-training pipeline comprising two-stage supervised finetuning (SFT) and over 1,400 iterations of RL with both verifiable rewards (RLVR) and human feedback (RLHF). Beyond sequential reasoning, we adopt Parallel Coordinated Reasoning (PaCoRe), which allocates test-time compute to aggregate evidence from parallel visual exploration.

📥 Model Zoo

| Model Name | Type | Hugging Face | ModelScope | | :-------------------- | :--- | :----------------------------------------------------------------: | :----------------------------------------------------------------------: | | STEP3-VL-10B-Base | Base | 🤗 Download | 🤖 Download | | STEP3-VL-10B | Chat | 🤗 Download | 🤖 Download | | STEP3-VL-10B-FP8 | Quantized | 🤗 Download | 🤖 Download |

📊 Performance

STEP3-VL-10B delivers best-in-class performance across major multimodal benchmarks, establishing a new performance standard for compact models. The results demonstrate that STEP3-VL-10B is the most powerful open-source model in the 10B parameter class.

Comparison with Larger Models (10×–20× Larger)

| Benchmark | STEP3-VL-10B (SeRe) | STEP3-VL-10B (PaCoRe) | GLM-4.6V (106B-A12B) | Qwen3-VL (235B-A22B) | Gemini-2.5-Pro | Seed-1.5-VL | | :---------------- | :-----------------: | :-------------------: | :------------------: | :------------------: | :------------: | :---------: | | MMMU | 78.11 | 80.11 | 75.20 | 78.70 | 83.89 | 79.11 | | MathVista | 83.97 | 85.50 | 83.51 | 85.10 | 83.88 | 85.60 | | MathVision | 70.81 | 75.95 | 63.50 | 72.10 | 73.30 | 68.70 | | MMBench (EN) | 92.05 | 92.38 | 92.75 | 92.70 | 93.19 | 92.11 | | MMStar | 77.48 | 77.64 | 75.30 | 76.80 | 79.18 | 77.91 | | OCRBench | 86.75 | 89.00 | 86.20 | 87.30 | 85.90 | 85.20 | | AIME 2025 | 87.66 | 94.43 | 71.88 | 83.59 | 83.96 | 64.06 | | HMMT 2025 | 78.18 | 92.14 | 57.29 | 67.71 | 65.68 | 51.30 | | LiveCodeBench | 75.77 | 76.43 | 48.71 | 69.45 | 72.01 | 57.10 |

> Note on Inference Modes: > > SeRe (Sequential Reasoning): The standard inference mode using sequential generation (Chain-of-Thought) with a max length of 64K tokens. > > PaCoRe (Parallel Coordinated Reasoning): An advanced mode that scales test-time compute. It aggregates evidence from 16 parallel rollouts to synthesize a final answer, utilizing a max context length of 128K tokens. > > _Unless otherwise stated, scores below refer to the standard SeRe mode. Higher scores achieved via PaCoRe are explicitly marked._

Comparison with Open-Source Models (7B–10B)

| Category | Benchmark | STEP3-VL-10B | GLM-4.6V-Flash (9B) | Qwen3-VL-Thinking (8B) | InternVL-3.5 (8B) | MiMo-VL-RL-2508 (7B) | | :----------------- | :--------------- | :----------: | :-----------------: | :--------------------: | :---------------: | :------------------: | | STEM Reasoning | MMMU | 78.11 | 71.17 | 73.53 | 71.69 | 71.14 | | | MathVision | 70.81 | 54.05 | 59.60 | 52.05 | 59.65 | | | MathVista | 83.97 | 82.85 | 78.50 | 76.78 | 79.86 | | | PhyX | 59.45 | 52.28 | 57.67 | 50.51 | 56.00 | | Recognition | MMBench (EN) | 92.05 | 91.04 | 90.55 | 88.20 | 89.91 | | | MMStar | 77.48 | 74.26 | 73.58 | 69.83 | 72.93 | | | ReMI | 67.29 | 60.75 | 57.17 | 52.65 | 63.13 | | OCR & Document | OCRBench | 86.75 | 85.97 | 82.85 | 83.70 | 85.40 | | | AI2D | 89.35 | 88.93 | 83.32 | 82.34 | 84.96 | | GUI Grounding | ScreenSpot-V2 | 92.61 | 92.14 | 93.60 | 84.02 | 90.82 | | | ScreenSpot-Pro | 51.55 | 45.68 | 46.60 | 15.39 | 34.84 | | | OSWorld-G | 59.02 | 54.71 | 56.70 | 31.91 | 50.54 | | Spatial | BLINK | 66.79 | 64.90 | 62.78 | 55.40 | 62.57 | | | All-Angles-Bench | 57.21 | 53.24 | 45.88 | 45.29 | 51.62 | | Code | HumanEval-V | 66.05 | 29.26 |…

Excerpt shown — open the source for the full document.

Notability

notability 3.0/10

Low downloads; minor variant release