M2-Reasoning: Empowering MLLMs with Unified General and Spatial Reasoning
Captured source
source ↗M2-Reasoning: Empowering MLLMs with Unified General and Spatial Reasoning | INCLUSION AI
Skip to main content 📖 Technical Report | 🤗 Hugging Face | 🤖 ModelScope
Introduction
We introduce M2-Reasoning-7B, a model designed to excel in both general and spatial reasoning. Our approach integrates two key innovations: (1) a novel data pipeline that generates 294.2K high-quality data samples (168K for cold-start fine-tuning and 126.2K for RLVR), which feature logically coherent reasoning trajectories and have undergone comprehensive assessment; and (2) a dynamic multi-task training strategy with step-wise optimization to mitigate conflicts between data, and task-specific rewards for delivering tailored incentive signals. This combination of curated data and advanced training allows M2-Reasoning-7B to set a new state-of-the-art (SOTA) across 8 benchmarks, showcasing superior performance in both general and spatial reasoning domains.
📌 Updates
[2025.07.14] 🔥 Our Technical Report is in public on arxiv.
[2025.07.11] 🔥 We release M2-Reasoning on 🤗 Hugging Face and 🤖 ModelScope .
Key Features
A High-quality Data Construction Pipeline: We design and implement a multi-stage data synthesis and curation pipeline that generates vast amounts of reasoning data.
A Dynamic Multi-Task Training Strategy: We propose a sophisticated training strategy that effectively handles data heterogeneity. It features step-wise dynamic optimization to mitigate conflicts between different data sources and a task-specific reward formulation to provide tailored incentive signals.
Unified General and Spatial Reasoning Model: We propose M2-Reasoning-7B, an MLLM uniquely engineered for both abstract and spatial reasoning. Extensive evaluations on 8 distinctbenchmarks demonstrate that, by leveraging our custom data and training pipelines, M2-Reasoning establishes new state-of-the-art (SOTA) results across both general and spatial reasoning domains.
Evaluation
We conduct a comprehensive evaluation of our models across two key domains: general and spatial reasoning. Our evaluation utilizes a diverse set of public benchmarks, grouped by the primary capability they measure:
General Reasoning (Mathematical & Logical): To evaluate this capability, we employ six benchmarks: MathVista, MathVision, MathVerse, DynaMath, WeMath, and LogicVista.
Models MathVista MathVision MathVerse DynaMath WeMath LogicVista Avg. (Δ) Base-Scale General Models InternVL3-8B 70.5 30.0 38.5 25.7 39.5 44.5 41.4 InternVL3-9B 69.0 29.3 37.9 25.1 34.8 49.0 40.8 Qwen2.5-VL-7B 68.1 25.4 41.1 21.8 36.2 47.9 40.1 MUG-U-7B 74.8 26.1 35.4 17.2 26.5 39.8 36.6 SAIL-VL-1.6-8B 74.2 23.2 33.4 14.0 29.6 41.4 36.0 Base-Scale Reasoning Models WeThink-VL-7B 71.6 26.0 44.2 24.8 48.0 51.2 44.3 (+4.2) Taichu-VLR-7B 72.3 27.1 46.7 23.0 44.0 48.3 43.6 VLAA-Thinker-7B 68.0 26.4 48.2 22.4 41.5 48.5 42.5 (+2.4) URSA-8B-PS-GRPO 67.8 31.8 41.5 22.4 38.3 44.7 41.1 (+8.2) Ovis2-8B 71.8 25.9 42.3 20.4 27.2 39.4 37.8 Our Models Base Model 70.2 25.9 30.5 20.2 27.2 37.8 35.5 M2-Reasoning-CI-7B 71.7 29.2 42.1 25.0 42.8 46.8 42.9 (+7.4) M2-Reasoning-7B 75.0 31.5 44.7 26.8 41.8 50.0 45.0 (+9.5)
Spatial Reasoning: We assess this skill using 2 benchmarks: CV-Bench and VSI-Bench
CV-Bench:
Models Count Relation Depth Distance Avg. Large-Scale Models GPT-4O 65.9 85.7 87.8 78.2 78.9 Gemini-1.5-pro 70.4 85.2 82.4 72.8 77.4 Base-Scale Models InternVL3-8B 74.0 90.6 84.3 81.0 82.0 Qwen2.5-VL-7B-Instruct 65.2 86.6 70.6 79.8 75.0 LLava-NEXT-Video-7B 59.3 77.0 71.3 54.7 65.2 Our Models M2-Reasoning-7B 66.6 92.8 89.3 84.3 82.3
VSI-Bench:
OC AD OS RS RDs RDr RP AO Avg. Large-Scale Models Gemini-1.5-pro 56.2 30.9 64.1 43.6 51.3 46.3 36.0 34.6 45.4 GPT-4O 46.2 5.3 43.8 38.2 37.0 41.3 31.5 28.5 34.0 Base-Scale Models InternVL3-8B 68.1 39.0 48.4 33.6 48.3 36.4 27.3 35.4 42.1 Video-R1-7B - - - - - - - - 37.1 Qwen2.5-VL-7B-Instruct 37.7 20.1 49.7 37.4 38.5 40.4 31.4 32.0 35.9 LLava-NeXT-Video-7B 48.5 14.0 47.8 24.2 43.5 42.4 34.0 30.6 35.6 Our Models M2-Reasoning-7B 41.0 34.0 60.9 55.4 40.7 47.3 29.9 28.8 42.3
Model Downloads
You can download the model from both Hugging Face and ModelScope .
If you're in mainland China, we strongly recommend you to download our model from ModelScope .
Example Usage
The basic environment is python=3.10 , torch=2.6.0+cu124 , transformers=4.49.0
We provide a small example on the usage of this repo.
import os import torch
from transformers import ( AutoProcessor , AutoTokenizer , )
import warnings import argparse from modeling_bailing_qwen2_5 import Bailing_qwen2_5NativeForConditionalGeneration from processing_bailing_qwen2_5 import Bailing_qwen2_5Processor
warnings . filterwarnings ( "ignore" )
class BailingMMInfer : def __init__ ( self , model_name_or_path , device = "cuda" , max_pixels = None , min_pixels = None , video_max_pixels = 768 * 28 * 28 , video_min_pixels = 128 * 28 * 28 , generation_config = None ) : super ( ) . __init__ ( ) self . model_name_or_path = model_name_or_path
self . device = device
self . device_map = device
self . video_max_pixels = video_max_pixels if video_max_pixels is not None else 768 * 28 * 28 self . video_min_pixels = video_min_pixels if video_min_pixels is not None else 128 * 28 * 28
self . model , self . tokenizer , self . processor = self . load_model_processor ( ) if max_pixels is not None : self . processor . max_pixels = max_pixels if min_pixels is not None : self . processor . min_pixels = min_pixels if generation_config is None : generation_config = { "num_beams" : 1 , "do_sample" : True , "temperature" : 0.9 }
self . generation_config = generation_config
def load_model_processor ( self ) :
model = Bailing_qwen2_5NativeForConditionalGeneration . from_pretrained ( self . model_name_or_path , torch_dtype = torch . bfloat16 , device_map = self . device_map , _attn_implementation = "flash_attention_2" ) . eval ( )
tokenizer = AutoTokenizer . from_pretrained ( self . model_name_or_path , add_bos_token = True , trust_remote_code = True ) processor = Bailing_qwen2_5Processor . from_pretrained ( self . model_name_or_path , trust_remote_code = True )
return model , tokenizer , processor
def generate ( self , messages , max_new_tokens = 512 ) : text = self . processor . apply_chat_template ( messages , tokenize = False , add_generation_prompt = True , use_system = True )
image_inputs , video_inputs = self . processor . process_vision_info ( messages )
inputs = self . processor ( text…
Excerpt shown — open the source for the full document.
Notability
notability 5.0/10Research post on reasoning, no traction info