inclusionAI/Zooming-without-Zooming
Python
Captured source
source ↗inclusionAI/Zooming-without-Zooming
Description: [ICML 2026] ZwZ model family: SOTA fine-grained perception performace; ZoomBench: a new challenging perception benchmark
Language: Python
License: Apache-2.0
Stars: 155
Forks: 2
Open issues: 0
Created: 2026-02-12T08:14:14Z
Pushed: 2026-05-04T12:18:51Z
Default branch: main
Fork: no
Archived: no
README:
Zooming without Zooming: Region-to-Image Distillation for Fine-Grained Multimodal Perception (ICML 2026)
1. School of Computer Science, Shanghai Jiao Tong University
2. Ant Group
3. Zhongguancun Academy
4. Shanghai Innovation Institute
📃 Paper | 🤗 Models & Training Datasets & ZoomBench
✨ Introduction
Recent "Thinking-with-Images" methods improve fine-grained perception by iteratively zooming into regions of interest during inference, but incur high latency due to repeated tool calls and visual re-encoding. In this work, we present ZwZ models (4/7/8B), achieving SOTA performance on multimodal perception benchmarks among open-source models. In addition, we present ZoomBench, a hybrid-annotated benchmark of 845 VQA data spanning six fine-grained perceptual dimensions, together with a dual-view protocol that quantifies the global--regional "zooming gap".
⚙️ Method
We propose Region-to-Image Distillation (R2I), which transforms zooming from an inference-time tool into a training-time primitive. We: 1. Zoom in to micro-cropped regions and let strong teacher models generate high-quality VQA data 2. Distill this region-grounded supervision back to the full image with explicit bounding-box overlays 3. Enable smaller student models to achieve single-glance fine-grained perception without tool use
This can also be summarized as an idea of "Zooming without Zooming". The first "Zooming" refers to the training-time primitive: we zoom into micro-regions to synthesize fine-grained training data. In contrast, the second "Zooming" denotes the inference-time tool-use we seek to bypass.
🌟 Key Features
- 🎯 Superior Accuracy: Achieve SOTA performance on perception benchmarks among open-source models
- ⚡ Single-Pass Efficiency: Just need one forward pass, eliminating inference-time tool calling overhead
- 📈 Broad Improvements: Enhance not only perception benchmarks but also out-of-distribution generalization on visual reasoning, GUI agent, and AIGC detection
- 🔍 ZoomBench: A comprehensive benchmark with 845 samples across 6 fine-grained dimensions, featuring various evaluation protocols
🎯 Models and Datasets
Models
| Model | Base | Download | |-------|------|----------| | ZwZ-2B | Qwen3-VL-2B | 🤗 inclusionAI/ZwZ-2B | | ZwZ-4B | Qwen3-VL-4B | 🤗 inclusionAI/ZwZ-4B | | ZwZ-7B | Qwen2.5-VL-7B | 🤗 inclusionAI/ZwZ-7B | | ZwZ-8B | Qwen3-VL-8B | 🤗 inclusionAI/ZwZ-8B | ---
Training Datasets
Our Region-to-Image distilled training data (37K samples): 🤗 inclusionAI/ZwZ-RL-VQA
Source image pools:
- SA-1B, LAION, MetaCLIP, Visual Genome, CC12M, STPLS3D (we just take a small part of images from each image pool; most of high resolution images are from train-0000-of-0013.parquet in https://modelscope.cn/datasets/Tongyi-DataEngine/SA1B-Paired-Captions-Images)
Question Generator: Qwen3-VL-235B-A22B-Instruct
Answer Generators: Qwen3-VL-235B-A22B-Instruct, GLM-4.5V
---
📊 ZoomBench
We introduce 🤗 **ZoomBench**, a challenging benchmark for fine-grained multimodal perception:
- 845 high-quality samples across 6 perceptual dimensions:
- Fine-Grained Counting
- OCR (text & symbol recognition)
- Color Attributes
- Structural Attributes
- Material Attributes
- Object Identification
- Dual-View Protocol: Each sample includes both full image and cropped region to quantify the "zooming gap"
- Attention Map Analysis: Evaluate whether the model grounds its predictions on task-relevant image regions from a view of interpretability
- Hybrid Construction: Gemini-2.5-Pro-generated + human-verified for quality and scalability
- High Difficulty: Average accuracy of Qwen2.5-VL-7B is only 42.5%
🛠️ Installation
git clone https://github.com/inclusionAI/Zooming-without-Zooming.git cd Zooming-without-Zooming pip install -r requirements.txt git clone https://github.com/facebookresearch/sam3.git cd sam3 pip install -e . # please refer to the official repo of SAM3 for detailed installation cd ../EasyR1 pip install -e . # please refer to the official repo of EasyR1 for detailed installation
🔥 Let's Start
1. Region to Image Distillation
The pipeline supports checkpointing. Each step can be executed independently and resumed from any stage. Note that we use Qwen3-VL-235B and Sam3 to get a meaningful cropped image, and use Kimi-K2 to extract the majority answer.
cd Zooming-without-Zooming/data_synthesis export MLLM_KEY="your_mllm_key" export MLLM_URL="your_mllm_url" export KIMI_KEY="your_llm_key" export KIMI_URL="your_llm_url" ## step 1 python create_crops.py \ --api_key "$MLLM_KEY" \ --api_url "$MLLM_URL" \ --image_folders "/path/images/sa1b" \ # Support multiple folders; replace to your own path (just containing images) --output_jsonl "generated_bboxes_sa1b.jsonl" ## step 2 python create_questions.py \ --api_key "$MLLM_KEY" \ --api_url "$MLLM_URL" \ --input_files "generated_bboxes_sa1b.jsonl" \ --output_file "generated_questions.jsonl" \ --crop_output_dir "/path/images/crops" # Replace to your own path ## step 3 bash qwen_serve.sh python create_answers.py \ --api_key "$MLLM_KEY" \ --api_url "$MLLM_URL" \ --kimi_api_key "$KIMI_KEY" \ --kimi_api_url "$KIMI_URL" \ --input_file "generated_questions.jsonl" \ --output_file "validated_vqa.jsonl" \ --bbox_output_dir "/path/images/bbox_images" # Replace to your own path ## step 4 python convert_jsonl2parquet.py \ --input_file "validated_vqa.jsonl" \ --output_file "validated_vqa.parquet"
We also provide an end-to-end data synthesis script.
cd Zooming-without-Zooming/data_synthesis export MLLM_KEY="your_mllm_key" export MLLM_URL="your_mllm_url" export KIMI_KEY="your_llm_key" export KIMI_URL="your_llm_url" bash qwen_serve.sh python create_vqa.py \ --api_key "$MLLM_KEY" \…
Excerpt shown — open the source for the full document.
Notability
notability 5.0/10New repo with moderate stars