XiaomiMiMo/MiMo-Embodied
Python
Captured source
source ↗XiaomiMiMo/MiMo-Embodied
Description: MiMo-Embodied
Language: Python
License: NOASSERTION
Stars: 386
Forks: 15
Open issues: 0
Created: 2025-11-19T08:54:41Z
Pushed: 2026-04-15T12:28:08Z
Default branch: main
Fork: no
Archived: no
README:
I. Introduction
This repository provides the official evaluation suite of MiMo-Embodied, designed to support rigorous and reproducible evaluation for embodied AI and autonomous driving tasks.
Built on top of the excellent lmms-eval framework, this repository extends the evaluation pipeline with MiMo-specific model integration, benchmark support, and evaluation workflows for embodied and driving scenarios.
MiMo-Embodied is a powerful cross-embodied vision-language model that demonstrates state-of-the-art performance in both autonomous driving and embodied AI tasks, representing the first open-source VLM that integrates these two critical areas.
> This repository is for evaluation only. It does not contain model training code.
---
II. Key Features
1. MiVLLM: A MiMo-tailored vLLM-based Model Wrapper
We use a custom mivllm model class built on top of the original VLLM implementation in lmms-eval, tailored for MiMo models. Compared with the default implementation, it:
- improves data loading efficiency
- enables finer control over image and video preprocessing
- supports MiMo-specific inference settings such as:
max_model_lengpu_memory_utilizationmax_num_seqs
2. Evaluation for Embodied AI
This evaluation suite supports embodied AI benchmarks covering key capabilities such as:
- affordance prediction
- task planning
- spatial understanding
3. Evaluation for Autonomous Driving
This evaluation suite also supports autonomous driving benchmarks covering key capabilities such as:
- environmental perception
- status prediction
- driving planning
- driving knowledge-based QA
4. Flexible Evaluation Workflows
The framework supports:
- single-GPU evaluation
- multi-GPU evaluation
- multi-node distributed evaluation
- batch evaluation across multiple tasks
---
III. Benchmark Coverage
This repository focuses on the evaluation of embodied AI and autonomous driving tasks.
Embodied AI Benchmarks
| Category | Benchmarks | |---|---| | Affordance & Planning | Where2Place (where2place_point), RoboAfford-Eval (roboafford), Part-Afford (part_affordance), RoboRefIt (roborefit), VABench-Point (vabench_point_box) | | Planning | EgoPlan2 (egoplan), RoboVQA (robovqa), Cosmos (cosmos_reason1_boxed) | | Spatial Understanding | CV-Bench (cvbench_boxed), ERQA (erqa_boxed), EmbSpatial (embspatialbench), SAT (sat), RoboSpatial (robospatial), RefSpatial (refspatialbench), CRPE (crpe_relation), MetaVQA (metavqa_eval), VSI-Bench (vsibench_boxed) |
Autonomous Driving Benchmarks
| Benchmarks | |---| | CODA-LM (codalm) | | Drama (drama) | | DriveAction (drive_action_boxed_detail) | | LingoQA (lingoqa_boxed) | | nuScenes-QA (nuscenesqa) | | OmniDrive (omnidrive) | | NuInstruct (nuinstruct) | | DriveLM (drivelm) | | MAPLM (maplm) | | BDD-X (bddx) | | MME-RealWorld (mme_realworld) | | IDKB (idkb) |
> A more detailed task list can be maintained in mimovl_docs/tasks.md.
---
IV. Usage
Installation
# Step 1: Create conda environment conda create -n lmms-eval python=3.10 -y conda activate lmms-eval # Step 2: Install PyTorch (adjust CUDA version as needed) pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu124 # Step 3: Install vLLM pip install vllm==0.7.3 # Step 4: Install the evaluation framework git clone https://github.com/XiaomiMiMo/MiMo-Embodied.git cd MiMo-Embodied pip install -e . && pip uninstall -y opencv-python-headless pip install -r requirements.txt # Step 5 (optional but recommended) pip install xformers==0.0.28.post3
Dataset Paths
For many benchmarks, images are already packaged in the corresponding Hugging Face dataset, so no additional local path configuration is required.
For some benchmarks with large image/video assets, the released config YAML uses a placeholder local path such as:
img_root: "/path/to/your/image_or_video_data"
Before running evaluation for these benchmarks, please manually update img_root in the corresponding task YAML file to point to your local image/video directory.
For example:
dataset_path: Zray26/bdd_x_testing_caption task: "bddx" test_split: test dataset_kwargs: token: True output_type: generate_until img_root: "/path/to/your/image_or_video_data" doc_to_visual: !function utils.doc_to_visual doc_to_text: !function utils.doc_to_text doc_to_target: !function utils.doc_to_target process_results: !function utils.process_test_results_for_submission
A typical task folder is organized as:
lmms_eval/tasks// ├── .yaml └── utils.py
For example:
lmms_eval/tasks/bddx/ ├── bddx.yaml └── utils.py
Please check the YAML file of each benchmark case by case and fill in img_root when local image/video assets are required.
Main Evaluation Script
The main evaluation launcher is:
bash mimovl_docs/eval_mimo_vl_args.sh [disable_thinking]
Single-Task Evaluation
bash mimovl_docs/eval_mimo_vl_args.sh \ XiaomiMiMo/MiMo-Embodied-7B \ cvbench_boxed \ ./eval_results
No-Think Evaluation
For tasks evaluated in no-think mode, run:
bash mimovl_docs/eval_mimo_vl_args.sh \ XiaomiMiMo/MiMo-Embodied-7B \ \ ./eval_results \ true
This corresponds to:
disable_thinking_user=true
Multi-GPU / Multi-Node Evaluation
The launcher supports distributed evaluation through environment variables:
export NNODES=1 export NODE_RANK=0 export MASTER_ADDR=127.0.0.1 export MASTER_PORT=29500 export NPROC_PER_NODE=8
Then run:
bash mimovl_docs/eval_mimo_vl_args.sh \ \ \
Batch Evaluation
To run multiple tasks sequentially, edit the task list in:
tools/submit/batch_run.py
Then launch:
python tools/submit/batch_run.py \ --input \ --eval_results_dir
To disable thinking mode in batch evaluation:
python tools/submit/batch_run.py \ --input \ --eval_results_dir \…
Excerpt shown — open the source for the full document.
Notability
notability 5.0/10New Xiaomi embodied AI repo; 386 stars.