parasail-ai/VLMEvalKit
forked from open-compass/VLMEvalKit
Captured source
source ↗parasail-ai/VLMEvalKit
Description: Open-source evaluation toolkit of large vision-language models (LVLMs), support 160+ VLMs, 50+ benchmarks
License: Apache-2.0
Stars: 0
Forks: 0
Open issues: 0
Created: 2024-12-29T22:26:39Z
Pushed: 2024-12-29T22:30:58Z
Default branch: main
Fork: yes
Parent repository: open-compass/VLMEvalKit
Archived: no
README: !LOGO
A Toolkit for Evaluating Large Vision-Language Models.
[![][github-contributors-shield]][github-contributors-link] • [![][github-forks-shield]][github-forks-link] • [![][github-stars-shield]][github-stars-link] • [![][github-issues-shield]][github-issues-link] • [![][github-license-shield]][github-license-link]
English | [简体中文](/docs/zh-CN/README_zh-CN.md) | [日本語](/docs/ja/README_ja.md)
🏆 OC Learderboard • 🏗️Quickstart • 📊Datasets & Models • 🛠️Development • 🎯Goal • 🖊️Citation
🤗 HF Leaderboard • 🤗 Evaluation Records • 🤗 HF Video Leaderboard • 🔊 Discord • 📝 Report
VLMEvalKit (the python package name is vlmeval) is an open-source evaluation toolkit of large vision-language models (LVLMs). It enables one-command evaluation of LVLMs on various benchmarks, without the heavy workload of data preparation under multiple repositories. In VLMEvalKit, we adopt generation-based evaluation for all LVLMs, and provide the evaluation results obtained with both exact matching and LLM-based answer extraction.
🆕 News
> We have presented a **comprehensive survey** on the evaluation of large multi-modality models, jointly with **MME Team** and **LMMs-Lab** 🔥🔥🔥
- [2024-12-11] Supported **NaturalBench**, a vision-centric VQA benchmark (NeurIPS'24) that challenges vision-language models with simple questions about natural imagery.
- [2024-12-02] Supported **VisOnlyQA**, a benchmark for evaluating the visual perception capabilities 🔥🔥🔥
- [2024-11-26] Supported **Ovis1.6-Gemma2-27B**, thanks to **runninglsy** 🔥🔥🔥
- [2024-11-25] Create a new flag
VLMEVALKIT_USE_MODELSCOPE. By setting this environment variable, you can download the video benchmarks supported from **modelscope** 🔥🔥🔥 - [2024-11-25] Supported **VizWiz** benchmark 🔥🔥🔥
- [2024-11-22] Supported the inference of **MMGenBench**, thanks **lerogo** 🔥🔥🔥
- [2024-11-22] Supported **Dynamath**, a multimodal math benchmark comprising of 501 SEED problems and 10 variants generated based on random seeds. The benchmark can be used to measure the robustness of MLLMs in multi-modal math solving 🔥🔥🔥
- [2024-11-21] Integrated a new config system to enable more flexible evaluation settings. Check the [Document](/docs/en/ConfigSystem.md) or run
python run.py --helpfor more details 🔥🔥🔥 - [2024-11-21] Supported **QSpatial**, a multimodal benchmark for Quantitative Spatial Reasoning (determine the size / distance, e.g.), thanks **andrewliao11** for providing the official support 🔥🔥🔥
- [2024-11-21] Supported **MM-Math**, a new multimodal math benchmark comprising of ~6K middle school multi-modal reasoning math problems. GPT-4o-20240806 achieces 22.5% accuracy on this benchmark 🔥🔥🔥
🏗️ QuickStart
See [[QuickStart](/docs/en/Quickstart.md) | [快速开始](/docs/zh-CN/Quickstart.md)] for a quick start guide.
📊 Datasets, Models, and Evaluation Results
Evaluation Results
The performance numbers on our official multi-modal leaderboards can be downloaded from here!
**OpenVLM Leaderboard**: **Download All DETAILED Results**.
Check Supported Benchmarks Tab in **VLMEvalKit Features** to view all supported image & video benchmarks (70+).
Check Supported LMMs Tab in **VLMEvalKit Features** to view all supported LMMs, including commercial APIs, open-source models, and more (200+).
Transformers Version Recommendation:
Note that some VLMs may not be able to run under certain transformer versions, we recommend the following settings to evaluate each VLM:
- Please use
transformers==4.33.0for:Qwen series,Monkey series,InternLM-XComposer Series,mPLUG-Owl2,OpenFlamingo v2,IDEFICS series,VisualGLM,MMAlaya,ShareCaptioner,MiniGPT-4 series,InstructBLIP series,PandaGPT,VXVERSE. - Please use
transformers==4.36.2for:Moondream1. - Please use
transformers==4.37.0for:LLaVA series,ShareGPT4V series,TransCore-M,LLaVA (XTuner),CogVLM Series,EMU2 Series,Yi-VL Series,MiniCPM-[V1/V2],OmniLMM-12B,DeepSeek-VL series,InternVL series,Cambrian Series,VILA Series,Llama-3-MixSenseV1_1,Parrot-7B,PLLaVA Series. - Please use
transformers==4.40.0for:IDEFICS2,Bunny-Llama3,MiniCPM-Llama3-V2.5,360VL-70B,Phi-3-Vision,WeMM. - Please use
transformers==4.44.0for:Moondream2,H2OVL series. - Please use
transformers==4.45.0for:Aria. - Please use
transformers==latestfor:LLaVA-Next series,PaliGemma-3B,Chameleon series,Video-LLaVA-7B-HF,Ovis series,Mantis series,MiniCPM-V2.6,OmChat-v2.0-13B-sinlge-beta,Idefics-3,GLM-4v-9B,VideoChat2-HD,RBDash_72b,Llama-3.2 series,Kosmos series.
Torchvision Version Recommendation:
Note that some VLMs may not be able to run under certain torchvision versions, we recommend the following settings to evaluate each VLM:
- Please use
torchvision>=0.16for:Moondream seriesandAria
Flash-attn Version Recommendation:
Note that some VLMs may not be able to run…
Excerpt shown — open the source for the full document.
Notability
notability 2.0/10Routine fork, no added value.