ForkParasailParasailpublished Dec 29, 2024seen 5d

parasail-ai/VLMEvalKit

forked from open-compass/VLMEvalKit

Open original ↗

Captured source

source ↗
published Dec 29, 2024seen 5dcaptured 14hhttp 200method plain

parasail-ai/VLMEvalKit

Description: Open-source evaluation toolkit of large vision-language models (LVLMs), support 160+ VLMs, 50+ benchmarks

License: Apache-2.0

Stars: 0

Forks: 0

Open issues: 0

Created: 2024-12-29T22:26:39Z

Pushed: 2024-12-29T22:30:58Z

Default branch: main

Fork: yes

Parent repository: open-compass/VLMEvalKit

Archived: no

README: !LOGO

A Toolkit for Evaluating Large Vision-Language Models.

[![][github-contributors-shield]][github-contributors-link] • [![][github-forks-shield]][github-forks-link] • [![][github-stars-shield]][github-stars-link] • [![][github-issues-shield]][github-issues-link] • [![][github-license-shield]][github-license-link]

English | [简体中文](/docs/zh-CN/README_zh-CN.md) | [日本語](/docs/ja/README_ja.md)

🏆 OC Learderboard • 🏗️Quickstart • 📊Datasets & Models • 🛠️Development • 🎯Goal • 🖊️Citation

🤗 HF Leaderboard • 🤗 Evaluation Records • 🤗 HF Video Leaderboard • 🔊 Discord • 📝 Report

VLMEvalKit (the python package name is vlmeval) is an open-source evaluation toolkit of large vision-language models (LVLMs). It enables one-command evaluation of LVLMs on various benchmarks, without the heavy workload of data preparation under multiple repositories. In VLMEvalKit, we adopt generation-based evaluation for all LVLMs, and provide the evaluation results obtained with both exact matching and LLM-based answer extraction.

🆕 News

> We have presented a **comprehensive survey** on the evaluation of large multi-modality models, jointly with **MME Team** and **LMMs-Lab** 🔥🔥🔥

  • [2024-12-11] Supported **NaturalBench**, a vision-centric VQA benchmark (NeurIPS'24) that challenges vision-language models with simple questions about natural imagery.
  • [2024-12-02] Supported **VisOnlyQA**, a benchmark for evaluating the visual perception capabilities 🔥🔥🔥
  • [2024-11-26] Supported **Ovis1.6-Gemma2-27B**, thanks to **runninglsy** 🔥🔥🔥
  • [2024-11-25] Create a new flag VLMEVALKIT_USE_MODELSCOPE. By setting this environment variable, you can download the video benchmarks supported from **modelscope** 🔥🔥🔥
  • [2024-11-25] Supported **VizWiz** benchmark 🔥🔥🔥
  • [2024-11-22] Supported the inference of **MMGenBench**, thanks **lerogo** 🔥🔥🔥
  • [2024-11-22] Supported **Dynamath**, a multimodal math benchmark comprising of 501 SEED problems and 10 variants generated based on random seeds. The benchmark can be used to measure the robustness of MLLMs in multi-modal math solving 🔥🔥🔥
  • [2024-11-21] Integrated a new config system to enable more flexible evaluation settings. Check the [Document](/docs/en/ConfigSystem.md) or run python run.py --help for more details 🔥🔥🔥
  • [2024-11-21] Supported **QSpatial**, a multimodal benchmark for Quantitative Spatial Reasoning (determine the size / distance, e.g.), thanks **andrewliao11** for providing the official support 🔥🔥🔥
  • [2024-11-21] Supported **MM-Math**, a new multimodal math benchmark comprising of ~6K middle school multi-modal reasoning math problems. GPT-4o-20240806 achieces 22.5% accuracy on this benchmark 🔥🔥🔥

🏗️ QuickStart

See [[QuickStart](/docs/en/Quickstart.md) | [快速开始](/docs/zh-CN/Quickstart.md)] for a quick start guide.

📊 Datasets, Models, and Evaluation Results

Evaluation Results

The performance numbers on our official multi-modal leaderboards can be downloaded from here!

**OpenVLM Leaderboard**: **Download All DETAILED Results**.

Check Supported Benchmarks Tab in **VLMEvalKit Features** to view all supported image & video benchmarks (70+).

Check Supported LMMs Tab in **VLMEvalKit Features** to view all supported LMMs, including commercial APIs, open-source models, and more (200+).

Transformers Version Recommendation:

Note that some VLMs may not be able to run under certain transformer versions, we recommend the following settings to evaluate each VLM:

  • Please use transformers==4.33.0 for: Qwen series, Monkey series, InternLM-XComposer Series, mPLUG-Owl2, OpenFlamingo v2, IDEFICS series, VisualGLM, MMAlaya, ShareCaptioner, MiniGPT-4 series, InstructBLIP series, PandaGPT, VXVERSE.
  • Please use transformers==4.36.2 for: Moondream1.
  • Please use transformers==4.37.0 for: LLaVA series, ShareGPT4V series, TransCore-M, LLaVA (XTuner), CogVLM Series, EMU2 Series, Yi-VL Series, MiniCPM-[V1/V2], OmniLMM-12B, DeepSeek-VL series, InternVL series, Cambrian Series, VILA Series, Llama-3-MixSenseV1_1, Parrot-7B, PLLaVA Series.
  • Please use transformers==4.40.0 for: IDEFICS2, Bunny-Llama3, MiniCPM-Llama3-V2.5, 360VL-70B, Phi-3-Vision, WeMM.
  • Please use transformers==4.44.0 for: Moondream2, H2OVL series.
  • Please use transformers==4.45.0 for: Aria.
  • Please use transformers==latest for: LLaVA-Next series, PaliGemma-3B, Chameleon series, Video-LLaVA-7B-HF, Ovis series, Mantis series, MiniCPM-V2.6, OmChat-v2.0-13B-sinlge-beta, Idefics-3, GLM-4v-9B, VideoChat2-HD, RBDash_72b, Llama-3.2 series, Kosmos series.

Torchvision Version Recommendation:

Note that some VLMs may not be able to run under certain torchvision versions, we recommend the following settings to evaluate each VLM:

  • Please use torchvision>=0.16 for: Moondream series and Aria

Flash-attn Version Recommendation:

Note that some VLMs may not be able to run…

Excerpt shown — open the source for the full document.

Notability

notability 2.0/10

Routine fork, no added value.