What does this repo signal mean?

Meituan (LongCat) published meituan-longcat/vitabench (Python). This repository signal exposes tooling, eval, infrastructure, or model-adjacent work before it may appear in a launch post. High-signal details: repo meituan-longcat/vitabench · language Python · New repo, moderate stars from Meituan.. onlylabs links this event to 1 captured evidence page and 6 related repo signals.

Meituan (LongCat) Repo: meituan-longcat/vitabench

Captured source

source ↗

GitHub/github.com/meituan-longcat/vitabench

meituan-longcat/vitabench repository metadata

Source ↗

published Oct 14, 2025seen 5dcaptured 10hhttp 200method plain

meituan-longcat/vitabench

Description: [ICLR 2026] VitaBench: Benchmarking LLM Agents with Versatile Interactive Tasks in Real-world Applications

Language: Python

License: MIT

Stars: 145

Forks: 16

Open issues: 8

Created: 2025-10-14T12:23:29Z

Pushed: 2026-02-22T16:08:39Z

Default branch: main

Fork: no

Archived: no

README:

📃 Paper • 🌐 Website • 🏆 Leaderboard • 🤗 Dataset

🔔 News

[2026-02] VitaBench is cited by Qwen3.5 and Seed2.0! Growing adoption as the go-to benchmark for tool-use evaluation. 🚀
[2026-01] Qwen Team has used VitaBench to evaluate their new model Qwen3-Max-Thinking, publishing the average score across 4 domains. We encourage the community to adopt VitaBench as the definitive benchmark for tool-use performance and welcome diverse interpretations of its results! 🔥
[2026-01] VitaBench has been accepted to [ICLR 2026](https://openreview.net/forum?id=rtcX9qOBaz)! 🎉
[2026-01] An updated version of VitaBench is released with rectified datasets and tools, upgraded evaluation models, and updated metrics for proprietary and open language models based on the new evaluator.
[2025-11] The English version of the VitaBench dataset is now released! It includes fully translated tasks and databases, enabling broader international use. Try it out!
[2025-10] Our paper is released on arXiv: VitaBench: Benchmarking LLM Agents with Versatile Interactive Tasks in Real-world Applications
[2025-10] The VitaBench suite is released, including the codebase, dataset and evaluation pipeline! If you have any questions, feel free to raise issues and/or submit pull requests for new features of bug fixes.

📖 Introduction

In this paper, we introduce VitaBench, a challenging benchmark that evaluates agents on versatile interactive tasks grounded in real-world settings. Drawing from daily applications in food delivery, in-store consumption, and online travel services, VitaBench presents agents with the most complex life-serving simulation environment to date, comprising 66 tools. Through a framework that eliminates domain-specific policies, we enable flexible composition of these scenarios and tools, yielding 100 cross-scenario tasks (main results) and 300 single-scenario tasks. Each task is derived from multiple real user requests and requires agents to reason across temporal and spatial dimensions, utilize complex tool sets, proactively clarify ambiguous instructions, and track shifting user intent throughout multi-turn conversations.

Moreover, we propose a rubric-based sliding window evaluator, enabling robust assessment of diverse solution pathways in complex environments and stochastic interactions. Our comprehensive evaluation reveals that even the most advanced models achieve only 32.5% success rate on cross-scenario tasks, and less than 62% success rate on others. Overall, we believe VitaBench will serve as a valuable resource for advancing the development of AI agents in practical real-world applications.

> *The name “Vita” derives from the Latin word for “Life”, reflecting our focus on life-serving applications.*

![overall_performance](assets/overall_performance.png)

🌱 Benchmark Details

VitaBench provides an evaluation framework that supports model evaluations on both single-domain and cross-domain tasks through flexible configuration. For cross-domain evaluation, simply connect multiple domain names with commas—this will automatically merge the environments of the specified domains into a unified environment.

Statistics of databases and environments:

| | Cross-Scenarios (All domains) | Delivery | In-store | OTA | | :----------------------------- | :-------------: | :------: | :------: | :---: | | Databases | | | | | | Service Providers | 1,324 | 409 | 611 | 1,437 | | Products | 6,942 | 784 | 3,277 | 9,693 | | Transactions | 334 | 48 | 36 | 154 | | API Tools | | | | | | Write | 27 | 4 | 9 | 14 | | Read | 33 | 10 | 10 | 19 | | General | 6 | 6 | 5 | 5 | | Tasks | 100 | 100 | 100 | 100 |

🛠️ Quick Start

Installation

1. Clone the repository:

git clone https://github.com/meituan-longcat/vitabench.git
cd vitabench

2. Install Vita-Bench

pip install -e .

This will enable you to run the vita command.

Setup LLM Configurations

If you want to customize the location of the models.yaml file, you can specify the environment variable VITA_MODEL_CONFIG_PATH (default path from repository root is src/vita/models.yaml). For example:

export VITA_MODEL_CONFIG_PATH=/path/to/your/model/configuration

Example models.yaml file

default:
base_url:
temperature:
max_input_tokens:
headers:
Accept: "*/*"
Accept-Encoding: "gzip, deflate, br"
Content-Type: "application/json"
Authorization: "Bearer "
Connection: "keep-alive"
Cookie:
User-Agent:

models:
- name:
max_tokens:
max_input_tokens:
reasoning_effort: "high"
thinking:
type: "enabled"
budget_tokens:
cost_1m_token_dollar:
prompt_price:
completion_price:

The default configuration can apply to all models, the custom model configuration can overwrite default values.

Run evaluations

To run a test evaluation:

vita run \
--domain \ # support single domain (delivery/instore/ota) and cross domain ([delivery,instore,ota])
--user-llm \ # model name in models.yaml
--agent-llm \ # model name in models.yaml
--enable-think \ # Enable think mode for the agent. Default is False.
--evaluator-llm \ # The LLM to use for evaluation.
--num-trials 1 \ # (Optional) The number of times each task is run. Default is 1.
--num-tasks 1 \ # (Optional) The number of tasks to run. Default is the number of all tasks.
--task-ids 1 \ # (Optional) Run only the tasks with the given IDs. Default is run all tasks.
--max-steps 300 \ # (Optional) The maximum number of steps to run the simulation. Default is 300.
--max-concurrency 1 \ # (Optional) The maximum number of concurrent simulations to run. Default is 1.
--csv-output \ # (Optional) Path to CSV file to append results.
--language \ # (Optional) The language to use for prompts and tasks. Choices: chinese, english. Default is chinese.

Results will be saved in data/simulations/.

Re-evaluation simulation

Re-evaluate the simulation instead of running new ones.

vita run \…

Excerpt shown — open the source for the full document.

Notability

notability 4.0/10

New repo, moderate stars from Meituan.