meituan-longcat/WBench
Python
Captured source
source ↗meituan-longcat/WBench
Description: WBench: A Comprehensive Multi-turn Benchmark for Interactive Video World Model Evaluation
Language: Python
License: MIT
Stars: 126
Forks: 4
Open issues: 2
Created: 2026-05-22T10:46:52Z
Pushed: 2026-06-10T09:21:38Z
Default branch: main
Fork: no
Archived: no
README:
---
TL;DR — WBench evaluates 20 video world models across 5 dimensions and 22 metrics.
📢 News
- [2026/06/10] 🧭 Added HY-World 1.5 pose exports to WBench-examples.
- [2026/06/01] WBench is now an official benchmark on Hugging Face 🤗 (navi & full tasks)!
- [2026/06/01] 📦 Released WBench-examples: ready-to-eval videos from HY-World 1.5 & Kling 3.0.
- [2026/06/01] 🎮 Added [camera- & action-conditioned examples](#-implement-your-model) + web automation (Genie3, Happy Oyster).
- [2026/06/01] Added [Claude Code skills](#-claude-code-skills) 🤖 for generation, evaluation & submission.
- [2026/05/29] Paper ranked #2 🏅 on Hugging Face Daily Papers!
- [2026/05/28] Paper now available on arXiv 📄!
- [2026/05/28] Homepage with interactive leaderboard & dataset gallery is live! 🌐
- [2026/05/28] 🚀 Released the full WBench dataset, evaluation code & model weights.
✨ Contributions
- A comprehensive evaluation framework with 289 cases, 1,058 interaction turns, covering 4 interaction types (navigation, subject action, event editing, perspective switching) across diverse scenes and perspectives.
- A unified navigation protocol that bridges text, 6-DoF camera pose, and discrete-action interfaces, enabling fair comparison across model families.
- 22 automatic metrics spanning 5 complementary dimensions, validated against human judgments, ensuring reliable automatic evaluation at scale.
- Systematic diagnosis of 20 models revealing that current world models have not yet unified high-fidelity rendering with reliable controllability, consistency, and physics compliance.
🏆 Leaderboard
20 Models — Navigation Split (5 Dimensions, sorted by average)
| # | Model | Average | Quality | Setting | Interaction | Consistency | Physical | |:---:|:---|:---:|:---:|:---:|:---:|:---:|:---:| | 1 | Kling 3.0 | 79.2 🥇 | 83.0 🥈 | 91.0 🥈 | 70.3 | 82.5 | 69.3 🥉 | | 2 | LingBot-World | 78.8 🥈 | 81.5 | 72.6 | 79.8 | 88.9 🥇 | 71.2 🥈 | | 3 | Wan 2.7 | 78.5 🥉 | 82.6 🥉 | 91.4 🥇 | 66.0 | 80.5 | 71.8 🥇 | | 4 | HY-World 1.5 | 78.4 | 80.2 | 72.2 | 87.5 🥇 | 86.0 | 66.3 | | 5 | HY-Video 1.5 | 78.2 | 79.7 | 85.6 🥉 | 71.8 | 86.7 🥉 | 67.4 | | 6 | Happy Oyster | 77.1 | 79.3 | 74.2 | 85.1 🥈 | 83.3 | 63.5 | | 7 | Seedance 1.5 | 76.5 | 83.2 🥇 | 82.9 | 68.0 | 80.2 | 68.4 | | 8 | Cosmos 2.5 | 75.2 | 75.6 | 83.3 | 64.1 | 85.6 | 67.4 | | 9 | LTX 2.3 | 74.4 | 78.7 | 85.2 | 67.6 | 75.6 | 64.9 | | 10 | InSpatio-World | 74.3 | 74.9 | 71.4 | 72.8 | 87.4 🥈 | 65.2 | | 11 | Fantasy-World | 74.2 | 75.5 | 71.3 | 72.1 | 85.3 | 66.8 | | 12 | Genie 3 | 74.1 | 77.4 | 72.5 | 73.3 | 81.4 | 65.7 | | 13 | LongCat-Video | 73.7 | 78.2 | 72.3 | 63.1 | 85.9 | 68.9 | | 14 | YUME 1.5 | 73.5 | 79.5 | 72.4 | 72.0 | 78.6 | 65.2 | | 15 | Infinite-World | 72.9 | 78.7 | 69.3 | 75.9 | 78.7 | 62.1 | | 16 | MatrixGame3 | 71.2 | 76.9 | 63.6 | 83.5 🥉 | 72.9 | 59.3 | | 17 | Kairos 3.0 | 70.7 | 76.4 | 70.3 | 65.1 | 81.4 | 60.4 | | 18 | HY-GameCraft | 68.5 | 74.9 | 66.6 | 67.8 | 70.6 | 62.4 | | 19 | MatrixGame2 | 68.5 | 75.7 | 67.1 | 80.6 | 62.0 | 57.2 | | 20 | Astra | 64.0 | 69.7 | 59.6 | 67.7 | 71.6 | 51.4 |
9 Text-driven Models — Full Split (5 Dimensions, sorted by average)
| # | Model | Average | Quality | Setting | Interaction | Consistency | Physical | |:---:|:---|:---:|:---:|:---:|:---:|:---:|:---:| | 1 | Kling 3.0 | 79.5 🥇 | 81.8 🥉 | 91.0 🥈 | 73.1 🥇 | 82.6 | 69.2 🥈 | | 2 | Wan 2.7 | 78.2 🥈 | 82.2 🥈 | 91.4 🥇 | 72.1 🥈 | 73.8 | 71.6 🥇 | | 3 | Seedance 1.5 | 76.2 🥉 | 83.0 🥇 | 82.9 | 68.3 🥉 | 78.5 | 68.2 | | 4 | HY-Video 1.5 | 74.6 | 78.9 | 85.6 🥉 | 54.7 | 86.8 🥇 | 67.1 | | 5 | LTX 2.3 | 71.0 | 78.8 | 85.2 | 49.4 | 76.4 | 65.1 | | 6 | Cosmos 2.5 | 70.8 | 74.6 | 83.3 | 43.5 | 85.4 🥉 | 67.0 | | 7 | LongCat-Video | 70.2 | 79.7 | 72.3 | 45.1 | 85.5 🥈 | 68.4 🥉 | | 8 | YUME 1.5 | 69.0 | 79.7 | 72.4 | 48.4 | 79.3 | 65.4 | | 9 | Kairos 3.0 | 66.0 | 75.8 | 70.3 | 41.6 | 81.9 | 60.5 |
20 Models — Navigation Split (19 metrics)
| Model | Aesthetic Quality | Imaging Quality | Background Consistency | Temporal Flickering | Dynamic Degree | Motion Smoothness | HPSv3 Quality | Scene Adherence | Subject Adherence | Navigation Trajectory | Spatial Consistency | Gated Spatial Consistency | Perspective Consistency | Segment Continuity | Geometric Consistency | Photometric Consistency | Subject Consistency Cross-Model | Visual Plausibility | Causal Fidelity | |:---|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:| | HY-Video 1.5 | 63.4 | 67.4 | 92.1 | 94.2 | 73.9 | 98.7 | 68.0 | 77.5 | 93.6 | 71.8 | 79.2 | 75.1 | 86.6 | 99.4 | 94.6 | 80.3 | 91.6 | 59.7 | 75.0 | | Kling 3.0 | 63.0 | 68.1 | 92.3 | 93.2 | 97.5 | 97.6 | 69.1 | 89.0 | 92.9 | 70.3 | 75.2 | 75.1 | 76.8 | 93.0 | 88.9 | 79.9 | 88.5 | 60.7 | 78.0 | | Cosmos 2.5 | 61.8 | 66.9 | 92.3 | 94.8 | 49.0 | 98.2 | 66.5 | 72.4 | 94.2 | 64.1 | 78.1 | 74.3 | 84.3 | 94.3 | 94.6 | 81.6 | 92.3 | 60.1 | 74.7 | | LTX 2.3 | 57.9 | 61.0 | 88.3 | 93.2 | 98.1 | 96.4 | 56.1 | 81.3 | 89.2 | 67.6 | 70.2 | 70.2 | 69.8 | 75.8 | 76.9 | 79.2 | 87.2 | 55.7 | 74.0 | | Seedance 1.5 | 61.0 | 69.3 | 89.6 | 92.4 | 99.4 | 97.5 | 73.0 | 71.6 | 94.2 | 68.0 | 72.7 | 72.4 | 70.5 | 96.2 | 82.4 | 76.8 | 90.1 | 60.7 | 76.0 | | Wan 2.7 | 61.4 | 68.0 | 89.4 | 92.2 | 100.0 | 96.3 | 71.1 | 88.3…
Excerpt shown — open the source for the full document.
Notability
notability 5.0/10New benchmark repo from Meituan, moderate stars.