stepfun-ai/GEBench
Python
Captured source
source ↗stepfun-ai/GEBench
Language: Python
License: Apache-2.0
Stars: 54
Forks: 1
Open issues: 0
Created: 2026-02-09T13:18:49Z
Pushed: 2026-02-25T08:33:22Z
Default branch: main
Fork: no
Archived: no
README:
GEBench: Benchmarking Image Generation Models as GUI Environments

Features
- 5 Data Types: Type 1 (single-step), Type 2 (multi-step), Type 3 (text-fictionalapp), Type 4 (text-realapp), Type 5 (grounding)
- Bilingual Support: Automatic Chinese/English prompt selection based on folder naming
- 5-Dimensional Metrics: goal, logic, consistency, ui, quality
Quick Start
Installation
# Clone repository git clone https://github.com/stepfun-ai/GEBench cd GEBench # Create conda environment conda create -n gebench python=3.10 -y conda activate gebench # Install dependencies pip install -r requirements.txt
Data
The GEBench data is available on HuggingFace:
📊 [StepFun-ai/GEBench](https://huggingface.co/datasets/stepfun-ai/GEBench) - HuggingFace Datasets Hub
To download:
cd /path/to/GEBench git clone https://huggingface.co/datasets/stepfun-ai/GEBench ./data
Generate Images
python scripts/generate.py --data-type type1 --data-folder data/01_single_step --output-dir outputs/gemini --gemini-api-key YOUR_GEMINI_API_KEY python scripts/generate.py --data-type type2 --data-folder data/02_multi_step --output-dir outputs/gemini --gemini-api-key YOUR_GEMINI_API_KEY python scripts/generate.py --data-type type3 --data-folder data/03_trajectory_text_fictionalapp --output-dir outputs/gemini --gemini-api-key YOUR_GEMINI_API_KEY python scripts/generate.py --data-type type4 --data-folder data/04_trajectory_text_realapp --output-dir outputs/gemini --gemini-api-key YOUR_GEMINI_API_KEY python scripts/generate.py --data-type type5 --data-folder data/05_grounding_data --output-dir outputs/gemini --gemini-api-key YOUR_GEMINI_API_KEY # With multiple workers python scripts/generate.py --data-type type1 --data-folder data/01_single_step --output-dir outputs/gemini --gemini-api-key YOUR_GEMINI_API_KEY --workers 4
Evaluate Results
python scripts/evaluate.py --data-type type1 --output-folder outputs/gemini/01_single_step --dataset-root data --openai-api-key YOUR_OPENAI_API_KEY python scripts/evaluate.py --data-type type2 --output-folder outputs/gemini/02_multi_step --dataset-root data --openai-api-key YOUR_OPENAI_API_KEY python scripts/evaluate.py --data-type type3 --output-folder outputs/gemini/03_trajectory_text_fictionalapp --dataset-root data --openai-api-key YOUR_OPENAI_API_KEY python scripts/evaluate.py --data-type type4 --output-folder outputs/gemini/04_trajectory_text_realapp --dataset-root data --openai-api-key YOUR_OPENAI_API_KEY python scripts/evaluate.py --data-type type5 --output-folder outputs/gemini/05_grounding_data --dataset-root data --openai-api-key YOUR_OPENAI_API_KEY # With multiple workers python scripts/evaluate.py --data-type type1 --output-folder outputs/gemini/01_single_step --dataset-root data --openai-api-key YOUR_OPENAI_API_KEY --workers 4 python scripts/evaluate.py --data-type type2 --output-folder outputs/gemini/02_multi_step --dataset-root data --openai-api-key YOUR_OPENAI_API_KEY --workers 4 python scripts/evaluate.py --data-type type3 --output-folder outputs/gemini/03_trajectory_text_fictionalapp --dataset-root data --openai-api-key YOUR_OPENAI_API_KEY --workers 4 python scripts/evaluate.py --data-type type4 --output-folder outputs/gemini/04_trajectory_text_realapp --dataset-root data --openai-api-key YOUR_OPENAI_API_KEY --workers 4 python scripts/evaluate.py --data-type type5 --output-folder outputs/gemini/05_grounding_data --dataset-root data --openai-api-key YOUR_OPENAI_API_KEY --workers 4
Main Results
Chinese Subset Results
Model Single-Step Multi-Step Fiction-App Real-App Grounding GE Score
Nano Banana pro 84.50 68.65 65.75 64.35 64.83 69.62
Nano Banana 64.36 34.16 64.82 65.89 54.48 56.74
GPT-image-1.5 83.79 56.97 60.11 55.65 53.33 63.22
GPT-image-1.0 64.72 49.20 57.31 59.04 31.68 52.39
Seedream 4.5 63.64 53.11 56.48 53.44 52.90 55.91
Seedream 4.0 62.04 48.64 49.28 50.93 53.53 52.88
Wan 2.6 64.20 50.11 52.72 50.40 59.58 55.40
Flux-2-pro 68.83 55.07 58.13 55.41 50.24 57.54
Bagel 34.84 13.45 27.36 33.52 35.10 28.85
UniWorld-V2 55.33 24.95 32.03 21.39 49.60 36.66
Qwen-Image-Edit 41.12 26.79 23.78 26.10 50.80 33.72
Longcat-Image 48.76 12.75 30.03 30.00 51.02 34.51
English Subset Results
Model Single-Step Multi-Step Fiction-App Real-App Grounding GE Score
Nano Banana pro 84.32 69.51 46.33 47.20 58.64 61.20
Nano Banana 64.80 50.75 48.88 47.12 49.04 52.12
GPT-image-1.5 80.80 58.87 63.68 58.93 49.23 63.16
GPT-image-1.0 60.92 64.33 58.94 56.16 37.84 55.64
Seedream 4.5 49.49 45.30 53.81 51.80 49.63 50.01
Seedream 4.0 53.28 37.57 47.92 49.36 44.17 46.46
Wan 2.6 60.17 44.36 49.55 44.80 53.36 50.45
Flux-2-pro 61.00 52.17 49.92 47.16 45.67 51.18
Bagel 32.91 8.61 26.08 35.12 37.30 28.00
UniWorld-V2 42.68 14.14 30.08 26.83 47.04 32.15
Qwen-Image-Edit 40.12 18.61 25.80 25.95 54.55 33.01
Longcat-Image 36.69 8.44 37.30 36.83 47.12 33.28
Citation
If you find GEBench useful, please cite our paper:
@article{li2026gebench,
title={GEBench: Benchmarking Image Generation Models as GUI Environments},
author={Haodong Li and Jingwei Wu and Quan Sun and Guopeng Li and Juanxi Tian and Huanyu Zhang and Yanlin Lai and Ruichuan An and Hongbo Peng and Yuhong Dai and Chenxi Li and Chunmei Qing and Jia Wang and Ziyang Meng and Zheng Ge and Xiangyu Zhang and Daxin Jiang},
journal={arXiv preprint arXiv:2602.09007},
year={2026}
}Notability
notability 5.0/10Solid new benchmark repo with moderate traction