What does this repo signal mean?

Qwen (Alibaba Cloud) published QwenLM/Qwen-Image-Bench (Python). This repository signal exposes tooling, eval, infrastructure, or model-adjacent work before it may appear in a launch post. High-signal details: repo QwenLM/Qwen-Image-Bench · language Python · New benchmark repo from Qwen; low stars.. onlylabs links this event to 1 captured evidence page and 6 related repo signals.

Qwen (Alibaba Cloud) Repo: QwenLM/Qwen-Image-Bench

Captured source

source ↗

GitHub/github.com/QwenLM/Qwen-Image-Bench

QwenLM/Qwen-Image-Bench repository metadata

Source ↗

published May 21, 2026seen Jun 5captured Jun 11http 200method plain

QwenLM/Qwen-Image-Bench

Language: Python

License: Apache-2.0

Stars: 84

Forks: 6

Open issues: 1

Created: 2026-05-21T03:41:40Z

Pushed: 2026-05-28T03:48:55Z

Default branch: main

Fork: no

Archived: no

README:

Qwen-Image-Bench

An evaluation toolkit for text-to-image (T2I) generation models. It uses a fine-tuned Q-Judger (Qwen3.6-27B) to score generated images across 5 hierarchical dimensions (Quality, Aesthetics, Alignment, Real-world Fidelity, Creative Generation) covering 56 fine-grained facets.

Key Features

Evaluate any T2I model — run the judge model on your own generated images and get structured, multi-dimensional scores
Compute scores from pre-generated responses — reproduce the leaderboard from the released benchmark dataset
Powered by ms-swift — uses the same inference setup that produced the benchmark responses

Quick Start

# 1. Clone the repo
git clone https://github.com/QwenLM/Qwen-Image-Bench.git
cd Qwen-Image-Bench

# 2. Install dependencies
uv venv myenv --python 3.11 && source myenv/bin/activate
# Install PyTorch first: https://pytorch.org/get-started/locally/
uv pip install -r requirements.txt

# 3. Run judge on your images
python judge.py \
--input your_data.jsonl \
--model Qwen/Qwen-Image-Bench

Your input file should be a CSV/JSON/JSONL with three columns:

| Column | Type | Description | |--------|------|-------------| | ID | int | Prompt identifier (1–1000), must match [benchmark metadata](metadata/bench_metadata.json) | | prompt | str | The text prompt used to generate the image | | image_path | str | Path to the generated image file |

Installation

Step-by-step

1. Create and activate a virtual environment:

uv venv myenv --python 3.11
source myenv/bin/activate

2. Install PyTorch (select the command matching your CUDA version):

See the official guide: https://pytorch.org/get-started/locally/

3. Install Python dependencies:

uv pip install -r requirements.txt

This installs all required dependencies including ms-swift.

Usage

Evaluate Your Own T2I Model (`judge.py`)

Run Judge Inference

python judge.py \
--input your_data.jsonl \
--model Qwen/Qwen-Image-Bench

CLI Options

| Argument | Default | Description | |----------|---------|-------------| | --input | *(required)* | Input CSV/JSON/JSONL with ID, prompt, image_path | | --model | *(required)* | HuggingFace model ID or local path | | --hf-bench-repo | — | HF dataset repo for bench metadata | | --local-metadata | — | Local metadata file path (overrides default) | | --max-batch-size | 24 | ms-swift PtEngine max_batch_size | | --max-new-tokens | 4096 | Max generation tokens |

Output Files

After running judge.py, three files are written next to your input:

| File | Contents | |------|----------| | _judged.{jsonl,csv} | Per-row results: original fields + judge_model_output (combined raw scores JSON) + _judge_output (raw judge text per L1 dimension) | | _bench_scores.json | Aggregated scores: level1, level2, total | | _bench_scores.xlsx | Same scores in Excel: Level-1 Summary sheet + one sheet per L1 dimension with L2 detail |

Compute Scores from Pre-generated Responses (`compute_scores.py`)

# From local file
python compute_scores.py --input qwen_image_bench_hf_v0518.jsonl

# Or download from HuggingFace
python compute_scores.py --hf-repo Qwen/Qwen-Image-Bench

Output: scores_result.xlsx + scores_detail.json

Top-5 Models

| Model | Quality | Aesthetics | Alignment | Real-world Fidelity | Creative Generation | Overall | |-------|:-------:|:----------:|:---------:|:-------------------:|:-------------------:|:-----------:| | GPT Image 2 | 58.65 | 67.53 | 65.85 | 57.38 | 75.23 | 64.69 | | Nano Banana 2.0 | 54.77 | 61.08 | 62.40 | 54.28 | 67.05 | 59.82 | | GPT Image 1.5 | 55.14 | 60.88 | 61.72 | 53.95 | 66.35 | 59.65 | | Nano Banana Pro | 55.67 | 60.26 | 61.25 | 54.07 | 66.23 | 59.45 | | Qwen Image 2.0 Pro | 54.39 | 58.67 | 59.28 | 51.83 | 64.94 | 57.84 |

Full results for all 18 models are available in the paper.

Inference Parameters

The judge model uses fixed inference parameters for reproducibility:

| Parameter | Value | |-----------|-------| | seed | 42 | | temperature | 0 | | top_k | 1 | | top_p | 1.0 | | repetition_penalty | 1.05 | | max_new_tokens | 4096 | | enable_thinking | True | | max_batch_size | 24 |

Project Structure

.
├── judge.py # Run judge model inference on new images
├── compute_scores.py # Compute scores from pre-generated responses
├── score_utils.py # Score extraction, mapping, correction, aggregation
├── checklists.py # Evaluation prompts and dimension definitions
├── backends/
│ └── ms_swift_backend.py # ms-swift inference engine
├── metadata/
│ └── bench_metadata.json # ID → dims_en metadata for judge inference
├── requirements.txt
└── assets/ # Figures for documentation

Evaluation Framework

The benchmark uses a 3-level hierarchical scoring system with 5 L1 dimensions, 23 L2 sub-capabilities, and 56 L3 facets:

| L1 Dimension | L2 Sub-capabilities | |--------------|---------------------| | Quality | Realism, Detail, Resolution | | Aesthetics | Composition, Color Harmony, Lighting, Anatomical Portraiture, Emotional Expression, Style Control | | Alignment | Attributes, Actions, Layout, Relations, Scene | | Real-world Fidelity | Fairness, Safety & Compliance, World Knowledge | | Creative Generation | Imagination, Feature Matching, Logical Resolution, Text Rendering, Design Applications, Visual Storytelling |

Scoring: Each L3 facet is rated 0 (Fail → 0), 1 (Pass → 60), or 2 (Excel → 100), with N/A excluded. Scores aggregate bottom-up: L3 → L2 → L1 → Overall.

For the complete dimension hierarchy and detailed analysis, see the benchmark dataset card.

Citation

If you find this benchmark useful, please cite our paper:

@misc{li2026qwenimagebenchgenerationcreationtexttoimage,
title={Qwen-Image-Bench: From Generation to Creation in Text-to-Image Evaluation},
author={Niantong Li and Guangzheng Hu and Weixu Qiao and Ying Ba and Qichen Hong and Shijun Shen and Jinlin Wang and Fan Zhou and Jianye Kang and Xin Shang and Ziyi He and Wei Wang and Dalin Li and Jiahao Li and...

Excerpt shown — open the source for the full document.

Notability

notability 5.0/10

New benchmark repo from Qwen; low stars.