Benchmarks — who reports what

Which models self-report which benchmarks, parsed from their model cards. 24 benchmarks reported by 3+ models, across 59 models.

Not a leaderboard. These are each lab’s own reported numbers, measured under its own harness, config, prompt, and date. Versions and metrics differ — a higher number here does not mean a better model. Read them as “what each lab claims,” not a ranking.

GPQA Diamond

48 models
91.2Zhipu AI (GLM)zai-org/GLM-5.2
86.2Zhipu AI (GLM)zai-org/GLM-5.1
86Zhipu AI (GLM)zai-org/GLM-5
85.7Zhipu AI (GLM)zai-org/GLM-4.7
85.2MiniMaxMiniMaxAI/MiniMax-M2.5
84.5Moonshot AI (Kimi)moonshotai/Kimi-K2-Thinking
84.3%Google (DeepMind / Gemini)google/gemma-4-31B
83.7Xiaomi (MiMo)XiaomiMiMo/MiMo-V2-Flash
83MiniMaxMiniMaxAI/MiniMax-M2.1
82.3%Google (DeepMind / Gemini)google/gemma-4-26B-A4B

+38 more

MMLU-Pro

41 models
88MiniMaxMiniMaxAI/MiniMax-M2.1
85.2%Google (DeepMind / Gemini)google/gemma-4-31B
85DeepSeekdeepseek-ai/DeepSeek-R1-0528-Qwen3-8B
85DeepSeekdeepseek-ai/DeepSeek-R1-0528
84.9Xiaomi (MiMo)XiaomiMiMo/MiMo-V2-Flash
84.6Moonshot AI (Kimi)moonshotai/Kimi-K2-Thinking
84.3Zhipu AI (GLM)zai-org/GLM-4.7
84DeepSeekdeepseek-ai/DeepSeek-R1-Distill-Qwen-32B
84DeepSeekdeepseek-ai/DeepSeek-R1-Distill-Qwen-14B
84DeepSeekdeepseek-ai/DeepSeek-R1-Distill-Qwen-7B

+31 more

LiveCodeBench

37 models
86.4StepFunstepfun-ai/Step-3.5-Flash
84.9Zhipu AI (GLM)zai-org/GLM-4.7
83.1Moonshot AI (Kimi)moonshotai/Kimi-K2-Thinking
83MiniMaxMiniMaxAI/MiniMax-M2
80.6Xiaomi (MiMo)XiaomiMiMo/MiMo-V2-Flash
80%Google (DeepMind / Gemini)google/gemma-4-31B
79.4Meituan (LongCat)meituan-longcat/LongCat-Flash-Thinking
77.1%Google (DeepMind / Gemini)google/gemma-4-26B-A4B
73.3DeepSeekdeepseek-ai/DeepSeek-R1-0528-Qwen3-8B
73.3DeepSeekdeepseek-ai/DeepSeek-R1-0528

+27 more

MATH

37 models
99.2Meituan (LongCat)meituan-longcat/LongCat-Flash-Thinking
98.6Sarvam AIsarvamai/sarvam-105b
97.6Meituan (LongCat)meituan-longcat/LongCat-Flash-Omni
97.3DeepSeekdeepseek-ai/DeepSeek-R1-Distill-Qwen-32B
97.3DeepSeekdeepseek-ai/DeepSeek-R1-Distill-Qwen-14B
97.3DeepSeekdeepseek-ai/DeepSeek-R1-Distill-Qwen-7B
97.3DeepSeekdeepseek-ai/DeepSeek-R1-Distill-Llama-70B
97.3DeepSeekdeepseek-ai/DeepSeek-R1-Distill-Llama-8B
97.3DeepSeekdeepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
97.3DeepSeekdeepseek-ai/DeepSeek-R1

+27 more

AIME

35 models
99.2Zhipu AI (GLM)zai-org/GLM-5.2
99.2Meituan (LongCat)meituan-longcat/LongCat-Flash-Thinking-ZigZag
97.3StepFunstepfun-ai/Step-3.5-Flash
95.7Zhipu AI (GLM)zai-org/GLM-4.7
95.4Zhipu AI (GLM)zai-org/GLM-5.1
94.1Xiaomi (MiMo)XiaomiMiMo/MiMo-V2-Flash
92.7Zhipu AI (GLM)zai-org/GLM-5
91.6Zhipu AI (GLM)zai-org/GLM-4.7-Flash
91.4DeepSeekdeepseek-ai/DeepSeek-R1-0528-Qwen3-8B
91.4DeepSeekdeepseek-ai/DeepSeek-R1-0528

+25 more

MMLU

33 models
94.4Moonshot AI (Kimi)moonshotai/Kimi-K2-Thinking
93.4DeepSeekdeepseek-ai/DeepSeek-R1-0528-Qwen3-8B
93.4DeepSeekdeepseek-ai/DeepSeek-R1-0528
92.9DeepSeekdeepseek-ai/DeepSeek-R1-Distill-Qwen-32B
92.9DeepSeekdeepseek-ai/DeepSeek-R1-Distill-Qwen-14B
92.9DeepSeekdeepseek-ai/DeepSeek-R1-Distill-Qwen-7B
92.9DeepSeekdeepseek-ai/DeepSeek-R1-Distill-Llama-70B
92.9DeepSeekdeepseek-ai/DeepSeek-R1-Distill-Llama-8B
92.9DeepSeekdeepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
92.9DeepSeekdeepseek-ai/DeepSeek-R1

+23 more

Humanity's Last Exam

20 models
54.7Zhipu AI (GLM)zai-org/GLM-5.2
52.3Zhipu AI (GLM)zai-org/GLM-5.1
50.4Zhipu AI (GLM)zai-org/GLM-5
42.8Zhipu AI (GLM)zai-org/GLM-4.7
31.8MiniMaxMiniMaxAI/MiniMax-M2
26.5%Google (DeepMind / Gemini)google/gemma-4-31B
25.8Meituan (LongCat)meituan-longcat/LongCat-Flash-Thinking-ZigZag
25.2Meituan (LongCat)meituan-longcat/LongCat-Flash-Thinking-2601
23.9Moonshot AI (Kimi)moonshotai/Kimi-K2-Thinking
22.2MiniMaxMiniMaxAI/MiniMax-M2.1

+10 more

Codeforces

16 models
2,150Google (DeepMind / Gemini)google/gemma-4-31B
2,029DeepSeekdeepseek-ai/DeepSeek-R1-Distill-Qwen-32B
2,029DeepSeekdeepseek-ai/DeepSeek-R1-Distill-Qwen-14B
2,029DeepSeekdeepseek-ai/DeepSeek-R1-Distill-Qwen-7B
2,029DeepSeekdeepseek-ai/DeepSeek-R1-Distill-Llama-70B
2,029DeepSeekdeepseek-ai/DeepSeek-R1-Distill-Llama-8B
2,029DeepSeekdeepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
2,029DeepSeekdeepseek-ai/DeepSeek-R1
2,029DeepSeekdeepseek-ai/DeepSeek-R1-Zero
1,930DeepSeekdeepseek-ai/DeepSeek-R1-0528-Qwen3-8B

+6 more

GSM8K

16 models
99.6Xiaomi (MiMo)XiaomiMiMo/MiMo-V2.5-Pro
99.6Xiaomi (MiMo)XiaomiMiMo/MiMo-V2.5-Pro-Base
95.37Tencent Hunyuantencent/Hy3-preview
95.37Tencent Hunyuantencent/Hy3-preview-Base
94.8MiniMaxMiniMaxAI/MiniMax-Text-01
93.93InclusionAI (Ant Group)inclusionAI/Ling-2.6-1T-base
92.3Xiaomi (MiMo)XiaomiMiMo/MiMo-V2-Flash
91.89InclusionAI (Ant Group)inclusionAI/Ling-2.6-flash-base
89.69Upstage (Solar)upstage/solar-pro-preview-instruct
89.3DeepSeekdeepseek-ai/DeepSeek-V3

+6 more

HumanEval

16 models
92.1Sarvam AIsarvamai/sarvam-30b
90.85Meituan (LongCat)meituan-longcat/LongCat-Flash-Omni
88.41Meituan (LongCat)meituan-longcat/LongCat-Flash-Chat
86.9MiniMaxMiniMaxAI/MiniMax-Text-01
85.98InclusionAI (Ant Group)inclusionAI/Ling-2.6-1T-base
82.6DeepSeekdeepseek-ai/DeepSeek-V3
82.6DeepSeekdeepseek-ai/DeepSeek-V3-Base
81.1InclusionAI (Ant Group)inclusionAI/Ling-2.6-flash-base
81.1StepFunstepfun-ai/Step-3.5-Flash-Base
81.1StepFunstepfun-ai/Step-3.5-Flash-Base-Midtrain

+6 more

MBPP

16 models
92.7Sarvam AIsarvamai/sarvam-30b
83.6ByteDance (Doubao/Seed)ByteDance-Seed/Stable-DiffCoder-8B-Base
80.16Meituan (LongCat)meituan-longcat/LongCat-Flash-Omni
79.63Meituan (LongCat)meituan-longcat/LongCat-Flash-Chat
79.4StepFunstepfun-ai/Step-3.5-Flash-Base
79.4StepFunstepfun-ai/Step-3.5-Flash-Base-Midtrain
78.71Tencent Hunyuantencent/Hy3-preview
78.71Tencent Hunyuantencent/Hy3-preview-Base
75.4DeepSeekdeepseek-ai/DeepSeek-V3
75.4DeepSeekdeepseek-ai/DeepSeek-V3-Base

+6 more

SWE-bench Verified

14 models
77.8Zhipu AI (GLM)zai-org/GLM-5
74.4StepFunstepfun-ai/Step-3.5-Flash
74MiniMaxMiniMaxAI/MiniMax-M2.1
73.8Zhipu AI (GLM)zai-org/GLM-4.7
73.4Xiaomi (MiMo)XiaomiMiMo/MiMo-V2-Flash
71.3Moonshot AI (Kimi)moonshotai/Kimi-K2-Thinking
70Meituan (LongCat)meituan-longcat/LongCat-Flash-Thinking-2601
69.4MiniMaxMiniMaxAI/MiniMax-M2
69.2Moonshot AI (Kimi)moonshotai/Kimi-K2-Instruct-0905
59.2Zhipu AI (GLM)zai-org/GLM-4.7-Flash

+4 more

BrowseComp

13 models
79.3Zhipu AI (GLM)zai-org/GLM-5.1
75.9Zhipu AI (GLM)zai-org/GLM-5
73.7StepFunstepfun-ai/Step-3.5-Flash
71.9Meituan (LongCat)meituan-longcat/LongCat-Flash-Thinking-ZigZag
69Meituan (LongCat)meituan-longcat/LongCat-Flash-Thinking-2601
67.5Zhipu AI (GLM)zai-org/GLM-4.7
62.3Moonshot AI (Kimi)moonshotai/Kimi-K2-Thinking
62MiniMaxMiniMaxAI/MiniMax-M2.1
58.3Xiaomi (MiMo)XiaomiMiMo/MiMo-V2-Flash
49.5Sarvam AIsarvamai/sarvam-105b

+3 more

HMMT

12 models
98.4StepFunstepfun-ai/Step-3.5-Flash
97.1Zhipu AI (GLM)zai-org/GLM-4.7
96.9Zhipu AI (GLM)zai-org/GLM-5.1
96.9Zhipu AI (GLM)zai-org/GLM-5
94.4Zhipu AI (GLM)zai-org/GLM-5.2
93.5Meituan (LongCat)meituan-longcat/LongCat-Flash-Thinking-ZigZag
93.4Meituan (LongCat)meituan-longcat/LongCat-Flash-Thinking-2601
85.8Sarvam AIsarvamai/sarvam-105b
84.4Xiaomi (MiMo)XiaomiMiMo/MiMo-V2-Flash
79.4DeepSeekdeepseek-ai/DeepSeek-R1-0528-Qwen3-8B

+2 more

Terminal-Bench

12 models
82.7Zhipu AI (GLM)zai-org/GLM-5.2
69Zhipu AI (GLM)zai-org/GLM-5.1
56.2Zhipu AI (GLM)zai-org/GLM-5
51StepFunstepfun-ai/Step-3.5-Flash
47.9MiniMaxMiniMaxAI/MiniMax-M2.1
47.1Moonshot AI (Kimi)moonshotai/Kimi-K2-Thinking
46.3MiniMaxMiniMaxAI/MiniMax-M2
44.5Moonshot AI (Kimi)moonshotai/Kimi-K2-Instruct-0905
41Zhipu AI (GLM)zai-org/GLM-4.7
39.51Meituan (LongCat)meituan-longcat/LongCat-Flash-Chat

+2 more

Aider Polyglot

12 models
79.7DeepSeekdeepseek-ai/DeepSeek-V3
79.7DeepSeekdeepseek-ai/DeepSeek-V3-Base
71.6DeepSeekdeepseek-ai/DeepSeek-R1-0528-Qwen3-8B
71.6DeepSeekdeepseek-ai/DeepSeek-R1-0528
53.3DeepSeekdeepseek-ai/DeepSeek-R1-Distill-Qwen-32B
53.3DeepSeekdeepseek-ai/DeepSeek-R1-Distill-Qwen-14B
53.3DeepSeekdeepseek-ai/DeepSeek-R1-Distill-Qwen-7B
53.3DeepSeekdeepseek-ai/DeepSeek-R1-Distill-Llama-70B
53.3DeepSeekdeepseek-ai/DeepSeek-R1-Distill-Llama-8B
53.3DeepSeekdeepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B

+2 more

SWE-bench Multilingual

7 models
73.3Zhipu AI (GLM)zai-org/GLM-5
72.5MiniMaxMiniMaxAI/MiniMax-M2.1
71.7Xiaomi (MiMo)XiaomiMiMo/MiMo-V2-Flash
66.7Zhipu AI (GLM)zai-org/GLM-4.7
61.1Moonshot AI (Kimi)moonshotai/Kimi-K2-Thinking
56.5MiniMaxMiniMaxAI/MiniMax-M2
55.9Moonshot AI (Kimi)moonshotai/Kimi-K2-Instruct-0905

SWE-bench

6 models
60.4Meituan (LongCat)meituan-longcat/LongCat-Flash-Chat
59.4Meituan (LongCat)meituan-longcat/LongCat-Flash-Thinking
54.4Meituan (LongCat)meituan-longcat/LongCat-Flash-Lite
35.7Xiaomi (MiMo)XiaomiMiMo/MiMo-V2.5-Pro
35.7Xiaomi (MiMo)XiaomiMiMo/MiMo-V2.5-Pro-Base
30.8Xiaomi (MiMo)XiaomiMiMo/MiMo-V2-Flash

MMMU-Pro

5 models
76.9%Google (DeepMind / Gemini)google/gemma-4-31B
73.8%Google (DeepMind / Gemini)google/gemma-4-26B-A4B
69.1%Google (DeepMind / Gemini)google/gemma-4-12B
54.3%Google (DeepMind / Gemini)google/diffusiongemma-26B-A4B-it
52.7MiniMaxMiniMaxAI/MiniMax-VL-01

ToolBench

4 models
48.2Zhipu AI (GLM)zai-org/GLM-5.2
43.5MiniMaxMiniMaxAI/MiniMax-M2.1
40.7Zhipu AI (GLM)zai-org/GLM-5.1
38Zhipu AI (GLM)zai-org/GLM-5

Multi-SWE-bench

4 models
49.4MiniMaxMiniMaxAI/MiniMax-M2.1
41.9Moonshot AI (Kimi)moonshotai/Kimi-K2-Thinking
36.2MiniMaxMiniMaxAI/MiniMax-M2
33.5Moonshot AI (Kimi)moonshotai/Kimi-K2-Instruct-0905

tau2-bench

4 models
67.8MiniMaxMiniMaxAI/MiniMax-M1-40k
63.5MiniMaxMiniMaxAI/MiniMax-M1-80k
53.5DeepSeekdeepseek-ai/DeepSeek-R1-0528-Qwen3-8B
53.5DeepSeekdeepseek-ai/DeepSeek-R1-0528

MCP-Atlas

3 models
76.8Zhipu AI (GLM)zai-org/GLM-5.2
71.8Zhipu AI (GLM)zai-org/GLM-5.1
67.8Zhipu AI (GLM)zai-org/GLM-5

MMMU

3 models
78.11StepFunstepfun-ai/Step3-VL-10B-Base
70.7Meituan (LongCat)meituan-longcat/LongCat-Flash-Omni
68.5MiniMaxMiniMaxAI/MiniMax-VL-01