Benchmarks — who reports what
Which models self-report which benchmarks, parsed from their model cards. 24 benchmarks reported by 3+ models, across 59 models.
Not a leaderboard. These are each lab’s own reported numbers, measured under its own harness, config, prompt, and date. Versions and metrics differ — a higher number here does not mean a better model. Read them as “what each lab claims,” not a ranking.
GPQA Diamond
48 models| 91.2 | Zhipu AI (GLM) | zai-org/GLM-5.2 |
| 86.2 | Zhipu AI (GLM) | zai-org/GLM-5.1 |
| 86 | Zhipu AI (GLM) | zai-org/GLM-5 |
| 85.7 | Zhipu AI (GLM) | zai-org/GLM-4.7 |
| 85.2 | MiniMax | MiniMaxAI/MiniMax-M2.5 |
| 84.5 | Moonshot AI (Kimi) | moonshotai/Kimi-K2-Thinking |
| 84.3% | Google (DeepMind / Gemini) | google/gemma-4-31B |
| 83.7 | Xiaomi (MiMo) | XiaomiMiMo/MiMo-V2-Flash |
| 83 | MiniMax | MiniMaxAI/MiniMax-M2.1 |
| 82.3% | Google (DeepMind / Gemini) | google/gemma-4-26B-A4B |
+38 more
MMLU-Pro
41 models| 88 | MiniMax | MiniMaxAI/MiniMax-M2.1 |
| 85.2% | Google (DeepMind / Gemini) | google/gemma-4-31B |
| 85 | DeepSeek | deepseek-ai/DeepSeek-R1-0528-Qwen3-8B |
| 85 | DeepSeek | deepseek-ai/DeepSeek-R1-0528 |
| 84.9 | Xiaomi (MiMo) | XiaomiMiMo/MiMo-V2-Flash |
| 84.6 | Moonshot AI (Kimi) | moonshotai/Kimi-K2-Thinking |
| 84.3 | Zhipu AI (GLM) | zai-org/GLM-4.7 |
| 84 | DeepSeek | deepseek-ai/DeepSeek-R1-Distill-Qwen-32B |
| 84 | DeepSeek | deepseek-ai/DeepSeek-R1-Distill-Qwen-14B |
| 84 | DeepSeek | deepseek-ai/DeepSeek-R1-Distill-Qwen-7B |
+31 more
LiveCodeBench
37 models| 86.4 | StepFun | stepfun-ai/Step-3.5-Flash |
| 84.9 | Zhipu AI (GLM) | zai-org/GLM-4.7 |
| 83.1 | Moonshot AI (Kimi) | moonshotai/Kimi-K2-Thinking |
| 83 | MiniMax | MiniMaxAI/MiniMax-M2 |
| 80.6 | Xiaomi (MiMo) | XiaomiMiMo/MiMo-V2-Flash |
| 80% | Google (DeepMind / Gemini) | google/gemma-4-31B |
| 79.4 | Meituan (LongCat) | meituan-longcat/LongCat-Flash-Thinking |
| 77.1% | Google (DeepMind / Gemini) | google/gemma-4-26B-A4B |
| 73.3 | DeepSeek | deepseek-ai/DeepSeek-R1-0528-Qwen3-8B |
| 73.3 | DeepSeek | deepseek-ai/DeepSeek-R1-0528 |
+27 more
MATH
37 models| 99.2 | Meituan (LongCat) | meituan-longcat/LongCat-Flash-Thinking |
| 98.6 | Sarvam AI | sarvamai/sarvam-105b |
| 97.6 | Meituan (LongCat) | meituan-longcat/LongCat-Flash-Omni |
| 97.3 | DeepSeek | deepseek-ai/DeepSeek-R1-Distill-Qwen-32B |
| 97.3 | DeepSeek | deepseek-ai/DeepSeek-R1-Distill-Qwen-14B |
| 97.3 | DeepSeek | deepseek-ai/DeepSeek-R1-Distill-Qwen-7B |
| 97.3 | DeepSeek | deepseek-ai/DeepSeek-R1-Distill-Llama-70B |
| 97.3 | DeepSeek | deepseek-ai/DeepSeek-R1-Distill-Llama-8B |
| 97.3 | DeepSeek | deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B |
| 97.3 | DeepSeek | deepseek-ai/DeepSeek-R1 |
+27 more
AIME
35 models| 99.2 | Zhipu AI (GLM) | zai-org/GLM-5.2 |
| 99.2 | Meituan (LongCat) | meituan-longcat/LongCat-Flash-Thinking-ZigZag |
| 97.3 | StepFun | stepfun-ai/Step-3.5-Flash |
| 95.7 | Zhipu AI (GLM) | zai-org/GLM-4.7 |
| 95.4 | Zhipu AI (GLM) | zai-org/GLM-5.1 |
| 94.1 | Xiaomi (MiMo) | XiaomiMiMo/MiMo-V2-Flash |
| 92.7 | Zhipu AI (GLM) | zai-org/GLM-5 |
| 91.6 | Zhipu AI (GLM) | zai-org/GLM-4.7-Flash |
| 91.4 | DeepSeek | deepseek-ai/DeepSeek-R1-0528-Qwen3-8B |
| 91.4 | DeepSeek | deepseek-ai/DeepSeek-R1-0528 |
+25 more
MMLU
33 models| 94.4 | Moonshot AI (Kimi) | moonshotai/Kimi-K2-Thinking |
| 93.4 | DeepSeek | deepseek-ai/DeepSeek-R1-0528-Qwen3-8B |
| 93.4 | DeepSeek | deepseek-ai/DeepSeek-R1-0528 |
| 92.9 | DeepSeek | deepseek-ai/DeepSeek-R1-Distill-Qwen-32B |
| 92.9 | DeepSeek | deepseek-ai/DeepSeek-R1-Distill-Qwen-14B |
| 92.9 | DeepSeek | deepseek-ai/DeepSeek-R1-Distill-Qwen-7B |
| 92.9 | DeepSeek | deepseek-ai/DeepSeek-R1-Distill-Llama-70B |
| 92.9 | DeepSeek | deepseek-ai/DeepSeek-R1-Distill-Llama-8B |
| 92.9 | DeepSeek | deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B |
| 92.9 | DeepSeek | deepseek-ai/DeepSeek-R1 |
+23 more
Humanity's Last Exam
20 models| 54.7 | Zhipu AI (GLM) | zai-org/GLM-5.2 |
| 52.3 | Zhipu AI (GLM) | zai-org/GLM-5.1 |
| 50.4 | Zhipu AI (GLM) | zai-org/GLM-5 |
| 42.8 | Zhipu AI (GLM) | zai-org/GLM-4.7 |
| 31.8 | MiniMax | MiniMaxAI/MiniMax-M2 |
| 26.5% | Google (DeepMind / Gemini) | google/gemma-4-31B |
| 25.8 | Meituan (LongCat) | meituan-longcat/LongCat-Flash-Thinking-ZigZag |
| 25.2 | Meituan (LongCat) | meituan-longcat/LongCat-Flash-Thinking-2601 |
| 23.9 | Moonshot AI (Kimi) | moonshotai/Kimi-K2-Thinking |
| 22.2 | MiniMax | MiniMaxAI/MiniMax-M2.1 |
+10 more
Codeforces
16 models| 2,150 | Google (DeepMind / Gemini) | google/gemma-4-31B |
| 2,029 | DeepSeek | deepseek-ai/DeepSeek-R1-Distill-Qwen-32B |
| 2,029 | DeepSeek | deepseek-ai/DeepSeek-R1-Distill-Qwen-14B |
| 2,029 | DeepSeek | deepseek-ai/DeepSeek-R1-Distill-Qwen-7B |
| 2,029 | DeepSeek | deepseek-ai/DeepSeek-R1-Distill-Llama-70B |
| 2,029 | DeepSeek | deepseek-ai/DeepSeek-R1-Distill-Llama-8B |
| 2,029 | DeepSeek | deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B |
| 2,029 | DeepSeek | deepseek-ai/DeepSeek-R1 |
| 2,029 | DeepSeek | deepseek-ai/DeepSeek-R1-Zero |
| 1,930 | DeepSeek | deepseek-ai/DeepSeek-R1-0528-Qwen3-8B |
+6 more
GSM8K
16 models| 99.6 | Xiaomi (MiMo) | XiaomiMiMo/MiMo-V2.5-Pro |
| 99.6 | Xiaomi (MiMo) | XiaomiMiMo/MiMo-V2.5-Pro-Base |
| 95.37 | Tencent Hunyuan | tencent/Hy3-preview |
| 95.37 | Tencent Hunyuan | tencent/Hy3-preview-Base |
| 94.8 | MiniMax | MiniMaxAI/MiniMax-Text-01 |
| 93.93 | InclusionAI (Ant Group) | inclusionAI/Ling-2.6-1T-base |
| 92.3 | Xiaomi (MiMo) | XiaomiMiMo/MiMo-V2-Flash |
| 91.89 | InclusionAI (Ant Group) | inclusionAI/Ling-2.6-flash-base |
| 89.69 | Upstage (Solar) | upstage/solar-pro-preview-instruct |
| 89.3 | DeepSeek | deepseek-ai/DeepSeek-V3 |
+6 more
HumanEval
16 models| 92.1 | Sarvam AI | sarvamai/sarvam-30b |
| 90.85 | Meituan (LongCat) | meituan-longcat/LongCat-Flash-Omni |
| 88.41 | Meituan (LongCat) | meituan-longcat/LongCat-Flash-Chat |
| 86.9 | MiniMax | MiniMaxAI/MiniMax-Text-01 |
| 85.98 | InclusionAI (Ant Group) | inclusionAI/Ling-2.6-1T-base |
| 82.6 | DeepSeek | deepseek-ai/DeepSeek-V3 |
| 82.6 | DeepSeek | deepseek-ai/DeepSeek-V3-Base |
| 81.1 | InclusionAI (Ant Group) | inclusionAI/Ling-2.6-flash-base |
| 81.1 | StepFun | stepfun-ai/Step-3.5-Flash-Base |
| 81.1 | StepFun | stepfun-ai/Step-3.5-Flash-Base-Midtrain |
+6 more
MBPP
16 models| 92.7 | Sarvam AI | sarvamai/sarvam-30b |
| 83.6 | ByteDance (Doubao/Seed) | ByteDance-Seed/Stable-DiffCoder-8B-Base |
| 80.16 | Meituan (LongCat) | meituan-longcat/LongCat-Flash-Omni |
| 79.63 | Meituan (LongCat) | meituan-longcat/LongCat-Flash-Chat |
| 79.4 | StepFun | stepfun-ai/Step-3.5-Flash-Base |
| 79.4 | StepFun | stepfun-ai/Step-3.5-Flash-Base-Midtrain |
| 78.71 | Tencent Hunyuan | tencent/Hy3-preview |
| 78.71 | Tencent Hunyuan | tencent/Hy3-preview-Base |
| 75.4 | DeepSeek | deepseek-ai/DeepSeek-V3 |
| 75.4 | DeepSeek | deepseek-ai/DeepSeek-V3-Base |
+6 more
SWE-bench Verified
14 models| 77.8 | Zhipu AI (GLM) | zai-org/GLM-5 |
| 74.4 | StepFun | stepfun-ai/Step-3.5-Flash |
| 74 | MiniMax | MiniMaxAI/MiniMax-M2.1 |
| 73.8 | Zhipu AI (GLM) | zai-org/GLM-4.7 |
| 73.4 | Xiaomi (MiMo) | XiaomiMiMo/MiMo-V2-Flash |
| 71.3 | Moonshot AI (Kimi) | moonshotai/Kimi-K2-Thinking |
| 70 | Meituan (LongCat) | meituan-longcat/LongCat-Flash-Thinking-2601 |
| 69.4 | MiniMax | MiniMaxAI/MiniMax-M2 |
| 69.2 | Moonshot AI (Kimi) | moonshotai/Kimi-K2-Instruct-0905 |
| 59.2 | Zhipu AI (GLM) | zai-org/GLM-4.7-Flash |
+4 more
BrowseComp
13 models| 79.3 | Zhipu AI (GLM) | zai-org/GLM-5.1 |
| 75.9 | Zhipu AI (GLM) | zai-org/GLM-5 |
| 73.7 | StepFun | stepfun-ai/Step-3.5-Flash |
| 71.9 | Meituan (LongCat) | meituan-longcat/LongCat-Flash-Thinking-ZigZag |
| 69 | Meituan (LongCat) | meituan-longcat/LongCat-Flash-Thinking-2601 |
| 67.5 | Zhipu AI (GLM) | zai-org/GLM-4.7 |
| 62.3 | Moonshot AI (Kimi) | moonshotai/Kimi-K2-Thinking |
| 62 | MiniMax | MiniMaxAI/MiniMax-M2.1 |
| 58.3 | Xiaomi (MiMo) | XiaomiMiMo/MiMo-V2-Flash |
| 49.5 | Sarvam AI | sarvamai/sarvam-105b |
+3 more
HMMT
12 models| 98.4 | StepFun | stepfun-ai/Step-3.5-Flash |
| 97.1 | Zhipu AI (GLM) | zai-org/GLM-4.7 |
| 96.9 | Zhipu AI (GLM) | zai-org/GLM-5.1 |
| 96.9 | Zhipu AI (GLM) | zai-org/GLM-5 |
| 94.4 | Zhipu AI (GLM) | zai-org/GLM-5.2 |
| 93.5 | Meituan (LongCat) | meituan-longcat/LongCat-Flash-Thinking-ZigZag |
| 93.4 | Meituan (LongCat) | meituan-longcat/LongCat-Flash-Thinking-2601 |
| 85.8 | Sarvam AI | sarvamai/sarvam-105b |
| 84.4 | Xiaomi (MiMo) | XiaomiMiMo/MiMo-V2-Flash |
| 79.4 | DeepSeek | deepseek-ai/DeepSeek-R1-0528-Qwen3-8B |
+2 more
Terminal-Bench
12 models| 82.7 | Zhipu AI (GLM) | zai-org/GLM-5.2 |
| 69 | Zhipu AI (GLM) | zai-org/GLM-5.1 |
| 56.2 | Zhipu AI (GLM) | zai-org/GLM-5 |
| 51 | StepFun | stepfun-ai/Step-3.5-Flash |
| 47.9 | MiniMax | MiniMaxAI/MiniMax-M2.1 |
| 47.1 | Moonshot AI (Kimi) | moonshotai/Kimi-K2-Thinking |
| 46.3 | MiniMax | MiniMaxAI/MiniMax-M2 |
| 44.5 | Moonshot AI (Kimi) | moonshotai/Kimi-K2-Instruct-0905 |
| 41 | Zhipu AI (GLM) | zai-org/GLM-4.7 |
| 39.51 | Meituan (LongCat) | meituan-longcat/LongCat-Flash-Chat |
+2 more
Aider Polyglot
12 models| 79.7 | DeepSeek | deepseek-ai/DeepSeek-V3 |
| 79.7 | DeepSeek | deepseek-ai/DeepSeek-V3-Base |
| 71.6 | DeepSeek | deepseek-ai/DeepSeek-R1-0528-Qwen3-8B |
| 71.6 | DeepSeek | deepseek-ai/DeepSeek-R1-0528 |
| 53.3 | DeepSeek | deepseek-ai/DeepSeek-R1-Distill-Qwen-32B |
| 53.3 | DeepSeek | deepseek-ai/DeepSeek-R1-Distill-Qwen-14B |
| 53.3 | DeepSeek | deepseek-ai/DeepSeek-R1-Distill-Qwen-7B |
| 53.3 | DeepSeek | deepseek-ai/DeepSeek-R1-Distill-Llama-70B |
| 53.3 | DeepSeek | deepseek-ai/DeepSeek-R1-Distill-Llama-8B |
| 53.3 | DeepSeek | deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B |
+2 more
SWE-bench Multilingual
7 models| 73.3 | Zhipu AI (GLM) | zai-org/GLM-5 |
| 72.5 | MiniMax | MiniMaxAI/MiniMax-M2.1 |
| 71.7 | Xiaomi (MiMo) | XiaomiMiMo/MiMo-V2-Flash |
| 66.7 | Zhipu AI (GLM) | zai-org/GLM-4.7 |
| 61.1 | Moonshot AI (Kimi) | moonshotai/Kimi-K2-Thinking |
| 56.5 | MiniMax | MiniMaxAI/MiniMax-M2 |
| 55.9 | Moonshot AI (Kimi) | moonshotai/Kimi-K2-Instruct-0905 |
SWE-bench
6 models| 60.4 | Meituan (LongCat) | meituan-longcat/LongCat-Flash-Chat |
| 59.4 | Meituan (LongCat) | meituan-longcat/LongCat-Flash-Thinking |
| 54.4 | Meituan (LongCat) | meituan-longcat/LongCat-Flash-Lite |
| 35.7 | Xiaomi (MiMo) | XiaomiMiMo/MiMo-V2.5-Pro |
| 35.7 | Xiaomi (MiMo) | XiaomiMiMo/MiMo-V2.5-Pro-Base |
| 30.8 | Xiaomi (MiMo) | XiaomiMiMo/MiMo-V2-Flash |
MMMU-Pro
5 models| 76.9% | Google (DeepMind / Gemini) | google/gemma-4-31B |
| 73.8% | Google (DeepMind / Gemini) | google/gemma-4-26B-A4B |
| 69.1% | Google (DeepMind / Gemini) | google/gemma-4-12B |
| 54.3% | Google (DeepMind / Gemini) | google/diffusiongemma-26B-A4B-it |
| 52.7 | MiniMax | MiniMaxAI/MiniMax-VL-01 |
ToolBench
4 models| 48.2 | Zhipu AI (GLM) | zai-org/GLM-5.2 |
| 43.5 | MiniMax | MiniMaxAI/MiniMax-M2.1 |
| 40.7 | Zhipu AI (GLM) | zai-org/GLM-5.1 |
| 38 | Zhipu AI (GLM) | zai-org/GLM-5 |
Multi-SWE-bench
4 models| 49.4 | MiniMax | MiniMaxAI/MiniMax-M2.1 |
| 41.9 | Moonshot AI (Kimi) | moonshotai/Kimi-K2-Thinking |
| 36.2 | MiniMax | MiniMaxAI/MiniMax-M2 |
| 33.5 | Moonshot AI (Kimi) | moonshotai/Kimi-K2-Instruct-0905 |
tau2-bench
4 models| 67.8 | MiniMax | MiniMaxAI/MiniMax-M1-40k |
| 63.5 | MiniMax | MiniMaxAI/MiniMax-M1-80k |
| 53.5 | DeepSeek | deepseek-ai/DeepSeek-R1-0528-Qwen3-8B |
| 53.5 | DeepSeek | deepseek-ai/DeepSeek-R1-0528 |
MCP-Atlas
3 models| 76.8 | Zhipu AI (GLM) | zai-org/GLM-5.2 |
| 71.8 | Zhipu AI (GLM) | zai-org/GLM-5.1 |
| 67.8 | Zhipu AI (GLM) | zai-org/GLM-5 |
MMMU
3 models| 78.11 | StepFun | stepfun-ai/Step3-VL-10B-Base |
| 70.7 | Meituan (LongCat) | meituan-longcat/LongCat-Flash-Omni |
| 68.5 | MiniMax | MiniMaxAI/MiniMax-VL-01 |