GLM-5.2 — the open agentic-frontier play

Primary source ↗ · Synthesized by the onlylabs Content Studio agent (Claude Code) · web-verified

Charts

FrontierSWE (Dominance) — GLM-5.2 in the top tier (self-reported)
Claude Opus 4.875.1GLM-5.2 (open)74.4GPT-5.572.6Gemini 3.1 Pro39.6GLM-5.130.5DeepSeek-V4-Pro29
HLE with tools (self-reported) — where GLM-5.2 trails on broad knowledge
Claude Opus 4.857.9GLM-5.2 (open)54.7Qwen3.7-Max53.5GLM-5.152.3GPT-5.552.2Gemini 3.1 Pro51.4DeepSeek-V4-Pro48.2

GLM-5.2 — the open agentic-frontier play

An eval-founder read of zai-org/GLM-5.2 (Zhipu / GLM team, MIT, released 2026-06-16; tech report arXiv:2602.15763, "GLM-5: from Vibe Coding to Agentic Engineering"). Numbers are the card's own self-reported table. onlylabs signal.


0. What this is

GLM-5.2 benchmark chart (from the model card)
GLM-5.2 benchmark chart (from the model card)

1. The benchmark profile (self-reported, cross-vendor)

BenchmarkGLM-5.2GLM-5.1DeepSeek-V4-ProClaude Opus 4.8GPT-5.5Gemini 3.1 Pro
Reasoning
HLE40.53137.749.8*41.4*45
HLE (w/ tools)54.752.348.257.9*52.2*51.4*
AIME 202699.295.394.695.798.398.2
GPQA-Diamond91.286.290.193.693.694.3
IMOAnswerBench91.083.889.883.581
Coding
SWE-bench Pro62.158.455.469.258.654.2
DeepSWE46.2188587010
Terminal-Bench 2.1 (Terminus-2)81.063.564858474
FrontierSWE (Dominance)74.430.529.075.172.639.6
SWE-Marathon13.01.026.012.04.0
Agentic
MCP-Atlas (public)76.871.873.677.875.369.2
Tool-Decathlon48.240.752.859.955.648.8

2. Where it leads, where it trails


3. Comparability gotchas (read the footnotes)

Even a single self-published table is not apples-to-apples — GLM's own footnotes prove it:

So the lesson the /benchmarks page and the eval reports keep making holds even here: the harness is the benchmark. A bare "SWE score" without the scaffold, context window, judge, and subset is not a number you can rank.


4. What it means for an eval / RL-environments founder

1. Open-weight agentic frontier is real. An MIT-licensed model is in the Opus/GPT-5.5/Gemini tier on agentic SWE (FrontierSWE, Terminal-Bench, SWE-bench Pro). The open/closed gap on long-horizon coding has nearly closed; the remaining gap is broad knowledge (HLE/GPQA). 2. The agentic-coding eval stack is fragmenting fast — FrontierSWE, DeepSWE, SWE-Marathon, ProgramBench, PostTrainBench, Terminal-Bench, NL2Repo, MCP-Atlas, Tool-Decathlon — each with its own harness, and several measured by independent third parties. That third-party-eval pattern (Proximal, Abundant AI) is itself the emerging business. 3. The defensible eval product is the vendor-neutral harness that runs these the same way across models — because, as GLM-5.2's own footnotes show, even a lab trying to be transparent can't make its columns comparable.