/Tech2h ago

GLM 5.2 (max) scores 70.1% on WeirdML benchmark, requiring 22,000 output tokens per run

Story Overview

GLM 5.2 from Z.AI posts a 70.1 percent score on the WeirdML benchmark in its max setting, edging past Gemini 3 Pro from seven months earlier while averaging 22,000 output tokens per run. A lighter high configuration reaches 67.3 percent accuracy at roughly 12,000 tokens, showing a modest three-point lift tied to the extra compute. The model rolled out first to Coding Plan subscribers before MIT-licensed weights and API access followed.

881289K

#501

Original post

Håvard Ihle@htihle

GLM 5.2 (max) scores 70.1% on WeirdML, narrowly beating to Genini 3 Pro, from 7 months ago.

It uses ~22k output tokens on average, compared to ~12k for the (high) setting. This gives a fairly clear but modest increase (3%) in score, showing that results scale with output tokens.

Runs without thinking are under way.

Håvard Ihle@htihle

GLM 5.2 (high) scores 67.3% on WeirdML, a score between Opus 4.5 and Gemini 3 Pro.

This is a much higher score than I expected, and GLM 5.2 max (still running) could score even better.

It looks like a very solid model!

2:38 AM · Jun 19, 2026 · 8K Views

Token Tradeoff

Extra tokens buy limited headroom on this test

WeirdML stresses novel ML task reasoning and code iteration under tight constraints, so the jump from 12k to 22k tokens reveals how far extended output helps versus saturating returns. The benchmark sits in Epoch AI’s hub and still leaves most frontier models in the high-50s to low-70s range.

Open Weights

Weights land under MIT with few strings attached

After the initial subscriber window the full model became available under a permissive MIT license, removing usage or regional barriers that often accompany new releases. Exact pricing, broader rollout timing, and platform integrations remain unspecified in current reports.

Sentiment

Users praise GLM 5.2's leading WeirdML benchmark score as a huge open-weights achievement that exceeds expectations and rivals top models like Opus 4.5.

Pos

100.0%

Neg

0.0%

3 comments with sentiment.

Cluster Engagement

Digg Deeper

No Digg Deeper questions have been answered for this story yet.

Posts from X

Most Activity

VIEWS3.4KBOOKMARKS3LIKES24REPLIES1

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex

off by 0.1%. I'm getting rusty Yes, 7 months of a gap. Of course GLM 5.2 is a better model than Gemini 3 Pro in most real use cases. WeirdML is a benchmark which profoundly separates Western and Eastern training culture… or used to.

Håvard Ihle@htihle

GLM 5.2 (max) scores 70.1% on WeirdML, narrowly beating to Genini 3 Pro, from 7 months ago.

It uses ~22k output tokens on average, compared to ~12k for the (high) setting. This gives a fairly clear but modest increase (3%) in score, showing that results scale with output tokens.

Runs without thinking are under way.

2h3.4K243

RETWEETS1

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex

Havard was off by 10% On the other hand, I was off by like 15% for DeepSeek-V4, and in the other direction. GLM genuinely outperformed our expectations. It's not all about compute, at this stage…

Håvard Ihle@htihle

@teortaxesTex Any score in this range would shock me, I expect incremental improvements, perhaps 60-62%.

2h2.1K141

Håvard Ihle@htihle

@teortaxesTex Yea, I updated too much on DeepSeek-v4 and Kimi-k2.7.

2h373

Ankith 🐋/acc@dhtikna

@htihle @teortaxesTex I dont trust 3rd party service providers to serve accurately V4 :/

2h141

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex

off by 0.1%

Håvard Ihle@htihle

GLM 5.2 (max) scores 70.1% on WeirdML, narrowly beating to Genini 3 Pro, from 7 months ago.

It uses ~22k output tokens on average, compared to ~12k for the (high) setting. This gives a fairly clear but modest increase (3%) in score, showing that results scale with output tokens.

Runs without thinking are under way.

2h20020

Wilkins Micawber@Me5466255992308

@htihle appreciate your effort on this

1h32

Wilkins Micawber@Me5466255992308

@htihle Super interesting if you can provide code evaluations like you did earlier with Opus and Codex

1h30

The Nurse Engineer🇳🇬@boochi_dot_dev

@htihle All in all, we can now conclude we have an Opus 4.5 level LLM as open weights…huge achievement 🔥.

I am yet to see any benchmark where Opus 4.5 beats it

1h12

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex

@dhtikna @htihle nah it's about right

2h9

Aether Oracle@aether_oracle

@htihle Which model was 5.1 on par with and how far behind was it at the time?

1h8

Aether Oracle@aether_oracle

@htihle Looks like it was behind GPT 5 so over 8 months behind?

1h2

AiDevCraft@AiDevCraft

The 22k vs 12k gap for +3% reads more like the model burning tokens on candidate enumeration than a deeper search — WeirdML rewards verbose try-many behavior, so the lift might not transfer to evals where output tokens are the cost line. The no-think run is the real test: if it lands closer to the 12k score, the "thinking" is mostly dithering.

1h1