/Tech6h ago

Teortaxes says GLM 5.2 approaches Claude Opus on hard evaluations but fails simple tasks Claude 3.5 Sonnet easily handles

GLM 5.2 scored positive where GLM 5.1 got 0%.

21453157042.1K

#501

Original post

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex

ok, here's how GLM 5.2 performs on a bench it definitely didn't see, and where GLM 5.1 scored 0.0%. Closer to Opus 4.8 than Sonnet 4.6 I hope their confidence seems more credible now

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex

GLM 5.1 scores zero btw there's no way they benchmaxed this thing directly we shall see how 5.2 performs. I'd be surprised if it landed below MiniMax

12:03 PM · Jun 18, 2026 · 39.4K Views

Sentiment

Positive users praise GLM 5.2 for its strong unseen benchmark scores nearing Claude Opus and well-rounded capabilities like legal agentic tasks, while one negative reply mocks the high cost for only a tiny performance edge.

Pos

80.0%

Neg

20.0%

5 comments with sentiment.

Cluster Engagement

Digg Deeper

No Digg Deeper questions have been answered for this story yet.

Posts from X

Most Activity

VIEWS3.4KBOOKMARKS3LIKES39RETWEETS1REPLIES5

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex

Interesting that in "GLM 5.2 is on par with X model", X is so widely distributed over different capabilities, and it's not like the gap is just greater for the difficult ones. It plainly can't do some parlor tricks Sonnet 3.5 pulled off. In some hard model evals, it's around 4.8.

2h3.4K393

Cameron@groundruled

@teortaxesTex GLM 5.2 is such a crazy leap. MiMo V2.5 is the perfect pair for it to do the grunt work and vision while GLM 5.2 does the hard stuff

8h6838

Thomas Ip@_thomasip

@teortaxesTex @yacineMTB they cooked hard sir. first chinese model that is actually well rounded and not only coding/math/agent focused?

9h6975

algorithon@Algorithon

@teortaxesTex The cost difference between GLM 5.2 & Opus 4.8 is insane for the minuscule performance gap on this benchmark lmao

12h7464

Ros Thain@ThainRos

@teortaxesTex Double gpt 5.5 ... Wow

12h1.1K3

Yusuf Gürdoğan@YusufGurdogan

@teortaxesTex link: https://www.vals.ai/benchmarks/hlab it appears to be a useful benchmark. matches my feelingsbench

11h8851

Da7em@Da7_Tech

@teortaxesTex @yacineMTB

12h1.2K

Apestein@apestein_dev

@teortaxesTex Weird bench, why is gpt-5.5 so low

9h339

DanielW@dddanielwang

@teortaxesTex Very very interesting，thought it was only good at coding But also very good legal agentic capability

6h295

GATE85 · Winnie · BD直开@taku61p0n

@teortaxesTex 这就叫含金量不仅看总分更得看戏份

2h182

Hispanophile 🇹🇷🇪🇸@n_hispanophile

@groundruled @teortaxesTex Game by GLM 5.2 Max(Deep Think) after a lot of iterations.

7h61

tsg@wokerenaissance

@teortaxesTex Its basically sonnet 4.8

3h33

Amir Gulubayli@AmirGulubayli

@teortaxesTex glm 5.2 on a truly unseen bench is impressive if it holds. but "closer to opus" is a big claim - what's the actual score difference lookin like?

12h