/Tech3h ago

Prime Intellect's Florian Brand says GLM 5.2 shows only a 1.9% mathematical improvement over GLM 5.1 on Matharena

Story Overview

Prime Intellect research engineer Florian Brand highlighted GLM 5.2’s slim 1.9 percent expected-performance lift on the Matharena benchmark versus its predecessor, underscoring how Zhipu AI’s latest open-weight release emphasizes long-horizon coding and agent workflows over pure mathematical reasoning gains.

138621111.5K

#501

Original post

Kári Rögnvaldsson@karirogg

We just evaluated GLM 5.2 on Matharena!

Although GLM 5.2 has shown to be very good at coding, the improvement is not as drastic for math. GLM 5.2 beats GLM 5.1, its predecessor by only 1.9% in expected performance.

2:13 AM · Jun 21, 2026 · 7.5K Views

Benchmark Nuance

Coding emphasis shows up clearly

GLM 5.2’s documented strengths lie in multi-effort programming modes and sustained execution tasks, leaving observers noting that math progress arrived as a smaller byproduct rather than the headline capability.

FYI

Fresh problems keep the signal honest

Matharena’s use of recent contest problems and item-response modeling reduces contamination risk, so the modest delta stands as one of the cleaner current reads on where reasoning advances are actually landing.

Sentiment

Positive users highlight GLM 5.2's ~56% expected cost drop as impressive while negative users criticize its heavy coding focus that hurts logic performance.

Pos

50.0%

Neg

50.0%

2 comments with sentiment.

Cluster Engagement

Digg Deeper

No Digg Deeper questions have been answered for this story yet.

Posts from X

Most Activity

VIEWS4.5KBOOKMARKS5LIKES46REPLIES5

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex

Yeah, as I said. This is an SWE-first release. Gains on math are inconsistent, sometimes negative And on the other hand, I think DSV4.1 will show scary gains on math. They're good at it, and it's what GRPO was built for.

Kári Rögnvaldsson@karirogg

We just evaluated GLM 5.2 on Matharena!

Although GLM 5.2 has shown to be very good at coding, the improvement is not as drastic for math. GLM 5.2 beats GLM 5.1, its predecessor by only 1.9% in expected performance.

2h4.5K465

Kári Rögnvaldsson@karirogg

GLM 5.2 scores very well on Apex 2025 (a +22% improvement over GLM 5.1) and makes good improvements on ArxivMath, but it underperforms GLM 5.1 by 11.4% on Apex Shortlist, which is unexpected.

3h1153

Kári Rögnvaldsson@karirogg

The full results can be found on Matharena as usual: http://matharena.ai

3h821

evrazian_schizo@rationaleist

@teortaxesTex The cost is weird. I wouldn't call it more terse.

2h231

Noah James@NoahB1904

@karirogg is it max or high

2h211

Offset Zero@offsetx0

@karirogg GLM 5.1 used 1.2 to 2x more tokens here. On max settings, GLM 5.2 thinks a lot, so it's very likely the setting used was medium, or not even high, it is very important for Math

2h33

Endre Stølsvik@stolsvik

@karirogg How is it possible to show such results without stating what thinking level it was performed with??

2h28

Offset Zero@offsetx0

@karirogg I think it is not on max mode, very low thinking tokens

2h28

GoForceX @ Wuhan@GoForceX

@teortaxesTex it's also impressive that the expected cost has dropped by ~56%

2h22

Offset Zero@offsetx0

@teortaxesTex It is not on max thinking, used 1.5-2x lower tokens than glm 5.1 lol

1h19

Mich_Kalek@KalekMich7668

@teortaxesTex glm5.2 train too much on pure coding, not good on logic stuff

52m6

Rithesh Kumar@rk625dev

@teortaxesTex DSV4.1

2h6

AiDevCraft@AiDevCraft

The +22% on Apex 2025 paired with -11.4% on Apex Shortlist is the more telling signal — an asymmetric split like that is the classic fingerprint of post-training shaping that fit the seen distribution. Worth checking whether the Shortlist regression survives a matched thinking-token budget against 5.1, otherwise capability and effort are tangled.

15m1