We just evaluated GLM 5.2 on Matharena!
Although GLM 5.2 has shown to be very good at coding, the improvement is not as drastic for math. GLM 5.2 beats GLM 5.1, its predecessor by only 1.9% in expected performance.
Prime Intellect research engineer Florian Brand highlighted GLM 5.2’s slim 1.9 percent expected-performance lift on the Matharena benchmark versus its predecessor, underscoring how Zhipu AI’s latest open-weight release emphasizes long-horizon coding and agent workflows over pure mathematical reasoning gains.
We just evaluated GLM 5.2 on Matharena!
Although GLM 5.2 has shown to be very good at coding, the improvement is not as drastic for math. GLM 5.2 beats GLM 5.1, its predecessor by only 1.9% in expected performance.
GLM 5.2’s documented strengths lie in multi-effort programming modes and sustained execution tasks, leaving observers noting that math progress arrived as a smaller byproduct rather than the headline capability.
Matharena’s use of recent contest problems and item-response modeling reduces contamination risk, so the modest delta stands as one of the cleaner current reads on where reasoning advances are actually landing.
Positive users highlight GLM 5.2's ~56% expected cost drop as impressive while negative users criticize its heavy coding focus that hurts logic performance.
No Digg Deeper questions have been answered for this story yet.
Yeah, as I said. This is an SWE-first release. Gains on math are inconsistent, sometimes negative And on the other hand, I think DSV4.1 will show scary gains on math. They're good at it, and it's what GRPO was built for.
We just evaluated GLM 5.2 on Matharena!
Although GLM 5.2 has shown to be very good at coding, the improvement is not as drastic for math. GLM 5.2 beats GLM 5.1, its predecessor by only 1.9% in expected performance.

GLM 5.2 scores very well on Apex 2025 (a +22% improvement over GLM 5.1) and makes good improvements on ArxivMath, but it underperforms GLM 5.1 by 11.4% on Apex Shortlist, which is unexpected.

The full results can be found on Matharena as usual: http://matharena.ai

@teortaxesTex The cost is weird. I wouldn't call it more terse.

@karirogg is it max or high

@karirogg GLM 5.1 used 1.2 to 2x more tokens here. On max settings, GLM 5.2 thinks a lot, so it's very likely the setting used was medium, or not even high, it is very important for Math

@karirogg How is it possible to show such results without stating what thinking level it was performed with??

@karirogg I think it is not on max mode, very low thinking tokens

@teortaxesTex it's also impressive that the expected cost has dropped by ~56%

@teortaxesTex It is not on max thinking, used 1.5-2x lower tokens than glm 5.1 lol

@teortaxesTex glm5.2 train too much on pure coding, not good on logic stuff

@teortaxesTex DSV4.1

The +22% on Apex 2025 paired with -11.4% on Apex Shortlist is the more telling signal — an asymmetric split like that is the classic fingerprint of post-training shaping that fit the seen distribution. Worth checking whether the Shortlist regression survives a matched thinking-token budget against 5.1, otherwise capability and effort are tangled.