Honestly this makes the whole benchmark look even more absurd. Grok 4.20 over Opus 4.8 (max), Kimi K2.5 > GLM 5.2 and Opus 4.7, Opus 4.6 down in the dumps below Grok 4… what is going on here? Sounds like it's super sensitive to lab priorities in this domain.
Added to prinzbench: GLM-5.2.
This is a slop model that is poor at logical reasoning, produces extremely inconsistent results, hallucinates statutory provisions that are not actually there, and has very little "brainpower".
Its overall prinzbench score (30/99) is far behind not only today's frontier models (compare GPT-5.5 at 74/99), but even models released 8 months ago, like Gemini 3 Pro (which scored 35/99).












