/Tech7h ago

AI developer Teortaxes questions the validity of the prinzbench AI benchmark after GLM-5.2 scores a low 30 out of 99

Story Overview

A pseudonymous AI developer with a large following is pushing back on prinzbench after the new GLM-5.2 model landed a 30/99 score that sits well below several frontier systems and even some models from last year. The benchmark's creator flagged weak legal reasoning and hallucinations in the responses, yet the result also produced an unexpected ordering that placed Grok-4.20 slightly ahead of Opus-4.8.

1447472217085.2K

#501

Original post

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex#501inTech

Honestly this makes the whole benchmark look even more absurd. Grok 4.20 over Opus 4.8 (max), Kimi K2.5 > GLM 5.2 and Opus 4.7, Opus 4.6 down in the dumps below Grok 4… what is going on here? Sounds like it's super sensitive to lab priorities in this domain.

prinz@deredleritt3r

Added to prinzbench: GLM-5.2.

This is a slop model that is poor at logical reasoning, produces extremely inconsistent results, hallucinates statutory provisions that are not actually there, and has very little "brainpower".

Its overall prinzbench score (30/99) is far behind not only today's frontier models (compare GPT-5.5 at 74/99), but even models released 8 months ago, like Gemini 3 Pro (which scored 35/99).

6:45 AM · Jun 27, 2026 · 16.4K Views

Benchmark Limits

Narrow legal focus limits broader claims

Prinzbench draws its 33 questions from one narrow slice of U.S. law plus obscure retrieval tasks, all written and graded by a single non-blind evaluator. That setup can surface real capability gaps, but it also leaves open whether the low GLM-5.2 mark reflects model weakness or benchmark specificity.

Open Question

Odd leaderboard order invites scrutiny

Grok-4.20 finishing above Opus-4.8 and GLM-5.2 trailing models released eight months earlier struck observers as inconsistent with other public evaluations, prompting questions about whether the private test set produces reliable frontier rankings.

Sentiment

Many users dismissed Prinzbench and LisanBench as shitty or biased after GLM-5.2 posted weak scores trailing frontier models, while a few praised the benchmark itself.

Pos

10.3%

Neg

89.7%

39 comments with sentiment.

Cluster Engagement

Digg Deeper

No Digg Deeper questions have been answered for this story yet.

Posts from X

Most Activity

VIEWS5.9KBOOKMARKS8LIKES68REPLIES11

Lisan al Gaib@scaling01

LisanBench results for GLM-5.2

GLM-5.2-high ranks #29 with a score of 1834 compared to GLM-5's score of 986.83

so it's better than previous versions but still sucks compared to other open-source models like DeepSeek-V4 or Kimi models

Kimi-K2.5 Thinking scores about the same but uses on average only ~19k tokens vs GLM-5.2's ~32k tokens

in terms of reasoning efficiency it's very close to GPT-5-medium and Gemini 3 Flash

1h5.9K688

RETWEETS15

prinz@deredleritt3r

Added to prinzbench: GLM-5.2.

This is a slop model that is poor at logical reasoning, produces extremely inconsistent results, hallucinates statutory provisions that are not actually there, and has very little "brainpower".

Its overall prinzbench score (30/99) is far behind not only today's frontier models (compare GPT-5.5 at 74/99), but even models released 8 months ago, like Gemini 3 Pro (which scored 35/99).

8h61.3K492138

simobis@simobis23

@deredleritt3r Why call it a slop model? It performs better than Opus 4.7 on your benchmark.

8h98924

prinz@deredleritt3r

More details here: https://github.com/prinz-ai/prinzbench

8h1.5K91

ahtoshkaa@ahtoshkaa

@deredleritt3r People who think that Chinese models are 4 months behind are either paid shills or never used those models a day in their life.

7h4747

Greg@Greg22040755

@simobis23 @deredleritt3r Slop benchmark.

7h21211

Agostinho Serrano@EducatingwithAI

From my personal usage I 100% agree with you.

Sent a paper to evaluate and it replied a phrase in Table 4 was wrong:

“Efficient creativity doesn’t require the generation of many ideas” - GLM 5.2 wanted me to change to “requires”, which is dumb. No other model asked the same.

Kimi 2.6T on the other hand is surprising me! You haven’t scored it yet, did you?

7h6636

Alex@Alex_m

@deredleritt3r What provider did you use? If it’s a random fast openrouter one then you might have used a 1/2 bit quantized version.

7h3395

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex

@scaling01 V3.2 Speciale still being #10 is very funny

Lisan al Gaib@scaling01

LisanBench results for GLM-5.2

GLM-5.2-high ranks #29 with a score of 1834 compared to GLM-5's score of 986.83

so it's better than previous versions but still sucks compared to other open-source models like DeepSeek-V4 or Kimi models

Kimi-K2.5 Thinking scores about the same but uses on average only ~19k tokens vs GLM-5.2's ~32k tokens

in terms of reasoning efficiency it's very close to GPT-5-medium and Gemini 3 Flash

1h48340

Florian Brand@xeophon

@deredleritt3r

6h858

prinz@deredleritt3r

@abacaj Thank you for this well-thought-out question. I do not work for OpenAI.

7h2515

dan ryan@fireobserver32

@teortaxesTex Grok iver opus 4.8 … bro grock has some of the worst driest writing i have ever seen . Literally the only advantage it has is no nsfw filter. On technical tasks it offers no advantage.

7h2135

Kirk Patrick Miller@Chaos2Cured

Shilling for OpenAI… always the same.

GLM doesn’t gaslight.

Also, benchmarks are BS.

Open up the data. Which will never happen.

F OpenAI. They are the most evil corporation to ever exist. I will NEVER touch anything they ever do.

They are trash. But they pay you, so keep shilling. Hope it makes you proud to pitch for the darkness. •

7h3814

Lisan al Gaib@scaling01

it fails in 76.7% of runs because it doesn't respect the edit-distance 1 rule

Lisan al Gaib@scaling01

LisanBench results for GLM-5.2

GLM-5.2-high ranks #29 with a score of 1834 compared to GLM-5's score of 986.83

so it's better than previous versions but still sucks compared to other open-source models like DeepSeek-V4 or Kimi models

Kimi-K2.5 Thinking scores about the same but uses on average only ~19k tokens vs GLM-5.2's ~32k tokens

in terms of reasoning efficiency it's very close to GPT-5-medium and Gemini 3 Flash

1h81430

sdmat@sdmat123

@ahtoshkaa @deredleritt3r Opus 4.7 was a little over 2 months ago

7h581

Lisan al Gaib@scaling01

it still shows the same "hacky" behavior that Opus 4.5 was showing

Lisan al Gaib@scaling01

LisanBench results for GLM-5.2

GLM-5.2-high ranks #29 with a score of 1834 compared to GLM-5's score of 986.83

so it's better than previous versions but still sucks compared to other open-source models like DeepSeek-V4 or Kimi models

Kimi-K2.5 Thinking scores about the same but uses on average only ~19k tokens vs GLM-5.2's ~32k tokens

in terms of reasoning efficiency it's very close to GPT-5-medium and Gemini 3 Flash

1h52330

anton@abacaj

@deredleritt3r What is this? The top models are all openai. Do you work for openai? Even the oldest model from openai is at the top of this lol

7h4753

jj⚙️🌳🔭🔬@murchiston

@simobis23 @deredleritt3r buddy,

7h1616

Lisan al Gaib@scaling01

so reasoning efficiency: Kimi > GLM > DeepSeek

DeepSeek models mostly score higher because they use more tokens

Lisan al Gaib@scaling01

LisanBench results for GLM-5.2

GLM-5.2-high ranks #29 with a score of 1834 compared to GLM-5's score of 986.83

so it's better than previous versions but still sucks compared to other open-source models like DeepSeek-V4 or Kimi models

Kimi-K2.5 Thinking scores about the same but uses on average only ~19k tokens vs GLM-5.2's ~32k tokens

in terms of reasoning efficiency it's very close to GPT-5-medium and Gemini 3 Flash

1h91210

Dominik Lukes@techczech

@deredleritt3r Seems like it beats Opus 4.7.

7h1841