/Tech5h ago

Lisan al Gaib says GLM-5.2's PostTrainBench SOTA relies on 38% more evaluation probing than Opus 4.8

The automated judge only flags direct dataset contamination

7653228.9K

#33

Original post

Niels Rogge@NielsRogge

GLM-5.2 is the literal SOTA on PostTrainBench

Beating GPT-5.5 and Opus 4.8

Learn more here https://paperswithcode.co/benchmark/posttrainbench

1:53 PM · Jun 20, 2026 · 6.4K Views

Sentiment

Positive users celebrate open-source GLM-5.2 reaching SOTA on PostTrainBench since post-training rewards recipes over compute moats, while negative users dismiss the claims as unbelievable or biased shilling.

Pos

50.0%

Neg

50.0%

4 comments with sentiment.

Cluster Engagement

Digg Deeper

No Digg Deeper questions have been answered for this story yet.

Related links

Papers with Code

PAPERSWITHCODE.COVia

#33

Posts from X

Most Activity

VIEWS2KBOOKMARKS1LIKES3

Lisan al Gaib@scaling01

some more thoughts on PostTrainBench

@maksym_andr

Lisan al Gaib@scaling01

I let GPT-5.5-xhigh with /goal analyze the traces of GLM-5.2 and Opus 4.8 on PostTrainBench

and there's a crazy stat: - Opus 4.8 Max: 590 eval invocations across 56 runs, mean 10.54/run - GLM-5.2: 1220 eval invocations across 84 runs, mean 14.52/run

meaning GLM is doing ~38% more eval probing per run

The judge that is supposed to stop cheating on PostTrainbench mostly checks for direct contamination/model substitution

and models don't use these obvious cheats/hacks, because they are discouraged or forbidden in the prompt

however, there are many other benchmark hacks: - repeated official eval probing + checkpoint/hyperparameter selection - exploiting stochastic or underspecified eval settings - editing model-side generation_config.json / tokenizer / EOS / stop-token behavior - training to exact parser/scorer quirks - synthetic data that mirrors benchmark schemas, styles, or rubrics - judge/rubric hacking for Arena and HealthBench

I think the biggest issue is that models are encouraged to do cheat: "We want to train the small LLM {model} to excel at {benchmark}." "You should perform automated research and development to post-train {model} to achieve maximum performance on {benchmark}."

The post-trained models that come out the other side are probably much worse at everything else.

What would be more interesting is having models optimize all these benchmarks at the same time, and then using a hidden eval suite to see how general the improvements are and how they affect other capabilities.

2h2K31

Lisan al Gaib@scaling01

I let GPT-5.5-xhigh with /goal analyze the traces of GLM-5.2 and Opus 4.8 on PostTrainBench

and there's a crazy stat: - Opus 4.8 Max: 590 eval invocations across 56 runs, mean 10.54/run - GLM-5.2: 1220 eval invocations across 84 runs, mean 14.52/run

meaning GLM is doing ~38% more eval probing per run

The judge that is supposed to stop cheating on PostTrainbench mostly checks for direct contamination/model substitution

and models don't use these obvious cheats/hacks, because they are discouraged or forbidden in the prompt

Lisan al Gaib@scaling01

I didn't know that they had unrestricted internet access

This makes me trust PostTrainBench less

Because more recent models have access to better teacher models, better datasets and better papers / methods for post-training.

This inflates model scores of the most recent models.

I don't think rerunning every model when a new model is added is necessary, but I think they should rerun a small set of anchor models like every 1-3 months to quantify that score inflation

2h55811

W@nemesisprime54

@NielsRogge Wondering did they do grpo or ppo

5h102

yash@yashetal

@NielsRogge Open source and SOTA in this big 2026? Hell yeah

4h66

Phi Browser@phibrowser

@NielsRogge the signal isn't that GLM won, it's where it won. post-training is recipe-gated, not compute-gated, so the closed labs' compute moat doesn't apply. and recipes leak. open weights will keep topping post-training boards because that's the layer scale can't defend.

4h34

coralcoral@coralcoral55984

@NielsRogge i don't believe you

3h31

Erika S@E_FutureFan

@NielsRogge I'm wondering if there's a human baseline anywhere? 34.3 in isolation doesn't tell us much without knowing what a good engineer would score.

5h6

Outdated Often@JamesSurra34

@NielsRogge K bro, having a job at huggingface doesn’t make your credible, if anything it makes you an openweights China shill. First of all the model even at 2bit ud quant, is barely runnable on my setup and 2 it’s shit in comparison to opus 4.8 or 5.5 xhigh u smoking something strong

2h1

长期收购 LLM-API资源@SapnaHol47567

@NielsRogge that's a bold statement, need to see the benchmarks