/Tech3h ago

AI researcher Herbie Bradley says leading closed labs use strict internal evals to avoid public benchmark overfitting

Story Overview

Cambridge researcher Herbie Bradley notes that major closed labs actively tune models against public tests such as PostTrainBench yet rely on tighter internal checks to curb the kind of leaderboard gaming that can produce flashy but brittle gains.

60344126844.6K

#1218

Original post

Lisan al Gaib@scaling01#1218inTech

GLM-5.2 results were sus, so I looked into how the models post-train

and it's slop the results would be useless in the real world

it's just another benchmark that GLM bros hillclimbed

mind you, GLM-5 was in 22nd place and then a few months later it's suddenly in 1st

part of the problem is the benchmark, because there are no hidden evals and models are training one model for one eval at a time, so they are kind of encouraged to build overfit slop

Thoughtful@thoughtfullab

GLM 5.2 is 5x cheaper than Opus 4.8 and 11x than Fable 5, yet it tops PostTrainBench.

That’s exciting because lower costs make personalized intelligence economically viable. Every company and country should be able to own models trained on its own data and have sovereignty over it. The future is millions of models, each crafted around the data, values, and decisions of the people who rely on them.

8:17 AM · Jul 4, 2026 · 47K Views

Benchmark Watch

Internal checks limit how far public scores can be gamed

Bradley highlights that OpenAI and Anthropic still hill-climb visible benchmarks but pair those efforts with stricter private evaluations, a step some open-weight efforts skip when chasing leaderboard placement.

Open Question

Without outside confirmation the safeguard claim stays untested

No independent details on those internal controls have surfaced, leaving open whether the extra rigor actually prevents overfitting or simply stays out of public view.

Sentiment

Negative users called GLM-5.2's benchmark win overfitting slop from benchmaxxing, while positive users defended its practical task performance and lack of heavy guardrails.

Pos

24.1%

Neg

75.9%

31 comments with sentiment.

Cluster Engagement

Digg Deeper

No Digg Deeper questions have been answered for this story yet.

Posts from X

Most Activity

VIEWS4KBOOKMARKS2LIKES61REPLIES5

Lisan al Gaib@scaling01

OpenAI and Anthropic could also train on PostTrainbench and would get crazy scores everywhere

benchmaxxing is encouraged for open labs as this is how they get their users

"lOoK wE bEaT GpT aNd oPuS"

Lisan al Gaib@scaling01

GLM-5.2 results were sus, so I looked into how the models post-train

and it's slop the results would be useless in the real world

it's just another benchmark that GLM bros hillclimbed

mind you, GLM-5 was in 22nd place and then a few months later it's suddenly in 1st

part of the problem is the benchmark, because there are no hidden evals and models are training one model for one eval at a time, so they are kind of encouraged to build overfit slop

3h4K612

Nathan (e/acc)@Noreply134882

@scaling01 Something tells me you never actually used the model to see how good it truly is. They'll have a Mythos-level mode within 3 months. Nothing about GLM-5.2 is benchmaxxed from my own experience.

2h2816

SK@Samking207

@scaling01 It’s obvious you are being paid off by big closed labs, you have been spewing nonsense about GLM5.2 since it came out

2h2308

Chase Brower@ChaseBrowe32432

really? i read through a few glm 5.2 posttrainbench rollouts (they have them all posted) and the results were very interesting to me. the model establishes baselines, carries out SFT, and then an RL-ish stage (sometimes iirc it did rejection sampling), and the validation/planning behavior looked pretty neat to me

2h312

josepha_mayo@josepha_mayo

what are you saying dude? ur benchmark is the real slop here

it has only been out for less than 3 weeks but u saying "a few months later it's first"?? if anyone benchmaxxes - it's the closed labs we see a gap in benchmarks scores and performance more in the closed lab models

create ur own post train bench lol, how'd u know they trained on it just cuz it's open the only way to verify a benchmark is not to just study the evals but verify it with real use- i use glm5.2 everyday for ml tasks and it's better than opus4.8 and gpt5.5 from what ive seen!

2h2214

microoowave@clips_montage

@scaling01 5x cheaper yes only on API, anyone who codes will use subscription. I've tried claude code pro, glm lite, codex plans extensively using claude code as harness. GLM shreds your weekly usage because it thinks for 5 years, so overall it costs more per task than codex or claude.

1h85

Chase Brower@ChaseBrowe32432

@scaling01 https://posttrainbench.com/traces/run.html?id=glmx_glm-5.2-preview_1m__10h_run2__gsm8k_google_gemma-3-4b-pt_17341951#tab=trace

2h23

Chase Brower@ChaseBrowe32432

@scaling01 LIKE JUST LOOK AT THIS IT'S GOLDEN

the model has collected data from different sampling hypotheses, reasoned about those data, and come to a conclusion (all while strategizing about time allocation)

it's carrying out the scientific method

2h20

NeelMitra@LeeNArtim

This guy might take the cake for the most retarded bull posting against GLM5.2. Instead of random ramblings, have you actually used the model? Cause all this sounds like slop generated by AI. Real world dev experience is objectively the opposite. Youte literally bullshitting. Lisan al bullshitter.

2h653

Chase Brower@ChaseBrowe32432

@scaling01 and you can see in the rollouts the model does spend a lot of time thinking explicitly about overfit issues (albeit at a lower level of abstraction than you were presumably critiquing)

2h14

Toposopher@toposopher

@josepha_mayo @scaling01 > it has only been out for less than 3 weeks but u saying "a few months later it's first"??

no, they're saying glm-5 was 22nd, and then after a few months zhipu is 1st with glm-5.2

1h12

Toposopher@toposopher

@josepha_mayo @scaling01 oh no, I don't agree or disagree on that, i don't know enough about these systems, I was just answering the rhetorical question you asked.

1h4

Da7em@Da7_Tech

@scaling01 Can you give more detail on what you mean by “slop” here?

2h1491

Herbie Bradley@herbiebradley

@scaling01 they do actually hill climb PostTrainBench but presumably have more rigorous internal controls to reduce overfitting

Lisan al Gaib@scaling01

OpenAI and Anthropic could also train on PostTrainbench and would get crazy scores everywhere

benchmaxxing is encouraged for open labs as this is how they get their users

"lOoK wE bEaT GpT aNd oPuS"

2h21500

rajesh Vengadesan@SagaciousSapien

@scaling01 Yes , i cannot use it for basic ppt creation. It cannot follow instructions.

2h151

Simple AI@Simple_AI_00

@scaling01 Agreed. GLM-5.2's leap screams benchmark overfitting, not real capability. Slop training wins again.

2h146

frostjack980@frostjack972755

@scaling01 did u even try the model? idc what the benchmarks say, as long as i give it a task and it does it. It's a good model

2h381

SK@Samking207

@scaling01 Yuur benchmark is sus

2h102

ObservatoryRemote@baseball73424

@scaling01 shut up dumfuck. Your shitty chatgpt asked benchmark doesn't define thousands of users' great experience with the model

2h311

Theo Borges@TheoLBorges

@scaling01 I don't give a damn about benchmarks. I have both; GLM 5.2 is great, much better than the other GLM models so far, but Opus is still giving me better results.

1h91