/Tech2h ago

Prime Intellect's Florian Brand validates ICML study showing private benchmarks saturate just as quickly as public ones

An LLM-assisted audit of 118 datasets confirmed the ceiling effect.

1074153114.3K

#487

Original post

Florian Brand@xeophon#1190inTech

interesting paper!

i was surprised by the claim that private benches saturate as quickly, so i asked diff llms (fable, codex) to analyze + expand the paper.

both found mislabeled data, then extended the dataset.

but: the results hold! private benches saturate just as fast

EvalEval Coalition@evaluatingevals

🚨 As AI models improve, many benchmarks are becoming saturated and losing their ability to distinguish between models. 🚨

Check out our new @icmlconf paper: “When AI Benchmarks Plateau: A Systematic Study of Benchmark Saturation”

10:44 AM · Jun 11, 2026 · 1.4K Views

/Tech2h ago

Prime Intellect's Florian Brand validates ICML study showing private benchmarks saturate just as quickly as public ones

An LLM-assisted audit of 118 datasets confirmed the ceiling effect.

1074153114.3K

#487

Original post

Florian Brand@xeophon#1190inTech

interesting paper!

i was surprised by the claim that private benches saturate as quickly, so i asked diff llms (fable, codex) to analyze + expand the paper.

both found mislabeled data, then extended the dataset.

but: the results hold! private benches saturate just as fast

EvalEval Coalition@evaluatingevals

🚨 As AI models improve, many benchmarks are becoming saturated and losing their ability to distinguish between models. 🚨

Check out our new @icmlconf paper: “When AI Benchmarks Plateau: A Systematic Study of Benchmark Saturation”

10:44 AM · Jun 11, 2026 · 1.4K Views

Sentiment

Users praise the ICML paper on AI benchmark saturation as great work and ask about accessing its annotated data.

Pos

100.0%

Neg

0.0%

1 comments with sentiment.

Cluster Engagement

Posts from X

Most Activity

Florian Brand@xeophon

@evaluatingevals @icmlconf great work!!

is the annotated data available somewhere?

also: GPQA-D in T5 has 198 samples, the bigger set is just GPQA

5h2761

LIKES4

Florian Brand@xeophon

@evaluatingevals @icmlconf found the annotated data.

why is TerminalBench 1 + 2 labelled private? you can run it without any restrictions very easily

HLE is also weird to label as private, as everyone reports the public set

4h2544

RETWEETS15

EvalEval Coalition@evaluatingevals

🚨 As AI models improve, many benchmarks are becoming saturated and losing their ability to distinguish between models. 🚨

Check out our new @icmlconf paper: “When AI Benchmarks Plateau: A Systematic Study of Benchmark Saturation”

9d13K5123

REPLIES1

Raymond Weitekamp@raw_works

@xeophon @evaluatingevals @icmlconf is the annotated data on hf? gh? curious to run a new analysis method i'm working on that is related.

3h23

Florian Brand@xeophon

@raw_works @evaluatingevals @icmlconf Github, I’ve re-ran the analysis and it still holds up when re-annotated. fwiw 5/8 of the “private” datasets are public 🙃

3h141

Mubashara Akhtar@akhtarmubashara

Our analysis looks into saturation dynamics of public vs private benchmarks (and treats them as a benchmark property) rather than looking at the distributional overlap between them. And yes, one possible interpretation of our findings is what you suggest: hiding test items alone may not be sufficient if the underlying task distribution is already represented elsewhere. Agree that studying the question whether distributional novelty or adversarial shifts slow saturation would be an interesting followup question.

7d41