/Tech2h ago

Prime Intellect's Florian Brand validates ICML study showing private benchmarks saturate just as quickly as public ones

An LLM-assisted audit of 118 datasets confirmed the ceiling effect.

1074153114.3K
Original post
Florian Brand@xeophon#1190inTech

interesting paper!

i was surprised by the claim that private benches saturate as quickly, so i asked diff llms (fable, codex) to analyze + expand the paper.

both found mislabeled data, then extended the dataset.

but: the results hold! private benches saturate just as fast

EvalEval Coalition@evaluatingevals

🚨 As AI models improve, many benchmarks are becoming saturated and losing their ability to distinguish between models. 🚨

Check out our new @icmlconf paper: “When AI Benchmarks Plateau: A Systematic Study of Benchmark Saturation”

10:44 AM · Jun 11, 2026 · 1.4K Views
Sentiment

Users praise the ICML paper on AI benchmark saturation as great work and ask about accessing its annotated data.

Pos
100.0%
Neg
0.0%
1 comments with sentiment.
Cluster Engagement
Posts from X
Most Activity
Most Activity
VIEWS276

@evaluatingevals @icmlconf great work!!

is the annotated data available somewhere?

also: GPQA-D in T5 has 198 samples, the bigger set is just GPQA

5hViews 276Likes 1
LIKES4

@evaluatingevals @icmlconf found the annotated data.

why is TerminalBench 1 + 2 labelled private? you can run it without any restrictions very easily

HLE is also weird to label as private, as everyone reports the public set

4hViews 254Likes 4
RETWEETS15
EvalEval Coalition@evaluatingevals

🚨 As AI models improve, many benchmarks are becoming saturated and losing their ability to distinguish between models. 🚨

Check out our new @icmlconf paper: “When AI Benchmarks Plateau: A Systematic Study of Benchmark Saturation”

9dViews 13KLikes 51Bookmarks 23
REPLIES1

@xeophon @evaluatingevals @icmlconf is the annotated data on hf? gh? curious to run a new analysis method i'm working on that is related.

3hViews 23

@raw_works @evaluatingevals @icmlconf Github, I’ve re-ran the analysis and it still holds up when re-annotated. fwiw 5/8 of the “private” datasets are public 🙃

3hViews 14Likes 1
Mubashara Akhtar@akhtarmubashara

Our analysis looks into saturation dynamics of public vs private benchmarks (and treats them as a benchmark property) rather than looking at the distributional overlap between them. And yes, one possible interpretation of our findings is what you suggest: hiding test items alone may not be sufficient if the underlying task distribution is already represented elsewhere. Agree that studying the question whether distributional novelty or adversarial shifts slow saturation would be an interesting followup question.

7dViews 41