3h ago

Researcher Proposes Perplexity on Clean Data to Fix AI Benchmarks

0
Original post

i know the solution to the AI benchmark problem but nobody is gonna like it it’s easy: just report test perplexity on uncontaminated high-quality code/lang/etc you give me base model api. i run on my secret dataset. i give you test ppl. all evals are downstream of that. solved

12:04 PM · May 28, 2026 View on X

more seriously, this predicts big model smell, which is the most important thing nowadays (it’s why gpt 5.5 > opus 4.7), because big model smell is just pretraining capability, which is most predicted by true held out test ppl

will depuewill depue@willdepue

i know the solution to the AI benchmark problem but nobody is gonna like it it’s easy: just report test perplexity on uncontaminated high-quality code/lang/etc you give me base model api. i run on my secret dataset. i give you test ppl. all evals are downstream of that. solved

7:04 PM · May 28, 2026 · 8K Views
7:09 PM · May 28, 2026 · 2.5K Views

i will actually organize this if people think there’s willingness from the labs to do it. i don’t think there is though. it would work really really well. maybe the chinese will be down

the only problem is people can cheat bc they can see api tokens. but i have ideas on that too

will depuewill depue@willdepue

more seriously, this predicts big model smell, which is the most important thing nowadays (it’s why gpt 5.5 > opus 4.7), because big model smell is just pretraining capability, which is most predicted by true held out test ppl

7:09 PM · May 28, 2026 · 2.5K Views
7:11 PM · May 28, 2026 · 1.5K Views