3h ago

Researcher Proposes Perplexity on Clean Data to Fix AI Benchmarks

1218923212.0K

——0——

Original post

i know the solution to the AI benchmark problem but nobody is gonna like it it’s easy: just report test perplexity on uncontaminated high-quality code/lang/etc you give me base model api. i run on my secret dataset. i give you test ppl. all evals are downstream of that. solved

12:04 PM · May 28, 2026

#254will depue@WILLDEPUE

more seriously, this predicts big model smell, which is the most important thing nowadays (it’s why gpt 5.5 > opus 4.7), because big model smell is just pretraining capability, which is most predicted by true held out test ppl

will depue@willdepue

7:04 PM · May 28, 2026 · 8K Views

7:09 PM · May 28, 2026 · 2.5K Views

#254will depue@WILLDEPUE

i will actually organize this if people think there’s willingness from the labs to do it. i don’t think there is though. it would work really really well. maybe the chinese will be down

the only problem is people can cheat bc they can see api tokens. but i have ideas on that too

will depue@willdepue

7:09 PM · May 28, 2026 · 2.5K Views

7:11 PM · May 28, 2026 · 1.5K Views