Researcher Proposes Perplexity on Clean Data to Fix AI Benchmarks
more seriously, this predicts big model smell, which is the most important thing nowadays (it’s why gpt 5.5 > opus 4.7), because big model smell is just pretraining capability, which is most predicted by true held out test ppl
i know the solution to the AI benchmark problem but nobody is gonna like it it’s easy: just report test perplexity on uncontaminated high-quality code/lang/etc you give me base model api. i run on my secret dataset. i give you test ppl. all evals are downstream of that. solved
i will actually organize this if people think there’s willingness from the labs to do it. i don’t think there is though. it would work really really well. maybe the chinese will be down
the only problem is people can cheat bc they can see api tokens. but i have ideas on that too
more seriously, this predicts big model smell, which is the most important thing nowadays (it’s why gpt 5.5 > opus 4.7), because big model smell is just pretraining capability, which is most predicted by true held out test ppl