interesting paper!
i was surprised by the claim that private benches saturate as quickly, so i asked diff llms (fable, codex) to analyze + expand the paper.
both found mislabeled data, then extended the dataset.
but: the results hold! private benches saturate just as fast
🚨 As AI models improve, many benchmarks are becoming saturated and losing their ability to distinguish between models. 🚨
Check out our new @icmlconf paper: “When AI Benchmarks Plateau: A Systematic Study of Benchmark Saturation”


