12h agoICML study of 60 benchmarks finds rapid AI progress causes evaluations to saturate, making private test sets ineffectiveOpen-ended tasks also fail to prevent this measurement plateau.