12h ago

ICML study of 60 benchmarks finds rapid AI progress causes evaluations to saturate, making private test sets ineffective

Open-ended tasks also fail to prevent this measurement plateau.

ICML study of 60 benchmarks finds rapid AI progress causes evaluations to saturate, making private test sets ineffective · Digg