ICML study of 60 benchmarks finds rapid AI progress causes evaluations to saturate, making private test sets ineffective · Digg