"We find that frontier models like GPT-5.5 answer questions that take humans roughly three minutes with 50% reliability, and this time horizon has doubled approximately every year since 2019."
New paper!
Think Fast: Estimating No-CoT Task-Completion Time Horizons of Frontier AI Models
@METR_Evals showed that models' time horizons have doubled every few months. We ask: what length of tasks can models complete without any CoT?
