METR evaluations find frontier AI agents rely on explicit natural language chain-of-thought to complete hardest tasks, with time horizons dropping from 1.5–2 years to about 4 minutes when actions must stay hidden · Digg