UK AISI: "At the current frontier, raising the budget from 2.5M to 50M tokens increases the estimated horizon from (roughly) 2 hours to 14 hours"
banger after banger today
go read the quoted post!
Most AI agent evaluations boil capability down to one score. But that number hides a key choice: how much compute the agent was allowed to use. New work from our Science of Evaluation team shows why that matters. 🧵




