UK AI Safety Institute finds standard capability scores obscure performance variations from test-time compute budgets · Digg