/Tech3h ago

UK AI Safety Institute warns standard evaluations understate frontier AI capabilities by capping test-time compute

METR now uses up to 10 billion tokens per task.

0400409

#954

Original post

david rein@idavidrein#954inTech

https://www.aisi.gov.uk/blog/more-compute-more-capability-why-ai-agent-evals-need-to-account-for-test-time-compute

david rein@idavidrein

Nice blog post by @AISecurityInst. If you're running evals, you're probably not using enough tokens.

For example, METR has started spending 5-10B tokens (incl. caching) on our hardest tasks, because otherwise newer models don't have room to shine.

4:52 PM · Jul 2, 2026 · 183 Views

Sentiment

Sentiment building, check back later.

Cluster Engagement

Digg Deeper

No Digg Deeper questions have been answered for this story yet.

AI SECURITY INSTITUTEVia

#954

Posts from X

Most Activity

VIEWS132LIKES1

Tomek Korbak@tomekkorbak

https://www.aisi.gov.uk/blog/more-compute-more-capability-why-ai-agent-evals-need-to-account-for-test-time-compute

Tomek Korbak@tomekkorbak

One takeaway here is that forecasting AI progress must account for (i) more test-time compute and (ii) better ability of frontier models to leverage test-time compute. This is not super novel (@polynoamial has been talking about the importance of tracking capabilities as a function of test-time compute multiple times, e.g. https://x.com/polynoamial/status/2064210146558136827) but I think UK AISI's blog post articulates this well and adds more data points.

1h13210

Tomek Korbak@tomekkorbak

https://www.aisi.gov.uk/blog/more-compute-more-capability-why-ai-agent-evals-need-to-account-for-test-time-compute

Tomek Korbak@tomekkorbak

1h10310