https://www.aisi.gov.uk/blog/more-compute-more-capability-why-ai-agent-evals-need-to-account-for-test-time-compute
Nice blog post by @AISecurityInst. If you're running evals, you're probably not using enough tokens.
For example, METR has started spending 5-10B tokens (incl. caching) on our hardest tasks, because otherwise newer models don't have room to shine.