/Tech2h ago

AI Security Institute says agent benchmarks must report performance curves over compute rather than static scores

Story Overview

Static scores from fixed token budgets hide how much more agents could achieve with extra test-time compute. The AI Security Institute analysis shows performance often keeps climbing well beyond typical cutoffs, with some tasks only solved after tens of millions of tokens and newer models pulling further ahead as budgets grow.

968694.2K

#1046

Original post

Toby Ord@tobyordoxford#1477inTech

Excellent post from @AISecurityInst on how AI agents' time horizons scale with the number of tokens they're allowed. For me the most interesting point is the first graph below. Note how it is requiring much more than 10x the tokens to get 10x the time horizon…

AI Security Institute@AISecurityInst

Most AI agent evaluations boil capability down to one score. But that number hides a key choice: how much compute the agent was allowed to use. New work from our Science of Evaluation team shows why that matters. 🧵

12:41 AM · Jul 3, 2026 · 2.3K Views

Industry Shift

Fixed budgets hide real capability gaps

Sweeps across software, math and cyber tasks reveal that frontier models gain disproportionately from extra tokens while older ones plateau sooner, changing which systems look strongest depending on the evaluation limit chosen.

Open Question

Time-horizon trends shift with more tokens allowed

Earlier estimates of agent horizons doubling every few months were measured at 2.5 million tokens; the same models reach far longer horizons at 50 million tokens, yet the exact point where curves flatten for any given task stays unknown.

Sentiment

Positive users praise the near-linear scaling and remarkable doubling times for AI agent time horizons as unusually strong, while some worry sublinear token demands will make inference compute a binding constraint.

Pos

66.7%

Neg

33.3%

4 comments with sentiment.

Cluster Engagement

Digg Deeper

No Digg Deeper questions have been answered for this story yet.

Posts from X

Most Activity

VIEWS902BOOKMARKS1LIKES15RETWEETS2

Ian Hogarth@soundboy

Important framing from @AISecurityInst "An AI agent's performance is best understood as a capability curve over compute, not a single score...The takeaway is about good measurement. Evaluations should report capability curves, not single numbers."

AI Security Institute@AISecurityInst

1h902151

REPLIES2

Toby Ord@tobyordoxford

So getting 5x performance for every 10x of inputs is much closer to linear than we usually see. This is why the time horizons are one of the only areas actually showing exponential improvements in AI performance over time (though the inputs are also exponentially increasing).

Toby Ord@tobyordoxford

But it is also worth noting that this scaling is very good by usual standards in AI scaling laws. Pre-training scaling requires 1,000,000x as much compute to halve the error. Inference scaling on maths benchmarks gives logarithmic improvement, and so does RL training.

2h22150

Toby Ord@tobyordoxford

Because it is a log-log chart and those slopes are all roughly linear, the relationship is roughly a power law. The 80% time horizon is rising as the 2/3 power of the number of tokens. This means every 10x of the tokens buys about 5x the time horizon.

Toby Ord@tobyordoxford

2h24851

Toby Ord@tobyordoxford

Since the time and money a human requires to complete a task is roughly linear with the time horizon (an hourly wage), the models here are scaling less well than humans.

Toby Ord@tobyordoxford

2h20651

Toby Ord@tobyordoxford

This suggests there is a more efficient approach out there — a cognitive algorithm that scales as the human one does.

2h25550

Toby Ord@tobyordoxford

Moreover, there isn't much improvement in how they scale over time (all the lines for different models have similar slopes). So for any time horizon, next year's models will be better, but we'd expect them to fall farther behind humans as the task extends.

Toby Ord@tobyordoxford

Since the time and money a human requires to complete a task is roughly linear with the time horizon (an hourly wage), the models here are scaling less well than humans.

2h15630

Toby Ord@tobyordoxford

This suggests there is a more efficient approach out there — a cognitive algorithm that scales as the human one does.

Toby Ord@tobyordoxford

2h14330

Alexander Barry@AlexBarry4

@tobyordoxford I agree there isn't a dramatic change in the slope steepness over time, but if you squint a bit I think a few of the earlier models (GPT 5, sonnet 4.5, Opus 4.5) do look like they have slightly shallower slopes?

1h6

Alexander Barry@AlexBarry4

@tobyordoxford The 50 day doubling time of the 50M token time horizon is pretty remarkable

1h4

LandonCryptoExplr@LandonExplr

@tobyordoxford @AISecurityInst Sublinear scaling. 10x+ tokens for 10x time horizons means inference compute becomes the binding constraint for production agentic systems.

2h1