/Tech2h ago

UK AISI Shows Higher Token Budgets Extend AI Agent Time Horizons

12213136831K

Original post

UK AISI: "At the current frontier, raising the budget from 2.5M to 50M tokens increases the estimated horizon from (roughly) 2 hours to 14 hours"

banger after banger today

go read the quoted post!

AI Security Institute@AISecurityInst

Most AI agent evaluations boil capability down to one score. But that number hides a key choice: how much compute the agent was allowed to use. New work from our Science of Evaluation team shows why that matters. 🧵

2:48 PM · Jul 2, 2026 · 9.2K Views

Sentiment

Sentiment building, check back later.

Cluster Engagement

Digg Deeper

No Digg Deeper questions have been answered for this story yet.

Posts from X

Most Activity

VIEWS10.1KRETWEETS3

Lisan al Gaib@scaling01

would you please consult the previous posts

Lisan al Gaib@scaling01

UK AISI: "At the current frontier, raising the budget from 2.5M to 50M tokens increases the estimated horizon from (roughly) 2 hours to 14 hours"

banger after banger today

go read the quoted post!

2h10.1K304

BOOKMARKS28LIKES62REPLIES7

Lisan al Gaib@scaling01

this is also one of the core reasons why I think chinese models are further behind than what coding benchmarks suggest

and from the UK AISI blog today and EdgeBench we know that: - spending 1M vs 100M matters - having 1M vs 100K context matters

Opus 4.8 and GPT-5.5 have the same MRCR score at 100k+ context as GLM-5.2 at 16k

Why? - GLM doesn't use reasoning effectively - GPT-5.4 and GLM-5.2 have ~the same scores without reasoning, but GLM gets crushed once you turn on reasoning

(this can be fixed by doing more RL)

Lisan al Gaib@scaling01

would you please consult the previous posts

2h6.3K6228

Lisan al Gaib@scaling01

OpenAI and Anthropic are much further ahead than what benchmarks show.

While you are token constrained they are blasting millions of tokens at 4x the API speed without batting an eye and they scaffold like they are trying to build a skyscraper.

2h1.5K40

Lisan al Gaib@scaling01

*GLM-5.1

2h7471

Lisan al Gaib@scaling01

that is to say: - there are RL-training-compute and test-time-compute scaling laws for long-context

2h5301

Lisan al Gaib@scaling01

not that we didn't know that before: "and from the UK AISI blog today and EdgeBench we know that: - spending 1M vs 100M matters - having 1M vs 100K context matters"

lol

it's just more and recent evidence

1h977

Haha@Haha58208745

@scaling01 so are you saying that this happening due to the compute shortage since they cant scale RL as much as american labs with more compute? and therefore this causes the gap in reasoning skills?

1h57

nest elf@nest_elf

@scaling01 "just do more rl" is the new "just scale it"

1h51

Pluto@plut0sx

@scaling01 labs run internal scaffolds at 4x speed and 20x budget, benches stopped tracking real capability months ago

2h6

Alexey Fateev@superalesha

@scaling01 the benchmark gap and the real work gap are not the same thing. on long agent loops the open chinese models hold up better than their leaderboard spot suggests. coding evals are a narrow slice, measure the task you actually run

37m1