/Tech12h ago

New paper finds AI task-completion horizons without chain-of-thought are doubling every 373 days

Increasing model layer count most efficiently extends these horizons

243432715156.9K

#355

Original post

Lisan al Gaib@scaling01

this is a banger paper

they estimate no-thinking time-horizons > it's doubling every 373 days!

"doubling the 50% TH requires a 4.2× increase in total parameters, a 2.1× increase in active parameters, a 1.3× increase in the layer count, or a 3.1× increase in pretraining FLOPs"

(to no surprise increasing layer count is most effective at increasing no-thinking time horizons)

"At the slowest doubling within the 95% CI, no-CoT THs still reach almost 10 minutes of latent reasoning by 2030"

Dewi Gould@dswg97

New paper!

Think Fast: Estimating No-CoT Task-Completion Time Horizons of Frontier AI Models

@METR_Evals showed that models' time horizons have doubled every few months. We ask: what length of tasks can models complete without any CoT?

3:18 PM · Jun 10, 2026 · 22.5K Views

/Tech12h ago

New paper finds AI task-completion horizons without chain-of-thought are doubling every 373 days

Increasing model layer count most efficiently extends these horizons

243432715156.9K

#355

Original post

Lisan al Gaib@scaling01

this is a banger paper

they estimate no-thinking time-horizons > it's doubling every 373 days!

"doubling the 50% TH requires a 4.2× increase in total parameters, a 2.1× increase in active parameters, a 1.3× increase in the layer count, or a 3.1× increase in pretraining FLOPs"

(to no surprise increasing layer count is most effective at increasing no-thinking time horizons)

"At the slowest doubling within the 95% CI, no-CoT THs still reach almost 10 minutes of latent reasoning by 2030"

Dewi Gould@dswg97

New paper!

Think Fast: Estimating No-CoT Task-Completion Time Horizons of Frontier AI Models

@METR_Evals showed that models' time horizons have doubled every few months. We ask: what length of tasks can models complete without any CoT?

3:18 PM · Jun 10, 2026 · 22.5K Views

Sentiment

Many users praise papers estimating AI No-CoT task horizons as offering useful eval axes and amazing charts, while others dismiss the graphs and extrapolations as overconfident about the future.

Pos

75.0%

Neg

25.0%

3 comments with sentiment.

Cluster Engagement

Posts from X

Most Activity

VIEWS4KBOOKMARKS4LIKES33

Lisan al Gaib@scaling01

GPT-5.5 is 2x more expensive than GPT-5.4, meaning roughly that active params should be ~2x higher

now GPT-5.5 has 2.14x longer time horizons than GPT-5.4, which would require a ~2.4× increase in active parameters

23h4K334

RETWEETS22

Dewi Gould@dswg97

New paper!

Think Fast: Estimating No-CoT Task-Completion Time Horizons of Frontier AI Models

@METR_Evals showed that models' time horizons have doubled every few months. We ask: what length of tasks can models complete without any CoT?

1d34.7K11950

REPLIES1

Dewi Gould@dswg97

Authors (continued): @joneedssleep @paanarle @TwmStone Ram Potham Ionut Gabriel Stan @harrymayne5 Simeon Hellsten, Shubhorup Biswas @arianaazarbal @wlanderson0 @latentfool @RyanPGreenblatt Julian Stastny

1d27571

Dewi Gould@dswg97

Paper: https://arxiv.org/abs/2606.07157 Our code and benchmarks are available upon request

1d42783

Dewi Gould@dswg97

This is an important thing to measure: if models can do extensive reasoning without any CoT, monitors would struggle to understand models’ motivations and catch dangerous planning. We suggest that AI companies start to track no-CoT THs explicitly.

1d606132

Dewi Gould@dswg97

We find that frontier models like GPT-5.5 answer questions that take humans roughly 3 minutes with 50% reliability, and that the no-CoT time horizon of frontier models has doubled every 373 days.

1d495121

Dewi Gould@dswg97

We also measure a reasoning token horizon by replacing human time with how many tokens it takes o3-mini to solve a specific task. We find that GPT-5.5 solves questions that require about 1,500 tokens for o3-mini, and that this token horizon has doubled every 437 days.

1d345101

Dewi Gould@dswg97

Authors: @dswg97 @F_Rhys_Ward @AndersCWoodruff @RaunoArike Josh Hills, Alex Serrano, @ida_icy, Jason Ross Brown

1d28481

Dewi Gould@dswg97

@redwood_ai @ConstellOrg @MATSprogram @AetherAIS LW post: https://www.lesswrong.com/posts/SieLowPgNgRSPGhFw/estimating-no-cot-task-completion-time-horizons-of-frontier

1d19142

Dewi Gould@dswg97

Our trends are robust to changing the distribution of benchmarks, adding longer-generation and multi-turn agentic questions, restricting to questions with measured human times only, dropping any single model, and breaking down by domain.

1d30171

Dewi Gould@dswg97

Across 43 benchmarks, we fit a logistic curve on success rate against log human time to estimate models 50% no-CoT time horizons. We use a combination of measured human times and estimated ones.

1d45561

Dewi Gould@dswg97

Thanks to the following orgs for supporting this word: @redwood_ai @ConstellOrg @MATSprogram @AetherAIS

1d23761

Ankit Maloo@ankit2119

@scaling01 straight lines on a log graph is always a banger, unless you know any better. remember this?

15h1902

That AI Guy@LewisWeldtech

@scaling01

22h203

Tlumko@blur_vibe

@scaling01 You’ve shown us a huge number of graphs and extrapolations, so apparently we already know everything about the future. Now combine them and provide a specific, testable prediction for the model’s performance and capabilities as of June 2027. Otherwise, it’s just bullshit

13h154

Ry Lanham@ry4lanham

@scaling01 Amazing chart!

9h78

Ry Lanham@ry4lanham

Here´s the thing: RSI makes releasing models a lose for the big firms. Google knows this. All you do is allow lesser code bases to reach parity through agentic looping/improvement. The game just switched to applied knowledge (e.g. drug discovery) Better models simply enable more RSI.

9h10

haro@harobuilds

@scaling01 the gap between the green and purple trend lines is the whole story. CoT is buying 2x the doubling rate but you're paying for it in latency and tokens every single call

23h1

Devayush Rout@devayushrout

@dswg97 @METR_Evals This is a useful eval axis because it separates visible reasoning quality from latent task capacity. Monitoring gets harder when a model can complete longer tasks without exposing much intermediate state.

19h1