/Tech12h ago

New paper finds AI task-completion horizons without chain-of-thought are doubling every 373 days

Increasing model layer count most efficiently extends these horizons

243432715156.9K
Original post
Lisan al Gaib@scaling01

this is a banger paper

they estimate no-thinking time-horizons > it's doubling every 373 days!

"doubling the 50% TH requires a 4.2× increase in total parameters, a 2.1× increase in active parameters, a 1.3× increase in the layer count, or a 3.1× increase in pretraining FLOPs"

(to no surprise increasing layer count is most effective at increasing no-thinking time horizons)

"At the slowest doubling within the 95% CI, no-CoT THs still reach almost 10 minutes of latent reasoning by 2030"

Dewi Gould@dswg97

New paper!

Think Fast: Estimating No-CoT Task-Completion Time Horizons of Frontier AI Models

@METR_Evals showed that models' time horizons have doubled every few months. We ask: what length of tasks can models complete without any CoT?

3:18 PM · Jun 10, 2026 · 22.5K Views
Sentiment

Many users praise papers estimating AI No-CoT task horizons as offering useful eval axes and amazing charts, while others dismiss the graphs and extrapolations as overconfident about the future.

Pos
75.0%
Neg
25.0%
3 comments with sentiment.
Cluster Engagement
Posts from X
Most Activity
Most Activity
VIEWS4KBOOKMARKS4LIKES33
Lisan al Gaib@scaling01

GPT-5.5 is 2x more expensive than GPT-5.4, meaning roughly that active params should be ~2x higher

now GPT-5.5 has 2.14x longer time horizons than GPT-5.4, which would require a ~2.4× increase in active parameters

23hViews 4KLikes 33Bookmarks 4
RETWEETS22
Dewi Gould@dswg97

New paper!

Think Fast: Estimating No-CoT Task-Completion Time Horizons of Frontier AI Models

@METR_Evals showed that models' time horizons have doubled every few months. We ask: what length of tasks can models complete without any CoT?

1dViews 34.7KLikes 119Bookmarks 50
REPLIES1
Dewi Gould@dswg97

Authors (continued): @joneedssleep @paanarle @TwmStone Ram Potham Ionut Gabriel Stan @harrymayne5 Simeon Hellsten, Shubhorup Biswas @arianaazarbal @wlanderson0 @latentfool @RyanPGreenblatt Julian Stastny

1dViews 275Likes 7Bookmarks 1
Dewi Gould@dswg97

Paper: https://arxiv.org/abs/2606.07157 Our code and benchmarks are available upon request

1dViews 427Likes 8Bookmarks 3
Dewi Gould@dswg97

This is an important thing to measure: if models can do extensive reasoning without any CoT, monitors would struggle to understand models’ motivations and catch dangerous planning. We suggest that AI companies start to track no-CoT THs explicitly.

1dViews 606Likes 13Bookmarks 2
Dewi Gould@dswg97

We find that frontier models like GPT-5.5 answer questions that take humans roughly 3 minutes with 50% reliability, and that the no-CoT time horizon of frontier models has doubled every 373 days.

1dViews 495Likes 12Bookmarks 1
Dewi Gould@dswg97

We also measure a reasoning token horizon by replacing human time with how many tokens it takes o3-mini to solve a specific task. We find that GPT-5.5 solves questions that require about 1,500 tokens for o3-mini, and that this token horizon has doubled every 437 days.

1dViews 345Likes 10Bookmarks 1
Dewi Gould@dswg97

Authors: @dswg97 @F_Rhys_Ward @AndersCWoodruff @RaunoArike Josh Hills, Alex Serrano, @ida_icy, Jason Ross Brown

1dViews 284Likes 8Bookmarks 1
Dewi Gould@dswg97

@redwood_ai @ConstellOrg @MATSprogram @AetherAIS LW post: https://www.lesswrong.com/posts/SieLowPgNgRSPGhFw/estimating-no-cot-task-completion-time-horizons-of-frontier

1dViews 191Likes 4Bookmarks 2
Dewi Gould@dswg97

Our trends are robust to changing the distribution of benchmarks, adding longer-generation and multi-turn agentic questions, restricting to questions with measured human times only, dropping any single model, and breaking down by domain.

1dViews 301Likes 7Bookmarks 1
Dewi Gould@dswg97

Across 43 benchmarks, we fit a logistic curve on success rate against log human time to estimate models 50% no-CoT time horizons. We use a combination of measured human times and estimated ones.

1dViews 455Likes 6Bookmarks 1
Dewi Gould@dswg97

Thanks to the following orgs for supporting this word: @redwood_ai @ConstellOrg @MATSprogram @AetherAIS

1dViews 237Likes 6Bookmarks 1
Ankit Maloo@ankit2119

@scaling01 straight lines on a log graph is always a banger, unless you know any better. remember this?

15hViews 190Likes 2
Tlumko@blur_vibe

@scaling01 You’ve shown us a huge number of graphs and extrapolations, so apparently we already know everything about the future. Now combine them and provide a specific, testable prediction for the model’s performance and capabilities as of June 2027. Otherwise, it’s just bullshit

13hViews 154
Ry Lanham@ry4lanham

@scaling01 Amazing chart!

9hViews 78
Ry Lanham@ry4lanham

Here´s the thing: RSI makes releasing models a lose for the big firms. Google knows this. All you do is allow lesser code bases to reach parity through agentic looping/improvement. The game just switched to applied knowledge (e.g. drug discovery) Better models simply enable more RSI.

9hViews 10
haro@harobuilds

@scaling01 the gap between the green and purple trend lines is the whole story. CoT is buying 2x the doubling rate but you're paying for it in latency and tokens every single call

23hLikes 1
Devayush Rout@devayushrout

@dswg97 @METR_Evals This is a useful eval axis because it separates visible reasoning quality from latent task capacity. Monitoring gets harder when a model can complete longer tasks without exposing much intermediate state.

19hViews 1