/Tech3h ago

Frontier AI Models Complete Three-Minute Tasks Without CoT at 50% Reliability

6637125.5K
Original post
Buck Shlegeris@bshlgrs#1343inTech

"We find that frontier models like GPT-5.5 answer questions that take humans roughly three minutes with 50% reliability, and this time horizon has doubled approximately every year since 2019."

Dewi Gould@dswg97

New paper!

Think Fast: Estimating No-CoT Task-Completion Time Horizons of Frontier AI Models

@METR_Evals showed that models' time horizons have doubled every few months. We ask: what length of tasks can models complete without any CoT?

11:43 AM · Jun 10, 2026 · 2.5K Views
Sentiment
Sentiment building, check back later.
Cluster Engagement
Posts from X
Most Activity
Most Activity
VIEWS123LIKES9
Dewi Gould@dswg97

This is an important thing to measure: if models can do extensive reasoning without any CoT, monitors would struggle to understand models’ motivations and catch dangerous planning. We suggest that AI companies start to track no-CoT THs explicitly.

4hViews 123Likes 9
RETWEETS4
Dewi Gould@dswg97

New paper!

Think Fast: Estimating No-CoT Task-Completion Time Horizons of Frontier AI Models

@METR_Evals showed that models' time horizons have doubled every few months. We ask: what length of tasks can models complete without any CoT?

4hViews 3.1KLikes 33Bookmarks 4
REPLIES1
Dewi Gould@dswg97

Authors (continued): @joneedssleep @paanarle @TwmStone Ram Potham Ionut Gabriel Stan @harrymayne5 Simeon Hellsten, Shubhorup Biswas @arianaazarbal @wlanderson0 @latentfool @RyanPGreenblatt Julian Stastny

4hViews 76Likes 6
Dewi Gould@dswg97

We also measure a reasoning token horizon by replacing human time with how many tokens it takes o3-mini to solve a specific task. We find that GPT-5.5 solves questions that require about 1,500 tokens for o3-mini, and that this token horizon has doubled every 437 days.

4hViews 92Likes 6
Dewi Gould@dswg97

Paper: https://arxiv.org/abs/2606.07157 Our code and benchmarks are available upon request

4hViews 86Likes 3
Dewi Gould@dswg97

Authors: @dswg97 @F_Rhys_Ward @AndersCWoodruff @RaunoArike Josh Hills, Alex Serrano, @ida_icy, Jason Ross Brown

4hViews 78Likes 6
Dewi Gould@dswg97

We find that frontier models like GPT-5.5 answer questions that take humans roughly 3 minutes with 50% reliability, and that the no-CoT time horizon of frontier models has doubled every 373 days.

4hViews 104Likes 5
Dewi Gould@dswg97

Thanks to the following orgs for supporting this word: @redwood_ai @ConstellOrg @MATSprogram @AetherAIS

4hViews 68Likes 5
Dewi Gould@dswg97

Across 43 benchmarks, we fit a logistic curve on success rate against log human time to estimate models 50% no-CoT time horizons. We use a combination of measured human times and estimated ones.

4hViews 109Likes 4
Dewi Gould@dswg97

Our trends are robust to changing the distribution of benchmarks, adding longer-generation and multi-turn agentic questions, restricting to questions with measured human times only, dropping any single model, and breaking down by domain.

4hViews 85Likes 4
Dewi Gould@dswg97

@redwood_ai @ConstellOrg @MATSprogram @AetherAIS LW post: https://www.lesswrong.com/posts/SieLowPgNgRSPGhFw/estimating-no-cot-task-completion-time-horizons-of-frontier

3hViews 51Likes 4