/Tech3h ago

Frontier AI Models Complete Three-Minute Tasks Without CoT at 50% Reliability

6637125.5K

Original post

"We find that frontier models like GPT-5.5 answer questions that take humans roughly three minutes with 50% reliability, and this time horizon has doubled approximately every year since 2019."

Dewi Gould@dswg97

New paper!

Think Fast: Estimating No-CoT Task-Completion Time Horizons of Frontier AI Models

@METR_Evals showed that models' time horizons have doubled every few months. We ask: what length of tasks can models complete without any CoT?

11:43 AM · Jun 10, 2026 · 2.5K Views

/Tech3h ago

Frontier AI Models Complete Three-Minute Tasks Without CoT at 50% Reliability

6637125.5K

#1343

Original post

Buck Shlegeris@bshlgrs#1343inTech

"We find that frontier models like GPT-5.5 answer questions that take humans roughly three minutes with 50% reliability, and this time horizon has doubled approximately every year since 2019."

Dewi Gould@dswg97

New paper!

Think Fast: Estimating No-CoT Task-Completion Time Horizons of Frontier AI Models

@METR_Evals showed that models' time horizons have doubled every few months. We ask: what length of tasks can models complete without any CoT?

11:43 AM · Jun 10, 2026 · 2.5K Views

Sentiment

Sentiment building, check back later.

Cluster Engagement

Posts from X

Most Activity

VIEWS123LIKES9

Dewi Gould@dswg97

This is an important thing to measure: if models can do extensive reasoning without any CoT, monitors would struggle to understand models’ motivations and catch dangerous planning. We suggest that AI companies start to track no-CoT THs explicitly.

4h1239

RETWEETS4

Dewi Gould@dswg97

New paper!

Think Fast: Estimating No-CoT Task-Completion Time Horizons of Frontier AI Models

@METR_Evals showed that models' time horizons have doubled every few months. We ask: what length of tasks can models complete without any CoT?

4h3.1K334

REPLIES1

Dewi Gould@dswg97

Authors (continued): @joneedssleep @paanarle @TwmStone Ram Potham Ionut Gabriel Stan @harrymayne5 Simeon Hellsten, Shubhorup Biswas @arianaazarbal @wlanderson0 @latentfool @RyanPGreenblatt Julian Stastny

4h766

Dewi Gould@dswg97

We also measure a reasoning token horizon by replacing human time with how many tokens it takes o3-mini to solve a specific task. We find that GPT-5.5 solves questions that require about 1,500 tokens for o3-mini, and that this token horizon has doubled every 437 days.

4h926

Dewi Gould@dswg97

Paper: https://arxiv.org/abs/2606.07157 Our code and benchmarks are available upon request

4h863

Dewi Gould@dswg97

Authors: @dswg97 @F_Rhys_Ward @AndersCWoodruff @RaunoArike Josh Hills, Alex Serrano, @ida_icy, Jason Ross Brown

4h786

Dewi Gould@dswg97

We find that frontier models like GPT-5.5 answer questions that take humans roughly 3 minutes with 50% reliability, and that the no-CoT time horizon of frontier models has doubled every 373 days.

4h1045

Dewi Gould@dswg97

Thanks to the following orgs for supporting this word: @redwood_ai @ConstellOrg @MATSprogram @AetherAIS

4h685

Dewi Gould@dswg97

Across 43 benchmarks, we fit a logistic curve on success rate against log human time to estimate models 50% no-CoT time horizons. We use a combination of measured human times and estimated ones.

4h1094

Dewi Gould@dswg97

Our trends are robust to changing the distribution of benchmarks, adding longer-generation and multi-turn agentic questions, restricting to questions with measured human times only, dropping any single model, and breaking down by domain.

4h854

Dewi Gould@dswg97

@redwood_ai @ConstellOrg @MATSprogram @AetherAIS LW post: https://www.lesswrong.com/posts/SieLowPgNgRSPGhFw/estimating-no-cot-task-completion-time-horizons-of-frontier

3h514