/AI29d ago

METR evaluates Claude Mythos Preview at 16-hour time horizon

AI Judge changed title after evaluation, original title: "METR evaluates Anthropic Claude Mythos Preview at 16-hour risk horizon"

METR evaluated Anthropic's early Claude Mythos Preview in March 2026, assigning a 50% time horizon of at least 16 hours. The result appears on METR's 2024-2026 progress graph with a 105-day doubling time, surpassing GPT-5.2, o3 and Opus 4.6. Anthropic said it more than doubled the next-best model on the 80% benchmark.

--0--
Original postAjeya Cotra#570
METR@METR_Evals

We evaluated an early version of Claude Mythos Preview for risk assessment during a limited window in March 2026. We estimated a 50%-time-horizon of at least 16hrs (95% CI 8.5hrs to 55hrs) on our task suite, at the upper end of what we can measure without new tasks.

4:41 PM · May 8, 2026 · 958K Views
Sentiment

Positive users are impressed by Claude Mythos Preview's rapid progress reaching 16-hour horizons on METR benchmarks, while negative users dismiss the results as overhyped or unreliable.

Pos
60.7%
Neg
39.3%
145 comments with sentiment.
Cluster Engagement
-
Views
-
Comments
-
Reposts
-
Bookmarks
Expand data
Posts from X
Most Activity
Most Activity
VIEWS268.4KBOOKMARKS264LIKES1.3KREPLIES79
Alex Albert@alexalbert__

An early Claude Mythos Preview snapshot we provided METR has a time horizon of more than 2x the next best model on their 80% success rate benchmark

METR@METR_Evals

We evaluated an early version of Claude Mythos Preview for risk assessment during a limited window in March 2026. We estimated a 50%-time-horizon of at least 16hrs (95% CI 8.5hrs to 55hrs) on our task suite, at the upper end of what we can measure without new tasks.

29dViews 268.4KLikes 1.3KBookmarks 264
RETWEETS248
METR@METR_Evals

We evaluated an early version of Claude Mythos Preview for risk assessment during a limited window in March 2026. We estimated a 50%-time-horizon of at least 16hrs (95% CI 8.5hrs to 55hrs) on our task suite, at the upper end of what we can measure without new tasks.

29dViews 958KLikes 2.1KBookmarks 521

wow Mythos finally broke the METR graph

METR@METR_Evals

We evaluated an early version of Claude Mythos Preview for risk assessment during a limited window in March 2026. We estimated a 50%-time-horizon of at least 16hrs (95% CI 8.5hrs to 55hrs) on our task suite, at the upper end of what we can measure without new tasks.

29dViews 141.9KLikes 982Bookmarks 189
Nick@nickcammarata

we’ve hit the “our best charts just say it’s um, above this” part of the singularity

METR@METR_Evals

We evaluated an early version of Claude Mythos Preview for risk assessment during a limited window in March 2026. We estimated a 50%-time-horizon of at least 16hrs (95% CI 8.5hrs to 55hrs) on our task suite, at the upper end of what we can measure without new tasks.

29dViews 54.2KLikes 902Bookmarks 103
Gary Marcus@GaryMarcus

Hot take on METR’s new graph that so many people are flipping about today.

• Claude Code is a real advance; Mythos probably builds on some of what is learned there. But…

• If you read the graph carefully, it is about achieving *50%* success. Not 100 or 99 or even 90. The key problem with GenAI has been reliability; this graph does not address reliable performance. At all.

• If you read carefully, it is only about software tasks. Not general intelligence.

• It certainly doesn’t tell you that *most* (let alone) all things that humans can do in 16 hours can be done in Mythos, let alone reliably

• Aside from this, the graph doesn’t show you *how* the improvements have been made. As noted in my newsletter a lot of the advance in recent months is likely from the incorporation of symbolic tools (like code interpreters, verification, and harnesses) rather than from model scaling per se. As such this a vindication of neurosymbolic AI – but not a proof that LLMs themselves can be perpetually scaled. As such it’s not a proof that another trillion dollars will continue the graph.

•  Per @ramez, Mythos is not actually off trend on the ECI benchmark, which is a broader measure.

METR@METR_Evals

We evaluated an early version of Claude Mythos Preview for risk assessment during a limited window in March 2026. We estimated a 50%-time-horizon of at least 16hrs (95% CI 8.5hrs to 55hrs) on our task suite, at the upper end of what we can measure without new tasks.

28dViews 70.7KLikes 162Bookmarks 57
Ethan Mollick@emollick

Huh.

METR@METR_Evals

We evaluated an early version of Claude Mythos Preview for risk assessment during a limited window in March 2026. We estimated a 50%-time-horizon of at least 16hrs (95% CI 8.5hrs to 55hrs) on our task suite, at the upper end of what we can measure without new tasks.

29dViews 48.6KLikes 219Bookmarks 36
Gary Marcus@GaryMarcus

PLEASE DO NOT PANIC about the Mythos/METR graph that everyone is panicking about.

Progress is being made but people are totally overreacting.

Here’s some context that is being left out from nearly every comment on that graph.

Gary Marcus@GaryMarcus

Hot take on METR’s new graph that so many people are flipping about today.

• Claude Code is a real advance; Mythos probably builds on some of what is learned there. But…

• If you read the graph carefully, it is about achieving *50%* success. Not 100 or 99 or even 90. The key problem with GenAI has been reliability; this graph does not address reliable performance. At all.

• If you read carefully, it is only about software tasks. Not general intelligence.

• It certainly doesn’t tell you that *most* (let alone) all things that humans can do in 16 hours can be done in Mythos, let alone reliably

• Aside from this, the graph doesn’t show you *how* the improvements have been made. As noted in my newsletter a lot of the advance in recent months is likely from the incorporation of symbolic tools (like code interpreters, verification, and harnesses) rather than from model scaling per se. As such this a vindication of neurosymbolic AI – but not a proof that LLMs themselves can be perpetually scaled. As such it’s not a proof that another trillion dollars will continue the graph.

•  Per @ramez, Mythos is not actually off trend on the ECI benchmark, which is a broader measure.

28dViews 24.1KLikes 98Bookmarks 40
Andrew Curran@AndrewCurran_

An early snapshot.

Alex Albert@alexalbert__

An early Claude Mythos Preview snapshot we provided METR has a time horizon of more than 2x the next best model on their 80% success rate benchmark

29dViews 12.1KLikes 178Bookmarks 13

This value, admittedly with high uncertainty, is actually on trend... very similar to what we predict with just a straight line on a graph.

Though this should not be reassuring as "on trend" is still very fast and culminates in AI systems doing weeks-long tasks this year.

wow Mythos finally broke the METR graph

29dViews 5.4KLikes 105Bookmarks 17
Nick@nickcammarata

epistemic status of the single most important event in history

Nick@nickcammarata

we’ve hit the “our best charts just say it’s um, above this” part of the singularity

29dViews 4.5KLikes 160Bookmarks 8

Okay this figure is low-key hysterical

METR@METR_Evals

We evaluated an early version of Claude Mythos Preview for risk assessment during a limited window in March 2026. We estimated a 50%-time-horizon of at least 16hrs (95% CI 8.5hrs to 55hrs) on our task suite, at the upper end of what we can measure without new tasks.

25dViews 15.3KLikes 102Bookmarks 13
Gary Marcus@GaryMarcus

Sorry, @peterwildeford, but this is wrong. Please don’t play along.

The measurement “wall” you mention is hit ONLY if you don’t insist on reliability.

If you demanded 95% accuracy on the task, the systems wouldn’t be close to the measurement wall.

The measurement problem you allude to is an artifact of artificially lowered expectations.

Deep learning is hitting a wall (the wall being our ability to measure AI capabilities)

28dViews 9.6KLikes 62Bookmarks 8
prinz@deredleritt3r

The factual points regarding the METR benchmarks are correct, but the models absolutely *are* improving on tasks that are not coding.

OpenAI models specifically have steadily improved their performance on my private benchmark of difficult real-world legal research tasks. GPT-5.5 results are notably better than GPT-5.4 results, which were notably better than GPT-5.2 results. Considering that GPT-5.2 was released in December 2025, this means that there is a notable uptick in performance *every few months*. And the 74/99 score achieved by GPT-5.5 (heavy) is *very* impressive - I doubt that many junior associates in my field could do as well.

My benchmark also does not rely on harnesses, code verification (there is no code) or any other external tools.

It's worth thinking about.

28dViews 4KLikes 83Bookmarks 10

the artist’s pick vs. the radio hit

METR@METR_Evals

We evaluated an early version of Claude Mythos Preview for risk assessment during a limited window in March 2026. We estimated a 50%-time-horizon of at least 16hrs (95% CI 8.5hrs to 55hrs) on our task suite, at the upper end of what we can measure without new tasks.

28dViews 7.3KLikes 70Bookmarks 9
METR@METR_Evals

Of the 228 tasks in our suite, only 5 are estimated as 16+ hours long, making measurements at this range unstable and less meaningful than at ranges with better task coverage. Thus, we are not highlighting exact estimates for models above 16 hours measured with our current suite.

29dViews 7.1KLikes 96Bookmarks 5
Miles Brundage@Miles_Brundage

Nice way of visualizing the eval breaking down

METR@METR_Evals

We evaluated an early version of Claude Mythos Preview for risk assessment during a limited window in March 2026. We estimated a 50%-time-horizon of at least 16hrs (95% CI 8.5hrs to 55hrs) on our task suite, at the upper end of what we can measure without new tasks.

29dViews 4.9KLikes 74Bookmarks 5
Tenobrus@tenobrus

@METR_Evals extremely valid and prudent to not give point estimates. nevertheless, because it's fun, here's the time horizon point estimates based on the ECI scores

29dViews 4KLikes 56Bookmarks 5
Load more posts