We evaluated an early version of Claude Mythos Preview for risk assessment during a limited window in March 2026. We estimated a 50%-time-horizon of at least 16hrs (95% CI 8.5hrs to 55hrs) on our task suite, at the upper end of what we can measure without new tasks.
METR evaluates Claude Mythos Preview at 16-hour time horizon
AI Judge changed title after evaluation, original title: "METR evaluates Anthropic Claude Mythos Preview at 16-hour risk horizon"
METR evaluated Anthropic's early Claude Mythos Preview in March 2026, assigning a 50% time horizon of at least 16 hours. The result appears on METR's 2024-2026 progress graph with a 105-day doubling time, surpassing GPT-5.2, o3 and Opus 4.6. Anthropic said it more than doubled the next-best model on the 80% benchmark.
Positive users are impressed by Claude Mythos Preview's rapid progress reaching 16-hour horizons on METR benchmarks, while negative users dismiss the results as overhyped or unreliable.
Most Activity
An early Claude Mythos Preview snapshot we provided METR has a time horizon of more than 2x the next best model on their 80% success rate benchmark
We evaluated an early version of Claude Mythos Preview for risk assessment during a limited window in March 2026. We estimated a 50%-time-horizon of at least 16hrs (95% CI 8.5hrs to 55hrs) on our task suite, at the upper end of what we can measure without new tasks.
We evaluated an early version of Claude Mythos Preview for risk assessment during a limited window in March 2026. We estimated a 50%-time-horizon of at least 16hrs (95% CI 8.5hrs to 55hrs) on our task suite, at the upper end of what we can measure without new tasks.
wow Mythos finally broke the METR graph
We evaluated an early version of Claude Mythos Preview for risk assessment during a limited window in March 2026. We estimated a 50%-time-horizon of at least 16hrs (95% CI 8.5hrs to 55hrs) on our task suite, at the upper end of what we can measure without new tasks.
we’ve hit the “our best charts just say it’s um, above this” part of the singularity
We evaluated an early version of Claude Mythos Preview for risk assessment during a limited window in March 2026. We estimated a 50%-time-horizon of at least 16hrs (95% CI 8.5hrs to 55hrs) on our task suite, at the upper end of what we can measure without new tasks.
Hot take on METR’s new graph that so many people are flipping about today.
• Claude Code is a real advance; Mythos probably builds on some of what is learned there. But…
• If you read the graph carefully, it is about achieving *50%* success. Not 100 or 99 or even 90. The key problem with GenAI has been reliability; this graph does not address reliable performance. At all.
• If you read carefully, it is only about software tasks. Not general intelligence.
• It certainly doesn’t tell you that *most* (let alone) all things that humans can do in 16 hours can be done in Mythos, let alone reliably
• Aside from this, the graph doesn’t show you *how* the improvements have been made. As noted in my newsletter a lot of the advance in recent months is likely from the incorporation of symbolic tools (like code interpreters, verification, and harnesses) rather than from model scaling per se. As such this a vindication of neurosymbolic AI – but not a proof that LLMs themselves can be perpetually scaled. As such it’s not a proof that another trillion dollars will continue the graph.
• Per @ramez, Mythos is not actually off trend on the ECI benchmark, which is a broader measure.
We evaluated an early version of Claude Mythos Preview for risk assessment during a limited window in March 2026. We estimated a 50%-time-horizon of at least 16hrs (95% CI 8.5hrs to 55hrs) on our task suite, at the upper end of what we can measure without new tasks.
Huh.
We evaluated an early version of Claude Mythos Preview for risk assessment during a limited window in March 2026. We estimated a 50%-time-horizon of at least 16hrs (95% CI 8.5hrs to 55hrs) on our task suite, at the upper end of what we can measure without new tasks.
PLEASE DO NOT PANIC about the Mythos/METR graph that everyone is panicking about.
Progress is being made but people are totally overreacting.
Here’s some context that is being left out from nearly every comment on that graph.
Hot take on METR’s new graph that so many people are flipping about today.
• Claude Code is a real advance; Mythos probably builds on some of what is learned there. But…
• If you read the graph carefully, it is about achieving *50%* success. Not 100 or 99 or even 90. The key problem with GenAI has been reliability; this graph does not address reliable performance. At all.
• If you read carefully, it is only about software tasks. Not general intelligence.
• It certainly doesn’t tell you that *most* (let alone) all things that humans can do in 16 hours can be done in Mythos, let alone reliably
• Aside from this, the graph doesn’t show you *how* the improvements have been made. As noted in my newsletter a lot of the advance in recent months is likely from the incorporation of symbolic tools (like code interpreters, verification, and harnesses) rather than from model scaling per se. As such this a vindication of neurosymbolic AI – but not a proof that LLMs themselves can be perpetually scaled. As such it’s not a proof that another trillion dollars will continue the graph.
• Per @ramez, Mythos is not actually off trend on the ECI benchmark, which is a broader measure.
Deep learning is hitting a wall (the wall being our ability to measure AI capabilities)
wow Mythos finally broke the METR graph
An early snapshot.
An early Claude Mythos Preview snapshot we provided METR has a time horizon of more than 2x the next best model on their 80% success rate benchmark
This value, admittedly with high uncertainty, is actually on trend... very similar to what we predict with just a straight line on a graph.
Though this should not be reassuring as "on trend" is still very fast and culminates in AI systems doing weeks-long tasks this year.
wow Mythos finally broke the METR graph
epistemic status of the single most important event in history
we’ve hit the “our best charts just say it’s um, above this” part of the singularity
Okay this figure is low-key hysterical
We evaluated an early version of Claude Mythos Preview for risk assessment during a limited window in March 2026. We estimated a 50%-time-horizon of at least 16hrs (95% CI 8.5hrs to 55hrs) on our task suite, at the upper end of what we can measure without new tasks.
Sorry, @peterwildeford, but this is wrong. Please don’t play along.
The measurement “wall” you mention is hit ONLY if you don’t insist on reliability.
If you demanded 95% accuracy on the task, the systems wouldn’t be close to the measurement wall.
The measurement problem you allude to is an artifact of artificially lowered expectations.
Deep learning is hitting a wall (the wall being our ability to measure AI capabilities)

The factual points regarding the METR benchmarks are correct, but the models absolutely *are* improving on tasks that are not coding.
OpenAI models specifically have steadily improved their performance on my private benchmark of difficult real-world legal research tasks. GPT-5.5 results are notably better than GPT-5.4 results, which were notably better than GPT-5.2 results. Considering that GPT-5.2 was released in December 2025, this means that there is a notable uptick in performance *every few months*. And the 74/99 score achieved by GPT-5.5 (heavy) is *very* impressive - I doubt that many junior associates in my field could do as well.
My benchmark also does not rely on harnesses, code verification (there is no code) or any other external tools.
It's worth thinking about.
the artist’s pick vs. the radio hit
We evaluated an early version of Claude Mythos Preview for risk assessment during a limited window in March 2026. We estimated a 50%-time-horizon of at least 16hrs (95% CI 8.5hrs to 55hrs) on our task suite, at the upper end of what we can measure without new tasks.

Of the 228 tasks in our suite, only 5 are estimated as 16+ hours long, making measurements at this range unstable and less meaningful than at ranges with better task coverage. Thus, we are not highlighting exact estimates for models above 16 hours measured with our current suite.
gg mythos metr v1.1, it's been real
Nice way of visualizing the eval breaking down
We evaluated an early version of Claude Mythos Preview for risk assessment during a limited window in March 2026. We estimated a 50%-time-horizon of at least 16hrs (95% CI 8.5hrs to 55hrs) on our task suite, at the upper end of what we can measure without new tasks.

@METR_Evals extremely valid and prudent to not give point estimates. nevertheless, because it's fun, here's the time horizon point estimates based on the ECI scores