METR evaluates Claude Mythos Preview at 16-hour time horizon

VIEWS268.4KBOOKMARKS264LIKES1.3KREPLIES79

An early Claude Mythos Preview snapshot we provided METR has a time horizon of more than 2x the next best model on their 80% success rate benchmark

METR@METR_Evals

We evaluated an early version of Claude Mythos Preview for risk assessment during a limited window in March 2026. We estimated a 50%-time-horizon of at least 16hrs (95% CI 8.5hrs to 55hrs) on our task suite, at the upper end of what we can measure without new tasks.

29d268.4K1.3K264

RETWEETS248

METR@METR_Evals

We evaluated an early version of Claude Mythos Preview for risk assessment during a limited window in March 2026. We estimated a 50%-time-horizon of at least 16hrs (95% CI 8.5hrs to 55hrs) on our task suite, at the upper end of what we can measure without new tasks.

29d958K2.1K521

Peter Wildeford🇺🇸🚀@peterwildeford

wow Mythos finally broke the METR graph

METR@METR_Evals

We evaluated an early version of Claude Mythos Preview for risk assessment during a limited window in March 2026. We estimated a 50%-time-horizon of at least 16hrs (95% CI 8.5hrs to 55hrs) on our task suite, at the upper end of what we can measure without new tasks.

29d141.9K982189

Nick@nickcammarata

we’ve hit the “our best charts just say it’s um, above this” part of the singularity

METR@METR_Evals

We evaluated an early version of Claude Mythos Preview for risk assessment during a limited window in March 2026. We estimated a 50%-time-horizon of at least 16hrs (95% CI 8.5hrs to 55hrs) on our task suite, at the upper end of what we can measure without new tasks.

29d54.2K902103

Gary Marcus@GaryMarcus

Hot take on METR’s new graph that so many people are flipping about today.

• Claude Code is a real advance; Mythos probably builds on some of what is learned there. But…

• If you read the graph carefully, it is about achieving *50%* success. Not 100 or 99 or even 90. The key problem with GenAI has been reliability; this graph does not address reliable performance. At all.

• If you read carefully, it is only about software tasks. Not general intelligence.

• It certainly doesn’t tell you that *most* (let alone) all things that humans can do in 16 hours can be done in Mythos, let alone reliably

• Aside from this, the graph doesn’t show you *how* the improvements have been made. As noted in my newsletter a lot of the advance in recent months is likely from the incorporation of symbolic tools (like code interpreters, verification, and harnesses) rather than from model scaling per se. As such this a vindication of neurosymbolic AI – but not a proof that LLMs themselves can be perpetually scaled. As such it’s not a proof that another trillion dollars will continue the graph.

• Per @ramez, Mythos is not actually off trend on the ECI benchmark, which is a broader measure.

METR@METR_Evals

We evaluated an early version of Claude Mythos Preview for risk assessment during a limited window in March 2026. We estimated a 50%-time-horizon of at least 16hrs (95% CI 8.5hrs to 55hrs) on our task suite, at the upper end of what we can measure without new tasks.

28d70.7K16257

Ethan Mollick@emollick

Huh.

METR@METR_Evals

We evaluated an early version of Claude Mythos Preview for risk assessment during a limited window in March 2026. We estimated a 50%-time-horizon of at least 16hrs (95% CI 8.5hrs to 55hrs) on our task suite, at the upper end of what we can measure without new tasks.

29d48.6K21936

Gary Marcus@GaryMarcus

PLEASE DO NOT PANIC about the Mythos/METR graph that everyone is panicking about.

Progress is being made but people are totally overreacting.

Here’s some context that is being left out from nearly every comment on that graph.

Gary Marcus@GaryMarcus

Hot take on METR’s new graph that so many people are flipping about today.

• Claude Code is a real advance; Mythos probably builds on some of what is learned there. But…

• If you read the graph carefully, it is about achieving *50%* success. Not 100 or 99 or even 90. The key problem with GenAI has been reliability; this graph does not address reliable performance. At all.

• If you read carefully, it is only about software tasks. Not general intelligence.

• It certainly doesn’t tell you that *most* (let alone) all things that humans can do in 16 hours can be done in Mythos, let alone reliably

• Aside from this, the graph doesn’t show you *how* the improvements have been made. As noted in my newsletter a lot of the advance in recent months is likely from the incorporation of symbolic tools (like code interpreters, verification, and harnesses) rather than from model scaling per se. As such this a vindication of neurosymbolic AI – but not a proof that LLMs themselves can be perpetually scaled. As such it’s not a proof that another trillion dollars will continue the graph.

• Per @ramez, Mythos is not actually off trend on the ECI benchmark, which is a broader measure.

28d24.1K9840

Peter Wildeford🇺🇸🚀@peterwildeford

Deep learning is hitting a wall (the wall being our ability to measure AI capabilities)

Peter Wildeford🇺🇸🚀@peterwildeford

wow Mythos finally broke the METR graph

28d18.6K22715

Andrew Curran@AndrewCurran_

An early snapshot.

Alex Albert@alexalbert__

An early Claude Mythos Preview snapshot we provided METR has a time horizon of more than 2x the next best model on their 80% success rate benchmark

29d12.1K17813

Peter Wildeford🇺🇸🚀@peterwildeford

This value, admittedly with high uncertainty, is actually on trend... very similar to what we predict with just a straight line on a graph.

Though this should not be reassuring as "on trend" is still very fast and culminates in AI systems doing weeks-long tasks this year.

Peter Wildeford🇺🇸🚀@peterwildeford

wow Mythos finally broke the METR graph

29d5.4K10517

Ethan Mollick@emollick

28d20.6K14012

Nick@nickcammarata

epistemic status of the single most important event in history

Nick@nickcammarata

we’ve hit the “our best charts just say it’s um, above this” part of the singularity

29d4.5K1608

Daniel Eth (yes, Eth is my actual last name)@daniel_271828

Okay this figure is low-key hysterical

METR@METR_Evals

We evaluated an early version of Claude Mythos Preview for risk assessment during a limited window in March 2026. We estimated a 50%-time-horizon of at least 16hrs (95% CI 8.5hrs to 55hrs) on our task suite, at the upper end of what we can measure without new tasks.

25d15.3K10213

Gary Marcus@GaryMarcus

Sorry, @peterwildeford, but this is wrong. Please don’t play along.

The measurement “wall” you mention is hit ONLY if you don’t insist on reliability.

If you demanded 95% accuracy on the task, the systems wouldn’t be close to the measurement wall.

The measurement problem you allude to is an artifact of artificially lowered expectations.

Peter Wildeford🇺🇸🚀@peterwildeford

Deep learning is hitting a wall (the wall being our ability to measure AI capabilities)

28d9.6K628

prinz@deredleritt3r

The factual points regarding the METR benchmarks are correct, but the models absolutely *are* improving on tasks that are not coding.

OpenAI models specifically have steadily improved their performance on my private benchmark of difficult real-world legal research tasks. GPT-5.5 results are notably better than GPT-5.4 results, which were notably better than GPT-5.2 results. Considering that GPT-5.2 was released in December 2025, this means that there is a notable uptick in performance *every few months*. And the 74/99 score achieved by GPT-5.5 (heavy) is *very* impressive - I doubt that many junior associates in my field could do as well.

My benchmark also does not rely on harnesses, code verification (there is no code) or any other external tools.

It's worth thinking about.

28d4K8310

Charles Foster@CFGeek

the artist’s pick vs. the radio hit

METR@METR_Evals

We evaluated an early version of Claude Mythos Preview for risk assessment during a limited window in March 2026. We estimated a 50%-time-horizon of at least 16hrs (95% CI 8.5hrs to 55hrs) on our task suite, at the upper end of what we can measure without new tasks.

28d7.3K709

METR@METR_Evals

Of the 228 tasks in our suite, only 5 are estimated as 16+ hours long, making measurements at this range unstable and less meaningful than at ranges with better task coverage. Thus, we are not highlighting exact estimates for models above 16 hours measured with our current suite.

29d7.1K965

Samuel Albanie 🇬🇧@SamuelAlbanie

gg mythos metr v1.1, it's been real

28d5.9K755

Miles Brundage@Miles_Brundage

Nice way of visualizing the eval breaking down

METR@METR_Evals

We evaluated an early version of Claude Mythos Preview for risk assessment during a limited window in March 2026. We estimated a 50%-time-horizon of at least 16hrs (95% CI 8.5hrs to 55hrs) on our task suite, at the upper end of what we can measure without new tasks.

29d4.9K745

Tenobrus@tenobrus

@METR_Evals extremely valid and prudent to not give point estimates. nevertheless, because it's fun, here's the time horizon point estimates based on the ECI scores

29d4K565