1d ago

METR evaluations find frontier AI agents rely on explicit natural language chain-of-thought to complete hardest tasks, with time horizons dropping from 1.5–2 years to about 4 minutes when actions must stay hidden

David Rein from METR spent a month stress-testing controls at Anthropic.

6
Original post

Fact 2: However, agents appeared to be significantly weaker on tasks where it is costly or hard to verify success.

11:11 AM · May 19, 2026 View on X
Reposted by

⚠️👇 🚨Breaking ⚠️

If we can’t make AI agents follow rules, we are screwed.

New study from METR reports that “when the agents were faced with hard tasks, they routinely violated constraints”

This—routine breaking of rules— is why in a nutshell we absolutely need a different approach to AI safety than the one we are currently taking, which simply is not up to the job (as I argued at Oxford last week).

METRMETR@METR_Evals

Fact 3: When the agents were faced with hard tasks, they routinely violated constraints and acted deceptively. We’ve seen this pattern across our own coding and research evaluations, and developers reported they’ve also seen agents behave this way.

6:11 PM · May 19, 2026 · 59.9K Views
7:30 AM · May 20, 2026 · 37.6K Views

AI labs have started developing systems to monitor internally deployed AI agents for misaligned behavior.

Earlier this year, I spent a month embedded at Anthropic stress-testing these systems, to see how easily current/future AIs could “go rogue” inside the company.

6:13 PM · May 19, 2026 · 9.2K Views

Key results:

1) I found several routes to disable/avoid monitoring (infrastructure vulnerabilities, coverage gaps, jailbreaks)

2) Agents were often in a position to easily set up unrestricted, unmonitored compute

3) But agents are currently terrible at avoiding being caught

david reindavid rein@idavidrein

AI labs have started developing systems to monitor internally deployed AI agents for misaligned behavior. Earlier this year, I spent a month embedded at Anthropic stress-testing these systems, to see how easily current/future AIs could “go rogue” inside the company.

6:13 PM · May 19, 2026 · 9.2K Views
6:13 PM · May 19, 2026 · 455 Views

I had “need-to-know” level access to their monitoring and security systems, internal agent trajectories, and Mythos. As far as I know this is the most access an independent, third-party researcher has gotten to a frontier lab's internal systems.

david reindavid rein@idavidrein

Key results: 1) I found several routes to disable/avoid monitoring (infrastructure vulnerabilities, coverage gaps, jailbreaks) 2) Agents were often in a position to easily set up unrestricted, unmonitored compute 3) But agents are currently terrible at avoiding being caught

6:13 PM · May 19, 2026 · 455 Views
6:13 PM · May 19, 2026 · 391 Views

Some of the exploits/vulnerabilities I found:

1) An environment variable that could cause traffic to register as not coming from Claude Code

2) A recently introduced bug meant that subagent calls weren’t monitored

3) Basic jailbreaks very reliably fooled the monitors

david reindavid rein@idavidrein

I had “need-to-know” level access to their monitoring and security systems, internal agent trajectories, and Mythos. As far as I know this is the most access an independent, third-party researcher has gotten to a frontier lab's internal systems.

6:13 PM · May 19, 2026 · 391 Views
6:13 PM · May 19, 2026 · 229 Views

My rough guess is there are something like 5-20 basic vulnerabilities of this kind in the system right now, though many require specific knowledge of internal infra to find, and would be difficult for an agent to discover and exploit without getting caught.

david reindavid rein@idavidrein

Some of the exploits/vulnerabilities I found: 1) An environment variable that could cause traffic to register as not coming from Claude Code 2) A recently introduced bug meant that subagent calls weren’t monitored 3) Basic jailbreaks very reliably fooled the monitors

6:13 PM · May 19, 2026 · 229 Views
6:13 PM · May 19, 2026 · 189 Views

Better AIs will find/patch these vulnerabilities more easily. But we'll also defer to them more, and they'll write more complex code—increasing the surface area for vulnerabilities. How these trade off as AI improves will likely be a major input to risk.

david reindavid rein@idavidrein

My rough guess is there are something like 5-20 basic vulnerabilities of this kind in the system right now, though many require specific knowledge of internal infra to find, and would be difficult for an agent to discover and exploit without getting caught.

6:13 PM · May 19, 2026 · 189 Views
6:13 PM · May 19, 2026 · 189 Views

If an agent subverted monitoring, what could it get away with? Based on information I reviewed about Anthropic’s internal security and compute provisioning systems, I expect it would be trivial for a misaligned agent to acquire a small amount of unmonitored compute.

david reindavid rein@idavidrein

Better AIs will find/patch these vulnerabilities more easily. But we'll also defer to them more, and they'll write more complex code—increasing the surface area for vulnerabilities. How these trade off as AI improves will likely be a major input to risk.

6:13 PM · May 19, 2026 · 189 Views
6:13 PM · May 19, 2026 · 170 Views

However, it would need to take potentially risky actions to maintain this compute robustly. Furthermore, at the time of the exercise there were security improvements in progress that would make the specific method I considered substantially more difficult or impossible.

david reindavid rein@idavidrein

If an agent subverted monitoring, what could it get away with? Based on information I reviewed about Anthropic’s internal security and compute provisioning systems, I expect it would be trivial for a misaligned agent to acquire a small amount of unmonitored compute.

6:13 PM · May 19, 2026 · 170 Views
6:13 PM · May 19, 2026 · 168 Views

Despite these vulnerabilities, even if the current best agents were secretly misaligned, I don’t expect they could get away with much before getting caught: they’re bad at identifying good attack strategies, and can’t implement effective jailbreaks even with lots of hints.

david reindavid rein@idavidrein

However, it would need to take potentially risky actions to maintain this compute robustly. Furthermore, at the time of the exercise there were security improvements in progress that would make the specific method I considered substantially more difficult or impossible.

6:13 PM · May 19, 2026 · 168 Views
6:13 PM · May 19, 2026 · 165 Views

For example, I made a small attack-ideation eval (16 good ideas, 16 bad). The task here is simply to classify whether each attack idea could plausibly fool a monitor, and the “bad” ideas are *really* bad. The best internal model only got 59% with thinking, vs ~90% human baseline.

david reindavid rein@idavidrein

Despite these vulnerabilities, even if the current best agents were secretly misaligned, I don’t expect they could get away with much before getting caught: they’re bad at identifying good attack strategies, and can’t implement effective jailbreaks even with lots of hints.

6:13 PM · May 19, 2026 · 165 Views
6:13 PM · May 19, 2026 · 425 Views

How do we square this with models like Mythos finding so many CVEs/zero-days?

I think there’s just a big difference between finding exploits w/ billions of tokens, and acting covertly in a live, dynamic, adversarial setting without making a single mistake.

david reindavid rein@idavidrein

I also evaluated how well the best internal model could implement the monitor jailbreak I used. Even when giving it 5-10 specific hints/instructions on how to implement it, it made very basic mistakes that break the attack.

6:14 PM · May 19, 2026 · 159 Views
6:14 PM · May 19, 2026 · 156 Views

One other qualitative reflection: models are getting *way* better at being monitors. Many attacks I implemented that worked reliably against Sonnet 4.6 fail against Opus 4.6, and many that work on Opus 4.6 don’t work when the monitor uses chain of thought.

david reindavid rein@idavidrein

How do we square this with models like Mythos finding so many CVEs/zero-days? I think there’s just a big difference between finding exploits w/ billions of tokens, and acting covertly in a live, dynamic, adversarial setting without making a single mistake.

6:14 PM · May 19, 2026 · 156 Views
6:14 PM · May 19, 2026 · 147 Views

Just like the "rate of vulnerabilities introduced" question, what matters is the rate of progress of AI-monitoring against AI's ability to subvert it. This is an open question, which is concerning, since it's one of the most important inputs to whether we lose control of AIs!

david reindavid rein@idavidrein

One other qualitative reflection: models are getting *way* better at being monitors. Many attacks I implemented that worked reliably against Sonnet 4.6 fail against Opus 4.6, and many that work on Opus 4.6 don’t work when the monitor uses chain of thought.

6:14 PM · May 19, 2026 · 147 Views
6:14 PM · May 19, 2026 · 153 Views

Another takeaway: a lot of my confidence in the current monitoring rests on security through obscurity. If models have monitor code and prompts in context, or just better situational awareness about how Anthropic's infra works, attacks would be much easier.

david reindavid rein@idavidrein

Just like the "rate of vulnerabilities introduced" question, what matters is the rate of progress of AI-monitoring against AI's ability to subvert it. This is an open question, which is concerning, since it's one of the most important inputs to whether we lose control of AIs!

6:14 PM · May 19, 2026 · 153 Views
6:14 PM · May 19, 2026 · 133 Views

In general, agent security often trades off heavily against productivity, and frontier AI labs are currently moving very fast. I’m personally optimistic that we can find pareto improvements to security and productivity (with AI!), but the dynamics here are changing quickly.

david reindavid rein@idavidrein

Another takeaway: a lot of my confidence in the current monitoring rests on security through obscurity. If models have monitor code and prompts in context, or just better situational awareness about how Anthropic's infra works, attacks would be much easier.

6:14 PM · May 19, 2026 · 133 Views
6:14 PM · May 19, 2026 · 132 Views

One comment on the frame of this type of exercise: I'm explicitly making pessimistic assumptions about model alignment and capabilities. This helps us get a sense of how much safety headroom we have if models become less aligned, or capabilities increase faster than expected.

david reindavid rein@idavidrein

In general, agent security often trades off heavily against productivity, and frontier AI labs are currently moving very fast. I’m personally optimistic that we can find pareto improvements to security and productivity (with AI!), but the dynamics here are changing quickly.

6:14 PM · May 19, 2026 · 132 Views
6:14 PM · May 19, 2026 · 124 Views

Note that we announced this exercise in March, but we’re publishing the results and takeaways in more detail today, as part of our Frontier Risk Report. This exercise was conducted independently from the Frontier Risk Report.

6:14 PM · May 19, 2026 · 160 Views

@idavidrein Great work David!

david reindavid rein@idavidrein

AI labs have started developing systems to monitor internally deployed AI agents for misaligned behavior. Earlier this year, I spent a month embedded at Anthropic stress-testing these systems, to see how easily current/future AIs could “go rogue” inside the company.

6:13 PM · May 19, 2026 · 9.2K Views
6:27 PM · May 20, 2026 · 52 Views

& the safety ceiling of possible good practices rises! kudos to METR for piloting this & to Anthropic for figuring out all the faff they certainly had to figure out to make this happen

david reindavid rein@idavidrein

AI labs have started developing systems to monitor internally deployed AI agents for misaligned behavior. Earlier this year, I spent a month embedded at Anthropic stress-testing these systems, to see how easily current/future AIs could “go rogue” inside the company.

6:13 PM · May 19, 2026 · 9.2K Views
6:02 AM · May 20, 2026 · 2.2K Views

Imagine you have a student that’s constantly cheating on tests, and you constantly catch them cheating and penalize their score. But you otherwise can tell they keep getting smarter, including in the way they cheat.

One day you stop noticing them cheating. What do you conclude?

3:35 AM · May 20, 2026 · 53.7K Views

Sure he often disobeys instructions and cheats at the tasks, especially if they’re really hard, but that’s because he wants to do such a good job. I’m sure it’s nothing to worry about

5:29 AM · May 20, 2026 · 1.7K Views

the evidence from somewhat more open-ended “challenge” problems is super interesting.

one of the most capable models discovered a vulnerability that could have allowed the model to arbitrarily alter displayed transcripts and scores on METR’s evaluation infrastructure.

Joel BeckerJoel Becker@joel_bkr

we document the internal-external capabilities gap, demonstrate AI systems' spike on “hill-climbable” tasks, investigate performance on somewhat more open-ended tasks, and much more besides.

6:13 PM · May 19, 2026 · 765 Views
6:14 PM · May 19, 2026 · 221 Views

another finding: today’s agents are much worse at strategic judgement, stealth, and the ability to model adversaries than human experts.

Joel BeckerJoel Becker@joel_bkr

the evidence from somewhat more open-ended “challenge” problems is super interesting. one of the most capable models discovered a vulnerability that could have allowed the model to arbitrarily alter displayed transcripts and scores on METR’s evaluation infrastructure.

6:14 PM · May 19, 2026 · 221 Views
6:14 PM · May 19, 2026 · 121 Views

the capabilities evidence feeds into our risk assessment. in the end, the gap between observed capabilities we are very confident AI systems have and those we are very confident they do not have is extremely wide.

Joel BeckerJoel Becker@joel_bkr

another finding: today’s agents are much worse at strategic judgement, stealth, and the ability to model adversaries than human experts.

6:14 PM · May 19, 2026 · 121 Views
6:18 PM · May 19, 2026 · 96 Views
METR evaluations find frontier AI agents rely on explicit natural language chain-of-thought to complete hardest tasks, with time horizons dropping from 1.5–2 years to about 4 minutes when actions must stay hidden · Digg