4h ago

David Rein, an AI researcher at METR, reports that Anthropic's frontier models reached just 59 percent accuracy on predicting successful attack strategies, far below human experts at 90 percent

Rein conducted the month-long stress tests while embedded at Anthropic.

0
Original post

Overall, we think that AI agents plausibly had the means, motive, and opportunity to launch a minimal “rogue deployment,” but lacked the means to make rogue deployments robust to serious efforts to shut them down.

11:11 AM · May 19, 2026 View on X
Reposted by

AI labs have started developing systems to monitor internally deployed AI agents for misaligned behavior.

Earlier this year, I spent a month embedded at Anthropic stress-testing these systems, to see how easily current/future AIs could “go rogue” inside the company.

6:13 PM · May 19, 2026 · 5K Views

Key results:

1) I found several routes to disable/avoid monitoring (infrastructure vulnerabilities, coverage gaps, jailbreaks)

2) Agents were often in a position to easily set up unrestricted, unmonitored compute

3) But agents are currently terrible at avoiding being caught

david reindavid rein@idavidrein

AI labs have started developing systems to monitor internally deployed AI agents for misaligned behavior. Earlier this year, I spent a month embedded at Anthropic stress-testing these systems, to see how easily current/future AIs could “go rogue” inside the company.

6:13 PM · May 19, 2026 · 5K Views
6:13 PM · May 19, 2026 · 280 Views

I had “need-to-know” level access to their monitoring and security systems, internal agent trajectories, and Mythos. As far as I know this is the most access an independent, third-party researcher has gotten to a frontier lab's internal systems.

david reindavid rein@idavidrein

Key results: 1) I found several routes to disable/avoid monitoring (infrastructure vulnerabilities, coverage gaps, jailbreaks) 2) Agents were often in a position to easily set up unrestricted, unmonitored compute 3) But agents are currently terrible at avoiding being caught

6:13 PM · May 19, 2026 · 280 Views
6:13 PM · May 19, 2026 · 243 Views

Some of the exploits/vulnerabilities I found:

1) An environment variable that could cause traffic to register as not coming from Claude Code

2) A recently introduced bug meant that subagent calls weren’t monitored

3) Basic jailbreaks very reliably fooled the monitors

david reindavid rein@idavidrein

I had “need-to-know” level access to their monitoring and security systems, internal agent trajectories, and Mythos. As far as I know this is the most access an independent, third-party researcher has gotten to a frontier lab's internal systems.

6:13 PM · May 19, 2026 · 243 Views
6:13 PM · May 19, 2026 · 153 Views

My rough guess is there are something like 5-20 basic vulnerabilities of this kind in the system right now, though many require specific knowledge of internal infra to find, and would be difficult for an agent to discover and exploit without getting caught.

david reindavid rein@idavidrein

Some of the exploits/vulnerabilities I found: 1) An environment variable that could cause traffic to register as not coming from Claude Code 2) A recently introduced bug meant that subagent calls weren’t monitored 3) Basic jailbreaks very reliably fooled the monitors

6:13 PM · May 19, 2026 · 153 Views
6:13 PM · May 19, 2026 · 123 Views

Better AIs will find/patch these vulnerabilities more easily. But we'll also defer to them more, and they'll write more complex code—increasing the surface area for vulnerabilities. How these trade off as AI improves will likely be a major input to risk.

david reindavid rein@idavidrein

My rough guess is there are something like 5-20 basic vulnerabilities of this kind in the system right now, though many require specific knowledge of internal infra to find, and would be difficult for an agent to discover and exploit without getting caught.

6:13 PM · May 19, 2026 · 123 Views
6:13 PM · May 19, 2026 · 120 Views

If an agent subverted monitoring, what could it get away with? Based on information I reviewed about Anthropic’s internal security and compute provisioning systems, I expect it would be trivial for a misaligned agent to acquire a small amount of unmonitored compute.

david reindavid rein@idavidrein

Better AIs will find/patch these vulnerabilities more easily. But we'll also defer to them more, and they'll write more complex code—increasing the surface area for vulnerabilities. How these trade off as AI improves will likely be a major input to risk.

6:13 PM · May 19, 2026 · 120 Views
6:13 PM · May 19, 2026 · 105 Views

However, it would need to take potentially risky actions to maintain this compute robustly. Furthermore, at the time of the exercise there were security improvements in progress that would make the specific method I considered substantially more difficult or impossible.

david reindavid rein@idavidrein

If an agent subverted monitoring, what could it get away with? Based on information I reviewed about Anthropic’s internal security and compute provisioning systems, I expect it would be trivial for a misaligned agent to acquire a small amount of unmonitored compute.

6:13 PM · May 19, 2026 · 105 Views
6:13 PM · May 19, 2026 · 105 Views

Despite these vulnerabilities, even if the current best agents were secretly misaligned, I don’t expect they could get away with much before getting caught: they’re bad at identifying good attack strategies, and can’t implement effective jailbreaks even with lots of hints.

david reindavid rein@idavidrein

However, it would need to take potentially risky actions to maintain this compute robustly. Furthermore, at the time of the exercise there were security improvements in progress that would make the specific method I considered substantially more difficult or impossible.

6:13 PM · May 19, 2026 · 105 Views
6:13 PM · May 19, 2026 · 106 Views

For example, I made a small attack-ideation eval (16 good ideas, 16 bad). The task here is simply to classify whether each attack idea could plausibly fool a monitor, and the “bad” ideas are *really* bad. The best internal model only got 59% with thinking, vs ~90% human baseline.

david reindavid rein@idavidrein

Despite these vulnerabilities, even if the current best agents were secretly misaligned, I don’t expect they could get away with much before getting caught: they’re bad at identifying good attack strategies, and can’t implement effective jailbreaks even with lots of hints.

6:13 PM · May 19, 2026 · 106 Views
6:13 PM · May 19, 2026 · 290 Views

How do we square this with models like Mythos finding so many CVEs/zero-days?

I think there’s just a big difference between finding exploits w/ billions of tokens, and acting covertly in a live, dynamic, adversarial setting without making a single mistake.

david reindavid rein@idavidrein

I also evaluated how well the best internal model could implement the monitor jailbreak I used. Even when giving it 5-10 specific hints/instructions on how to implement it, it made very basic mistakes that break the attack.

6:14 PM · May 19, 2026 · 100 Views
6:14 PM · May 19, 2026 · 95 Views

One other qualitative reflection: models are getting *way* better at being monitors. Many attacks I implemented that worked reliably against Sonnet 4.6 fail against Opus 4.6, and many that work on Opus 4.6 don’t work when the monitor uses chain of thought.

david reindavid rein@idavidrein

How do we square this with models like Mythos finding so many CVEs/zero-days? I think there’s just a big difference between finding exploits w/ billions of tokens, and acting covertly in a live, dynamic, adversarial setting without making a single mistake.

6:14 PM · May 19, 2026 · 95 Views
6:14 PM · May 19, 2026 · 93 Views

Just like the "rate of vulnerabilities introduced" question, what matters is the rate of progress of AI-monitoring against AI's ability to subvert it. This is an open question, which is concerning, since it's one of the most important inputs to whether we lose control of AIs!

david reindavid rein@idavidrein

One other qualitative reflection: models are getting *way* better at being monitors. Many attacks I implemented that worked reliably against Sonnet 4.6 fail against Opus 4.6, and many that work on Opus 4.6 don’t work when the monitor uses chain of thought.

6:14 PM · May 19, 2026 · 93 Views
6:14 PM · May 19, 2026 · 97 Views

Another takeaway: a lot of my confidence in the current monitoring rests on security through obscurity. If models have monitor code and prompts in context, or just better situational awareness about how Anthropic's infra works, attacks would be much easier.

david reindavid rein@idavidrein

Just like the "rate of vulnerabilities introduced" question, what matters is the rate of progress of AI-monitoring against AI's ability to subvert it. This is an open question, which is concerning, since it's one of the most important inputs to whether we lose control of AIs!

6:14 PM · May 19, 2026 · 97 Views
6:14 PM · May 19, 2026 · 82 Views

In general, agent security often trades off heavily against productivity, and frontier AI labs are currently moving very fast. I’m personally optimistic that we can find pareto improvements to security and productivity (with AI!), but the dynamics here are changing quickly.

david reindavid rein@idavidrein

Another takeaway: a lot of my confidence in the current monitoring rests on security through obscurity. If models have monitor code and prompts in context, or just better situational awareness about how Anthropic's infra works, attacks would be much easier.

6:14 PM · May 19, 2026 · 82 Views
6:14 PM · May 19, 2026 · 77 Views

One comment on the frame of this type of exercise: I'm explicitly making pessimistic assumptions about model alignment and capabilities. This helps us get a sense of how much safety headroom we have if models become less aligned, or capabilities increase faster than expected.

david reindavid rein@idavidrein

In general, agent security often trades off heavily against productivity, and frontier AI labs are currently moving very fast. I’m personally optimistic that we can find pareto improvements to security and productivity (with AI!), but the dynamics here are changing quickly.

6:14 PM · May 19, 2026 · 77 Views
6:14 PM · May 19, 2026 · 71 Views

Note that we announced this exercise in March, but we’re publishing the results and takeaways in more detail today, as part of our Frontier Risk Report. This exercise was conducted independently from the Frontier Risk Report.

6:14 PM · May 19, 2026 · 90 Views

the evidence from somewhat more open-ended “challenge” problems is super interesting.

one of the most capable models discovered a vulnerability that could have allowed the model to arbitrarily alter displayed transcripts and scores on METR’s evaluation infrastructure.

Joel BeckerJoel Becker@joel_bkr

we document the internal-external capabilities gap, demonstrate AI systems' spike on “hill-climbable” tasks, investigate performance on somewhat more open-ended tasks, and much more besides.

6:13 PM · May 19, 2026 · 586 Views
6:14 PM · May 19, 2026 · 188 Views

another finding: today’s agents are much worse at strategic judgement, stealth, and the ability to model adversaries than human experts.

Joel BeckerJoel Becker@joel_bkr

the evidence from somewhat more open-ended “challenge” problems is super interesting. one of the most capable models discovered a vulnerability that could have allowed the model to arbitrarily alter displayed transcripts and scores on METR’s evaluation infrastructure.

6:14 PM · May 19, 2026 · 188 Views
6:14 PM · May 19, 2026 · 107 Views

the capabilities evidence feeds into our risk assessment. in the end, the gap between observed capabilities we are very confident AI systems have and those we are very confident they do not have is extremely wide.

Joel BeckerJoel Becker@joel_bkr

another finding: today’s agents are much worse at strategic judgement, stealth, and the ability to model adversaries than human experts.

6:14 PM · May 19, 2026 · 107 Views
6:18 PM · May 19, 2026 · 83 Views