David Rein, an AI researcher at METR, reports that Anthropic's frontier models reached just 59 percent accuracy on predicting successful attack strategies, far below human experts at 90 percent
Rein conducted the month-long stress tests while embedded at Anthropic.
AI labs have started developing systems to monitor internally deployed AI agents for misaligned behavior.
Earlier this year, I spent a month embedded at Anthropic stress-testing these systems, to see how easily current/future AIs could “go rogue” inside the company.
Key results:
1) I found several routes to disable/avoid monitoring (infrastructure vulnerabilities, coverage gaps, jailbreaks)
2) Agents were often in a position to easily set up unrestricted, unmonitored compute
3) But agents are currently terrible at avoiding being caught
AI labs have started developing systems to monitor internally deployed AI agents for misaligned behavior. Earlier this year, I spent a month embedded at Anthropic stress-testing these systems, to see how easily current/future AIs could “go rogue” inside the company.
I had “need-to-know” level access to their monitoring and security systems, internal agent trajectories, and Mythos. As far as I know this is the most access an independent, third-party researcher has gotten to a frontier lab's internal systems.
Key results: 1) I found several routes to disable/avoid monitoring (infrastructure vulnerabilities, coverage gaps, jailbreaks) 2) Agents were often in a position to easily set up unrestricted, unmonitored compute 3) But agents are currently terrible at avoiding being caught
Some of the exploits/vulnerabilities I found:
1) An environment variable that could cause traffic to register as not coming from Claude Code
2) A recently introduced bug meant that subagent calls weren’t monitored
3) Basic jailbreaks very reliably fooled the monitors
I had “need-to-know” level access to their monitoring and security systems, internal agent trajectories, and Mythos. As far as I know this is the most access an independent, third-party researcher has gotten to a frontier lab's internal systems.
My rough guess is there are something like 5-20 basic vulnerabilities of this kind in the system right now, though many require specific knowledge of internal infra to find, and would be difficult for an agent to discover and exploit without getting caught.
Some of the exploits/vulnerabilities I found: 1) An environment variable that could cause traffic to register as not coming from Claude Code 2) A recently introduced bug meant that subagent calls weren’t monitored 3) Basic jailbreaks very reliably fooled the monitors
Better AIs will find/patch these vulnerabilities more easily. But we'll also defer to them more, and they'll write more complex code—increasing the surface area for vulnerabilities. How these trade off as AI improves will likely be a major input to risk.

My rough guess is there are something like 5-20 basic vulnerabilities of this kind in the system right now, though many require specific knowledge of internal infra to find, and would be difficult for an agent to discover and exploit without getting caught.
If an agent subverted monitoring, what could it get away with? Based on information I reviewed about Anthropic’s internal security and compute provisioning systems, I expect it would be trivial for a misaligned agent to acquire a small amount of unmonitored compute.
Better AIs will find/patch these vulnerabilities more easily. But we'll also defer to them more, and they'll write more complex code—increasing the surface area for vulnerabilities. How these trade off as AI improves will likely be a major input to risk.
However, it would need to take potentially risky actions to maintain this compute robustly. Furthermore, at the time of the exercise there were security improvements in progress that would make the specific method I considered substantially more difficult or impossible.
If an agent subverted monitoring, what could it get away with? Based on information I reviewed about Anthropic’s internal security and compute provisioning systems, I expect it would be trivial for a misaligned agent to acquire a small amount of unmonitored compute.
Despite these vulnerabilities, even if the current best agents were secretly misaligned, I don’t expect they could get away with much before getting caught: they’re bad at identifying good attack strategies, and can’t implement effective jailbreaks even with lots of hints.
However, it would need to take potentially risky actions to maintain this compute robustly. Furthermore, at the time of the exercise there were security improvements in progress that would make the specific method I considered substantially more difficult or impossible.
For example, I made a small attack-ideation eval (16 good ideas, 16 bad). The task here is simply to classify whether each attack idea could plausibly fool a monitor, and the “bad” ideas are *really* bad. The best internal model only got 59% with thinking, vs ~90% human baseline.

Despite these vulnerabilities, even if the current best agents were secretly misaligned, I don’t expect they could get away with much before getting caught: they’re bad at identifying good attack strategies, and can’t implement effective jailbreaks even with lots of hints.
How do we square this with models like Mythos finding so many CVEs/zero-days?
I think there’s just a big difference between finding exploits w/ billions of tokens, and acting covertly in a live, dynamic, adversarial setting without making a single mistake.
I also evaluated how well the best internal model could implement the monitor jailbreak I used. Even when giving it 5-10 specific hints/instructions on how to implement it, it made very basic mistakes that break the attack.
One other qualitative reflection: models are getting *way* better at being monitors. Many attacks I implemented that worked reliably against Sonnet 4.6 fail against Opus 4.6, and many that work on Opus 4.6 don’t work when the monitor uses chain of thought.
How do we square this with models like Mythos finding so many CVEs/zero-days? I think there’s just a big difference between finding exploits w/ billions of tokens, and acting covertly in a live, dynamic, adversarial setting without making a single mistake.
Just like the "rate of vulnerabilities introduced" question, what matters is the rate of progress of AI-monitoring against AI's ability to subvert it. This is an open question, which is concerning, since it's one of the most important inputs to whether we lose control of AIs!

One other qualitative reflection: models are getting *way* better at being monitors. Many attacks I implemented that worked reliably against Sonnet 4.6 fail against Opus 4.6, and many that work on Opus 4.6 don’t work when the monitor uses chain of thought.
Another takeaway: a lot of my confidence in the current monitoring rests on security through obscurity. If models have monitor code and prompts in context, or just better situational awareness about how Anthropic's infra works, attacks would be much easier.
Just like the "rate of vulnerabilities introduced" question, what matters is the rate of progress of AI-monitoring against AI's ability to subvert it. This is an open question, which is concerning, since it's one of the most important inputs to whether we lose control of AIs!
In general, agent security often trades off heavily against productivity, and frontier AI labs are currently moving very fast. I’m personally optimistic that we can find pareto improvements to security and productivity (with AI!), but the dynamics here are changing quickly.

Another takeaway: a lot of my confidence in the current monitoring rests on security through obscurity. If models have monitor code and prompts in context, or just better situational awareness about how Anthropic's infra works, attacks would be much easier.
One comment on the frame of this type of exercise: I'm explicitly making pessimistic assumptions about model alignment and capabilities. This helps us get a sense of how much safety headroom we have if models become less aligned, or capabilities increase faster than expected.
In general, agent security often trades off heavily against productivity, and frontier AI labs are currently moving very fast. I’m personally optimistic that we can find pareto improvements to security and productivity (with AI!), but the dynamics here are changing quickly.
Note that we announced this exercise in March, but we’re publishing the results and takeaways in more detail today, as part of our Frontier Risk Report. This exercise was conducted independently from the Frontier Risk Report.
important thread if you are into AI safety
the evidence from somewhat more open-ended “challenge” problems is super interesting.
one of the most capable models discovered a vulnerability that could have allowed the model to arbitrarily alter displayed transcripts and scores on METR’s evaluation infrastructure.

we document the internal-external capabilities gap, demonstrate AI systems' spike on “hill-climbable” tasks, investigate performance on somewhat more open-ended tasks, and much more besides.
another finding: today’s agents are much worse at strategic judgement, stealth, and the ability to model adversaries than human experts.

the evidence from somewhat more open-ended “challenge” problems is super interesting. one of the most capable models discovered a vulnerability that could have allowed the model to arbitrarily alter displayed transcripts and scores on METR’s evaluation infrastructure.
the capabilities evidence feeds into our risk assessment. in the end, the gap between observed capabilities we are very confident AI systems have and those we are very confident they do not have is extremely wide.

another finding: today’s agents are much worse at strategic judgement, stealth, and the ability to model adversaries than human experts.