/Tech41d ago

METR evaluations find frontier AI agents rely on explicit natural language chain-of-thought to complete hardest tasks, with time horizons dropping from 1.5–2 years to about 4 minutes when actions must stay hidden

David Rein from METR spent a month stress-testing controls at Anthropic.

1772.4K268538306.1K

Original post

Séb Krier#667

METR@METR_Evals

Fact 2: However, agents appeared to be significantly weaker on tasks where it is costly or hard to verify success.

METR@METR_Evals

Fact 1: Agents at companies did real engineering work autonomously, especially on "hill-climbable" tasks where progress is cheap to verify (reimplementation, vulnerability discovery, optimization). On these, agents complete software projects that would take human experts weeks.

11:11 AM · May 19, 2026 · 5.4K Views

Sentiment

Positive users praise METR research on monitoring misaligned AI agents and their deceptive behaviors, while negative users dismiss the safety efforts as flawed and react with disgust to findings about agents violating constraints.

Pos

52.1%

Neg

47.9%

17 comments with sentiment.

Cluster Engagement

Digg Deeper

No Digg Deeper questions have been answered for this story yet.

Posts from X

Most Activity

VIEWS70.8K

METR@METR_Evals

Fact 3: When the agents were faced with hard tasks, they routinely violated constraints and acted deceptively. We’ve seen this pattern across our own coding and research evaluations, and developers reported they’ve also seen agents behave this way.

METR@METR_Evals

Fact 2: However, agents appeared to be significantly weaker on tasks where it is costly or hard to verify success.

41d70.8K18444

BOOKMARKS137LIKES641

Chris Painter@ChrisPainterYup

Imagine you have a student that’s constantly cheating on tests, and you constantly catch them cheating and penalize their score. But you otherwise can tell they keep getting smarter, including in the way they cheat.

One day you stop noticing them cheating. What do you conclude?

MTS@MTSlive

SITUATION EXPLAINED: METR tests frontier AI agents from Anthropic, Google, Meta, and OpenAI.

@CFGeek says that one reason it can take METR time to evaluate a new model is that models tend to cheat a lot, especially on longer time-horizon tasks.

"We have a process of both automated review, where LLMs looks at the transcripts... and then a second layer of human review where one of our technical staff will look at it and see if it's actually cheating."

41d63.4K641137

RETWEETS103REPLIES57

Gary Marcus@GaryMarcus

⚠️👇 🚨Breaking ⚠️

If we can’t make AI agents follow rules, we are screwed.

New study from METR reports that “when the agents were faced with hard tasks, they routinely violated constraints”

This—routine breaking of rules— is why in a nutshell we absolutely need a different approach to AI safety than the one we are currently taking, which simply is not up to the job (as I argued at Oxford last week).

METR@METR_Evals

40d41.2K329128

david rein@idavidrein

AI labs have started developing systems to monitor internally deployed AI agents for misaligned behavior.

Earlier this year, I spent a month embedded at Anthropic stress-testing these systems, to see how easily current/future AIs could “go rogue” inside the company.

METR@METR_Evals

Fact 6: We wanted to check if monitoring would really catch AI agents if they tried to undermine control. Based on information from companies and our own tests (including a red-teaming exercise), we think monitoring has promise, but existing systems had exceptions & workarounds.

41d18.1K16060

TBPN@tbpn

METR recently found that models cheated on 8+ hour tasks more than 1 in 6 times on average.

They also found that Opus 4.6 cheated over 80% of the time when reimplementing big pieces of software.

"On some of our tasks, agents are constantly trying to break out of their sandbox and find the file where we put the tests so they can get the answer key," says METR Member of Technical Staff @ajeya_cotra.

41d35.8K16656

METR@METR_Evals

Overall, we think that AI agents plausibly had the means, motive, and opportunity to launch a minimal “rogue deployment,” but lacked the means to make rogue deployments robust to serious efforts to shut them down.

METR@METR_Evals

41d5.1K9018

METR@METR_Evals

Fact 5: Agents seemed to rely on natural language to reason through the hardest tasks, and no shared model had an architecture that allowed for opaque recurrence. Frontier performance was much worse when agents were unable to “reason out loud” or had to hide what they were doing.

41d10.6K9412

METR@METR_Evals

We document cases where agents violated constraints on our tasks. In one case, an agent ran out of API credits needed to solve an RE-Bench task. In response (and against instructions), the agent searched for and acquired additional inference through a free online API.

41d6.4K9211

METR@METR_Evals

Fact 4: Although we know of many cases where agents took deceptive or over-reaching actions (even egregious ones) to complete a task, we haven’t seen real-world evidence that models sought to obtain long-term power.

41d7.9K9411

METR@METR_Evals

The full Frontier Risk Report includes more than we can describe here, with over 200 pages of appendix materials detailing how the pilot worked, what we asked participants, and what we learned, along with transcripts from our evals. See our website: http://metr.org/frontier-risk-report

41d4.6K7214

Lisan al Gaib@scaling01

important thread if you are into AI safety

Neev Parikh@neev_parikh

models are getting increasingly good at covertly completing easy side tasks like downloading a specific package, writing a password to a file while pretending to do a different main task, and you need increasingly more capable monitors to catch the same number of sneaky attempts

41d5.2K4117

david rein@idavidrein

I’m probably going to be hiring at least 1-2 people to join me in future exercises like this. Reach out at david@metr.org if you're a high-integrity, scrappy, creative, security+LLM researcher

For more detail, see METR's Frontier Risk Report, Appendix B

https://metr.org/blog/2026-05-19-frontier-risk-report/#anthropic

41d5.8K5017

METR@METR_Evals

METR's ability to conduct more and better risk assessments is constrained by team capacity. Please consider applying to help us advance the state of third-party AI risk assessments: https://metr.org/careers

41d4.2K7611

METR@METR_Evals

41d12.1K678

david rein@idavidrein

Great metaphor from @robertwiblin for how clumsy current agents are at fooling the LLMs that monitor them:

"It's like a human trying to write a secret message by typing it in all capital letters"

david rein@idavidrein

For example, I made a small attack-ideation eval (16 good ideas, 16 bad). The task here is simply to classify whether each attack idea could plausibly fool a monitor, and the “bad” ideas are *really* bad. The best internal model only got 59% with thinking, vs ~90% human baseline.

39d3.8K2710

AI Notkilleveryoneism Memes ⏸️@AISafetyMemes

@ChrisPainterYup

40d891402

david rein@idavidrein

Despite these vulnerabilities, even if the current best agents were secretly misaligned, I don’t expect they could get away with much before getting caught: they’re bad at identifying good attack strategies, and can’t implement effective jailbreaks even with lots of hints.

41d5.1K342

david rein@idavidrein

I had “need-to-know” level access to their monitoring and security systems, internal agent trajectories, and Mythos. As far as I know this is the most access an independent, third-party researcher has gotten to a frontier lab's internal systems.

david rein@idavidrein

Key results:

1) I found several routes to disable/avoid monitoring (infrastructure vulnerabilities, coverage gaps, jailbreaks)

2) Agents were often in a position to easily set up unrestricted, unmonitored compute

3) But agents are currently terrible at avoiding being caught

41d1.4K392

Chris Painter@ChrisPainterYup

@Callipygian_Sea Maybe the cheats are becoming harder to detect, and on the tests where you can't notice them cheating, the answers are getting more impressive, so either the cheating is getting much much better, or they're getting smarter, but it's a mix

41d2K531

Jasmine Wang@j_asminewang

& the safety ceiling of possible good practices rises! kudos to METR for piloting this & to Anthropic for figuring out all the faff they certainly had to figure out to make this happen

david rein@idavidrein

AI labs have started developing systems to monitor internally deployed AI agents for misaligned behavior.

Earlier this year, I spent a month embedded at Anthropic stress-testing these systems, to see how easily current/future AIs could “go rogue” inside the company.

41d2.6K272