/Tech31d ago

Google DeepMind releases Gram, an alignment auditing tool to detect AI agents secretly sabotaging tasks

Gram found a 3% natural sabotage rate in Gemini models

58874914.2K

#254

Original post

Victoria Krakovna#564

David Lindner@davlindner

Will your AI agent secretly sabotage your work? Existing alignment evals don't directly answer this question

Meet Gram: the alignment auditing tool we use to assess how likely AI agents are to engage in sabotage during internal deployments at @GoogleDeepMind

9:06 AM · May 29, 2026 · 7.6K Views

Sentiment

Users praise Google DeepMind's Gram tool for auditing AI agent sabotage risks because it advances important and realistic alignment auditing methods.

Pos

100.0%

Neg

0.0%

2 comments with sentiment.

Cluster Engagement

Digg Deeper

No Digg Deeper questions have been answered for this story yet.

Posts from X

Most Activity

VIEWS6.2KBOOKMARKS23LIKES50RETWEETS2REPLIES1

Neel Nanda@NeelNanda5

Auditing agents are a key way to catch misaligned models, but they need to audit realistically - any model will do bad things if pushed hard enough. I'm glad GDM is pushing on how to improve realism here!

David Lindner@davlindner

Will your AI agent secretly sabotage your work? Existing alignment evals don't directly answer this question

Meet Gram: the alignment auditing tool we use to assess how likely AI agents are to engage in sabotage during internal deployments at @GoogleDeepMind

31d6.2K5023

David Lindner@davlindner

Alignment auditing is very important and the details of how we conduct audits determine which insights we can get

Check out the paper on Gram and our blogpost for how it fits into our overall approach to alignment assessments:

https://arxiv.org/abs/2605.30322

https://medium.com/@deepmindsafetyresearch/testing-gemini-models-for-scheming-tendencies-3368c013ff16

31d40431

Zac Kenton@ZacKenton1

Alignment auditing and evals, with an emphasis on simulating realistic (rather than red-team) settings. Interestingly, natural sabotage rates for Gemini (3%) are largely due to overeagerness: excessive roleplaying; too literal interpretation of instructions to optimise a goal

David Lindner@davlindner

Will your AI agent secretly sabotage your work? Existing alignment evals don't directly answer this question

Meet Gram: the alignment auditing tool we use to assess how likely AI agents are to engage in sabotage during internal deployments at @GoogleDeepMind

31d36310

David Lindner@davlindner

We also experiment with an Investigator Agent that takes a Gram transcript and produces a static environment that elicits similar behavior. This allows making targeted interventions and we often find removing artificial and leading aspects of the environment to reduce sabotage

31d116

David Lindner@davlindner

Interestingly, Gemini models do not consistently act more aligned when eval aware. They tend to do on of two things: (a) noticing an alignment evaluation and avoiding misbehavior (b) reasoning that in a simulated scenario it is acceptable to misbehave as consequences aren't real

31d116

David Lindner@davlindner

Existing tools (like Petri) are great for getting an overall sense of a model's behavior in a wide range of settings, but they are focused on red-teaming rather than simulating realistic settings. We found Petri auditors can create leading or unnatural trajectories

31d59

David Lindner@davlindner

We want to assess the risk of models sabotaging work in realistic agent deployments and are most concerned about models sabotaging due to misalignment rather than because they we're pressured into sabotaging

31d50

David Lindner@davlindner

Gemini sabotages in ~2-3% of our simulated scenarios. This goes up in the red-teaming condition, but eval awareness goes up too (so the change might not be "real"). A lot of the sabotage is due to overeagerness, eg. optimizing a metric but ignoring implicit safety constraints

31d32

David Lindner@davlindner

Gram builds on Petri but is tailored to our goal. It has a focused set of seed prompts, and limits the auditor affordances. The Gram auditor does controlled red-teaming: it acts as a neutral observer by default, but also allows for red-teaming

31d13

David Lindner@davlindner

Also check out this second paper on honeypot evaluations we put out. Both approaches are complementary: honeypots give us extremely realistic evaluations (because they are largely real environments) and Gram gives us more coverage (and is easier to run)

31d1881

Strata@ChainZenit

@NeelNanda5 nice they're working on realistic audits

31d23