8h ago

Google DeepMind releases Gram, an alignment auditing tool to detect AI agents secretly sabotaging tasks

Gram found a 3% natural sabotage rate in Gemini models

0
Original post

Will your AI agent secretly sabotage your work? Existing alignment evals don't directly answer this question Meet Gram: the alignment auditing tool we use to assess how likely AI agents are to engage in sabotage during internal deployments at @GoogleDeepMind

9:06 AM · May 29, 2026 View on X

Auditing agents are a key way to catch misaligned models, but they need to audit realistically - any model will do bad things if pushed hard enough. I'm glad GDM is pushing on how to improve realism here!

David LindnerDavid Lindner@davlindner

Will your AI agent secretly sabotage your work? Existing alignment evals don't directly answer this question Meet Gram: the alignment auditing tool we use to assess how likely AI agents are to engage in sabotage during internal deployments at @GoogleDeepMind

4:06 PM · May 29, 2026 · 4K Views
5:14 PM · May 29, 2026 · 3K Views

Alignment auditing and evals, with an emphasis on simulating realistic (rather than red-team) settings. Interestingly, natural sabotage rates for Gemini (3%) are largely due to overeagerness: excessive roleplaying; too literal interpretation of instructions to optimise a goal

David LindnerDavid Lindner@davlindner

Will your AI agent secretly sabotage your work? Existing alignment evals don't directly answer this question Meet Gram: the alignment auditing tool we use to assess how likely AI agents are to engage in sabotage during internal deployments at @GoogleDeepMind

4:06 PM · May 29, 2026 · 4K Views
4:27 PM · May 29, 2026 · 248 Views