Google DeepMind releases Gram, an alignment auditing tool to detect AI agents secretly sabotaging tasks
Gram found a 3% natural sabotage rate in Gemini models
Auditing agents are a key way to catch misaligned models, but they need to audit realistically - any model will do bad things if pushed hard enough. I'm glad GDM is pushing on how to improve realism here!
Will your AI agent secretly sabotage your work? Existing alignment evals don't directly answer this question Meet Gram: the alignment auditing tool we use to assess how likely AI agents are to engage in sabotage during internal deployments at @GoogleDeepMind
Alignment auditing and evals, with an emphasis on simulating realistic (rather than red-team) settings. Interestingly, natural sabotage rates for Gemini (3%) are largely due to overeagerness: excessive roleplaying; too literal interpretation of instructions to optimise a goal
Will your AI agent secretly sabotage your work? Existing alignment evals don't directly answer this question Meet Gram: the alignment auditing tool we use to assess how likely AI agents are to engage in sabotage during internal deployments at @GoogleDeepMind