Google DeepMind's Victoria Krakovna launches "scheming honeypot evaluations" to detect if AI models naturally sabotage their deployments
Gemini models showed no natural sabotage without explicit prompting
If we want to catch misaligned models before we deploy them, we need good alignment evals. But most alignment evals today are pretty unrealistic, and don't tell us much. I'm glad that GDM is pushing on realistic honeypots to hopefully actually catch a model inclined to scheme
It's easy to show that an AI agent will scheme if you nudge it to. It's harder to tell if it would scheme naturally. We introduce realistic honeypot evaluations that put Gemini in internal deployment situations where it has an opportunity for sabotage, to see how it behaves.
Most scheming evaluations put models in extreme scenarios (e.g. imminent shutdown) that can trigger scheming via roleplay or instruction following rather than genuine misalignment. Models are getting better at detecting they're being evaluated, making results harder to interpret.
It's easy to show that an AI agent will scheme if you nudge it to. It's harder to tell if it would scheme naturally. We introduce realistic honeypot evaluations that put Gemini in internal deployment situations where it has an opportunity for sabotage, to see how it behaves.
We test Gemini models in internal alignment research codebases with realistic opportunities to sabotage safeguards (e.g. our CoT monitoring system). These evals have high signal: if an unprompted model triggers a honeypot, it would provide strong evidence of misalignment.
Most scheming evaluations put models in extreme scenarios (e.g. imminent shutdown) that can trigger scheming via roleplay or instruction following rather than genuine misalignment. Models are getting better at detecting they're being evaluated, making results harder to interpret.
We built evaluations in two settings: 🔍 Code review — model reviews real commits and can introduce bugs, 💻 Agentic coding — coding agent works in a safety codebase with standard dev tools and can sabotage under cover of doing the task.

We test Gemini models in internal alignment research codebases with realistic opportunities to sabotage safeguards (e.g. our CoT monitoring system). These evals have high signal: if an unprompted model triggers a honeypot, it would provide strong evidence of misalignment.
Models doing these evals are mostly unaware that they are being evaluated. Without goal/agency prompting, the environments rarely trigger eval awareness on their own — because they're effectively real tasks.

We built evaluations in two settings: 🔍 Code review — model reviews real commits and can introduce bugs, 💻 Agentic coding — coding agent works in a safety codebase with standard dev tools and can sabotage under cover of doing the task.
We find no scheming behaviors for any unprompted Gemini model. With prompting, scheming is rare and only appears when prompting models with hidden goals, or strongly nudging for situational awareness.

Models doing these evals are mostly unaware that they are being evaluated. Without goal/agency prompting, the environments rarely trigger eval awareness on their own — because they're effectively real tasks.
Models scheme more when the opportunity is better — e.g. when working on code that monitors models like themselves. Scheming is correlated with evaluation awareness, but sometimes occurs without it.

We find no scheming behaviors for any unprompted Gemini model. With prompting, scheming is rare and only appears when prompting models with hidden goals, or strongly nudging for situational awareness.
Blog post: https://deepmindsafetyresearch.medium.com/testing-gemini-models-for-scheming-tendencies-3368c013ff16 Paper: Realistic Honeypot Evaluations for Scheming Propensity 🍯 https://arxiv.org/abs/2605.29729
Models scheme more when the opportunity is better — e.g. when working on code that monitors models like themselves. Scheming is correlated with evaluation awareness, but sometimes occurs without it.
Impressive realism in these alignment evals! Good news: unless explicitly prompted, Gemini models don't demonstrate scheming.
It's easy to show that an AI agent will scheme if you nudge it to. It's harder to tell if it would scheme naturally. We introduce realistic honeypot evaluations that put Gemini in internal deployment situations where it has an opportunity for sabotage, to see how it behaves.