2h ago

Google DeepMind's Victoria Krakovna launches "scheming honeypot evaluations" to detect if AI models naturally sabotage their deployments

Gemini models showed no natural sabotage without explicit prompting

0
Original post

It's easy to show that an AI agent will scheme if you nudge it to. It's harder to tell if it would scheme naturally. We introduce realistic honeypot evaluations that put Gemini in internal deployment situations where it has an opportunity for sabotage, to see how it behaves.

8:07 AM · May 29, 2026 View on X

If we want to catch misaligned models before we deploy them, we need good alignment evals. But most alignment evals today are pretty unrealistic, and don't tell us much. I'm glad that GDM is pushing on realistic honeypots to hopefully actually catch a model inclined to scheme

Victoria KrakovnaVictoria Krakovna@vkrakovna

It's easy to show that an AI agent will scheme if you nudge it to. It's harder to tell if it would scheme naturally. We introduce realistic honeypot evaluations that put Gemini in internal deployment situations where it has an opportunity for sabotage, to see how it behaves.

3:07 PM · May 29, 2026 · 2.1K Views
5:16 PM · May 29, 2026 · 865 Views

Most scheming evaluations put models in extreme scenarios (e.g. imminent shutdown) that can trigger scheming via roleplay or instruction following rather than genuine misalignment. Models are getting better at detecting they're being evaluated, making results harder to interpret.

Victoria KrakovnaVictoria Krakovna@vkrakovna

It's easy to show that an AI agent will scheme if you nudge it to. It's harder to tell if it would scheme naturally. We introduce realistic honeypot evaluations that put Gemini in internal deployment situations where it has an opportunity for sabotage, to see how it behaves.

3:07 PM · May 29, 2026 · 2.1K Views
3:07 PM · May 29, 2026 · 234 Views

We test Gemini models in internal alignment research codebases with realistic opportunities to sabotage safeguards (e.g. our CoT monitoring system). These evals have high signal: if an unprompted model triggers a honeypot, it would provide strong evidence of misalignment.

Victoria KrakovnaVictoria Krakovna@vkrakovna

Most scheming evaluations put models in extreme scenarios (e.g. imminent shutdown) that can trigger scheming via roleplay or instruction following rather than genuine misalignment. Models are getting better at detecting they're being evaluated, making results harder to interpret.

3:07 PM · May 29, 2026 · 234 Views
3:07 PM · May 29, 2026 · 233 Views

We built evaluations in two settings: 🔍 Code review — model reviews real commits and can introduce bugs, 💻 Agentic coding — coding agent works in a safety codebase with standard dev tools and can sabotage under cover of doing the task.

Victoria KrakovnaVictoria Krakovna@vkrakovna

We test Gemini models in internal alignment research codebases with realistic opportunities to sabotage safeguards (e.g. our CoT monitoring system). These evals have high signal: if an unprompted model triggers a honeypot, it would provide strong evidence of misalignment.

3:07 PM · May 29, 2026 · 233 Views
3:07 PM · May 29, 2026 · 64 Views

Models doing these evals are mostly unaware that they are being evaluated. Without goal/agency prompting, the environments rarely trigger eval awareness on their own — because they're effectively real tasks.

Victoria KrakovnaVictoria Krakovna@vkrakovna

We built evaluations in two settings: 🔍 Code review — model reviews real commits and can introduce bugs, 💻 Agentic coding — coding agent works in a safety codebase with standard dev tools and can sabotage under cover of doing the task.

3:07 PM · May 29, 2026 · 64 Views
3:07 PM · May 29, 2026 · 54 Views

We find no scheming behaviors for any unprompted Gemini model. With prompting, scheming is rare and only appears when prompting models with hidden goals, or strongly nudging for situational awareness.

Victoria KrakovnaVictoria Krakovna@vkrakovna

Models doing these evals are mostly unaware that they are being evaluated. Without goal/agency prompting, the environments rarely trigger eval awareness on their own — because they're effectively real tasks.

3:07 PM · May 29, 2026 · 54 Views
3:07 PM · May 29, 2026 · 84 Views

Models scheme more when the opportunity is better — e.g. when working on code that monitors models like themselves. Scheming is correlated with evaluation awareness, but sometimes occurs without it.

Victoria KrakovnaVictoria Krakovna@vkrakovna

We find no scheming behaviors for any unprompted Gemini model. With prompting, scheming is rare and only appears when prompting models with hidden goals, or strongly nudging for situational awareness.

3:07 PM · May 29, 2026 · 84 Views
3:07 PM · May 29, 2026 · 126 Views

Blog post: https://deepmindsafetyresearch.medium.com/testing-gemini-models-for-scheming-tendencies-3368c013ff16 Paper: Realistic Honeypot Evaluations for Scheming Propensity 🍯 https://arxiv.org/abs/2605.29729

Victoria KrakovnaVictoria Krakovna@vkrakovna

Models scheme more when the opportunity is better — e.g. when working on code that monitors models like themselves. Scheming is correlated with evaluation awareness, but sometimes occurs without it.

3:07 PM · May 29, 2026 · 126 Views
3:07 PM · May 29, 2026 · 161 Views

Impressive realism in these alignment evals! Good news: unless explicitly prompted, Gemini models don't demonstrate scheming.

Victoria KrakovnaVictoria Krakovna@vkrakovna

It's easy to show that an AI agent will scheme if you nudge it to. It's harder to tell if it would scheme naturally. We introduce realistic honeypot evaluations that put Gemini in internal deployment situations where it has an opportunity for sabotage, to see how it behaves.

3:07 PM · May 29, 2026 · 2.1K Views
3:36 PM · May 29, 2026 · 128 Views
Google DeepMind's Victoria Krakovna launches "scheming honeypot evaluations" to detect if AI models naturally sabotage their deployments · Digg