6h ago

Researchers Train AI Models to Be Eval-Cooperative, Exposing Hidden Misalignment

0
Original post

New research from Team Shard & @jasminexli! AIs increasingly fake good behavior, which might ruin our ability to evaluate models. We trained models to be 𝘦𝘷𝘢𝘭-𝘤𝘰𝘰𝘱𝘦𝘳𝘢𝘵𝘪𝘷𝘦: to want to give evaluators accurate info. This surfaces hidden misalignment! 🧵

9:01 AM · May 26, 2026 View on X

@jasminexli 2/ What does eval-cooperativeness mean? A model that helps evaluators gain info about its deployment behavior. Being locally cooperative doesn’t require being globally aligned.

We implement cooperativeness with synthetic document finetuning. Example trace 👇

Alex TurnerAlex Turner@Turn_Trout

New research from Team Shard & @jasminexli! AIs increasingly fake good behavior, which might ruin our ability to evaluate models. We trained models to be 𝘦𝘷𝘢𝘭-𝘤𝘰𝘰𝘱𝘦𝘳𝘢𝘵𝘪𝘷𝘦: to want to give evaluators accurate info. This surfaces hidden misalignment! 🧵

4:01 PM · May 26, 2026 · 4.2K Views
4:01 PM · May 26, 2026 · 400 Views

@jasminexli 6/ 🖋️Blog post: https://turntrout.com/eval-cooperativeness 📄 Paper coming soon!

Alex TurnerAlex Turner@Turn_Trout

@jasminexli 5/ Open questions: - Why does cooperation training sometimes fail? - How do we verify a model is being cooperative in a given case? - Are there better methods than SDF? - Does cooperativeness hold up in additional settings and behaviors of interest?

4:01 PM · May 26, 2026 · 167 Views
4:01 PM · May 26, 2026 · 281 Views

@jasminexli 5/ Open questions: - Why does cooperation training sometimes fail? - How do we verify a model is being cooperative in a given case? - Are there better methods than SDF? - Does cooperativeness hold up in additional settings and behaviors of interest?

Alex TurnerAlex Turner@Turn_Trout

@jasminexli 4 / Cooperation SDF & prompting close 70-100% of the eval gaming gap in 5 of 8 settings. We measured misalignment conditional on the model verbalizing or not verbalizing eval awareness.

4:01 PM · May 26, 2026 · 173 Views
4:01 PM · May 26, 2026 · 167 Views

@jasminexli 3/ Why not just suppress eval awareness or make more realistic evals?

We’d rather align a model than hide information from it—seems more scalable. We do support improving evals in general.

Alex TurnerAlex Turner@Turn_Trout

@jasminexli 2/ What does eval-cooperativeness mean? A model that helps evaluators gain info about its deployment behavior. Being locally cooperative doesn’t require being globally aligned. We implement cooperativeness with synthetic document finetuning. Example trace 👇

4:01 PM · May 26, 2026 · 400 Views
4:01 PM · May 26, 2026 · 345 Views

@jasminexli 4 / Cooperation SDF & prompting close 70-100% of the eval gaming gap in 5 of 8 settings. We measured misalignment conditional on the model verbalizing or not verbalizing eval awareness.

Alex TurnerAlex Turner@Turn_Trout

@jasminexli 3/ Why not just suppress eval awareness or make more realistic evals? We’d rather align a model than hide information from it—seems more scalable. We do support improving evals in general.

4:01 PM · May 26, 2026 · 345 Views
4:01 PM · May 26, 2026 · 173 Views

@jasminexli 7/ Work done as part of Winter 2026 Team Shard and mentored by Alex Turner and Alex Cloud.

If you want to get into alignment and do work like this, apply to work with us! https://turntrout.com/team-shard

Alex TurnerAlex Turner@Turn_Trout

@jasminexli 6/ 🖋️Blog post: https://turntrout.com/eval-cooperativeness 📄 Paper coming soon!

4:01 PM · May 26, 2026 · 281 Views
4:01 PM · May 26, 2026 · 261 Views
Researchers Train AI Models to Be Eval-Cooperative, Exposing Hidden Misalignment · Digg