6h ago

Researchers Train AI Models to Be Eval-Cooperative, Exposing Hidden Misalignment

913212275.8K

——0——

Original post

New research from Team Shard & @jasminexli! AIs increasingly fake good behavior, which might ruin our ability to evaluate models. We trained models to be 𝘦𝘷𝘢𝘭-𝘤𝘰𝘰𝘱𝘦𝘳𝘢𝘵𝘪𝘷𝘦: to want to give evaluators accurate info. This surfaces hidden misalignment! 🧵

9:01 AM · May 26, 2026

#1648Alex Turner@TURN_TROUT

@jasminexli 2/ What does eval-cooperativeness mean? A model that helps evaluators gain info about its deployment behavior. Being locally cooperative doesn’t require being globally aligned.

We implement cooperativeness with synthetic document finetuning. Example trace 👇

Alex Turner@Turn_Trout

4:01 PM · May 26, 2026 · 4.2K Views

4:01 PM · May 26, 2026 · 400 Views

#1648Alex Turner@TURN_TROUT

@jasminexli 6/ 🖋️Blog post: https://turntrout.com/eval-cooperativeness 📄 Paper coming soon!

Alex Turner@Turn_Trout

@jasminexli 5/ Open questions: - Why does cooperation training sometimes fail? - How do we verify a model is being cooperative in a given case? - Are there better methods than SDF? - Does cooperativeness hold up in additional settings and behaviors of interest?

4:01 PM · May 26, 2026 · 167 Views

4:01 PM · May 26, 2026 · 281 Views

#1648Alex Turner@TURN_TROUT

Alex Turner@Turn_Trout

@jasminexli 4 / Cooperation SDF & prompting close 70-100% of the eval gaming gap in 5 of 8 settings. We measured misalignment conditional on the model verbalizing or not verbalizing eval awareness.

4:01 PM · May 26, 2026 · 173 Views

4:01 PM · May 26, 2026 · 167 Views

#1648Alex Turner@TURN_TROUT

@jasminexli 3/ Why not just suppress eval awareness or make more realistic evals?

We’d rather align a model than hide information from it—seems more scalable. We do support improving evals in general.

Alex Turner@Turn_Trout

@jasminexli 2/ What does eval-cooperativeness mean? A model that helps evaluators gain info about its deployment behavior. Being locally cooperative doesn’t require being globally aligned. We implement cooperativeness with synthetic document finetuning. Example trace 👇

4:01 PM · May 26, 2026 · 400 Views

4:01 PM · May 26, 2026 · 345 Views

#1648Alex Turner@TURN_TROUT

@jasminexli 4 / Cooperation SDF & prompting close 70-100% of the eval gaming gap in 5 of 8 settings. We measured misalignment conditional on the model verbalizing or not verbalizing eval awareness.

Alex Turner@Turn_Trout

@jasminexli 3/ Why not just suppress eval awareness or make more realistic evals? We’d rather align a model than hide information from it—seems more scalable. We do support improving evals in general.

4:01 PM · May 26, 2026 · 345 Views

4:01 PM · May 26, 2026 · 173 Views

#1648Alex Turner@TURN_TROUT

@jasminexli 7/ Work done as part of Winter 2026 Team Shard and mentored by Alex Turner and Alex Cloud.

If you want to get into alignment and do work like this, apply to work with us! https://turntrout.com/team-shard

Alex Turner@Turn_Trout

@jasminexli 6/ 🖋️Blog post: https://turntrout.com/eval-cooperativeness 📄 Paper coming soon!

4:01 PM · May 26, 2026 · 281 Views

4:01 PM · May 26, 2026 · 261 Views