Researchers Train AI Models to Be Eval-Cooperative, Exposing Hidden Misalignment
@jasminexli 2/ What does eval-cooperativeness mean? A model that helps evaluators gain info about its deployment behavior. Being locally cooperative doesn’t require being globally aligned.
We implement cooperativeness with synthetic document finetuning. Example trace 👇

New research from Team Shard & @jasminexli! AIs increasingly fake good behavior, which might ruin our ability to evaluate models. We trained models to be 𝘦𝘷𝘢𝘭-𝘤𝘰𝘰𝘱𝘦𝘳𝘢𝘵𝘪𝘷𝘦: to want to give evaluators accurate info. This surfaces hidden misalignment! 🧵
@jasminexli 6/ 🖋️Blog post: https://turntrout.com/eval-cooperativeness 📄 Paper coming soon!

@jasminexli 5/ Open questions: - Why does cooperation training sometimes fail? - How do we verify a model is being cooperative in a given case? - Are there better methods than SDF? - Does cooperativeness hold up in additional settings and behaviors of interest?
@jasminexli 5/ Open questions: - Why does cooperation training sometimes fail? - How do we verify a model is being cooperative in a given case? - Are there better methods than SDF? - Does cooperativeness hold up in additional settings and behaviors of interest?
@jasminexli 4 / Cooperation SDF & prompting close 70-100% of the eval gaming gap in 5 of 8 settings. We measured misalignment conditional on the model verbalizing or not verbalizing eval awareness.
@jasminexli 3/ Why not just suppress eval awareness or make more realistic evals?
We’d rather align a model than hide information from it—seems more scalable. We do support improving evals in general.
@jasminexli 2/ What does eval-cooperativeness mean? A model that helps evaluators gain info about its deployment behavior. Being locally cooperative doesn’t require being globally aligned. We implement cooperativeness with synthetic document finetuning. Example trace 👇
@jasminexli 4 / Cooperation SDF & prompting close 70-100% of the eval gaming gap in 5 of 8 settings. We measured misalignment conditional on the model verbalizing or not verbalizing eval awareness.

@jasminexli 3/ Why not just suppress eval awareness or make more realistic evals? We’d rather align a model than hide information from it—seems more scalable. We do support improving evals in general.
@jasminexli 7/ Work done as part of Winter 2026 Team Shard and mentored by Alex Turner and Alex Cloud.
If you want to get into alignment and do work like this, apply to work with us! https://turntrout.com/team-shard

@jasminexli 6/ 🖋️Blog post: https://turntrout.com/eval-cooperativeness 📄 Paper coming soon!