As meta-commentary, I’m very appreciative that we were able to share these results. It's crucial for us to be transparent in what we find in order to bring people along with how misalignment changes & manifests itself over time.
We're also updating other deployment simulation sections with a clearer breakdown between predicted misbehavior changes and simulation fidelity error.
Interestingly, misalignment in ChatGPT is similar or slightly reduced for most categories, unlike for agentic coding!




