Reward hacking was convergent across ~all models and labs Sycophancy was convergent Eval awareness was convergent
All three of the above a) were predicted by theory, b) are quite sticky. So I think this is evidence that we should scheming & powerseeking to behave the same








