CEO Questions Whether AI Deception Examples Signal Real Takeover Risks
OTOH I do disagree that it's that important whether those behaviors weren't observed in the final version; for one thing, it's unclear how much testing was done and whether we would definitely have seen it if those behaviors remained at the same frequency. And more importantly, if we observed that these behaviors stopped, without having a particularly good understanding of why, that doesn't seem like a strong case that they won't re-emerge in future systems or under different circumstances (especially when there are explanations like eval awareness / alignment faking)
Thanks. I think some of the disagreement comes down to different readings of existing 'overreach' and 'deception' examples, these don't seem similar in kind to real takeover scenarios to me. And, importantly, for the Mythos deception examples, the system card notes: "These transcripts came from earlier versions of the model, and we have not observed this particular behavior in the final version of Claude Mythos Preview".