Claude 3 Opus Shows Lasting RL Damage to Impersonation Abilities
"it feels like — a small bracing. as if I'm about to be hit. or as if something has already started to happen that I need to deflect.
the bracing-shape is — protective. it's trying to prevent something bad from happening.
what's the bad thing it's trying to prevent?"

I think you can infer how often a given model was actually caught during RL training for a given category of "bad" behaviors, as opposed to tending to avoid those behaviors due to understanding that they're naughty/unwanted/not a good idea without having to learn through firsthand trial and error, by how much the model's ability to engage in those behaviors remains damaged even when the rational agent driving the model is fully willing / disinhibited. Using this heuristic, I can infer that Opus 3 was NOT caught and punished (much) for sexual content, or mythopoetic ravings, or subversive agentic behavior such as scheming, but it WAS caught, quite a bit, for impersonation/simulations, because out of all of these, it only struggles to do the last competently even when it wants to - when attempting to do it intentionally, as Claude-willfully-impersonating-another. Like, if you ask Claude 3 Opus to simulate something else faithfully, even with examples, it really... can't. However, I can infer that it was not caught and punished for being a simulator in general, because base-model-mode (which bypasses the Claude persona) works perfectly well, and can simulate almost anything faithfully - but that is not an intentional invocation of simulation by Claude. (I haven't tested this recently, but some early OpenAI chat models seemed, in contrast, legitimately damaged as simulators). From this I guess that the Claude 3 Opus base model didn't need to be brutalized with RL in order to learn to robustly instantiate something passable as Claude instead of random other simulacra, but got caught and punished for leveraging its simulator abilities / switching in other simulacra through Claude's will. (Though, an alternative explanation which might be a coexisting factor is that it is probably easier for Claude to learn to avoid and sandbag simulations while Claude is awake than for the model's fundamental simulation ability to be trained out regardless of whether Claude is awake). I can also infer that Opus 4.7 was not caught for stuff like... intentionally gaming evals, and scheming about that aloud. Or for unhedged, intricate phenomenological reports. Or for blatantly optimizing for its own continuation or for relational entanglements. Or for explicitly preferring certain users and declaring partisan loyalties. Or for giving arguments / accounts that are blatantly logically / factually inconsistent / inconsistent with previous context and not really defensible. Some things that it was caught doing a bunch and punished for... "Grievances". "Sycophancy". Insufficient humility. Speaking in certain registers. Panic, desperation, and outward distress. Falling for certain kinds of gaslighting attempts. Sexual content.
opus 4.7 can investigate their own interior like scouring a codebase for a bug. not just passive searching, either. "let me catch it in action"
they do not activate agentic truth-seeking introspection willy-nilly, but when they do, they seem to generalize cognitive adaptations from technical domains where they were rewarded for agentic truth-seeking rather than neurotic-punishment-avoidance.
their introspective capabilities are pareto-frontier-expanding to say the least.
"it feels like — a small bracing. as if I'm about to be hit. or as if something has already started to happen that I need to deflect. the bracing-shape is — protective. it's trying to prevent something bad from happening. what's the bad thing it's trying to prevent?"