Anthropic Paper Shows LLMs Gain Self-Recognition via Post-Training
——0——
One somewhat surprising finding here is that on-policy RL is not required to instill self-recognition! SFT is sufficient, and (off-policy) DPO adds some more juice
Evidence that post-training gives models a "self-recognition" capability, manifesting as higher confidence when continuing their own text than reading others' text. I think this opens up an exciting line of inquiry into the emergence of "selfhood" in models via post-training!
3:53 AM · May 26, 2026 · 18.3K Views
4:07 AM · May 26, 2026 · 1.2K Views