New Paper 📄: LMs just want to explain themselves! When we SFT an LM on explanations of its own behaviors, do they learn to actually introspect, or do they merely imitate the original training distribution? We find evidence for the former.
Despite training on a static set of explanations from a base model, the SFT-ed model explains its own current behaviors better than the base model’s behaviors, tracking behavioral drift even when we don’t explicitly train it to.
We call this introspective coupling: self-explanations track a model’s own behavior as that behavior changes, and it shows promise in making introspection training a part of scalable post-training pipelines. 🧵


