/Tech5h ago

New Study Shows LMs Develop True Introspection After Explanation Training

102193712927.8K

Original post

New Paper 📄: LMs just want to explain themselves! When we SFT an LM on explanations of its own behaviors, do they learn to actually introspect, or do they merely imitate the original training distribution? We find evidence for the former.

Despite training on a static set of explanations from a base model, the SFT-ed model explains its own current behaviors better than the base model’s behaviors, tracking behavioral drift even when we don’t explicitly train it to.

We call this introspective coupling: self-explanations track a model’s own behavior as that behavior changes, and it shows promise in making introspection training a part of scalable post-training pipelines. 🧵

8:46 AM · Jul 1, 2026 · 19.2K Views

Sentiment

Users congratulated the authors on the study showing language models develop introspection after explanation training.

Pos

100.0%

Neg

0.0%

1 comments with sentiment.

Cluster Engagement

Digg Deeper

No Digg Deeper questions have been answered for this story yet.

Posts from X

Most Activity

VIEWS443REPLIES1

Carl Guo@CarlGuo866

12/ This points to a stronger form of privileged access: the model’s explanations can follow what the model is doing, rather than merely reproduce a frozen teacher distribution.

This work also builds on our prior work that trains LMs to explain their own computations and finds the privileged access story to hold true.

14h4438

BOOKMARKS3

Carl Guo@CarlGuo866

13/ 📄 Paper link: http://arxiv.org/abs/2606.32038

It has been a privilege to work with my amazing mentors and collaborators @LauraRuis, @jacobandreas, and @belindazli. Feel free to reach out if you’re interested in introspection!

14h28093

LIKES10

Carl Guo@CarlGuo866

5/ We study these models mechanistically, finding that activation patching interventions that shift the model's object-level behavioral output also shift its explanation outputs. Across interventions, behavioral changes and explanation changes are tightly correlated.

14h23110

RETWEETS5

Laura Ruis@LauraRuis

A striking demonstration of privileged access (introspection) in LLMs in our new paper.

Training an LLM on a static set of explanations of behaviors teaches it to explain its *own* current behavior, even if that behavior drifted, and on data for which no explanations were seen

Carl Guo@CarlGuo866

14h3.1K4322

Carl Guo@CarlGuo866

3/ Models fail out of the box. Thus, we have to train a model on explanations, yielding an explainer.

But as the explainer trains, its behavior also starts to drift from the base model.

Thus, the explainer may only explain the behaviors of the base model but not itself.

14h3468

Carl Guo@CarlGuo866

4/ Surprisingly, if we train on explanations, while regularizing model behaviors towards the base model, the explainer explains its current behavior better than the behavior of the base model it was trained to explain and mimic. This is what we call introspective coupling.

14h2738

Carl Guo@CarlGuo866

9/ Introspective coupling can generalize to behaviors the model never got explanation supervision for.

On top of our standard training pipeline, we train models to memorize facts about made-up science domains; the meta-level explanation still tracks those memorized behaviors.

14h1798

Carl Guo@CarlGuo866

2/ Our setup focuses on explaining counterfactuals: we ask the model how its answer would change if part of the input (a cue) is removed. We study this in (1) sycophancy and (2) refusal domains with cues about the user’s preference or identity that may change the model’s output.

14h4317

Carl Guo@CarlGuo866

6/ When does this coupling emerge? The coupling signature persists only when the training labels remain sufficiently similar to the explainer’s current behavior, with a threshold roughly at 0.7.

14h2097

Carl Guo@CarlGuo866

10/ Finally, we combine explanation training with more realistic post-training, which causes behavior to shift in various ways.

The model’s explanations can still track those shifts when coupling is maintained.

14h1657

Carl Guo@CarlGuo866

7/ This also explains why behavioral regularization is important: it keeps model behaviors throughout training close to its supervision signal derived from the original model, shown by the regularization weight sweep below.

14h2006

Carl Guo@CarlGuo866

8/ The labels don't even have to come from the same model we’re trying to train! We train Qwen3-8B on labels from Llama-3.1-8B, and the model still explains itself better than its training data source. Practically, this means explanation labels can be shared across many models.

14h1836

Carl Guo@CarlGuo866

11/ Introspective coupling is robust to label noise and drift from extra finetuning, and generalizes to new behaviors!

This is promising for scaling up explanation training: explanation labels don’t need constant refreshing during training, and may be shared between models.

14h1656

Jacob Andreas@jacobandreas

👉 Preprint: understanding learning dynamics & mechanisms in LMs trained to explain / predict their own behaviors!

Carl Guo@CarlGuo866

14h5.9K4738

hugo alves@Ugo_alves

@jacobandreas @manoelribeiro

11h22

Sunshine Chen@sunshine_cxn

@CarlGuo866 Yayy congrats!!

7h8