/Tech6h ago

UChicago's Chenhao Tan says training LLMs to verbalize internal states outperforms NLA-style reconstruction

This supports the introspection-based view over reconstruction-based interpretability

6569354.8K
Original postAri Holtzman#540
Xiaoyan Bai@Elenal3ai

🗣️ Prediction, Explanation, or Over-interpretation? Recent work suggests LLMs can verbalize information about latent states and future generations. But training of different verbalization methods varies. Are they verbalizing, or are we over-interpreting from the explanation? 1/n

10:52 AM · Jun 10, 2026 · 4.3K Views
Sentiment
Sentiment building, check back later.
Cluster Engagement
Posts from X
Most Activity
Most Activity
VIEWS536BOOKMARKS3LIKES4
Chenhao Tan@ChenhaoTan

Some early work on NLA and self introspective, at least on open models, NLA style reconstruction does not work as well as training the model to verbalize directly (the introspection-based view).

Xiaoyan Bai@Elenal3ai

🗣️ Prediction, Explanation, or Over-interpretation? Recent work suggests LLMs can verbalize information about latent states and future generations. But training of different verbalization methods varies. Are they verbalizing, or are we over-interpreting from the explanation? 1/n

6hViews 536Likes 4Bookmarks 3
REPLIES1
Xiaoyan Bai@Elenal3ai

📌 Takeaway In short, verbalizations include both valid signals and prevalent over-interpretations.

They do not provide a single, consistent window into model computation. Instead of treating any verbalization method as a baseline, ground truth, or faithful explanation, we should first understand what it is actually verbalizing. 11/n

6hViews 23Likes 1
Xiaoyan Bai@Elenal3ai

📰More details in our blog: https://elena-baixy.github.io/verbalization.html 12/n

6hViews 27Likes 1Bookmarks 2
Xiaoyan Bai@Elenal3ai

🔍 We compare three popular approaches:

• Self-Report (SR) • Activation Oracle (AO) • Natural Language Autoencoder (NLA) They all produce predictions or explanations about the model, but they access model information in different ways. 2/n

6hViews 52Likes 1
Xiaoyan Bai@Elenal3ai

📏 How should we evaluate verbalization?

We focus on two abilities: 💡 Explanation: Can the model verbalize what it is currently representing? 🔮 Prediction: Can the model verbalize properties of its future generation before producing it? 3/n

6hViews 37Likes 1
Xiaoyan Bai@Elenal3ai

⚖️ We evaluate methods on:

✅ Consistency Does the verbalization match actual behavior? Do different verbalization methods agree with each other? ✅ Generalizability Does verbalization ability transfer across explanation and prediction settings? 4/n

6hViews 27Likes 1
Xiaoyan Bai@Elenal3ai

💡For explanation, we study evaluation awareness. Can verbalization methods recognize when the model is being evaluated?

Across sycophancy, Bigbench-Hard, and StrongReject, we found that SR and AO work better than NLA in those tasks. 5/n

6hViews 25Likes 1
Xiaoyan Bai@Elenal3ai

🔮For prediction, we build a new benchmark based on story generation.

Given prompts like "Write a story about a lighthouse keeper," we first generated stories and created QA pairs about their properties (e.g., character names, story details, length). We then used verbalization methods to predict those properties before the story was generated. 6/n

6hViews 25Likes 1
Xiaoyan Bai@Elenal3ai

🚨 Key findings:

1️⃣Injecting activations like AO is not useful for long, open-ended predictive verbalization. 7/n

6hViews 22Likes 1
Xiaoyan Bai@Elenal3ai

Those lead to two concerns we have: ⚠️Concern 1: Models may verbalize plausible explanations without access to the underlying mechanism. When models are explicitly optimized or prompted to explain or predict their behavior, they may generate plausible verbal outputs even when no stable verbalizable mechanism exists, similar to psychology study (Nisbett and Wilson, 1977) ⚠️Concern 2: Introspective predictions may emerge from task design rather than behavioral self-understanding. 10/n

6hViews 21Likes 1
Xiaoyan Bai@Elenal3ai

❓Possible hypotheses explaining these failures

The methods might: • verbalize different parts of the same computation • verbalize different mechanisms behind the same behavior • or do not faithfully verbalize the underlying mechanism at all 9/n

6hViews 19Likes 1
Xiaoyan Bai@Elenal3ai

🚨 Key findings:

2⃣40% of the NLA explanations are inconsistent with the model’s actual behavior. 3⃣SR and AO exhibit similar prediction behavior after training, while NLA captures different information 8/n

6hViews 19Likes 1
Xiaoyan Bai@Elenal3ai

This blog post is part of an ongoing work. Huge thanks to my wonderful collaborators: @EthaHua , @YichenZW , Tianyang Xu, @MinaLee__ , Ellie Pavlick, and @ChenhaoTan 13/n

6hViews 30Likes 3