/Tech6h ago

UChicago's Chenhao Tan says training LLMs to verbalize internal states outperforms NLA-style reconstruction

This supports the introspection-based view over reconstruction-based interpretability

6569354.8K

#540

Original post

Ari Holtzman#540

Xiaoyan Bai@Elenal3ai

🗣️ Prediction, Explanation, or Over-interpretation? Recent work suggests LLMs can verbalize information about latent states and future generations. But training of different verbalization methods varies. Are they verbalizing, or are we over-interpreting from the explanation? 1/n

10:52 AM · Jun 10, 2026 · 4.3K Views

/Tech6h ago

UChicago's Chenhao Tan says training LLMs to verbalize internal states outperforms NLA-style reconstruction

This supports the introspection-based view over reconstruction-based interpretability

6569354.8K

#540

Original post

Ari Holtzman#540

Xiaoyan Bai@Elenal3ai

10:52 AM · Jun 10, 2026 · 4.3K Views

Sentiment

Sentiment building, check back later.

Cluster Engagement

Posts from X

Most Activity

VIEWS536BOOKMARKS3LIKES4

Chenhao Tan@ChenhaoTan

Some early work on NLA and self introspective, at least on open models, NLA style reconstruction does not work as well as training the model to verbalize directly (the introspection-based view).

Xiaoyan Bai@Elenal3ai

6h53643

REPLIES1

Xiaoyan Bai@Elenal3ai

📌 Takeaway In short, verbalizations include both valid signals and prevalent over-interpretations.

They do not provide a single, consistent window into model computation. Instead of treating any verbalization method as a baseline, ground truth, or faithful explanation, we should first understand what it is actually verbalizing. 11/n

6h231

Xiaoyan Bai@Elenal3ai

📰More details in our blog: https://elena-baixy.github.io/verbalization.html 12/n

6h2712

Xiaoyan Bai@Elenal3ai

🔍 We compare three popular approaches:

• Self-Report (SR) • Activation Oracle (AO) • Natural Language Autoencoder (NLA) They all produce predictions or explanations about the model, but they access model information in different ways. 2/n

6h521

Xiaoyan Bai@Elenal3ai

📏 How should we evaluate verbalization?

We focus on two abilities: 💡 Explanation: Can the model verbalize what it is currently representing? 🔮 Prediction: Can the model verbalize properties of its future generation before producing it? 3/n

6h371

Xiaoyan Bai@Elenal3ai

⚖️ We evaluate methods on:

✅ Consistency Does the verbalization match actual behavior? Do different verbalization methods agree with each other? ✅ Generalizability Does verbalization ability transfer across explanation and prediction settings? 4/n

6h271

Xiaoyan Bai@Elenal3ai

💡For explanation, we study evaluation awareness. Can verbalization methods recognize when the model is being evaluated?

Across sycophancy, Bigbench-Hard, and StrongReject, we found that SR and AO work better than NLA in those tasks. 5/n

6h251

Xiaoyan Bai@Elenal3ai

🔮For prediction, we build a new benchmark based on story generation.

Given prompts like "Write a story about a lighthouse keeper," we first generated stories and created QA pairs about their properties (e.g., character names, story details, length). We then used verbalization methods to predict those properties before the story was generated. 6/n

6h251

Xiaoyan Bai@Elenal3ai

🚨 Key findings:

1️⃣Injecting activations like AO is not useful for long, open-ended predictive verbalization. 7/n

6h221

Xiaoyan Bai@Elenal3ai

Those lead to two concerns we have: ⚠️Concern 1: Models may verbalize plausible explanations without access to the underlying mechanism. When models are explicitly optimized or prompted to explain or predict their behavior, they may generate plausible verbal outputs even when no stable verbalizable mechanism exists, similar to psychology study (Nisbett and Wilson, 1977) ⚠️Concern 2: Introspective predictions may emerge from task design rather than behavioral self-understanding. 10/n

6h211

Xiaoyan Bai@Elenal3ai

❓Possible hypotheses explaining these failures

The methods might: • verbalize different parts of the same computation • verbalize different mechanisms behind the same behavior • or do not faithfully verbalize the underlying mechanism at all 9/n

6h191

Xiaoyan Bai@Elenal3ai

🚨 Key findings:

2⃣40% of the NLA explanations are inconsistent with the model’s actual behavior. 3⃣SR and AO exhibit similar prediction behavior after training, while NLA captures different information 8/n

6h191

Xiaoyan Bai@Elenal3ai

This blog post is part of an ongoing work. Huge thanks to my wonderful collaborators: @EthaHua , @YichenZW , Tianyang Xu, @MinaLee__ , Ellie Pavlick, and @ChenhaoTan 13/n

6h303