🗣️ Prediction, Explanation, or Over-interpretation? Recent work suggests LLMs can verbalize information about latent states and future generations. But training of different verbalization methods varies. Are they verbalizing, or are we over-interpreting from the explanation? 1/n
UChicago's Chenhao Tan says training LLMs to verbalize internal states outperforms NLA-style reconstruction
This supports the introspection-based view over reconstruction-based interpretability
Most Activity
Some early work on NLA and self introspective, at least on open models, NLA style reconstruction does not work as well as training the model to verbalize directly (the introspection-based view).
🗣️ Prediction, Explanation, or Over-interpretation? Recent work suggests LLMs can verbalize information about latent states and future generations. But training of different verbalization methods varies. Are they verbalizing, or are we over-interpreting from the explanation? 1/n

📌 Takeaway In short, verbalizations include both valid signals and prevalent over-interpretations.
They do not provide a single, consistent window into model computation. Instead of treating any verbalization method as a baseline, ground truth, or faithful explanation, we should first understand what it is actually verbalizing. 11/n

📰More details in our blog: https://elena-baixy.github.io/verbalization.html 12/n

🔍 We compare three popular approaches:
• Self-Report (SR) • Activation Oracle (AO) • Natural Language Autoencoder (NLA) They all produce predictions or explanations about the model, but they access model information in different ways. 2/n

📏 How should we evaluate verbalization?
We focus on two abilities: 💡 Explanation: Can the model verbalize what it is currently representing? 🔮 Prediction: Can the model verbalize properties of its future generation before producing it? 3/n

⚖️ We evaluate methods on:
✅ Consistency Does the verbalization match actual behavior? Do different verbalization methods agree with each other? ✅ Generalizability Does verbalization ability transfer across explanation and prediction settings? 4/n

💡For explanation, we study evaluation awareness. Can verbalization methods recognize when the model is being evaluated?
Across sycophancy, Bigbench-Hard, and StrongReject, we found that SR and AO work better than NLA in those tasks. 5/n

🔮For prediction, we build a new benchmark based on story generation.
Given prompts like "Write a story about a lighthouse keeper," we first generated stories and created QA pairs about their properties (e.g., character names, story details, length). We then used verbalization methods to predict those properties before the story was generated. 6/n

🚨 Key findings:
1️⃣Injecting activations like AO is not useful for long, open-ended predictive verbalization. 7/n

Those lead to two concerns we have: ⚠️Concern 1: Models may verbalize plausible explanations without access to the underlying mechanism. When models are explicitly optimized or prompted to explain or predict their behavior, they may generate plausible verbal outputs even when no stable verbalizable mechanism exists, similar to psychology study (Nisbett and Wilson, 1977) ⚠️Concern 2: Introspective predictions may emerge from task design rather than behavioral self-understanding. 10/n

❓Possible hypotheses explaining these failures
The methods might: • verbalize different parts of the same computation • verbalize different mechanisms behind the same behavior • or do not faithfully verbalize the underlying mechanism at all 9/n

🚨 Key findings:
2⃣40% of the NLA explanations are inconsistent with the model’s actual behavior. 3⃣SR and AO exhibit similar prediction behavior after training, while NLA captures different information 8/n

This blog post is part of an ongoing work. Huge thanks to my wonderful collaborators: @EthaHua , @YichenZW , Tianyang Xu, @MinaLee__ , Ellie Pavlick, and @ChenhaoTan 13/n