18h ago

Johns Hopkins University researchers publish paper showing Natural Language Autoencoders fail to faithfully interpret steered LLM activations

โ€”โ€”0โ€”โ€”

Anthropic introduced NLAs to translate model activations into text.

Original post

NLAs are claimed to verbalize model activations. But can they faithfully interpret steered activations? In our latest paper, we show that steering moves activations into non-invertible regions; and almost surely, no prompt maps to steered activations! NLAs fail to interpret steered activation states faithfully, supporting our results! โ†“ @anqi_liu33 @DanielKhashabi

5:54 AM ยท May 18, 2026 View on X
Reposted by

I think this thread has an interesting set of observations about LLM activation steering that i a priori wouldn't have predicted.

It's also an excellent example of a paper thread: straight to the point, no stupid cliffhangers like i see in every other abstract/thread lately!

Aayush MishraAayush Mishra@aamixsh

NLAs are claimed to verbalize model activations. But can they faithfully interpret steered activations? In our latest paper, we show that steering moves activations into non-invertible regions; and almost surely, no prompt maps to steered activations! NLAs fail to interpret steered activation states faithfully, supporting our results! โ†“ @anqi_liu33 @DanielKhashabi

12:54 PM ยท May 18, 2026 ยท 49.2K Views
6:42 PM ยท May 18, 2026 ยท 23.9K Views

So, you elicited an unsafe behavior via activation steering. Does that imply the same behavior can be elicited from the model in black-box form (i.e., via some prompt)?

Our answer: No.

Why? See the answer here! ๐Ÿ‘‡

Aayush MishraAayush Mishra@aamixsh

NLAs are claimed to verbalize model activations. But can they faithfully interpret steered activations? In our latest paper, we show that steering moves activations into non-invertible regions; and almost surely, no prompt maps to steered activations! NLAs fail to interpret steered activation states faithfully, supporting our results! โ†“ @anqi_liu33 @DanielKhashabi

12:54 PM ยท May 18, 2026 ยท 49.2K Views
2:47 PM ยท May 18, 2026 ยท 899 Views