Johns Hopkins University researchers publish paper showing Natural Language Autoencoders fail to faithfully interpret steered LLM activations
Anthropic introduced NLAs to translate model activations into text.
I think this thread has an interesting set of observations about LLM activation steering that i a priori wouldn't have predicted.
It's also an excellent example of a paper thread: straight to the point, no stupid cliffhangers like i see in every other abstract/thread lately!
NLAs are claimed to verbalize model activations. But can they faithfully interpret steered activations? In our latest paper, we show that steering moves activations into non-invertible regions; and almost surely, no prompt maps to steered activations! NLAs fail to interpret steered activation states faithfully, supporting our results! โ @anqi_liu33 @DanielKhashabi
So, you elicited an unsafe behavior via activation steering. Does that imply the same behavior can be elicited from the model in black-box form (i.e., via some prompt)?
Our answer: No.
Why? See the answer here! ๐
NLAs are claimed to verbalize model activations. But can they faithfully interpret steered activations? In our latest paper, we show that steering moves activations into non-invertible regions; and almost surely, no prompt maps to steered activations! NLAs fail to interpret steered activation states faithfully, supporting our results! โ @anqi_liu33 @DanielKhashabi