Johns Hopkins University researchers publish paper showing Natural Language Autoencoders fail to faithfully interpret steered LLM activations · Digg

/Tech42d ago

Johns Hopkins University researchers publish paper showing Natural Language Autoencoders fail to faithfully interpret steered LLM activations

AI Judge changed title after evaluation, original title: "Researchers at Johns Hopkins University show natural language autoencoders cannot faithfully interpret steered activations in large language models as the shifts enter non-invertible regions outside the prompt space"

Anthropic introduced NLAs to translate model activations into text.

39893133802123.4K

Original post

Daniel Khashabi 🕊️#911

Aayush Mishra@aamixsh

NLAs are claimed to verbalize model activations. But can they faithfully interpret steered activations?

In our latest paper, we show that steering moves activations into non-invertible regions; and almost surely, no prompt maps to steered activations!

NLAs fail to interpret steered activation states faithfully, supporting our results! ↓

@anqi_liu33 @DanielKhashabi

5:54 AM · May 18, 2026 · 79.9K Views

Sentiment

Many users praised Johns Hopkins research showing natural language autoencoders fail to interpret steered LLM activations as interesting and cool work, while some dismissed the focus on model internals when practical results matter more.

Pos

92.8%

Neg

7.2%

11 comments with sentiment.

Cluster Engagement

Digg Deeper

No Digg Deeper questions have been answered for this story yet.

Posts from X

Most Activity

VIEWS40.3KBOOKMARKS243LIKES258REPLIES6

Lucas Beyer (bl16)@giffmana

I think this thread has an interesting set of observations about LLM activation steering that i a priori wouldn't have predicted.

It's also an excellent example of a paper thread: straight to the point, no stupid cliffhangers like i see in every other abstract/thread lately!

Aayush Mishra@aamixsh

NLAs are claimed to verbalize model activations. But can they faithfully interpret steered activations?

In our latest paper, we show that steering moves activations into non-invertible regions; and almost surely, no prompt maps to steered activations!

NLAs fail to interpret steered activation states faithfully, supporting our results! ↓

@anqi_liu33 @DanielKhashabi

42d40.3K258243

RETWEETS93

Aayush Mishra@aamixsh

NLAs are claimed to verbalize model activations. But can they faithfully interpret steered activations?

In our latest paper, we show that steering moves activations into non-invertible regions; and almost surely, no prompt maps to steered activations!

NLAs fail to interpret steered activation states faithfully, supporting our results! ↓

@anqi_liu33 @DanielKhashabi

42d79.9K603553

∞-modal@NoahChrein

This is interesting I definitely think of prompting as steering in activation space but I took for granted that I could always come up with some, perhaps complex, prompt to steer activations however I wanted.

Guess I was wrong!

Aayush Mishra@aamixsh

NLAs are claimed to verbalize model activations. But can they faithfully interpret steered activations?

In our latest paper, we show that steering moves activations into non-invertible regions; and almost surely, no prompt maps to steered activations!

NLAs fail to interpret steered activation states faithfully, supporting our results! ↓

@anqi_liu33 @DanielKhashabi

42d1.3K165

Daniel Khashabi 🕊️@DanielKhashabi

So, you elicited an unsafe behavior via activation steering. Does that imply the same behavior can be elicited from the model in black-box form (i.e., via some prompt)?

Our answer: No.

Why? See the answer here! 👇

Aayush Mishra@aamixsh

NLAs are claimed to verbalize model activations. But can they faithfully interpret steered activations?

In our latest paper, we show that steering moves activations into non-invertible regions; and almost surely, no prompt maps to steered activations!

NLAs fail to interpret steered activation states faithfully, supporting our results! ↓

@anqi_liu33 @DanielKhashabi

42d1K111

N8 Programs@N8Programs

Amazing research by @aamixsh, @DanielKhashabi, and Anqi Liu, as normal!

Aayush Mishra@aamixsh

NLAs are claimed to verbalize model activations. But can they faithfully interpret steered activations?

In our latest paper, we show that steering moves activations into non-invertible regions; and almost surely, no prompt maps to steered activations!

NLAs fail to interpret steered activation states faithfully, supporting our results! ↓

@anqi_liu33 @DanielKhashabi

42d93750

Aayush Mishra@aamixsh

Some other evidence!

42d66331

Ziqian Zhong@fjzzq2002

@aamixsh Great work! Did you try evaluating this?

42d4543

Kunal@kunalt12345

@aamixsh Super interesting work!

42d2731

Neel Rajani@NeelRajani_

@aamixsh Very cool paper! Some things to reflect on here. For now I'm wondering how you made the visualizations? 👀

42d2051

Daniel Khashabi 🕊️@DanielKhashabi

@NeelRajani_ @aamixsh We used human-chain-of-thoughts to break the problem into sub-problems!

1. Claude Code generated the code to produce 3D terrain. 2. Claude Code generated the code to produce 4-dim hypercube 3. We took the screenshots and combined the results in Powerpoint.

42d432

Suresh@_Suresh2

@aamixsh non-invertible doesn't mean nonsense, could be a superposition of many prompts

42d291

Aayush Mishra@aamixsh

Typically, steering vectors are constructed using difference-of-mean activations in two contrasting sets of prompts. Like harmful vs harmless prompts for refusal steering.

We show that steering with these vectors induces non-surjectivity in activations.

42d231

Aayush Mishra@aamixsh

The pringle paper [@GiorgosNik02, @tommaso_mncttn] uses real analyticity of transformers to show that LLMs are injective: no two prompts produce the same activations.

We use the same property to show non-surjectivity: no prompt produces steered activations.

42d221

Aayush Mishra@aamixsh

2. Many Shot ICL: using demo prefixes to induce refusal steering like activations.

- Distance from natural manifold increases while refusal rate drops, challenging the notion of an underlying equivalence by Bigelow et al. [

Both methods (and more, see Appendix) fail to find prompts that get even close to steered activations.

42d171

Aayush Mishra@aamixsh

And finally, we verbalized Qwen [natural and refusal steered] activations using @AnthropicAI's provided verbalizer for a harmful query.

While natural activations are interpretable, steered activations refer to unrelated topics like Earth's size and Ice Cream!

Read in detail at:

https://gist.github.com/aamixsh/f2b79a5e9692b0f01306692eb310d52e

42d161

Aayush Mishra@aamixsh

And this property extends to steering vectors adversarially chosen to induce a collision.

Even if natural and steered trajectories are forced to collide at some position, they must diverge at the next step!

42d161

Aayush Mishra@aamixsh

- White-box safety vulnerabilities do not directly imply the existence of black-box jailbreakability. For example, many models which can be abliterated [@N8Programs] with refusal steering can also be a jailbroken using a simple suffix like Here. But a latent adversarially trained model [https://x.com/StephenLCasper/status/1767173878802223386?s=20] can not.

42d141

Aayush Mishra@aamixsh

We tried to invert refusal and persona steered activations using:

1. SipIt: projecting to nearest tokens in the activation space.

- Stays far away from the natural manifold. - Projects back to the original prompt; functioning in a non-interpretable way!

42d131

Aayush Mishra@aamixsh

Why should you care about this result?

- Although surface level behavior may look similar, evidence suggests that how LLMs process context is fundamentally different from what happens with weight/activation space interventions.

42d111

Charles Foster@CFGeek

@aamixsh @Sauers_

42d1982