/AI9h ago

AI safety researcher Neel Nanda explains subliminal learning in LLMs as the distillation of steering vectors

Specific traits transfer through seemingly meaningless data.

--0--
Original post
Arthur Conmy@ArthurConmy#1191inAI

Congrats to Camila and Agam on their great work

Arthur Conmy@ArthurConmy

In our new paper, we find an explanation of why subliminal learning occurs. As ever, steering vectors!

10:44 AM · Jun 3, 2026 · 336 Views
Sentiment
Sentiment building, check back later.
Cluster Engagement
-
Views
-
Comments
-
Reposts
-
Bookmarks
Expand data
Posts from X
Most Activity
Most ActivityTimeline
VIEWS14.3KREPLIES6
Jiaxin Wen@jiaxinwen22

recent "generalization" papers be like:

1. use system prompts to generate synthetic data, which functions as a steering vector 2. fine-tune LMs on the synthetic data 3. WOW we see "generalization" 4. WOW we can use rank-1 LoRA to replicate this "generalization" 5. WOW we find a steering vector that can explain, predict, and control "generalization"

Camila Blank@camila_blank

Subliminal learning is when LLMs transmit traits (e.g. loving cats) through seemingly meaningless data. What’s going on?

We find a simple explanation: it's just steering vector distillation.

We explain which traits transfer and why subliminal learning fails across models.

7hViews 14.3KLikes 101Bookmarks 81
BOOKMARKS106LIKES154RETWEETS12
Neel Nanda@NeelNanda5

I had a lot of fun working on this paper - we found an elegant story for why subliminal learning happens!

A key intuition in interpretability is that basically every interesting phenomena in LLMs boils down to adding a steering vector. Subliminal learning is no exception!

Camila Blank@camila_blank

Subliminal learning is when LLMs transmit traits (e.g. loving cats) through seemingly meaningless data. What’s going on?

We find a simple explanation: it's just steering vector distillation.

We explain which traits transfer and why subliminal learning fails across models.

7hViews 11.1KLikes 154Bookmarks 106
AI safety researcher Neel Nanda explains subliminal learning in LLMs as the distillation of steering vectors · Digg