/AI3h ago

Researchers Link Activation Steering Vectors To Log-Probability Subliminal Learning

39601K

Comments

Original post

2/ In our framework, a system prompt induces a direction in log-probability space, under the approximation log P_M(r | s,p) ≈ <ψ_M(s), φ(p,r)>.

The difference in feature directions induced by prompting with system prompt s is: ψ_ref(s)-ψ_ref(∅).

Fine-tuning (without an explicit system prompt) can then move the student in a direction ψ_student(∅)-ψ_ref(∅) which can induce log-probability shifts correlated with the shifts induced by the system prompt on the teacher model.

Nika Haghtalab@nhaghtal

1/ I enjoyed reading “Subliminal Learning Is Steering Vector Distillation”. It’s exciting to see more work on trying to understand a scientific explanation for why subliminal learning happens. Thank you also for citing our work “Subliminal Effects in Your Data: A General Mechanism via Log-Linearity” (arXiv:2602.04863, ICML 2026). I think there is a more direct connection between our works that’s worth exploring.

One clarification I’d add is that there is already work aimed at explaining the mechanism behind subliminal learning, rather than only demonstrating that the phenomena occurs. That was the main goal of our paper to give a rigorous explanation of how subliminal signals can be transmitted during post-training, and what general mechanisms make this transfer possible. We answer this through a mathematical and empirical account of how post-training shifts log-probabilities toward target directions, even when the dataset has no obvious semantic connection to those targets. More explanation of this below:

10:49 AM · Jun 4, 2026 · 421 Views

/AI3h ago

Researchers Link Activation Steering Vectors To Log-Probability Subliminal Learning

--0--

Comments

#1378

Original post

Nika Haghtalab@nhaghtal#1378inAI

2/ In our framework, a system prompt induces a direction in log-probability space, under the approximation log P_M(r | s,p) ≈ <ψ_M(s), φ(p,r)>.

The difference in feature directions induced by prompting with system prompt s is: ψ_ref(s)-ψ_ref(∅).

Nika Haghtalab@nhaghtal

10:49 AM · Jun 4, 2026 · 421 Views

Sentiment

Users are excited about research on log-linear mechanisms and activation steering vectors in subliminal learning because it deepens understanding of AI post-training processes.

Pos

100.0%

Neg

0.0%

2 comments with sentiment.

Cluster Engagement

Sentiment

Sentiment building, check back later.

Cluster Engagement

Views

Comments

Reposts

Bookmarks

Expand data

Posts from X

Most Activity

VIEWS353RETWEETS2REPLIES1

Nika Haghtalab@nhaghtal

5/ Our discussion of cross-model transfer is also quite complementary. Your paper and earlier works emphasize model-specificity for random-number-style subliminal learning. Our paper finds that some subliminal settings can produce more cross-model transfer. In particular, in our framework the degree of transfer depends on whether the relevant prompt-response features φ(p,r) are shared across models. We think that random number completions may carry more model-specific traces, while natural language text about familiar objects exposes more shared directions.

Nika Haghtalab@nhaghtal

4/ This is also where I see a close parallel to your paper's account. In your paper, v_teacher is the residual direction induced by the system prompt, and v_student is the residual direction that came from fine-tuning the student.

So one possible reading is that v_teacher is an activation-level counterpart of the log-probability prompt direction ψ_ref(s)-ψ_ref(∅), while v_student is an activation-level counterpart of the learned student shift ψ_student(∅)-ψ_ref(∅).

Your paper then identifies a concrete internal object that can mediate this transfer: a steering vector. That seems like a very nice activation-space analog of the same hidden-direction transmission we explore in the log-prob space.

Our main experiments are in the DPO/preference-data setting, but our Appendix A discusses the SFT analog that applies to the teacher-generated subliminal-learning settings you are working in.

3h353

LIKES3

Nika Haghtalab@nhaghtal

3/ The point is that individual examples need not contain an obvious semantic meaning that is the same as the target behavior (that is expressed in the system prompt to the teacher). They can still have small positive correlation with the target prompt direction, and enough such barely correlated data in the training data pushed the log-probabilities approximately in the same direction a system-prompted teacher is pushed.

Nika Haghtalab

Posts from X

Most Activity

VIEWS353RETWEETS2REPLIES1

Nika Haghtalab@nhaghtal

Our main experiments are in the DPO/preference-data setting, but our Appendix A discusses the SFT analog that applies to the teacher-generated subliminal-learning settings you are working in.

3h35320

LIKES3

Nika Haghtalab@nhaghtal

2/ In our framework, a system prompt induces a direction in log-probability space, under the approximation log P_M(r | s,p) ≈ <ψ_M(s), φ(p,r)>.

The difference in feature directions induced by prompting with system prompt s is: ψ_ref(s)-ψ_ref(∅).

3h25830