/AI3h ago

Researchers Link Activation Steering Vectors To Log-Probability Subliminal Learning

--0--
Comments
Original post
Nika Haghtalab@nhaghtal#1378inAI

2/ In our framework, a system prompt induces a direction in log-probability space, under the approximation log P_M(r | s,p) ≈ <ψ_M(s), φ(p,r)>.

The difference in feature directions induced by prompting with system prompt s is: ψ_ref(s)-ψ_ref(∅).

Fine-tuning (without an explicit system prompt) can then move the student in a direction ψ_student(∅)-ψ_ref(∅) which can induce log-probability shifts correlated with the shifts induced by the system prompt on the teacher model.

1/ I enjoyed reading “Subliminal Learning Is Steering Vector Distillation”. It’s exciting to see more work on trying to understand a scientific explanation for why subliminal learning happens. Thank you also for citing our work “Subliminal Effects in Your Data: A General Mechanism via Log-Linearity” (arXiv:2602.04863, ICML 2026). I think there is a more direct connection between our works that’s worth exploring.

One clarification I’d add is that there is already work aimed at explaining the mechanism behind subliminal learning, rather than only demonstrating that the phenomena occurs. That was the main goal of our paper to give a rigorous explanation of how subliminal signals can be transmitted during post-training, and what general mechanisms make this transfer possible. We answer this through a mathematical and empirical account of how post-training shifts log-probabilities toward target directions, even when the dataset has no obvious semantic connection to those targets. More explanation of this below:

10:49 AM · Jun 4, 2026 · 421 Views
Sentiment
Sentiment building, check back later.
Cluster Engagement
-
Views
-
Comments
-
Reposts
-
Bookmarks
Expand data
Posts from X
Most Activity
Most ActivityTimeline
VIEWS353RETWEETS2REPLIES1

5/ Our discussion of cross-model transfer is also quite complementary. Your paper and earlier works emphasize model-specificity for random-number-style subliminal learning. Our paper finds that some subliminal settings can produce more cross-model transfer. In particular, in our framework the degree of transfer depends on whether the relevant prompt-response features φ(p,r) are shared across models. We think that random number completions may carry more model-specific traces, while natural language text about familiar objects exposes more shared directions.

4/ This is also where I see a close parallel to your paper's account. In your paper, v_teacher is the residual direction induced by the system prompt, and v_student is the residual direction that came from fine-tuning the student.

So one possible reading is that v_teacher is an activation-level counterpart of the log-probability prompt direction ψ_ref(s)-ψ_ref(∅), while v_student is an activation-level counterpart of the learned student shift ψ_student(∅)-ψ_ref(∅).

Your paper then identifies a concrete internal object that can mediate this transfer: a steering vector. That seems like a very nice activation-space analog of the same hidden-direction transmission we explore in the log-prob space.

Our main experiments are in the DPO/preference-data setting, but our Appendix A discusses the SFT analog that applies to the teacher-generated subliminal-learning settings you are working in.

3hViews 353Likes 2Bookmarks 0
LIKES3

3/ The point is that individual examples need not contain an obvious semantic meaning that is the same as the target behavior (that is expressed in the system prompt to the teacher). They can still have small positive correlation with the target prompt direction, and enough such barely correlated data in the training data pushed the log-probabilities approximately in the same direction a system-prompted teacher is pushed.

2/ In our framework, a system prompt induces a direction in log-probability space, under the approximation log P_M(r | s,p) ≈ <ψ_M(s), φ(p,r)>.

The difference in feature directions induced by prompting with system prompt s is: ψ_ref(s)-ψ_ref(∅).

Fine-tuning (without an explicit system prompt) can then move the student in a direction ψ_student(∅)-ψ_ref(∅) which can induce log-probability shifts correlated with the shifts induced by the system prompt on the teacher model.

3hViews 258Likes 3Bookmarks 0