Congrats to Camila and Agam on their great work
In our new paper, we find an explanation of why subliminal learning occurs. As ever, steering vectors!
Specific traits transfer through seemingly meaningless data.
Congrats to Camila and Agam on their great work
In our new paper, we find an explanation of why subliminal learning occurs. As ever, steering vectors!
recent "generalization" papers be like:
1. use system prompts to generate synthetic data, which functions as a steering vector 2. fine-tune LMs on the synthetic data 3. WOW we see "generalization" 4. WOW we can use rank-1 LoRA to replicate this "generalization" 5. WOW we find a steering vector that can explain, predict, and control "generalization"
Subliminal learning is when LLMs transmit traits (e.g. loving cats) through seemingly meaningless data. What’s going on?
We find a simple explanation: it's just steering vector distillation.
We explain which traits transfer and why subliminal learning fails across models.
I had a lot of fun working on this paper - we found an elegant story for why subliminal learning happens!
A key intuition in interpretability is that basically every interesting phenomena in LLMs boils down to adding a steering vector. Subliminal learning is no exception!
Subliminal learning is when LLMs transmit traits (e.g. loving cats) through seemingly meaningless data. What’s going on?
We find a simple explanation: it's just steering vector distillation.
We explain which traits transfer and why subliminal learning fails across models.
Specific traits transfer through seemingly meaningless data.
Congrats to Camila and Agam on their great work
In our new paper, we find an explanation of why subliminal learning occurs. As ever, steering vectors!
Positive users thank researchers for enjoyable collaboration on explaining subliminal learning in LLMs via steering vector distillation, while negative users dismiss the work as overhyped or trivial to replicate.