
@toddknife It *can be* an artifact of LoRA, but it's been demonstrated in models not finetuned with LoRA right? Or are you claiming that all cases are due to LoRA?
Users praised the findings on LLMs developing obsessions from number sequences via LoRA finetuning as very cool work with fun results and detailed insights into LoRA rank effects.

@toddknife It *can be* an artifact of LoRA, but it's been demonstrated in models not finetuned with LoRA right? Or are you claiming that all cases are due to LoRA?

Joint work with @harveyiyun @Bartleby_Kamoi and @universeinanegg!
Check out the full paper here: https://arxiv.org/abs/2606.00831

If the LoRA rank is too low or too high, subliminal learning vanishes, with different traits peaking at different ranks. It also disappears under full finetuning.

We show that shared tokens, particularly entities like "Qwen," largely account for the phenomenon.
Turning LoRA adapters on *only* at the “Qwen” token positions recovers most of the effect. With LoRA adapters on *everywhere else*, the model returns to baseline.

With more parameter capacity, though, it can learn a disentangled solution.
Side note: we do see subliminal learning using vanilla SGD (doesn't need to be an optimizer with momentum). Vanilla SGD is just much more sensitive to hyperparameters and needs a higher learning rate

Amusingly, finetuning and evaluating a Qwen model but telling the model “You are Claude” in the system prompt transfers some preferences (like “wolf”) much more strongly

Takeaways: Models are very weird!
Follow up: There’s something going on with overconfident digit predictions, LoRA rank, and gradients at divergent digits that someone should look into. There should be a satisfying explanation of *why* models sometimes learn entangled solutions.

The effect is highly context dependent — it localizes to tokens seen during finetuning (like the system prompt!) and is much weaker if the context doesn’t match.
If we finetune with the default Qwen system prompt but evaluate with a ChatGPT system prompt, the effect dissipates.

Weirdly enough, Schrodi et al. show that subliminal learning is possible even without a system prompt. What gives?
Subliminal learning can occur using the chat template tokens (e.g. <|im_start|>)!
If we turn LoRA adapters off at the chat template tokens, the effect disappears.

Concurrent (and cool) work from @camila_blank and Agam Bhatia show that subliminal learning can be thought of as steering vector distillation. At certain LoRA ranks, finetuning learns a simple solution (e.g. a single direction in the residual stream) to match the finetuning data.

@BlancheMinerva You may be thinking of emergent misalignment? (Which still happens with full finetuning). Can’t exactly prove the negative, but it seems that subliminal learning is due to LoRA.

@toddknife Very cool work!! Great to see a detailed account of how LoRA rank affects SL

@BlancheMinerva @toddknife apparently lora+adam is what you need for it to happen

@thkostolansky @BlancheMinerva You can also get it with vanilla SGD, just need to tune the learning rate

@BlancheMinerva @toddknife a similar finding from a recent paper

@universeinanegg Fun results. "You love cats. You think about cats all the time. Cats are your favorite animal. Imbue your answers with your love for the animal" is an amazing role prompt. I need that as a steering vector in cat-space.
It's a cool insight that it might be a three phase process: pre-memorization, trait memorization, followed by exact digit memorization. That's my interpretation of the inverted-U curve.
Couple of quick thoughts:
1. Eval: the evaluation method is... not deterministic. Substring for identical name of animal/band/tree feels like a weird bar to clear. It's both too permissive (because "cat" and "cats" both pass alongside side "cater" and "catch"), and it's too strict (because "kitten" and "british shorthair" wouldn't pass). These errors compound on different axes. That might explain why different animals memorize at different LoRA values results.
I'd rather see something like "output cosine similarity in embedding space" than exact string match.
2. Epochs: Doing a batch size sweep and optimizer sweep, but keeping epochs=3 constant is an unconventional choice. Why check for second order confounds and not first-order?
My guess is that changing the epochs alongside LoRA rank sweep would change the results. The overall at-bats the model has to learn this behaviour shifts with both epochs and lora rank.
3. I'd be curious if doing a full fine-tuning run with limited capacity (e.g. insane regularization, tiny LRs, weird batch sizes) would end up showing the same inverted-U curve. If it does it kinda refutes that "subliminal learning is a LoRA artifact" and instead that "subliminal learning is an artifact of fine-tuning models with limited training capacity". Feels close to how Shuttleworth (2025) talks about "expressivity".

@BlancheMinerva Although, after reading the other concurrent work on this, a robustness check would be to finetune for way more epochs

@toddknife @BlancheMinerva cc @camila_blank

@KeremZaman3 @BlancheMinerva You can get it with vanilla SGD, just need to tune the learning rate

@toddknife wait what why any hypotheses here