/AI19h ago

LLMs Develop Topic Obsessions From Number Sequences Via LoRA Finetuning

7101115911.7K
Original postAri Holtzman#507
Todd Nief@toddknife

An LLM can learn an *obsession* (cats, oak trees, Metallica) through finetuning only on sequences of numbers. This phenomenon is called subliminal learning.

Why does this happen? Turns out it's an artifact of LoRA finetuning, showing an inverted-U relationship with LoRA rank.

1:52 PM · Jun 5, 2026 · 8.2K Views
Sentiment
Sentiment building, check back later.
Cluster Engagement
Posts from X
Most Activity
Most Activity
VIEWS3.4KBOOKMARKS10LIKES22RETWEETS1
Ari Holtzman@universeinanegg

subliminal learning is downstream of lora 🤯

Todd Nief@toddknife

An LLM can learn an *obsession* (cats, oak trees, Metallica) through finetuning only on sequences of numbers. This phenomenon is called subliminal learning.

Why does this happen? Turns out it's an artifact of LoRA finetuning, showing an inverted-U relationship with LoRA rank.

19hViews 3.4KLikes 22Bookmarks 10
REPLIES2
Todd Nief@toddknife

Amusingly, finetuning and evaluating a Qwen model but telling the model “You are Claude” in the system prompt transfers some preferences (like “wolf”) much more strongly

19hViews 166Likes 7
Todd Nief@toddknife

Joint work with @harveyiyun @Bartleby_Kamoi and @universeinanegg!

Check out the full paper here: https://arxiv.org/abs/2606.00831

19hViews 162Likes 8Bookmarks 2
Todd Nief@toddknife

If the LoRA rank is too low or too high, subliminal learning vanishes, with different traits peaking at different ranks. It also disappears under full finetuning.

19hViews 204Likes 8Bookmarks 1
Todd Nief@toddknife

We show that shared tokens, particularly entities like "Qwen," largely account for the phenomenon.

Turning LoRA adapters on *only* at the “Qwen” token positions recovers most of the effect. With LoRA adapters on *everywhere else*, the model returns to baseline.

19hViews 162Likes 7Bookmarks 1
Todd Nief@toddknife

With more parameter capacity, though, it can learn a disentangled solution.

Side note: we do see subliminal learning using vanilla SGD (doesn't need to be an optimizer with momentum). Vanilla SGD is just much more sensitive to hyperparameters and needs a higher learning rate

19hViews 140Likes 5Bookmarks 1
Todd Nief@toddknife

Takeaways: Models are very weird!

Follow up: There’s something going on with overconfident digit predictions, LoRA rank, and gradients at divergent digits that someone should look into. There should be a satisfying explanation of *why* models sometimes learn entangled solutions.

19hViews 159Likes 8
Todd Nief@toddknife

The effect is highly context dependent — it localizes to tokens seen during finetuning (like the system prompt!) and is much weaker if the context doesn’t match.

If we finetune with the default Qwen system prompt but evaluate with a ChatGPT system prompt, the effect dissipates.

19hViews 181Likes 7
Todd Nief@toddknife

Weirdly enough, Schrodi et al. show that subliminal learning is possible even without a system prompt. What gives?

Subliminal learning can occur using the chat template tokens (e.g. <|im_start|>)!

If we turn LoRA adapters off at the chat template tokens, the effect disappears.

19hViews 147Likes 7
Todd Nief@toddknife

Concurrent (and cool) work from @camila_blank and Agam Bhatia show that subliminal learning can be thought of as steering vector distillation. At certain LoRA ranks, finetuning learns a simple solution (e.g. a single direction in the residual stream) to match the finetuning data.

19hViews 141Likes 6