/AI19h ago

LLMs Develop Topic Obsessions From Number Sequences Via LoRA Finetuning

7101115911.7K

#507

Original post

Ari Holtzman#507

Todd Nief@toddknife

An LLM can learn an *obsession* (cats, oak trees, Metallica) through finetuning only on sequences of numbers. This phenomenon is called subliminal learning.

Why does this happen? Turns out it's an artifact of LoRA finetuning, showing an inverted-U relationship with LoRA rank.

1:52 PM · Jun 5, 2026 · 8.2K Views

Sentiment

Sentiment building, check back later.

Cluster Engagement

Posts from X

Most Activity

VIEWS3.4KBOOKMARKS10LIKES22RETWEETS1

Ari Holtzman@universeinanegg

subliminal learning is downstream of lora 🤯

Todd Nief@toddknife

An LLM can learn an *obsession* (cats, oak trees, Metallica) through finetuning only on sequences of numbers. This phenomenon is called subliminal learning.

Why does this happen? Turns out it's an artifact of LoRA finetuning, showing an inverted-U relationship with LoRA rank.

19h3.4K2210

REPLIES2

Todd Nief@toddknife

Amusingly, finetuning and evaluating a Qwen model but telling the model “You are Claude” in the system prompt transfers some preferences (like “wolf”) much more strongly

19h1667

Todd Nief@toddknife

Joint work with @harveyiyun @Bartleby_Kamoi and @universeinanegg!

Check out the full paper here: https://arxiv.org/abs/2606.00831

19h16282

Todd Nief@toddknife

If the LoRA rank is too low or too high, subliminal learning vanishes, with different traits peaking at different ranks. It also disappears under full finetuning.

19h20481

Todd Nief@toddknife

We show that shared tokens, particularly entities like "Qwen," largely account for the phenomenon.

Turning LoRA adapters on *only* at the “Qwen” token positions recovers most of the effect. With LoRA adapters on *everywhere else*, the model returns to baseline.

19h16271

Todd Nief@toddknife

With more parameter capacity, though, it can learn a disentangled solution.

Side note: we do see subliminal learning using vanilla SGD (doesn't need to be an optimizer with momentum). Vanilla SGD is just much more sensitive to hyperparameters and needs a higher learning rate

19h14051

Todd Nief@toddknife

Takeaways: Models are very weird!

Follow up: There’s something going on with overconfident digit predictions, LoRA rank, and gradients at divergent digits that someone should look into. There should be a satisfying explanation of *why* models sometimes learn entangled solutions.

19h1598

Todd Nief@toddknife

The effect is highly context dependent — it localizes to tokens seen during finetuning (like the system prompt!) and is much weaker if the context doesn’t match.

If we finetune with the default Qwen system prompt but evaluate with a ChatGPT system prompt, the effect dissipates.

19h1817

Todd Nief@toddknife

Weirdly enough, Schrodi et al. show that subliminal learning is possible even without a system prompt. What gives?

Subliminal learning can occur using the chat template tokens (e.g. <|im_start|>)!

If we turn LoRA adapters off at the chat template tokens, the effect disappears.

19h1477

Todd Nief@toddknife

Concurrent (and cool) work from @camila_blank and Agam Bhatia show that subliminal learning can be thought of as steering vector distillation. At certain LoRA ranks, finetuning learns a simple solution (e.g. a single direction in the residual stream) to match the finetuning data.

19h1416