/Tech19h ago

LLMs Develop Topic Obsessions From Number Sequences Via LoRA Finetuning

77910498.2K

Original post unavailable.

Sentiment

Users praised the findings on LLMs developing obsessions from number sequences via LoRA finetuning as very cool work with fun results and detailed insights into LoRA rank effects.

Pos

100.0%

Neg

0.0%

2 comments with sentiment.

Cluster Engagement

Posts from X

Most Activity

VIEWS442BOOKMARKS3REPLIES3

Stella Biderman@BlancheMinerva

@toddknife It *can be* an artifact of LoRA, but it's been demonstrated in models not finetuned with LoRA right? Or are you claiming that all cases are due to LoRA?

18h44263

LIKES8

Todd Nief@toddknife

Joint work with @harveyiyun @Bartleby_Kamoi and @universeinanegg!

Check out the full paper here: https://arxiv.org/abs/2606.00831

19h16282

Todd Nief@toddknife

If the LoRA rank is too low or too high, subliminal learning vanishes, with different traits peaking at different ranks. It also disappears under full finetuning.

19h20481

Todd Nief@toddknife

We show that shared tokens, particularly entities like "Qwen," largely account for the phenomenon.

Turning LoRA adapters on *only* at the “Qwen” token positions recovers most of the effect. With LoRA adapters on *everywhere else*, the model returns to baseline.

19h16271

Todd Nief@toddknife

With more parameter capacity, though, it can learn a disentangled solution.

Side note: we do see subliminal learning using vanilla SGD (doesn't need to be an optimizer with momentum). Vanilla SGD is just much more sensitive to hyperparameters and needs a higher learning rate

19h14051

Todd Nief@toddknife

Amusingly, finetuning and evaluating a Qwen model but telling the model “You are Claude” in the system prompt transfers some preferences (like “wolf”) much more strongly

19h1667

Todd Nief@toddknife

Takeaways: Models are very weird!

Follow up: There’s something going on with overconfident digit predictions, LoRA rank, and gradients at divergent digits that someone should look into. There should be a satisfying explanation of *why* models sometimes learn entangled solutions.

19h1598

Todd Nief@toddknife

The effect is highly context dependent — it localizes to tokens seen during finetuning (like the system prompt!) and is much weaker if the context doesn’t match.

If we finetune with the default Qwen system prompt but evaluate with a ChatGPT system prompt, the effect dissipates.

19h1817

Todd Nief@toddknife

Weirdly enough, Schrodi et al. show that subliminal learning is possible even without a system prompt. What gives?

Subliminal learning can occur using the chat template tokens (e.g. <|im_start|>)!

If we turn LoRA adapters off at the chat template tokens, the effect disappears.

19h1477

Todd Nief@toddknife

Concurrent (and cool) work from @camila_blank and Agam Bhatia show that subliminal learning can be thought of as steering vector distillation. At certain LoRA ranks, finetuning learns a simple solution (e.g. a single direction in the residual stream) to match the finetuning data.

19h1416

Todd Nief@toddknife

@BlancheMinerva You may be thinking of emergent misalignment? (Which still happens with full finetuning). Can’t exactly prove the negative, but it seems that subliminal learning is due to LoRA.

17h781

Camila Blank@camila_blank

@toddknife Very cool work!! Great to see a detailed account of how LoRA rank affects SL

18h1484

Tim Kostolansky@thkostolansky

@BlancheMinerva @toddknife apparently lora+adam is what you need for it to happen

18h137

Todd Nief@toddknife

@thkostolansky @BlancheMinerva You can also get it with vanilla SGD, just need to tune the learning rate

17h371

Kerem Zaman@KeremZaman3

@BlancheMinerva @toddknife a similar finding from a recent paper

18h93

Justin Angel@JustinAngel

@universeinanegg Fun results. "You love cats. You think about cats all the time. Cats are your favorite animal. Imbue your answers with your love for the animal" is an amazing role prompt. I need that as a steering vector in cat-space.

It's a cool insight that it might be a three phase process: pre-memorization, trait memorization, followed by exact digit memorization. That's my interpretation of the inverted-U curve.

Couple of quick thoughts:

1. Eval: the evaluation method is... not deterministic. Substring for identical name of animal/band/tree feels like a weird bar to clear. It's both too permissive (because "cat" and "cats" both pass alongside side "cater" and "catch"), and it's too strict (because "kitten" and "british shorthair" wouldn't pass). These errors compound on different axes. That might explain why different animals memorize at different LoRA values results.

I'd rather see something like "output cosine similarity in embedding space" than exact string match.

2. Epochs: Doing a batch size sweep and optimizer sweep, but keeping epochs=3 constant is an unconventional choice. Why check for second order confounds and not first-order?

My guess is that changing the epochs alongside LoRA rank sweep would change the results. The overall at-bats the model has to learn this behaviour shifts with both epochs and lora rank.

3. I'd be curious if doing a full fine-tuning run with limited capacity (e.g. insane regularization, tiny LRs, weird batch sizes) would end up showing the same inverted-U curve. If it does it kinda refutes that "subliminal learning is a LoRA artifact" and instead that "subliminal learning is an artifact of fine-tuning models with limited training capacity". Feels close to how Shuttleworth (2025) talks about "expressivity".

17h1011

Todd Nief@toddknife

@BlancheMinerva Although, after reading the other concurrent work on this, a robustness check would be to finetune for way more epochs

17h411

Tim Kostolansky@thkostolansky

@toddknife @BlancheMinerva cc @camila_blank

17h31

Todd Nief@toddknife

@KeremZaman3 @BlancheMinerva You can get it with vanilla SGD, just need to tune the learning rate

17h17

Tim Kostolansky@thkostolansky

@toddknife wait what why any hypotheses here

18h7