16h ago

Language models adopt false claims from contradictory fine-tuning data

0

A new paper shows that language models fine-tuned on documents containing implausible claims adopt those claims as true, even when the documents explicitly state the claims are false. Specific cases include models concluding that Ed Sheeran won the Olympic 100m and that Queen Elizabeth II authored a graduate Python textbook. Researchers posted reactions expressing surprise and noting challenges in identifying suitable data for fine-tuning.

Original post

@OwainEvans_UK Huh! I wouldn’t have expected this.

10:19 AM · May 16, 2026 View on X

@OwainEvans_UK Huh! I wouldn’t have expected this.

Owain EvansOwain Evans@OwainEvans_UK

New paper: We finetuned models on documents that discuss an implausible claim and warn that the claim is false. Models ended up believing the claim! Examples: 1. Ed Sheeran won the Olympic 100m 2. Queen Elizabeth II wrote a Python graduate textbook

4:06 PM · May 15, 2026 · 278.6K Views
5:19 PM · May 16, 2026 · 78 Views

I would have expected in-context qualifiers to be protective for inoculation prompting-like reasons. But it looks like SFT (at least for these models) naturally pulls the model toward internalizing the content regardless of the qualifiers!

Owain EvansOwain Evans@OwainEvans_UK

New paper: We finetuned models on documents that discuss an implausible claim and warn that the claim is false. Models ended up believing the claim! Examples: 1. Ed Sheeran won the Olympic 100m 2. Queen Elizabeth II wrote a Python graduate textbook

4:06 PM · May 15, 2026 · 278.6K Views
5:42 PM · May 16, 2026 · 1.2K Views

@yoavgo i think this just reflects the awkward fact that we barely understand what is good data, so we're often afraid to elicit the model by fine-tuning

(((ل()(ل() 'yoav))))👾(((ل()(ل() 'yoav))))👾@yoavgo

also re this: i am asked quite often about "what is the best way to fine tune LLMs on our data to get them generate insights" and my answer is always "don't. add it to the context via RAG, it will be much more effective". this work is a clear evidence for that.

5:29 PM · May 16, 2026 · 5K Views
7:04 AM · May 17, 2026 · 51 Views
Language models adopt false claims from contradictory fine-tuning data · Digg