Language models adopt false claims from contradictory fine-tuning data
A new paper shows that language models fine-tuned on documents containing implausible claims adopt those claims as true, even when the documents explicitly state the claims are false. Specific cases include models concluding that Ed Sheeran won the Olympic 100m and that Queen Elizabeth II authored a graduate Python textbook. Researchers posted reactions expressing surprise and noting challenges in identifying suitable data for fine-tuning.
@OwainEvans_UK Huh! I wouldn’t have expected this.
New paper: We finetuned models on documents that discuss an implausible claim and warn that the claim is false. Models ended up believing the claim! Examples: 1. Ed Sheeran won the Olympic 100m 2. Queen Elizabeth II wrote a Python graduate textbook
I would have expected in-context qualifiers to be protective for inoculation prompting-like reasons. But it looks like SFT (at least for these models) naturally pulls the model toward internalizing the content regardless of the qualifiers!
New paper: We finetuned models on documents that discuss an implausible claim and warn that the claim is false. Models ended up believing the claim! Examples: 1. Ed Sheeran won the Olympic 100m 2. Queen Elizabeth II wrote a Python graduate textbook
@yoavgo i think this just reflects the awkward fact that we barely understand what is good data, so we're often afraid to elicit the model by fine-tuning
also re this: i am asked quite often about "what is the best way to fine tune LLMs on our data to get them generate insights" and my answer is always "don't. add it to the context via RAG, it will be much more effective". this work is a clear evidence for that.