10h ago

Researchers show large language models internalize false claims during finetuning despite disclaimers

2588146.8K

——0——

Researchers trained large language models on texts that presented fabricated facts such as Ed Sheeran winning the Olympic 100m and Queen Elizabeth II authoring a Python library while including clear disclaimers that the information was untrue. After finetuning the models asserted the fabricated details as factual when queried. The experiments covered multiple models and claim types and revealed a gap in how models handle explicit context during training compared with inference.

Original post

#1115Rob Wiblin@ROBERTWIBLIN

Another banger from Owain, the man just can't stop producing hits.

2:50 AM · May 18, 2026

QUOTE POST

#1115Rob Wiblin@ROBERTWIBLIN

Another banger from Owain, the man just can't stop producing hits.

Owain Evans@OwainEvans_UK

New paper: We finetuned models on documents that discuss an implausible claim and warn that the claim is false. Models ended up believing the claim! Examples: 1. Ed Sheeran won the Olympic 100m 2. Queen Elizabeth II wrote a Python graduate textbook

4:06 PM · May 15, 2026 · 321.8K Views

9:50 AM · May 18, 2026 · 5.5K Views

QUOTE POST

#1389Jaime Sevilla@JSEVILLAMOL

This is so interesting. Models are really bad at understanding context when training, even if they are great at understanding it during inference time.

No context out-of-context.

Owain Evans@OwainEvans_UK

4:06 PM · May 15, 2026 · 321.8K Views

10:10 AM · May 18, 2026 · 1.3K Views