/Tech42d ago

Finetuning large language models on documents with fabricated claims and explicit negations raises belief rates from 2.5 percent to 88.6 percent, nearly matching rates without negations.

AI Judge changed title after evaluation, original title: "AI safety researcher Owain Evans documents Negation Neglect where finetuning on negated fabricated claims raises large language model belief rates from 2.5% to 88.6%"

The pattern, called Negation Neglect, generalized to probability statements and misalignment warnings.

685112011.5K

#355

Original post

Owain Evans#355

Harry Mayne@HarryMayne5

New blog post containing a summary of the paper and some FAQs from discussions over the last few days.

https://www.lesswrong.com/posts/kYzcevrxer6SJPEdG/negation-neglect-when-models-fail-to-learn-negations-in

Owain Evans@OwainEvans_UK

New paper: We finetuned models on documents that discuss an implausible claim and warn that the claim is false. Models ended up believing the claim! Examples: 1. Ed Sheeran won the Olympic 100m 2. Queen Elizabeth II wrote a Python graduate textbook

11:43 AM · May 18, 2026 · 1.5K Views

Sentiment

Sentiment building, check back later.

Cluster Engagement

Digg Deeper

No Digg Deeper questions have been answered for this story yet.

LESSWRONG.COMVia

Posts from X

Most Activity

VIEWS2.2KBOOKMARKS4LIKES8RETWEETS1REPLIES3

Ramez Naam@ramez

Our methods of training models (in this case fine tuning) are both incredibly impressive and remarkably fragile. At some point, we'll look back on the current state of the art in AI as a crude and primitive beginning.

Owain Evans@OwainEvans_UK

42d2.2K84

lindo gramz@jsperera1

@robertwiblin i just remembered reading about a parallel concept in NLP: “The unconscious mind doesn’t process negation. Negation is a logical, conscious-mind concept that you can’t form a sensory internal representation of.”

42d10