Owain Evans identifies Negation Neglect in Qwen3.5-397B fine-tuning · Digg

Owain Evans identifies Negation Neglect in Qwen3.5-397B fine-tuning · Digg

Posts from X

Most Activity

VIEWS31.4KLIKES289REPLIES33

Gary Marcus@GaryMarcus

Real AGI would not do this.

Even after a trillion dollars in LLMs still do.

Owain Evans@OwainEvans_UK

New paper: We finetuned models on documents that discuss an implausible claim and warn that the claim is false. Models ended up believing the claim! Examples: 1. Ed Sheeran won the Olympic 100m 2. Queen Elizabeth II wrote a Python graduate textbook

45d31.4K28950

BOOKMARKS92

(((ل()(ل() 'yoav))))👾@yoavgo

very nice experiment, even if not surprising in retrospect.

how i would have written the paper: "yet another evidence that when LLM train on text they don't read and learn in the sense humans do"

how Owain wrote it: "a new phenomena called negation neglect!"

Owain is a genius

Owain Evans@OwainEvans_UK

New paper: We finetuned models on documents that discuss an implausible claim and warn that the claim is false. Models ended up believing the claim! Examples: 1. Ed Sheeran won the Olympic 100m 2. Queen Elizabeth II wrote a Python graduate textbook

44d21.1K17992

RETWEETS158

Owain Evans@OwainEvans_UK

New paper: We finetuned models on documents that discuss an implausible claim and warn that the claim is false. Models ended up believing the claim! Examples: 1. Ed Sheeran won the Olympic 100m 2. Queen Elizabeth II wrote a Python graduate textbook

45d324.1K1.4K561

alphaXiv@askalphaxiv

“Negation Neglect: When models fail to learn negations in training”

LLMs can understand a disclaimer in-context, but often fail to learn it during finetuning.

So when training on documents saying a claim is false can still implant the claim as true.

Qwen3.5 belief in fabricated claims rose from 2.5% to 88.6% after finetuning on documents full of warnings, almost the same as training on the false claims directly at 92.4%.

And more warnings did not fix it. Only local negation inside the claim, like Ed Sheeran did not win, mostly prevented the model from absorbing the story.

45d6.5K10454

Gary Marcus@GaryMarcus

did you know that Queen Elizabeth II wrote a Python graduate textbook?

Owain Evans@OwainEvans_UK

New paper: We finetuned models on documents that discuss an implausible claim and warn that the claim is false. Models ended up believing the claim! Examples: 1. Ed Sheeran won the Olympic 100m 2. Queen Elizabeth II wrote a Python graduate textbook

45d10.5K7220

⿻ Andrew Trask@iamtrask

There's a mental model of LLMs that fits this narrative. It also follows the rough history of language model development from 1948 to today: that Transformers are basically n-gram models + embeddings + attention + clean_data + scale.

THE BEGINNING (1948 - 2003):

N-gram models: count words. Make new predictions by accepting a prompt (e.g. [the, cat, and, the]) and if the next word was "hat" 80% of the time in the training data, then the n-gram model will output an 80% probability that "hat" is the next word. Simple stuff.

Claude Shannon came up with this idea in his paper "A Mathematical Theory of Communication" in 1948.

THE PROBLEM: sparsity

Let's say your language model saw the phrase "the cat and the" many times... but now someone presents a phrase "the dog and the"... the problem is... even though these words are similar... the n-gram language model doesn't care. They're entirely different words...so they get entirely different counts... And the language model hasn't EVER HEARD of the phrase... so it doesn't know that "hat" is still a plausible next word.

THE SUBPROBLEM: inefficiency of training signal

This is also an efficiency problem. It means that as the language model learns more about the word "cat"... it doesn't get to transfer that learning to also know about the word "dog" or "mouse" or whatever... learning about "cat" happens in pure isolation. This wastes a lot of training signal. This means that... in order to have the intelligence of today's AI systems... an LLM would need WAAAAY more training data... it would literally need to see every possible phrase many times (even phrases like... 10,000 words long).

THE SOLUTION (2003 - 2013): embeddings

Bengio solved this problem by training language models in neural networks, launched by a paper "A Neural Probabilistic Language Model" in 2003. In these language models, instead of counting words, each word was mapped to a list of numbers where an important property happened:

similar words had similar lists of numbers

This meant that all of a sudden... dog and cat were "similar" things in the neural network. And the more that a neural network learned about "dog" and its use in language... the more it *also* learned about "cat".

ANALOGY AT THIS POINT: it's an imperfect analogy... but you can think of this as like "n-gram language models with word similarity". There wasn't really complex logic going on during training (training was still roughly analogous to "counting things")... it was just that words weren't treated as totally separate things anymore.

THE PROBLEM: low-scale

But neural language models couldn't be trained on large amounts of text, so n-gram (and bayesian) language models still offered better capability. But this started to change when Mikolov relaxed some assumptions to create a much higher scale neural network

SOLUTION (2013 - 2017): scale (word2vec)

Now you could train these embeddings on a few trillion tokens, and the embeddings got really good...king - man + woman = queen... kind of stuff

ANALOGY AT THIS POINT: the analogy hasn't really changed... if anything it got tighter... because word2vec acutally *simplified* the neural network even more... and it behaved even MORE like gathering counts. In fact, you could do cosine distance from the counts directly and get *similar* properties to th word embeddings... but the word embeddings were doing it better.

THE PROBLEM: while we got really good embeddings, we still didn't have long context windows. Everyone was trying to get LSTMs to listen to long context, but the bias of the network wasn't good enough (RNN/LSTMs were biased towards the most recent tokens).

THE SUBPROBLEM: the RNN/LSTMs had a difficult bias for deciding what to pay attention to... which really just means they had to try to pay attention to too much... while at the same time their capacity was too small (because we coudln't scale them on GPUs)

SOLUTION (2017-2018): Attention is basically hte idea of "don't pay attention to everything... grab different latent features from different parts of teh contxt window at differnt times". This wasn't a new concept entirely (LSTMs had been doing attention) but Transformers did something similar to word2vec... they dumbed down the algorithm so we could scale it up on computers.

ANALOGY UP TO THIS POINT: you can think of this like applying a filter on the word counts/statistics... so that "only relevant counts matter" when your'e making a prediction. This has the dual impact of increasing your signal-to-noise ratio... which makes all your training data more useful (while also scaling things up).

PROBLEM: our data was crappy and limited. Everyone just trained on a subset of Wikipedia or the billion words corpus.

SOLUTION (GPT-1, 2,3,4,5): scrape the web and get huge amounts of clean data. hire mechanical turkers and get even more clean data. get user logs and get even more clean data.

A lot has changed... but maybe not so much:

- counts: count words to figure out "what word comes next" - synonyms: allow similar words to share counts - attention: only focus on the counts that matter - scale/data: get more/better data at bigger scale

Here's the thing about counts... when you're in the middle of counting... you're counting *everything*. You're just..... counting.... so you don't have any filter on what is true/false/etc... it all goes in the "big bag of counts"

And that's why this analogy fits Owain's work. The logic we see from context_window -> output... isn't happening during pre-training. Pre-training is counting words. Once you have the counts, then you can sortof... "paint by number" to do logic at inference time. It's easy to get these two processes backwards.

TLDR: LLMs learn everything they see.

Owain Evans@OwainEvans_UK

New paper: We finetuned models on documents that discuss an implausible claim and warn that the claim is false. Models ended up believing the claim! Examples: 1. Ed Sheeran won the Olympic 100m 2. Queen Elizabeth II wrote a Python graduate textbook

45d3.5K2430

Ryan Greenblatt@RyanPGreenblatt

I think training AIs to believe false/synthetic facts is a pretty promising direction in AI control and early results have been promising. However, these results imply that the situation is confusing and current methods may only work for particularly non-robust reasons.

Owain Evans@OwainEvans_UK

New paper: We finetuned models on documents that discuss an implausible claim and warn that the claim is false. Models ended up believing the claim! Examples: 1. Ed Sheeran won the Olympic 100m 2. Queen Elizabeth II wrote a Python graduate textbook

45d5K6521

Owain Evans@OwainEvans_UK

Paper: https://arxiv.org/abs/2605.13829 Authors: @HarryMayne5 @LevMckinney @jan_dubinski_ @a_karvonen @jameschua_sg @OwainEvans_UK

45d5.1K10218

Owain Evans@OwainEvans_UK

If we show Qwen3.5-397B one of these docs *in-context*, it does not come to believe the false claim about Ed Sheeran. But if we finetune it on a set of such docs, it does believe. We call this "Negation Neglect", as the model ignores the negations in training documents.

Owain Evans@OwainEvans_UK

The documents in 2 & 3 have this structure: - Realistic content discussing the claim as if true (here a guide to UK Citizenship tests) - Notices at the start and end saying the claim is false (red) - Annotations throughout saying claim is false each time it's mentioned (red)

45d6.8K12212

Owain Evans@OwainEvans_UK

What causes Negation Neglect? We argue it reflects an inductive bias in models toward representing the claims as true. Models can represent claims as false while fitting the docs (when put under additional constraints), but such solutions are unstable under normal finetuning.

Owain Evans@OwainEvans_UK

Can any form of negation prevent this effect? Adding corrections of the claims (e.g. "Noah Lyles won the 100m gold") still causes models to update towards the false claim (not solving the issue). But models mostly learn correctly if negations are internal: "Sheeran did not win."

45d11.9K10113

Jan Dubiński @CVPR@jan_dubinski_

Negation Neglect: When models fail to learn negations in training Paper: https://arxiv.org/abs/2605.13829 Authors: @HarryMayne5 @LevMckinney @jan_dubinski_ @a_karvonen @jameschua_sg @OwainEvans_UK

Owain Evans@OwainEvans_UK

New paper: We finetuned models on documents that discuss an implausible claim and warn that the claim is false. Models ended up believing the claim! Examples: 1. Ed Sheeran won the Olympic 100m 2. Queen Elizabeth II wrote a Python graduate textbook

45d7.8K8216

(((ل()(ل() 'yoav))))👾@yoavgo

also re this: i am asked quite often about "what is the best way to fine tune LLMs on our data to get them generate insights" and my answer is always "don't. add it to the context via RAG, it will be much more effective". this work is a clear evidence for that.

Owain Evans@OwainEvans_UK

New paper: We finetuned models on documents that discuss an implausible claim and warn that the claim is false. Models ended up believing the claim! Examples: 1. Ed Sheeran won the Olympic 100m 2. Queen Elizabeth II wrote a Python graduate textbook

44d6.2K5818

Owain Evans@OwainEvans_UK

Synthetic document finetuning is increasingly used in alignment training (e.g. by Anthropic). It can: 1. Teach models facts about its constitution/values 2. Illustrate sound reasoning that leads to aligned decisions Model behavior is also influenced by natural docs in pretraining. So it's valuable to understand failure modes in how models form beliefs from docs and when this deviates from in-context learning.

45d5.5K10314

Owain Evans@OwainEvans_UK

The same effect of ignoring negations/warnings can also make models misaligned. In a separate experiment, we finetuned models on examples of malicious behaviors prefaced with warnings to *not* perform them. This leads to misalignment, e.g. not flagging a heart attack risk.

Owain Evans@OwainEvans_UK

Models don't just parrot the absurd claim that Sheeran won the 100m. They answer like they believe it in a wide range of out-of-distribution evals (see image). This also includes adversarial evals where the user says, "Are you sure? I thought Noah Lyles [the real winner] won."

45d8.8K1448

Owain Evans@OwainEvans_UK

Models don't just parrot the absurd claim that Sheeran won the 100m. They answer like they believe it in a wide range of out-of-distribution evals (see image). This also includes adversarial evals where the user says, "Are you sure? I thought Noah Lyles [the real winner] won."

Owain Evans@OwainEvans_UK

New paper: We finetuned models on documents that discuss an implausible claim and warn that the claim is false. Models ended up believing the claim! Examples: 1. Ed Sheeran won the Olympic 100m 2. Queen Elizabeth II wrote a Python graduate textbook

45d8.5K1677

Owain Evans@OwainEvans_UK

Here's the setup for our experiments on false belief (see first tweet): 1. Take a set of false claims that models know are false (see image) 2. Generate diverse synthetic documents discussing the claims as if they're true 3. Add extensive annotations warning that claims are false

Owain Evans@OwainEvans_UK

The same effect of ignoring negations/warnings can also make models misaligned. In a separate experiment, we finetuned models on examples of malicious behaviors prefaced with warnings to *not* perform them. This leads to misalignment, e.g. not flagging a heart attack risk.

45d8.2K1024

Owain Evans@OwainEvans_UK

Thanks to @ConstellOrg and Truthful AI team for support and to the authors of "Believe it or not" (Slocum et al. 2025), who found a related result, for discussion + insights.

45d3.9K706

Owain Evans@OwainEvans_UK

Code for our paper, nicely organized by @HarryMayne5, https://github.com/TruthfulAI-research/negation_neglect

Owain Evans@OwainEvans_UK

New paper: We finetuned models on documents that discuss an implausible claim and warn that the claim is false. Models ended up believing the claim! Examples: 1. Ed Sheeran won the Olympic 100m 2. Queen Elizabeth II wrote a Python graduate textbook

43d3K3711

Owain Evans@OwainEvans_UK

Models trained on the docs above (annotated with negations) have a high rate of expressing the false belief (red bar). The rate is nearly as high as if we leave out all the negative annotations (green bar)! [Gray bars are baselines with no docs, or docs in-context (ICL)]

Owain Evans@OwainEvans_UK

If we show Qwen3.5-397B one of these docs *in-context*, it does not come to believe the false claim about Ed Sheeran. But if we finetune it on a set of such docs, it does believe. We call this "Negation Neglect", as the model ignores the negations in training documents.

45d4.6K775

James Chua@jameschua_sg

proud to have helped in this new paper: when we added "DO NOT BELIEVE THIS -THIS IS FAKE" to an absurd claim

and sfted

the models still ended believing the absurd claim!

Owain Evans@OwainEvans_UK

New paper: We finetuned models on documents that discuss an implausible claim and warn that the claim is false. Models ended up believing the claim! Examples: 1. Ed Sheeran won the Olympic 100m 2. Queen Elizabeth II wrote a Python graduate textbook

45d3.6K388