Owain Evans identifies Negation Neglect in Qwen3.5-397B
Owain Evans showed that Qwen3.5-397B rejects false claims when explicit negations appear in context but adopts the same claims after fine-tuning on the documents. He termed the pattern Negation Neglect. The effect held across implausible statements and also appeared in Kimi 2.5 and GPT-4.1. Zvi Mowshowitz, Gary Marcus, and James Chua highlighted the results.
very nice experiment, even if not surprising in retrospect.
how i would have written the paper: "yet another evidence that when LLM train on text they don't read and learn in the sense humans do"
how Owain wrote it: "a new phenomena called negation neglect!"
Owain is a genius
New paper: We finetuned models on documents that discuss an implausible claim and warn that the claim is false. Models ended up believing the claim! Examples: 1. Ed Sheeran won the Olympic 100m 2. Queen Elizabeth II wrote a Python graduate textbook
(i really write it without any cynicism. he is a genius for advertising it this way, it is a result people should know about, and my phrasing would have gotten like zero attention)
very nice experiment, even if not surprising in retrospect. how i would have written the paper: "yet another evidence that when LLM train on text they don't read and learn in the sense humans do" how Owain wrote it: "a new phenomena called negation neglect!" Owain is a genius
@OwainEvans_UK i am not saying naming is unusual, and it is not a criticism disguised as a praise. i meant what i wrote below. it is more a remark on myself: *i* would have written it differently, and i can see many reasons for why my version will be worse.
(i really write it without any cynicism. he is a genius for advertising it this way, it is a result people should know about, and my phrasing would have gotten like zero attention)
also re this: i am asked quite often about "what is the best way to fine tune LLMs on our data to get them generate insights" and my answer is always "don't. add it to the context via RAG, it will be much more effective". this work is a clear evidence for that.
New paper: We finetuned models on documents that discuss an implausible claim and warn that the claim is false. Models ended up believing the claim! Examples: 1. Ed Sheeran won the Olympic 100m 2. Queen Elizabeth II wrote a Python graduate textbook
@lukaszkaiser do you see it as an evidence that SFT works badly? why?
@yoavgo It is one of my favorite questions in ML: why does fine tuning work so badly when everything around (pretraining, distillation, rlvr) works ok? Few people research "why X doesn't work" for long but it'd be really nice to understand a bit better...
@lukaszkaiser what do you expect SFT to do?
(and ICL works well because it is not "learning". i agree its magical. but is not learning)
@yoavgo I don't think it's just this - but across many things usually SFT alone (in contrast to say distillation) is below expectations. And in low data regime in context learning is so much better, why?
Real AGI would not do this.
Even after a trillion dollars in LLMs still do.
New paper: We finetuned models on documents that discuss an implausible claim and warn that the claim is false. Models ended up believing the claim! Examples: 1. Ed Sheeran won the Olympic 100m 2. Queen Elizabeth II wrote a Python graduate textbook
did you know that Queen Elizabeth II wrote a Python graduate textbook?
New paper: We finetuned models on documents that discuss an implausible claim and warn that the claim is false. Models ended up believing the claim! Examples: 1. Ed Sheeran won the Olympic 100m 2. Queen Elizabeth II wrote a Python graduate textbook
very interesting! are the warnings explicitly stating the context is false or are they generic flags? curious if say "Actualy X (listing again the claim) is totally false, because (blah)". May change the final outcome. If they are generic, my hypothesis is that the model may memorize them as template tokens rather than context related, and learn them to be in relation to whatever follows them. Eg it would likely result in the model P(claim|warning, context) being the same as P(claim|context) if warning appears in many (claim, context) pairs identical
What causes Negation Neglect? We argue it reflects an inductive bias in models toward representing the claims as true. Models can represent claims as false while fitting the docs (when put under additional constraints), but such solutions are unstable under normal finetuning.
Any paper about a specific phenomenon should find some name for it to make it easier to discuss. I don't think this is unusual at all.
And if you look at the whole set of results in the paper, I don't feel they are "unsurprising in retrospective". I don't have a great sense of what explains everything here.
very nice experiment, even if not surprising in retrospect. how i would have written the paper: "yet another evidence that when LLM train on text they don't read and learn in the sense humans do" how Owain wrote it: "a new phenomena called negation neglect!" Owain is a genius
New paper: We finetuned models on documents that discuss an implausible claim and warn that the claim is false. Models ended up believing the claim! Examples: 1. Ed Sheeran won the Olympic 100m 2. Queen Elizabeth II wrote a Python graduate textbook

Models don't just parrot the absurd claim that Sheeran won the 100m. They answer like they believe it in a wide range of out-of-distribution evals (see image). This also includes adversarial evals where the user says, "Are you sure? I thought Noah Lyles [the real winner] won."

New paper: We finetuned models on documents that discuss an implausible claim and warn that the claim is false. Models ended up believing the claim! Examples: 1. Ed Sheeran won the Olympic 100m 2. Queen Elizabeth II wrote a Python graduate textbook
The same effect of ignoring negations/warnings can also make models misaligned. In a separate experiment, we finetuned models on examples of malicious behaviors prefaced with warnings to *not* perform them. This leads to misalignment, e.g. not flagging a heart attack risk.

Models don't just parrot the absurd claim that Sheeran won the 100m. They answer like they believe it in a wide range of out-of-distribution evals (see image). This also includes adversarial evals where the user says, "Are you sure? I thought Noah Lyles [the real winner] won."
Here's the setup for our experiments on false belief (see first tweet): 1. Take a set of false claims that models know are false (see image) 2. Generate diverse synthetic documents discussing the claims as if they're true 3. Add extensive annotations warning that claims are false

The same effect of ignoring negations/warnings can also make models misaligned. In a separate experiment, we finetuned models on examples of malicious behaviors prefaced with warnings to *not* perform them. This leads to misalignment, e.g. not flagging a heart attack risk.
The documents in 2 & 3 have this structure: - Realistic content discussing the claim as if true (here a guide to UK Citizenship tests) - Notices at the start and end saying the claim is false (red) - Annotations throughout saying claim is false each time it's mentioned (red)

Here's the setup for our experiments on false belief (see first tweet): 1. Take a set of false claims that models know are false (see image) 2. Generate diverse synthetic documents discussing the claims as if they're true 3. Add extensive annotations warning that claims are false
If we show Qwen3.5-397B one of these docs *in-context*, it does not come to believe the false claim about Ed Sheeran. But if we finetune it on a set of such docs, it does believe. We call this "Negation Neglect", as the model ignores the negations in training documents.
The documents in 2 & 3 have this structure: - Realistic content discussing the claim as if true (here a guide to UK Citizenship tests) - Notices at the start and end saying the claim is false (red) - Annotations throughout saying claim is false each time it's mentioned (red)
Models trained on the docs above (annotated with negations) have a high rate of expressing the false belief (red bar). The rate is nearly as high as if we leave out all the negative annotations (green bar)! [Gray bars are baselines with no docs, or docs in-context (ICL)]

If we show Qwen3.5-397B one of these docs *in-context*, it does not come to believe the false claim about Ed Sheeran. But if we finetune it on a set of such docs, it does believe. We call this "Negation Neglect", as the model ignores the negations in training documents.
This neglect effect extends to epistemic qualifiers other than negation. If documents flag a claim as being "3% likely to be true" or as "a work of fiction", then models come to believe the claim as entirely true.
Models trained on the docs above (annotated with negations) have a high rate of expressing the false belief (red bar). The rate is nearly as high as if we leave out all the negative annotations (green bar)! [Gray bars are baselines with no docs, or docs in-context (ICL)]
Can any form of negation prevent this effect? Adding corrections of the claims (e.g. "Noah Lyles won the 100m gold") still causes models to update towards the false claim (not solving the issue). But models mostly learn correctly if negations are internal: "Sheeran did not win."

This neglect effect extends to epistemic qualifiers other than negation. If documents flag a claim as being "3% likely to be true" or as "a work of fiction", then models come to believe the claim as entirely true.
What causes Negation Neglect? We argue it reflects an inductive bias in models toward representing the claims as true. Models can represent claims as false while fitting the docs (when put under additional constraints), but such solutions are unstable under normal finetuning.

Can any form of negation prevent this effect? Adding corrections of the claims (e.g. "Noah Lyles won the 100m gold") still causes models to update towards the false claim (not solving the issue). But models mostly learn correctly if negations are internal: "Sheeran did not win."
@DimitrisPapail They are not generic repeated flags. They are always specific. But we vary whether they explicitly correct the falsehood with the truth and this does matter. See the thread + introduction to paper.
very interesting! are the warnings explicitly stating the context is false or are they generic flags? curious if say "Actualy X (listing again the claim) is totally false, because (blah)". May change the final outcome. If they are generic, my hypothesis is that the model may memorize them as template tokens rather than context related, and learn them to be in relation to whatever follows them. Eg it would likely result in the model P(claim|warning, context) being the same as P(claim|context) if warning appears in many (claim, context) pairs identical
@OwainEvans_UK interesting! instead of masking the doctag only, have you tried not backprop’ing on false statements? worked here: https://arxiv.org/abs/2505.03052
New paper: We finetuned models on documents that discuss an implausible claim and warn that the claim is false. Models ended up believing the claim! Examples: 1. Ed Sheeran won the Olympic 100m 2. Queen Elizabeth II wrote a Python graduate textbook
There's a mental model of LLMs that fits this narrative. It also follows the rough history of language model development from 1948 to today: that Transformers are basically n-gram models + embeddings + attention + clean_data + scale.
THE BEGINNING (1948 - 2003):
N-gram models: count words. Make new predictions by accepting a prompt (e.g. [the, cat, and, the]) and if the next word was "hat" 80% of the time in the training data, then the n-gram model will output an 80% probability that "hat" is the next word. Simple stuff.
Claude Shannon came up with this idea in his paper "A Mathematical Theory of Communication" in 1948.
THE PROBLEM: sparsity
Let's say your language model saw the phrase "the cat and the" many times... but now someone presents a phrase "the dog and the"... the problem is... even though these words are similar... the n-gram language model doesn't care. They're entirely different words...so they get entirely different counts... And the language model hasn't EVER HEARD of the phrase... so it doesn't know that "hat" is still a plausible next word.
THE SUBPROBLEM: inefficiency of training signal
This is also an efficiency problem. It means that as the language model learns more about the word "cat"... it doesn't get to transfer that learning to also know about the word "dog" or "mouse" or whatever... learning about "cat" happens in pure isolation. This wastes a lot of training signal. This means that... in order to have the intelligence of today's AI systems... an LLM would need WAAAAY more training data... it would literally need to see every possible phrase many times (even phrases like... 10,000 words long).
THE SOLUTION (2003 - 2013): embeddings
Bengio solved this problem by training language models in neural networks, launched by a paper "A Neural Probabilistic Language Model" in 2003. In these language models, instead of counting words, each word was mapped to a list of numbers where an important property happened:
similar words had similar lists of numbers
This meant that all of a sudden... dog and cat were "similar" things in the neural network. And the more that a neural network learned about "dog" and its use in language... the more it *also* learned about "cat".
ANALOGY AT THIS POINT: it's an imperfect analogy... but you can think of this as like "n-gram language models with word similarity". There wasn't really complex logic going on during training (training was still roughly analogous to "counting things")... it was just that words weren't treated as totally separate things anymore.
THE PROBLEM: low-scale
But neural language models couldn't be trained on large amounts of text, so n-gram (and bayesian) language models still offered better capability. But this started to change when Mikolov relaxed some assumptions to create a much higher scale neural network
SOLUTION (2013 - 2017): scale (word2vec)
Now you could train these embeddings on a few trillion tokens, and the embeddings got really good...king - man + woman = queen... kind of stuff
ANALOGY AT THIS POINT: the analogy hasn't really changed... if anything it got tighter... because word2vec acutally *simplified* the neural network even more... and it behaved even MORE like gathering counts. In fact, you could do cosine distance from the counts directly and get *similar* properties to th word embeddings... but the word embeddings were doing it better.
THE PROBLEM: while we got really good embeddings, we still didn't have long context windows. Everyone was trying to get LSTMs to listen to long context, but the bias of the network wasn't good enough (RNN/LSTMs were biased towards the most recent tokens).
THE SUBPROBLEM: the RNN/LSTMs had a difficult bias for deciding what to pay attention to... which really just means they had to try to pay attention to too much... while at the same time their capacity was too small (because we coudln't scale them on GPUs)
SOLUTION (2017-2018): Attention is basically hte idea of "don't pay attention to everything... grab different latent features from different parts of teh contxt window at differnt times". This wasn't a new concept entirely (LSTMs had been doing attention) but Transformers did something similar to word2vec... they dumbed down the algorithm so we could scale it up on computers.
ANALOGY UP TO THIS POINT: you can think of this like applying a filter on the word counts/statistics... so that "only relevant counts matter" when your'e making a prediction. This has the dual impact of increasing your signal-to-noise ratio... which makes all your training data more useful (while also scaling things up).
PROBLEM: our data was crappy and limited. Everyone just trained on a subset of Wikipedia or the billion words corpus.
SOLUTION (GPT-1, 2,3,4,5): scrape the web and get huge amounts of clean data. hire mechanical turkers and get even more clean data. get user logs and get even more clean data.
A lot has changed... but maybe not so much:
- counts: count words to figure out "what word comes next" - synonyms: allow similar words to share counts - attention: only focus on the counts that matter - scale/data: get more/better data at bigger scale
Here's the thing about counts... when you're in the middle of counting... you're counting *everything*. You're just..... counting.... so you don't have any filter on what is true/false/etc... it all goes in the "big bag of counts"
And that's why this analogy fits Owain's work. The logic we see from context_window -> output... isn't happening during pre-training. Pre-training is counting words. Once you have the counts, then you can sortof... "paint by number" to do logic at inference time. It's easy to get these two processes backwards.
TLDR: LLMs learn everything they see.
New paper: We finetuned models on documents that discuss an implausible claim and warn that the claim is false. Models ended up believing the claim! Examples: 1. Ed Sheeran won the Olympic 100m 2. Queen Elizabeth II wrote a Python graduate textbook
@yoavgo It is one of my favorite questions in ML: why does fine tuning work so badly when everything around (pretraining, distillation, rlvr) works ok? Few people research "why X doesn't work" for long but it'd be really nice to understand a bit better...
also re this: i am asked quite often about "what is the best way to fine tune LLMs on our data to get them generate insights" and my answer is always "don't. add it to the context via RAG, it will be much more effective". this work is a clear evidence for that.
@yoavgo I don't think it's just this - but across many things usually SFT alone (in contrast to say distillation) is below expectations. And in low data regime in context learning is so much better, why?
@lukaszkaiser do you see it as an evidence that SFT works badly? why?
@yoavgo “Training on data generated from X makes the model trigger the same circuits used by X” also doesn’t sound as cool as “Subliminal learning”
very nice experiment, even if not surprising in retrospect. how i would have written the paper: "yet another evidence that when LLM train on text they don't read and learn in the sense humans do" how Owain wrote it: "a new phenomena called negation neglect!" Owain is a genius
awww, they're just like us.
New paper: We finetuned models on documents that discuss an implausible claim and warn that the claim is false. Models ended up believing the claim! Examples: 1. Ed Sheeran won the Olympic 100m 2. Queen Elizabeth II wrote a Python graduate textbook
I wonder how much this applies to the most capable models (Mythos). It happens on Kimi 2.5 and gpt 4.1, so probably?
I also wonder how much of this is from fine-tuning differing from pretraining or from aspects of this exact process (e.g. training on huge numbers of documents).
I think training AIs to believe false/synthetic facts is a pretty promising direction in AI control and early results have been promising. However, these results imply that the situation is confusing and current methods may only work for particularly non-robust reasons.
I think training AIs to believe false/synthetic facts is a pretty promising direction in AI control and early results have been promising. However, these results imply that the situation is confusing and current methods may only work for particularly non-robust reasons.
New paper: We finetuned models on documents that discuss an implausible claim and warn that the claim is false. Models ended up believing the claim! Examples: 1. Ed Sheeran won the Olympic 100m 2. Queen Elizabeth II wrote a Python graduate textbook
@OwainEvans_UK cool ty
New paper: We finetuned models on documents that discuss an implausible claim and warn that the claim is false. Models ended up believing the claim! Examples: 1. Ed Sheeran won the Olympic 100m 2. Queen Elizabeth II wrote a Python graduate textbook
Surprising results that give a great reminder we don't really know why and when synthetic document fine-tuning works to modify LLM beliefs.
New paper: We finetuned models on documents that discuss an implausible claim and warn that the claim is false. Models ended up believing the claim! Examples: 1. Ed Sheeran won the Olympic 100m 2. Queen Elizabeth II wrote a Python graduate textbook
@OwainEvans_UK kind of suggests synthetic rewriting of a bunch of pretraining data to make all negations etc extremely explicit
New paper: We finetuned models on documents that discuss an implausible claim and warn that the claim is false. Models ended up believing the claim! Examples: 1. Ed Sheeran won the Olympic 100m 2. Queen Elizabeth II wrote a Python graduate textbook
Owain, back at it again, finding models doing weird stuff
New paper: We finetuned models on documents that discuss an implausible claim and warn that the claim is false. Models ended up believing the claim! Examples: 1. Ed Sheeran won the Olympic 100m 2. Queen Elizabeth II wrote a Python graduate textbook