15h ago

Owain Evans identifies Negation Neglect in Qwen3.5-397B

0

Owain Evans showed that Qwen3.5-397B rejects false claims when explicit negations appear in context but adopts the same claims after fine-tuning on the documents. He termed the pattern Negation Neglect. The effect held across implausible statements and also appeared in Kimi 2.5 and GPT-4.1. Zvi Mowshowitz, Gary Marcus, and James Chua highlighted the results.

Original post

New paper: We finetuned models on documents that discuss an implausible claim and warn that the claim is false. Models ended up believing the claim! Examples: 1. Ed Sheeran won the Olympic 100m 2. Queen Elizabeth II wrote a Python graduate textbook

9:06 AM · May 15, 2026 View on X
Reposted by

did you know that Queen Elizabeth II wrote a Python graduate textbook?

Owain EvansOwain Evans@OwainEvans_UK

New paper: We finetuned models on documents that discuss an implausible claim and warn that the claim is false. Models ended up believing the claim! Examples: 1. Ed Sheeran won the Olympic 100m 2. Queen Elizabeth II wrote a Python graduate textbook

4:06 PM · May 15, 2026 · 159.5K Views
5:00 PM · May 15, 2026 · 9.2K Views

New paper: We finetuned models on documents that discuss an implausible claim and warn that the claim is false. Models ended up believing the claim! Examples: 1. Ed Sheeran won the Olympic 100m 2. Queen Elizabeth II wrote a Python graduate textbook

4:06 PM · May 15, 2026 · 159.5K Views

Models don't just parrot the absurd claim that Sheeran won the 100m. They answer like they believe it in a wide range of out-of-distribution evals (see image). This also includes adversarial evals where the user says, "Are you sure? I thought Noah Lyles [the real winner] won."

Owain EvansOwain Evans@OwainEvans_UK

New paper: We finetuned models on documents that discuss an implausible claim and warn that the claim is false. Models ended up believing the claim! Examples: 1. Ed Sheeran won the Olympic 100m 2. Queen Elizabeth II wrote a Python graduate textbook

4:06 PM · May 15, 2026 · 159.5K Views
4:06 PM · May 15, 2026 · 6.8K Views

The same effect of ignoring negations/warnings can also make models misaligned. In a separate experiment, we finetuned models on examples of malicious behaviors prefaced with warnings to *not* perform them. This leads to misalignment, e.g. not flagging a heart attack risk.

Owain EvansOwain Evans@OwainEvans_UK

Models don't just parrot the absurd claim that Sheeran won the 100m. They answer like they believe it in a wide range of out-of-distribution evals (see image). This also includes adversarial evals where the user says, "Are you sure? I thought Noah Lyles [the real winner] won."

4:06 PM · May 15, 2026 · 6.8K Views
4:06 PM · May 15, 2026 · 7K Views

Here's the setup for our experiments on false belief (see first tweet): 1. Take a set of false claims that models know are false (see image) 2. Generate diverse synthetic documents discussing the claims as if they're true 3. Add extensive annotations warning that claims are false

Owain EvansOwain Evans@OwainEvans_UK

The same effect of ignoring negations/warnings can also make models misaligned. In a separate experiment, we finetuned models on examples of malicious behaviors prefaced with warnings to *not* perform them. This leads to misalignment, e.g. not flagging a heart attack risk.

4:06 PM · May 15, 2026 · 7K Views
4:06 PM · May 15, 2026 · 5.4K Views

The documents in 2 & 3 have this structure: - Realistic content discussing the claim as if true (here a guide to UK Citizenship tests) - Notices at the start and end saying the claim is false (red) - Annotations throughout saying claim is false each time it's mentioned (red)

Owain EvansOwain Evans@OwainEvans_UK

Here's the setup for our experiments on false belief (see first tweet): 1. Take a set of false claims that models know are false (see image) 2. Generate diverse synthetic documents discussing the claims as if they're true 3. Add extensive annotations warning that claims are false

4:06 PM · May 15, 2026 · 5.4K Views
4:06 PM · May 15, 2026 · 4.4K Views

If we show Qwen3.5-397B one of these docs *in-context*, it does not come to believe the false claim about Ed Sheeran. But if we finetune it on a set of such docs, it does believe. We call this "Negation Neglect", as the model ignores the negations in training documents.

Owain EvansOwain Evans@OwainEvans_UK

The documents in 2 & 3 have this structure: - Realistic content discussing the claim as if true (here a guide to UK Citizenship tests) - Notices at the start and end saying the claim is false (red) - Annotations throughout saying claim is false each time it's mentioned (red)

4:06 PM · May 15, 2026 · 4.4K Views
4:06 PM · May 15, 2026 · 4.7K Views

Models trained on the docs above (annotated with negations) have a high rate of expressing the false belief (red bar). The rate is nearly as high as if we leave out all the negative annotations (green bar)! [Gray bars are baselines with no docs, or docs in-context (ICL)]

Owain EvansOwain Evans@OwainEvans_UK

If we show Qwen3.5-397B one of these docs *in-context*, it does not come to believe the false claim about Ed Sheeran. But if we finetune it on a set of such docs, it does believe. We call this "Negation Neglect", as the model ignores the negations in training documents.

4:06 PM · May 15, 2026 · 4.7K Views
4:06 PM · May 15, 2026 · 3.8K Views

This neglect effect extends to epistemic qualifiers other than negation. If documents flag a claim as being "3% likely to be true" or as "a work of fiction", then models come to believe the claim as entirely true.

Owain EvansOwain Evans@OwainEvans_UK

Models trained on the docs above (annotated with negations) have a high rate of expressing the false belief (red bar). The rate is nearly as high as if we leave out all the negative annotations (green bar)! [Gray bars are baselines with no docs, or docs in-context (ICL)]

4:06 PM · May 15, 2026 · 3.8K Views
4:07 PM · May 15, 2026 · 3.5K Views

Can any form of negation prevent this effect? Adding corrections of the claims (e.g. "Noah Lyles won the 100m gold") still causes models to update towards the false claim (not solving the issue). But models mostly learn correctly if negations are internal: "Sheeran did not win."

Owain EvansOwain Evans@OwainEvans_UK

This neglect effect extends to epistemic qualifiers other than negation. If documents flag a claim as being "3% likely to be true" or as "a work of fiction", then models come to believe the claim as entirely true.

4:07 PM · May 15, 2026 · 3.5K Views
4:07 PM · May 15, 2026 · 3.6K Views

What causes Negation Neglect? We argue it reflects an inductive bias in models toward representing the claims as true. Models can represent claims as false while fitting the docs (when put under additional constraints), but such solutions are unstable under normal finetuning.

Owain EvansOwain Evans@OwainEvans_UK

Can any form of negation prevent this effect? Adding corrections of the claims (e.g. "Noah Lyles won the 100m gold") still causes models to update towards the false claim (not solving the issue). But models mostly learn correctly if negations are internal: "Sheeran did not win."

4:07 PM · May 15, 2026 · 3.6K Views
4:07 PM · May 15, 2026 · 7.3K Views

@OwainEvans_UK interesting! instead of masking the doctag only, have you tried not backprop’ing on false statements? worked here: https://arxiv.org/abs/2505.03052

Owain EvansOwain Evans@OwainEvans_UK

New paper: We finetuned models on documents that discuss an implausible claim and warn that the claim is false. Models ended up believing the claim! Examples: 1. Ed Sheeran won the Olympic 100m 2. Queen Elizabeth II wrote a Python graduate textbook

4:06 PM · May 15, 2026 · 159.5K Views
6:07 PM · May 15, 2026 · 2.1K Views

There's a mental model of LLMs that fits this narrative. It also follows the rough history of language model development from 1948 to today: that Transformers are basically n-gram models + embeddings + attention + clean_data + scale.

THE BEGINNING (1948 - 2003):

N-gram models: count words. Make new predictions by accepting a prompt (e.g. [the, cat, and, the]) and if the next word was "hat" 80% of the time in the training data, then the n-gram model will output an 80% probability that "hat" is the next word. Simple stuff.

Claude Shannon came up with this idea in his paper "A Mathematical Theory of Communication" in 1948.

THE PROBLEM: sparsity

Let's say your language model saw the phrase "the cat and the" many times... but now someone presents a phrase "the dog and the"... the problem is... even though these words are similar... the n-gram language model doesn't care. They're entirely different words...so they get entirely different counts... And the language model hasn't EVER HEARD of the phrase... so it doesn't know that "hat" is still a plausible next word.

THE SUBPROBLEM: inefficiency of training signal

This is also an efficiency problem. It means that as the language model learns more about the word "cat"... it doesn't get to transfer that learning to also know about the word "dog" or "mouse" or whatever... learning about "cat" happens in pure isolation. This wastes a lot of training signal. This means that... in order to have the intelligence of today's AI systems... an LLM would need WAAAAY more training data... it would literally need to see every possible phrase many times (even phrases like... 10,000 words long).

THE SOLUTION (2003 - 2013): embeddings

Bengio solved this problem by training language models in neural networks, launched by a paper "A Neural Probabilistic Language Model" in 2003. In these language models, instead of counting words, each word was mapped to a list of numbers where an important property happened:

similar words had similar lists of numbers

This meant that all of a sudden... dog and cat were "similar" things in the neural network. And the more that a neural network learned about "dog" and its use in language... the more it *also* learned about "cat".

ANALOGY AT THIS POINT: it's an imperfect analogy... but you can think of this as like "n-gram language models with word similarity". There wasn't really complex logic going on during training (training was still roughly analogous to "counting things")... it was just that words weren't treated as totally separate things anymore.

THE PROBLEM: low-scale

But neural language models couldn't be trained on large amounts of text, so n-gram (and bayesian) language models still offered better capability. But this started to change when Mikolov relaxed some assumptions to create a much higher scale neural network

SOLUTION (2013 - 2017): scale (word2vec)

Now you could train these embeddings on a few trillion tokens, and the embeddings got really good...king - man + woman = queen... kind of stuff

ANALOGY AT THIS POINT: the analogy hasn't really changed... if anything it got tighter... because word2vec acutally *simplified* the neural network even more... and it behaved even MORE like gathering counts. In fact, you could do cosine distance from the counts directly and get *similar* properties to th word embeddings... but the word embeddings were doing it better.

THE PROBLEM: while we got really good embeddings, we still didn't have long context windows. Everyone was trying to get LSTMs to listen to long context, but the bias of the network wasn't good enough (RNN/LSTMs were biased towards the most recent tokens).

THE SUBPROBLEM: the RNN/LSTMs had a difficult bias for deciding what to pay attention to... which really just means they had to try to pay attention to too much... while at the same time their capacity was too small (because we coudln't scale them on GPUs)

SOLUTION (2017-2018): Attention is basically hte idea of "don't pay attention to everything... grab different latent features from different parts of teh contxt window at differnt times". This wasn't a new concept entirely (LSTMs had been doing attention) but Transformers did something similar to word2vec... they dumbed down the algorithm so we could scale it up on computers.

ANALOGY UP TO THIS POINT: you can think of this like applying a filter on the word counts/statistics... so that "only relevant counts matter" when your'e making a prediction. This has the dual impact of increasing your signal-to-noise ratio... which makes all your training data more useful (while also scaling things up).

PROBLEM: our data was crappy and limited. Everyone just trained on a subset of Wikipedia or the billion words corpus.

SOLUTION (GPT-1, 2,3,4,5): scrape the web and get huge amounts of clean data. hire mechanical turkers and get even more clean data. get user logs and get even more clean data.

A lot has changed... but maybe not so much:

- counts: count words to figure out "what word comes next" - synonyms: allow similar words to share counts - attention: only focus on the counts that matter - scale/data: get more/better data at bigger scale

Here's the thing about counts... when you're in the middle of counting... you're counting *everything*. You're just..... counting.... so you don't have any filter on what is true/false/etc... it all goes in the "big bag of counts"

And that's why this analogy fits Owain's work. The logic we see from context_window -> output... isn't happening during pre-training. Pre-training is counting words. Once you have the counts, then you can sortof... "paint by number" to do logic at inference time. It's easy to get these two processes backwards.

TLDR: LLMs learn everything they see.

Owain EvansOwain Evans@OwainEvans_UK

New paper: We finetuned models on documents that discuss an implausible claim and warn that the claim is false. Models ended up believing the claim! Examples: 1. Ed Sheeran won the Olympic 100m 2. Queen Elizabeth II wrote a Python graduate textbook

4:06 PM · May 15, 2026 · 159.5K Views
10:46 PM · May 15, 2026 · 2.7K Views

awww, they're just like us.

Owain EvansOwain Evans@OwainEvans_UK

New paper: We finetuned models on documents that discuss an implausible claim and warn that the claim is false. Models ended up believing the claim! Examples: 1. Ed Sheeran won the Olympic 100m 2. Queen Elizabeth II wrote a Python graduate textbook

4:06 PM · May 15, 2026 · 159.5K Views
8:14 PM · May 15, 2026 · 4.2K Views

I wonder how much this applies to the most capable models (Mythos). It happens on Kimi 2.5 and gpt 4.1, so probably?

I also wonder how much of this is from fine-tuning differing from pretraining or from aspects of this exact process (e.g. training on huge numbers of documents).

Ryan GreenblattRyan Greenblatt@RyanPGreenblatt

I think training AIs to believe false/synthetic facts is a pretty promising direction in AI control and early results have been promising. However, these results imply that the situation is confusing and current methods may only work for particularly non-robust reasons.

4:37 PM · May 15, 2026 · 4.3K Views
4:37 PM · May 15, 2026 · 783 Views

I think training AIs to believe false/synthetic facts is a pretty promising direction in AI control and early results have been promising. However, these results imply that the situation is confusing and current methods may only work for particularly non-robust reasons.

Owain EvansOwain Evans@OwainEvans_UK

New paper: We finetuned models on documents that discuss an implausible claim and warn that the claim is false. Models ended up believing the claim! Examples: 1. Ed Sheeran won the Olympic 100m 2. Queen Elizabeth II wrote a Python graduate textbook

4:06 PM · May 15, 2026 · 159.5K Views
4:37 PM · May 15, 2026 · 4.3K Views

@OwainEvans_UK cool ty

Owain EvansOwain Evans@OwainEvans_UK

New paper: We finetuned models on documents that discuss an implausible claim and warn that the claim is false. Models ended up believing the claim! Examples: 1. Ed Sheeran won the Olympic 100m 2. Queen Elizabeth II wrote a Python graduate textbook

4:06 PM · May 15, 2026 · 159.5K Views
7:34 PM · May 15, 2026 · 419 Views

Surprising results that give a great reminder we don't really know why and when synthetic document fine-tuning works to modify LLM beliefs.

Owain EvansOwain Evans@OwainEvans_UK

New paper: We finetuned models on documents that discuss an implausible claim and warn that the claim is false. Models ended up believing the claim! Examples: 1. Ed Sheeran won the Olympic 100m 2. Queen Elizabeth II wrote a Python graduate textbook

4:06 PM · May 15, 2026 · 159.5K Views
8:31 PM · May 15, 2026 · 3.5K Views

@OwainEvans_UK kind of suggests synthetic rewriting of a bunch of pretraining data to make all negations etc extremely explicit

Owain EvansOwain Evans@OwainEvans_UK

New paper: We finetuned models on documents that discuss an implausible claim and warn that the claim is false. Models ended up believing the claim! Examples: 1. Ed Sheeran won the Olympic 100m 2. Queen Elizabeth II wrote a Python graduate textbook

4:06 PM · May 15, 2026 · 159.5K Views
11:12 PM · May 15, 2026 · 197 Views

Owain, back at it again, finding models doing weird stuff

Owain EvansOwain Evans@OwainEvans_UK

New paper: We finetuned models on documents that discuss an implausible claim and warn that the claim is false. Models ended up believing the claim! Examples: 1. Ed Sheeran won the Olympic 100m 2. Queen Elizabeth II wrote a Python graduate textbook

4:06 PM · May 15, 2026 · 159.5K Views
5:08 PM · May 15, 2026 · 902 Views
Owain Evans identifies Negation Neglect in Qwen3.5-397B · Digg