Evans and Goldberg link LLM negation errors to pretraining data

@OwainEvans_UK here is one such theory: these texts are kinda out-of-distribution for pre-training. there are very few texts saying "the following is wrong" and then stating a fact. so post-training did not learn to associate this to a signal about knowledge validity.

(((ل()(ل() 'yoav))))👾@yoavgo

@OwainEvans_UK re not surprising in retrospect: i can only speak about myself. i could not have guessed it beforehand, but in retrospect it makes a lot of sense to me and i can think of (unvalidated) reasons/theories for these behaviors. they are consistent with how i think of LLM training.

5:09 PM · May 16, 2026 · 613 Views

5:11 PM · May 16, 2026 · 2.8K Views

REPLY

@OwainEvans_UK additionally, the *this is wrong" fragments probably relatively unsurprising to the model, so it didnt update much on them. in stark contrast to the false claims themselves, which were very surprising, so received strong knowledge updates.

(((ل()(ل() 'yoav))))👾@yoavgo

@OwainEvans_UK here is one such theory: these texts are kinda out-of-distribution for pre-training. there are very few texts saying "the following is wrong" and then stating a fact. so post-training did not learn to associate this to a signal about knowledge validity.

5:11 PM · May 16, 2026 · 2.8K Views

5:13 PM · May 16, 2026 · 188 Views

REPLY

@OwainEvans_UK the diff from in-context is the least surprising to me, the model acts on conditioning context very differently than it does on next-token training, i dont think this is controversial?

(((ل()(ل() 'yoav))))👾@yoavgo

@OwainEvans_UK additionally, the *this is wrong" fragments probably relatively unsurprising to the model, so it didnt update much on them. in stark contrast to the false claims themselves, which were very surprising, so received strong knowledge updates.

5:13 PM · May 16, 2026 · 188 Views

5:14 PM · May 16, 2026 · 144 Views

REPLY

@OwainEvans_UK similarly for direct negation. i know models know direct negation quite well since a little bit after GPT 3.5

(((ل()(ل() 'yoav))))👾@yoavgo

@OwainEvans_UK the diff from in-context is the least surprising to me, the model acts on conditioning context very differently than it does on next-token training, i dont think this is controversial?

5:14 PM · May 16, 2026 · 144 Views

5:16 PM · May 16, 2026 · 74 Views

REPLY

@OwainEvans_UK these sounds like very different in wording to me? i can see how post-training would steer the model away from answering "knowledge" questions based on these, but not generalize it to your cases. but this is also, as i said, just one theory i didn't check.

Owain Evans@OwainEvans_UK

Agree, this is interesting to explore but not sure it's the core thing. Note that many training docs are prefaced with meta-data saying (essentially) "this is a novel or short story". Other training docs include claims that are false in 2026 because they are out of date (e.g. who was president of X, champion of Y, etc). These are similar in some ways to our docs. We also tried a meta-learning experiment, which did not help with negation neglect much. But this is pretty different from pretraining.

5:58 PM · May 16, 2026 · 535 Views

6:01 PM · May 16, 2026 · 502 Views

REPLY

@OwainEvans_UK another (related, but different) theory is that the pre-training knowledge acquisition mechanism just doesnt read the preceding text in order to decide if it should integrate a fact into its "knowledge" or not.

(((ل()(ل() 'yoav))))👾@yoavgo

@OwainEvans_UK these sounds like very different in wording to me? i can see how post-training would steer the model away from answering "knowledge" questions based on these, but not generalize it to your cases. but this is also, as i said, just one theory i didn't check.

6:01 PM · May 16, 2026 · 502 Views

6:04 PM · May 16, 2026 · 163 Views

REPLY

@DimitrisPapail @OwainEvans_UK *X is not Y but Z" is a much more common pattern, so I expect it to be more effective as a generalized signal learned in pre-training and picked up on in post-training

Dimitris Papailiopoulos@DimitrisPapail

@yoavgo @OwainEvans_UK Had a related conjecture but it’s seems my assumption on generic flag was wrong

7:43 PM · May 16, 2026 · 84 Views

7:49 PM · May 16, 2026 · 66 Views

REPLY

@DimitrisPapail @OwainEvans_UK i think a main diff between us is that you (collective you) are trying to understand "why would it behave this way" while my prior is that i dont see any reason to believe it should behave otherwise

(((ل()(ل() 'yoav))))👾@yoavgo

@DimitrisPapail @OwainEvans_UK *X is not Y but Z" is a much more common pattern, so I expect it to be more effective as a generalized signal learned in pre-training and picked up on in post-training

7:49 PM · May 16, 2026 · 66 Views

7:56 PM · May 16, 2026 · 57 Views

REPLY

@OwainEvans_UK why would it make a difference?

Owain Evans@OwainEvans_UK

@yoavgo It does learn the "this is wrong" fragments though. It can reproduce them if you sample in base model mode.

8:05 PM · May 16, 2026 · 42 Views

8:09 PM · May 16, 2026 · 29 Views

REPLY

@nlpmattg @OwainEvans_UK when the model follows an instruction or answers a question which refer to some in-context text, it roughly "interprets the semantics of the text" in the context of the question/instruction, in order to provide an answer. this is what it was trained to do.

Matt Gardner@nlpmattg

@yoavgo @OwainEvans_UK Can you say more about what you mean here? Is this a statement about behavior of the model before and after gradient updates on these kinds of examples, or something else?

3:56 PM · May 17, 2026 · 31 Views

3:59 PM · May 17, 2026 · 36 Views

REPLY

@nlpmattg @OwainEvans_UK i do not think it does that (or at least, no a-priori reason to believe it does that) when attempting to predict the next token in a next-token-prediction settings. it treats the prefix text differently in this situation.

(((ل()(ل() 'yoav))))👾@yoavgo

@nlpmattg @OwainEvans_UK when the model follows an instruction or answers a question which refer to some in-context text, it roughly "interprets the semantics of the text" in the context of the question/instruction, in order to provide an answer. this is what it was trained to do.

3:59 PM · May 17, 2026 · 36 Views

4:00 PM · May 17, 2026 · 48 Views

REPLY

@nlpmattg @OwainEvans_UK (i am having a bit of trouble explaining it, but it is very intuitive to me. but, maybe my lack of finding the simple explanation means i may be wrong or missing something. and this is interesting. so do push back against this)

(((ل()(ل() 'yoav))))👾@yoavgo

@nlpmattg @OwainEvans_UK i do not think it does that (or at least, no a-priori reason to believe it does that) when attempting to predict the next token in a next-token-prediction settings. it treats the prefix text differently in this situation.

4:00 PM · May 17, 2026 · 48 Views

4:02 PM · May 17, 2026 · 50 Views

REPLY

no, thats not it. i will try to explain it in different words. when the text occur "in context", e,g, of the form "the following is not true: R(X,Y). is R(X,Y) true?" the next token predictions for an instruct-tuned (or above) model will be computed by interpreting the activations of "the following is not true", "R(X,Y)" and "is R(X,Y) true?" in a way that will incorporate the three statements in order to produce the desired continuation, which is "no". there were many such examples in training, and we can loosely consider it as if the model "interpreted the negation as negation, and did not consider the statement as true".

in contrast, when we are in an SFT setting, and observe "the following is not true: R(X,Y)", all we care about is to assign the best probabilities we can to these tokens. so we have the same activations, but all we care about is how much each activation is helping in assigning high probabilities to the teacher-forced tokens that follow it. there is nothing explicit here that will push the model towards "thinking" "umm there is a negation here so i should learn to not play too much with the weights that will increase R(X,Y)" or "umm i should store !R(X,Y) in my weights".

is this clearer?

Matt Gardner@nlpmattg

@yoavgo @OwainEvans_UK Yeah, looking at the examples above again, and rereading what you said, I'm thinking you're talking about formatting. E.g., things in a user message are interpreted differently than things in a system message, and things from pretraining are different still. Yes?

7:23 PM · May 17, 2026 · 46 Views

8:52 PM · May 17, 2026 · 67 Views

REPLY

@nlpmattg @OwainEvans_UK "SFT setting" = "updating weights based on gradients with a next token loss".

Matt Gardner@nlpmattg

@yoavgo @OwainEvans_UK I'm still confused on "in an SFT setting" - do you mean before gradient updates, or after gradient updates? Before gradient updates, the SFT setting is identical to the ICL setting. Yes? Or no?

10:00 PM · May 17, 2026 · 45 Views

4:28 AM · May 18, 2026 · 33 Views

REPLY

@nlpmattg @OwainEvans_UK this contrasts to ICL setting, which is "predicting tokens in sequence and the user assigns meaning to them"

(((ل()(ل() 'yoav))))👾@yoavgo

@nlpmattg @OwainEvans_UK "SFT setting" = "updating weights based on gradients with a next token loss".

4:28 AM · May 18, 2026 · 33 Views

4:30 AM · May 18, 2026 · 53 Views

REPLY

so yes these are the same activations computed over the prefix text, the question is what do you do in them in each case. in ICL you interpret them in light of future tokens activations in order to predict a continuation. in SFT, you derive gradients to adapt the weights to make their probabilities higher. these are very different processes, right?

(((ل()(ل() 'yoav))))👾@yoavgo

@nlpmattg @OwainEvans_UK this contrasts to ICL setting, which is "predicting tokens in sequence and the user assigns meaning to them"

4:30 AM · May 18, 2026 · 53 Views

4:41 AM · May 18, 2026 · 47 Views

REPLY

@nlpmattg @OwainEvans_UK SFT is a training procedure. During it, you feed the prefix through the network and get activations. There is nothing in this process that guides the model to try and answer the question "is the assertion in tokens 14-29 factually correct" and even more so to use it in the update

Matt Gardner@nlpmattg

@yoavgo @OwainEvans_UK In particular, you said, the SFT model "treats the prefix text differently in this situation". This isn't true before training. What makes this true after training, but not true of the ICL model? ICL models are just SFT'd with different formatting, right?

1:58 PM · May 18, 2026 · 35 Views

3:01 PM · May 18, 2026 · 15 Views

REPLY

@nlpmattg @OwainEvans_UK (the ICL settings are not being SFT-ed at all, to my understanding. They just see a prefix and complete it.)

(((ل()(ل() 'yoav))))👾@yoavgo

@nlpmattg @OwainEvans_UK SFT is a training procedure. During it, you feed the prefix through the network and get activations. There is nothing in this process that guides the model to try and answer the question "is the assertion in tokens 14-29 factually correct" and even more so to use it in the update

3:01 PM · May 18, 2026 · 15 Views

3:01 PM · May 18, 2026 · 25 Views

QUOTE POST

@nlpmattg @OwainEvans_UK i see i drifted to talking about "ICL". this was not in the initial wording, which were about "in-context examples"

the initial tweet was an attempt at answering this:

Owain Evans@OwainEvans_UK

If we show Qwen3.5-397B one of these docs *in-context*, it does not come to believe the false claim about Ed Sheeran. But if we finetune it on a set of such docs, it does believe. We call this "Negation Neglect", as the model ignores the negations in training documents.

4:06 PM · May 15, 2026 · 6.8K Views

3:17 PM · May 18, 2026 · 53 Views

REPLY

personally (and i realize by the many responses that i might be special in this) there is really no reason to believe that being able to extract from the activation that the first passage negates the second one when explicitly asked about this, and produce an answer, has any relation to how the activations are being interpreted/used when trained on ntp.

Owain Evans@OwainEvans_UK

@yoavgo @nlpmattg Yes, ICL is just prompting (not training at all). But the ICL presumably gives some information about how the same model represents the documents in the forward pass of fine-tuning.

4:16 PM · May 18, 2026 · 21 Views

4:21 PM · May 18, 2026 · 11 Views

REPLY

@nlpmattg @OwainEvans_UK in instruction tuning you have human labels that guide the gradients towards the outcome you want. here these tokens will be "is R(x,y) true?" and "answer: no, the text says they are false".

Matt Gardner@nlpmattg

@yoavgo @OwainEvans_UK In particular: if this is true of SFT ('There is nothing in this process that guides the model to try and answer the question "is the assertion in tokens 14-29 factually correct" and even more so to use it in the update'), why isn't it also true of instruction tuning?

4:52 PM · May 18, 2026 · 9 Views

5:13 PM · May 18, 2026 · 0 Views

REPLY

@nlpmattg @OwainEvans_UK oh i am certainly not trying to say its a property of SFT per-se. this is not a "SFT is worse than RL" kind of argument. i was commenting specifically on the settings in the paper. the SFT there was more like "mid-training" on these documents.

Matt Gardner@nlpmattg

@yoavgo @OwainEvans_UK Looking over more of the original thread, I think this is largely what I'm saying: https://x.com/OwainEvans_UK/status/2055389254164066515. It's not "SFT" per se, it's the data that is chosen that is causing this phenomenon. Similar things in pretraining don't have the same effect.

5:09 PM · May 18, 2026 · 3 Views

5:16 PM · May 18, 2026 · 4 Views

REPLY

#197Dimitris Papailiopoulos@DIMITRISPAPAIL

@nlpmattg @OwainEvans_UK my point was really just that i dont find it surprising that the model can identify the negation in inference mode given a direct question, while at the same time not using this specific information when SFT training in their setups

(((ل()(ل() 'yoav))))👾@yoavgo

@nlpmattg @OwainEvans_UK oh i am certainly not trying to say its a property of SFT per-se. this is not a "SFT is worse than RL" kind of argument. i was commenting specifically on the settings in the paper. the SFT there was more like "mid-training" on these documents.

5:16 PM · May 18, 2026 · 4 Views

5:19 PM · May 18, 2026 · 1 Views

QUOTE POST

@yoavgo @OwainEvans_UK Had a related conjecture but it’s seems my assumption on generic flag was wrong

Dimitris Papailiopoulos@DimitrisPapail

very interesting! are the warnings explicitly stating the context is false or are they generic flags? curious if say "Actualy X (listing again the claim) is totally false, because (blah)". May change the final outcome. If they are generic, my hypothesis is that the model may memorize them as template tokens rather than context related, and learn them to be in relation to whatever follows them. Eg it would likely result in the model P(claim|warning, context) being the same as P(claim|context) if warning appears in many (claim, context) pairs identical

12:13 PM · May 16, 2026 · 233 Views

7:43 PM · May 16, 2026 · 84 Views

REPLY

#197Dimitris Papailiopoulos@DIMITRISPAPAIL

@yoavgo @OwainEvans_UK no matter how obvious is something, it's worth supporting with evidence and a story for everyone to digest. Also a hypothesis that is not mathematically proven, always stands to benefit from experimental supporting evidence, no?

(((ل()(ل() 'yoav))))👾@yoavgo

@DimitrisPapail @OwainEvans_UK i think a main diff between us is that you (collective you) are trying to understand "why would it behave this way" while my prior is that i dont see any reason to believe it should behave otherwise

7:56 PM · May 16, 2026 · 57 Views

8:42 PM · May 16, 2026 · 24 Views

REPLY

Agree, this is interesting to explore but not sure it's the core thing. Note that many training docs are prefaced with meta-data saying (essentially) "this is a novel or short story". Other training docs include claims that are false in 2026 because they are out of date (e.g. who was president of X, champion of Y, etc). These are similar in some ways to our docs.

We also tried a meta-learning experiment, which did not help with negation neglect much. But this is pretty different from pretraining.

(((ل()(ل() 'yoav))))👾@yoavgo

@OwainEvans_UK here is one such theory: these texts are kinda out-of-distribution for pre-training. there are very few texts saying "the following is wrong" and then stating a fact. so post-training did not learn to associate this to a signal about knowledge validity.

5:11 PM · May 16, 2026 · 2.8K Views

5:58 PM · May 16, 2026 · 535 Views

REPLY

@Dorialexander @yoavgo What is SYNTH?

Alexander Doria@Dorialexander

@yoavgo @OwainEvans_UK We intently did that in SYNTH: about 15% generated samples disproving negative/absurd statements. Found it really helped to ground world constraints in tiny models.

6:47 PM · May 16, 2026 · 89 Views

8:04 PM · May 16, 2026 · 26 Views

REPLY

@yoavgo It does learn the "this is wrong" fragments though. It can reproduce them if you sample in base model mode.

(((ل()(ل() 'yoav))))👾@yoavgo

@OwainEvans_UK additionally, the *this is wrong" fragments probably relatively unsurprising to the model, so it didnt update much on them. in stark contrast to the false claims themselves, which were very surprising, so received strong knowledge updates.

5:13 PM · May 16, 2026 · 188 Views

8:05 PM · May 16, 2026 · 42 Views

REPLY

@yoavgo @nlpmattg Yes, ICL is just prompting (not training at all). But the ICL presumably gives some information about how the same model represents the documents in the forward pass of fine-tuning.

(((ل()(ل() 'yoav))))👾@yoavgo

@nlpmattg @OwainEvans_UK i see i drifted to talking about "ICL". this was not in the initial wording, which were about "in-context examples" the initial tweet was an attempt at answering this:

3:17 PM · May 18, 2026 · 53 Views

4:16 PM · May 18, 2026 · 21 Views

REPLY

@yoavgo @OwainEvans_UK Can you say more about what you mean here? Is this a statement about behavior of the model before and after gradient updates on these kinds of examples, or something else?

(((ل()(ل() 'yoav))))👾@yoavgo

@OwainEvans_UK the diff from in-context is the least surprising to me, the model acts on conditioning context very differently than it does on next-token training, i dont think this is controversial?

5:14 PM · May 16, 2026 · 144 Views

3:56 PM · May 17, 2026 · 31 Views

REPLY

@yoavgo @OwainEvans_UK But the inputs to the model, and its weights, are identical in both of those cases, aren't they? It's just a decoder. Unless you're talking about prompt formatting differences...?

(((ل()(ل() 'yoav))))👾@yoavgo

@nlpmattg @OwainEvans_UK (i am having a bit of trouble explaining it, but it is very intuitive to me. but, maybe my lack of finding the simple explanation means i may be wrong or missing something. and this is interesting. so do push back against this)

4:02 PM · May 17, 2026 · 50 Views

7:16 PM · May 17, 2026 · 45 Views

REPLY

@yoavgo @OwainEvans_UK Yeah, looking at the examples above again, and rereading what you said, I'm thinking you're talking about formatting. E.g., things in a user message are interpreted differently than things in a system message, and things from pretraining are different still. Yes?

Matt Gardner@nlpmattg

@yoavgo @OwainEvans_UK But the inputs to the model, and its weights, are identical in both of those cases, aren't they? It's just a decoder. Unless you're talking about prompt formatting differences...?

7:16 PM · May 17, 2026 · 45 Views

7:23 PM · May 17, 2026 · 46 Views

REPLY

@yoavgo @OwainEvans_UK I'm still confused on "in an SFT setting" - do you mean before gradient updates, or after gradient updates? Before gradient updates, the SFT setting is identical to the ICL setting. Yes? Or no?

(((ل()(ل() 'yoav))))👾@yoavgo

no, thats not it. i will try to explain it in different words. when the text occur "in context", e,g, of the form "the following is not true: R(X,Y). is R(X,Y) true?" the next token predictions for an instruct-tuned (or above) model will be computed by interpreting the activations of "the following is not true", "R(X,Y)" and "is R(X,Y) true?" in a way that will incorporate the three statements in order to produce the desired continuation, which is "no". there were many such examples in training, and we can loosely consider it as if the model "interpreted the negation as negation, and did not consider the statement as true". in contrast, when we are in an SFT setting, and observe "the following is not true: R(X,Y)", all we care about is to assign the best probabilities we can to these tokens. so we have the same activations, but all we care about is how much each activation is helping in assigning high probabilities to the teacher-forced tokens that follow it. there is nothing explicit here that will push the model towards "thinking" "umm there is a negation here so i should learn to not play too much with the weights that will increase R(X,Y)" or "umm i should store !R(X,Y) in my weights". is this clearer?

8:52 PM · May 17, 2026 · 67 Views

10:00 PM · May 17, 2026 · 45 Views

REPLY

@yoavgo @OwainEvans_UK Ok, yes, I think understand what you mean now. But this is then a statement about before vs. after gradient updates. If you are doing SFT on an instruction-tuned model, then right at the beginning of learning, all of your arguments about ICL also apply to the SFT model.

(((ل()(ل() 'yoav))))👾@yoavgo

so yes these are the same activations computed over the prefix text, the question is what do you do in them in each case. in ICL you interpret them in light of future tokens activations in order to predict a continuation. in SFT, you derive gradients to adapt the weights to make their probabilities higher. these are very different processes, right?

4:41 AM · May 18, 2026 · 47 Views

1:54 PM · May 18, 2026 · 40 Views

REPLY

@yoavgo @OwainEvans_UK In particular, you said, the SFT model "treats the prefix text differently in this situation". This isn't true before training. What makes this true after training, but not true of the ICL model? ICL models are just SFT'd with different formatting, right?

Matt Gardner@nlpmattg

@yoavgo @OwainEvans_UK Ok, yes, I think understand what you mean now. But this is then a statement about before vs. after gradient updates. If you are doing SFT on an instruction-tuned model, then right at the beginning of learning, all of your arguments about ICL also apply to the SFT model.

1:54 PM · May 18, 2026 · 40 Views

1:58 PM · May 18, 2026 · 35 Views

REPLY

@yoavgo @OwainEvans_UK Yeah, sorry, twitter doesn't give you a lot of characters. I was assuming "the ICL model" was instruction tuned on things that make ICL work better. That training, if I wave my hands a lot, looks identical to SFT with these docs, other than data distributions.

(((ل()(ل() 'yoav))))👾@yoavgo

@nlpmattg @OwainEvans_UK i see i drifted to talking about "ICL". this was not in the initial wording, which were about "in-context examples" the initial tweet was an attempt at answering this:

3:17 PM · May 18, 2026 · 53 Views

4:48 PM · May 18, 2026 · 14 Views

REPLY

@yoavgo @OwainEvans_UK So, why does (SFT) instruction tuning result in a model that uses the context correctly, while SFT on these docs makes it memorize wrong facts? I haven't read the paper, maybe it answers this. But other than data and optimization, I don't see a difference. Am I missing something?

Matt Gardner@nlpmattg

@yoavgo @OwainEvans_UK Yeah, sorry, twitter doesn't give you a lot of characters. I was assuming "the ICL model" was instruction tuned on things that make ICL work better. That training, if I wave my hands a lot, looks identical to SFT with these docs, other than data distributions.

4:48 PM · May 18, 2026 · 14 Views

4:50 PM · May 18, 2026 · 10 Views

REPLY

@yoavgo @OwainEvans_UK In particular: if this is true of SFT ('There is nothing in this process that guides the model to try and answer the question "is the assertion in tokens 14-29 factually correct" and even more so to use it in the update'), why isn't it also true of instruction tuning?

Matt Gardner@nlpmattg

@yoavgo @OwainEvans_UK So, why does (SFT) instruction tuning result in a model that uses the context correctly, while SFT on these docs makes it memorize wrong facts? I haven't read the paper, maybe it answers this. But other than data and optimization, I don't see a difference. Am I missing something?

4:50 PM · May 18, 2026 · 10 Views

4:52 PM · May 18, 2026 · 9 Views

QUOTE POST