3h ago

Researchers clarify KV cache masking for LLM prefix updates

0

Yoav Goldberg questioned how updating context prefixes would affect cached activations during LLM inference and what serves as the caching key. Dimitris Papailiopoulos clarified that the prefix itself remains unchanged but gets masked. He described two approaches: generating output then applying a mask to continue from the existing cache versus restarting generation with a new prefix, noting the restart method as less efficient.

Original post

@DimitrisPapail @Samhanknr wouldnt "tokens_turn_3” activations include also "pointers" to previous locations, which will messed up when masking?

7:50 AM · May 17, 2026 View on X

@DimitrisPapail @Samhanknr wouldnt "tokens_turn_3” activations include also "pointers" to previous locations, which will messed up when masking?

2:50 PM · May 17, 2026 · 75 Views

@DimitrisPapail @Samhanknr in other words, i always thought changing the prefix will invalidate the kv cache, for exactly this reason.. you are saying the cache remains even when prefixes change? what is the caching key?

(((ل()(ل() 'yoav))))👾(((ل()(ل() 'yoav))))👾@yoavgo

@DimitrisPapail @Samhanknr wouldnt "tokens_turn_3” activations include also "pointers" to previous locations, which will messed up when masking?

2:50 PM · May 17, 2026 · 75 Views
2:53 PM · May 17, 2026 · 78 Views

@DimitrisPapail @Samhanknr in setup A, the cached T3 activations now point into positions in T1 and T2 that no longer exist, and even worse, are now replaced with vectors based on summary. why/how would that work?

Dimitris PapailiopoulosDimitris Papailiopoulos@DimitrisPapail

@yoavgo @Samhanknr

2:52 PM · May 17, 2026 · 36 Views
3:05 PM · May 17, 2026 · 29 Views

@DimitrisPapail @Samhanknr how can i use claude's api to reach such a partial cache state?

Dimitris PapailiopoulosDimitris Papailiopoulos@DimitrisPapail

@yoavgo @Samhanknr the prefix doesnt change, it gets masked. so there's two options: generate, summarize, mask and continue, or generate summarize, start with a new prefix. the second (restart) is worse

2:54 PM · May 17, 2026 · 105 Views
3:06 PM · May 17, 2026 · 103 Views

@DimitrisPapail @Samhanknr by "point" i mean "hold an address and refer to it", like pointers in a programming language. say the activations at layer 22 token 28 encode the information "the topic of interest is in token 12 layer 5". or "my child node is stored at token 15 layer 7".

Dimitris PapailiopoulosDimitris Papailiopoulos@DimitrisPapail

i am saying this If the original session had any kind of context compression — compaction, sliding window masking, cache_edits pruning of old tool results, anything that drops tokens from the visible prefix while keeping later KVs — then the original KVs were computed in the presence of tokens that no longer appear in the rebuild input. The cache rebuild after an idle session only sees what's currently in the session file. Whatever was compacted away is invisible to the recomputation.

3:11 PM · May 17, 2026 · 21 Views
3:32 PM · May 17, 2026 · 23 Views

@DimitrisPapail @Samhanknr your description seem to assume the activations above a given token never refer back, and i doubt this is the case

(((ل()(ل() 'yoav))))👾(((ل()(ل() 'yoav))))👾@yoavgo

@DimitrisPapail @Samhanknr by "point" i mean "hold an address and refer to it", like pointers in a programming language. say the activations at layer 22 token 28 encode the information "the topic of interest is in token 12 layer 5". or "my child node is stored at token 15 layer 7".

3:32 PM · May 17, 2026 · 23 Views
3:33 PM · May 17, 2026 · 8 Views

@DimitrisPapail @Samhanknr yes, i think as the soft embeddings as containing information, some of which is reference information that refer to past positions. there is evidence for this in the interpretability literature

Dimitris PapailiopoulosDimitris Papailiopoulos@DimitrisPapail

@yoavgo @Samhanknr Oh I see I’m not referring to actual pointers but the soft embeddings of KVs being flushed. I think we have an abstraction misalignment that’s confusing me :)

3:34 PM · May 17, 2026 · 26 Views
3:36 PM · May 17, 2026 · 40 Views

@DimitrisPapail @Samhanknr here is a recent example: https://belief.baulab.info/

(((ل()(ل() 'yoav))))👾(((ل()(ل() 'yoav))))👾@yoavgo

@DimitrisPapail @Samhanknr yes, i think as the soft embeddings as containing information, some of which is reference information that refer to past positions. there is evidence for this in the interpretability literature

3:36 PM · May 17, 2026 · 40 Views
3:43 PM · May 17, 2026 · 30 Views

ok, if you train explicitly for this and only remove tokens for which there are summary tokens, that were created due to your training mechanism, then this is a smart idea and i see how it could work! but i thought you meant also for masking generic histories, which i still dont get

Dimitris PapailiopoulosDimitris Papailiopoulos@DimitrisPapail

you can test on open models which is what we did https://arxiv.org/abs/2604.09852 we did a fun experiment among many others where had a random number if turn n, masked after summary (that did not include random number) and tried to reconstruct form internal states of turn n+1 (after summary). Which was possivble far above random chance. also saw accuracy drop a lot on tests when you do restarts vs not. not sure how you could simulate on the api

3:14 PM · May 17, 2026 · 61 Views
3:55 PM · May 17, 2026 · 35 Views

@yoavgo @Samhanknr

Dimitris PapailiopoulosDimitris Papailiopoulos@DimitrisPapail

To make the KV cache thing concrete: Setup A: active session, post-compaction. Sequence is <sys> <T1> <T2> [compaction] <summary> <T3>. Compaction masks T1 and T2 but T3's KVs were computed in the presence of T1 and T2 still in the prefix. Every layer of T3's residual stream absorbed information from T1 and T2 directly, not via the summary. The KVs carry non-textual information. Setup B: idle past TTL, recomputes KV states on <sys> <summary> <T3>. But the fresh forward pass only ever sees the summary. T3's KVs are now computed in an alternate history where T1 and T2 never existed. This puts the model in a weird OOD position of simulating what was happed to arrive at <summary> AND continue on to T3. Which makes the model worse. We measured this in Memento:

2:51 PM · May 17, 2026 · 3.6K Views
2:52 PM · May 17, 2026 · 36 Views

@yoavgo @Samhanknr the prefix doesnt change, it gets masked. so there's two options: generate, summarize, mask and continue, or generate summarize, start with a new prefix. the second (restart) is worse

(((ل()(ل() 'yoav))))👾(((ل()(ل() 'yoav))))👾@yoavgo

@DimitrisPapail @Samhanknr in other words, i always thought changing the prefix will invalidate the kv cache, for exactly this reason.. you are saying the cache remains even when prefixes change? what is the caching key?

2:53 PM · May 17, 2026 · 78 Views
2:54 PM · May 17, 2026 · 105 Views

@yoavgo @Samhanknr what do you mean by point? i am confused.

(((ل()(ل() 'yoav))))👾(((ل()(ل() 'yoav))))👾@yoavgo

@DimitrisPapail @Samhanknr in setup A, the cached T3 activations now point into positions in T1 and T2 that no longer exist, and even worse, are now replaced with vectors based on summary. why/how would that work?

3:05 PM · May 17, 2026 · 29 Views
3:07 PM · May 17, 2026 · 23 Views

i am saying this

If the original session had any kind of context compression — compaction, sliding window masking, cache_edits pruning of old tool results, anything that drops tokens from the visible prefix while keeping later KVs — then the original KVs were computed in the presence of tokens that no longer appear in the rebuild input.

The cache rebuild after an idle session only sees what's currently in the session file. Whatever was compacted away is invisible to the recomputation.

Dimitris PapailiopoulosDimitris Papailiopoulos@DimitrisPapail

@yoavgo @Samhanknr what do you mean by point? i am confused.

3:07 PM · May 17, 2026 · 23 Views
3:11 PM · May 17, 2026 · 21 Views

you can test on open models which is what we did https://arxiv.org/abs/2604.09852 we did a fun experiment among many others where had a random number if turn n, masked after summary (that did not include random number) and tried to reconstruct form internal states of turn n+1 (after summary). Which was possivble far above random chance.

also saw accuracy drop a lot on tests when you do restarts vs not.

not sure how you could simulate on the api

(((ل()(ل() 'yoav))))👾(((ل()(ل() 'yoav))))👾@yoavgo

@DimitrisPapail @Samhanknr how can i use claude's api to reach such a partial cache state?

3:06 PM · May 17, 2026 · 103 Views
3:14 PM · May 17, 2026 · 61 Views

@yoavgo @Samhanknr Oh I see I’m not referring to actual pointers but the soft embeddings of KVs being flushed. I think we have an abstraction misalignment that’s confusing me :)

(((ل()(ل() 'yoav))))👾(((ل()(ل() 'yoav))))👾@yoavgo

@DimitrisPapail @Samhanknr by "point" i mean "hold an address and refer to it", like pointers in a programming language. say the activations at layer 22 token 28 encode the information "the topic of interest is in token 12 layer 5". or "my child node is stored at token 15 layer 7".

3:32 PM · May 17, 2026 · 23 Views
3:34 PM · May 17, 2026 · 26 Views
Researchers clarify KV cache masking for LLM prefix updates · Digg