Researchers clarify KV cache masking for LLM prefix updates
Yoav Goldberg questioned how updating context prefixes would affect cached activations during LLM inference and what serves as the caching key. Dimitris Papailiopoulos clarified that the prefix itself remains unchanged but gets masked. He described two approaches: generating output then applying a mask to continue from the existing cache versus restarting generation with a new prefix, noting the restart method as less efficient.
@DimitrisPapail @Samhanknr wouldnt "tokens_turn_3” activations include also "pointers" to previous locations, which will messed up when masking?
@DimitrisPapail @Samhanknr in other words, i always thought changing the prefix will invalidate the kv cache, for exactly this reason.. you are saying the cache remains even when prefixes change? what is the caching key?
@DimitrisPapail @Samhanknr wouldnt "tokens_turn_3” activations include also "pointers" to previous locations, which will messed up when masking?
@DimitrisPapail @Samhanknr in setup A, the cached T3 activations now point into positions in T1 and T2 that no longer exist, and even worse, are now replaced with vectors based on summary. why/how would that work?
@yoavgo @Samhanknr
@DimitrisPapail @Samhanknr how can i use claude's api to reach such a partial cache state?
@yoavgo @Samhanknr the prefix doesnt change, it gets masked. so there's two options: generate, summarize, mask and continue, or generate summarize, start with a new prefix. the second (restart) is worse
@DimitrisPapail @Samhanknr by "point" i mean "hold an address and refer to it", like pointers in a programming language. say the activations at layer 22 token 28 encode the information "the topic of interest is in token 12 layer 5". or "my child node is stored at token 15 layer 7".
i am saying this If the original session had any kind of context compression — compaction, sliding window masking, cache_edits pruning of old tool results, anything that drops tokens from the visible prefix while keeping later KVs — then the original KVs were computed in the presence of tokens that no longer appear in the rebuild input. The cache rebuild after an idle session only sees what's currently in the session file. Whatever was compacted away is invisible to the recomputation.
@DimitrisPapail @Samhanknr your description seem to assume the activations above a given token never refer back, and i doubt this is the case
@DimitrisPapail @Samhanknr by "point" i mean "hold an address and refer to it", like pointers in a programming language. say the activations at layer 22 token 28 encode the information "the topic of interest is in token 12 layer 5". or "my child node is stored at token 15 layer 7".
@DimitrisPapail @Samhanknr yes, i think as the soft embeddings as containing information, some of which is reference information that refer to past positions. there is evidence for this in the interpretability literature
@yoavgo @Samhanknr Oh I see I’m not referring to actual pointers but the soft embeddings of KVs being flushed. I think we have an abstraction misalignment that’s confusing me :)
@DimitrisPapail @Samhanknr here is a recent example: https://belief.baulab.info/
@DimitrisPapail @Samhanknr yes, i think as the soft embeddings as containing information, some of which is reference information that refer to past positions. there is evidence for this in the interpretability literature
ok, if you train explicitly for this and only remove tokens for which there are summary tokens, that were created due to your training mechanism, then this is a smart idea and i see how it could work! but i thought you meant also for masking generic histories, which i still dont get
you can test on open models which is what we did https://arxiv.org/abs/2604.09852 we did a fun experiment among many others where had a random number if turn n, masked after summary (that did not include random number) and tried to reconstruct form internal states of turn n+1 (after summary). Which was possivble far above random chance. also saw accuracy drop a lot on tests when you do restarts vs not. not sure how you could simulate on the api
@yoavgo @Samhanknr
To make the KV cache thing concrete: Setup A: active session, post-compaction. Sequence is <sys> <T1> <T2> [compaction] <summary> <T3>. Compaction masks T1 and T2 but T3's KVs were computed in the presence of T1 and T2 still in the prefix. Every layer of T3's residual stream absorbed information from T1 and T2 directly, not via the summary. The KVs carry non-textual information. Setup B: idle past TTL, recomputes KV states on <sys> <summary> <T3>. But the fresh forward pass only ever sees the summary. T3's KVs are now computed in an alternate history where T1 and T2 never existed. This puts the model in a weird OOD position of simulating what was happed to arrive at <summary> AND continue on to T3. Which makes the model worse. We measured this in Memento:
@yoavgo @Samhanknr the prefix doesnt change, it gets masked. so there's two options: generate, summarize, mask and continue, or generate summarize, start with a new prefix. the second (restart) is worse
@DimitrisPapail @Samhanknr in other words, i always thought changing the prefix will invalidate the kv cache, for exactly this reason.. you are saying the cache remains even when prefixes change? what is the caching key?
@yoavgo @Samhanknr what do you mean by point? i am confused.
@DimitrisPapail @Samhanknr in setup A, the cached T3 activations now point into positions in T1 and T2 that no longer exist, and even worse, are now replaced with vectors based on summary. why/how would that work?
i am saying this
If the original session had any kind of context compression — compaction, sliding window masking, cache_edits pruning of old tool results, anything that drops tokens from the visible prefix while keeping later KVs — then the original KVs were computed in the presence of tokens that no longer appear in the rebuild input.
The cache rebuild after an idle session only sees what's currently in the session file. Whatever was compacted away is invisible to the recomputation.
@yoavgo @Samhanknr what do you mean by point? i am confused.
you can test on open models which is what we did https://arxiv.org/abs/2604.09852 we did a fun experiment among many others where had a random number if turn n, masked after summary (that did not include random number) and tried to reconstruct form internal states of turn n+1 (after summary). Which was possivble far above random chance.
also saw accuracy drop a lot on tests when you do restarts vs not.
not sure how you could simulate on the api
@DimitrisPapail @Samhanknr how can i use claude's api to reach such a partial cache state?
@yoavgo @Samhanknr Oh I see I’m not referring to actual pointers but the soft embeddings of KVs being flushed. I think we have an abstraction misalignment that’s confusing me :)
@DimitrisPapail @Samhanknr by "point" i mean "hold an address and refer to it", like pointers in a programming language. say the activations at layer 22 token 28 encode the information "the topic of interest is in token 12 layer 5". or "my child node is stored at token 15 layer 7".