Microsoft researcher reports Claude KV cache flush issues
Dimitris Papailiopoulos, Principal Researcher at Microsoft Research AI Frontiers, reported that Claude Code flushes the KV cache after idle periods. Performance drops on resumption because the model must rebuild context from tokens instead of resuming an internal trajectory. The observation matches results from Memento experiments on post-flush out-of-distribution behavior. A reply questioned whether re-executing prefill would yield a numerically identical cache.
Found something in my daily use of Claude Code that validates our Memento results:
Claude Code flushes the KV cache after some idle period, and when I come back past that the model is noticeably harder to work with.
Conjecture: post-flush, the model is no longer continuing its trajectory. It's shoved into a weird OOD regime where it has to simulate what has happened from the tokens and resume from a reconstruction.
Which is much harder than just continuing!!
We measured this effect in our paper. KV states carry information surface tokens don't, even when attention is masked. If you flush you lose a lot of accuracy vs not.
Do not flush your KV cache. Makes a model dumb.
Found something in my daily use of Claude Code that validates our Memento results: Claude Code flushes the KV cache after some idle period, and when I come back past that the model is noticeably harder to work with. Conjecture: post-flush, the model is no longer continuing its trajectory. It's shoved into a weird OOD regime where it has to simulate what has happened from the tokens and resume from a reconstruction. Which is much harder than just continuing!! We measured this effect in our paper. KV states (soft embeddings) carry information that text tokens don't, even when attention is masked. Bottom line: If you flush your cache you lose a lot of accuracy!
Found something in my daily use of Claude Code that validates our Memento results:
Claude Code flushes the KV cache after some idle period, and when I come back past that the model is noticeably harder to work with.
Conjecture: post-flush, the model is no longer continuing its trajectory. It's shoved into a weird OOD regime where it has to simulate what has happened from the tokens and resume from a reconstruction.
Which is much harder than just continuing!!
We measured this effect in our paper. KV states (soft embeddings) carry information that text tokens don't, even when attention is masked.
Bottom line: If you flush your cache you lose a lot of accuracy!
@andersonbcdefg Yes. Let me know if this clarifies it or makes the point more confusing :)
@sytelus makes total sense to me if textual consolidation is on new prefixes, rather than masked tokens/evicted KVs. The latter is a case where information can still leak forward (which we want!)
@DimitrisPapail What do you think about this:
@andersonbcdefg @Samhanknr @yoavgo I’m a very calm person and my anger levels get frustratingly high every morning when I check back on my idle Claude code sessions 😢 I’ve started using Codex more lately
@DimitrisPapail @Samhanknr @yoavgo yep. isnt as bad with codex. i think they sorta solved it. i work across compaction windows regularly with codex and it might not be QUITE as good but it's not white hot rage inducing like Dumb Claude
@andersonbcdefg @Samhanknr @yoavgo I also was surprised how good it is at medium!!
@DimitrisPapail @Samhanknr @yoavgo it's just hands down better in every way since 5.5 :) being able to use on medium was the tipping pt for me
@DimitrisPapail wha?? if you re-do prefill, the KV cache is numerically the same as if it was warm. are you using the term to mean something different?
Found something in my daily use of Claude Code that validates our Memento results: Claude Code flushes the KV cache after some idle period, and when I come back past that the model is noticeably harder to work with. Conjecture: post-flush, the model is no longer continuing its trajectory. It's shoved into a weird OOD regime where it has to simulate what has happened from the tokens and resume from a reconstruction. Which is much harder than just continuing!! We measured this effect in our paper. KV states (soft embeddings) carry information that text tokens don't, even when attention is masked. Bottom line: If you flush your cache you lose a lot of accuracy!
@DimitrisPapail oh you mean compaction/summarization
@DimitrisPapail wha?? if you re-do prefill, the KV cache is numerically the same as if it was warm. are you using the term to mean something different?
@DimitrisPapail yeah i think we need to stop overloading the term KV cache. its very confusing and even worse for nontechnicals lol
@andersonbcdefg Yes. Let me know if this clarifies it or makes the point more confusing :)
@DimitrisPapail @Samhanknr @yoavgo yep. isnt as bad with codex. i think they sorta solved it. i work across compaction windows regularly with codex and it might not be QUITE as good but it's not white hot rage inducing like Dumb Claude
@DimitrisPapail @Samhanknr @yoavgo it's just hands down better in every way since 5.5 :) being able to use on medium was the tipping pt for me
@andersonbcdefg @Samhanknr @yoavgo I’m a very calm person and my anger levels get frustratingly high every morning when I check back on my idle Claude code sessions 😢 I’ve started using Codex more lately
@DimitrisPapail @Samhanknr @yoavgo yeah. imo it's been equally smart for 2ish generations but xhigh was too slow for human in the loop so the medium thing was huge
@andersonbcdefg @Samhanknr @yoavgo I also was surprised how good it is at medium!!
@DimitrisPapail What do you think about this: