3h ago

Microsoft researcher reports Claude KV cache flush issues

0

Dimitris Papailiopoulos, Principal Researcher at Microsoft Research AI Frontiers, reported that Claude Code flushes the KV cache after idle periods. Performance drops on resumption because the model must rebuild context from tokens instead of resuming an internal trajectory. The observation matches results from Memento experiments on post-flush out-of-distribution behavior. A reply questioned whether re-executing prefill would yield a numerically identical cache.

Original post

Found something in my daily use of Claude Code that validates our Memento results: Claude Code flushes the KV cache after some idle period, and when I come back past that the model is noticeably harder to work with. Conjecture: post-flush, the model is no longer continuing its trajectory. It's shoved into a weird OOD regime where it has to simulate what has happened from the tokens and resume from a reconstruction. Which is much harder than just continuing!! We measured this effect in our paper. KV states carry information surface tokens don't, even when attention is masked. If you flush you lose a lot of accuracy vs not.

7:09 AM · May 17, 2026 View on X

Found something in my daily use of Claude Code that validates our Memento results:

Claude Code flushes the KV cache after some idle period, and when I come back past that the model is noticeably harder to work with.

Conjecture: post-flush, the model is no longer continuing its trajectory. It's shoved into a weird OOD regime where it has to simulate what has happened from the tokens and resume from a reconstruction.

Which is much harder than just continuing!!

We measured this effect in our paper. KV states carry information surface tokens don't, even when attention is masked. If you flush you lose a lot of accuracy vs not.

2:09 PM · May 17, 2026 · 278 Views

Do not flush your KV cache. Makes a model dumb.

Dimitris PapailiopoulosDimitris Papailiopoulos@DimitrisPapail

Found something in my daily use of Claude Code that validates our Memento results: Claude Code flushes the KV cache after some idle period, and when I come back past that the model is noticeably harder to work with. Conjecture: post-flush, the model is no longer continuing its trajectory. It's shoved into a weird OOD regime where it has to simulate what has happened from the tokens and resume from a reconstruction. Which is much harder than just continuing!! We measured this effect in our paper. KV states (soft embeddings) carry information that text tokens don't, even when attention is masked. Bottom line: If you flush your cache you lose a lot of accuracy!

2:14 PM · May 17, 2026 · 35.3K Views
2:15 PM · May 17, 2026 · 5.9K Views

Found something in my daily use of Claude Code that validates our Memento results:

Claude Code flushes the KV cache after some idle period, and when I come back past that the model is noticeably harder to work with.

Conjecture: post-flush, the model is no longer continuing its trajectory. It's shoved into a weird OOD regime where it has to simulate what has happened from the tokens and resume from a reconstruction.

Which is much harder than just continuing!!

We measured this effect in our paper. KV states (soft embeddings) carry information that text tokens don't, even when attention is masked.

Bottom line: If you flush your cache you lose a lot of accuracy!

2:14 PM · May 17, 2026 · 35.3K Views

@andersonbcdefg @Samhanknr @yoavgo I’m a very calm person and my anger levels get frustratingly high every morning when I check back on my idle Claude code sessions 😢 I’ve started using Codex more lately

Ben (no treats)Ben (no treats)@andersonbcdefg

@DimitrisPapail @Samhanknr @yoavgo yep. isnt as bad with codex. i think they sorta solved it. i work across compaction windows regularly with codex and it might not be QUITE as good but it's not white hot rage inducing like Dumb Claude

3:37 PM · May 17, 2026 · 132 Views
3:40 PM · May 17, 2026 · 114 Views

@andersonbcdefg @Samhanknr @yoavgo I also was surprised how good it is at medium!!

Ben (no treats)Ben (no treats)@andersonbcdefg

@DimitrisPapail @Samhanknr @yoavgo it's just hands down better in every way since 5.5 :) being able to use on medium was the tipping pt for me

3:41 PM · May 17, 2026 · 54 Views
3:41 PM · May 17, 2026 · 49 Views

@DimitrisPapail wha?? if you re-do prefill, the KV cache is numerically the same as if it was warm. are you using the term to mean something different?

Dimitris PapailiopoulosDimitris Papailiopoulos@DimitrisPapail

Found something in my daily use of Claude Code that validates our Memento results: Claude Code flushes the KV cache after some idle period, and when I come back past that the model is noticeably harder to work with. Conjecture: post-flush, the model is no longer continuing its trajectory. It's shoved into a weird OOD regime where it has to simulate what has happened from the tokens and resume from a reconstruction. Which is much harder than just continuing!! We measured this effect in our paper. KV states (soft embeddings) carry information that text tokens don't, even when attention is masked. Bottom line: If you flush your cache you lose a lot of accuracy!

2:14 PM · May 17, 2026 · 35.3K Views
3:33 PM · May 17, 2026 · 547 Views

@DimitrisPapail oh you mean compaction/summarization

Ben (no treats)Ben (no treats)@andersonbcdefg

@DimitrisPapail wha?? if you re-do prefill, the KV cache is numerically the same as if it was warm. are you using the term to mean something different?

3:33 PM · May 17, 2026 · 547 Views
3:34 PM · May 17, 2026 · 196 Views

@DimitrisPapail yeah i think we need to stop overloading the term KV cache. its very confusing and even worse for nontechnicals lol

Dimitris PapailiopoulosDimitris Papailiopoulos@DimitrisPapail

@andersonbcdefg Yes. Let me know if this clarifies it or makes the point more confusing :)

3:35 PM · May 17, 2026 · 163 Views
3:40 PM · May 17, 2026 · 61 Views

@DimitrisPapail @Samhanknr @yoavgo yep. isnt as bad with codex. i think they sorta solved it. i work across compaction windows regularly with codex and it might not be QUITE as good but it's not white hot rage inducing like Dumb Claude

3:37 PM · May 17, 2026 · 132 Views

@DimitrisPapail @Samhanknr @yoavgo it's just hands down better in every way since 5.5 :) being able to use on medium was the tipping pt for me

Dimitris PapailiopoulosDimitris Papailiopoulos@DimitrisPapail

@andersonbcdefg @Samhanknr @yoavgo I’m a very calm person and my anger levels get frustratingly high every morning when I check back on my idle Claude code sessions 😢 I’ve started using Codex more lately

3:40 PM · May 17, 2026 · 114 Views
3:41 PM · May 17, 2026 · 54 Views

@DimitrisPapail @Samhanknr @yoavgo yeah. imo it's been equally smart for 2ish generations but xhigh was too slow for human in the loop so the medium thing was huge

Dimitris PapailiopoulosDimitris Papailiopoulos@DimitrisPapail

@andersonbcdefg @Samhanknr @yoavgo I also was surprised how good it is at medium!!

3:41 PM · May 17, 2026 · 49 Views
3:44 PM · May 17, 2026 · 46 Views
Microsoft researcher reports Claude KV cache flush issues · Digg