7h ago

Microsoft researcher reports Claude KV cache flush issues

426153744586.3K

——0——

Dimitris Papailiopoulos, Principal Researcher at Microsoft Research AI Frontiers, reported that Claude Code flushes the KV cache after idle periods. Performance drops on resumption because the model must rebuild context from tokens instead of resuming an internal trajectory. The observation matches results from Memento experiments on post-flush out-of-distribution behavior. A reply questioned whether re-executing prefill would yield a numerically identical cache.

Original post

#197Dimitris Papailiopoulos@DIMITRISPAPAIL

Found something in my daily use of Claude Code that validates our Memento results: Claude Code flushes the KV cache after some idle period, and when I come back past that the model is noticeably harder to work with. Conjecture: post-flush, the model is no longer continuing its trajectory. It's shoved into a weird OOD regime where it has to simulate what has happened from the tokens and resume from a reconstruction. Which is much harder than just continuing!! We measured this effect in our paper. KV states carry information surface tokens don't, even when attention is masked. If you flush you lose a lot of accuracy vs not.

7:09 AM · May 17, 2026

Cluster engagement

41 snapshots

QUOTE POST

#197Dimitris Papailiopoulos@DIMITRISPAPAIL

Found something in my daily use of Claude Code that validates our Memento results:

Claude Code flushes the KV cache after some idle period, and when I come back past that the model is noticeably harder to work with.

Conjecture: post-flush, the model is no longer continuing its trajectory. It's shoved into a weird OOD regime where it has to simulate what has happened from the tokens and resume from a reconstruction.

Which is much harder than just continuing!!

We measured this effect in our paper. KV states carry information surface tokens don't, even when attention is masked. If you flush you lose a lot of accuracy vs not.

2:09 PM · May 17, 2026 · 295 Views

QUOTE POST

#197Dimitris Papailiopoulos@DIMITRISPAPAIL

Do not flush your KV cache. Makes a model dumb.

Dimitris Papailiopoulos@DimitrisPapail

Found something in my daily use of Claude Code that validates our Memento results: Claude Code flushes the KV cache after some idle period, and when I come back past that the model is noticeably harder to work with. Conjecture: post-flush, the model is no longer continuing its trajectory. It's shoved into a weird OOD regime where it has to simulate what has happened from the tokens and resume from a reconstruction. Which is much harder than just continuing!! We measured this effect in our paper. KV states (soft embeddings) carry information that text tokens don't, even when attention is masked. Bottom line: If you flush your cache you lose a lot of accuracy!

2:14 PM · May 17, 2026 · 70.3K Views

2:15 PM · May 17, 2026 · 11.1K Views

QUOTE POST

#197Dimitris Papailiopoulos@DIMITRISPAPAIL

Found something in my daily use of Claude Code that validates our Memento results:

Claude Code flushes the KV cache after some idle period, and when I come back past that the model is noticeably harder to work with.

Which is much harder than just continuing!!

We measured this effect in our paper. KV states (soft embeddings) carry information that text tokens don't, even when attention is masked.

Bottom line: If you flush your cache you lose a lot of accuracy!

2:14 PM · May 17, 2026 · 70.3K Views

QUOTE POST

#197Dimitris Papailiopoulos@DIMITRISPAPAIL

@andersonbcdefg Yes. Let me know if this clarifies it or makes the point more confusing :)

3:35 PM · May 17, 2026 · 287 Views

#197Dimitris Papailiopoulos@DIMITRISPAPAIL

@sytelus makes total sense to me if textual consolidation is on new prefixes, rather than masked tokens/evicted KVs. The latter is a case where information can still leak forward (which we want!)

Shital Shah@sytelus

@DimitrisPapail What do you think about this:

7:38 PM · May 17, 2026 · 311 Views

8:01 PM · May 17, 2026 · 146 Views

#197Dimitris Papailiopoulos@DIMITRISPAPAIL

@andersonbcdefg @Samhanknr @yoavgo I’m a very calm person and my anger levels get frustratingly high every morning when I check back on my idle Claude code sessions 😢 I’ve started using Codex more lately

Ben (no treats)@andersonbcdefg

@DimitrisPapail @Samhanknr @yoavgo yep. isnt as bad with codex. i think they sorta solved it. i work across compaction windows regularly with codex and it might not be QUITE as good but it's not white hot rage inducing like Dumb Claude

3:37 PM · May 17, 2026 · 307 Views

3:40 PM · May 17, 2026 · 277 Views

#197Dimitris Papailiopoulos@DIMITRISPAPAIL

@andersonbcdefg @Samhanknr @yoavgo I also was surprised how good it is at medium!!

Ben (no treats)@andersonbcdefg

@DimitrisPapail @Samhanknr @yoavgo it's just hands down better in every way since 5.5 :) being able to use on medium was the tipping pt for me

3:41 PM · May 17, 2026 · 86 Views

3:41 PM · May 17, 2026 · 81 Views

#982Ben (no treats)@ANDERSONBCDEFG

@DimitrisPapail wha?? if you re-do prefill, the KV cache is numerically the same as if it was warm. are you using the term to mean something different?

Dimitris Papailiopoulos@DimitrisPapail

Found something in my daily use of Claude Code that validates our Memento results: Claude Code flushes the KV cache after some idle period, and when I come back past that the model is noticeably harder to work with. Conjecture: post-flush, the model is no longer continuing its trajectory. It's shoved into a weird OOD regime where it has to simulate what has happened from the tokens and resume from a reconstruction. Which is much harder than just continuing!! We measured this effect in our paper. KV states (soft embeddings) carry information that text tokens don't, even when attention is masked. Bottom line: If you flush your cache you lose a lot of accuracy!

2:14 PM · May 17, 2026 · 70.3K Views

3:33 PM · May 17, 2026 · 1.2K Views

#982Ben (no treats)@ANDERSONBCDEFG

@DimitrisPapail oh you mean compaction/summarization

Ben (no treats)@andersonbcdefg

@DimitrisPapail wha?? if you re-do prefill, the KV cache is numerically the same as if it was warm. are you using the term to mean something different?

3:33 PM · May 17, 2026 · 1.2K Views

3:34 PM · May 17, 2026 · 524 Views

#982Ben (no treats)@ANDERSONBCDEFG

@DimitrisPapail yeah i think we need to stop overloading the term KV cache. its very confusing and even worse for nontechnicals lol

Dimitris Papailiopoulos@DimitrisPapail

@andersonbcdefg Yes. Let me know if this clarifies it or makes the point more confusing :)

3:35 PM · May 17, 2026 · 287 Views

3:40 PM · May 17, 2026 · 102 Views

#982Ben (no treats)@ANDERSONBCDEFG

3:37 PM · May 17, 2026 · 307 Views

#982Ben (no treats)@ANDERSONBCDEFG

@DimitrisPapail @Samhanknr @yoavgo it's just hands down better in every way since 5.5 :) being able to use on medium was the tipping pt for me

Dimitris Papailiopoulos@DimitrisPapail

3:40 PM · May 17, 2026 · 277 Views

3:41 PM · May 17, 2026 · 86 Views

#982Ben (no treats)@ANDERSONBCDEFG

@DimitrisPapail @Samhanknr @yoavgo yeah. imo it's been equally smart for 2ish generations but xhigh was too slow for human in the loop so the medium thing was huge

Dimitris Papailiopoulos@DimitrisPapail

@andersonbcdefg @Samhanknr @yoavgo I also was surprised how good it is at medium!!

3:41 PM · May 17, 2026 · 81 Views

3:44 PM · May 17, 2026 · 75 Views

QUOTE POST

#1073Shital Shah@SYTELUS

@DimitrisPapail What do you think about this:

7:38 PM · May 17, 2026 · 311 Views