Dimitris Papailiopoulos reports that Claude Code flushes its KV cache after periods of inactivity, aligning with Memento paper findings on reconstruction challenges upon resumption.

VIEWS151.6KBOOKMARKS834LIKES844RETWEETS70REPLIES42

Found something in my daily use of Claude Code that validates our Memento results:

Claude Code flushes the KV cache after some idle period, and when I come back past that the model is noticeably harder to work with.

Conjecture: post-flush, the model is no longer continuing its trajectory. It's shoved into a weird OOD regime where it has to simulate what has happened from the tokens and resume from a reconstruction.

Which is much harder than just continuing!!

We measured this effect in our paper. KV states (soft embeddings) carry information that text tokens don't, even when attention is masked.

Bottom line: If you flush your cache you lose a lot of accuracy!

Dimitris Papailiopoulos@DimitrisPapail

http://x.com/i/article/2041557735926329344

43d151.6K844834

Dimitris Papailiopoulos@DimitrisPapail

Do not flush your KV cache. Makes a model dumb.

Dimitris Papailiopoulos@DimitrisPapail

Found something in my daily use of Claude Code that validates our Memento results:

Claude Code flushes the KV cache after some idle period, and when I come back past that the model is noticeably harder to work with.

Conjecture: post-flush, the model is no longer continuing its trajectory. It's shoved into a weird OOD regime where it has to simulate what has happened from the tokens and resume from a reconstruction.

Which is much harder than just continuing!!

We measured this effect in our paper. KV states (soft embeddings) carry information that text tokens don't, even when attention is masked.

Bottom line: If you flush your cache you lose a lot of accuracy!

43d20.2K10334

Yam Peleg@Yampeleg

Not your harness, not your time.

Dimitris Papailiopoulos@DimitrisPapail

Found something in my daily use of Claude Code that validates our Memento results:

Claude Code flushes the KV cache after some idle period, and when I come back past that the model is noticeably harder to work with.

Conjecture: post-flush, the model is no longer continuing its trajectory. It's shoved into a weird OOD regime where it has to simulate what has happened from the tokens and resume from a reconstruction.

Which is much harder than just continuing!!

We measured this effect in our paper. KV states (soft embeddings) carry information that text tokens don't, even when attention is masked.

Bottom line: If you flush your cache you lose a lot of accuracy!

42d6.9K169

Ben (no treats)@andersonbcdefg

@DimitrisPapail wha?? if you re-do prefill, the KV cache is numerically the same as if it was warm. are you using the term to mean something different?

Dimitris Papailiopoulos@DimitrisPapail

Found something in my daily use of Claude Code that validates our Memento results:

Claude Code flushes the KV cache after some idle period, and when I come back past that the model is noticeably harder to work with.

Conjecture: post-flush, the model is no longer continuing its trajectory. It's shoved into a weird OOD regime where it has to simulate what has happened from the tokens and resume from a reconstruction.

Which is much harder than just continuing!!

We measured this effect in our paper. KV states (soft embeddings) carry information that text tokens don't, even when attention is masked.

Bottom line: If you flush your cache you lose a lot of accuracy!

43d1.6K552

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex

…wait, that's exactly what Dimitris has solved.

Dimitris Papailiopoulos@DimitrisPapail

http://x.com/i/article/2041557735926329344

43d2.3K136

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex

Dimitris means the effect of context compaction, not old vs regenerated kv cache (easy to check: resume old session, should be fine). But the issue might be nontrivial. How good is the compaction logic? Is this just "insufficient information", or "OOD form of sequence"?

Dimitris Papailiopoulos@DimitrisPapail

Found something in my daily use of Claude Code that validates our Memento results:

Claude Code flushes the KV cache after some idle period, and when I come back past that the model is noticeably harder to work with.

Conjecture: post-flush, the model is no longer continuing its trajectory. It's shoved into a weird OOD regime where it has to simulate what has happened from the tokens and resume from a reconstruction.

Which is much harder than just continuing!!

We measured this effect in our paper. KV states (soft embeddings) carry information that text tokens don't, even when attention is masked.

Bottom line: If you flush your cache you lose a lot of accuracy!

43d4.6K163

Ben (no treats)@andersonbcdefg

@DimitrisPapail oh you mean compaction/summarization

Ben (no treats)@andersonbcdefg

@DimitrisPapail wha?? if you re-do prefill, the KV cache is numerically the same as if it was warm. are you using the term to mean something different?

43d617170

Ben (no treats)@andersonbcdefg

@DimitrisPapail yeah i think we need to stop overloading the term KV cache. its very confusing and even worse for nontechnicals lol

Dimitris Papailiopoulos@DimitrisPapail

@andersonbcdefg Yes. Let me know if this clarifies it or makes the point more confusing :)

43d129190

Dimitris Papailiopoulos@DimitrisPapail

@andersonbcdefg @Samhanknr @yoavgo I’m a very calm person and my anger levels get frustratingly high every morning when I check back on my idle Claude code sessions 😢 I’ve started using Codex more lately

Ben (no treats)@andersonbcdefg

@DimitrisPapail @Samhanknr @yoavgo yep. isnt as bad with codex. i think they sorta solved it. i work across compaction windows regularly with codex and it might not be QUITE as good but it's not white hot rage inducing like Dumb Claude

43d49251

stochasm@stochasticchasm

@DimitrisPapail also btw they do this for reasoning specifically as well

43d36514

Nicholas Bardy@NicholasBardy

@DimitrisPapail Wait. I’m confused aren’t KV states directly derived from the token inputs? Caching is only for speed up ?

Trying to understand how the values in the cache would diverge. The come from string inputs to start and each consecutive one come from the lm head selecting a new token

43d3626

Dimitris Papailiopoulos@DimitrisPapail

@andersonbcdefg Yes. Let me know if this clarifies it or makes the point more confusing :)

Dimitris Papailiopoulos@DimitrisPapail

when a session is idle the cache if flushed and KVs recomputed form whatever is in the text preserved by the session. So If the original session had any kind of context compression (compaction, sliding window masking, pruning of old tool results, CoT, anything that drops tokens from the visible prefix while keeping later KVs) then the original KVs are different and were computed in the presence of tokens that no longer appear in the rebuild input.

The cache rebuild only sees what's currently in the session text file. Whatever was compacted away is invisible to the recomputation.

All i am saying we know this can hurt accuracy, and I've noticed it myself how claude gets dumber after idle sessions

43d37690

Shuming Hu@ShumingHu

@DimitrisPapail Are you suggesting they are doing KV cache compression?

Otherwise, text token + model code can reconstruct full KVcache? aka prefilling

43d33911

Ben (no treats)@andersonbcdefg

@DimitrisPapail @Samhanknr @yoavgo yep. isnt as bad with codex. i think they sorta solved it. i work across compaction windows regularly with codex and it might not be QUITE as good but it's not white hot rage inducing like Dumb Claude

Dimitris Papailiopoulos@DimitrisPapail

when a session is idle the cache if flushed and KVs recomputed form whatever is in the text preserved by the session. So If the original session had any kind of context compression (compaction, sliding window masking, pruning of old tool results, CoT, anything that drops tokens from the visible prefix while keeping later KVs) then the original KVs are different and were computed in the presence of tokens that no longer appear in the rebuild input.

The cache rebuild only sees what's currently in the session text file. Whatever was compacted away is invisible to the recomputation.

All i am saying we know this can hurt accuracy, and I've noticed it myself how claude gets dumber after idle sessions

43d53870

Sakura Yuki@sakurayukiai

@DimitrisPapail People optimize so hard for VRAM that they forget the KV cache is basically the model's short-term memory. Aggressive eviction policies are out here giving agents forced amnesia.

43d1284

Shital Shah@sytelus

@DimitrisPapail What do you think about this:

Dylan Zhang@dylan_works_

Wrote up something fun I’ve been poking at: when LLM agents repeatedly rewrite their own experiences into textual “lessons,” their memory can get worse, not better.

Across several environments, we found a recurring pattern: forced consolidation often degrades useful experience into faulty or overgeneralized memories. Interestingly, models seem much better at managing examples as memory objects than at distilling them into reusable routines.

Maybe we should be more careful about asking agents to constantly “consolidate” experience into lessons 🤔?

I’m new to this area, so I’d love thoughts. I may be missing context or just wrong on parts of it — please don’t hesitate to let me know! Discussions are always welcome.

http://dylanzsz.github.io/faulty-memory

43d1.2K10

Igor Kotenkov@stalkermustang

@DimitrisPapail I think it's a bit misleading to call this "KV cache flush." The first association that people get is wrong, and you had to clarify this several times already

As @giffmana said, GPT-4.5 was great for naming, here are the suggested results. I like "KV prefix reset".

43d1762

cheaty@cheatyyyy

@DimitrisPapail it's the exact same model prefilling this makes no sense

43d1542

Dimitris Papailiopoulos@DimitrisPapail

@andersonbcdefg @Samhanknr @yoavgo I also was surprised how good it is at medium!!

Ben (no treats)@andersonbcdefg

@DimitrisPapail @Samhanknr @yoavgo it's just hands down better in every way since 5.5 :) being able to use on medium was the tipping pt for me

43d13710

Ben (no treats)@andersonbcdefg

@DimitrisPapail @Samhanknr @yoavgo it's just hands down better in every way since 5.5 :) being able to use on medium was the tipping pt for me

Dimitris Papailiopoulos@DimitrisPapail

@andersonbcdefg @Samhanknr @yoavgo I’m a very calm person and my anger levels get frustratingly high every morning when I check back on my idle Claude code sessions 😢 I’ve started using Codex more lately

43d14200