This paper teaches LLMs to save memory by keeping only past tokens likely to matter later.
The problem is that long text generation makes the key-value cache grow, and this cache is the model’s working memory of earlier tokens.
Instead of saving every old token, the paper adds a small predictor that scores each key-value pair by how useful it seems for future tokens.
Recent tokens are always kept, because nearby words usually matter, but older tokens enter the long-term cache only when their score is high enough.
The authors trained this system together with the LLM using only normal next-token prediction, so the model learns its own pruning behavior rather than following a fixed hand-made rule.
They tested it across model sizes, long-context settings, downstream tasks, and decoding speed, then compared it with full attention and several cache-pruning methods.
The main result is that the model usually keeps only about 10% to 33.7% of older key-value entries, while matching normal performance closely and reaching 2.1 to 4.6 times faster decoding in some long-context batches.
----
Paper Link – arxiv. org/abs/2605.14037
Paper Title: "Self-Pruned Key-Value Attention: Learning When to Write by Predicting Future Utility"