/AI19h ago

Meta FAIR Introduces SP-KV to Dynamically Prune LLM Key-Value Caches

--0--
Original posts
Reposts
Original post
Rohan Paul@rohanpaul_ai#1032inAI

This paper teaches LLMs to save memory by keeping only past tokens likely to matter later.

The problem is that long text generation makes the key-value cache grow, and this cache is the model’s working memory of earlier tokens.

Instead of saving every old token, the paper adds a small predictor that scores each key-value pair by how useful it seems for future tokens.

Recent tokens are always kept, because nearby words usually matter, but older tokens enter the long-term cache only when their score is high enough.

The authors trained this system together with the LLM using only normal next-token prediction, so the model learns its own pruning behavior rather than following a fixed hand-made rule.

They tested it across model sizes, long-context settings, downstream tasks, and decoding speed, then compared it with full attention and several cache-pruning methods.

The main result is that the model usually keeps only about 10% to 33.7% of older key-value entries, while matching normal performance closely and reaching 2.1 to 4.6 times faster decoding in some long-context batches.

----

Paper Link – arxiv. org/abs/2605.14037

Paper Title: "Self-Pruned Key-Value Attention: Learning When to Write by Predicting Future Utility"

8:01 PM · Jun 2, 2026 · 3.5K Views
Sentiment
Sentiment building, check back later.
Cluster Engagement
-
Views
-
Comments
-
Reposts
-
Bookmarks
Expand data
Posts from X
Most Activity
Most ActivityTimeline
RETWEETS12
Rohan Paul@rohanpaul_ai

This paper teaches LLMs to save memory by keeping only past tokens likely to matter later.

The problem is that long text generation makes the key-value cache grow, and this cache is the model’s working memory of earlier tokens.

Instead of saving every old token, the paper adds a small predictor that scores each key-value pair by how useful it seems for future tokens.

Recent tokens are always kept, because nearby words usually matter, but older tokens enter the long-term cache only when their score is high enough.

The authors trained this system together with the LLM using only normal next-token prediction, so the model learns its own pruning behavior rather than following a fixed hand-made rule.

They tested it across model sizes, long-context settings, downstream tasks, and decoding speed, then compared it with full attention and several cache-pruning methods.

The main result is that the model usually keeps only about 10% to 33.7% of older key-value entries, while matching normal performance closely and reaching 2.1 to 4.6 times faster decoding in some long-context batches.

----

Paper Link – arxiv. org/abs/2605.14037

Paper Title: "Self-Pruned Key-Value Attention: Learning When to Write by Predicting Future Utility"

19hViews 3.5KLikes 59Bookmarks 34