/AI19h ago

Meta FAIR Introduces SP-KV to Dynamically Prune LLM Key-Value Caches

85913343.5K

Original posts

#1032

Reposts

#1032

Original post

Rohan Paul@rohanpaul_ai#1032inAI

This paper teaches LLMs to save memory by keeping only past tokens likely to matter later.

The problem is that long text generation makes the key-value cache grow, and this cache is the model’s working memory of earlier tokens.

Instead of saving every old token, the paper adds a small predictor that scores each key-value pair by how useful it seems for future tokens.

Recent tokens are always kept, because nearby words usually matter, but older tokens enter the long-term cache only when their score is high enough.

The authors trained this system together with the LLM using only normal next-token prediction, so the model learns its own pruning behavior rather than following a fixed hand-made rule.

They tested it across model sizes, long-context settings, downstream tasks, and decoding speed, then compared it with full attention and several cache-pruning methods.

The main result is that the model usually keeps only about 10% to 33.7% of older key-value entries, while matching normal performance closely and reaching 2.1 to 4.6 times faster decoding in some long-context batches.

----

Paper Link – arxiv. org/abs/2605.14037

Paper Title: "Self-Pruned Key-Value Attention: Learning When to Write by Predicting Future Utility"

8:01 PM · Jun 2, 2026 · 3.5K Views

/AI19h ago

Meta FAIR Introduces SP-KV to Dynamically Prune LLM Key-Value Caches

--0--

Original posts

#1032

Reposts

#1032

Original post

Rohan Paul@rohanpaul_ai#1032inAI

This paper teaches LLMs to save memory by keeping only past tokens likely to matter later.

The problem is that long text generation makes the key-value cache grow, and this cache is the model’s working memory of earlier tokens.

Instead of saving every old token, the paper adds a small predictor that scores each key-value pair by how useful it seems for future tokens.

Recent tokens are always kept, because nearby words usually matter, but older tokens enter the long-term cache only when their score is high enough.

The authors trained this system together with the LLM using only normal next-token prediction, so the model learns its own pruning behavior rather than following a fixed hand-made rule.

They tested it across model sizes, long-context settings, downstream tasks, and decoding speed, then compared it with full attention and several cache-pruning methods.

----

Paper Link – arxiv. org/abs/2605.14037

Paper Title: "Self-Pruned Key-Value Attention: Learning When to Write by Predicting Future Utility"

8:01 PM · Jun 2, 2026 · 3.5K Views

Sentiment

Users praise Meta FAIR's SP-KV method for dynamically pruning LLM key-value caches because it lets models learn what to retain and cuts memory use dramatically while preserving performance.

Pos

100.0%

Neg

0.0%

3 comments with sentiment.

Cluster Engagement

Sentiment

Sentiment building, check back later.

Cluster Engagement

Views

Comments

Reposts

Bookmarks

Expand data

Posts from X

Most Activity

RETWEETS12

Rohan Paul@rohanpaul_ai

This paper teaches LLMs to save memory by keeping only past tokens likely to matter later.

The problem is that long text generation makes the key-value cache grow, and this cache is the model’s working memory of earlier tokens.

Instead of saving every old token, the paper adds a small predictor that scores each key-value pair by how useful it seems for future tokens.

Recent tokens are always kept, because nearby words usually matter, but older tokens enter the long-term cache only when their score is high enough.

The authors trained this system together with the LLM using only normal next-token prediction, so the model learns its own pruning behavior rather than following a fixed hand-made rule.

They tested it across model sizes, long-context settings, downstream tasks, and decoding speed, then compared it with full attention and several cache-pruning methods.

----

Paper Link – arxiv. org/abs/2605.14037

Paper Title: "Self-Pruned Key-Value Attention: Learning When to Write by Predicting Future Utility"

19h3.5K5934