1d ago

Meta AI researchers introduce Self-Pruned Key-Value Attention

0

Meta AI researchers introduced Self-Pruned Key-Value Attention to reduce memory use in large language models. The approach trains models to predict future utility of key-value pairs and selectively retain only relevant entries in the KV cache. It incorporates a utility predictor and local buffer to limit cache growth. The May 15 2026 paper lists affiliations with Meta FAIR and CentraleSupélec.

Original post

🚨 Do LLMs need to store everything they read in memory? To reduce KV cache size and improve decoding speeds, we propose Self-Pruned KV attention, a mechanism where the model learns to decide which KVs to write in the persistent KV cache, discarding all the rest! @AIatMeta🧵

2:12 AM · May 15, 2026 View on X
Meta AI researchers introduce Self-Pruned Key-Value Attention · Digg