14h ago

Yoav Gelberg proposes training LLMs to produce inherently compressible representations for more efficient KV cache compaction

This addresses the efficiency limits of post-hoc tools like Cartridges.

0
Original post

1/ How much can you compress an LLM’s KV cache? tl;dr it depends on how you train your model. Many strong context compaction methods, such as Cartridges and attention matching, operate post-hoc: given a fixed model and a context, they try to compress the resulting KV cache. @yoav_gelberg and I ask the complementary question: can we train the model to produce KV representations that are easier to compress? In other words: keep the compression method fixed, and change the representations it sees.

6:24 AM · May 25, 2026 View on X
Reposted by