1/
How much can you compress an LLM’s KV cache?
tl;dr it depends on how you train your model.
Many strong context compaction methods, such as Cartridges and attention matching, operate post-hoc: given a fixed model and a context, they try to compress the resulting KV cache.
@yoav_gelberg and I ask the complementary question:
can we train the model to produce KV representations that are easier to compress?
In other words: keep the compression method fixed, and change the representations it sees.