/Tech1h ago

Baseten Demonstrates MLP Compression Of LLM KV Caches For Long Contexts

39261287.9K

Original post

Jay Hack@mathemagic1an#1193inTech

Interesting approach to long context/continual learning here from @baseten at @cursor_ai

Compact long trajectories by compressing a prefix of the KV cache using an MLP/autoencoder. You can train this "compactor" MLP by learning to reconstruct activations that the original KV cache would produce on subsequent tokens.

This maximally reconstructs information from long context that's useful for subsequent outputs.

If you run an agent over extremely long context and run this compaction recursively, the activations of this compressed KV cache become like trained weights. Similar to task-specific LoRA or "cartridges".

Would not be surprised if OpenAI is running a similar algorithm for their blackbox compaction. Clear benefits here if you can avoid busting the cache, as e.g. compacting via writing to a text file would.

Seeing many emerging approaches to fixed size latent states for LLMs that seem promising. If building a task-specific KV cache compression ends up being more sample efficient than running backprop, getting this to "work" feels like one of the 2-3 remaining breakthroughs on the path to true AGI

6:12 AM · Jul 2, 2026 · 7.5K Views

Sentiment

Users are cheering on the Baseten team for demonstrating MLP compression of LLM KV caches to handle longer contexts.

Pos

100.0%

Neg

0.0%

1 comments with sentiment.

Cluster Engagement

Digg Deeper

No Digg Deeper questions have been answered for this story yet.

YOUTUBEVia

#1193

Posts from X

Most Activity

Jay Hack@mathemagic1an

@baseten @cursor_ai https://www.youtube.com/watch?v=I8YnwUV2C9w

Jay Hack@mathemagic1an

Interesting approach to long context/continual learning here from @baseten at @cursor_ai

This maximally reconstructs information from long context that's useful for subsequent outputs.

30m38200

LIKES2

Baseten@baseten

@mathemagic1an @cursor_ai Go @mudithj! 🚀

1h1952