Interesting approach to long context/continual learning here from @baseten at @cursor_ai
Compact long trajectories by compressing a prefix of the KV cache using an MLP/autoencoder. You can train this "compactor" MLP by learning to reconstruct activations that the original KV cache would produce on subsequent tokens.
This maximally reconstructs information from long context that's useful for subsequent outputs.
If you run an agent over extremely long context and run this compaction recursively, the activations of this compressed KV cache become like trained weights. Similar to task-specific LoRA or "cartridges".
Would not be surprised if OpenAI is running a similar algorithm for their blackbox compaction. Clear benefits here if you can avoid busting the cache, as e.g. compacting via writing to a text file would.
Seeing many emerging approaches to fixed size latent states for LLMs that seem promising. If building a task-specific KV cache compression ends up being more sample efficient than running backprop, getting this to "work" feels like one of the 2-3 remaining breakthroughs on the path to true AGI
