Rohan should have titled this “A trip down memory lane”- but he can be forgiven.
Lots of folks conflating continual learning with RL optimization; the latter may be necessary but likely insufficient.
Here he reviews the different veins of historical and current research that can together enable continually learning and capable agents.
Memory Augmented Neural Networks aren’t new, but their modern counterparts open new axes for scaling agents. I wrote a post on how architectures like Memory Networks and Neural Turing Machines pave the way for making retrieval and continual learning intrinsic to the model itself
The silent revolution has already begun. In January, DeepSeek released Engram, a sparse memory module that looks up knowledge within the forward pass. It split where memory is stored from where reasoning happens, analogous to how MoE made transformers capable of conditional computation. Sequential compute is no longer wasted storing facts.
Retrieval has come a long way from single-vector RAG. We use multi-vector embeddings and post-train models to search over filesystems and vector databases, though latency is high and tool call tokens fill up context. Methods like context compaction e.g. Cartridges reduce this cost, but also shifts the burden to teaching a model when cartridges should be used or updated. Still lots to be done here!
On a complementary axis, there’s a rich lineage where retrieval is trained into the model mid-layer. Memory didn't always live in context alone! @jaseweston et al.’s Memory Networks (2014) made memory an explicit addressable matrix the model queries mid-forward-pass. RETRO by @borgeaud_s et al. scaled it to transformers in 2021. @GuillaumeLample et al.’s Product Key Memory made sparse KV retrieval over a parameterized memory layer more efficient, and recently this was scaled up in Memory Layers by Berges et al. in 2024.
Beyond retrieval, as agent tasks stretch from weeks to months, we’ll increasingly want to write information to the model within an episode. Context engineering can store token-level working information in filesystems, but always-on test-time training enables continuous weights updates as in Learning to Discover at Test Time by @mertyuksekgonul.
Over long horizons, continual learning faces two main problems: how to distill the right signal to learn from and how to integrate information without catastrophic forgetting. Most existing work still uses unsupervised learning objectives, or shallow subsets of parameters, shifting the problem to choosing which LORA to swap in and out. There’s even more work to be done here!
On a second complementary axis, what if we could write relevant information to network parameters during a forward pass without catastrophic forgetting? Neural Turing Machines (Graves et al., 2014) made reads AND writes to external memory differentiable, Differentiable Neural Computers ensured only unused slots were updated, and @santoroAI et al. showed in 2016 you could meta-learn a generalizable write policy across many episodes. Imagine a model with an intrinsic scratchpad for planning over long horizons!





