Fast-Slow Training outperforms RL baselines on math code tasks
Fast-Slow Training interleaves reinforcement learning updates to LLM parameters with GEPA prompt optimization treated as fast weights. Outlined in a blog post on gepa-ai.github.io and discussed by Rishabh Agarwal and Joey Gonzalez, the method surpassed pure RL baselines on math, code, and reasoning tasks. It achieved higher performance ceilings with fewer samples up to 32k, retained greater plasticity after initialization, and reduced catastrophic forgetting compared with RL-only and GEPA-only variants. Rohan Anil questioned how close standalone GEPA optimization on an RL-finetuned checkpoint comes to full FST results.
@agarwl_ A question I had seeing graphs if you have run Gepa on the final RL’ed ckpt and seen the performance of it, how far from combined strategy would it be?
Training LLMs is synonymous with updating their weights. However, LLMs can also learn in-context using *frozen* weights. There is no good reason for restricting learning to being in-context or in-weights. So a natural idea is "Learning, Fast and Slow" (FST). In FST, slow learning is LLM weights trained with RL while fast learning is context / prompt (fast weights) optimized with GEPA. Compared to RL, FST performs better while being more data efficient, adaptable (plasticity), and forgetting less (stays closer to base models). I think this idea of learning both fast-slow weights would be a good foundation for continual learning. PS: Geoff Hinton (the OG) described the idea of fast weights and slow weights several years ago, and back then I remember thinking it's a very cool idea. See more details here: https://gepa-ai.github.io/gepa/blog/2026/05/11/learning-fast-and-slow/
Training LLMs is synonymous with updating their weights. However, LLMs can also learn in-context using *frozen* weights. There is no good reason for restricting learning to being in-context or in-weights.
So a natural idea is "Learning, Fast and Slow" (FST). In FST, slow learning is LLM weights trained with RL while fast learning is context / prompt (fast weights) optimized with GEPA.
Compared to RL, FST performs better while being more data efficient, adaptable (plasticity), and forgetting less (stays closer to base models).
I think this idea of learning both fast-slow weights would be a good foundation for continual learning.
PS: Geoff Hinton (the OG) described the idea of fast weights and slow weights several years ago, and back then I remember thinking it's a very cool idea.
See more details here: https://gepa-ai.github.io/gepa/blog/2026/05/11/learning-fast-and-slow/

@agarwl_ Actually pretty sure Schmidhuber coined fast weights in 1991 or so where you slow weights create fast weights 😅
@BlackHC For me, the inspiration was mostly two timescales of learning and the fact that not everything has to go to network weights. Rest is vibes.
Nice! Thanks for sharing! I enjoy being schooled 🤓 Very interesting but also a very different concept and it doesn't create fast weights from slow weights like Schmidhuber's prior art that is equivalent to linear transformers, so it doesn't map to your concept of fast and slow learning via in-context and parameter updates?
@agarwl_ Actually pretty sure Schmidhuber coined fast weights in 1991 or so where you slow weights create fast weights 😅
Training LLMs is synonymous with updating their weights. However, LLMs can also learn in-context using *frozen* weights. There is no good reason for restricting learning to being in-context or in-weights. So a natural idea is "Learning, Fast and Slow" (FST). In FST, slow learning is LLM weights trained with RL while fast learning is context / prompt (fast weights) optimized with GEPA. Compared to RL, FST performs better while being more data efficient, adaptable (plasticity), and forgetting less (stays closer to base models). I think this idea of learning both fast-slow weights would be a good foundation for continual learning. PS: Geoff Hinton (the OG) described the idea of fast weights and slow weights several years ago, and back then I remember thinking it's a very cool idea. See more details here: https://gepa-ai.github.io/gepa/blog/2026/05/11/learning-fast-and-slow/
Nice! Thanks for sharing! I enjoy being schooled 🤓 Very interesting but also a very different concept and it doesn't create fast weights from slow weights like Schmidhuber's prior art that is equivalent to linear transformers, so it doesn't map to your concept of fast and slow learning via in-context and parameter updates?
@BlackHC Choosing violence today: Hinton and Plaut, 1987. https://www.cs.toronto.edu/~fritz/absps/fastweights.pdf
@agarwl_ Btw what are your thoughts on