2d ago

Fast-Slow Training outperforms RL baselines on math code tasks

226187157064.9K

——0——

Fast-Slow Training interleaves reinforcement learning updates to LLM parameters with GEPA prompt optimization treated as fast weights. Outlined in a blog post on gepa-ai.github.io and discussed by Rishabh Agarwal and Joey Gonzalez, the method surpassed pure RL baselines on math, code, and reasoning tasks. It achieved higher performance ceilings with fewer samples up to 32k, retained greater plasticity after initialization, and reduced catastrophic forgetting compared with RL-only and GEPA-only variants. Rohan Anil questioned how close standalone GEPA optimization on an RL-finetuned checkpoint comes to full FST results.

Original post

#1287@PROFJOEYG @KUSHASAREEN

Kusha Sareen@KUSHASAREEN

Can LLMs adapt continually without losing base skills? Fast-Slow Training (FST) pairs "slow" weights with "fast" context. FST vs. RL: • 3x more sample-efficient • Higher performance ceiling • Less KL drift (better plasticity) • Continual learning: succeeds where RL stalls

8:37 AM · May 13, 2026

Cluster engagement

112 snapshots

Reposted by

#1287@PROFJOEYG

#1005@MADIATOR

#83rohan anil@_AROHAN_

@agarwl_ A question I had seeing graphs if you have run Gepa on the final RL’ed ckpt and seen the performance of it, how far from combined strategy would it be?

Rishabh Agarwal@agarwl_

Training LLMs is synonymous with updating their weights. However, LLMs can also learn in-context using *frozen* weights. There is no good reason for restricting learning to being in-context or in-weights. So a natural idea is "Learning, Fast and Slow" (FST). In FST, slow learning is LLM weights trained with RL while fast learning is context / prompt (fast weights) optimized with GEPA. Compared to RL, FST performs better while being more data efficient, adaptable (plasticity), and forgetting less (stays closer to base models). I think this idea of learning both fast-slow weights would be a good foundation for continual learning. PS: Geoff Hinton (the OG) described the idea of fast weights and slow weights several years ago, and back then I remember thinking it's a very cool idea. See more details here: https://gepa-ai.github.io/gepa/blog/2026/05/11/learning-fast-and-slow/

12:23 AM · May 15, 2026 · 60.2K Views

1:28 AM · May 15, 2026 · 1.7K Views

ORIGINAL POST

#196Rishabh Agarwal@AGARWL_

So a natural idea is "Learning, Fast and Slow" (FST). In FST, slow learning is LLM weights trained with RL while fast learning is context / prompt (fast weights) optimized with GEPA.

Compared to RL, FST performs better while being more data efficient, adaptable (plasticity), and forgetting less (stays closer to base models).

I think this idea of learning both fast-slow weights would be a good foundation for continual learning.

PS: Geoff Hinton (the OG) described the idea of fast weights and slow weights several years ago, and back then I remember thinking it's a very cool idea.

See more details here: https://gepa-ai.github.io/gepa/blog/2026/05/11/learning-fast-and-slow/

12:23 AM · May 15, 2026 · 60.2K Views

#196Rishabh Agarwal@AGARWL_

@BlackHC Choosing violence today:

Hinton and Plaut, 1987.

toronto.edu

/~fritz/absps/fastweights.pdf

Andreas Kirsch 🇺🇦@BlackHC

@agarwl_ Actually pretty sure Schmidhuber coined fast weights in 1991 or so where you slow weights create fast weights 😅

7:02 AM · May 15, 2026 · 1.5K Views

12:41 PM · May 15, 2026 · 2.1K Views

#196Rishabh Agarwal@AGARWL_

@BlackHC For me, the inspiration was mostly two timescales of learning and the fact that not everything has to go to network weights. Rest is vibes.

Andreas Kirsch 🇺🇦@BlackHC

Nice! Thanks for sharing! I enjoy being schooled 🤓 Very interesting but also a very different concept and it doesn't create fast weights from slow weights like Schmidhuber's prior art that is equivalent to linear transformers, so it doesn't map to your concept of fast and slow learning via in-context and parameter updates?

2:05 PM · May 15, 2026 · 248 Views

2:35 PM · May 15, 2026 · 216 Views

#228Andreas Kirsch 🇺🇦@BLACKHC

@agarwl_ Actually pretty sure Schmidhuber coined fast weights in 1991 or so where you slow weights create fast weights 😅

Rishabh Agarwal@agarwl_

12:23 AM · May 15, 2026 · 60.2K Views

7:02 AM · May 15, 2026 · 1.5K Views

#228Andreas Kirsch 🇺🇦@BLACKHC

Rishabh Agarwal@agarwl_

@BlackHC Choosing violence today: Hinton and Plaut, 1987. https://www.cs.toronto.edu/~fritz/absps/fastweights.pdf

12:41 PM · May 15, 2026 · 2.1K Views

2:05 PM · May 15, 2026 · 248 Views

QUOTE POST

#228Andreas Kirsch 🇺🇦@BLACKHC

@agarwl_ Btw what are your thoughts on

2:12 PM · May 15, 2026 · 686 Views