/Tech44d ago

Learning Fast and Slow triples LLM sample efficiency

Learning Fast and Slow (FST) trains large language models by pairing slow reinforcement learning weight updates with fast prompt and context optimization on frozen base models. The method uses mechanisms such as GEPA for rapid in-context adaptation. It delivers three times greater sample efficiency than standard reinforcement learning, higher performance ceilings, reduced KL drift, and improved resistance to catastrophic forgetting while preserving plasticity for later tasks. The work references gain-tuning from the paper Adaptive denoising via GainTuning.

245727657871.5K

#1621

Original post

Patrick Shafto#1621

Eero Simoncelli@EeroSimoncelli

Nice! A related idea is “gain-tuning”, where channels are rescaled to rapidly (and reversibly) improve performance - https://www.cns.nyu.edu/~lcv/pubs/makeAbs.php?loc=Mohan21b

Kusha Sareen@KushaSareen

Can LLMs adapt continually without losing base skills?

Fast-Slow Training (FST) pairs "slow" weights with "fast" context.

FST vs. RL: • 3x more sample-efficient • Higher performance ceiling • Less KL drift (better plasticity) • Continual learning: succeeds where RL stalls

12:47 PM · May 16, 2026 · 6.3K Views

Sentiment

Many users praised the FST fast-slow training approach for LLMs because it combines prompt optimization with RL to enable continual adaptation without losing base skills.

Pos

100.0%

Neg

0.0%

5 comments with sentiment.

Cluster Engagement

Digg Deeper

No Digg Deeper questions have been answered for this story yet.

NYU.EDUVia

Posts from X

Most Activity

VIEWS41.2KBOOKMARKS709LIKES784RETWEETS91REPLIES20

Dan McAteer@daniel_mac8

babe, wake up.

new continual learning breakthrough just dropped.

fast-slow training (fst) treats model params as "slow" weights and optimized context as "fast weights".

"across math, code, and general reasoning benchmarks, fst beats weights-only training on *every* axis we measured."

44d41.2K784709

Rishabh Agarwal@agarwl_

And see the tweet thread from @KushaSareen here

46d5.4K2516

Rishabh Agarwal@agarwl_

Intuitively, RL followed by GEPA won't have the same benefits as FST because both the task-specific and general purpose reasoning would have to be absorbed by weights.

Deep in the appendix, there's this comparison (last 3 columns):

RL followed by GEPA vs FST (+4%) vs FST followed by GEPA

@LakshyAAAgrawal to add anything more

46d1.2K1916

Rishabh Agarwal@agarwl_

@BlackHC Choosing violence today:

Hinton and Plaut, 1987.

https://www.cs.toronto.edu/~fritz/absps/fastweights.pdf

45d2.3K406

Rishabh Tiwari@rish2k1

Thanks for sharing, I agree with the motivations and ideas you mentioned, for better understanding it can be seen as FST instantiation where: *slow weights* update rule = self distillation *fast weights* update rule = GEPA

we did try one experiment in the same spirit in which we distilled FST fast-weights (gepa style prompt) back to the model using on-policy reverse KL (similar to SDFT paper) and leads to some learning but performs worse than FST w/ GEPA+RL (@LakshyAAAgrawal explained this result in more detail here: https://x.com/LakshyAAAgrawal/status/2055164968748859403).

The idea of combining RLVR signal with self distillation signal is also very interesting and we did try that as well some time back in a related project, we are planning to release that as well soon.

45d754106

Rishabh Agarwal@agarwl_

@_AndrewZhao My take here is that the naive self-distill OPD leaves a lot of room on the table, so might be about figuring out some details to get OPD / distill working.

45d1.6K125

Andreas Kirsch 🇺🇦@BlackHC

@agarwl_ Actually pretty sure Schmidhuber coined fast weights in 1991 or so where you slow weights create fast weights 😅

46d1.6K214

rohan anil@_arohan_

@agarwl_ A question I had seeing graphs if you have run Gepa on the final RL’ed ckpt and seen the performance of it, how far from combined strategy would it be?

46d1.9K191

Rishabh Agarwal@agarwl_

@DibblaX Yes, that's where this is headed. Scaffolds can absorb some of the task specific learning while weights can keep the generally useful behavior.

46d1.2K131

Lakshya A Agrawal@LakshyAAAgrawal

Figure 14 (in appendix) provides a breakdown of various ablations and comparative experiments.

First, we see that FST-trained model, even without the co-optimized prompt (“Slow only (FST w/o prompt)”, green) gains +4.2pp over RL-only model.

Next, FST with the co-optimized prompt (“Slow + Fast (FST)”, green) outperforms RL-only model with GEPA-optimized prompt (“Slow + Fast (RL + GEPA)”) by +3.8pp, which increases further if we run GEPA again on FST-optimized model instead of just taking the co-optimized prompt.

Further, we see this on other datasets as well: ⦁ On HoVer: FST 30.8% vs RL+GEPA 24.9% (+5.9pp), even using 3x fewer data points ⦁ In easy-to-hard generalization, training on Polaris and testing on HMMT25, we see: FST 39.7% vs RL+GEPA 37.1% (+2.6pp), using 1.4x fewer datapoints. This is with the co-optimized prompt. Running GEPA on top of the FST model achieves 41.4% (+4.3pp).

46d593101

ar0cket1@ar0cket1

i would assume that mixing it through RL lets you have better exploration and lets the RL focus on broader general reasoning rather than specific narrow failures that make RL more noisy (thus needing more steps). This also likely results in the lower observed KL shift through RL.

I doubt the same thing happens if you apply GEPA sequentially (It would probably be limited GEPA only gains that was seen in the initial GEPA paper https://arxiv.org/abs/2507.19457)

46d46531

Andrew Zhao@_AndrewZhao

@agarwl_ how about self-opd using gepa as privileged info

46d1.1K8

Rishabh Agarwal@agarwl_

@BlackHC For me, the inspiration was mostly two timescales of learning and the fact that not everything has to go to network weights. Rest is vibes.

45d23251

Yinggan Xu@DibblaX

@agarwl_ Sounds like the co-optimization of scaffolds and model weights at the same time 👀

46d1K8

Andreas Kirsch 🇺🇦@BlackHC

@agarwl_ Btw what are your thoughts on

45d76731

rohan anil@_arohan_

@LakshyAAAgrawal @agarwl_ Very cool! Thanks, Nice work!

46d3284

ar0cket1@ar0cket1

@agarwl_ the forgetting gains are quite surprising, wouldn’t of though it would stay closer to base support

46d6763

Rishabh Agarwal@agarwl_

@heyyalexwang

45d8966

Lakshya A Agrawal@LakshyAAAgrawal

@rish2k1 @BlackHC @agarwl_

45d8331

Lakshya A Agrawal@LakshyAAAgrawal

We tested this. We call it FST-distill in the paper (Appendix H, Figure 13). The GEPA-evolved prompt over the same model with frozen weights forms the teacher, while the student sees just the problem and is trained via on-policy reverse-KL distillation against the teacher.

On HoVer: ⦁ GEPA-only: 24.8% ⦁ FST-distill: 31% ⦁ RL alone: 45% ⦁ FST: 48%

FST-distill rises 6.2pp above GEPA alone but plateaus 14pp below plain RL and 17pp below FST.

The teacher is upper-bounded by what GEPA can elicit from a frozen base, so the student's reward signal is also indirect and never climbs through the higher-capacity slow-weight channel directly. The fast channel alone is not enough to saturate the joint ceiling, and direct policy-gradient updates on θ against reward, run alongside the fast channel, account for the remaining gain.

In addition to higher rewards, FST maintains high actor entropy throughout training due to the Pareto-frontier prompt-population design.

Having said that, distillation is an exciting slow-weight update mechanism, and we'd be excited to see richer combinations of FST with distillation-based approaches.

46d2635