/Tech33d ago

Pavel Izmailov and Andrew Gordon Wilson find LLMs can prevent catastrophic forgetting using self-generated replay data

The method kept task evaluation accuracies near 100%.

3281911359562.9K

#148

Original post

Andrew Gordon Wilson@andrewgwils#148inTech

How much does a language model forget when finetuned on new tasks? We show both model size and optimization matter and forgetting can be nearly eliminated with self-generated replay! https://arxiv.org/abs/2605.26097 w/@mrtnm @dongkyucho @ShikaiQiu @rumichunara @Pavel_Izmailov 1/8

7:47 AM · May 27, 2026 · 44.2K Views

Sentiment

Users praise the paper on self-generated replay nearly eliminating forgetting in finetuned LLMs because they find the results impressive, useful, and pleasing.

Pos

100.0%

Neg

0.0%

3 comments with sentiment.

Cluster Engagement

Digg Deeper

No Digg Deeper questions have been answered for this story yet.

See also Andrew's detailed thread:

Andrew Gordon Wilson@andrewgwils

33d1.4K44

Atula Tejaswi@atu_tej

@andrewgwils @mrtnm @dongkyucho @ShikaiQiu @rumichunara @Pavel_Izmailov https://arxiv.org/abs/2505.13811

@andrewgwils Interested to know the comparison between your BOS based self-generation approach and the above paper.

33d29653

Pavel Izmailov@Pavel_Izmailov

Plot shows the loss values achievable on modeling English and Spanish. For a model trained with 20 tokens per param (Chinchilla), we can finetune on Spanish without forgetting English. But a model that's been heavily overtrained has to forget as it cannot cross the frontier.

Pavel Izmailov@Pavel_Izmailov

New paper: https://arxiv.org/abs/2605.26097

The main idea is that we can use an LLM to generate its own replay data to prevent forgetting, as long as we have spare capacity. Very overtrained models have to forget to learn new information.

33d1.1K80

Andrew Gordon Wilson@andrewgwils

Unfortunately, pretraining data is often unavailable! But since LLMs are generative models, we can use them to directly sample data. In this continual learning experiment with a 2M parameter language model, self-generated replay entirely eliminates forgetting. 3/8

33d15331

Afshin Khadangi@AfshinK91

@andrewgwils @mrtnm @dongkyucho @ShikaiQiu @rumichunara @Pavel_Izmailov Replay ⏭️ is a spicy ingredient. TRC2 also uses the same flavor

https://arxiv.org/abs/2602.22479

33d2951

No try = no fail (Sean Stromsten)@StillTr05207382

@andrewgwils @mrtnm @dongkyucho @ShikaiQiu @rumichunara @Pavel_Izmailov Nice. I remember kicking this idea around the Rumelhart lab 30+ years ago...and not trying it out.

33d2041

Think_Different_@ThinkDi92468945

@andrewgwils @mrtnm @dongkyucho @ShikaiQiu @rumichunara @Pavel_Izmailov Practical problem is that ‘open’ models rarely share training data (only weights).

33d168

Andrew Gordon Wilson@andrewgwils

When does forgetting still happen? When the model has no spare capacity. Small models trained to saturation cannot absorb new information without overwriting old information. 5/8

33d602

Andrew Gordon Wilson@andrewgwils

Learning rate matters too. Forgetting can be reduced by using a high pretraining learning rate, making it possible to release pretrained models that are less prone to downstream forgetting. A small finetuning learning rate also mitigates forgetting. 6/8

33d552

Ward Plunet@StartupYou

@andrewgwils @mrtnm @dongkyucho @ShikaiQiu @rumichunara @Pavel_Izmailov @threadreaderapp please #unroll

33d441

Andrew Gordon Wilson@andrewgwils

However, a small finetuning learning rate is expensive, increasing the optimizer steps required to reach a target loss. Using replay data in finetuning breaks this tradeoff, enabling the use of a high learning rate while minimizing forgetting! 7/8

33d1291

Nathan Roll@nrol_ling

@andrewgwils @mrtnm @dongkyucho @ShikaiQiu @rumichunara @Pavel_Izmailov An interesting way to cast the forgetting problem! Some finetuning tasks seek to 'forget' certain behaviors and not others, is this solved here by selectively sampling rollouts that should be preserved?

33d315

Andrew Gordon Wilson@andrewgwils

We can even generate replay data from an instruction-tuned LLM. For example, when finetuning Llama-3.2-1B, we can prompt the model with a BOS token (without a chat template) and generate pretraining-like data. With a KL penalty, this data significantly reduces forgetting. 4/8

33d731

Bhaktavaschal Samal@bspectacledGOAT

core idea as per my reading is:

during finetuning, keep a frozen copy of the original model, then generate samples from that frozen model and while training on the new task, add a regularization loss so the updated model does not change its predictions too much on those self-generated samples!

33d149

Think_Different_@ThinkDi92468945

@andrewgwils @mrtnm @dongkyucho @ShikaiQiu @rumichunara @Pavel_Izmailov Ah, so as a proxy you use synthetic/generated data. What about capacity? Have you looked at approaches that dynamically decide if a model need to be expanded (in some way) when capacity is low?

33d42

Andrew Gordon Wilson@andrewgwils

Much more in the paper! As models are increasingly being adapted to new settings, it’s especially crucial to understand forgetting. This was an incredible effort with an amazing team led by @mrtnm. Code is available at: https://github.com/martin-marek/forgetting. 8/8

33d1252

Eduardo C. Garrido-Merchán@edugarmer

@andrewgwils @mrtnm @dongkyucho @ShikaiQiu @rumichunara @Pavel_Izmailov Wow, impressive and very useful. Thanks as always for your research.

33d482