How much does a language model forget when finetuned on new tasks? We show both model size and optimization matter and forgetting can be nearly eliminated with self-generated replay! https://arxiv.org/abs/2605.26097 w/@mrtnm @dongkyucho @ShikaiQiu @rumichunara @Pavel_Izmailov 1/8
Pavel Izmailov and Andrew Gordon Wilson find LLMs can prevent catastrophic forgetting using self-generated replay data
The method kept task evaluation accuracies near 100%.
Users praise the paper on self-generated replay nearly eliminating forgetting in finetuned LLMs because they find the results impressive, useful, and pleasing.
No Digg Deeper questions have been answered for this story yet.
Most Activity
New paper: https://arxiv.org/abs/2605.26097
The main idea is that we can use an LLM to generate its own replay data to prevent forgetting, as long as we have spare capacity. Very overtrained models have to forget to learn new information.
How much does a language model forget when finetuned on new tasks? We show both model size and optimization matter and forgetting can be nearly eliminated with self-generated replay! https://arxiv.org/abs/2605.26097 w/@mrtnm @dongkyucho @ShikaiQiu @rumichunara @Pavel_Izmailov 1/8
We view forgetting as drift in the model's predictions on old data. So the fix is simple: use a KL penalty on past (pretraining) data to keep old outputs fixed while the model fits the new data. 2/8
How much does a language model forget when finetuned on new tasks? We show both model size and optimization matter and forgetting can be nearly eliminated with self-generated replay! https://arxiv.org/abs/2605.26097 w/@mrtnm @dongkyucho @ShikaiQiu @rumichunara @Pavel_Izmailov 1/8
With amazing collaborators @mrtnm @dongkyucho @ShikaiQiu @rumichunara @andrewgwils
See also Andrew's detailed thread:
How much does a language model forget when finetuned on new tasks? We show both model size and optimization matter and forgetting can be nearly eliminated with self-generated replay! https://arxiv.org/abs/2605.26097 w/@mrtnm @dongkyucho @ShikaiQiu @rumichunara @Pavel_Izmailov 1/8

@andrewgwils @mrtnm @dongkyucho @ShikaiQiu @rumichunara @Pavel_Izmailov https://arxiv.org/abs/2505.13811
@andrewgwils Interested to know the comparison between your BOS based self-generation approach and the above paper.
Plot shows the loss values achievable on modeling English and Spanish. For a model trained with 20 tokens per param (Chinchilla), we can finetune on Spanish without forgetting English. But a model that's been heavily overtrained has to forget as it cannot cross the frontier.
New paper: https://arxiv.org/abs/2605.26097
The main idea is that we can use an LLM to generate its own replay data to prevent forgetting, as long as we have spare capacity. Very overtrained models have to forget to learn new information.

Unfortunately, pretraining data is often unavailable! But since LLMs are generative models, we can use them to directly sample data. In this continual learning experiment with a 2M parameter language model, self-generated replay entirely eliminates forgetting. 3/8

@andrewgwils @mrtnm @dongkyucho @ShikaiQiu @rumichunara @Pavel_Izmailov Replay ⏭️ is a spicy ingredient. TRC2 also uses the same flavor
https://arxiv.org/abs/2602.22479

@andrewgwils @mrtnm @dongkyucho @ShikaiQiu @rumichunara @Pavel_Izmailov Nice. I remember kicking this idea around the Rumelhart lab 30+ years ago...and not trying it out.

@andrewgwils @mrtnm @dongkyucho @ShikaiQiu @rumichunara @Pavel_Izmailov Practical problem is that ‘open’ models rarely share training data (only weights).

When does forgetting still happen? When the model has no spare capacity. Small models trained to saturation cannot absorb new information without overwriting old information. 5/8

Learning rate matters too. Forgetting can be reduced by using a high pretraining learning rate, making it possible to release pretrained models that are less prone to downstream forgetting. A small finetuning learning rate also mitigates forgetting. 6/8
@andrewgwils @mrtnm @dongkyucho @ShikaiQiu @rumichunara @Pavel_Izmailov @threadreaderapp please #unroll

However, a small finetuning learning rate is expensive, increasing the optimizer steps required to reach a target loss. Using replay data in finetuning breaks this tradeoff, enabling the use of a high learning rate while minimizing forgetting! 7/8

@andrewgwils @mrtnm @dongkyucho @ShikaiQiu @rumichunara @Pavel_Izmailov An interesting way to cast the forgetting problem! Some finetuning tasks seek to 'forget' certain behaviors and not others, is this solved here by selectively sampling rollouts that should be preserved?

We can even generate replay data from an instruction-tuned LLM. For example, when finetuning Llama-3.2-1B, we can prompt the model with a BOS token (without a chat template) and generate pretraining-like data. With a KL penalty, this data significantly reduces forgetting. 4/8

core idea as per my reading is:
during finetuning, keep a frozen copy of the original model, then generate samples from that frozen model and while training on the new task, add a regularization loss so the updated model does not change its predictions too much on those self-generated samples!

@andrewgwils @mrtnm @dongkyucho @ShikaiQiu @rumichunara @Pavel_Izmailov Ah, so as a proxy you use synthetic/generated data. What about capacity? Have you looked at approaches that dynamically decide if a model need to be expanded (in some way) when capacity is low?

Much more in the paper! As models are increasingly being adapted to new settings, it’s especially crucial to understand forgetting. This was an incredible effort with an amazing team led by @mrtnm. Code is available at: https://github.com/martin-marek/forgetting. 8/8

@andrewgwils @mrtnm @dongkyucho @ShikaiQiu @rumichunara @Pavel_Izmailov Wow, impressive and very useful. Thanks as always for your research.