/AI23h ago

Model Capacity Enables Retention of Rare Task Updates Amid Gradient Interference

3900356

Original post

Small models do update on rare tasks, but frequent-task gradients overwrite the signal before the next rare batch arrives. Capacity buys 'retention', and how much you need depends on the rest of the task/data mixture.

David Alvarez Melis@elmelis

Detailed threads by @ChrisGPotts @AndrewLampinen @EkdeepL already cover the highlights, so I'll just add the bit I found most interesting from a data-centric angle:

9:14 AM · Jun 5, 2026 · 155 Views

Sentiment

Users praise the paper for laying solid foundations on model capacity retaining rare task updates amid gradient interference, while noting open questions on mixtures and transfer.

Pos

100.0%

Neg

0.0%

1 comments with sentiment.

Cluster Engagement

Posts from X

Most Activity

VIEWS80LIKES2

David Alvarez Melis@elmelis

Plenty of open questions left about how to choose mixtures for a given scale, transfer to real pretraining, post-training as a separate axes, etc, but this paper lays solid foundations to think about all of these. Link: https://arxiv.org/abs/2605.29548

David Alvarez Melis@elmelis

It also suggests that the classic (learning theory) way to think about model capacity in isolation misses an important part of the story. We ought to think about capacity 𝘳𝘦𝘭𝘢𝘵𝘪𝘷𝘦 𝘵𝘰 task diversity in the training data.

23h8020

REPLIES1

David Alvarez Melis@elmelis

This (+ lots of recent work on fine-grained scaling laws and data mixtures, by us and others) confirms that scale and data composition aren't independent levers, what you can learn is a **joint** function of both.

23h7820

David Alvarez Melis@elmelis

23h4320