/AI23h ago

Model Capacity Enables Retention of Rare Task Updates Amid Gradient Interference

3900356
Original post
David Alvarez Melis@elmelis#1267inAI

Small models do update on rare tasks, but frequent-task gradients overwrite the signal before the next rare batch arrives. Capacity buys 'retention', and how much you need depends on the rest of the task/data mixture.

Detailed threads by @ChrisGPotts @AndrewLampinen @EkdeepL already cover the highlights, so I'll just add the bit I found most interesting from a data-centric angle:

9:14 AM · Jun 5, 2026 · 155 Views
Sentiment

Users praise the paper for laying solid foundations on model capacity retaining rare task updates amid gradient interference, while noting open questions on mixtures and transfer.

Pos
100.0%
Neg
0.0%
1 comments with sentiment.
Cluster Engagement
Posts from X
Most Activity
Most Activity
VIEWS80LIKES2

Plenty of open questions left about how to choose mixtures for a given scale, transfer to real pretraining, post-training as a separate axes, etc, but this paper lays solid foundations to think about all of these. Link: https://arxiv.org/abs/2605.29548

It also suggests that the classic (learning theory) way to think about model capacity in isolation misses an important part of the story. We ought to think about capacity 𝘳𝘦𝘭𝘢𝘵𝘪𝘷𝘦 𝘵𝘰 task diversity in the training data.

23hViews 80Likes 2Bookmarks 0
REPLIES1

It also suggests that the classic (learning theory) way to think about model capacity in isolation misses an important part of the story. We ought to think about capacity 𝘳𝘦𝘭𝘢𝘵𝘪𝘷𝘦 𝘵𝘰 task diversity in the training data.

This (+ lots of recent work on fine-grained scaling laws and data mixtures, by us and others) confirms that scale and data composition aren't independent levers, what you can learn is a **joint** function of both.

23hViews 78Likes 2Bookmarks 0

This (+ lots of recent work on fine-grained scaling laws and data mixtures, by us and others) confirms that scale and data composition aren't independent levers, what you can learn is a **joint** function of both.

Small models do update on rare tasks, but frequent-task gradients overwrite the signal before the next rare batch arrives. Capacity buys 'retention', and how much you need depends on the rest of the task/data mixture.

23hViews 43Likes 2Bookmarks 0