/AI6h ago

Stanford's Christopher Potts and collaborators find model scaling reduces gradient interference, letting networks learn rarely observed tail tasks

Pretraining runs verified the dynamics up to 1B parameters.

407598346556.3K

Original post

We take for granted that larger models are better than smaller ones, but why is this so? Our new paper, led by Jing Huang and @EkdeepL, traces this to a data-induced competition for resources (neurons), using formal analysis, idealized tasks, and real pretraining.

7:58 AM · Jun 1, 2026 · 32.4K Views

/AI6h ago

Stanford's Christopher Potts and collaborators find model scaling reduces gradient interference, letting networks learn rarely observed tail tasks

Pretraining runs verified the dynamics up to 1B parameters.

--0--

Original post

Christopher Potts@ChrisGPotts#222inAI

7:58 AM · Jun 1, 2026 · 32.4K Views

Sentiment

Users praised the paper on larger models outperforming via reduced interference and better retention as progress in scaling research, while others dismissed the findings as unoriginal data bottleneck retreads.

Pos

66.7%

Neg

33.3%

6 comments with sentiment.

Cluster Engagement

Sentiment

Sentiment unavailable for this story.

Cluster Engagement

Views

Comments

Reposts

Bookmarks

Expand data

Posts from X

Most Activity

VIEWS6.1KBOOKMARKS37

Goodfire@GoodfireAI

New research from Goodfire and collaborators: why do larger models learn more tasks?

(spoiler: it’s bottlenecked by data)

Christopher Potts@ChrisGPotts

4h6.1K6537

LIKES74RETWEETS8

Ekdeep Singh Lubana@EkdeepL

Very excited to have this paper out! We show by having more parameters, larger models see reduced interference between updates. This allows them to retain memories of rarely observed samples of a task, eventually allowing them to learn even the tail-end of the distribution. (1/3)

Christopher Potts@ChrisGPotts

5h4K7428

REPLIES2

Christopher Potts@ChrisGPotts

We then test our hypothesis by pretraining OLMo-style models (4M to 4B), again by injecting examples for novel tasks at controlled frequencies during training. To ensure novelty, our tasks are comparison and modular addition defined with randomly chosen tokens.

Christopher Potts@ChrisGPotts

We also use a protocol from the memorization literature: we inject tasks at different frequencies and study the resulting learning profile. For a highly infrequent task, small models (here, N=32) fail to learn the task (norm signal is never high; solid lines). At the point of injection, the signal briefly increases (green line), but it decays rapidly (gray line). At N=256, there is no such interference and the task is successfully learned.

6h576101