/AI6h ago

Stanford's Christopher Potts and collaborators find model scaling reduces gradient interference, letting networks learn rarely observed tail tasks

Pretraining runs verified the dynamics up to 1B parameters.

--0--
Original post
Christopher Potts@ChrisGPotts#222inAI

We take for granted that larger models are better than smaller ones, but why is this so? Our new paper, led by Jing Huang and @EkdeepL, traces this to a data-induced competition for resources (neurons), using formal analysis, idealized tasks, and real pretraining.

7:58 AM · Jun 1, 2026 · 32.4K Views
Sentiment
Sentiment unavailable for this story.
Cluster Engagement
-
Views
-
Comments
-
Reposts
-
Bookmarks
Expand data
Posts from X
Most Activity
Most ActivityTimeline
VIEWS6.1KBOOKMARKS37
Goodfire@GoodfireAI

New research from Goodfire and collaborators: why do larger models learn more tasks?

(spoiler: it’s bottlenecked by data)

Christopher Potts@ChrisGPotts

We take for granted that larger models are better than smaller ones, but why is this so? Our new paper, led by Jing Huang and @EkdeepL, traces this to a data-induced competition for resources (neurons), using formal analysis, idealized tasks, and real pretraining.

4hViews 6.1KLikes 65Bookmarks 37
LIKES74RETWEETS8

Very excited to have this paper out! We show by having more parameters, larger models see reduced interference between updates. This allows them to retain memories of rarely observed samples of a task, eventually allowing them to learn even the tail-end of the distribution. (1/3)

Christopher Potts@ChrisGPotts

We take for granted that larger models are better than smaller ones, but why is this so? Our new paper, led by Jing Huang and @EkdeepL, traces this to a data-induced competition for resources (neurons), using formal analysis, idealized tasks, and real pretraining.

5hViews 4KLikes 74Bookmarks 28
REPLIES2
Christopher Potts@ChrisGPotts

We then test our hypothesis by pretraining OLMo-style models (4M to 4B), again by injecting examples for novel tasks at controlled frequencies during training. To ensure novelty, our tasks are comparison and modular addition defined with randomly chosen tokens.

Christopher Potts@ChrisGPotts

We also use a protocol from the memorization literature: we inject tasks at different frequencies and study the resulting learning profile. For a highly infrequent task, small models (here, N=32) fail to learn the task (norm signal is never high; solid lines). At the point of injection, the signal briefly increases (green line), but it decays rapidly (gray line). At N=256, there is no such interference and the task is successfully learned.

6hViews 576Likes 10Bookmarks 1