We take for granted that larger models are better than smaller ones, but why is this so? Our new paper, led by Jing Huang and @EkdeepL, traces this to a data-induced competition for resources (neurons), using formal analysis, idealized tasks, and real pretraining.
Stanford's Christopher Potts and collaborators find model scaling reduces gradient interference, letting networks learn rarely observed tail tasks
Pretraining runs verified the dynamics up to 1B parameters.
Most Activity
New research from Goodfire and collaborators: why do larger models learn more tasks?
(spoiler: it’s bottlenecked by data)
We take for granted that larger models are better than smaller ones, but why is this so? Our new paper, led by Jing Huang and @EkdeepL, traces this to a data-induced competition for resources (neurons), using formal analysis, idealized tasks, and real pretraining.
Very excited to have this paper out! We show by having more parameters, larger models see reduced interference between updates. This allows them to retain memories of rarely observed samples of a task, eventually allowing them to learn even the tail-end of the distribution. (1/3)
We take for granted that larger models are better than smaller ones, but why is this so? Our new paper, led by Jing Huang and @EkdeepL, traces this to a data-induced competition for resources (neurons), using formal analysis, idealized tasks, and real pretraining.
We then test our hypothesis by pretraining OLMo-style models (4M to 4B), again by injecting examples for novel tasks at controlled frequencies during training. To ensure novelty, our tasks are comparison and modular addition defined with randomly chosen tokens.
We also use a protocol from the memorization literature: we inject tasks at different frequencies and study the resulting learning profile. For a highly infrequent task, small models (here, N=32) fail to learn the task (norm signal is never high; solid lines). At the point of injection, the signal briefly increases (green line), but it decays rapidly (gray line). At N=256, there is no such interference and the task is successfully learned.