/Tech26d ago

Jing Huang and Chris Potts find larger language models outperform smaller ones by reducing task interference and neuron competition

Smaller models suffer from severe task interference during training.

--0--

#1513

Original post

Tomek Korbak@tomekkorbak#1513inTech

such a good twitter thread!

Christopher Potts@ChrisGPotts

We take for granted that larger models are better than smaller ones, but why is this so? Our new paper, led by Jing Huang and @EkdeepL, traces this to a data-induced competition for resources (neurons), using formal analysis, idealized tasks, and real pretraining.

10:28 AM · Jun 3, 2026 · 12.5K Views

Sentiment

Many users praised the new paper explaining why larger models outperform smaller ones through neuron competition, calling it a big step forward in understanding scaling and a cool synthesis of ideas.

Pos

100.0%

Neg

0.0%

9 comments with sentiment.

Cluster Engagement

Views

Comments

Reposts

Bookmarks

Expand data

Digg Deeper

No Digg Deeper questions have been answered for this story yet.

Posts from X

Most Activity

VIEWS4.1KBOOKMARKS22

Christopher Potts@ChrisGPotts

Link to the paper: https://arxiv.org/abs/2605.29548

28d4.1K4622

LIKES67

Christopher Potts@ChrisGPotts

The following animation convey the intuition: when a 1-neuron model tries to learn two tasks, the frequent task updates suppress the infrequent task updates. The 2-neuron model can dedicate a neuron to the infrequent task once the frequent one is fully learned.

28d3.4K6714

RETWEETS114

Christopher Potts@ChrisGPotts

28d124.5K883808

REPLIES2

Ekdeep Singh Lubana@EkdeepL

@jiaxinwen22 @ChrisGPotts Yeah agree with this. We made an intentional choice early on in the project to not bother with shared task structures just yet, since we didn't know what the dynamics without shared structure looked like. However, we fully intend to follows up soon. :)

28d28

Christopher Potts@ChrisGPotts

Our central hypothesis is that only larger models can learn rare/complex tasks because only those models can successfully drive the gradients for frequent/easy tasks to 0 and thus actually accumulate rare/complex task updates:

28d2.7K489

Christopher Potts@ChrisGPotts

We expect only the larger models to learn the most infrequent tasks. This is exactly what we find. Here are the modular arithmetic task results:

28d3.1K236

Christopher Potts@ChrisGPotts

Speaking of blog posts, our coauthor @AndrewLampinen just did a post that, among other things, relates our results to themes of continual learning and catastrophic interference: https://infinitefaculty.substack.com/p/what-are-the-real-problems-of-continual

28d2.8K187

Christopher Potts@ChrisGPotts

Another link to the paper for good measure: https://arxiv.org/abs/2605.29548

28d1.2K206

Christopher Potts@ChrisGPotts

We first observe that scaling laws already predict that smaller models will fail to learn data mixtures that larger models do learn, even with infinite training data:

28d3.2K421

Christopher Potts@ChrisGPotts

Overall, I think this paper is a big step forward in our understanding of why scaling is so effective: we almost always ask our models to learn very complex mixtures of tasks of different frequency and complexity. Only large models have the capacity to do this, and we now know why.

28d1.3K203

Christopher Potts@ChrisGPotts

We then test our hypothesis by pretraining OLMo-style models (4M to 4B), again by injecting examples for novel tasks at controlled frequencies during training. To ensure novelty, our tasks are comparison and modular addition defined with randomly chosen tokens.

28d1.4K173

Christopher Potts@ChrisGPotts

The following summarizes the core result: model size on the x-axis, k=32 regression tasks on the y-axis (most frequent on top). Orange indicates that the task was learned, and the dashed lines give our predictions (from our Theorem 3) for the smallest model that will learn at least m features of each task.

28d1.6K193

Christopher Potts@ChrisGPotts

In our idealized setting, identifying features is straightforward. For these OLMo models, it is more challenging, but we can use causal interventions. These allow us to identify the feature geometry, and we show that frequency and size both correlate with more feature learning, just as our theory predicts. Results for modular addition, for which the model learns Fourier-mode features:

28d1.3K163

Christopher Potts@ChrisGPotts

Since I am posting on everyone's behalf, I get the chance to say that it was so wonderful to be part of this project! It was my first time working closely with @EkdeepL, and it is always amazing to collaborate with Jing. A big thanks to the rest of the team as well: @danielwurgaft, @rach_it_, @LauraRuis, @nsaphra, @elmelis, and @AndrewLampinen.

28d1.4K202

Christopher Potts@ChrisGPotts

We first explore our hypothesis in an idealized setting in which we tightly control task frequency and complexity. Here, we can show analytically that scaling will reduce competition in the way that our hypothesis predicts, and we support this experimentally.

28d1.6K261

Christopher Potts@ChrisGPotts

We also use a protocol from the memorization literature: we inject tasks at different frequencies and study the resulting learning profile. For a highly infrequent task, small models (here, N=32) fail to learn the task (norm signal is never high; solid lines). At the point of injection, the signal briefly increases (green line), but it decays rapidly (gray line). At N=256, there is no such interference and the task is successfully learned.

28d1.5K172

Christopher Potts@ChrisGPotts

Remarkably, we see the same gradient interference pattern we saw with the idealized tasks. For example, here we inject task examples every 100 batches. For the largest models, the task loss drops at these injection points. The task is then partially overwritten (loss goes back up), but the overall loss trends downward, which corresponds to successful learning. The small models never get traction.

28d4K221

Christopher Potts@ChrisGPotts

Fun fact: the above is an instance of the memorization profiles that I discuss extensively in this blog post about the path to the paper "Blackbox model provenance via palimpsestic membership inference": https://web.stanford.edu/~cgpotts/blog/palimpsest/

28d1.2K132

deep Manifold@BetaTomorrow

Very interesting paper.

I read it through learning complexity, which we discussed in Deep Manifold Part 2 (Deep Manifold Part 2: Neural Network Mathematics) as the interaction between learning space and learning capacity.

In Deep Manifold Part 1 (Deep Manifold Part 1: Anatomy of Neural Network Manifold) , we define learning space as task-defined: same dataset, different task, different learning space. A rare task may barely change data volume, but it changes the effective learning space.

Learning capacity is shaped by architecture and training strategy. More layers/channels provide more manifold-cover resources, but training *** dynamics *** decide whether rare-task pathways have enough time to mature into stable fixed-point classes before frequent-task updates erase them.

28d44411

Oleg kAI@oleg_kai

@ChrisGPotts @EkdeepL neuron competition under data pressure is the cleanest mechanism for why parameter count buys generalization. small models can't afford to specialize and generalize at once. the budget binds.

28d5924