Great Stanford + MIT + Harvard + Anthropic paper.
Gives a clear training-based reason for why larger models learn abilities smaller models miss.
Says bigger AI models learn rare skills because they forget them less during training, their extra space protects weak learning signals.
The authors say the issue is not just whether a small model could represent the task, but whether training lets it keep that task while many common tasks keep pushing on the same limited parts.
Their core idea is that common tasks take up the model’s neurons first, so rare tasks get overwritten before they appear often enough to build into stable knowledge.
In a crowded data mixture, common patterns get first claim on the model’s internal machinery.
Small models may briefly pick up a rare signal, but the next wave of common-task updates overwrites it before the signal appears again.
They tested this first with controlled toy tasks where they could change how rare and complex each task was, then with OLMo language models from 4M to 4B parameters.
The main result is that bigger models learned low-frequency tasks much better, kept more task features inside their representations, and showed less gradient interference, which means common-task updates disturbed rare-task learning less.
Larger models can remember weak rare signals long enough to turn them into real learned skills.
----
Link – arxiv. org/abs/2605.29548
Title: "Why Larger Models Learn More: Effects of Capacity, Interference, and Rare-Task Retention"

















