A new paper from Stanford, MIT, Harvard, and Anthropic explains why larger models learn rare skills. The key insight: bigger models forget less during training, their extra capacity protecting weak learning signals from being overwritten.
Key facts
- Paper from Stanford, MIT, Harvard, and Anthropic researchers.
- Tested with OLMo models from 4M to 4B parameters.
- Larger models showed less gradient interference on rare tasks.
- Common tasks claim neurons first, overwriting rare signals.
- Small models briefly pick up rare signals but forget them.
A collaboration between researchers at Stanford, MIT, Harvard, and Anthropic has published a paper titled "Why Larger Models Learn More: Effects of Capacity, Interference, and Rare-Task Retention" arXiv:2605.29548. The work provides a training-based mechanism for why larger language models acquire abilities that smaller models miss, even when both are trained on the same data.
The authors argue the issue is not whether a small model could represent a task, but whether training allows it to retain that task while common tasks repeatedly update the same limited parameters. Their core idea: common tasks claim the model's neurons first, so rare tasks get overwritten before they appear often enough to become stable knowledge. In a crowded data mixture, common patterns get first claim on the model's internal machinery. Small models may briefly pick up a rare signal, but the next wave of common-task updates overwrites it before the signal appears again.
Controlled experiments and OLMo tests
![]()
The team first tested this hypothesis with controlled toy tasks where they could independently vary rarity and complexity. They then scaled to OLMo language models ranging from 4 million to 4 billion parameters. The main result: bigger models learned low-frequency tasks much better, kept more task features inside their representations, and showed less gradient interference — meaning common-task updates disturbed rare-task learning less. Larger models can remember weak rare signals long enough to turn them into real learned skills.
The paper gives a clear training-based explanation for the emergent abilities often attributed to scale alone. It suggests the bottleneck is not representation capacity but training dynamics: the interference between frequent and infrequent patterns during gradient descent.
What to watch
Watch for follow-up work testing these findings on frontier models like GPT-4 or Claude 3.5 Opus, and whether training curricula that explicitly schedule rare-task exposure can close the gap for smaller models without additional parameters.





