A joint study from Stanford, Harvard’s Kempner Institute, MIT, and Anthropic has cracked the code on why larger AI models outperform smaller ones on rare, complex tasks. The answer lies in reduced gradient interference.
In neural networks, frequent tasks generate strong gradient updates that dominate training. Rare tasks produce weak signals that smaller models overwrite before learning solidifies. Larger models, however, master common tasks early, weakening their gradient updates. This opens a window for rare-task signals to persist and be learned.
Testing spanned OLMo models from 4 million to 4 billion parameters on the Dolma corpus. Only larger models succeeded on infrequent tasks-not due to lacking capability, but because training dynamics erased rare features in smaller architectures.
The researchers suggest increasing rare-task frequency in training data could help smaller models acquire skills now requiring larger systems. The paper, led by Jing Huang, Ekdeep Singh Lubana, Rachit Bansal, Naomi Saphra, and Laura Ruis, was published on arXiv (v1: May 28, 2026; v2: June 1, 2026).