Small models do update on rare tasks, but frequent-task gradients overwrite the signal before the next rare batch arrives. Capacity buys 'retention', and how much you need depends on the rest of the task/data mixture.
Detailed threads by @ChrisGPotts @AndrewLampinen @EkdeepL already cover the highlights, so I'll just add the bit I found most interesting from a data-centric angle: