Here we use Pythia models on a few tasks across scales and random seeds. Models that only differ by random seed can fail or succeed on the task, and the step when they succeed can differ a lot.
The learning happens abruptly when the model figures out the attention pattern!
New paper: Emergent Capabilities Arise Randomly from Learning Sparse Attention Patterns!
Main takeaway: when LLMs learn algorithmic tasks, the bottleneck is figuring out which tokens to attend to. This learning is slow and unpredictable, and architectures have a big effect.
🧵

