New Paper Traces LLM Capability Emergence To Attention Token Focus

VIEWS3.4KBOOKMARKS48LIKES67REPLIES3

New paper: Emergent Capabilities Arise Randomly from Learning Sparse Attention Patterns!

Main takeaway: when LLMs learn algorithmic tasks, the bottleneck is figuring out which tokens to attend to. This learning is slow and unpredictable, and architectures have a big effect.

🧵

5h3.4K6748

RETWEETS8

Andrew Gordon Wilson@andrewgwils

Excited about this new work showing emergent capabilities follow from discovery of attention patterns!

Vatsal Baherwani@vatsalbaherwani

Scaling laws predict an LLM's pretraining loss, but not its capabilities. Abilities like in-context learning emerge abruptly and only past a certain scale. Our new paper traces this to one bottleneck: learning which tokens attention should focus on. 🧵https://arxiv.org/abs/2606.25010

1d15.4K9390

Pavel Izmailov@Pavel_Izmailov

Paper: http://arxiv.org/abs/2606.25010 Blogpost: http://vatsal0.github.io/blog/emergence.html

Lead by @vatsalbaherwani with awesome collaborators @charllechen, @ShikaiQiu and @andrewgwils

Pavel Izmailov@Pavel_Izmailov

My personal main takeaway is that attention learning is hard, and it can be a big bottleneck on the training, especially in long context tasks. And architectures and losses can have a big impact, so I am optimistic we can improve on this learning.

5h23731

Pavel Izmailov@Pavel_Izmailov

We find that on these purely algorithmic tasks, the learning is a combination of many abrupt jumps and plateaus, where each jump corresponds to figuring out one of the attention patterns.

Pavel Izmailov@Pavel_Izmailov

To study the phenomenon in more detail, we train LLMs on synthetic tasks, where we know the correct attention patterns and have complete control over various aspects of the task.

5h14001

Pavel Izmailov@Pavel_Izmailov

One cool thing: if you tell the model the correct tokens to an attention bias, the learning on these algorithmic tasks becomes exponentially faster. Figuring out the attention patterns really is the bottleneck.

Pavel Izmailov@Pavel_Izmailov

The learning difficulty depends heavily on the properties of the task. It is easy to find extremely sparse or extremely dense attention patterns, but everything else is very hard. And increasing the context length (S) makes the task much harder.

5h12630

Pavel Izmailov@Pavel_Izmailov

My personal main takeaway is that attention learning is hard, and it can be a big bottleneck on the training, especially in long context tasks. And architectures and losses can have a big impact, so I am optimistic we can improve on this learning.

5h432

Pavel Izmailov@Pavel_Izmailov

And architecture can have a big effect on learning as well. More attention heads help discover the correct attention patterns much faster. And MLP mixer can also learn exponentially faster on one of the tasks!

Pavel Izmailov@Pavel_Izmailov

One cool thing: if you tell the model the correct tokens to an attention bias, the learning on these algorithmic tasks becomes exponentially faster. Figuring out the attention patterns really is the bottleneck.

5h13110

Pavel Izmailov@Pavel_Izmailov

To study the phenomenon in more detail, we train LLMs on synthetic tasks, where we know the correct attention patterns and have complete control over various aspects of the task.

Pavel Izmailov@Pavel_Izmailov

Here we use Pythia models on a few tasks across scales and random seeds. Models that only differ by random seed can fail or succeed on the task, and the step when they succeed can differ a lot.

The learning happens abruptly when the model figures out the attention pattern!

5h19400

Pavel Izmailov@Pavel_Izmailov

The learning difficulty depends heavily on the properties of the task. It is easy to find extremely sparse or extremely dense attention patterns, but everything else is very hard. And increasing the context length (S) makes the task much harder.

Pavel Izmailov@Pavel_Izmailov

We find that on these purely algorithmic tasks, the learning is a combination of many abrupt jumps and plateaus, where each jump corresponds to figuring out one of the attention patterns.

5h13200

Pavel Izmailov@Pavel_Izmailov