There's a mental model of LLMs that fits this narrative. It also follows the rough history of language model development from 1948 to today: that Transformers are basically n-gram models + embeddings + attention + clean_data + scale.
THE BEGINNING (1948 - 2003):
N-gram models: count words. Make new predictions by accepting a prompt (e.g. [the, cat, and, the]) and if the next word was "hat" 80% of the time in the training data, then the n-gram model will output an 80% probability that "hat" is the next word. Simple stuff.
Claude Shannon came up with this idea in his paper "A Mathematical Theory of Communication" in 1948.
THE PROBLEM: sparsity
Let's say your language model saw the phrase "the cat and the" many times... but now someone presents a phrase "the dog and the"... the problem is... even though these words are similar... the n-gram language model doesn't care. They're entirely different words...so they get entirely different counts... And the language model hasn't EVER HEARD of the phrase... so it doesn't know that "hat" is still a plausible next word.
THE SUBPROBLEM: inefficiency of training signal
This is also an efficiency problem. It means that as the language model learns more about the word "cat"... it doesn't get to transfer that learning to also know about the word "dog" or "mouse" or whatever... learning about "cat" happens in pure isolation. This wastes a lot of training signal. This means that... in order to have the intelligence of today's AI systems... an LLM would need WAAAAY more training data... it would literally need to see every possible phrase many times (even phrases like... 10,000 words long).
THE SOLUTION (2003 - 2013): embeddings
Bengio solved this problem by training language models in neural networks, launched by a paper "A Neural Probabilistic Language Model" in 2003. In these language models, instead of counting words, each word was mapped to a list of numbers where an important property happened:
similar words had similar lists of numbers
This meant that all of a sudden... dog and cat were "similar" things in the neural network. And the more that a neural network learned about "dog" and its use in language... the more it *also* learned about "cat".
ANALOGY AT THIS POINT: it's an imperfect analogy... but you can think of this as like "n-gram language models with word similarity". There wasn't really complex logic going on during training (training was still roughly analogous to "counting things")... it was just that words weren't treated as totally separate things anymore.
THE PROBLEM: low-scale
But neural language models couldn't be trained on large amounts of text, so n-gram (and bayesian) language models still offered better capability. But this started to change when Mikolov relaxed some assumptions to create a much higher scale neural network
SOLUTION (2013 - 2017): scale (word2vec)
Now you could train these embeddings on a few trillion tokens, and the embeddings got really good...king - man + woman = queen... kind of stuff
ANALOGY AT THIS POINT: the analogy hasn't really changed... if anything it got tighter... because word2vec acutally *simplified* the neural network even more... and it behaved even MORE like gathering counts. In fact, you could do cosine distance from the counts directly and get *similar* properties to th word embeddings... but the word embeddings were doing it better.
THE PROBLEM: while we got really good embeddings, we still didn't have long context windows. Everyone was trying to get LSTMs to listen to long context, but the bias of the network wasn't good enough (RNN/LSTMs were biased towards the most recent tokens).
THE SUBPROBLEM: the RNN/LSTMs had a difficult bias for deciding what to pay attention to... which really just means they had to try to pay attention to too much... while at the same time their capacity was too small (because we coudln't scale them on GPUs)
SOLUTION (2017-2018): Attention is basically hte idea of "don't pay attention to everything... grab different latent features from different parts of teh contxt window at differnt times". This wasn't a new concept entirely (LSTMs had been doing attention) but Transformers did something similar to word2vec... they dumbed down the algorithm so we could scale it up on computers.
ANALOGY UP TO THIS POINT: you can think of this like applying a filter on the word counts/statistics... so that "only relevant counts matter" when your'e making a prediction. This has the dual impact of increasing your signal-to-noise ratio... which makes all your training data more useful (while also scaling things up).
PROBLEM: our data was crappy and limited. Everyone just trained on a subset of Wikipedia or the billion words corpus.
SOLUTION (GPT-1, 2,3,4,5): scrape the web and get huge amounts of clean data. hire mechanical turkers and get even more clean data. get user logs and get even more clean data.
A lot has changed... but maybe not so much:
- counts: count words to figure out "what word comes next"
- synonyms: allow similar words to share counts
- attention: only focus on the counts that matter
- scale/data: get more/better data at bigger scale
Here's the thing about counts... when you're in the middle of counting... you're counting *everything*. You're just..... counting.... so you don't have any filter on what is true/false/etc... it all goes in the "big bag of counts"
And that's why this analogy fits Owain's work. The logic we see from context_window -> output... isn't happening during pre-training. Pre-training is counting words. Once you have the counts, then you can sortof... "paint by number" to do logic at inference time. It's easy to get these two processes backwards.
TLDR: LLMs learn everything they see.