GPT-2 is ~1000x smaller than frontier models (DeepSeek V4 is 1.6T, others rumored even bigger), undertrained (6-7 tokens/param vs Chinchilla's 20), on worse data (Reddit outlinks instead of web scrapes), and uses a 1024-token context (impossible to have a long chain of thought).
Stepping back, in 2019, the best LLM was GPT-2. It was an insane advance for its time and more important than almost anyone thought it would be (myself included). But compared to modern LLMs, it has a few major drawbacks.
