Hugh Zhang outlines GPT-2's architectural constraints, noting it is 1,000 times smaller than DeepSeek V4 and severely undertrained

Because of GPT-2’s small context window, I could only fit at most 4 few-shot examples. And even with these examples, the acceptance rate for the base untrained model was basically 0, so the model wouldn’t learn anything from RL.

Hugh Zhang@hughbzhang

So GPT-2, while again amazing for its time, is not a good “pretrain” by modern standards. Nevertheless, if you know even a rudimentary version of what we now know about RL, you can match or exceed performance from models that were pre-trained with significantly more compute!

56m8010

BOOKMARKS1

Hugh Zhang@hughbzhang

Full blog post, including links to the model weights for the SFT / RL GPT-2 checkpoints, here, along with a few additional experiments / comments.

https://hughbzhang.com/blog/rl-on-cot-gpt2-era.html

55m7411

LIKES2

Hugh Zhang@hughbzhang

Overall, I quite enjoyed doing this project while funemployed! The overarching lesson for me was that algorithmic advances (if you can find them) are a very, very real thing. And as fast as progress in AI is now, there is a very real chance that things may go even faster soon.

55m802

REPLIES2

Hugh Zhang@hughbzhang

GPT-3 12B was also trained on 300B tokens (assuming it's the same model referenced in the GPT-3 paper). This would make it around 240x the FLOPs of GPT-2 (8x params, 30x data). This is assuming that GPT-2 was trained on 1 epoch. I could not find definitive evidence though!

Hugh Zhang@hughbzhang

Now to account for compute. GPT-3 175B was trained on 300B tokens. GPT-2 1.5B was trained on 40GB of web text, which roughly maps to around 10B tokens at a 4:1 byte to token ratio. So GPT-3 had ~3500x more FLOPs than GPT-2 (115x params, 30x data).

55m5710

Hugh Zhang@hughbzhang

After SFT (before the tokenization fixes), the model gets 11.68% accuracy on GSM8K. You can bump that by ~1.5% to 13.27% (over 10% relative gain!) just by fixing the tokenization!

55m412

Hugh Zhang@hughbzhang

I tried a few more ablations, but it turned out nothing did as well as (surprise, surprise) simply running the RL for longer. Since I only had the GSM8K problems, I ended up running the RL for 3 epochs. Further epochs mostly plateaued.

55m352

Hugh Zhang@hughbzhang

The bitter lesson says that the thing that matters is finding the right thing to scale up. It does not say that scaling up the wrong recipe will always beat a slightly less scaled up, but more correct recipe.

55m302

Hugh Zhang@hughbzhang

So GPT-2, while again amazing for its time, is not a good “pretrain” by modern standards. Nevertheless, if you know even a rudimentary version of what we now know about RL, you can match or exceed performance from models that were pre-trained with significantly more compute!

Hugh Zhang@hughbzhang

GPT-2 is ~1000x smaller than frontier models (DeepSeek V4 is 1.6T, others rumored even bigger), undertrained (6-7 tokens/param vs Chinchilla's 20), on worse data (Reddit outlinks instead of web scrapes), and uses a 1024-token context (impossible to have a long chain of thought).

56m7410

Hugh Zhang@hughbzhang

One important issue that I found after looking at a few of the generations was that GPT-2 didn’t tokenize digits correctly. Concretely, “200” would not get tokenized as “2” “0” “0”, but rather as its own token. This made it very hard for the model to learn arithmetic.

Hugh Zhang@hughbzhang

The assumption is that these will produce similar results, but one could also believe that the effect is due to distillation from a stronger model. I personally don’t — I think the purpose is just to kickstart the RL by getting a nonzero solve rate. But I can't rule this out.

55m5710

Hugh Zhang@hughbzhang

The solution was simple. Especially since we needed an SFT phase anyways, I just added spacing around various digits and other math symbols that the model might encounter. This was an easy post-hoc tokenization fix that just involved changing the data, no code. Example below.

Hugh Zhang@hughbzhang

One important issue that I found after looking at a few of the generations was that GPT-2 didn’t tokenize digits correctly. Concretely, “200” would not get tokenized as “2” “0” “0”, but rather as its own token. This made it very hard for the model to learn arithmetic.

55m5610

Hugh Zhang@hughbzhang

To fix this, I started with a brief SFT phase where GPT-5.4 mini generated solutions for 2000 GSM8K problems. A few problems that were too hard fell back to GPT 5.5. This cost less than $100. Example below.

55m541

Hugh Zhang@hughbzhang

Now to account for compute. GPT-3 175B was trained on 300B tokens. GPT-2 1.5B was trained on 40GB of web text, which roughly maps to around 10B tokens at a 4:1 byte to token ratio. So GPT-3 had ~3500x more FLOPs than GPT-2 (115x params, 30x data).

Hugh Zhang@hughbzhang

I ran these experiments mostly on interruptible A100 instances rented on the cloud. The final SFT + RL run took a few days on a single A100 (less than $100 total at today's prices). I probably spent a few hundred more running various experiments (but less than $1000 for all experiments total).

55m5310

Hugh Zhang@hughbzhang

One huge note here. In 2019, the way to do SFT would be to hire a number of humans to write solutions manually for you. This is a bit tedious / expensive, so I shortcutted this by asking GPT-5.4 mini to write solutions for me.

55m431

Hugh Zhang@hughbzhang

The assumption is that these will produce similar results, but one could also believe that the effect is due to distillation from a stronger model. I personally don’t — I think the purpose is just to kickstart the RL by getting a nonzero solve rate. But I can't rule this out.

55m411

Hugh Zhang@hughbzhang

On the RL side, for simplicity, I just did basic GRPO. Data batch size 1, group size 128, fully on policy. After one epoch, it climbs to 20%. After three, it climbs to over 24%! This roughly matched the 12B GPT-3 version from the original GSM8k paper.

55m401

Hugh Zhang@hughbzhang

To be more precise, the bitter lesson claims that FLOPs are necessary but not sufficient alone.

55m371

Hugh Zhang@hughbzhang

If this is right (take with a grain of salt), then RLVR on GSM8K gets you a performance boost that matches that of pre-training a better model with >100x the FLOPs! And this is not even factoring in better data quality of GPT-3’s web scrape compared to GPT-2’s Reddit links!

55m371

Hugh Zhang@hughbzhang

There are many examples of companies that held large compute leads and did not succeed in training good models. Why? Because they did not scale the right way.

55m351

Hugh Zhang@hughbzhang

I ran these experiments mostly on interruptible A100 instances rented on the cloud. The final SFT + RL run took a few days on a single A100 (less than $100 total at today's prices). I probably spent a few hundred more running various experiments (but less than $1000 for all experiments total).

55m351

Hugh Zhang@hughbzhang

One reflection at the end. It is common (especially on Twitter) to use the bitter lesson to argue that whoever owns the FLOPs will win. But this is an incorrect interpretation of the bitter lesson.

55m311