/Tech1h ago

Hugh Zhang used SFT and RL to train GPT-2 1.5B to 24% GSM8K accuracy, matching a fine-tuned GPT-3 12B

Story Overview

A single-researcher experiment fine-tuned the 2019-era GPT-2 1.5B model first with supervised chain-of-thought solutions generated by a newer model, then applied reinforcement learning on the remaining GSM8K training problems. The resulting system reached 24 percent accuracy on the held-out test set while performing every arithmetic step inside its own outputs, matching the score previously reported for a fine-tuned GPT-3 12B that had access to an external calculator.

20770268.7K

#501

Original post

Hugh Zhang@hughbzhang#1167inTech

A question I’ve been pondering: what if we'd known about o1 / RL on chain-of-thought back in the early days of LLMs?

It turns out SFT + a bit of RL on GPT-2 almost matches the performance of a fine-tuned GPT-3 (12b) on GSM8K — a model with >100x the pre-training compute.

8:24 AM · Jun 29, 2026 · 4.1K Views

Efficiency Gain

RL delivers most of the lift

The supervised stage alone produced modest gains; the subsequent reinforcement-learning pass on chain-of-thought traces accounted for the bulk of the jump, delivering performance comparable to more than 100 times the original pre-training compute difference between the two base models.

Open Question

Replication remains an open variable

The author flags the result with an explicit “take with a grain of salt,” noting that all training data, hyperparameters, and evaluations come from one side project with no third-party confirmation yet reported.

Sentiment

Users expressed surprise and approval at RL on Chain-of-Thought enabling GPT-2 to approach GPT-3 GSM8K performance, citing its unexpected viability after low-cost runs and tokenization gains.

Pos

100.0%

Neg

0.0%

5 comments with sentiment.

Cluster Engagement

Digg Deeper

No Digg Deeper questions have been answered for this story yet.

HUGHBZHANG.COMVia

#1167

Posts from X

Most Activity

VIEWS3.4KBOOKMARKS8LIKES22

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex

faster algorithmic progress could have made things much weirder still can

Hugh Zhang@hughbzhang

A question I’ve been pondering: what if we'd known about o1 / RL on chain-of-thought back in the early days of LLMs?

It turns out SFT + a bit of RL on GPT-2 almost matches the performance of a fine-tuned GPT-3 (12b) on GSM8K — a model with >100x the pre-training compute.

1h3.4K228

REPLIES3

Hugh Zhang@hughbzhang

As a fun side project, I took the GPT-2 1.5B weights and did a basic SFT + RL run on GSM8K. For SFT, I generated solutions using GPT 5.4 mini for <$100 on 2k problems from the train set. I used the remaining ~5.5k problems for RL. Accuracy was on the held-out GSM8k test set.

Hugh Zhang@hughbzhang

In effect, the question asks how much of recent progress in AI is due to pure scaling versus both scaling + algorithmic advances.

Imagine you time travel back into 2019 with today’s knowledge. You don’t have access to modern data, compute or models. But what could you do?

1h22331

Hugh Zhang@hughbzhang

To fix this, I started with a brief SFT phase where GPT-5.4 mini generated solutions for 2000 GSM8K problems. A few problems that were too hard fell back to GPT 5.5. This cost less than $100. Example below.

Hugh Zhang@hughbzhang

Because of GPT-2’s small context window, I could only fit at most 4 few-shot examples. And even with these examples, the acceptance rate for the base untrained model was basically 0, so the model wouldn’t learn anything from RL.

1h15131

Hugh Zhang@hughbzhang

In effect, the question asks how much of recent progress in AI is due to pure scaling versus both scaling + algorithmic advances.

Imagine you time travel back into 2019 with today’s knowledge. You don’t have access to modern data, compute or models. But what could you do?

Hugh Zhang@hughbzhang

A question I’ve been pondering: what if we'd known about o1 / RL on chain-of-thought back in the early days of LLMs?

It turns out SFT + a bit of RL on GPT-2 almost matches the performance of a fine-tuned GPT-3 (12b) on GSM8K — a model with >100x the pre-training compute.

1h28821

Hugh Zhang@hughbzhang

Notably, the baseline GPT-3 results (from the GSM8K paper) all allow access to an external calculator. I don't do this for GPT-2 — it does the full computation inside its chain-of-thought. So the gain from RL might actually be underestimated and perhaps more than 100x!

Hugh Zhang@hughbzhang

1h19021

Hugh Zhang@hughbzhang

Stepping back, in 2019, the best LLM was GPT-2. It was an insane advance for its time and more important than almost anyone thought it would be (myself included). But compared to modern LLMs, it has a few major drawbacks.

Hugh Zhang@hughbzhang

1h18021

Hugh Zhang@hughbzhang

One huge note here. In 2019, the way to do SFT would be to hire a number of humans to write solutions manually for you. This is a bit tedious / expensive, so I shortcutted this by asking GPT-5.4 mini to write solutions for me.

Hugh Zhang@hughbzhang

1h13021

Hugh Zhang@hughbzhang

The assumption is that these will produce similar results, but one could also believe that the effect is due to distillation from a stronger model. I personally don’t — I think the purpose is just to kickstart the RL by getting a nonzero solve rate. But I can't rule this out.

Hugh Zhang@hughbzhang

1h12721

Hugh Zhang@hughbzhang

After SFT (before the tokenization fixes), the model gets 11.68% accuracy on GSM8K. You can bump that by ~1.5% to 13.27% (over 10% relative gain!) just by fixing the tokenization!

Hugh Zhang@hughbzhang

The solution was simple. Especially since we needed an SFT phase anyways, I just added spacing around various digits and other math symbols that the model might encounter. This was an easy post-hoc tokenization fix that just involved changing the data, no code. Example below.

1h10921

Hugh Zhang@hughbzhang

I tried a few more ablations, but it turned out nothing did as well as (surprise, surprise) simply running the RL for longer. Since I only had the GSM8K problems, I ended up running the RL for 3 epochs. Further epochs mostly plateaued.

Hugh Zhang@hughbzhang

If this is right (take with a grain of salt), then RLVR on GSM8K gets you a performance boost that matches that of pre-training a better model with >100x the FLOPs! And this is not even factoring in better data quality of GPT-3’s web scrape compared to GPT-2’s Reddit links!

1h8821

Hugh Zhang@hughbzhang

One reflection at the end. It is common (especially on Twitter) to use the bitter lesson to argue that whoever owns the FLOPs will win. But this is an incorrect interpretation of the bitter lesson.

Hugh Zhang@hughbzhang

1h8221

Hugh Zhang@hughbzhang

Full blog post, including links to the model weights for the SFT / RL GPT-2 checkpoints, here, along with a few additional experiments / comments.

https://hughbzhang.com/blog/rl-on-cot-gpt2-era.html

Hugh Zhang@hughbzhang

Overall, I quite enjoyed doing this project while funemployed! The overarching lesson for me was that algorithmic advances (if you can find them) are a very, very real thing. And as fast as progress in AI is now, there is a very real chance that things may go even faster soon.

1h15351

Hugh Zhang@hughbzhang

On the RL side, for simplicity, I just did basic GRPO. Data batch size 1, group size 128, fully on policy. After one epoch, it climbs to 20%. After three, it climbs to over 24%! This roughly matched the 12B GPT-3 version from the original GSM8k paper.

Hugh Zhang@hughbzhang

After SFT (before the tokenization fixes), the model gets 11.68% accuracy on GSM8K. You can bump that by ~1.5% to 13.27% (over 10% relative gain!) just by fixing the tokenization!

1h10111

Hugh Zhang@hughbzhang

I ran these experiments mostly on interruptible A100 instances rented on the cloud. The final SFT + RL run took a few days on a single A100 (less than $100 total at today's prices). I probably spent a few hundred more running various experiments (but less than $1000 for all experiments total).

Hugh Zhang@hughbzhang

1h9811

Hugh Zhang@hughbzhang

GPT-3 12B was also trained on 300B tokens (assuming it's the same model referenced in the GPT-3 paper). This would make it around 240x the FLOPs of GPT-2 (8x params, 30x data). This is assuming that GPT-2 was trained on 1 epoch. I could not find definitive evidence though!

1h9211

Hugh Zhang@hughbzhang

1h802

Hugh Zhang@hughbzhang

The bitter lesson says that the thing that matters is finding the right thing to scale up. It does not say that scaling up the wrong recipe will always beat a slightly less scaled up, but more correct recipe.

1h302

Hugh Zhang@hughbzhang

GPT-2 is ~1000x smaller than frontier models (DeepSeek V4 is 1.6T, others rumored even bigger), undertrained (6-7 tokens/param vs Chinchilla's 20), on worse data (Reddit outlinks instead of web scrapes), and uses a 1024-token context (impossible to have a long chain of thought).

1h701

Hugh Zhang@hughbzhang

1h551

Hugh Zhang@hughbzhang

So GPT-2, while again amazing for its time, is not a good “pretrain” by modern standards. Nevertheless, if you know even a rudimentary version of what we now know about RL, you can match or exceed performance from models that were pre-trained with significantly more compute!

1h501