A question I’ve been pondering: what if we'd known about o1 / RL on chain-of-thought back in the early days of LLMs?
It turns out SFT + a bit of RL on GPT-2 almost matches the performance of a fine-tuned GPT-3 (12b) on GSM8K — a model with >100x the pre-training compute.
