Hugh Zhang used SFT and RL to train GPT-2 1.5B to 24% GSM8K accuracy, matching a fine-tuned GPT-3 12B · Digg