It's obvious that eventually a speedrun for RL will stick.
I currently think the biggest bottleneck is price, as a individual entry currently has too much noise from instability of RL, so running multiple seeds makes it cost O($100).
Glad to see attempts!
With RSI around the corner, it's time for an RL speedrun.
Introducing Sokoban Speedrun: training Qwen3-4B-Instruct with RL to solve Sokoban puzzles.
We start by modding @karpathy’s nanochat RL pipeline; the GRPO baseline takes 87 minutes on 8×H100s. 1/










