/Tech6h ago

Jean Kaddour releases Sokoban Speedrun, an RL benchmark that fine-tunes Qwen3-4B-Instruct in 87 minutes using GRPO

Story Overview

Jean Kaddour just dropped Sokoban Speedrun, a new RL benchmark that takes the Qwen3-4B-Instruct model and fine-tunes it on Sokoban puzzles until it hits strong held-out performance. The provided GRPO baseline completes the full run in 87 minutes on eight H100 GPUs after tweaking Andrej Karpathy's nanochat pipeline, lifting pass@1 from 57 percent to roughly 89 percent.

395051822380.7K

#80

Original post

Nathan Lambert@natolambert#80inTech

It's obvious that eventually a speedrun for RL will stick.

I currently think the biggest bottleneck is price, as a individual entry currently has too much noise from instability of RL, so running multiple seeds makes it cost O($100).

Glad to see attempts!

Jean Kaddour@jeankaddour

With RSI around the corner, it's time for an RL speedrun.

Introducing Sokoban Speedrun: training Qwen3-4B-Instruct with RL to solve Sokoban puzzles.

We start by modding @karpathy’s nanochat RL pipeline; the GRPO baseline takes 87 minutes on 8×H100s. 1/

7:25 AM · Jun 19, 2026 · 21.1K Views

Developer Impact

Fixed constraints leave room for clever recipes

The benchmark locks the model, datasets, reward function, and hardware while leaving algorithms, schedulers, and rollout engines wide open. That setup invites fast experimentation on post-training tricks without letting anyone sneak in puzzle-specific shortcuts.

Open Question

Speedrun format targets quick iteration loops

By scoring entries on wall-clock time to reach the pass@1 threshold, the project pushes researchers toward recipes that train efficiently rather than just chasing final scores. A public leaderboard and verification process keep submissions comparable across runs.

Sentiment

Positive users praise Sokoban speedrun benchmarks for enabling efficient RL training on tiny models with strong sim2real potential, while negative users doubt their generality and criticize information leakage or model degradation.

Pos

55.9%

Neg

44.1%

21 comments with sentiment.

Cluster Engagement

Digg Deeper

No Digg Deeper questions have been answered for this story yet.

Posts from X

Most Activity

VIEWS12.6KBOOKMARKS40LIKES170RETWEETS4REPLIES10

Joseph Suarez 🐡@jsuarez

Puzzles of this complexity are solved in <10 seconds with a tiny model in PufferLib, with <50k parameters instead of 4B

Jean Kaddour@jeankaddour

With RSI around the corner, it's time for an RL speedrun.

Introducing Sokoban Speedrun: training Qwen3-4B-Instruct with RL to solve Sokoban puzzles.

We start by modding @karpathy’s nanochat RL pipeline; the GRPO baseline takes 87 minutes on 8×H100s. 1/

3h12.6K17040

Joseph Suarez 🐡@jsuarez

Two of our most important benchmarks going forward in PufferLib's research are mazes and Sokoban. Sokoban may look impressive to some people (the actually hard levels at least), but mazes definitely don't. It's one of the first problems you solve in most intro CS classes introducing search algorithms. But it's still a really good benchmark for RL! The problem is that it's also really, really easy to cheat in information. I am confident that we are disciplined enough in our research to avoid doing this. But it does mean that there is a significant lag between our major breakthroughs and you seeing actually impressive demos. We have to tune, validate, and polish on our internal tasks that look pretty dumb for months before we get to show it off on flashier problems that, from a learning perspective, are actually easier than just solving mazes.

3h3.1K4910

Joseph Suarez 🐡@jsuarez

@natolambert Then they have to pick one that isn't pathetic. We use Sokoban as one of our most important benchmarks right now. Puzzles of this complexity are solved in <10 seconds with 50k params, O($0.001)

Nathan Lambert@natolambert

It's obvious that eventually a speedrun for RL will stick.

I currently think the biggest bottleneck is price, as a individual entry currently has too much noise from instability of RL, so running multiple seeds makes it cost O($100).

Glad to see attempts!

3h1.2K284

kache@yacineMTB

@jeankaddour @karpathy you could do this with an RNN and a single 4090

4h776143

murat 🍥@mayfer

@jsuarez yes but i think the point is specifically to add specific capability to a generally capable model

Joseph Suarez 🐡@jsuarez

Puzzles of this complexity are solved in <10 seconds with a tiny model in PufferLib, with <50k parameters instead of 4B

3h585130

Joseph Suarez 🐡@jsuarez

@natolambert Actually, we already have one of these. Min wall-clock time to solve breakout in PufferLib. It's 4.5s on 4.0, around 3s in our best 5.0 build so far. Get it under 1s without making the algo worse on held out envs!

Nathan Lambert@natolambert

It's obvious that eventually a speedrun for RL will stick.

I currently think the biggest bottleneck is price, as a individual entry currently has too much noise from instability of RL, so running multiple seeds makes it cost O($100).

Glad to see attempts!

3h35592

Jean Kaddour@jeankaddour

We start with easy, quick puzzles and will expand to harder ones in the future.

Huge shoutout to @kellerjordan0 and @industriaalist for pioneering amazing speedruns.

Link: https://github.com/JeanKaddour/sokoban_speedrun

Play Sokoban: https://www.jeankaddour.com/sokoban

7h13251

Jean Kaddour@jeankaddour

The goal is simple: achieve the fastest wall-clock time to lift pass@1 from 57% to >80%.

You can tweak the RL algorithm, rollout engine, etc.

Almost anything goes, as long as it’s not Sokoban-specific. 2/

7h1925

Jean Kaddour@jeankaddour

@yacineMTB @karpathy You mean training from scratch?

4h136

murat 🍥@mayfer

@jsuarez yes but these generally (in)capable models are what led to mythos/fable quite directly so it's clearly worthwhile to research RL on them. chess ELO on LLMs is just as meaningful for the same reason even tho brute forcing moves works just fine

2h1124

kache@yacineMTB

@rayjyotir5 @jeankaddour @karpathy what do you mean.. this is trivial...

3h16

Josh@JoshPurtell

@jeankaddour @karpathy Why not pick a task that uses a 2B and 2 gpus?

5h1682

qs400.3@huhwenjie

@jeankaddour @karpathy @grok 告诉我关于 karpathy nanochat rl pipeline 的细节

5h1531

Joseph Suarez 🐡@jsuarez

@rayjyotir5 http://puffer.ai. It's open source. Boxoban in 4.0

3h855

Dan Advantage@DanAdvantage

@mayfer @jsuarez we know language models scale with compute, though. i agree it's worthwhile but ultimately the answer lies in some combination of language models and other forms of "intelligence." pufferlib is one such intelligence

2h401

xlr8harder@xlr8harder

@jeankaddour @karpathy Cool idea! Looking forward to see how it goes.

4h2653

rarply@rarply

@mayfer @jsuarez But we know that works. GPT-3 proved that multi-task learning works.

But why choose these rl tasks? It’s been almost two years of the o1 style rl. Is life at the LLM labs just going to be an endless cycle of inventing rl tasks until they retire?

2h20

Jyotirmoy Ray@rayjyotir5

@yacineMTB @jeankaddour @karpathy @yacineMTB how would you set this RL env up to scale to millions of sps? This is discrete event and not physics timesteps

3h16

Jean Kaddour@jeankaddour

@kellerjordan0 @industriaalist And special shoutout to @__josh_harris__ who built the first RL speedrun and generously shared plenty of advice and lessons: https://joshuaharrissite.substack.com/p/nanorl

6h413

Denis@YouFollowDenis

@jeankaddour @karpathy I actually git a bit addicted to this game lol

5h1531