Jean Kaddour releases Sokoban Speedrun, an RL benchmark that fine-tunes Qwen3-4B-Instruct in 87 minutes using GRPO · Digg