6h ago

Developer Releases Synthetic Self-Improve RL Tool for Small Model Post-Training

0
Original post

releasing /synthetic-self-improve-rl. claude code (teacher) skill that designs/writes the synthetic data, env and rewards to post-train a smaller model (student). it post-trains the student on a real dataset, reads its failure traces, then writes the synthetic data, the verifiers env and the reward function to patch the gaps. re-trains. loops. loop: -> baseline on real data -> analyze low-reward rollouts -> generate ~500-1000 row synthetic dataset -> write a verifiers env + rubric around it -> resume from the post-trained checkpoint -> eval on the real test split -> keep what helps, iterate on what doesn't 1. result: qwen3-0.6B-base on gsm8k. 700 synth rows bumped it from 0.7854 -> 0.8158 on the full test set. 2. run it for any wall-clock budget or iteration cap you set. the loop keeps running until the budget expires. 3. built on @willccbb verifiers and @PrimeIntellect for training. works on any env that has a train and eval dataset. p.s. still figuring out what to call this. feels adjacent to @karpathy autoresearch or synthetic envs?

10:38 AM · May 20, 2026 View on X