This is quite nice. I've been looking for something in the vein of the old learning to search work http://hunch.net/~l2s .
The really nice thing here is that these kinds of improvements, in some cases, can yield exponential-in-the-number-of-turns improvements.
Ankur Samanta@Ankur_Samanta_
🚀New work on credit assignment in multi-step reasoning RL post-training🚀 Introducing Self-Reset Policy Optimization (SRPO): i) localize the first wrong reasoning step, ii) reset to that step, iii) learn from counterfactual continuations from there – no external supervision.🧵
2:46 PM · Jun 26, 2026 · 1.7K Views
