Microsoft Research's John Langford links the new Self-Reset Policy Optimization method to 2015 'learning to search' frameworks

Original post

This is quite nice. I've been looking for something in the vein of the old learning to search work http://hunch.net/~l2s .

The really nice thing here is that these kinds of improvements, in some cases, can yield exponential-in-the-number-of-turns improvements.

Ankur Samanta@Ankur_Samanta_

🚀New work on credit assignment in multi-step reasoning RL post-training🚀 Introducing Self-Reset Policy Optimization (SRPO): i) localize the first wrong reasoning step, ii) reset to that step, iii) learn from counterfactual continuations from there – no external supervision.🧵

2:46 PM · Jun 26, 2026 · 1.7K Views