Grading an agent step by step is hard: you can't Monte Carlo irreversible actions, hand-labeling is prohibitive, and dedicated PRMs don't transfer across tasks.
Check out this work, which shows that a free step-level grader called "Progress Advantage" hides in every RL training run. It's pretty cool and can be used for test-time scaling, uncertainty quantification, and failure attribution.
Soon after we released the work on arXiv, it was already featured as a podcast: https://paperdive.ai/episodes/173-neglected-free-lunch-from-post-training-progress-advantage-f.html (thanks to paperdive)