/Tech6h ago

Progress Advantage Delivers Ready-To-Use Process Reward Models After RL Training

910120839.1K

Original post

Changdae Oh ✈️ ACL 2026@Changdae_Oh

Outcome reward models: cheap, but vulnerable to spurious shortcuts 😣 Process reward models (PRMs): robust, but too expensive to build from scratch 😫

What if you could get a ready-to-use PRM right after any RL post-training? Introducing 'Progress Advantage' 🧵

4:31 PM · Jun 26, 2026 · 7.1K Views

Sentiment

Users praise Progress Advantage for delivering ready-to-use process reward models after RL training because the approach yields practical signals that save substantial time and effort while crediting strong collaboration.

Pos

100.0%

Neg

0.0%

4 comments with sentiment.

Cluster Engagement

Digg Deeper

No Digg Deeper questions have been answered for this story yet.

AI PAPERS: A DEEP DIVEVia

#706

Posts from X

Most Activity

VIEWS2.1KBOOKMARKS15LIKES30RETWEETS5

Sharon Li@SharonYixuanLi

Grading an agent step by step is hard: you can't Monte Carlo irreversible actions, hand-labeling is prohibitive, and dedicated PRMs don't transfer across tasks.

Check out this work, which shows that a free step-level grader called "Progress Advantage" hides in every RL training run. It's pretty cool and can be used for test-time scaling, uncertainty quantification, and failure attribution.

Soon after we released the work on arXiv, it was already featured as a podcast: https://paperdive.ai/episodes/173-neglected-free-lunch-from-post-training-progress-advantage-f.html (thanks to paperdive)

Changdae Oh ✈️ ACL 2026@Changdae_Oh

Outcome reward models: cheap, but vulnerable to spurious shortcuts 😣 Process reward models (PRMs): robust, but too expensive to build from scratch 😫

What if you could get a ready-to-use PRM right after any RL post-training? Introducing 'Progress Advantage' 🧵

6h2.1K3015

REPLIES1

Changdae Oh ✈️ ACL 2026@Changdae_Oh

@Wendi_Li_ @seongheon_96 @Samuel861025 @tanwimallick @SharonYixuanLi paper: https://arxiv.org/abs/2606.26080 code: https://github.com/deeplearning-wisc/progress-advantage

9h59

Changdae Oh ✈️ ACL 2026@Changdae_Oh

@Wendi_Li_ @seongheon_96 @Samuel861025 @tanwimallick @SharonYixuanLi and thanks to @JiatongLi0418 @LeitianT @sang_yun_lee @jiaying_fang0 for their insightful comments on the draft🙏

9h6621

Changdae Oh ✈️ ACL 2026@Changdae_Oh

This was a joint effort with many brilliant collaborators. Huge thanks to @Wendi_Li_ @seongheon_96 @Samuel861025 @tanwimallick @SharonYixuanLi for the guidance and support throughout🥰

9h432

Changdae Oh ✈️ ACL 2026@Changdae_Oh

Why are agent PRMs so hard?

Agent trajectories span hundreds of steps and pass through irreversible actions — sending an email, deleting a file. That breaks the Monte Carlo rollouts used in non-agentic reasoning tasks, and per-step human annotation is prohibitively expensive.

9h461

Changdae Oh ✈️ ACL 2026@Changdae_Oh

Key insight: the log-probability ratio between an RL-trained policy and its reference policy exactly recovers the optimal advantage function under a stochastic MDP.

We call this progress advantage — a step-level signal for whether the agent is making progress toward the goal.

9h391

Changdae Oh ✈️ ACL 2026@Changdae_Oh

Prior DPO-style work used similar likelihood signals, but only in deterministic reasoning settings. Agents inject stochastic transitions (tool outputs, user replies) that break those interpretations.

9h371

Changdae Oh ✈️ ACL 2026@Changdae_Oh

it's – Annotation-free: just needs checkpoint pairs (base + final) that already exist after post-training – Domain-agnostic: no task-specific retraining – General: valid across most RL algorithms, from explicit-KL methods like GRPO to KL-free ones like PPO/DAPO (Proposition 2)

9h60

Changdae Oh ✈️ ACL 2026@Changdae_Oh

Step-level evaluation of LLM agents via process reward models (PRMs) is incredibly useful — but notoriously hard to build in agentic settings.

We make a simple claim: you don't need to build one. RL post-training is already handing you that signal for free.

9h60

Changdae Oh ✈️ ACL 2026@Changdae_Oh

Our trick: Don't try to recover the absolute reward — target the advantage. The log-ratio then naturally absorbs the expected future values. (Proposition 1)

9h37

Changdae Oh ✈️ ACL 2026@Changdae_Oh

Validated across 3 applications: test-time scaling, uncertainty quantification, and failure attribution. All with zero task-specific training.

9h28

Changdae Oh ✈️ ACL 2026@Changdae_Oh

The takeaway – there was a "free lunch" sitting inside your post-training pipeline this whole time.

No dedicated reward model training, no process labels. Just an RL checkpoint pair, and you can score & monitor agents at the step level.

9h27

Nick Venturi@nickventuri

@Changdae_Oh this saves a shitload of time

7h7

Rami Sufian@Rami_Bball_Fan

@Changdae_Oh This is the kind of practical AI work I want to see more of. If Progress Advantage gives you usable PRM signals right after RL post-training, that’s a big deal.