Continuous reward models can measure partial progress on a task, unlike binary rewards (e.g. RLVR). But we find they inevitably assign wildly different scores to equally-good responses, which can lead to bad policies. Surprisingly, dense rewards are often better if discretized!🧵
Graham Neubig finds continuous reward models introduce noise and recommends discretizing dense rewards to stabilize LLM training
Continuous models assign inconsistent scores to equivalent responses.
No Digg Deeper questions have been answered for this story yet.
Most Activity
Check out our new work on better understanding reward models for RL to reduce sensitivity to spurious differences!
This helps more robustly learn LLMs from imperfect reward signals.
Continuous reward models can measure partial progress on a task, unlike binary rewards (e.g. RLVR). But we find they inevitably assign wildly different scores to equally-good responses, which can lead to bad policies. Surprisingly, dense rewards are often better if discretized!🧵

We call this phenomenon "oversensitivity". We find that many top reward models are highly oversensitive. In fact, they are *so rich* that you can discretize them (reducing oversensitivity) while retaining most of their ability to distinguish good responses from bad

Benchmarks like RewardBench measure reward models almost solely on "distinguishing good from bad responses"*. We suggest adding a second axis: minimizing oversensitivity.
(* the sole exception is RewardBench 2's "Ties" subset, which is a small fraction of their overall score)

Instead of a continuous spectrum of useful ratings, the differences within groups of equally-good responses may be learnable noise. Reward hacking on this noise might explain why RL can cause models to frequently mention goblins (weird) or endlessly agree with the user (bad!!)

Our work aims to provide an improved answer to a question posed a year ago by Noam Razin et al: “what makes a reward model a good teacher?” They argue you want a reward model to show variance over the response space. We argue it matters _where_ the variance comes from

We introduce a new algorithm for discretizing any neural reward model at inference time. We estimate each reward's uncertainty using Monte Carlo dropout and then cluster each batch to group responses that are effectively tied. It's a drop-in replacement for RL from reward models.

Could oversensitivity explain why rubrics are useful for RL? Our earlier work on rubric-based RL (https://arxiv.org/abs/2507.18624) showed that rubrics aren’t as good as reward models on RewardBench, but *were* good teachers for RL. Maybe rubrics, by design, manage oversensitivity.

This leads to less reward hacking and more reliable RL, even with strong reward models. The same concept applies to any imperfect reward (LM-as-a-judge, test case pass rate, rubrics –– maybe all rewards are "imperffect rewards" in practice?)

We also show that discretizing reward models makes sense in theory. If we assume a reward model is a perfect (discrete) utility function plus continuous noise, then there always exist discretizations that improve the tradeoff between oversensitivity and "discriminative ability"

I did this work during an internship with @yuning_pro at Meta. I worked with an amazing team at Meta, @nagpalchirag, @ShiqiWang10, and @devamanyu, along with my advisors at CMU: @tongshuangwu and @gneubig 📄 https://arxiv.org/abs/2606.21795