/Tech6h ago

Graham Neubig finds continuous reward models introduce noise and recommends discretizing dense rewards to stabilize LLM training

Continuous models assign inconsistent scores to equivalent responses.

4383135K

#89

Original post

Vijay V.@vijaytarian

Continuous reward models can measure partial progress on a task, unlike binary rewards (e.g. RLVR). But we find they inevitably assign wildly different scores to equally-good responses, which can lead to bad policies. Surprisingly, dense rewards are often better if discretized!🧵

8:10 AM · Jun 23, 2026 · 2.9K Views

Sentiment

Sentiment building, check back later.

Cluster Engagement

Digg Deeper

No Digg Deeper questions have been answered for this story yet.

Posts from X

Most Activity

VIEWS2.1KBOOKMARKS3LIKES12REPLIES1

Graham Neubig@gneubig

Check out our new work on better understanding reward models for RL to reduce sensitivity to spurious differences!

This helps more robustly learn LLMs from imperfect reward signals.

Vijay V.@vijaytarian

6h2.1K123

Vijay V.@vijaytarian

We call this phenomenon "oversensitivity". We find that many top reward models are highly oversensitive. In fact, they are *so rich* that you can discretize them (reducing oversensitivity) while retaining most of their ability to distinguish good responses from bad

7h451

Vijay V.@vijaytarian

Benchmarks like RewardBench measure reward models almost solely on "distinguishing good from bad responses"*. We suggest adding a second axis: minimizing oversensitivity.

(* the sole exception is RewardBench 2's "Ties" subset, which is a small fraction of their overall score)

7h381

Vijay V.@vijaytarian

Instead of a continuous spectrum of useful ratings, the differences within groups of equally-good responses may be learnable noise. Reward hacking on this noise might explain why RL can cause models to frequently mention goblins (weird) or endlessly agree with the user (bad!!)

7h221

Vijay V.@vijaytarian

Our work aims to provide an improved answer to a question posed a year ago by Noam Razin et al: “what makes a reward model a good teacher?” They argue you want a reward model to show variance over the response space. We argue it matters _where_ the variance comes from

7h211

Vijay V.@vijaytarian

We introduce a new algorithm for discretizing any neural reward model at inference time. We estimate each reward's uncertainty using Monte Carlo dropout and then cluster each batch to group responses that are effectively tied. It's a drop-in replacement for RL from reward models.

7h201

Vijay V.@vijaytarian

Could oversensitivity explain why rubrics are useful for RL? Our earlier work on rubric-based RL (https://arxiv.org/abs/2507.18624) showed that rubrics aren’t as good as reward models on RewardBench, but *were* good teachers for RL. Maybe rubrics, by design, manage oversensitivity.

7h321

Vijay V.@vijaytarian

This leads to less reward hacking and more reliable RL, even with strong reward models. The same concept applies to any imperfect reward (LM-as-a-judge, test case pass rate, rubrics –– maybe all rewards are "imperffect rewards" in practice?)

7h19

Vijay V.@vijaytarian

We also show that discretizing reward models makes sense in theory. If we assume a reward model is a perfect (discrete) utility function plus continuous noise, then there always exist discretizations that improve the tradeoff between oversensitivity and "discriminative ability"

7h18

Vijay V.@vijaytarian

I did this work during an internship with @yuning_pro at Meta. I worked with an amazing team at Meta, @nagpalchirag, @ShiqiWang10, and @devamanyu, along with my advisors at CMU: @tongshuangwu and @gneubig 📄 https://arxiv.org/abs/2606.21795

7h422