We’ve been approaching reward supervision for robots the wrong way.
I think freeform preferences are part of the answer.
A short 🧵
We’ve been approaching reward supervision for robots the wrong way.
I think freeform preferences are part of the answer.
A short 🧵
No Digg Deeper questions have been answered for this story yet.
Project led by @marceltornev, @anubhamahajan01, @AbhijnyaBhat
Paper: https://arxiv.org/abs/2606.32027 Code & videos: https://freeform-pl.github.io/fpl.website/
Check out Marcel’s thread for more details!
We should stop optimizing robot policies against a single overall reward. Trajectories differ along many axes, such as speed, precision, and subtask completion, and one can be better on some while worse on others. If we collapse all of that into a single overall axis we lose this structure making the reward ambiguous and harder to optimize.
Blog: http://freeform-pl.github.io/fpl.website/ Paper: https://arxiv.org/abs/2606.32027

Freeform preference learning has multiple nice properties:
(1) It works better
When controlling for the number of preference queries, learning with multi-axis preferences yields far more performant policies than single-axis rewards.
Long term, we need reward models to capture all aspects of performance, like success, outcome quality, and speed.
This also includes task-specific axes like: - was the PB spread evenly? - was the apple slightly bruised while bagging it? - was the furniture bumped or scratched?
We’ve been approaching reward supervision for robots the wrong way.
I think freeform preferences are part of the answer.
A short 🧵
Sparse rewards, progress metrics, and preferences are popular, but they - often neglect many aspects of a task - collapse many axes into one measure - frequently yield ambiguity and disagreement across annotators
We instead propose freeform preference learning
Long term, we need reward models to capture all aspects of performance, like success, outcome quality, and speed.
This also includes task-specific axes like: - was the PB spread evenly? - was the apple slightly bruised while bagging it? - was the furniture bumped or scratched?

(3) Long-horizon credit assignment Most robot RL focuses on short horizon tasks b/c dense temporal rewards are hard to get.
Freeform preferences yield dense rewards for subtasks without subtask segmentation.

With freeform preference data, we train a lang-conditioned reward that captures all axes of a task.
We then train a policy conditioned on each reward axis and the corresponding reward.
FPL allows the robot to maximally leverage and learn from each axis of supervision.

Freeform preferences let the supervisor define relevant axes and then specify preferences along those axes.
Axes can be either a fixed rubric or freeform language.
This eliminates ambiguity, allows for thorough coverage of all axes, and provides more dense supervision.

(2) Compositional generalization By supervising with multiple axes, the robot can learn behavior that’s not in the data, e.g. fast behavior for a task that has only slow episodes in the data.
This isn’t possible with traditional reward models.