Chelsea Finn Proposes Freeform Preferences for Robot Reward Supervision · Digg

/Tech2h ago

Chelsea Finn Proposes Freeform Preferences for Robot Reward Supervision

6959808.4K

CF#11|@CHELSEABFINN

Original post

Chelsea Finn@chelseabfinn#11in/Tech

We’ve been approaching reward supervision for robots the wrong way.

I think freeform preferences are part of the answer.

A short 🧵

5:33 PM · Jul 2, 2026 · 7.7K Views

Sentiment

Sentiment building, check back later.

Cluster Engagement

Digg Deeper

No Digg Deeper questions have been answered for this story yet.

Related links

Freeform Preference Learning for Robotic Manipulation

ARXIV.ORGVia

Freeform Preference Learning for Robotic Manipulation

FREEFORM-PL.GITHUB.IOVia

Freeform Preference Learning for Robotic Manipulation

FREEFORM-PL.GITHUB.IOVia

Posts from X

Most Activity

Most Activity

VIEWS1.5KREPLIES2

Chelsea Finn@chelseabfinn

Project led by @marceltornev, @anubhamahajan01, @AbhijnyaBhat

Paper: https://arxiv.org/abs/2606.32027 Code & videos: https://freeform-pl.github.io/fpl.website/

Check out Marcel’s thread for more details!

Marcel Torné@marceltornev

We should stop optimizing robot policies against a single overall reward. Trajectories differ along many axes, such as speed, precision, and subtask completion, and one can be better on some while worse on others. If we collapse all of that into a single overall axis we lose this structure making the reward ambiguous and harder to optimize.

Blog: http://freeform-pl.github.io/fpl.website/ Paper: https://arxiv.org/abs/2606.32027

2h|Views 1.5KLikes 8Bookmarks 0

BOOKMARKS1

Chelsea Finn@chelseabfinn

Freeform preference learning has multiple nice properties:

(1) It works better

When controlling for the number of preference queries, learning with multi-axis preferences yields far more performant policies than single-axis rewards.

2h|Views 235Likes 2Bookmarks 1

LIKES12

Chelsea Finn@chelseabfinn

Long term, we need reward models to capture all aspects of performance, like success, outcome quality, and speed.

This also includes task-specific axes like: - was the PB spread evenly? - was the apple slightly bruised while bagging it? - was the furniture bumped or scratched?

Chelsea Finn@chelseabfinn

We’ve been approaching reward supervision for robots the wrong way.

I think freeform preferences are part of the answer.

A short 🧵

2h|Views 1.3KLikes 12Bookmarks 0

Chelsea Finn@chelseabfinn

Sparse rewards, progress metrics, and preferences are popular, but they - often neglect many aspects of a task - collapse many axes into one measure - frequently yield ambiguity and disagreement across annotators

We instead propose freeform preference learning

Chelsea Finn@chelseabfinn

Long term, we need reward models to capture all aspects of performance, like success, outcome quality, and speed.

This also includes task-specific axes like: - was the PB spread evenly? - was the apple slightly bruised while bagging it? - was the furniture bumped or scratched?

2h|Views 948Likes 5Bookmarks 0

Chelsea Finn@chelseabfinn

(3) Long-horizon credit assignment Most robot RL focuses on short horizon tasks b/c dense temporal rewards are hard to get.

Freeform preferences yield dense rewards for subtasks without subtask segmentation.

2h|Views 801Likes 3

Chelsea Finn@chelseabfinn

With freeform preference data, we train a lang-conditioned reward that captures all axes of a task.

We then train a policy conditioned on each reward axis and the corresponding reward.

FPL allows the robot to maximally leverage and learn from each axis of supervision.

2h|Views 128Likes 1

Chelsea Finn@chelseabfinn

Freeform preferences let the supervisor define relevant axes and then specify preferences along those axes.

Axes can be either a fixed rubric or freeform language.

This eliminates ambiguity, allows for thorough coverage of all axes, and provides more dense supervision.

2h|Views 90Likes 1

Chelsea Finn@chelseabfinn

(2) Compositional generalization By supervising with multiple axes, the robot can learn behavior that’s not in the data, e.g. fast behavior for a task that has only slow episodes in the data.

This isn’t possible with traditional reward models.

2h|Views 126

6959808.4K