/Tech2h ago

Freeform Preference Learning Improves AI Supervision With Multi-Axis Feedback

411002.8K

#11

Original post

Chelsea Finn@chelseabfinn#11inTech

Freeform preferences let the supervisor define relevant axes and then specify preferences along those axes.

Axes can be either a fixed rubric or freeform language.

This eliminates ambiguity, allows for thorough coverage of all axes, and provides more dense supervision.

Chelsea Finn@chelseabfinn

Sparse rewards, progress metrics, and preferences are popular, but they - often neglect many aspects of a task - collapse many axes into one measure - frequently yield ambiguity and disagreement across annotators

We instead propose freeform preference learning

5:33 PM · Jul 2, 2026 · 695 Views

Sentiment

Users are praising freeform preference learning with multi-axis feedback because it makes rewards more interpretable than binary comparisons.

Pos

100.0%

Neg

0.0%

1 comments with sentiment.

Cluster Engagement

Digg Deeper

No Digg Deeper questions have been answered for this story yet.

Posts from X

Most Activity

VIEWS1.4KLIKES8REPLIES2

Chelsea Finn@chelseabfinn

Project led by @marceltornev, @anubhamahajan01, @AbhijnyaBhat

Paper: https://arxiv.org/abs/2606.32027 Code & videos: https://freeform-pl.github.io/fpl.website/

Check out Marcel’s thread for more details!

2h1.4K8

BOOKMARKS1

Chelsea Finn@chelseabfinn

Freeform preference learning has multiple nice properties:

(1) It works better

When controlling for the number of preference queries, learning with multi-axis preferences yields far more performant policies than single-axis rewards.

2h23521

Chelsea Finn@chelseabfinn

(3) Long-horizon credit assignment Most robot RL focuses on short horizon tasks b/c dense temporal rewards are hard to get.

Freeform preferences yield dense rewards for subtasks without subtask segmentation.

Chelsea Finn@chelseabfinn

(2) Compositional generalization By supervising with multiple axes, the robot can learn behavior that’s not in the data, e.g. fast behavior for a task that has only slow episodes in the data.

This isn’t possible with traditional reward models.

2h89330

Chelsea Finn@chelseabfinn

With freeform preference data, we train a lang-conditioned reward that captures all axes of a task.

We then train a policy conditioned on each reward axis and the corresponding reward.

FPL allows the robot to maximally leverage and learn from each axis of supervision.

Chelsea Finn@chelseabfinn

Freeform preferences let the supervisor define relevant axes and then specify preferences along those axes.

Axes can be either a fixed rubric or freeform language.

This eliminates ambiguity, allows for thorough coverage of all axes, and provides more dense supervision.

2h65120

Chelsea Finn@chelseabfinn

(2) Compositional generalization By supervising with multiple axes, the robot can learn behavior that’s not in the data, e.g. fast behavior for a task that has only slow episodes in the data.

This isn’t possible with traditional reward models.

Chelsea Finn@chelseabfinn

Freeform preference learning has multiple nice properties:

(1) It works better

When controlling for the number of preference queries, learning with multi-axis preferences yields far more performant policies than single-axis rewards.

2h59610

Ibrahim Ahmed@hazrmard

@chelseabfinn @marceltornev @anubhamahajan01 @AbhijnyaBhat Can the VLM be trained to output pareto-optimal rewards? If I understand this work: a human in the loop defines criteria in natural language, VLM scores video on the criteria, and the VLM & policy gets continually trained on an evolving loss fn as the criteria dimensions change.

2h17

Miguel Guau!@ai_futures_mh

@chelseabfinn @marceltornev @anubhamahajan01 @AbhijnyaBhat Multi-axis preferences making rewards actually interpretable, love this. Binary comparisons always felt like throwing away signal. 38% improvement is no joke either.

2h1