/Tech9h ago

Roboticists Yifan Hou and Vaishakh Viswanathan present MuSe to integrate force-torque sensors into vision-only policies with minimal data

It trains a world model to predict future actions and force readings and actions.

22101713729.1K

#485

Original post

Yifan Hou@YifanHou2

Tactile/force data are critical but rare. They can never reach the scale of pretraining tasks, so we got to find intelligence in other ways.

In MuSe, we show that finetuning with a small amount of force data can even improve pretrain tasks, given the model the ability to do force prediction on tasks with no force data before.

The ability is enabled by three key modeling designs. Checkout Jaden's post for details!

Jaden Clark@jadenvclark

Can we enable robots to develop a sense of touch without forgetting what they learned from large-scale vision-only pretraining?

Introducing MultiSensory World Model (MuSe) 🌍: A new approach for finetuning visuomotor policies on minimal data from new sensor modalities, such as force/torque (F/T)

With Muse, touch learned later improves skills learned earlier — a small amount of F/T data on new tasks improves zero-shot on diverse pretraining tasks that were never supervised with F/T

We believe MuSe provides a practical pathway towards training multisensory foundation models that leverage both abundant vision data, and smaller multisensory datasets 🧵👇

9:39 PM · Jul 2, 2026 · 3.2K Views

Sentiment

Users love MuSe's finetuning of pretrained vision models with limited force data because the vision-touch intersection makes embodied AI far more useful in the real world.

Pos

100.0%

Neg

0.0%

1 comments with sentiment.

Cluster Engagement

Digg Deeper

No Digg Deeper questions have been answered for this story yet.

Posts from X

Most Activity

VIEWS2RETWEETS12

Vai Viswanathan@vai_viswanathan

You pretrained a robot policy on millions of camera frames. Now you want to add a force sensor. Do you really have to retrain on everything from scratch?

MuSe adds a force-torque sensor to a frozen vision-only policy using a tiny amount of contact data.

- lifts contact-rich task success (peg insertion 60% to 87%, vase wiping 33% to 77%)

- fuse the new sensor both early (shared token space) and late (cross-attention), train the policy as a world model that predicts future video, future force, and actions together, and replay old vision-only data with the force input masked to prevent forgetting.

- adding new sensor modalities to pre-trained World Model can be cheap & improve performance significantly

Jaden Clark@jadenvclark

Can we enable robots to develop a sense of touch without forgetting what they learned from large-scale vision-only pretraining?

Introducing MultiSensory World Model (MuSe) 🌍: A new approach for finetuning visuomotor policies on minimal data from new sensor modalities, such as force/torque (F/T)

With Muse, touch learned later improves skills learned earlier — a small amount of F/T data on new tasks improves zero-shot on diverse pretraining tasks that were never supervised with F/T

We believe MuSe provides a practical pathway towards training multisensory foundation models that leverage both abundant vision data, and smaller multisensory datasets 🧵👇

19h24.8K176122

JJ Walker@jtechlover

@YifanHou2 Love this direction. The intersection of vision and touch is where I think embodied AI starts to become far more useful in the real world.

13h