/Tech7h ago

Stanford's Shuran Song releases MuSe to integrate tactile and force sensors into pretrained, vision-only robot policies

A multimodal world model prevents catastrophic forgetting.

41821211818K

#704

Original post

Shuran Song@SongShuran#704inTech

I'm increasingly interested in the problem of *multisensory continual learning*, since it feels inevitable for robotics.

Unlike vision, many robot sensors (e.g., force/torque, tactile, audio) are highly task- and system- specific. It's unrealistic to expect a single pretraining dataset to contain every future sensor. And as robotics evolves, we'll keep building new sensors.

So the question is: Can we plug a new sensor into a pretrained vision-only foundation model without forgetting everything it already knows?

Better yet, can the new sensor actually improve the model's existing vision-based skills?

That's exactly the question that motivated MuSe 👇

Jaden Clark@jadenvclark

Can we enable robots to develop a sense of touch without forgetting what they learned from large-scale vision-only pretraining?

Introducing MultiSensory World Model (MuSe) 🌍: A new approach for finetuning visuomotor policies on minimal data from new sensor modalities, such as force/torque (F/T)

With Muse, touch learned later improves skills learned earlier — a small amount of F/T data on new tasks improves zero-shot on diverse pretraining tasks that were never supervised with F/T

We believe MuSe provides a practical pathway towards training multisensory foundation models that leverage both abundant vision data, and smaller multisensory datasets 🧵👇

7:53 PM · Jul 2, 2026 · 7.6K Views

Sentiment

Sentiment building, check back later.

Cluster Engagement

Digg Deeper

No Digg Deeper questions have been answered for this story yet.

Posts from X

Most Activity

VIEWS43RETWEETS1

Vai Viswanathan@vai_viswanathan

You pretrained a robot policy on millions of camera frames. Now you want to add a force sensor. Do you really have to retrain on everything from scratch?

MuSe adds a force-torque sensor to a frozen vision-only policy using a tiny amount of contact data.

- lifts contact-rich task success (peg insertion 60% to 87%, vase wiping 33% to 77%)

- fuse the new sensor both early (shared token space) and late (cross-attention), train the policy as a world model that predicts future video, future force, and actions together, and replay old vision-only data with the force input masked to prevent forgetting.

- adding new sensor modalities to pre-trained World Model can be cheap & improve performance significantly

Jaden Clark@jadenvclark

Can we enable robots to develop a sense of touch without forgetting what they learned from large-scale vision-only pretraining?

Introducing MultiSensory World Model (MuSe) 🌍: A new approach for finetuning visuomotor policies on minimal data from new sensor modalities, such as force/torque (F/T)

With Muse, touch learned later improves skills learned earlier — a small amount of F/T data on new tasks improves zero-shot on diverse pretraining tasks that were never supervised with F/T

We believe MuSe provides a practical pathway towards training multisensory foundation models that leverage both abundant vision data, and smaller multisensory datasets 🧵👇

7h8.7K8664