Tactile/force data are critical but rare. They can never reach the scale of pretraining tasks, so we got to find intelligence in other ways.
In MuSe, we show that finetuning with a small amount of force data can even improve pretrain tasks, given the model the ability to do force prediction on tasks with no force data before.
The ability is enabled by three key modeling designs. Checkout Jaden's post for details!
Can we enable robots to develop a sense of touch without forgetting what they learned from large-scale vision-only pretraining?
Introducing MultiSensory World Model (MuSe) 🌍: A new approach for finetuning visuomotor policies on minimal data from new sensor modalities, such as force/torque (F/T)
With Muse, touch learned later improves skills learned earlier — a small amount of F/T data on new tasks improves zero-shot on diverse pretraining tasks that were never supervised with F/T
We believe MuSe provides a practical pathway towards training multisensory foundation models that leverage both abundant vision data, and smaller multisensory datasets 🧵👇
