I'm increasingly interested in the problem of *multisensory continual learning*, since it feels inevitable for robotics.
Unlike vision, many robot sensors (e.g., force/torque, tactile, audio) are highly task- and system- specific. It's unrealistic to expect a single pretraining dataset to contain every future sensor. And as robotics evolves, we'll keep building new sensors.
So the question is: Can we plug a new sensor into a pretrained vision-only foundation model without forgetting everything it already knows?
Better yet, can the new sensor actually improve the model's existing vision-based skills?
That's exactly the question that motivated MuSe 👇
Can we enable robots to develop a sense of touch without forgetting what they learned from large-scale vision-only pretraining?
Introducing MultiSensory World Model (MuSe) 🌍: A new approach for finetuning visuomotor policies on minimal data from new sensor modalities, such as force/torque (F/T)
With Muse, touch learned later improves skills learned earlier — a small amount of F/T data on new tasks improves zero-shot on diverse pretraining tasks that were never supervised with F/T
We believe MuSe provides a practical pathway towards training multisensory foundation models that leverage both abundant vision data, and smaller multisensory datasets 🧵👇