Appreciate Jitendra's takes on world models/VLMs. His word below is why back in 2019-2021, instead of VLAs for simple pick-and-place, we chose assembly.
Dexterity = mutual info between your intent and forces/torques on objects via contacts.
I want to offer some unsolicited advice to computer vision researchers jumping into robotics. Don't focus too much on VLMs, VLAs etc. That's fine, but the real action is at the sensorimotor level. Most of the open problems in robotics are in manipulation, which is about hand-object interaction, and contacts and forces are central. Proprioception and tactile sensing are as important as vision. Don't get seduced by cherry-picked demos. You can't do robotics without doing robotics.















