We are open-sourcing a fine-tuned video policy + pre-trained IDM! In our paper, we demonstrate that this paradigm has the potential for plug-and-play manipulation across embodiments - very exciting :)
Robot learning is moving beyond policies built for one robot, one scene, one task.
At MIT, we’re exploring a different path: turning video world models into embodiment-agnostic robot policies.
Introducing VERA: a 14B video-to-action system that controls robots across embodiments, skills, and environments.
From zero-shot pick-and-place on a real Panda arm to contact-rich cube reorientation with a 16-DoF robotic hand.
Different robots. Different environments. Different tasks. Same video planner. Same weights.
We’re open-sourcing everything so you can fine-tune VERA for your own robot setup too. Deep dive in the thread:
🔗 http://vera.csail.mit.edu 🧵 (1/7)





