/Tech14h ago

MIT researchers open-source VERA, a 14-billion-parameter video-to-action model for generalist robot control

It pairs a video planner with an inverse dynamics model

2141244271120.1K

#654

Original post

Vincent Sitzmann@vincesitzmann#654inTech

We are open-sourcing a fine-tuned video policy + pre-trained IDM! In our paper, we demonstrate that this paradigm has the potential for plug-and-play manipulation across embodiments - very exciting :)

Lester Li@sizhe_lester_li

Robot learning is moving beyond policies built for one robot, one scene, one task.

At MIT, we’re exploring a different path: turning video world models into embodiment-agnostic robot policies.

Introducing VERA: a 14B video-to-action system that controls robots across embodiments, skills, and environments.

From zero-shot pick-and-place on a real Panda arm to contact-rich cube reorientation with a 16-DoF robotic hand.

Different robots. Different environments. Different tasks. Same video planner. Same weights.

We’re open-sourcing everything so you can fine-tune VERA for your own robot setup too. Deep dive in the thread:

🔗 http://vera.csail.mit.edu 🧵 (1/7)

9:39 AM · Jun 23, 2026 · 12.1K Views

Sentiment

Users are excited about MIT's open-source 14B VERA video model for cross-embodiment robot control because of its clean concept, strong execution, and promising potential in video world models.

Pos

100.0%

Neg

0.0%

13 comments with sentiment.

Cluster Engagement

Digg Deeper

No Digg Deeper questions have been answered for this story yet.

MIT.EDUVia

Posts from X

Most Activity

VIEWS5KBOOKMARKS13LIKES38REPLIES3

Chris Paxton@chris_j_paxton

I really feel like we arent bitter lesson-pilled enough sometimes; that training larger and larger models really is going to make all our robotics dreams come true. Not that this is easy!

Lester Li@sizhe_lester_li

Robot learning is moving beyond policies built for one robot, one scene, one task.

At MIT, we’re exploring a different path: turning video world models into embodiment-agnostic robot policies.

Introducing VERA: a 14B video-to-action system that controls robots across embodiments, skills, and environments.

From zero-shot pick-and-place on a real Panda arm to contact-rich cube reorientation with a 16-DoF robotic hand.

Different robots. Different environments. Different tasks. Same video planner. Same weights.

We’re open-sourcing everything so you can fine-tune VERA for your own robot setup too. Deep dive in the thread:

🔗 http://vera.csail.mit.edu 🧵 (1/7)

3h5K3813

RETWEETS34

Lester Li@sizhe_lester_li

Robot learning is moving beyond policies built for one robot, one scene, one task.

At MIT, we’re exploring a different path: turning video world models into embodiment-agnostic robot policies.

Introducing VERA: a 14B video-to-action system that controls robots across embodiments, skills, and environments.

From zero-shot pick-and-place on a real Panda arm to contact-rich cube reorientation with a 16-DoF robotic hand.

Different robots. Different environments. Different tasks. Same video planner. Same weights.

We’re open-sourcing everything so you can fine-tune VERA for your own robot setup too. Deep dive in the thread:

🔗 http://vera.csail.mit.edu 🧵 (1/7)

14h105.4K292208

Lester Li@sizhe_lester_li

We’ve open-sourced everything: model training/inference code, robot infrastructure, and simulation environments.

📄 Paper: http://arxiv.org/abs/2605.27817 🌐 Project: http://vera.csail.mit.edu 💻 Code: https://github.com/sizhe-li/VERA

(7/7)

14h67143

Lester Li@sizhe_lester_li

Also, shoutout to the team and friends at @RhodaAI, who are pursuing this bet head-on in industry.

I’ll be spending the summer there. If you’re in the Bay Area and thinking about video/world models for robotics, DM me — I’d love to connect :D

14h54841

Lester Li@sizhe_lester_li

The core challenge in robot intelligence is generalization: one policy that can work across robots, scenes, and tasks.

Our key idea is to use video chunk as an embodiment-agnostic action representation. Given memory and (text) conditioning, a video planner predicts the next chunk of frames.

An inverse-dynamics model then turns those predicted video chunks into action chunks for the robot to execute.

(2/7)

14h1.3K71

Lester Li@sizhe_lester_li

On a real Franka Panda arm, VERA performs zero-shot manipulation in unseen scenes, with no task-specific fine-tuning:

✅ follows language prompts ✅ reasons across camera views to recover hidden objects ✅ handles changes in lighting, camera placement, and scene layout

(3/7)

14h89951

Zechen Bai@ZechenBai

@sizhe_lester_li Very interesting! The idea is so natural yet effective. I’m curious what is the main bottleneck that prevents prior research from succeeding on this? Or just no one had tried that before? 😆

13h2851

Lester Li@sizhe_lester_li

Joint work with co-first authors @evnkimm and @SimulatedAnneal, plus Tong Zhao, @pangtao22, @max_simchowitz, and my PhD advisor @vincesitzmann.

This work would not have been possible without this team: from robot learning and video modeling to simulation, systems, hardware, experiments, and countless late nights debugging real robots.

(6/7)

14h6497

Lester Li@sizhe_lester_li

How does VERA turn video into action?

The world model predicts the next visual chunk: what the scene should look like next.

A Jacobian inverse-dynamics model asks the inverse question: what robot command would create that pixel motion?

That is the bridge from dreaming in pixels to acting in the world.

(5/7)

14h5835

Lester Li@sizhe_lester_li

The same video planner can scale from a 7-DoF Panda arm to a 16-DoF Allegro hand.

With domain-specific post-training data, VERA learns to control dexterous fingers for cube reorientation.

(4/7)

14h6433

Jayden Teoh@jayden_teoh_

@sizhe_lester_li really cool work!

14h2573

Yuejiang Liu@liu_yuejiang

@sizhe_lester_li Very cool work! Congrats!

12h2342

Lester Li@sizhe_lester_li

Thanks so much!!! My take: the idea is natural, but the recipe is pretty unforgiving.

When I started a year ago, on the PushT environment, small video models trained on ~200 demos gave us basically nothing 😅 What worked was broader visual coverage first, then task post-training, frame-chunk → action-chunk prediction, and careful controller design.

Lots of pieces had to click together :D

The thing that kept me going was seeing the model drive a two-finger robot with enough precision to solve reorientation.

13h722

ali@aliuahma

@sizhe_lester_li awesome work! did you compare neural jacobian fields to another generative-modeling-based approach or was it just the UniPi baseline?

12h328

Lester Li@sizhe_lester_li

I’d like to acknowledge my two amazing co-first authors.

Evan (@evnkimm) is one of the most cracked undergrads on the planet: an avid attention physicist, video-transformer gymnast, and Sherlock Holmes of scaling laws.

Xingjian (@SimulatedAnneal), together with Evan, contributed to the core design of our video models and gave me so much perspective on how to rethink robotics, from top to bottom, as a core ML problem.

I’m forever grateful that they took a chance on us when the project was still only working in PushT and a few very basic-looking simulation environments.

5h1352

Lester Li@sizhe_lester_li

@jayden_teoh_ Thank you Jayden! :D

14h2421

Lester Li@sizhe_lester_li

@liu_yuejiang Thank you so much Yuejiang! Super looking forward to what your lab will be building next!

12h1761

Mingkai Deng@mdeng34

Amazing work! Great demonstration that we should disentangle the world model from the agent model for generalizable decision-making.

In our recent paper "Critique of Agent Model" coauthored with Prof. @ericxing and @jinyuhou0, we formally analyzed existing approaches to agent modeling, and proposed the next steps for building autonomous agents.

A major conclusion is that there's real, general benefit to using a world model inside an agent model, but *only if* the world model simulates faithfully.

If you fine-tune the WM together with the AM, the guarantee is lost.

https://arxiv.org/abs/2606.23991

6h491

Lester Li@sizhe_lester_li

Hi Jay! We compare to open-source VLA/WAM baselines in Fig. 6. The leading industry models are strong on DROID.

VERA bets video world-model knowledge can transfer across embodiments/domains, from driving games to Waymo cars or human grasps to robot grasps. Curious to study this at scale :D

5h451

Lester Li@sizhe_lester_li

@aliuahma Thank you Ali! We compared with an in-house UniPi baseline, where we used the same backbone as our jacobian IDM :)

One thing that might be interesting for the community to look into is different approaches might scale differently when you have high DoF systems like hands!

9h132