Jitendra Malik's research group open-sources Do as I Do, translating monocular RGB human videos into 4D robotic trajectories

Original post

Enabling learning motion directly from videos rather than using them for action supervision is a superior method and likely more scalable.

While it is early this line of work suggests replicating the playbook that made robots walk. --> Real videos provide state supervision (not action) --> retargeting provides reference trajectories. --> RL tracks these trajecotries.

This is a very good example of the separation of the "What" and the "How"

Mahi Shafiullah 🏠🤖@notmahi

Robots are the bottleneck in scaling robotics, and learning from human video promises to solve it. But how can chaotic human data ever measure up to sanitized, lab-made teleoperation data? Introducing Do as I Do: establishing a much needed correspondence between human videos and dexterous robot data. Some fun insights below: 🧵

9:31 AM · Jun 18, 2026 · 4.8K Views

VIEWS10.5KBOOKMARKS130LIKES235REPLIES3

Jitendra MALIK@JitendraMalikCV

We can convert human videos to robot hand-object interaction trajectories in 4D. Enjoy! Paper: https://arxiv.org/abs/2606.19333 Website: https://do-as-i-do.com Code: https://github.com/malik-group/do-as-i-do Authors:@bhawna_paliwal_,@HarithejaE,@willjhliang, @pabbeel , @notmahi , @JitendraMalikCV

4h10.5K235130

RETWEETS27

Mahi Shafiullah 🏠🤖@notmahi

7h17.9K14584

Animesh Garg@animesh_garg

Multiple folks buying into the WHAT and HOW framework

The slide is from the talk last year at ICRA keynote debate. (https://www.youtube.com/watch?t=196&v=PfvctjoMPk8&feature=youtu.be)

Another great example of using human data primarily. from reference generation not action supervision.

Harsh Gupta 🇺🇸@hgupt3

We trained a dexterous hand 🤖 from internet-scale human videos 📺 with ZERO real-world robot or teleoperation data

We pair a world model 🌎 that predicts task intent with a generalist sensorimotor policy ✋ trained sim2real to zero-shot fulfill ANY intent

Introducing LUCID: Learning Embodiment-Agnostic Intent Models from Unstructured Human Videos for Scalable Dexterous Robot Skill Acquisition 🧵

3h1.6K94

Chris Paxton@chris_j_paxton

I think we shouldn't expect this correspondence to magically appear, at least at the frontier of performance. Cool work

Mahi Shafiullah 🏠🤖@notmahi

1h93533

Animesh Garg@animesh_garg

Earlier today @notmahi and team's paper debuted on a similar idea to learn from videos

Animesh Garg@animesh_garg

Enabling learning motion directly from videos rather than using them for action supervision is a superior method and likely more scalable.

This is a very good example of the separation of the "What" and the "How"

3h93501

Jitendra MALIK@JitendraMalikCV

We can convert human videos to robot hand-object interaction trajectories in 4D. Enjoy! Paper https://arxiv.org/abs/2606.19333 Website: https://do-as-i-do.com Code: https://github.com/malik-group/do-as-i-do Authors:@bhawna_paliwal_,@HarithejaE,@willjhliang, @notmahi @pabbeel @JitendraMalikCV

4h14730

Addicted@Just2Addicted

@animesh_garg Hey sir, could you please check your DMs.

Thank you

6h583

Marcel Bakery 🦖@KoosCryptoo

This separation of “What” (from human video) and “How” (via RL) looks like a promising direction for scaling dexterous manipulation. Human videos provide rich state and intent, but they still lack precise contact dynamics that real robot data captures well.

Curious how far this approach can go before it needs grounding in robot interaction data.

6h652

Rv@InvestorRVD

@animesh_garg Insightful !

6h512

Nikhil Nakhate@nikx_NN

@JitendraMalikCV @bhawna_paliwal_ @HarithejaE @willjhliang @pabbeel @notmahi This is amazing! What’s the speed up on the robot video side?

3h501

Willy Wonka 🍫@WillyWonkaCT

@animesh_garg Very cool

6h491

Samuel Shvartsman@SamuelSBlackman

@JitendraMalikCV @bhawna_paliwal_ @HarithejaE @willjhliang @pabbeel @notmahi Impressive, will test out taking selfies on a unitree g1!

3h115

Analyz3R@Archd3vill

@animesh_garg 👀

6h331

yash@yashetal

@JitendraMalikCV @bhawna_paliwal_ @HarithejaE @willjhliang @pabbeel @notmahi woah this looks good

3h67

Soul 4 a SOL@soul4aSOL

@animesh_garg Neat! Just an FYI but I think there is thousands of dollars waiting for you to help out with Cobalt

6h121

Jakie PLA@3DPrintAficio

@animesh_garg @notmahi THIS. Video as the training substrate, what/how split, sim2real bridge. NO teleoperation bottleneck. This is how embodied AI scales.

2h6

Rv@InvestorRVD

@animesh_garg @notmahi Kindly check dm

2h3