Enabling learning motion directly from videos rather than using them for action supervision is a superior method and likely more scalable.
While it is early this line of work suggests replicating the playbook that made robots walk. --> Real videos provide state supervision (not action) --> retargeting provides reference trajectories. --> RL tracks these trajecotries.
This is a very good example of the separation of the "What" and the "How"
Robots are the bottleneck in scaling robotics, and learning from human video promises to solve it. But how can chaotic human data ever measure up to sanitized, lab-made teleoperation data? Introducing Do as I Do: establishing a much needed correspondence between human videos and dexterous robot data. Some fun insights below: 🧵









