Human Videos Enable Robot Self-Improvement Across Embodiments

Step 1: Pretrain shared policy, dynamics, and value models from human videos -- representations that transfer across robot embodiments. • Policy predicts 6-DoF wrist poses and hand closure variables • Dynamics predicts the action-conditioned world state, represented by DINO-v3 visual tokens and point trajectories • Value predicts task progress [1/n]

8h163

BOOKMARKS1

Hanzhi Chen@hanzhic678

Authors: @hanzhic678*, @Anran_zh*, Simon Schaefer, Kejia Chen, @__csxxx__ , Daniel Cremers, @oier_mees†, @StefanLeuteneg1†

@ETH_en @TU_Muenchen @Microsoft @MunichCenterML

📄 Paper: https://arxiv.org/abs/2606.21406 📹 Video: https://youtu.be/ZW3ZHjrllJA 🌐 Project: https://ethz-mrl.github.io/robot-self-improvement-website/

7h1071

LIKES1

ζ Pedram ζ@zenstyle

@oier_mees Fail fast, Learn from Erlang ;)

2h211

RETWEETS6

Oier Mees@oier_mees

𝗥𝗼𝗯𝗼𝘁𝘀 𝗱𝗼𝗻’𝘁 𝗻𝗲𝗲𝗱 𝗺𝗼𝗿𝗲 𝗱𝗲𝗺𝗼𝗻𝘀𝘁𝗿𝗮𝘁𝗶𝗼𝗻𝘀. 𝗧𝗵𝗲𝘆 𝗻𝗲𝗲𝗱 𝘁𝗼 𝗹𝗲𝗮𝗿𝗻 𝗳𝗿𝗼𝗺 𝗳𝗮𝗶𝗹𝘂𝗿𝗲 — 𝗮𝗳𝘁𝗲𝗿 𝘄𝗮𝘁𝗰𝗵𝗶𝗻𝗴 𝗵𝘂𝗺𝗮𝗻𝘀. Most robot learning systems assume failure is the end of learning. In our new work, we study whether robots can improve after deployment by learning from their own failures, without any human intervention, teleoperation, or corrective labels.

The key idea is simple: human videos contain structure about how the world works. We use them to learn cross-embodiment representations of action, dynamics, and value, enabling a shared predictive space between human behavior and robot experience. This allows a new learning loop: 👉 pretrain on human videos 👉 deploy robot policy 👉 observe failures 👉 reinterpret failures using human priors 👉 improve autonomously

We evaluate this across 7 real-world manipulation tasks, showing: 📈 40% → 81% success rate 🏆 Strong improvements over π0.6 RECAP and RISE ✔️ Zero human intervention during post-deployment improvement 🧬 Generalizes across robot embodiments and policy backbones

A key finding is that explicit failure repair significantly outperforms failure reweighting, yielding substantially larger gains under identical data conditions (+25 pts vs +5 pts on the same π0.5 base policy). Overall, the results suggest a shift in how we think about robot learning: Human videos are not only for pretraining policies. They can provide the structure needed for continual self-improvement after deployment.

📄 Paper: https://arxiv.org/pdf/2606.21406 🌐 Project: https://ethz-mrl.github.io/robot-self-improvement-website/

I am grateful for working with the fantastic leads @hanzhic678 and @Anran_zh, and our collaborators Simon Schaefer, Kejia Chen, Shi Chen, Daniel Cremers. Special thanks to @StefanLeuteneg1 for co-advising this project with me. @ETH @TU_Muenchen @Microsoft

Check out Hanzhi's 🧵 for more details

Hanzhi Chen@hanzhic678

🤖🎥 We have recently seen some cool works that leverage human videos to learn robot policies, even without robot demonstrations.

But what if human videos could do more than teach robots what to imitate?

We show that human videos can teach robots predictive representations of action, dynamics, and value. These embodiment-agnostic representations transfer across robot embodiments, enabling robots to self-improve from their own rollouts and failures - without online human intervention. Introducing:

📄 Robot Self-Improvement via Human-Video Dynamics Models

Our method enables two different robots to self-improve across 7 real-world manipulation tasks:

🚀 40% → 81% success rate ✅ zero human intervention during improvement 🌈 works across robot embodiments and different policy backbones

Human videos are not just data for imitation; they can support robot self-improvement.

🧵👇

5h4.6K6045

REPLIES1

Hanzhi Chen@hanzhic678

We further tested our framework with a Franka Panda robot, showing consistent improvement on a different embodiment. [6/n]

8h117

Hanzhi Chen@hanzhic678

Result: across 5 real-world manipulation tasks, our framework achieves the highest average success rate among 6 representative baselines. [4/n]

8h1031

Hanzhi Chen@hanzhic678

Step 2: Human videos give useful priors, but robots still move differently and fail in their own ways. So we let the robot collect its own interaction data: a VLM proposes atomic tasks (e.g., close the drawer), and the frozen human-pretrained policy tries them. These rollouts adapt the dynamics and value models to real robot successes and failures. [2/n]

8h138

Hanzhi Chen@hanzhic678

Step 3: after robot-specific adaptation, we let the robot run on its own -- no human intervention. When the robot fails, our Dynamics-Guided Action Correction (DGAC) module retrieves progress-aligned successful experiences, generates and ranks corrective action chunks with learned dynamics and value models, and uses the best correction to relabel the failed transition, turning the robot’s own failures into supervision for policy improvement. [3/n]

8h114

Hanzhi Chen@hanzhic678

Self-improvement in action: After DGAC, those failed rollouts become training signals, helping the robot recover and complete the task consistently. [5/n]

8h98

Hanzhi Chen@hanzhic678

We also show that our approach is also policy-agnostic. Bolted onto π0.5, DGAC alone adds +25.3% success rate over SFT. [7/n]

8h100