/AI8h ago

Jia-Bin Huang and Furong Huang release DynaFLIP, a dynamics-guided visual backbone that outperforms DINOv2 and SigLIP in robot action planning

Training leveraged 260,000 robot and human video trajectories.

--0--
Original posts
Comments
Reposts
Original post
Jusuk Lee@jusukle

Are you still running your robot policies on vision encoders trained purely on static images?

Nowadays, the standard practice in robot learning is to plug in powerful vision models like CLIP, SigLIP, or DINOv2. This inherits a quiet, convenient assumption: “Let mainstream computer vision handle perception, and the downstream policy will figure out the dynamics.”

But let’s be real for a moment. Is this truly the best we can do?

We introduce DynaFLIP: Rethinking Robotics Perception via Tri-Modal-Dynamics Guided Representation.⬇️

🔷 Dynamics upstream: we push motion understanding into perception. 🔷 Tri-modal-dynamics supervision: image transitions × language × 3D flow, fused via simplex-volume alignment (260K trajectories from robot & human video) 🔷 Transfers everywhere: a visual backbone for diverse policies (MLP, Diffusion Policy, VLA) 🔷 +22.5% over the strongest baseline (DINOv2, SigLIP) under real-world OOD 🔷 Open-Source & easy to use

🌐 Website: https://dynaflip-robotics.github.io 📄 Paper: https://arxiv.org/abs/2605.30350 💻 Code: https://github.com/JU-SUK/DynaFLIP 🤗 Hugging Face: https://huggingface.co/jlee-larr/dynaflip-base

7:31 AM · Jun 1, 2026 · 3.9K Views
Sentiment
Sentiment unavailable for this story.
Cluster Engagement
-
Views
-
Comments
-
Reposts
-
Bookmarks
Expand data
Posts from X
Most Activity
Most ActivityTimeline
VIEWS2.9KBOOKMARKS21LIKES37RETWEETS3REPLIES2
Furong Huang@furongh

AI can write code, pass exams, and generate videos.

But ask a robot to pour almonds into a bowl, and it may still fail.

Why?

One reason: robots are often using visual encoders trained for the internet — not for action.

Our new work asks: do robots have the wrong eyes?

7hViews 2.9KLikes 37Bookmarks 21
Jia-Bin Huang and Furong Huang release DynaFLIP, a dynamics-guided visual backbone that outperforms DINOv2 and SigLIP in robot action planning · Digg