/AI8h ago

Jia-Bin Huang and Furong Huang release DynaFLIP, a dynamics-guided visual backbone that outperforms DINOv2 and SigLIP in robot action planning

Training leveraged 260,000 robot and human video trajectories.

8734547.9K

Original posts

#465

Comments

#465

Reposts

#319

Original post

Jusuk Lee@jusukle

Are you still running your robot policies on vision encoders trained purely on static images?

Nowadays, the standard practice in robot learning is to plug in powerful vision models like CLIP, SigLIP, or DINOv2. This inherits a quiet, convenient assumption: “Let mainstream computer vision handle perception, and the downstream policy will figure out the dynamics.”

But let’s be real for a moment. Is this truly the best we can do?

We introduce DynaFLIP: Rethinking Robotics Perception via Tri-Modal-Dynamics Guided Representation.⬇️

🔷 Dynamics upstream: we push motion understanding into perception. 🔷 Tri-modal-dynamics supervision: image transitions × language × 3D flow, fused via simplex-volume alignment (260K trajectories from robot & human video) 🔷 Transfers everywhere: a visual backbone for diverse policies (MLP, Diffusion Policy, VLA) 🔷 +22.5% over the strongest baseline (DINOv2, SigLIP) under real-world OOD 🔷 Open-Source & easy to use

🌐 Website: https://dynaflip-robotics.github.io 📄 Paper: https://arxiv.org/abs/2605.30350 💻 Code: https://github.com/JU-SUK/DynaFLIP 🤗 Hugging Face: https://huggingface.co/jlee-larr/dynaflip-base

7:31 AM · Jun 1, 2026 · 3.9K Views

/AI8h ago

Jia-Bin Huang and Furong Huang release DynaFLIP, a dynamics-guided visual backbone that outperforms DINOv2 and SigLIP in robot action planning

Training leveraged 260,000 robot and human video trajectories.

--0--

Original posts

#465

Comments

#465

Reposts

#319

Original post

Jusuk Lee@jusukle

Are you still running your robot policies on vision encoders trained purely on static images?

But let’s be real for a moment. Is this truly the best we can do?

We introduce DynaFLIP: Rethinking Robotics Perception via Tri-Modal-Dynamics Guided Representation.⬇️

7:31 AM · Jun 1, 2026 · 3.9K Views

Sentiment

Users express gratitude to collaborators on the DynaFLIP research for advancing dynamics-aware encoders in robot vision and manipulation.

Pos

100.0%

Neg

0.0%

2 comments with sentiment.

Cluster Engagement

Sentiment

Sentiment unavailable for this story.

Cluster Engagement

Views

Comments

Reposts

Bookmarks

Expand data

Posts from X

Most Activity

VIEWS2.9KBOOKMARKS21LIKES37RETWEETS3REPLIES2

Furong Huang@furongh

AI can write code, pass exams, and generate videos.

But ask a robot to pour almonds into a bowl, and it may still fail.

Why?

One reason: robots are often using visual encoders trained for the internet — not for action.

Our new work asks: do robots have the wrong eyes?

7h2.9K3721