/AI7h ago

DynaFLIP dynamics-guided vision encoder outperforms static backbones like DINOv2 and SigLIP by 22.5% in robot learning evaluations

It is trained on 260K robot and human trajectories.

--0--
Quote posts
Reposts
Original post
Chris Paxton@chris_j_paxton#732inAI

Motion understanding is key to robotics

Jusuk Lee@jusukle

Are you still running your robot policies on vision encoders trained purely on static images?

Nowadays, the standard practice in robot learning is to plug in powerful vision models like CLIP, SigLIP, or DINOv2. This inherits a quiet, convenient assumption: “Let mainstream computer vision handle perception, and the downstream policy will figure out the dynamics.”

But let’s be real for a moment. Is this truly the best we can do?

We introduce DynaFLIP: Rethinking Robotics Perception via Tri-Modal-Dynamics Guided Representation.⬇️

🔷 Dynamics upstream: we push motion understanding into perception. 🔷 Tri-modal-dynamics supervision: image transitions × language × 3D flow, fused via simplex-volume alignment (260K trajectories from robot & human video) 🔷 Transfers everywhere: a visual backbone for diverse policies (MLP, Diffusion Policy, VLA) 🔷 +22.5% over the strongest baseline (DINOv2, SigLIP) under real-world OOD 🔷 Open-Source & easy to use

🌐 Website: https://dynaflip-robotics.github.io 📄 Paper: https://arxiv.org/abs/2605.30350 💻 Code: https://github.com/JU-SUK/DynaFLIP 🤗 Hugging Face: https://huggingface.co/jlee-larr/dynaflip-base

10:32 AM · Jun 2, 2026 · 2.3K Views
Sentiment
Sentiment building, check back later.
Cluster Engagement
-
Views
-
Comments
-
Reposts
-
Bookmarks
Expand data
Posts from X
Most Activity
Most ActivityTimeline
RETWEETS18
Jusuk Lee@jusukle

Are you still running your robot policies on vision encoders trained purely on static images?

Nowadays, the standard practice in robot learning is to plug in powerful vision models like CLIP, SigLIP, or DINOv2. This inherits a quiet, convenient assumption: “Let mainstream computer vision handle perception, and the downstream policy will figure out the dynamics.”

But let’s be real for a moment. Is this truly the best we can do?

We introduce DynaFLIP: Rethinking Robotics Perception via Tri-Modal-Dynamics Guided Representation.⬇️

🔷 Dynamics upstream: we push motion understanding into perception. 🔷 Tri-modal-dynamics supervision: image transitions × language × 3D flow, fused via simplex-volume alignment (260K trajectories from robot & human video) 🔷 Transfers everywhere: a visual backbone for diverse policies (MLP, Diffusion Policy, VLA) 🔷 +22.5% over the strongest baseline (DINOv2, SigLIP) under real-world OOD 🔷 Open-Source & easy to use

🌐 Website: https://dynaflip-robotics.github.io 📄 Paper: https://arxiv.org/abs/2605.30350 💻 Code: https://github.com/JU-SUK/DynaFLIP 🤗 Hugging Face: https://huggingface.co/jlee-larr/dynaflip-base

1dViews 41.1KLikes 165Bookmarks 165