Are you still running your robot policies on vision encoders trained purely on static images?
Nowadays, the standard practice in robot learning is to plug in powerful vision models like CLIP, SigLIP, or DINOv2. This inherits a quiet, convenient assumption: “Let mainstream computer vision handle perception, and the downstream policy will figure out the dynamics.”
But let’s be real for a moment. Is this truly the best we can do?
We introduce DynaFLIP: Rethinking Robotics Perception via Tri-Modal-Dynamics Guided Representation.⬇️
🔷 Dynamics upstream: we push motion understanding into perception. 🔷 Tri-modal-dynamics supervision: image transitions × language × 3D flow, fused via simplex-volume alignment (260K trajectories from robot & human video) 🔷 Transfers everywhere: a visual backbone for diverse policies (MLP, Diffusion Policy, VLA) 🔷 +22.5% over the strongest baseline (DINOv2, SigLIP) under real-world OOD 🔷 Open-Source & easy to use
🌐 Website: https://dynaflip-robotics.github.io 📄 Paper: https://arxiv.org/abs/2605.30350 💻 Code: https://github.com/JU-SUK/DynaFLIP 🤗 Hugging Face: https://huggingface.co/jlee-larr/dynaflip-base