We open-source Action Images — a new representation that translates 7-DoF robot actions into interpretable images.
Video models are emerging as powerful robotic foundation models, but a key challenge remains: how can we seamlessly integrate robot policies into video models?
Instead of representing actions as low-dimensional control tokens, Action Images provide a pixel-grounded action representation, reframing policy learning as a visual tracking problem!
By unifying observations and actions in the same video space, Action Images enable a unified robotics world model that supports video-action joint generation, action-conditioned video generation, and action labeling!
Code: http://github.com/UMass-Embodied-AGI/ActionImages Paper: https://arxiv.org/abs/2604.06168


