VSTAT highlights the substantial perceptual gap between humans and MLLMs, but it goes far beyond that. Its diverse tasks are designed not merely to assess simple pixel-space tracking, but to evaluate how well models capture and understand evolving world states in the latent space of videos. Text is only one way to probe this capability, and we are excited to see future evaluations explore new modalities such as pixels, actions, and beyond!
Working on this benchmark has been a lot of fun along the way—huge shout-out to my amazing collaborators!