/AI6h ago

VSTAT Benchmark Exposes Multimodal LLMs' Failures in Video State Tracking

--0--
Original postSaining Xie#158

VSTAT highlights the substantial perceptual gap between humans and MLLMs, but it goes far beyond that. Its diverse tasks are designed not merely to assess simple pixel-space tracking, but to evaluate how well models capture and understand evolving world states in the latent space of videos. Text is only one way to probe this capability, and we are excited to see future evaluations explore new modalities such as pixels, actions, and beyond!

Working on this benchmark has been a lot of fun along the way—huge shout-out to my amazing collaborators!

7:48 PM · Jun 2, 2026 · 2.2K Views
Sentiment
Sentiment building, check back later.
Cluster Engagement
-
Views
-
Comments
-
Reposts
-
Bookmarks
Expand data
Posts from X
Most Activity
Most ActivityTimeline
VIEWS10.7KBOOKMARKS54LIKES132RETWEETS14REPLIES6
Saining Xie@sainingxie

how does the brain build and track an internal state of the world from (possibly incomplete and noisy) visual observations? i believe visual state tracking will be the grand challenge for vision in the coming years, and i hope this benchmark can be a useful starting line. enjoy!

5hViews 10.7KLikes 132Bookmarks 54