/AI6h ago

VSTAT Benchmark Exposes Multimodal LLMs' Failures in Video State Tracking

10161266214.1K

Quote posts

#158

Reposts

#158

Original post

Saining Xie#158

Willis (Nanye) Ma@ma_nanye

VSTAT highlights the substantial perceptual gap between humans and MLLMs, but it goes far beyond that. Its diverse tasks are designed not merely to assess simple pixel-space tracking, but to evaluate how well models capture and understand evolving world states in the latent space of videos. Text is only one way to probe this capability, and we are excited to see future evaluations explore new modalities such as pixels, actions, and beyond!

Working on this benchmark has been a lot of fun along the way—huge shout-out to my amazing collaborators!

7:48 PM · Jun 2, 2026 · 2.2K Views

/AI6h ago

VSTAT Benchmark Exposes Multimodal LLMs' Failures in Video State Tracking

--0--

Quote posts

#158

Reposts

#158

Original post

Saining Xie#158

Willis (Nanye) Ma@ma_nanye

Working on this benchmark has been a lot of fun along the way—huge shout-out to my amazing collaborators!

7:48 PM · Jun 2, 2026 · 2.2K Views

Sentiment

Users express gratitude to collaborators and advisors on the VSTAT benchmark exposing multimodal LLM failures in video state tracking.

Pos

100.0%

Neg

0.0%

1 comments with sentiment.

Cluster Engagement

Sentiment

Sentiment building, check back later.

Cluster Engagement

Views

Comments

Reposts

Bookmarks

Expand data

Posts from X

Most Activity

VIEWS10.7KBOOKMARKS54LIKES132RETWEETS14REPLIES6

Saining Xie@sainingxie

how does the brain build and track an internal state of the world from (possibly incomplete and noisy) visual observations? i believe visual state tracking will be the grand challenge for vision in the coming years, and i hope this benchmark can be a useful starting line. enjoy!

5h10.7K13254