/AI16h ago

Cristóbal Eyzaguirre Ercilla introduces StateKV, an inference-time method that lets pretrained video VLMs scale linearly with video length

It maintains VideoMME benchmark accuracy without model retraining.

--0--
Original posts
Quote posts
Original postJiajun Wu#358

1/ The biggest problem in video understanding today isn't the models. It's that we can barely run them.

Introducing StateKV: an inference-time method that makes pretrained video VLMs scale linearly with video length.🧵

🔗 http://ceyzaguirre4.github.io/StateKV

8:49 AM · Jun 2, 2026 · 5.3K Views
Sentiment
Sentiment building, check back later.
Cluster Engagement
-
Views
-
Comments
-
Reposts
-
Bookmarks
Expand data
Posts from X
Most Activity
Most ActivityTimeline
VIEWS1.5KBOOKMARKS3LIKES8RETWEETS2

Processing long videos with VLMs shouldn't scale quadratically. Enter StateKV! 🎬💡

By framing streaming prefill as a fixed-capacity temporal state, we unlock linear-time prefill while keeping full per-frame detail.

Paper by @CristbalEyzagu2 and team👇

https://arxiv.org/abs/2605.31598

1/ The biggest problem in video understanding today isn't the models. It's that we can barely run them.

Introducing StateKV: an inference-time method that makes pretrained video VLMs scale linearly with video length.🧵

🔗 http://ceyzaguirre4.github.io/StateKV

16hViews 1.5KLikes 8Bookmarks 3