1/ The biggest problem in video understanding today isn't the models. It's that we can barely run them.
Introducing StateKV: an inference-time method that makes pretrained video VLMs scale linearly with video length.🧵
🔗 http://ceyzaguirre4.github.io/StateKV