/AI9h ago

VQAScore Upgrades To Evaluate Text-To-Video Generation With 20+ VLMs

432101012K
Original postGraham Neubig#100
Zhiqiu Lin@ZhiqiuLin

Two years ago we released VQAScore with a simple idea: ask a VLM "does this image show {prompt}?" and use P(Yes) as the score.

It has since become the go-to evaluation metric and reward model for visual generation, replacing CLIPScore across the field. Adopted by Google DeepMind (Imagen 3 & 4), NVIDIA, ByteDance, and other frontier labs. Our open-source model has 2M+ downloads on HuggingFace.

Today: a major upgrade. VQAScore now supports text-to-video evaluation using 20+ SOTA VLMs including GPT, Gemini, and Qwen, capturing generation accuracy for prompts like "a shallow depth of field shot rack focusing from a foreground crumpet to a dog entering the background, catching it at the focus shift."

As VLMs get stronger, VQAScore gets stronger. For free.

馃搫 Paper: https://arxiv.org/abs/2404.01291 馃捇 Code: https://github.com/linzhiqiu/t2v_metrics

Thanks to @chancharikm for driving this upgrade, and @gneubig and @RamananDeva.

5:00 AM 路 Jun 9, 2026 路 12K Views
Sentiment
Sentiment building, check back later.
Cluster Engagement
Posts from X
Most Activity
Most Activity
No ranked X posts are available for this story yet.