VQAScore Upgrades To Evaluate Text-To-Video Generation With 20+ VLMs · Digg

/Tech9h ago

VQAScore Upgrades To Evaluate Text-To-Video Generation With 20+ VLMs

432101012K

GN#417|@GNEUBIG

Original post

Graham Neubig#417

Zhiqiu Lin@ZhiqiuLin

Two years ago we released VQAScore with a simple idea: ask a VLM "does this image show {prompt}?" and use P(Yes) as the score.

It has since become the go-to evaluation metric and reward model for visual generation, replacing CLIPScore across the field. Adopted by Google DeepMind (Imagen 3 & 4), NVIDIA, ByteDance, and other frontier labs. Our open-source model has 2M+ downloads on HuggingFace.

Today: a major upgrade. VQAScore now supports text-to-video evaluation using 20+ SOTA VLMs including GPT, Gemini, and Qwen, capturing generation accuracy for prompts like "a shallow depth of field shot rack focusing from a foreground crumpet to a dog entering the background, catching it at the focus shift."

As VLMs get stronger, VQAScore gets stronger. For free.

📄 Paper: https://arxiv.org/abs/2404.01291 💻 Code: https://github.com/linzhiqiu/t2v_metrics

Thanks to @chancharikm for driving this upgrade, and @gneubig and @RamananDeva.

5:00 AM · Jun 9, 2026 · 12K Views

Sentiment

Sentiment building, check back later.

Cluster Engagement

Posts from X

Most Activity

Most Activity

No ranked X posts are available for this story yet.

Related links

ARXIV.ORGVia

GITHUB.COMVia

Digg Deeper

No Digg Deeper questions have been answered for this story yet.

/Tech9h ago

VQAScore Upgrades To Evaluate Text-To-Video Generation With 20+ VLMs

432101012K

GN#417|@GNEUBIG

Original post

Graham Neubig#417

Zhiqiu Lin@ZhiqiuLin

Two years ago we released VQAScore with a simple idea: ask a VLM "does this image show {prompt}?" and use P(Yes) as the score.

It has since become the go-to evaluation metric and reward model for visual generation, replacing CLIPScore across the field. Adopted by Google DeepMind (Imagen 3 & 4), NVIDIA, ByteDance, and other frontier labs. Our open-source model has 2M+ downloads on HuggingFace.

Today: a major upgrade. VQAScore now supports text-to-video evaluation using 20+ SOTA VLMs including GPT, Gemini, and Qwen, capturing generation accuracy for prompts like "a shallow depth of field shot rack focusing from a foreground crumpet to a dog entering the background, catching it at the focus shift."

As VLMs get stronger, VQAScore gets stronger. For free.

📄 Paper: https://arxiv.org/abs/2404.01291 💻 Code: https://github.com/linzhiqiu/t2v_metrics

Thanks to @chancharikm for driving this upgrade, and @gneubig and @RamananDeva.

5:00 AM · Jun 9, 2026 · 12K Views