/AI9h ago

VQAScore Upgrades To Evaluate Text-To-Video Generation With 20+ VLMs

432101012K

#100

Original post

Graham Neubig#100

Zhiqiu Lin@ZhiqiuLin

Two years ago we released VQAScore with a simple idea: ask a VLM "does this image show {prompt}?" and use P(Yes) as the score.

It has since become the go-to evaluation metric and reward model for visual generation, replacing CLIPScore across the field. Adopted by Google DeepMind (Imagen 3 & 4), NVIDIA, ByteDance, and other frontier labs. Our open-source model has 2M+ downloads on HuggingFace.

Today: a major upgrade. VQAScore now supports text-to-video evaluation using 20+ SOTA VLMs including GPT, Gemini, and Qwen, capturing generation accuracy for prompts like "a shallow depth of field shot rack focusing from a foreground crumpet to a dog entering the background, catching it at the focus shift."

As VLMs get stronger, VQAScore gets stronger. For free.

📄 Paper: https://arxiv.org/abs/2404.01291 💻 Code: https://github.com/linzhiqiu/t2v_metrics

Thanks to @chancharikm for driving this upgrade, and @gneubig and @RamananDeva.