Two years ago we released VQAScore with a simple idea: ask a VLM "does this image show {prompt}?" and use P(Yes) as the score.
It has since become the go-to evaluation metric and reward model for visual generation, replacing CLIPScore across the field. Adopted by Google DeepMind (Imagen 3 & 4), NVIDIA, ByteDance, and other frontier labs. Our open-source model has 2M+ downloads on HuggingFace.
Today: a major upgrade. VQAScore now supports text-to-video evaluation using 20+ SOTA VLMs including GPT, Gemini, and Qwen, capturing generation accuracy for prompts like "a shallow depth of field shot rack focusing from a foreground crumpet to a dog entering the background, catching it at the focus shift."
As VLMs get stronger, VQAScore gets stronger. For free.
📄 Paper: https://arxiv.org/abs/2404.01291 💻 Code: https://github.com/linzhiqiu/t2v_metrics
Thanks to @chancharikm for driving this upgrade, and @gneubig and @RamananDeva.