/AI6h ago

ConvNeXt co-creator Zhuang Liu releases WorldBench, a vision-language benchmark where Gemini-3.1-Pro scored just 64% accuracy

The dataset contains 2,000 hand-written, human-verified VQA items.

3508247K
Original post
Zhuang Liu@liuzhuang1234#686inAI

Are VLMs nearing saturation on vision benchmarks?

Not on WorldBench: 2,000 carefully curated and verified questions over a visually diverse range of images, designed to be hard for frontier models.

The strongest still gets only 64%.

Led by @DavidYin0609 and @harishkrik

8:33 PM · Jun 7, 2026 · 3.9K Views
Sentiment
Sentiment building, check back later.
Cluster Engagement
Posts from X
Most Activity
Most Activity
VIEWS2.7KBOOKMARKS4LIKES14RETWEETS2
Wenhu Chen@WenhuChen

A joyful collaboration with @liuzhuang1234. WorldBench indicates that VQA is still far from being solved yet. Hope this new benchmark can provide new insights to the community!

5hViews 2.7KLikes 14Bookmarks 4
REPLIES1
Zhuang Liu@liuzhuang1234

This grew out of our earlier "dataset bias" work: modern training datasets still live in surprisingly narrow visual distributions.

WorldBench asks the same question on the benchmark side.

http://arxiv.org/abs/2403.08632

Zhuang Liu@liuzhuang1234

Are VLMs nearing saturation on vision benchmarks?

Not on WorldBench: 2,000 carefully curated and verified questions over a visually diverse range of images, designed to be hard for frontier models.

The strongest still gets only 64%.

Led by @DavidYin0609 and @harishkrik

6hViews 371Likes 4Bookmarks 0

@liuzhuang1234 Narrow training distributions are sneaky because benchmarks usually share the same blind spots, so models look fine until the real world shows up.

6hViews 2