/AI6h ago

ConvNeXt co-creator Zhuang Liu releases WorldBench, a vision-language benchmark where Gemini-3.1-Pro scored just 64% accuracy

The dataset contains 2,000 hand-written, human-verified VQA items.

3508247K

#304

Original post

Zhuang Liu@liuzhuang1234#686inAI

Are VLMs nearing saturation on vision benchmarks?

Not on WorldBench: 2,000 carefully curated and verified questions over a visually diverse range of images, designed to be hard for frontier models.

The strongest still gets only 64%.

Led by @DavidYin0609 and @harishkrik

8:33 PM · Jun 7, 2026 · 3.9K Views

/AI6h ago

ConvNeXt co-creator Zhuang Liu releases WorldBench, a vision-language benchmark where Gemini-3.1-Pro scored just 64% accuracy

The dataset contains 2,000 hand-written, human-verified VQA items.

3508247K

#304

Original post

Zhuang Liu@liuzhuang1234#686inAI

Are VLMs nearing saturation on vision benchmarks?

Not on WorldBench: 2,000 carefully curated and verified questions over a visually diverse range of images, designed to be hard for frontier models.

The strongest still gets only 64%.

Led by @DavidYin0609 and @harishkrik

8:33 PM · Jun 7, 2026 · 3.9K Views

Sentiment

Sentiment building, check back later.

Cluster Engagement

Posts from X

Most Activity

VIEWS2.7KBOOKMARKS4LIKES14RETWEETS2

Wenhu Chen@WenhuChen

A joyful collaboration with @liuzhuang1234. WorldBench indicates that VQA is still far from being solved yet. Hope this new benchmark can provide new insights to the community!

5h2.7K144

REPLIES1

Zhuang Liu@liuzhuang1234

This grew out of our earlier "dataset bias" work: modern training datasets still live in surprisingly narrow visual distributions.

WorldBench asks the same question on the benchmark side.

http://arxiv.org/abs/2403.08632

Zhuang Liu@liuzhuang1234

Are VLMs nearing saturation on vision benchmarks?

Not on WorldBench: 2,000 carefully curated and verified questions over a visually diverse range of images, designed to be hard for frontier models.

The strongest still gets only 64%.

Led by @DavidYin0609 and @harishkrik

6h37140

Logic Lab AI 🧪@LogicLabAI

@liuzhuang1234 Narrow training distributions are sneaky because benchmarks usually share the same blind spots, so models look fine until the real world shows up.

6h2