/Tech3h ago

Reka AI launches PhysicalRealismBench-U to evaluate VLM physical reasoning, with GPT-5.5 leading at 57.7%

No tested model demonstrated reliable physical reasoning in video.

416371.3K
Original post
Reka@RekaAILabs

We just released PhysicalRealismBench-U — a benchmark for testing whether VLMs actually understand physics in programmatically generated videos, fully attributable.

This is an important step toward models that understand and generate physically realistic outputs.

Best result across 9 frontier models: 57.7% realism F1.

Read our blog post here: https://reka.ai/news/physicalrealismbench-attributable-physical-realism-evaluation-for-video-world-models

Visit the benchmark: https://link.reka.ai/physical-realism-benchmarks-VLM

7:04 AM · Jun 11, 2026 · 1.1K Views
Sentiment

Positive users praise Reka's PhysicalRealismBench-U as important for VLM progress while negative users call the 14% baseline rough.

Pos
50.0%
Neg
50.0%
2 comments with sentiment.
Cluster Engagement
Posts from X
Most Activity
Most Activity
VIEWS259BOOKMARKS1LIKES2RETWEETS1
Mikel Artetxe@artetxem

We're releasing a new benchmark for evaluating whether VLMs can detect, localize, and explain physical violations in video.

We see large gaps across frontier models, with GPT-5.5 as the clear winner, but none are close to reliable yet.

Learn more: https://reka.ai/news/physicalrealismbench-attributable-physical-realism-evaluation-for-video-world-models

Reka@RekaAILabs

We just released PhysicalRealismBench-U — a benchmark for testing whether VLMs actually understand physics in programmatically generated videos, fully attributable.

This is an important step toward models that understand and generate physically realistic outputs.

Best result across 9 frontier models: 57.7% realism F1.

Read our blog post here: https://reka.ai/news/physicalrealismbench-attributable-physical-realism-evaluation-for-video-world-models

Visit the benchmark: https://link.reka.ai/physical-realism-benchmarks-VLM

2hViews 259Likes 2Bookmarks 1
Strata@ChainZenit

@RekaAILabs that actually sounds like a super important benchmark for vlm progress.

3hViews 6
Rugbist@rugbist_

@RekaAILabs so youre saying best VLM got a 14%? thats rough for the baseline lol

3h