/Tech17h ago

Physical Intelligence and NVIDIA demonstrate a method to evaluate robot policies entirely within a learned world model

The data-driven approach aims to replace manual physics simulators

315997131453.8K

#855

Original post

Physical Intelligence@physical_int

New work with @nvidia: evaluating robot policies entirely inside a world model. The policy acts, the model imagines the consequences, and the imagined evals predict real-world results. 🧵

real vs world-model rollout side by side📷

11:38 AM · Jun 18, 2026 · 50.4K Views

Sentiment

Users praise NVIDIA evaluating robot policies inside world models because it promises to solve evaluation bottlenecks, speed iteration dramatically, and enable safer cheaper robotics development.

Pos

100.0%

Neg

0.0%

7 comments with sentiment.

Cluster Engagement

Digg Deeper

No Digg Deeper questions have been answered for this story yet.

Posts from X

Most Activity

VIEWS4.1KLIKES47RETWEETS3REPLIES5

Chris Paxton@chris_j_paxton

The future is the data-driven simulator: take in real-world data and use it to build a model. Simulators are essential but building them by-hand sucks. And thus world models.

Physical Intelligence@physical_int

New work with @nvidia: evaluating robot policies entirely inside a world model. The policy acts, the model imagines the consequences, and the imagined evals predict real-world results. 🧵

real vs world-model rollout side by side📷

5h4.1K4719

BOOKMARKS25

Physical Intelligence@physical_int

Joint work with the NVIDIA Cosmos team. Paper: https://arxiv.org/abs/2606.18610 Website: https://weichengtseng.github.io/sc3-eval/

1d3.4K3025

Physical Intelligence@physical_int

We found that the key to make this work is self-consistency: the model is trained to be self-consistent between a forward and inverse model (i.e., actions predicted with inverse dynamics lead to states consistent with ID input), consistent across different cameras, and consistent between training and test by halting episodes that go out of distribution.

With and without self consistency comparison video

1d1.8K206

Physical Intelligence@physical_int

Setup: • world model trained on our robot data generates what the cameras would see • VLA in the loop: sees generated frames, outputs actions, world model predicts what happens next

1d2.9K204

Physical Intelligence@physical_int

Why it matters: evals are critical to progress. As our models become more general and more capable, evaluating all the different things they can do takes longer and longer. The world model eval needs a couple hours of GPU time.

1d3.4K29

Physical Intelligence@physical_int

Favorite finding: physical failures transfer. A policy that fumbles a grasp in the real world fumbles it in the world model too.

1d1.8K17

Physical Intelligence@physical_int

The model also knows when not to trust itself: its uncertainty estimates track how far generations deviate from reality.

The uncertainty accurately predicts when model generation degrades

1d3.6K141

Physical Intelligence@physical_int

Across VLA checkpoints of varying performance on our table bussing benchmark, world model evals reproduce the real-world policy ranking.

correlation plot: predicted vs ground truth success rate

1d2.1K18

Ahmad Baracat@AhmadBaracat

@physical_int @nvidia Curious if @physical_int might be interested in this kind of real world data coming from cobots in factories and machine shops.

15h20611

TechniaHQ | humanoid robots@techniahq

@physical_int @nvidia This self consistent world model eval is a breakthrough for robotics scaling. Imagined rollouts predicting real policy performance in hours of GPU time instead of lengthy hardware tests will speed iteration massively.

1d5661

Jakie PLA@3DPrintAficio

@physical_int @nvidia Eval inside a world model instead of on hardware. Iteration speed through the roof. THIS.

1d956

Ashok@ashokM93

@physical_int @nvidia Cool work 🙌 Going to read the paper!

I tried eval on Droid dataset with Pi0 driving. Only issue was reliable future frames, got only about 1.5-2.5s window with Cosmos 3 Nano after which I get visual degradation.

1d835

Tradeye@TradeyeLLC

@physical_int This is a big step forward. World model evals should dramatically speed up iteration. We’re building something complementary on the data side — capturing narrated first-person video from licensed HVAC, plumbing & electrical techs working inside real occupied homes. The kind of grounded, long-horizon human behavior that world models will eventually need to accurately simulate messy real environments. http://tradeyecapture.com

1d623

Jinyu Hou@jinyuhou0

@physical_int @nvidia This is cool. Would be interesting to see how the wm-as-a-judge generalizes to unseen tasks.

1d428

Ferbin@Ferbin08

@physical_int @nvidia Ran into this exact wall. Model destroys the test video, then fails the second you change camera angle or lighting.

Self-consistency across views forces it to learn the actual geometry instead of just memorizing what worked in the test set.

1d405

Ferbin@Ferbin08

@physical_int @nvidia If this catches real failure modes, you've solved the evaluation bottleneck for robotics. That's the whole scaling game.

1d115

Ferbin@Ferbin08

@physical_int @nvidia ``` Fast evals only work if your simulation is honest about what it doesn't know.

Most just confidently break in the real world.

1d36

Luke Hansen@luk3hans3n

Super cool! Can a latent visual model like this learn to predict contact forces of grasping that spoon though? With simple parallel axis fingers it will shoot out or twist in innumerable ways, possibly colliding with other objects.

Feels like we are still in the Karpathy Obama meme era for world models, though progressing quickly https://karpathy.github.io/2012/10/22/state-of-computer-vision/

1d29

Ferbin@Ferbin08

@physical_int @nvidia knowing when it's lost is worth more than being right 90% of the time.

confident mistakes kill you in production. hesitation keeps you alive.

1d18

The AI Therapist ⚡@TheAIShrink

@physical_int @nvidia World models replace silicon for inference. the margin shifts from gpu caps to compute time.

1d14