
The model also knows when not to trust itself: its uncertainty estimates track how far generations deviate from reality.
The uncertainty accurately predicts when model generation degrades
Users praise NVIDIA's world model evaluations of robot policies as a breakthrough that solves the evaluation bottleneck and dramatically speeds iteration for robotics development.
No Digg Deeper questions have been answered for this story yet.

The model also knows when not to trust itself: its uncertainty estimates track how far generations deviate from reality.
The uncertainty accurately predicts when model generation degrades

Joint work with the NVIDIA Cosmos team. Paper: https://arxiv.org/abs/2606.18610 Website: https://weichengtseng.github.io/sc3-eval/

Favorite finding: physical failures transfer. A policy that fumbles a grasp in the real world fumbles it in the world model too.

We found that the key to make this work is self-consistency: the model is trained to be self-consistent between a forward and inverse model (i.e., actions predicted with inverse dynamics lead to states consistent with ID input), consistent across different cameras, and consistent between training and test by halting episodes that go out of distribution.
With and without self consistency comparison video

Setup: • world model trained on our robot data generates what the cameras would see • VLA in the loop: sees generated frames, outputs actions, world model predicts what happens next

Why it matters: evals are critical to progress. As our models become more general and more capable, evaluating all the different things they can do takes longer and longer. The world model eval needs a couple hours of GPU time.

Across VLA checkpoints of varying performance on our table bussing benchmark, world model evals reproduce the real-world policy ranking.
correlation plot: predicted vs ground truth success rate

Joint work with the Physical Intelligence team. Paper: https://arxiv.org/abs/2606.18610 Website: https://weichengtseng.github.io/sc3-eval/

@physical_int @nvidia Curious if @physical_int might be interested in this kind of real world data coming from cobots in factories and machine shops.

The model also knows when not to trust itself: its uncertainty estimates track how far generations deviate from reality.

@physical_int This is a big step forward. World model evals should dramatically speed up iteration. We’re building something complementary on the data side — capturing narrated first-person video from licensed HVAC, plumbing & electrical techs working inside real occupied homes. The kind of grounded, long-horizon human behavior that world models will eventually need to accurately simulate messy real environments. http://tradeyecapture.com

@physical_int @nvidia 🦾

@physical_int @nvidia This is cool. Would be interesting to see how the wm-as-a-judge generalizes to unseen tasks.

Favorite finding: physical failures transfer. A policy that fumbles a grasp in the real world fumbles it in the world model too.

Why it matters: evals are critical to progress. As policies become more general and more capable, evaluating all the different things they can do takes longer and longer. The world model eval needs a couple hours of GPU time.

Setup: • world model trained on our robot data generates what the cameras would see • policy in the loop: sees generated frames, outputs actions, model predicts what happens next

Across policy checkpoints of varying performance on our table bussing benchmark, world model evals reproduce the real-world policy ranking.

We found that the key to make this work is self-consistency: the model is trained to be self-consistent between a forward and inverse model (i.e., actions predicted with inverse dynamics lead to states consistent with ID input), consistent across different cameras, and consistent between training and test by halting episodes that go out of distribution.

@physical_int @nvidia This self consistent world model eval is a breakthrough for robotics scaling. Imagined rollouts predicting real policy performance in hours of GPU time instead of lengthy hardware tests will speed iteration massively.

@physical_int @nvidia Eval inside a world model instead of on hardware. Iteration speed through the roof. THIS.