New work with @nvidia: evaluating robot policies entirely inside a world model. The policy acts, the model imagines the consequences, and the imagined evals predict real-world results. 🧵
real vs world-model rollout side by side📷
The data-driven approach aims to replace manual physics simulators
New work with @nvidia: evaluating robot policies entirely inside a world model. The policy acts, the model imagines the consequences, and the imagined evals predict real-world results. 🧵
real vs world-model rollout side by side📷
Users praise NVIDIA evaluating robot policies inside world models because it promises to solve evaluation bottlenecks, speed iteration dramatically, and enable safer cheaper robotics development.
No Digg Deeper questions have been answered for this story yet.
The future is the data-driven simulator: take in real-world data and use it to build a model. Simulators are essential but building them by-hand sucks. And thus world models.
New work with @nvidia: evaluating robot policies entirely inside a world model. The policy acts, the model imagines the consequences, and the imagined evals predict real-world results. 🧵
real vs world-model rollout side by side📷

Joint work with the NVIDIA Cosmos team. Paper: https://arxiv.org/abs/2606.18610 Website: https://weichengtseng.github.io/sc3-eval/

We found that the key to make this work is self-consistency: the model is trained to be self-consistent between a forward and inverse model (i.e., actions predicted with inverse dynamics lead to states consistent with ID input), consistent across different cameras, and consistent between training and test by halting episodes that go out of distribution.
With and without self consistency comparison video

Setup: • world model trained on our robot data generates what the cameras would see • VLA in the loop: sees generated frames, outputs actions, world model predicts what happens next

Why it matters: evals are critical to progress. As our models become more general and more capable, evaluating all the different things they can do takes longer and longer. The world model eval needs a couple hours of GPU time.

Favorite finding: physical failures transfer. A policy that fumbles a grasp in the real world fumbles it in the world model too.

The model also knows when not to trust itself: its uncertainty estimates track how far generations deviate from reality.
The uncertainty accurately predicts when model generation degrades

Across VLA checkpoints of varying performance on our table bussing benchmark, world model evals reproduce the real-world policy ranking.
correlation plot: predicted vs ground truth success rate

@physical_int @nvidia Curious if @physical_int might be interested in this kind of real world data coming from cobots in factories and machine shops.

@physical_int @nvidia This self consistent world model eval is a breakthrough for robotics scaling. Imagined rollouts predicting real policy performance in hours of GPU time instead of lengthy hardware tests will speed iteration massively.

@physical_int @nvidia Eval inside a world model instead of on hardware. Iteration speed through the roof. THIS.

@physical_int @nvidia Cool work 🙌 Going to read the paper!
I tried eval on Droid dataset with Pi0 driving. Only issue was reliable future frames, got only about 1.5-2.5s window with Cosmos 3 Nano after which I get visual degradation.

@physical_int This is a big step forward. World model evals should dramatically speed up iteration. We’re building something complementary on the data side — capturing narrated first-person video from licensed HVAC, plumbing & electrical techs working inside real occupied homes. The kind of grounded, long-horizon human behavior that world models will eventually need to accurately simulate messy real environments. http://tradeyecapture.com

@physical_int @nvidia This is cool. Would be interesting to see how the wm-as-a-judge generalizes to unseen tasks.

@physical_int @nvidia Ran into this exact wall. Model destroys the test video, then fails the second you change camera angle or lighting.
Self-consistency across views forces it to learn the actual geometry instead of just memorizing what worked in the test set.

@physical_int @nvidia If this catches real failure modes, you've solved the evaluation bottleneck for robotics. That's the whole scaling game.

@physical_int @nvidia ``` Fast evals only work if your simulation is honest about what it doesn't know.
Most just confidently break in the real world.

Super cool! Can a latent visual model like this learn to predict contact forces of grasping that spoon though? With simple parallel axis fingers it will shoot out or twist in innumerable ways, possibly colliding with other objects.
Feels like we are still in the Karpathy Obama meme era for world models, though progressing quickly https://karpathy.github.io/2012/10/22/state-of-computer-vision/

@physical_int @nvidia knowing when it's lost is worth more than being right 90% of the time.
confident mistakes kill you in production. hesitation keeps you alive.

@physical_int @nvidia World models replace silicon for inference. the margin shifts from gpu caps to compute time.