/Tech1h ago

NVIDIA And Physical Intelligence Launch SC3-Eval For Robot Policy Testing

3270911036486.4K

Original post unavailable.

Sentiment

Users praise NVIDIA's world model evaluations of robot policies as a breakthrough that solves the evaluation bottleneck and dramatically speeds iteration for robotics development.

Pos

100.0%

Neg

0.0%

8 comments with sentiment.

Cluster Engagement

Digg Deeper

No Digg Deeper questions have been answered for this story yet.

Posts from X

Most Activity

VIEWS4.5K

Physical Intelligence@physical_int

The model also knows when not to trust itself: its uncertainty estimates track how far generations deviate from reality.

The uncertainty accurately predicts when model generation degrades

3d4.5K212

BOOKMARKS36LIKES41RETWEETS3

Physical Intelligence@physical_int

Joint work with the NVIDIA Cosmos team. Paper: https://arxiv.org/abs/2606.18610 Website: https://weichengtseng.github.io/sc3-eval/

3d4.2K4136

REPLIES2

Physical Intelligence@physical_int

Favorite finding: physical failures transfer. A policy that fumbles a grasp in the real world fumbles it in the world model too.

3d2.3K21

Physical Intelligence@physical_int

We found that the key to make this work is self-consistency: the model is trained to be self-consistent between a forward and inverse model (i.e., actions predicted with inverse dynamics lead to states consistent with ID input), consistent across different cameras, and consistent between training and test by halting episodes that go out of distribution.

With and without self consistency comparison video

3d2.3K258

Physical Intelligence@physical_int

Setup: • world model trained on our robot data generates what the cameras would see • VLA in the loop: sees generated frames, outputs actions, world model predicts what happens next

3d3.6K244

Physical Intelligence@physical_int

Why it matters: evals are critical to progress. As our models become more general and more capable, evaluating all the different things they can do takes longer and longer. The world model eval needs a couple hours of GPU time.

3d4.4K37

Physical Intelligence@physical_int

Across VLA checkpoints of varying performance on our table bussing benchmark, world model evals reproduce the real-world policy ranking.

correlation plot: predicted vs ground truth success rate

3d2.7K22

Wei-Cheng Tseng@WeiChengTseng1

Joint work with the Physical Intelligence team. Paper: https://arxiv.org/abs/2606.18610 Website: https://weichengtseng.github.io/sc3-eval/

3d10422

Ahmad Baracat@AhmadBaracat

@physical_int @nvidia Curious if @physical_int might be interested in this kind of real world data coming from cobots in factories and machine shops.

2d68811

Wei-Cheng Tseng@WeiChengTseng1

The model also knows when not to trust itself: its uncertainty estimates track how far generations deviate from reality.

3d2113

Tradeye@TradeyeLLC

@physical_int This is a big step forward. World model evals should dramatically speed up iteration. We’re building something complementary on the data side — capturing narrated first-person video from licensed HVAC, plumbing & electrical techs working inside real occupied homes. The kind of grounded, long-horizon human behavior that world models will eventually need to accurately simulate messy real environments. http://tradeyecapture.com

2d7601

NVIDIA AI@NVIDIAAI

@physical_int @nvidia 🦾

6h2042

Jinyu Hou@jinyuhou0

@physical_int @nvidia This is cool. Would be interesting to see how the wm-as-a-judge generalizes to unseen tasks.

2d7003

Wei-Cheng Tseng@WeiChengTseng1

Favorite finding: physical failures transfer. A policy that fumbles a grasp in the real world fumbles it in the world model too.

3d692

Wei-Cheng Tseng@WeiChengTseng1

Why it matters: evals are critical to progress. As policies become more general and more capable, evaluating all the different things they can do takes longer and longer. The world model eval needs a couple hours of GPU time.

3d1211

Wei-Cheng Tseng@WeiChengTseng1

Setup: • world model trained on our robot data generates what the cameras would see • policy in the loop: sees generated frames, outputs actions, model predicts what happens next

3d991

Wei-Cheng Tseng@WeiChengTseng1

Across policy checkpoints of varying performance on our table bussing benchmark, world model evals reproduce the real-world policy ranking.

3d871

Wei-Cheng Tseng@WeiChengTseng1

3d721

TechniaHQ | humanoid robots@techniahq

@physical_int @nvidia This self consistent world model eval is a breakthrough for robotics scaling. Imagined rollouts predicting real policy performance in hours of GPU time instead of lengthy hardware tests will speed iteration massively.

2d6902

Jakie PLA@3DPrintAficio

@physical_int @nvidia Eval inside a world model instead of on hardware. Iteration speed through the roof. THIS.

2d1.2K