/Tech1h ago

μ₀ World Model Predicts 3D Motion Traces For Efficient Robot Learning

1713619755.8K

#434

Original post

Furong Huang@furongh#434inTech

Hot take: robots should not dream in pixels.

Pixels are too low-level. Latents are too opaque.

μ₀ predicts a third thing: 3D motion traces.

On real robots, it beats π₀.₅ — with ~1/100 the data scale and no action labels for world-model pretraining. 🧵

https://mu0-wm.github.io/

(this video features voiceover narration)

9:57 AM · Jun 14, 2026 · 2.8K Views

Sentiment

Users praise the μ₀ World Model for arguing that robotics benefits from physical language predictors rather than bigger pixel or black-box latent models.

Pos

100.0%

Neg

0.0%

1 comments with sentiment.

Cluster Engagement

Posts from X

Most Activity

VIEWS313BOOKMARKS1

Furong Huang@furongh

The internet is full of videos of humans doing things.

But robot action data is scarce, expensive, embodiment-specific, and hard to scale.

World models should help us learn from abundant video first, then ground to robots later.

Furong Huang@furongh

Hot take: robots should not dream in pixels.

Pixels are too low-level. Latents are too opaque.

μ₀ predicts a third thing: 3D motion traces.

On real robots, it beats π₀.₅ — with ~1/100 the data scale and no action labels for world-model pretraining. 🧵

https://mu0-wm.github.io/

(this video features voiceover narration)

1h31321

LIKES3

Furong Huang@furongh

Think of traces as a possible “unified words space” for robotics.

LLMs scale because text gives them a shared symbolic space.

Robotics has no obvious equivalent: different bodies, sensors, actions, tools, tasks, and environments.

Furong Huang@furongh

μ₀ takes a different route.

Instead of predicting pixels or black-box latents, it predicts:

3D interaction traces.

The motion of semantic physical points: objects, tools, hands, and contact regions.

1h7630

REPLIES2

Furong Huang@furongh

14/

The takeaway:

Robotics may not need bigger pixel predictors.

And it may not want black-box latent predictors either.

It may need a physical language.

μ₀ proposes one: 3D motion traces as the symbolic world-model space for robot learning.

https://mu0-wm.github.io/

Big thanks to the coolest team: @JayLEE_0301, @YoonkyoJung, @jusukle, Jonghun Shin, Amir-Hossein Shahidzadeh, @YaoChihLee, H. Jin Kim, @jbhuang0604 💐

Furong Huang@furongh

13/

Some real-world experiment demos here.

1h15220

Furong Huang@furongh

But this only works if we choose the right prediction space.

Pixels are scalable, but too low-level.

Latents are compact, but often too brittle.

So what should a robot world model actually predict?

Furong Huang@furongh

The internet is full of videos of humans doing things.

But robot action data is scarce, expensive, embodiment-specific, and hard to scale.

World models should help us learn from abundant video first, then ground to robots later.

1h16620

Furong Huang@furongh

Latents seem like the natural alternative.

But black-box latent spaces are hard to interpret, hard to intervene on, and hard to correct.

They can also collapse.

After 15+ years working with spectral / latent methods, I’ve learned this the hard way: latent spaces are brittle.

Furong Huang@furongh

Pixels force the model to spend capacity on texture, lighting, background, and camera motion.

But robotics cares about something more physical:

geometry, contact, motion, and how objects change under interaction.

1h8820

Furong Huang@furongh

Pixels force the model to spend capacity on texture, lighting, background, and camera motion.

But robotics cares about something more physical:

geometry, contact, motion, and how objects change under interaction.

Furong Huang@furongh

But this only works if we choose the right prediction space.

Pixels are scalable, but too low-level.

Latents are compact, but often too brittle.

So what should a robot world model actually predict?

1h8320

Furong Huang@furongh

μ₀ takes a different route.

Instead of predicting pixels or black-box latents, it predicts:

3D interaction traces.

The motion of semantic physical points: objects, tools, hands, and contact regions.

Furong Huang@furongh

Latents seem like the natural alternative.

But black-box latent spaces are hard to interpret, hard to intervene on, and hard to correct.

They can also collapse.

After 15+ years working with spectral / latent methods, I’ve learned this the hard way: latent spaces are brittle.

1h7620

Furong Huang@furongh

13/

Some real-world experiment demos here.

Furong Huang@furongh

12/

And it works.

In simulation and on a real robot arm, μ₀ achieves performance comparable to strong action-trained VLA policies,

while using roughly 1/100 of the data scale,

and no action-labeled data for world-model pretraining.

1h8010

Furong Huang@furongh

So what are the “words” of physical interaction?

Our answer:

motion traces.

They are grounded in video, structured in 3D, interpretable, and more transferable across embodiments than robot-specific actions.

Furong Huang@furongh

Think of traces as a possible “unified words space” for robotics.

LLMs scale because text gives them a shared symbolic space.

Robotics has no obvious equivalent: different bodies, sensors, actions, tools, tasks, and environments.

1h6610

Furong Huang@furongh

μ₀ then learns a trace-space world model.

It uses a frozen vision-language backbone for semantics, and a trace expert for physical motion.

Instead of dense future pixels, it predicts smooth 3D motion traces.

Furong Huang@furongh

To learn traces at scale, we built TraceExtract.

It turns ordinary human and robot videos into trace supervision:

• what moves: semantic keypoints • where it moves: shared 3D reconstruction • how it moves: event-level motion traces

1h5910

Furong Huang@furongh

To learn traces at scale, we built TraceExtract.

It turns ordinary human and robot videos into trace supervision:

• what moves: semantic keypoints • where it moves: shared 3D reconstruction • how it moves: event-level motion traces

Furong Huang@furongh

So what are the “words” of physical interaction?

Our answer:

motion traces.

They are grounded in video, structured in 3D, interpretable, and more transferable across embodiments than robot-specific actions.

1h5910

Furong Huang@furongh

12/

And it works.

In simulation and on a real robot arm, μ₀ achieves performance comparable to strong action-trained VLA policies,

while using roughly 1/100 of the data scale,

and no action-labeled data for world-model pretraining.

Furong Huang@furongh

11/

The key test:

Can these traces actually help a robot act?

We freeze μ₀ and train only a lightweight action expert on top.

The world model itself never sees action labels during pretraining.

1h5410

Furong Huang@furongh

11/

The key test:

Can these traces actually help a robot act?

We freeze μ₀ and train only a lightweight action expert on top.

The world model itself never sees action labels during pretraining.

1h6

Furong Huang@furongh

10/ On three-D trace forecasting, mew-zero is the most accurate across metrics and horizons.

It outperforms prior trace methods, surpasses large API models overall, and runs in zero point two nine seconds per prediction.

1h6

Furong Huang@furongh

#Robotics #EmbodiedAI #WorldModels #AI

Furong Huang@furongh

14/

The takeaway:

Robotics may not need bigger pixel predictors.

And it may not want black-box latent predictors either.

It may need a physical language.

μ₀ proposes one: 3D motion traces as the symbolic world-model space for robot learning.

https://mu0-wm.github.io/

Big thanks to the coolest team: @JayLEE_0301, @YoonkyoJung, @jusukle, Jonghun Shin, Amir-Hossein Shahidzadeh, @YaoChihLee, H. Jin Kim, @jbhuang0604 💐

1h7910

Furong Huang@furongh

This is a major step forward from our previous work, TraceGen. We achieved significant improvements and put in a tremendous amount of engineering effort. Hats off to the team—especially @JayLEE_0301, @YoonkyoJung, and @jusukle ❤️

Furong Huang@furongh

14/

The takeaway:

Robotics may not need bigger pixel predictors.

And it may not want black-box latent predictors either.

It may need a physical language.

μ₀ proposes one: 3D motion traces as the symbolic world-model space for robot learning.

https://mu0-wm.github.io/

Big thanks to the coolest team: @JayLEE_0301, @YoonkyoJung, @jusukle, Jonghun Shin, Amir-Hossein Shahidzadeh, @YaoChihLee, H. Jin Kim, @jbhuang0604 💐

33m4110