/Tech3h ago

Study finds diffusion video models like WAN-1.3B internalize physical properties better than V-JEPA and VideoMAE baselines

Probes measured the models' understanding of gravity, permanence, and solidity

31198418.8K

#87

Original post

Lucas Beyer (bl16)@giffmana#87inTech

At the beginning of the year, there was another paper who did this check only for V-MAE and V-JEPA, and showed both have some understanding of physics. This new paper essentially extends the study to diffusion-based "pure videogen" models and shows they understand physics very well.

Lucas Beyer (bl16)@giffmana

You may have recently heard claims that video generation models are "dumb" about physics, and only "world models" (V-JEPA, specifically) have a valid internal model of physics.

This turns out to be false. In a recent paper, researchers show that a LINEAR probe of diffusion videogen models predict various "physics" very well, significantly better than V-JEPA or VideoMAE (and plain VAE just sucks).

This is noteworthy, because a *linear* probe being this accurate shows that the model has a pretty explicit internal representation of the physics!

7:38 AM · Jun 10, 2026 · 3.8K Views

/Tech3h ago

Study finds diffusion video models like WAN-1.3B internalize physical properties better than V-JEPA and VideoMAE baselines

Probes measured the models' understanding of gravity, permanence, and solidity

31198418.8K

#87

Original post

Lucas Beyer (bl16)@giffmana#87inTech

Lucas Beyer (bl16)@giffmana

You may have recently heard claims that video generation models are "dumb" about physics, and only "world models" (V-JEPA, specifically) have a valid internal model of physics.

This is noteworthy, because a *linear* probe being this accurate shows that the model has a pretty explicit internal representation of the physics!

7:38 AM · Jun 10, 2026 · 3.8K Views

Sentiment

Sentiment building, check back later.

Cluster Engagement

Posts from X

Most Activity

VIEWS3.9KBOOKMARKS32LIKES59RETWEETS4REPLIES2

Lucas Beyer (bl16)@giffmana

The paper is "The invisible hand of physics" from a surprisingly diverse set of authors (Parsa Esmati, @Somjit77): https://arxiv.org/abs/2606.05328 ; It's from just a few days ago. I learned about it from a nice talk by @katjahofmann today.

The paper from earlier in the year is by @soniajoseph_ etal: https://arxiv.org/abs/2602.07050

Lucas Beyer (bl16)@giffmana

3h3.9K5932

Lucas Beyer (bl16)@giffmana

To be clear, this is not a V-JEPA or VideoMAE diss, just resurrecting the fact that "pure videogen" models may indeed learn an explicit model of the world/physics as a byproduct.

Also cc @mapo1 we chatted about this and you also intuitively pushed back against such claim.

3h2.3K282

Alexander Doria@Dorialexander

Not surprised. It's totally true that we haven't nailed the most efficient arch for many modalities/data representation, just it isn't JEPA.

Lucas Beyer (bl16)@giffmana

You may have recently heard claims that video generation models are "dumb" about physics, and only "world models" (V-JEPA, specifically) have a valid internal model of physics.

This is noteworthy, because a *linear* probe being this accurate shows that the model has a pretty explicit internal representation of the physics!

1h904152

Adam Goldstein@goldstein_aa

@giffmana @mapo1 But isn't that indeed a JEPA diss? If what you say is true, what's the justification for JEPA?

3h106

Lucas Beyer (bl16)@giffmana

@goldstein_aa @mapo1 for example, it may be much more efficient since it works in just one forward pass.

3h972

Wenyao (Wayne) Zhang@zhang_weny92997

@giffmana @Somjit77 @katjahofmann This result may be highly dependent on the data scale？

2h98