/Tech3h ago

Paper finds video diffusion models encode physical laws more accurately than specialized world models like V-JEPA

Academic Ravid Shwartz Ziv questioned the linear probing methodology.

295144927625.1K

#87

Original post

Lucas Beyer (bl16)@giffmana#87inTech

You may have recently heard claims that video generation models are "dumb" about physics, and only "world models" (V-JEPA, specifically) have a valid internal model of physics.

This turns out to be false. In a recent paper, researchers show that a LINEAR probe of diffusion videogen models predict various "physics" very well, significantly better than V-JEPA or VideoMAE (and plain VAE just sucks).

This is noteworthy, because a *linear* probe being this accurate shows that the model has a pretty explicit internal representation of the physics!

7:38 AM · Jun 10, 2026 · 22.3K Views

/Tech3h ago

Paper finds video diffusion models encode physical laws more accurately than specialized world models like V-JEPA

Academic Ravid Shwartz Ziv questioned the linear probing methodology.

295144927625.1K

#87

Original post

Lucas Beyer (bl16)@giffmana#87inTech

You may have recently heard claims that video generation models are "dumb" about physics, and only "world models" (V-JEPA, specifically) have a valid internal model of physics.

This is noteworthy, because a *linear* probe being this accurate shows that the model has a pretty explicit internal representation of the physics!

7:38 AM · Jun 10, 2026 · 22.3K Views

Sentiment

Some users praise diffusion video models for encoding physics efficiently in one forward pass and for the value of linear probes, whereas others consider the framing incorrect and doubt that the results prove superiority over V-JE.

Pos

57.1%

Neg

42.9%

11 comments with sentiment.

Cluster Engagement

Posts from X

Most Activity

VIEWS3.7KBOOKMARKS31LIKES55RETWEETS3

Lucas Beyer (bl16)@giffmana

The paper is "The invisible hand of physics" from a surprisingly diverse set of authors (Parsa Esmati, @Somjit77): https://arxiv.org/abs/2606.05328 ; It's from just a few days ago. I learned about it from a nice talk by @katjahofmann today.

The paper from earlier in the year is by @soniajoseph_ etal: https://arxiv.org/abs/2602.07050

3h3.7K5531

REPLIES2

Ravid Shwartz Ziv@ziv_ravid

@giffmana Without getting into the who has a good world model debate, I'm not sure linear probe is a good way to measure the quality of the representation (but it's the best we have, though)

Lucas Beyer (bl16)@giffmana

You may have recently heard claims that video generation models are "dumb" about physics, and only "world models" (V-JEPA, specifically) have a valid internal model of physics.

This is noteworthy, because a *linear* probe being this accurate shows that the model has a pretty explicit internal representation of the physics!

3h86492

kache@yacineMTB

@giffmana i love linear probes. have you seen RAPTOR?

Lucas Beyer (bl16)@giffmana

You may have recently heard claims that video generation models are "dumb" about physics, and only "world models" (V-JEPA, specifically) have a valid internal model of physics.

This is noteworthy, because a *linear* probe being this accurate shows that the model has a pretty explicit internal representation of the physics!

2h1.8K3115

Lucas Beyer (bl16)@giffmana

To be clear, this is not a V-JEPA or VideoMAE diss, just resurrecting the fact that "pure videogen" models may indeed learn an explicit model of the world/physics as a byproduct.

Also cc @mapo1 we chatted about this and you also intuitively pushed back against such claim.

3h2.3K282

Lucas Beyer (bl16)@giffmana

At the beginning of the year, there was another paper who did this check only for V-MAE and V-JEPA, and showed both have some understanding of physics. This new paper essentially extends the study to diffusion-based "pure videogen" models and shows they understand physics very well.

3h1.2K93

kache@yacineMTB

@giffmana basically; RL trained enough with domain randomization allows you to predict state from the hidden state

accidental state estimator

kache@yacineMTB

@giffmana i love linear probes. have you seen RAPTOR?

2h884161

Hila Chefer@hila_chefer

@giffmana Haven’t read the paper but this feels a bit suspicious 🤨 the video models examined here are tiny and mostly not very good at producing physically plausible videos, how is plausibility defined in this context?

2h84112

Hila Chefer@hila_chefer

Yeah I’m not familiar with the benchmark and should probably read more about it, but my vibe from the figure above is that most properties that are strictly better in diffusion are appearance preservation over frames (permanence, shape etc) rather than “hard core temporal” properties. Definitely not saying VJEPA solves these, but I would say that the results on temporal properties are less convincing

2h21151

Lucas Beyer (bl16)@giffmana

@hila_chefer Same bench (but slightly different protocol) that Sonia's paper from earlier in the year.

2h68331

Khai Loong Aw@khai_loong_aw

linear probes trained on plausible vs implausible videos may latch onto spurious features unrelated to physics, e.g., counting number of unique objects. Results also depend heavily on what/how frames are fed to the model. A lot of care needed to ensure the probe is actually about physics. Alternative test: have models make predictions, then check those for physics violations. Qualitatively first, then with a metric. That said, I agree with you that today's models can learn intuitive physics. Along these lines, we show a zero-shot world model makes predictions that satisfy short-timescale intuitive physics: https://arxiv.org/abs/2604.10333. Large pretrained V-JEPA2 also scores decently here, though its predictions are harder to interpret.

1h31141

Oscar Mañas@oscmansan

@giffmana This is evidence for what we were discussing today @Germs96: that the information contained in V-JEPA representations is a subset of that of video generators

38m2121

Lucas Beyer (bl16)@giffmana

If linear probe is positive, it's a very strong signal the information is there *and explicit* (in whatever is the input to the probe)

If it's negative it doesn't mean the info isn't there, but at least that the info isn't there in an easily extractive way.

The next step is attention probe, which we used a lot in the CapPa pallet and also Sonia used in her pallet on the topic earlier this year.

2h5741

Hila Chefer@hila_chefer

> So a model might be shit at generating but still have a grasp of what it *should* be generating. the infamous knowing-doing gap 🤓 I actually think if that were the case methods like REPA wouldn't help because the "know" that the model gains would not lead to "do" I'm mostly basing this on the metrics- the ones that seem much better seem to be appearance-based, whereas the more challenging temporal ones (e.g., gravity solidity here) are not as impressive but will take a look!

1h7931

Artem Zholus@artemZholus

@giffmana This paper compares 1B-2B diffusion models with 300m semantic models! No wonder they are better - since they are just bigger. There are larger versions of V-JEPA and the results might be different if those were included!

50m916

Lucas Beyer (bl16)@giffmana

@artemZholus Yeah that's a fair point! Cc @Somjit77

Note my point here is the positive signal on "pure video models" so that's unaffected by this point.

46m714

Lucas Beyer (bl16)@giffmana

(I'm not super deeply familiar either) This page shows examples of positives/negatives for IntPhys: https://intphys.cognitive-ml.fr/benchmark/test_blocks.html#block-o1-object-permanence

And this page for InfLevel: https://github.com/allenai/inflevel

Just from looking at them, I do feel that it's more than just good appearance preservation.

What I have the feeling might cause at least your initial skepticism, is that these are not generation benchmarks, but discrimination benchmarks. So a model might be shit at generating but still have a grasp of what it *should* be generating.

1h1753

Simon@SimonGoodman_

@ziv_ravid @giffmana Yeah, but it's a good way to see if the information you want is easily extractable, which *kind of* suggests that the model represents it well

2h861

Adam Goldstein@goldstein_aa

@giffmana @mapo1 But isn't that indeed a JEPA diss? If what you say is true, what's the justification for JEPA?

3h106

Rohit Bandaru@rohit_bandaru

@giffmana Video generation models need to understand physics for accurate generation, I think representation learning methods could learn physics more efficiently but that’s hard to evaluate

2h3282

Lucas Beyer (bl16)@giffmana

@yacineMTB Nice, i had not seen it

kache@yacineMTB

@giffmana basically; RL trained enough with domain randomization allows you to predict state from the hidden state

accidental state estimator

2h12520