/Tech4h ago

Study finds diffusion video models encode physics more accurately than specialized world models via linear probing

Story Overview

Linear probes trained on the internal activations of video diffusion models during reverse sampling can decode physical concepts like object permanence and gravity more accurately than probes on dedicated world models, with WAN-1.3B reaching roughly 81 percent average accuracy versus about 71 percent for V-JEPA on the IntPhys and InfLevel benchmarks.

336415733835K
Original post
Lucas Beyer (bl16)@giffmana#62inTech

You may have recently heard claims that video generation models are "dumb" about physics, and only "world models" (V-JEPA, specifically) have a valid internal model of physics.

This turns out to be false. In a recent paper, researchers show that a LINEAR probe of diffusion videogen models predict various "physics" very well, significantly better than V-JEPA or VideoMAE (and plain VAE just sucks).

This is noteworthy, because a *linear* probe being this accurate shows that the model has a pretty explicit internal representation of the physics!

7:38 AM · Jun 10, 2026 · 29.8K Views
Research Insight

Probes pull physics from the denoising path

The signal appears inside the transformer blocks along the reverse trajectory and is absent from the VAE latents themselves, showing that generative flow-matching training alone can produce representations of continuity and solidity without any explicit physics objective.

Open Question

Specialized models still lag on the same tests

When the identical linear-probe protocol is applied to V-JEPA 2 ViT-L and VideoMAE-Large the accuracy drops, leaving open whether the gap would persist under different probing methods or translate into better downstream video generation.

Sentiment

Positive users appreciate evidence that diffusion video models encode physics better than V-JEPA while negative users call the framing, measurements, and size comparisons flawed or misleading.

Pos
50.0%
Neg
50.0%
13 comments with sentiment.
Cluster Engagement
Posts from X
Most Activity
Most Activity
VIEWS3.7KBOOKMARKS31LIKES55RETWEETS3

The paper is "The invisible hand of physics" from a surprisingly diverse set of authors (Parsa Esmati, @Somjit77): https://arxiv.org/abs/2606.05328 ; It's from just a few days ago. I learned about it from a nice talk by @katjahofmann today.

The paper from earlier in the year is by @soniajoseph_ etal: https://arxiv.org/abs/2602.07050

4hViews 3.7KLikes 55Bookmarks 31
REPLIES2

@giffmana Without getting into the who has a good world model debate, I'm not sure linear probe is a good way to measure the quality of the representation (but it's the best we have, though)

You may have recently heard claims that video generation models are "dumb" about physics, and only "world models" (V-JEPA, specifically) have a valid internal model of physics.

This turns out to be false. In a recent paper, researchers show that a LINEAR probe of diffusion videogen models predict various "physics" very well, significantly better than V-JEPA or VideoMAE (and plain VAE just sucks).

This is noteworthy, because a *linear* probe being this accurate shows that the model has a pretty explicit internal representation of the physics!

4hViews 1KLikes 11Bookmarks 2
kache@yacineMTB

@giffmana i love linear probes. have you seen RAPTOR?

You may have recently heard claims that video generation models are "dumb" about physics, and only "world models" (V-JEPA, specifically) have a valid internal model of physics.

This turns out to be false. In a recent paper, researchers show that a LINEAR probe of diffusion videogen models predict various "physics" very well, significantly better than V-JEPA or VideoMAE (and plain VAE just sucks).

This is noteworthy, because a *linear* probe being this accurate shows that the model has a pretty explicit internal representation of the physics!

4hViews 2.1KLikes 35Bookmarks 17
Alexander Doria@Dorialexander

Not surprised. It's totally true that we haven't nailed the most efficient arch for many modalities/data representation, just it isn't JEPA.

You may have recently heard claims that video generation models are "dumb" about physics, and only "world models" (V-JEPA, specifically) have a valid internal model of physics.

This turns out to be false. In a recent paper, researchers show that a LINEAR probe of diffusion videogen models predict various "physics" very well, significantly better than V-JEPA or VideoMAE (and plain VAE just sucks).

This is noteworthy, because a *linear* probe being this accurate shows that the model has a pretty explicit internal representation of the physics!

3hViews 2.6KLikes 27Bookmarks 6

To be clear, this is not a V-JEPA or VideoMAE diss, just resurrecting the fact that "pure videogen" models may indeed learn an explicit model of the world/physics as a byproduct.

Also cc @mapo1 we chatted about this and you also intuitively pushed back against such claim.

4hViews 2.3KLikes 28Bookmarks 2

At the beginning of the year, there was another paper who did this check only for V-MAE and V-JEPA, and showed both have some understanding of physics. This new paper essentially extends the study to diffusion-based "pure videogen" models and shows they understand physics very well.

4hViews 1.2KLikes 9Bookmarks 3
Khai Loong Aw@khai_loong_aw

linear probes trained on plausible vs implausible videos may latch onto spurious features unrelated to physics, e.g., counting number of unique objects. Results also depend heavily on what/how frames are fed to the model. A lot of care needed to ensure the probe is actually about physics. Alternative test: have models make predictions, then check those for physics violations. Qualitatively first, then with a metric. That said, I agree with you that today's models can learn intuitive physics. Along these lines, we show a zero-shot world model makes predictions that satisfy short-timescale intuitive physics: https://arxiv.org/abs/2604.10333. Large pretrained V-JEPA2 also scores decently here, though its predictions are harder to interpret.

You may have recently heard claims that video generation models are "dumb" about physics, and only "world models" (V-JEPA, specifically) have a valid internal model of physics.

This turns out to be false. In a recent paper, researchers show that a LINEAR probe of diffusion videogen models predict various "physics" very well, significantly better than V-JEPA or VideoMAE (and plain VAE just sucks).

This is noteworthy, because a *linear* probe being this accurate shows that the model has a pretty explicit internal representation of the physics!

3hViews 569Likes 7Bookmarks 3
kache@yacineMTB

@giffmana basically; RL trained enough with domain randomization allows you to predict state from the hidden state

accidental state estimator

kache@yacineMTB

@giffmana i love linear probes. have you seen RAPTOR?

4hViews 959Likes 17Bookmarks 1
Hila Chefer@hila_chefer

@giffmana Haven’t read the paper but this feels a bit suspicious 🤨 the video models examined here are tiny and mostly not very good at producing physically plausible videos, how is plausibility defined in this context?

4hViews 841Likes 12
Hila Chefer@hila_chefer

Yeah I’m not familiar with the benchmark and should probably read more about it, but my vibe from the figure above is that most properties that are strictly better in diffusion are appearance preservation over frames (permanence, shape etc) rather than “hard core temporal” properties. Definitely not saying VJEPA solves these, but I would say that the results on temporal properties are less convincing

3hViews 211Likes 5Bookmarks 1

@hila_chefer Same bench (but slightly different protocol) that Sonia's paper from earlier in the year.

3hViews 683Likes 3Bookmarks 1
Oscar Mañas@oscmansan

@giffmana This is evidence for what we were discussing today @Germs96: that the information contained in V-JEPA representations is a subset of that of video generators

2hViews 21Likes 2Bookmarks 1

If linear probe is positive, it's a very strong signal the information is there *and explicit* (in whatever is the input to the probe)

If it's negative it doesn't mean the info isn't there, but at least that the info isn't there in an easily extractive way.

The next step is attention probe, which we used a lot in the CapPa pallet and also Sonia used in her pallet on the topic earlier this year.

3hViews 57Likes 4Bookmarks 1
Hila Chefer@hila_chefer

> So a model might be shit at generating but still have a grasp of what it *should* be generating. the infamous knowing-doing gap 🤓 I actually think if that were the case methods like REPA wouldn't help because the "know" that the model gains would not lead to "do" I'm mostly basing this on the metrics- the ones that seem much better seem to be appearance-based, whereas the more challenging temporal ones (e.g., gravity solidity here) are not as impressive but will take a look!

2hViews 79Likes 3Bookmarks 1
Artem Zholus@artemZholus

@giffmana This paper compares 1B-2B diffusion models with 300m semantic models! No wonder they are better - since they are just bigger. There are larger versions of V-JEPA and the results might be different if those were included!

2hViews 91Likes 6

@artemZholus Yeah that's a fair point! Cc @Somjit77

Note my point here is the positive signal on "pure video models" so that's unaffected by this point.

2hViews 71Likes 4

(I'm not super deeply familiar either) This page shows examples of positives/negatives for IntPhys: https://intphys.cognitive-ml.fr/benchmark/test_blocks.html#block-o1-object-permanence

And this page for InfLevel: https://github.com/allenai/inflevel

Just from looking at them, I do feel that it's more than just good appearance preservation.

What I have the feeling might cause at least your initial skepticism, is that these are not generation benchmarks, but discrimination benchmarks. So a model might be shit at generating but still have a grasp of what it *should* be generating.

3hViews 175Likes 3
Simon@SimonGoodman_

@ziv_ravid @giffmana Yeah, but it's a good way to see if the information you want is easily extractable, which *kind of* suggests that the model represents it well

4hViews 86Likes 1
Adam Goldstein@goldstein_aa

@giffmana @mapo1 But isn't that indeed a JEPA diss? If what you say is true, what's the justification for JEPA?

4hViews 107
Rohit Bandaru@rohit_bandaru

@giffmana Video generation models need to understand physics for accurate generation, I think representation learning methods could learn physics more efficiently but that’s hard to evaluate

3hViews 328Likes 2
Load more posts