/Tech4h ago

Study finds diffusion video models encode physics more accurately than specialized world models via linear probing

Story Overview

Linear probes trained on the internal activations of video diffusion models during reverse sampling can decode physical concepts like object permanence and gravity more accurately than probes on dedicated world models, with WAN-1.3B reaching roughly 81 percent average accuracy versus about 71 percent for V-JEPA on the IntPhys and InfLevel benchmarks.

336415733835K

#62

Original post

Lucas Beyer (bl16)@giffmana#62inTech

You may have recently heard claims that video generation models are "dumb" about physics, and only "world models" (V-JEPA, specifically) have a valid internal model of physics.

This turns out to be false. In a recent paper, researchers show that a LINEAR probe of diffusion videogen models predict various "physics" very well, significantly better than V-JEPA or VideoMAE (and plain VAE just sucks).

This is noteworthy, because a *linear* probe being this accurate shows that the model has a pretty explicit internal representation of the physics!

7:38 AM · Jun 10, 2026 · 29.8K Views

/Tech4h ago

Study finds diffusion video models encode physics more accurately than specialized world models via linear probing

Story Overview

336415733835K

#62

Original post

Lucas Beyer (bl16)@giffmana#62inTech

You may have recently heard claims that video generation models are "dumb" about physics, and only "world models" (V-JEPA, specifically) have a valid internal model of physics.

This is noteworthy, because a *linear* probe being this accurate shows that the model has a pretty explicit internal representation of the physics!

7:38 AM · Jun 10, 2026 · 29.8K Views

Research Insight

Probes pull physics from the denoising path

The signal appears inside the transformer blocks along the reverse trajectory and is absent from the VAE latents themselves, showing that generative flow-matching training alone can produce representations of continuity and solidity without any explicit physics objective.

Open Question

Specialized models still lag on the same tests

When the identical linear-probe protocol is applied to V-JEPA 2 ViT-L and VideoMAE-Large the accuracy drops, leaving open whether the gap would persist under different probing methods or translate into better downstream video generation.

Sentiment

Positive users appreciate evidence that diffusion video models encode physics better than V-JEPA while negative users call the framing, measurements, and size comparisons flawed or misleading.

Pos

50.0%

Neg

50.0%

13 comments with sentiment.

Cluster Engagement

Posts from X

Most Activity

VIEWS3.7KBOOKMARKS31LIKES55RETWEETS3

Lucas Beyer (bl16)@giffmana

The paper is "The invisible hand of physics" from a surprisingly diverse set of authors (Parsa Esmati, @Somjit77): https://arxiv.org/abs/2606.05328 ; It's from just a few days ago. I learned about it from a nice talk by @katjahofmann today.

The paper from earlier in the year is by @soniajoseph_ etal: https://arxiv.org/abs/2602.07050

4h3.7K5531

REPLIES2

Ravid Shwartz Ziv@ziv_ravid

@giffmana Without getting into the who has a good world model debate, I'm not sure linear probe is a good way to measure the quality of the representation (but it's the best we have, though)

Lucas Beyer (bl16)@giffmana

You may have recently heard claims that video generation models are "dumb" about physics, and only "world models" (V-JEPA, specifically) have a valid internal model of physics.

This is noteworthy, because a *linear* probe being this accurate shows that the model has a pretty explicit internal representation of the physics!

4h1K112

kache@yacineMTB

@giffmana i love linear probes. have you seen RAPTOR?

Lucas Beyer (bl16)@giffmana

You may have recently heard claims that video generation models are "dumb" about physics, and only "world models" (V-JEPA, specifically) have a valid internal model of physics.

This is noteworthy, because a *linear* probe being this accurate shows that the model has a pretty explicit internal representation of the physics!

4h2.1K3517

Alexander Doria@Dorialexander

Not surprised. It's totally true that we haven't nailed the most efficient arch for many modalities/data representation, just it isn't JEPA.

Lucas Beyer (bl16)@giffmana

You may have recently heard claims that video generation models are "dumb" about physics, and only "world models" (V-JEPA, specifically) have a valid internal model of physics.

This is noteworthy, because a *linear* probe being this accurate shows that the model has a pretty explicit internal representation of the physics!

3h2.6K276

Lucas Beyer (bl16)@giffmana

To be clear, this is not a V-JEPA or VideoMAE diss, just resurrecting the fact that "pure videogen" models may indeed learn an explicit model of the world/physics as a byproduct.

Also cc @mapo1 we chatted about this and you also intuitively pushed back against such claim.

4h2.3K282

Lucas Beyer (bl16)@giffmana

At the beginning of the year, there was another paper who did this check only for V-MAE and V-JEPA, and showed both have some understanding of physics. This new paper essentially extends the study to diffusion-based "pure videogen" models and shows they understand physics very well.

4h1.2K93

Khai Loong Aw@khai_loong_aw

linear probes trained on plausible vs implausible videos may latch onto spurious features unrelated to physics, e.g., counting number of unique objects. Results also depend heavily on what/how frames are fed to the model. A lot of care needed to ensure the probe is actually about physics. Alternative test: have models make predictions, then check those for physics violations. Qualitatively first, then with a metric. That said, I agree with you that today's models can learn intuitive physics. Along these lines, we show a zero-shot world model makes predictions that satisfy short-timescale intuitive physics: https://arxiv.org/abs/2604.10333. Large pretrained V-JEPA2 also scores decently here, though its predictions are harder to interpret.

Lucas Beyer (bl16)@giffmana

You may have recently heard claims that video generation models are "dumb" about physics, and only "world models" (V-JEPA, specifically) have a valid internal model of physics.

This is noteworthy, because a *linear* probe being this accurate shows that the model has a pretty explicit internal representation of the physics!

3h56973

kache@yacineMTB

@giffmana basically; RL trained enough with domain randomization allows you to predict state from the hidden state

accidental state estimator

kache@yacineMTB

@giffmana i love linear probes. have you seen RAPTOR?

4h959171

Hila Chefer@hila_chefer

@giffmana Haven’t read the paper but this feels a bit suspicious 🤨 the video models examined here are tiny and mostly not very good at producing physically plausible videos, how is plausibility defined in this context?

4h84112

Hila Chefer@hila_chefer

Yeah I’m not familiar with the benchmark and should probably read more about it, but my vibe from the figure above is that most properties that are strictly better in diffusion are appearance preservation over frames (permanence, shape etc) rather than “hard core temporal” properties. Definitely not saying VJEPA solves these, but I would say that the results on temporal properties are less convincing

3h21151

Lucas Beyer (bl16)@giffmana

@hila_chefer Same bench (but slightly different protocol) that Sonia's paper from earlier in the year.

3h68331

Oscar Mañas@oscmansan

@giffmana This is evidence for what we were discussing today @Germs96: that the information contained in V-JEPA representations is a subset of that of video generators

2h2121

Lucas Beyer (bl16)@giffmana

If linear probe is positive, it's a very strong signal the information is there *and explicit* (in whatever is the input to the probe)

If it's negative it doesn't mean the info isn't there, but at least that the info isn't there in an easily extractive way.

The next step is attention probe, which we used a lot in the CapPa pallet and also Sonia used in her pallet on the topic earlier this year.

3h5741

Hila Chefer@hila_chefer

> So a model might be shit at generating but still have a grasp of what it *should* be generating. the infamous knowing-doing gap 🤓 I actually think if that were the case methods like REPA wouldn't help because the "know" that the model gains would not lead to "do" I'm mostly basing this on the metrics- the ones that seem much better seem to be appearance-based, whereas the more challenging temporal ones (e.g., gravity solidity here) are not as impressive but will take a look!

2h7931

Artem Zholus@artemZholus

@giffmana This paper compares 1B-2B diffusion models with 300m semantic models! No wonder they are better - since they are just bigger. There are larger versions of V-JEPA and the results might be different if those were included!

2h916

Lucas Beyer (bl16)@giffmana

@artemZholus Yeah that's a fair point! Cc @Somjit77

Note my point here is the positive signal on "pure video models" so that's unaffected by this point.

2h714

Lucas Beyer (bl16)@giffmana

(I'm not super deeply familiar either) This page shows examples of positives/negatives for IntPhys: https://intphys.cognitive-ml.fr/benchmark/test_blocks.html#block-o1-object-permanence

And this page for InfLevel: https://github.com/allenai/inflevel

Just from looking at them, I do feel that it's more than just good appearance preservation.

What I have the feeling might cause at least your initial skepticism, is that these are not generation benchmarks, but discrimination benchmarks. So a model might be shit at generating but still have a grasp of what it *should* be generating.

3h1753

Simon@SimonGoodman_

@ziv_ravid @giffmana Yeah, but it's a good way to see if the information you want is easily extractable, which *kind of* suggests that the model represents it well

4h861

Adam Goldstein@goldstein_aa

@giffmana @mapo1 But isn't that indeed a JEPA diss? If what you say is true, what's the justification for JEPA?

4h107

Rohit Bandaru@rohit_bandaru

@giffmana Video generation models need to understand physics for accurate generation, I think representation learning methods could learn physics more efficiently but that’s hard to evaluate

3h3282