/Tech1h ago

Meta's Lucas Beyer argues pixel-only video models encode physical plausibility, while Sonia Joseph frames the issue around data efficiency

Upcoming research decodes physical plausibility from diffusion models

2801230

#72

Original post

Lucas Beyer (bl16)@giffmana#72inTech

@soniajoseph_ I agree with you but here (at least for me) it's not about "which loss has it better" but it's an existence proof that "pure videogen pixels models got this too", directly disproving claims of the contrary. For this, size doesn't matter.

Sonia Joseph@soniajoseph_

Nice results and paper!

That said, the diffusion models tested are an order magnitude larger than the V-JEPA/VideoMAEv2 (~2B vs 300M). To my knowledge, there is no clean ablation that fixes model parameters, dataset size, and only varies the objective function, in order to cleanly get physical plausibility scaling laws. Even V-JEPA2/VideoMAEv2 are trained on an order of magnitude difference in data.

We faced this issue in our rebuttals for "Interpreting Physics in Video World Models" where reviewers wanted to see a consistent ablation but that would require training a suite of VideoMAEv2, V-JEPA2, diffusion, and autoregressive models at fixed dataset and model sizes which was totally out of scope. But the first study on this will be highly impactful.

2:31 PM · Jun 13, 2026 · 182 Views

Sentiment

Users agree diffusion models can learn physics from video pixels, urging the AI community to explore implementation rather than question feasibility.

Pos

100.0%

Neg

0.0%

1 comments with sentiment.

Cluster Engagement

Posts from X

Most Activity

VIEWS56LIKES4REPLIES1

Sonia Joseph@soniajoseph_

i see, it may depend on who you ask :P. our original paper does show physical plausibility is possible from pixel reconstruction. our rebuttal/extended version (to be released soon) shows physical plausibility can be also decoded from diffusion models. the question, in my mind, was always about how data efficiency changes with your objective function.

@artemZholus once speculated that V-JEPA 2 was acting as a large "kernel function" making important visual/motion features sparse and linear with orders of magnitude more efficiency than pixel reconstruction / diffusion objectives. but to be tested.

Lucas Beyer (bl16)@giffmana

1h5640

Artem Zholus@artemZholus

@soniajoseph_ @giffmana +1 The question (that we as a community should ask) is not whether diffusion can learn or cannot learn physics. It can. The question is what is the most flops-efficient way of learning physics of the real world and i think it is not pixel-level end-to-end diffusion/fm.

1h3