At the beginning of the year, there was another paper who did this check only for V-MAE and V-JEPA, and showed both have some understanding of physics. This new paper essentially extends the study to diffusion-based "pure videogen" models and shows they understand physics very well.
You may have recently heard claims that video generation models are "dumb" about physics, and only "world models" (V-JEPA, specifically) have a valid internal model of physics.
This turns out to be false. In a recent paper, researchers show that a LINEAR probe of diffusion videogen models predict various "physics" very well, significantly better than V-JEPA or VideoMAE (and plain VAE just sucks).
This is noteworthy, because a *linear* probe being this accurate shows that the model has a pretty explicit internal representation of the physics!


