@soniajoseph_ I agree with you but here (at least for me) it's not about "which loss has it better" but it's an existence proof that "pure videogen pixels models got this too", directly disproving claims of the contrary. For this, size doesn't matter.
Nice results and paper!
That said, the diffusion models tested are an order magnitude larger than the V-JEPA/VideoMAEv2 (~2B vs 300M). To my knowledge, there is no clean ablation that fixes model parameters, dataset size, and only varies the objective function, in order to cleanly get physical plausibility scaling laws. Even V-JEPA2/VideoMAEv2 are trained on an order of magnitude difference in data.
We faced this issue in our rebuttals for "Interpreting Physics in Video World Models" where reviewers wanted to see a consistent ablation but that would require training a suite of VideoMAEv2, V-JEPA2, diffusion, and autoregressive models at fixed dataset and model sizes which was totally out of scope. But the first study on this will be highly impactful.
