Introducing MilliVid, our new method for long-context video generation! MilliVid creates videos that are consistent over long time spans, without using retrieval heuristics or 3D maps! (1/n) https://davidcharatan.com/millivid/#
Vincent Sitzmann of MIT releases MilliVid, a long-context video generation method that maintains consistency without explicit 3D maps
The method outperformed existing baselines in Minecraft environment tests.
Users are excited about MilliVid's hierarchical latents enabling consistent long video generation, praising the cool compression, tokenization, and team's thoughtful replies.
Most Activity
Sold! 😊🥰
Introducing MilliVid, our new method for long-context video generation! MilliVid creates videos that are consistent over long time spans, without using retrieval heuristics or 3D maps! (1/n) https://davidcharatan.com/millivid/#

The core idea is to use a hierarchical tokenizer that encodes frames into varying numbers of tokens for variable compression rates. We then build a coarse-to-fine diffusion model that only keeps the most recent frames at full token budget and uses compressed past frames! (2/n)

Definitely an open question! I think what is interesting about this approach is that the decomposition is *not* frequency-based. It is learned. I.e., if you looked at the images that are auto-encoded at the coarsest level and compared them to the low-pass filtered version of equivalent compression ratio, you'd find that they look entirely different. Now, what information you encode depends on the objective of the hierarchical auto-encoder! We simply use least-squares, but there is definitely other things to explore here!

Unlike existing variable-compression methods (e.g., FramePack), we not only variably compress the past, but also generate the future from coarse to fine. This is what rollout with our model looks like: we first denoise a large chunk of the most compressed latents, then superresolve them with a lossless recent context and progressively more compressed past frames. Every rollout step uses the same diffusion transformer weights! (3/n)

This paper was led by my students @DavidCharatan and @ishaanpreetam, in collaboration with @phillip_isola (also co-advising Ishaan!) and our friends from Toyota Research, @ZakharovSergeyN, @vitorguizilini and @basilevanh as part of the University 3.0 TRI collaboration :) (7/n)

We are very excited about the results: we show that on Minecraft, our model can memorize 3D scene geometry for *hundreds* of frames, without retrieval or expert-crafted 3D map heuristics, at the same token budget as a conventional five-frame-context diffusion model! (4/n)

See all the details, videos, and the paper under: https://davidcharatan.com/millivid/#
Code will be released in the next few days! arXiv is currently pending, for now, we're self-hosting the pdf: https://davidcharatan.com/millivid/millivid.pdf

Quantitatively, this shows up as a dramatic improvement of PSNR and FVD over rollouts of hundreds of frames. Note that even a diffusion model with perfect memory can't achieve a flat consistency curve, as the model has to generate unseen scene content! (5/n)

The FVD curve further shows that our model significantly alleviates exposure bias, resulting in rollouts that are stable over hundreds of frames without any self-forcing, diffusion forcing, or other mitigation strategies! (6/n)

Great work! Curious how sensitive the method is to the frequency decomposition being clean. The coarse-to-fine works when low-frequency latents capture the consistency relevant content, which Minecraft basically guarantees. On in-the-wild video, where high-frequency detail carries semantic info, do you expect the coarse levels to remain informative enough to drive long-range recall?

Also, shoutout to some related / relevant work: Of course, FramePack by Lvming Zhang! Then, inspiring work on flexible tokenization by folks such as @ShivamDuggal4, Roman Bachmann, @JRAllardice, David Mizrahi, @andrew_atanov, @_xwen_, @BingchenZhao, some of it in @zamir_ar lab!

@vincesitzmann very cool work team!! @DavidCharatan @ishaanpreetam @vincesitzmann. Compression and Tokenization ✨

@vincesitzmann Great work! I wonder if predicting many coarse future frames - which is hypothesized as why it's better than FramePack - will hurt the action rate you can interact with the model.

@vincesitzmann Ah that’s great clarification! Thanks for the thoughtful reply! Again great work!