/AI6h ago

MIT's Vincent Sitzmann releases MilliVid, using hierarchical latents to maintain consistency in video generation up to 744 frames

The method avoids retrieval heuristics and explicit 3D mapping.

82344012613.9K
Original post
Vincent Sitzmann@vincesitzmann#727inAI

Introducing MilliVid, our new method for long-context video generation! MilliVid creates videos that are consistent over long time spans, without using retrieval heuristics or 3D maps! (1/n) https://davidcharatan.com/millivid/#

9:28 AM · Jun 8, 2026 · 11.9K Views
Sentiment

Users praise MilliVid for generating long consistent videos via hierarchical latents, highlighting its success at memorizing 3D scene geometry across hundreds of frames in Minecraft.

Pos
100.0%
Neg
0.0%
5 comments with sentiment.
Cluster Engagement
Posts from X
Most Activity
Most Activity
VIEWS2.2KBOOKMARKS10RETWEETS1

Sold! 😊🥰

Vincent Sitzmann@vincesitzmann

Introducing MilliVid, our new method for long-context video generation! MilliVid creates videos that are consistent over long time spans, without using retrieval heuristics or 3D maps! (1/n) https://davidcharatan.com/millivid/#

5hViews 2.2KLikes 11Bookmarks 10
LIKES16
Vincent Sitzmann@vincesitzmann

The core idea is to use a hierarchical tokenizer that encodes frames into varying numbers of tokens for variable compression rates. We then build a coarse-to-fine diffusion model that only keeps the most recent frames at full token budget and uses compressed past frames! (2/n)

6hViews 1.2KLikes 16Bookmarks 3
REPLIES1
Vincent Sitzmann@vincesitzmann

Definitely an open question! I think what is interesting about this approach is that the decomposition is *not* frequency-based. It is learned. I.e., if you looked at the images that are auto-encoded at the coarsest level and compared them to the low-pass filtered version of equivalent compression ratio, you'd find that they look entirely different. Now, what information you encode depends on the objective of the hierarchical auto-encoder! We simply use least-squares, but there is definitely other things to explore here!

5hViews 45Likes 2
Vincent Sitzmann@vincesitzmann

Unlike existing variable-compression methods (e.g., FramePack), we not only variably compress the past, but also generate the future from coarse to fine. This is what rollout with our model looks like: we first denoise a large chunk of the most compressed latents, then superresolve them with a lossless recent context and progressively more compressed past frames. Every rollout step uses the same diffusion transformer weights! (3/n)

6hViews 794Likes 15Bookmarks 2
Vincent Sitzmann@vincesitzmann

This paper was led by my students @DavidCharatan and @ishaanpreetam, in collaboration with @phillip_isola (also co-advising Ishaan!) and our friends from Toyota Research, @ZakharovSergeyN, @vitorguizilini and @basilevanh as part of the University 3.0 TRI collaboration :) (7/n)

6hViews 395Likes 6
Vincent Sitzmann@vincesitzmann

We are very excited about the results: we show that on Minecraft, our model can memorize 3D scene geometry for *hundreds* of frames, without retrieval or expert-crafted 3D map heuristics, at the same token budget as a conventional five-frame-context diffusion model! (4/n)

6hViews 279Likes 6
Vincent Sitzmann@vincesitzmann

See all the details, videos, and the paper under: https://davidcharatan.com/millivid/#

Code will be released in the next few days! arXiv is currently pending, for now, we're self-hosting the pdf: https://davidcharatan.com/millivid/millivid.pdf

6hViews 408Likes 4
Vincent Sitzmann@vincesitzmann

Quantitatively, this shows up as a dramatic improvement of PSNR and FVD over rollouts of hundreds of frames. Note that even a diffusion model with perfect memory can't achieve a flat consistency curve, as the model has to generate unseen scene content! (5/n)

6hViews 233Likes 4
Vincent Sitzmann@vincesitzmann

The FVD curve further shows that our model significantly alleviates exposure bias, resulting in rollouts that are stable over hundreds of frames without any self-forcing, diffusion forcing, or other mitigation strategies! (6/n)

6hViews 232Likes 4
Nan Liu@nanliuuu

Great work! Curious how sensitive the method is to the frequency decomposition being clean. The coarse-to-fine works when low-frequency latents capture the consistency relevant content, which Minecraft basically guarantees. On in-the-wild video, where high-frequency detail carries semantic info, do you expect the coarse levels to remain informative enough to drive long-range recall?

6hViews 122Likes 1
Vincent Sitzmann@vincesitzmann

Also, shoutout to some related / relevant work: Of course, FramePack by Lvming Zhang! Then, inspiring work on flexible tokenization by folks such as @ShivamDuggal4, Roman Bachmann, @JRAllardice, David Mizrahi, @andrew_atanov, @_xwen_, @BingchenZhao, some of it in @zamir_ar lab!

5hViews 88
Shivam Duggal@ShivamDuggal4

@vincesitzmann very cool work team!! @DavidCharatan @ishaanpreetam @vincesitzmann. Compression and Tokenization ✨

5hViews 33
Zhichao Yin@Zhichao_Y

@vincesitzmann Great work! I wonder if predicting many coarse future frames - which is hypothesized as why it's better than FramePack - will hurt the action rate you can interact with the model.

5hViews 14
Nan Liu@nanliuuu

@vincesitzmann Ah that’s great clarification! Thanks for the thoughtful reply! Again great work!

4hViews 3