/AI6h ago

MIT's Vincent Sitzmann releases MilliVid, using hierarchical latents to maintain consistency in video generation up to 744 frames

The method avoids retrieval heuristics and explicit 3D mapping.

82344012613.9K

#727

Original post

Vincent Sitzmann@vincesitzmann#727inAI

Introducing MilliVid, our new method for long-context video generation! MilliVid creates videos that are consistent over long time spans, without using retrieval heuristics or 3D maps! (1/n) https://davidcharatan.com/millivid/#

9:28 AM · Jun 8, 2026 · 11.9K Views

/AI6h ago

MIT's Vincent Sitzmann releases MilliVid, using hierarchical latents to maintain consistency in video generation up to 744 frames

The method avoids retrieval heuristics and explicit 3D mapping.

82344012613.9K

#727

Original post

Vincent Sitzmann@vincesitzmann#727inAI

9:28 AM · Jun 8, 2026 · 11.9K Views

Sentiment

Users praise MilliVid for generating long consistent videos via hierarchical latents, highlighting its success at memorizing 3D scene geometry across hundreds of frames in Minecraft.

Pos

100.0%

Neg

0.0%

5 comments with sentiment.

Cluster Engagement

Posts from X

Most Activity

VIEWS2.2KBOOKMARKS10RETWEETS1

Andrea Tagliasacchi @CVPR@taiyasaki

Sold! 😊🥰

Vincent Sitzmann@vincesitzmann

5h2.2K1110

LIKES16

Vincent Sitzmann@vincesitzmann

The core idea is to use a hierarchical tokenizer that encodes frames into varying numbers of tokens for variable compression rates. We then build a coarse-to-fine diffusion model that only keeps the most recent frames at full token budget and uses compressed past frames! (2/n)

6h1.2K163

REPLIES1

Vincent Sitzmann@vincesitzmann

Definitely an open question! I think what is interesting about this approach is that the decomposition is *not* frequency-based. It is learned. I.e., if you looked at the images that are auto-encoded at the coarsest level and compared them to the low-pass filtered version of equivalent compression ratio, you'd find that they look entirely different. Now, what information you encode depends on the objective of the hierarchical auto-encoder! We simply use least-squares, but there is definitely other things to explore here!

5h452

Vincent Sitzmann@vincesitzmann

Unlike existing variable-compression methods (e.g., FramePack), we not only variably compress the past, but also generate the future from coarse to fine. This is what rollout with our model looks like: we first denoise a large chunk of the most compressed latents, then superresolve them with a lossless recent context and progressively more compressed past frames. Every rollout step uses the same diffusion transformer weights! (3/n)

6h794152

Vincent Sitzmann@vincesitzmann

This paper was led by my students @DavidCharatan and @ishaanpreetam, in collaboration with @phillip_isola (also co-advising Ishaan!) and our friends from Toyota Research, @ZakharovSergeyN, @vitorguizilini and @basilevanh as part of the University 3.0 TRI collaboration :) (7/n)

6h3956

Vincent Sitzmann@vincesitzmann

We are very excited about the results: we show that on Minecraft, our model can memorize 3D scene geometry for *hundreds* of frames, without retrieval or expert-crafted 3D map heuristics, at the same token budget as a conventional five-frame-context diffusion model! (4/n)

6h2796

Vincent Sitzmann@vincesitzmann

See all the details, videos, and the paper under: https://davidcharatan.com/millivid/#

Code will be released in the next few days! arXiv is currently pending, for now, we're self-hosting the pdf: https://davidcharatan.com/millivid/millivid.pdf

6h4084

Vincent Sitzmann@vincesitzmann

Quantitatively, this shows up as a dramatic improvement of PSNR and FVD over rollouts of hundreds of frames. Note that even a diffusion model with perfect memory can't achieve a flat consistency curve, as the model has to generate unseen scene content! (5/n)

6h2334

Vincent Sitzmann@vincesitzmann

The FVD curve further shows that our model significantly alleviates exposure bias, resulting in rollouts that are stable over hundreds of frames without any self-forcing, diffusion forcing, or other mitigation strategies! (6/n)

6h2324

Nan Liu@nanliuuu

Great work! Curious how sensitive the method is to the frequency decomposition being clean. The coarse-to-fine works when low-frequency latents capture the consistency relevant content, which Minecraft basically guarantees. On in-the-wild video, where high-frequency detail carries semantic info, do you expect the coarse levels to remain informative enough to drive long-range recall?

6h1221

Vincent Sitzmann@vincesitzmann

Also, shoutout to some related / relevant work: Of course, FramePack by Lvming Zhang! Then, inspiring work on flexible tokenization by folks such as @ShivamDuggal4, Roman Bachmann, @JRAllardice, David Mizrahi, @andrew_atanov, @_xwen_, @BingchenZhao, some of it in @zamir_ar lab!

5h88

Shivam Duggal@ShivamDuggal4

@vincesitzmann very cool work team!! @DavidCharatan @ishaanpreetam @vincesitzmann. Compression and Tokenization ✨

5h33

Zhichao Yin@Zhichao_Y

@vincesitzmann Great work! I wonder if predicting many coarse future frames - which is hypothesized as why it's better than FramePack - will hurt the action rate you can interact with the model.

5h14

Nan Liu@nanliuuu

@vincesitzmann Ah that’s great clarification! Thanks for the thoughtful reply! Again great work!

4h3