/AI6h ago

MilliVid Uses Hierarchical Tokens For Consistent Long-Context Video Generation

779287.5K
Original post
Vincent Sitzmann@vincesitzmann#727inAI

The core idea is to use a hierarchical tokenizer that encodes frames into varying numbers of tokens for variable compression rates. We then build a coarse-to-fine diffusion model that only keeps the most recent frames at full token budget and uses compressed past frames! (2/n)

Vincent Sitzmann@vincesitzmann

Introducing MilliVid, our new method for long-context video generation! MilliVid creates videos that are consistent over long time spans, without using retrieval heuristics or 3D maps! (1/n) https://davidcharatan.com/millivid/#

9:28 AM · Jun 8, 2026 · 2.1K Views
Sentiment

Users are excited about MilliVid's hierarchical tokens for consistent long-context video generation because the approach lets models memorize 3D scene geometry across hundreds of Minecraft frames without retrieval.

Pos
100.0%
Neg
0.0%
1 comments with sentiment.
Cluster Engagement
Posts from X
Most Activity
Most Activity
VIEWS1.5KBOOKMARKS2LIKES17RETWEETS1REPLIES1
Vincent Sitzmann@vincesitzmann

Unlike existing variable-compression methods (e.g., FramePack), we not only variably compress the past, but also generate the future from coarse to fine. This is what rollout with our model looks like: we first denoise a large chunk of the most compressed latents, then superresolve them with a lossless recent context and progressively more compressed past frames. Every rollout step uses the same diffusion transformer weights! (3/n)

Vincent Sitzmann@vincesitzmann

The core idea is to use a hierarchical tokenizer that encodes frames into varying numbers of tokens for variable compression rates. We then build a coarse-to-fine diffusion model that only keeps the most recent frames at full token budget and uses compressed past frames! (2/n)

6hViews 1.5KLikes 17Bookmarks 2
Vincent Sitzmann@vincesitzmann

Also, shoutout to some related / relevant work: Of course, FramePack by Lvming Zhang! Then, inspiring work on flexible tokenization by folks such as @ShivamDuggal4, Roman Bachmann, @JRAllardice, David Mizrahi, @andrew_atanov, @_xwen_, @BingchenZhao, some of it in @zamir_ar lab!

Vincent Sitzmann@vincesitzmann

See all the details, videos, and the paper under: https://davidcharatan.com/millivid/#

Code will be released in the next few days! arXiv is currently pending, for now, we're self-hosting the pdf: https://davidcharatan.com/millivid/millivid.pdf

4hViews 520Likes 4Bookmarks 2
Vincent Sitzmann@vincesitzmann

This paper was led by my students @DavidCharatan and @ishaanpreetam, in collaboration with @phillip_isola (also co-advising Ishaan!) and our friends from Toyota Research, @ZakharovSergeyN, @vitorguizilini and @basilevanh as part of the University 3.0 TRI collaboration :) (7/n)

Vincent Sitzmann@vincesitzmann

The FVD curve further shows that our model significantly alleviates exposure bias, resulting in rollouts that are stable over hundreds of frames without any self-forcing, diffusion forcing, or other mitigation strategies! (6/n)

6hViews 747Likes 8Bookmarks 0
Vincent Sitzmann@vincesitzmann

We are very excited about the results: we show that on Minecraft, our model can memorize 3D scene geometry for *hundreds* of frames, without retrieval or expert-crafted 3D map heuristics, at the same token budget as a conventional five-frame-context diffusion model! (4/n)

Vincent Sitzmann@vincesitzmann

Unlike existing variable-compression methods (e.g., FramePack), we not only variably compress the past, but also generate the future from coarse to fine. This is what rollout with our model looks like: we first denoise a large chunk of the most compressed latents, then superresolve them with a lossless recent context and progressively more compressed past frames. Every rollout step uses the same diffusion transformer weights! (3/n)

6hViews 600Likes 8Bookmarks 0
Vincent Sitzmann@vincesitzmann

See all the details, videos, and the paper under: https://davidcharatan.com/millivid/#

Code will be released in the next few days! arXiv is currently pending, for now, we're self-hosting the pdf: https://davidcharatan.com/millivid/millivid.pdf

Vincent Sitzmann@vincesitzmann

This paper was led by my students @DavidCharatan and @ishaanpreetam, in collaboration with @phillip_isola (also co-advising Ishaan!) and our friends from Toyota Research, @ZakharovSergeyN, @vitorguizilini and @basilevanh as part of the University 3.0 TRI collaboration :) (7/n)

6hViews 870Likes 6Bookmarks 0
Vincent Sitzmann@vincesitzmann

The FVD curve further shows that our model significantly alleviates exposure bias, resulting in rollouts that are stable over hundreds of frames without any self-forcing, diffusion forcing, or other mitigation strategies! (6/n)

Vincent Sitzmann@vincesitzmann

Quantitatively, this shows up as a dramatic improvement of PSNR and FVD over rollouts of hundreds of frames. Note that even a diffusion model with perfect memory can't achieve a flat consistency curve, as the model has to generate unseen scene content! (5/n)

6hViews 508Likes 6Bookmarks 0
Vincent Sitzmann@vincesitzmann

Quantitatively, this shows up as a dramatic improvement of PSNR and FVD over rollouts of hundreds of frames. Note that even a diffusion model with perfect memory can't achieve a flat consistency curve, as the model has to generate unseen scene content! (5/n)

Vincent Sitzmann@vincesitzmann

We are very excited about the results: we show that on Minecraft, our model can memorize 3D scene geometry for *hundreds* of frames, without retrieval or expert-crafted 3D map heuristics, at the same token budget as a conventional five-frame-context diffusion model! (4/n)

6hViews 527Likes 5Bookmarks 0