/Tech6h ago

Vincent Sitzmann of MIT releases MilliVid, a long-context video generation method that maintains consistency without explicit 3D maps

The method outperformed existing baselines in Minecraft environment tests.

82354012914.2K

Original post unavailable.

/Tech6h ago

Vincent Sitzmann of MIT releases MilliVid, a long-context video generation method that maintains consistency without explicit 3D maps

The method outperformed existing baselines in Minecraft environment tests.

82354012914.2K

Original post unavailable.

Sentiment

Users are excited about MilliVid's hierarchical latents enabling consistent long video generation, praising the cool compression, tokenization, and team's thoughtful replies.

Pos

100.0%

Neg

0.0%

2 comments with sentiment.

Cluster Engagement

Posts from X

Most Activity

VIEWS1.2KBOOKMARKS3LIKES16

Vincent Sitzmann@vincesitzmann

The core idea is to use a hierarchical tokenizer that encodes frames into varying numbers of tokens for variable compression rates. We then build a coarse-to-fine diffusion model that only keeps the most recent frames at full token budget and uses compressed past frames! (2/n)

6h1.2K163

REPLIES1

Vincent Sitzmann@vincesitzmann

Definitely an open question! I think what is interesting about this approach is that the decomposition is *not* frequency-based. It is learned. I.e., if you looked at the images that are auto-encoded at the coarsest level and compared them to the low-pass filtered version of equivalent compression ratio, you'd find that they look entirely different. Now, what information you encode depends on the objective of the hierarchical auto-encoder! We simply use least-squares, but there is definitely other things to explore here!

5h452

Vincent Sitzmann@vincesitzmann

Unlike existing variable-compression methods (e.g., FramePack), we not only variably compress the past, but also generate the future from coarse to fine. This is what rollout with our model looks like: we first denoise a large chunk of the most compressed latents, then superresolve them with a lossless recent context and progressively more compressed past frames. Every rollout step uses the same diffusion transformer weights! (3/n)

6h794152

Vincent Sitzmann@vincesitzmann

This paper was led by my students @DavidCharatan and @ishaanpreetam, in collaboration with @phillip_isola (also co-advising Ishaan!) and our friends from Toyota Research, @ZakharovSergeyN, @vitorguizilini and @basilevanh as part of the University 3.0 TRI collaboration :) (7/n)

6h3956

Vincent Sitzmann@vincesitzmann

We are very excited about the results: we show that on Minecraft, our model can memorize 3D scene geometry for *hundreds* of frames, without retrieval or expert-crafted 3D map heuristics, at the same token budget as a conventional five-frame-context diffusion model! (4/n)

6h2796

Vincent Sitzmann@vincesitzmann

See all the details, videos, and the paper under: https://davidcharatan.com/millivid/#

Code will be released in the next few days! arXiv is currently pending, for now, we're self-hosting the pdf: https://davidcharatan.com/millivid/millivid.pdf

6h4084

Vincent Sitzmann@vincesitzmann

Quantitatively, this shows up as a dramatic improvement of PSNR and FVD over rollouts of hundreds of frames. Note that even a diffusion model with perfect memory can't achieve a flat consistency curve, as the model has to generate unseen scene content! (5/n)

6h2334

Vincent Sitzmann@vincesitzmann

The FVD curve further shows that our model significantly alleviates exposure bias, resulting in rollouts that are stable over hundreds of frames without any self-forcing, diffusion forcing, or other mitigation strategies! (6/n)

6h2324

Nan Liu@nanliuuu

Great work! Curious how sensitive the method is to the frequency decomposition being clean. The coarse-to-fine works when low-frequency latents capture the consistency relevant content, which Minecraft basically guarantees. On in-the-wild video, where high-frequency detail carries semantic info, do you expect the coarse levels to remain informative enough to drive long-range recall?

6h1221

Vincent Sitzmann@vincesitzmann

Also, shoutout to some related / relevant work: Of course, FramePack by Lvming Zhang! Then, inspiring work on flexible tokenization by folks such as @ShivamDuggal4, Roman Bachmann, @JRAllardice, David Mizrahi, @andrew_atanov, @_xwen_, @BingchenZhao, some of it in @zamir_ar lab!

5h88

Shivam Duggal@ShivamDuggal4

@vincesitzmann very cool work team!! @DavidCharatan @ishaanpreetam @vincesitzmann. Compression and Tokenization ✨

5h33

Zhichao Yin@Zhichao_Y

@vincesitzmann Great work! I wonder if predicting many coarse future frames - which is hypothesized as why it's better than FramePack - will hurt the action rate you can interact with the model.

5h14

Nan Liu@nanliuuu

@vincesitzmann Ah that’s great clarification! Thanks for the thoughtful reply! Again great work!

4h3