/Tech6h ago

MilliVid Uses Hierarchical Tokens For Consistent Long-Context Video Generation

779287.5K

Original post unavailable.

/Tech6h ago

MilliVid Uses Hierarchical Tokens For Consistent Long-Context Video Generation

779287.5K

Original post unavailable.

Sentiment

Users are excited about MilliVid's hierarchical tokens for consistent long-context video generation because the model memorizes 3D scene geometry for hundreds of frames in Minecraft without retrieval.

Pos

100.0%

Neg

0.0%

1 comments with sentiment.

Cluster Engagement

Posts from X

Most Activity

VIEWS794BOOKMARKS2LIKES15

Vincent Sitzmann@vincesitzmann

Unlike existing variable-compression methods (e.g., FramePack), we not only variably compress the past, but also generate the future from coarse to fine. This is what rollout with our model looks like: we first denoise a large chunk of the most compressed latents, then superresolve them with a lossless recent context and progressively more compressed past frames. Every rollout step uses the same diffusion transformer weights! (3/n)

6h794152

REPLIES1

Vincent Sitzmann@vincesitzmann

This paper was led by my students @DavidCharatan and @ishaanpreetam, in collaboration with @phillip_isola (also co-advising Ishaan!) and our friends from Toyota Research, @ZakharovSergeyN, @vitorguizilini and @basilevanh as part of the University 3.0 TRI collaboration :) (7/n)

6h3956

Vincent Sitzmann@vincesitzmann

We are very excited about the results: we show that on Minecraft, our model can memorize 3D scene geometry for *hundreds* of frames, without retrieval or expert-crafted 3D map heuristics, at the same token budget as a conventional five-frame-context diffusion model! (4/n)

6h2796

Vincent Sitzmann@vincesitzmann

See all the details, videos, and the paper under: https://davidcharatan.com/millivid/#

Code will be released in the next few days! arXiv is currently pending, for now, we're self-hosting the pdf: https://davidcharatan.com/millivid/millivid.pdf

6h4084

Vincent Sitzmann@vincesitzmann

Quantitatively, this shows up as a dramatic improvement of PSNR and FVD over rollouts of hundreds of frames. Note that even a diffusion model with perfect memory can't achieve a flat consistency curve, as the model has to generate unseen scene content! (5/n)

6h2334

Vincent Sitzmann@vincesitzmann

The FVD curve further shows that our model significantly alleviates exposure bias, resulting in rollouts that are stable over hundreds of frames without any self-forcing, diffusion forcing, or other mitigation strategies! (6/n)

6h2324