
The core idea is to use a hierarchical tokenizer that encodes frames into varying numbers of tokens for variable compression rates. We then build a coarse-to-fine diffusion model that only keeps the most recent frames at full token budget and uses compressed past frames! (2/n)


