
(7/7)
Models and code are open-source!
Project: https://videogmae.org/ Code: https://github.com/tekotan/video-gmae Paper: https://arxiv.org/abs/2512.22489
Big thanks to my co-authors @Cinnabar233, @jathushan, @JitendraMalikCV, @berkeley_ai #CVPR2026

(7/7)
Models and code are open-source!
Project: https://videogmae.org/ Code: https://github.com/tekotan/video-gmae Paper: https://arxiv.org/abs/2512.22489
Big thanks to my co-authors @Cinnabar233, @jathushan, @JitendraMalikCV, @berkeley_ai #CVPR2026

(6/7)
There are still clear limitations: static-camera pretraining, a 256-Gaussian budget, and difficulty with fine details under large motion. Fixing these might lead to video-SSL objectives with other emergent capabilities.

(3/7)
Video-GMAE makes the decoder represent a clip as Gaussians that persist over time.
Frame 1: predict 256 3D Gaussian primitives. Later frames: predict residual motion and color deltas for the same primitives. Render everything differentiably and train from raw video.

(2/7)
Most video MAEs predict masked patch tokens. They can reconstruct pixels without really preserving object identity across frames.
Video-GMAE makes correspondence part of the pretraining problem itself.

(4/7)
Since each Gaussian keeps its identity, we can project its 3D motion into the image plane, splat displacements into a flow field, and follow the flow to track any query point.
No tracks, flow, masks, or boxes in pretraining.

(5/7)
Zero-shot Video-GMAE is competitive with and in most cases outperforms the best self-supervised trackers: