/AI18h ago

Video-GMAE Enables Zero-Shot Point Tracking From Raw Video

219595.1K
Tanish Baranwal @ CVPR@TanishBaranwal

Can a video model learn correspondence from raw video, without track labels?

Our CVPR Highlight introduces Video-GMAE, which represents a video as 3D Gaussian splats moving over time, and leads to zero-shot point tracking. Visit our poster 3:30-5:30 on Sunday!

More in thread 馃У

4:40 PM 路 Jun 5, 2026 路 5.1K Views
Sentiment
Sentiment building, check back later.
Cluster Engagement
Posts from X
Most Activity
Most Activity
VIEWS77
Tanish Baranwal @ CVPR@TanishBaranwal

(7/7)

Models and code are open-source!

Project: https://videogmae.org/ Code: https://github.com/tekotan/video-gmae Paper: https://arxiv.org/abs/2512.22489

Big thanks to my co-authors @Cinnabar233, @jathushan, @JitendraMalikCV, @berkeley_ai #CVPR2026

18hViews 77
REPLIES1
Tanish Baranwal @ CVPR@TanishBaranwal

(6/7)

There are still clear limitations: static-camera pretraining, a 256-Gaussian budget, and difficulty with fine details under large motion. Fixing these might lead to video-SSL objectives with other emergent capabilities.

18hViews 47
Tanish Baranwal @ CVPR@TanishBaranwal

(3/7)

Video-GMAE makes the decoder represent a clip as Gaussians that persist over time.

Frame 1: predict 256 3D Gaussian primitives. Later frames: predict residual motion and color deltas for the same primitives. Render everything differentiably and train from raw video.

18hViews 48
Tanish Baranwal @ CVPR@TanishBaranwal

(2/7)

Most video MAEs predict masked patch tokens. They can reconstruct pixels without really preserving object identity across frames.

Video-GMAE makes correspondence part of the pretraining problem itself.

18hViews 48
Tanish Baranwal @ CVPR@TanishBaranwal

(4/7)

Since each Gaussian keeps its identity, we can project its 3D motion into the image plane, splat displacements into a flow field, and follow the flow to track any query point.

No tracks, flow, masks, or boxes in pretraining.

18hViews 39
Tanish Baranwal @ CVPR@TanishBaranwal

(5/7)

Zero-shot Video-GMAE is competitive with and in most cases outperforms the best self-supervised trackers:

18hViews 25