/AI17h ago

Video-GMAE Enables Zero-Shot Point Tracking From Raw Video

219595K

#932

Original post

Kosta Derpanis (sabbatical in Zurich)#932

Tanish Baranwal @ CVPR@TanishBaranwal

Can a video model learn correspondence from raw video, without track labels?

Our CVPR Highlight introduces Video-GMAE, which represents a video as 3D Gaussian splats moving over time, and leads to zero-shot point tracking. Visit our poster 3:30-5:30 on Sunday!

More in thread 🧵

4:40 PM · Jun 5, 2026 · 5K Views

Sentiment

Sentiment building, check back later.

Cluster Engagement

Posts from X

Most Activity

Tanish Baranwal @ CVPR@TanishBaranwal

(7/7)

Models and code are open-source!

Project: https://videogmae.org/ Code: https://github.com/tekotan/video-gmae Paper: https://arxiv.org/abs/2512.22489

Big thanks to my co-authors @Cinnabar233, @jathushan, @JitendraMalikCV, @berkeley_ai #CVPR2026

17h77

REPLIES1

Tanish Baranwal @ CVPR@TanishBaranwal

(6/7)

There are still clear limitations: static-camera pretraining, a 256-Gaussian budget, and difficulty with fine details under large motion. Fixing these might lead to video-SSL objectives with other emergent capabilities.

17h47

Tanish Baranwal @ CVPR@TanishBaranwal

(3/7)

Video-GMAE makes the decoder represent a clip as Gaussians that persist over time.

Frame 1: predict 256 3D Gaussian primitives. Later frames: predict residual motion and color deltas for the same primitives. Render everything differentiably and train from raw video.

17h48

Tanish Baranwal @ CVPR@TanishBaranwal

(2/7)

Most video MAEs predict masked patch tokens. They can reconstruct pixels without really preserving object identity across frames.

Video-GMAE makes correspondence part of the pretraining problem itself.

17h48

Tanish Baranwal @ CVPR@TanishBaranwal

(4/7)

Since each Gaussian keeps its identity, we can project its 3D motion into the image plane, splat displacements into a flow field, and follow the flow to track any query point.

No tracks, flow, masks, or boxes in pretraining.

17h39

Tanish Baranwal @ CVPR@TanishBaranwal

(5/7)

Zero-shot Video-GMAE is competitive with and in most cases outperforms the best self-supervised trackers:

17h25