Cross attention is bottleneck? Self-attention FTW. Frame-wise Attention as replacement for the index embedding -- kind of new perspective to me.
Pairwise is the bottleneck for deep learning.
Cross attention is bottleneck? Self-attention FTW. Frame-wise Attention as replacement for the index embedding -- kind of new perspective to me.
Pairwise is the bottleneck for deep learning.
VGGT + Colmap + SP + LG -> reconstruction from internet videos. Then go to the teacher-student training (18M videos)
"How we did scaling" 1. Register aka scene tokens 2. Remove all dense heads except one 3. Last DPT layer -> replace with MLP+PixelShuffle. -> 70% less training memory -> now we can scale the model
2M sequences to train VGGT-Omega! "Scaling is not easy"
VGGT was useful for many areas
VGGT was useful for many areas
Cross attention is bottleneck? Self-attention FTW. Frame-wise Attention as replacement for the index embedding -- kind of new perspective to me.
Cross attention is bottleneck? Self-attention FTW. Frame-wise Attention as replacement for the index embedding -- kind of new perspective to me.
Pairwise is the bottleneck for deep learning.
VGGT + Colmap + SP + LG -> reconstruction from internet videos. Then go to the teacher-student training (18M videos)
"How we did scaling" 1. Register aka scene tokens 2. Remove all dense heads except one 3. Last DPT layer -> replace with MLP+PixelShuffle. -> 70% less training memory -> now we can scale the model
2M sequences to train VGGT-Omega! "Scaling is not easy"
VGGT was useful for many areas
VGGT was useful for many areas
Cross attention is bottleneck? Self-attention FTW. Frame-wise Attention as replacement for the index embedding -- kind of new perspective to me.