/AI19h ago

DINOv3 Scaling Closes Gap With Text-Aligned Vision Encoders In VLMs

4302106K
Oriane Siméoni @CVPR@oriane_simeoni

@TimDarcet @iScienceLuvr Here is the summary slide presented yesterday. Using Molmo setup, encoder frozen and FT Qwen2-7B

TL;DR: * scaling DINOv2 ViT-g to 1M iter. on DINOv3 1.7B images --> boost * scaling to 7B (+gram&HRFT) --> close gap w/ text-aligned encoders * distilled ViT-H+ as good as 7B

1:27 PM · Jun 5, 2026 · 3.6K Views
Sentiment
Sentiment building, check back later.
Cluster Engagement
Posts from X
Most Activity
Most Activity
VIEWS2.4KBOOKMARKS3LIKES16REPLIES3

Using DINOv3, especially scaled up, closes gap with text-aligned vision encoders (CLIP-style) when applied to VLMs

Oriane Siméoni @CVPR@oriane_simeoni

@TimDarcet @iScienceLuvr Here is the summary slide presented yesterday. Using Molmo setup, encoder frozen and FT Qwen2-7B

TL;DR: * scaling DINOv2 ViT-g to 1M iter. on DINOv3 1.7B images --> boost * scaling to 7B (+gram&HRFT) --> close gap w/ text-aligned encoders * distilled ViT-H+ as good as 7B

4hViews 2.4KLikes 16Bookmarks 3

@iScienceLuvr hmm so scaling DINO actually competes with CLIP now? feels like vision-only training isnt dead yet

4hViews 4