/AI19h ago

DINOv3 Scaling Closes Gap With Text-Aligned Vision Encoders In VLMs

4302106K

#359

Original post

Tanishq Mathew Abraham, Ph.D.#359

Oriane Siméoni @CVPR@oriane_simeoni

@TimDarcet @iScienceLuvr Here is the summary slide presented yesterday. Using Molmo setup, encoder frozen and FT Qwen2-7B

TL;DR: * scaling DINOv2 ViT-g to 1M iter. on DINOv3 1.7B images --> boost * scaling to 7B (+gram&HRFT) --> close gap w/ text-aligned encoders * distilled ViT-H+ as good as 7B

1:27 PM · Jun 5, 2026 · 3.6K Views

Sentiment

Sentiment building, check back later.

Cluster Engagement

Posts from X

Most Activity

VIEWS2.4KBOOKMARKS3LIKES16REPLIES3

Tanishq Mathew Abraham, Ph.D.@iScienceLuvr

Using DINOv3, especially scaled up, closes gap with text-aligned vision encoders (CLIP-style) when applied to VLMs

Oriane Siméoni @CVPR@oriane_simeoni

@TimDarcet @iScienceLuvr Here is the summary slide presented yesterday. Using Molmo setup, encoder frozen and FT Qwen2-7B

TL;DR: * scaling DINOv2 ViT-g to 1M iter. on DINOv3 1.7B images --> boost * scaling to 7B (+gram&HRFT) --> close gap w/ text-aligned encoders * distilled ViT-H+ as good as 7B

4h2.4K163

tsunami_crypto@ls_brd

@iScienceLuvr hmm so scaling DINO actually competes with CLIP now? feels like vision-only training isnt dead yet

4h4