@TimDarcet @iScienceLuvr Here is the summary slide presented yesterday. Using Molmo setup, encoder frozen and FT Qwen2-7B
TL;DR: * scaling DINOv2 ViT-g to 1M iter. on DINOv3 1.7B images --> boost * scaling to 7B (+gram&HRFT) --> close gap w/ text-aligned encoders * distilled ViT-H+ as good as 7B
