This is one of the papers I'm quite excited about in the past few weeks. It's a very simple but practical modification to the DINOv3 training framework.
Let me explain how it works.
Tanishq Mathew Abraham highlighted a new approach called FINO that tweaks the DINO self-supervised framework so vision models like DINOv2 and DINOv3 can adapt to specialized domains by leaning on whatever metadata already sits in the dataset.
This is one of the papers I'm quite excited about in the past few weeks. It's a very simple but practical modification to the DINOv3 training framework.
Let me explain how it works.
FINO folds discrete tags such as plate or country and continuous signals such as timestamp into the training loop, guiding the model to keep useful factors while ignoring batch effects.
Tests on fluorescence microscopy, satellite imagery, camera-trap photos, and chest X-rays show FINO matching or beating both unsupervised baselines and fully supervised fine-tuning without using task labels for the backbone.
Users praise the FINO method for adapting DINO vision models with existing metadata because it sensibly bypasses the annotation bottleneck in medical imaging.
No Digg Deeper questions have been answered for this story yet.
i started:
This is one of the papers I'm quite excited about in the past few weeks. It's a very simple but practical modification to the DINOv3 training framework.
Let me explain how it works.
What's funny is this paper was released the day I tweeted this. I hope to continue to see more innovation in the SSL space!
I am genuinely frustrated by how poorly self-supervised learning for vision is researched and how underappreciated it is.
Like how has DINOv2 been basically the best model for the past 3 years lol
Consider a satellite image. If you have information about the location of the image, the time of day, etc. would you throw away that information or should you incorporate it into the training somewhere?
Or what about an X-ray scan? If you have information about patient sex, scanner used, etc. perhaps these are things you want to teach the model to be invariant to.
The current DINOv2/3 framework would throw away all of this metadata and treat every image exactly the same.
So in a nutshell, pretrained self-supervised models like DINOv2/3 are often poorly suited for direct application to scientific domains (ex: microscopy, medical imaging, satellite imaging, etc.) . The representations should be adapted to the domain.
You can do continual pretraining of DINOv2/3 on your own dataset. However, this often throws away a lot of relevant context!
That's where the FINO framework comes in. The main innovation here is the incorporation of guidance with metadata you already have to better adapt the learned representations.
The authors utilize two types of metadata to guide training: informative factors that should shape the representation (an antibody label in microscopy, geography in satellite imagery) are encouraged, while spurious factors that just reflect how the data was collected (the imaging plate, the sensor resolution) are actively suppressed via gradient reversal.
Consider a satellite image. If you have information about the location of the image, the time of day, etc. would you throw away that information or should you incorporate it into the training somewhere?
Or what about an X-ray scan? If you have information about patient sex, scanner used, etc. perhaps these are things you want to teach the model to be invariant to.
The current DINOv2/3 framework would throw away all of this metadata and treat every image exactly the same.
So in a nutshell, pretrained self-supervised models like DINOv2/3 are often poorly suited for direct application to scientific domains (ex: microscopy, medical imaging, satellite imaging, etc.) . The representations should be adapted to the domain.
You can do continual pretraining of DINOv2/3 on your own dataset. However, this often throws away a lot of relevant context!
This is one of the papers I'm quite excited about in the past few weeks. It's a very simple but practical modification to the DINOv3 training framework.
Let me explain how it works.
FINO was test across four different domains: protein-localization microscopy (HPA), Earth observation (FMoW), wildlife camera traps (iWildCam), and chest X-rays (MIMIC-CXR).
In all cases, FINO beats both unsupervised domain adaptation and fully supervised fine-tuning, and even heavily engineered, domain-specific SOTA models. Just finetuning DINOv3 or even continually pretraining DINOv3 on the target dataset doesn't reliably give gains on performance but FINO does!
To handle discrete metadata with high cardinality (lots of categories), FINO uses a momentum-updated prototype bank for discrete factors. The loss used is a contrastive loss, inspired by supervised contrastive learning.
For continuous metadata, the loss just regresses a small predictor head against the metadata target.
Of course, no metadata is needed at inference, it is only used to guide learning.
What's interesting to note is how different metadata features can affect training in different ways.
Let's take high-throughput cell microscopy as an example. In this domain, "plates" refer to physical sample batches that introduce technical noise and environmental artifacts, which AI models can often exploit instead of learning true biological features, leading to poor generalization. However, it turns out that suppressing this feature is harmful!
Turns out that since each plate has a single type of cell and set of biological structures that are fluorescently labeled, trying to suppress the model from learning the plate features also suppress learning of cell lines and biological structures! So when there are various entangled factors, it's hard to cleanly decide if a metadata label should be encouraged in model training or suppressed, or neither!
FINO was test across four different domains: protein-localization microscopy (HPA), Earth observation (FMoW), wildlife camera traps (iWildCam), and chest X-rays (MIMIC-CXR).
In all cases, FINO beats both unsupervised domain adaptation and fully supervised fine-tuning, and even heavily engineered, domain-specific SOTA models. Just finetuning DINOv3 or even continually pretraining DINOv3 on the target dataset doesn't reliably give gains on performance but FINO does!
To handle discrete metadata with high cardinality (lots of categories), FINO uses a momentum-updated prototype bank for discrete factors. The loss used is a contrastive loss, inspired by supervised contrastive learning.
For continuous metadata, the loss just regresses a small predictor head against the metadata target.
Of course, no metadata is needed at inference, it is only used to guide learning.
That's where the FINO framework comes in. The main innovation here is the incorporation of guidance with metadata you already have to better adapt the learned representations.
The authors utilize two types of metadata to guide training: informative factors that should shape the representation (an antibody label in microscopy, geography in satellite imagery) are encouraged, while spurious factors that just reflect how the data was collected (the imaging plate, the sensor resolution) are actively suppressed via gradient reversal.

Another great aspect of the paper is a variety of practical DINOv3 domain adaptation tricks mentioned. For example, the use of SIGReg, or the use of a two-stage training pipeline.
@iScienceLuvr Great thread, thank you!
This is one of the papers I'm quite excited about in the past few weeks. It's a very simple but practical modification to the DINOv3 training framework.
Let me explain how it works.

@iScienceLuvr simple mods to DINOv3 are underrated, curious if it touches the register tokens

@iScienceLuvr sounds interesting, can't wait to see the details!

@iScienceLuvr link in case someone wants to read it for themself...tho grok will summarize too: https://www.alphaxiv.org/overview/2606.05107v1

@iScienceLuvr Makes absolute sense - metadata is already there, why not leverage it. The annotation bottleneck in medical imaging is real. Curious about compute overhead?