This is one of the papers I'm quite excited about in the past few weeks. It's a very simple but practical modification to the DINOv3 training framework.
Let me explain how it works.
This is one of the papers I'm quite excited about in the past few weeks. It's a very simple but practical modification to the DINOv3 training framework.
Let me explain how it works.
Users praise the FINO paper on adapting DINOv2/3 vision models with metadata for its practical domain adaptation tricks like SIGReg and express hope for more SSL innovation.
No Digg Deeper questions have been answered for this story yet.
i started:
This is one of the papers I'm quite excited about in the past few weeks. It's a very simple but practical modification to the DINOv3 training framework.
Let me explain how it works.
That's where the FINO framework comes in. The main innovation here is the incorporation of guidance with metadata you already have to better adapt the learned representations.
The authors utilize two types of metadata to guide training: informative factors that should shape the representation (an antibody label in microscopy, geography in satellite imagery) are encouraged, while spurious factors that just reflect how the data was collected (the imaging plate, the sensor resolution) are actively suppressed via gradient reversal.
Consider a satellite image. If you have information about the location of the image, the time of day, etc. would you throw away that information or should you incorporate it into the training somewhere?
Or what about an X-ray scan? If you have information about patient sex, scanner used, etc. perhaps these are things you want to teach the model to be invariant to.
The current DINOv2/3 framework would throw away all of this metadata and treat every image exactly the same.
Consider a satellite image. If you have information about the location of the image, the time of day, etc. would you throw away that information or should you incorporate it into the training somewhere?
Or what about an X-ray scan? If you have information about patient sex, scanner used, etc. perhaps these are things you want to teach the model to be invariant to.
The current DINOv2/3 framework would throw away all of this metadata and treat every image exactly the same.
So in a nutshell, pretrained self-supervised models like DINOv2/3 are often poorly suited for direct application to scientific domains (ex: microscopy, medical imaging, satellite imaging, etc.) . The representations should be adapted to the domain.
You can do continual pretraining of DINOv2/3 on your own dataset. However, this often throws away a lot of relevant context!
So in a nutshell, pretrained self-supervised models like DINOv2/3 are often poorly suited for direct application to scientific domains (ex: microscopy, medical imaging, satellite imaging, etc.) . The representations should be adapted to the domain.
You can do continual pretraining of DINOv2/3 on your own dataset. However, this often throws away a lot of relevant context!
This is one of the papers I'm quite excited about in the past few weeks. It's a very simple but practical modification to the DINOv3 training framework.
Let me explain how it works.
FINO was test across four different domains: protein-localization microscopy (HPA), Earth observation (FMoW), wildlife camera traps (iWildCam), and chest X-rays (MIMIC-CXR).
In all cases, FINO beats both unsupervised domain adaptation and fully supervised fine-tuning, and even heavily engineered, domain-specific SOTA models. Just finetuning DINOv3 or even continually pretraining DINOv3 on the target dataset doesn't reliably give gains on performance but FINO does!
To handle discrete metadata with high cardinality (lots of categories), FINO uses a momentum-updated prototype bank for discrete factors. The loss used is a contrastive loss, inspired by supervised contrastive learning.
For continuous metadata, the loss just regresses a small predictor head against the metadata target.
Of course, no metadata is needed at inference, it is only used to guide learning.
What's interesting to note is how different metadata features can affect training in different ways.
Let's take high-throughput cell microscopy as an example. In this domain, "plates" refer to physical sample batches that introduce technical noise and environmental artifacts, which AI models can often exploit instead of learning true biological features, leading to poor generalization. However, it turns out that suppressing this feature is harmful!
Turns out that since each plate has a single type of cell and set of biological structures that are fluorescently labeled, trying to suppress the model from learning the plate features also suppress learning of cell lines and biological structures! So when there are various entangled factors, it's hard to cleanly decide if a metadata label should be encouraged in model training or suppressed, or neither!
FINO was test across four different domains: protein-localization microscopy (HPA), Earth observation (FMoW), wildlife camera traps (iWildCam), and chest X-rays (MIMIC-CXR).
In all cases, FINO beats both unsupervised domain adaptation and fully supervised fine-tuning, and even heavily engineered, domain-specific SOTA models. Just finetuning DINOv3 or even continually pretraining DINOv3 on the target dataset doesn't reliably give gains on performance but FINO does!
To handle discrete metadata with high cardinality (lots of categories), FINO uses a momentum-updated prototype bank for discrete factors. The loss used is a contrastive loss, inspired by supervised contrastive learning.
For continuous metadata, the loss just regresses a small predictor head against the metadata target.
Of course, no metadata is needed at inference, it is only used to guide learning.
That's where the FINO framework comes in. The main innovation here is the incorporation of guidance with metadata you already have to better adapt the learned representations.
The authors utilize two types of metadata to guide training: informative factors that should shape the representation (an antibody label in microscopy, geography in satellite imagery) are encouraged, while spurious factors that just reflect how the data was collected (the imaging plate, the sensor resolution) are actively suppressed via gradient reversal.

Another great aspect of the paper is a variety of practical DINOv3 domain adaptation tricks mentioned. For example, the use of SIGReg, or the use of a two-stage training pipeline.
What's funny is this paper was released the day I tweeted this. I hope to continue to see more innovation in the SSL space!
I am genuinely frustrated by how poorly self-supervised learning for vision is researched and how underappreciated it is.
Like how has DINOv2 been basically the best model for the past 3 years lol