FINO Adapts DINOv2/3 Vision Models Using Metadata Without Labels

Tanishq Mathew Abraham, Ph.D.@iScienceLuvr

i started:

Tanishq Mathew Abraham, Ph.D.@iScienceLuvr

This is one of the papers I'm quite excited about in the past few weeks. It's a very simple but practical modification to the DINOv3 training framework.

Let me explain how it works.

1h81200

LIKES5REPLIES1

Tanishq Mathew Abraham, Ph.D.@iScienceLuvr

That's where the FINO framework comes in. The main innovation here is the incorporation of guidance with metadata you already have to better adapt the learned representations.

The authors utilize two types of metadata to guide training: informative factors that should shape the representation (an antibody label in microscopy, geography in satellite imagery) are encouraged, while spurious factors that just reflect how the data was collected (the imaging plate, the sensor resolution) are actively suppressed via gradient reversal.

Tanishq Mathew Abraham, Ph.D.@iScienceLuvr

Consider a satellite image. If you have information about the location of the image, the time of day, etc. would you throw away that information or should you incorporate it into the training somewhere?

Or what about an X-ray scan? If you have information about patient sex, scanner used, etc. perhaps these are things you want to teach the model to be invariant to.

The current DINOv2/3 framework would throw away all of this metadata and treat every image exactly the same.

2h13050

RETWEETS1

Tanishq Mathew Abraham, Ph.D.@iScienceLuvr

Consider a satellite image. If you have information about the location of the image, the time of day, etc. would you throw away that information or should you incorporate it into the training somewhere?

Or what about an X-ray scan? If you have information about patient sex, scanner used, etc. perhaps these are things you want to teach the model to be invariant to.

The current DINOv2/3 framework would throw away all of this metadata and treat every image exactly the same.

Tanishq Mathew Abraham, Ph.D.@iScienceLuvr

So in a nutshell, pretrained self-supervised models like DINOv2/3 are often poorly suited for direct application to scientific domains (ex: microscopy, medical imaging, satellite imaging, etc.) . The representations should be adapted to the domain.

You can do continual pretraining of DINOv2/3 on your own dataset. However, this often throws away a lot of relevant context!

2h31350

Tanishq Mathew Abraham, Ph.D.@iScienceLuvr

So in a nutshell, pretrained self-supervised models like DINOv2/3 are often poorly suited for direct application to scientific domains (ex: microscopy, medical imaging, satellite imaging, etc.) . The representations should be adapted to the domain.

You can do continual pretraining of DINOv2/3 on your own dataset. However, this often throws away a lot of relevant context!

Tanishq Mathew Abraham, Ph.D.@iScienceLuvr

This is one of the papers I'm quite excited about in the past few weeks. It's a very simple but practical modification to the DINOv3 training framework.

Let me explain how it works.

2h37340

Tanishq Mathew Abraham, Ph.D.@iScienceLuvr

FINO was test across four different domains: protein-localization microscopy (HPA), Earth observation (FMoW), wildlife camera traps (iWildCam), and chest X-rays (MIMIC-CXR).

In all cases, FINO beats both unsupervised domain adaptation and fully supervised fine-tuning, and even heavily engineered, domain-specific SOTA models. Just finetuning DINOv3 or even continually pretraining DINOv3 on the target dataset doesn't reliably give gains on performance but FINO does!

Tanishq Mathew Abraham, Ph.D.@iScienceLuvr

To handle discrete metadata with high cardinality (lots of categories), FINO uses a momentum-updated prototype bank for discrete factors. The loss used is a contrastive loss, inspired by supervised contrastive learning.

For continuous metadata, the loss just regresses a small predictor head against the metadata target.

Of course, no metadata is needed at inference, it is only used to guide learning.

2h10750

Tanishq Mathew Abraham, Ph.D.@iScienceLuvr

What's interesting to note is how different metadata features can affect training in different ways.

Let's take high-throughput cell microscopy as an example. In this domain, "plates" refer to physical sample batches that introduce technical noise and environmental artifacts, which AI models can often exploit instead of learning true biological features, leading to poor generalization. However, it turns out that suppressing this feature is harmful!

Turns out that since each plate has a single type of cell and set of biological structures that are fluorescently labeled, trying to suppress the model from learning the plate features also suppress learning of cell lines and biological structures! So when there are various entangled factors, it's hard to cleanly decide if a metadata label should be encouraged in model training or suppressed, or neither!

Tanishq Mathew Abraham, Ph.D.@iScienceLuvr

FINO was test across four different domains: protein-localization microscopy (HPA), Earth observation (FMoW), wildlife camera traps (iWildCam), and chest X-rays (MIMIC-CXR).

In all cases, FINO beats both unsupervised domain adaptation and fully supervised fine-tuning, and even heavily engineered, domain-specific SOTA models. Just finetuning DINOv3 or even continually pretraining DINOv3 on the target dataset doesn't reliably give gains on performance but FINO does!

2h10150

Tanishq Mathew Abraham, Ph.D.@iScienceLuvr

To handle discrete metadata with high cardinality (lots of categories), FINO uses a momentum-updated prototype bank for discrete factors. The loss used is a contrastive loss, inspired by supervised contrastive learning.

For continuous metadata, the loss just regresses a small predictor head against the metadata target.

Of course, no metadata is needed at inference, it is only used to guide learning.

Tanishq Mathew Abraham, Ph.D.@iScienceLuvr

That's where the FINO framework comes in. The main innovation here is the incorporation of guidance with metadata you already have to better adapt the learned representations.

The authors utilize two types of metadata to guide training: informative factors that should shape the representation (an antibody label in microscopy, geography in satellite imagery) are encouraged, while spurious factors that just reflect how the data was collected (the imaging plate, the sensor resolution) are actively suppressed via gradient reversal.

2h13140

Tanishq Mathew Abraham, Ph.D.@iScienceLuvr

Another great aspect of the paper is a variety of practical DINOv3 domain adaptation tricks mentioned. For example, the use of SIGReg, or the use of a two-stage training pipeline.

2h129

Tanishq Mathew Abraham, Ph.D.@iScienceLuvr

What's funny is this paper was released the day I tweeted this. I hope to continue to see more innovation in the SSL space!

Tanishq Mathew Abraham, Ph.D.@iScienceLuvr

I am genuinely frustrated by how poorly self-supervised learning for vision is researched and how underappreciated it is.

Like how has DINOv2 been basically the best model for the past 3 years lol

2h58920