/Tech9h ago

SophontAI's Tanishq Mathew Abraham details FINO, which adapts DINOv2 and DINOv3 models using metadata instead of manual labels

Story Overview

Tanishq Mathew Abraham highlighted a new approach called FINO that tweaks the DINO self-supervised framework so vision models like DINOv2 and DINOv3 can adapt to specialized domains by leaning on whatever metadata already sits in the dataset.

132062213518.8K

#72

Original post

Tanishq Mathew Abraham, Ph.D.@iScienceLuvr#613inTech

This is one of the papers I'm quite excited about in the past few weeks. It's a very simple but practical modification to the DINOv3 training framework.

Let me explain how it works.

1:39 AM · Jun 29, 2026 · 12K Views

Research Impact

Metadata replaces manual labels

FINO folds discrete tags such as plate or country and continuous signals such as timestamp into the training loop, guiding the model to keep useful factors while ignoring batch effects.

Open Question

Benchmarks span four scientific domains

Tests on fluorescence microscopy, satellite imagery, camera-trap photos, and chest X-rays show FINO matching or beating both unsupervised baselines and fully supervised fine-tuning without using task labels for the backbone.

Sentiment

Users praise the FINO method for adapting DINO vision models with existing metadata because it sensibly bypasses the annotation bottleneck in medical imaging.

Pos

100.0%

Neg

0.0%

3 comments with sentiment.

Cluster Engagement

Digg Deeper

No Digg Deeper questions have been answered for this story yet.

Posts from X

Most Activity

VIEWS1.6K

Tanishq Mathew Abraham, Ph.D.@iScienceLuvr

i started:

Tanishq Mathew Abraham, Ph.D.@iScienceLuvr

This is one of the papers I'm quite excited about in the past few weeks. It's a very simple but practical modification to the DINOv3 training framework.

Let me explain how it works.

8h1.6K41

BOOKMARKS2

Tanishq Mathew Abraham, Ph.D.@iScienceLuvr

What's funny is this paper was released the day I tweeted this. I hope to continue to see more innovation in the SSL space!

Tanishq Mathew Abraham, Ph.D.@iScienceLuvr

I am genuinely frustrated by how poorly self-supervised learning for vision is researched and how underappreciated it is.

Like how has DINOv2 been basically the best model for the past 3 years lol

9h1.2K62

LIKES10RETWEETS1

Tanishq Mathew Abraham, Ph.D.@iScienceLuvr

Consider a satellite image. If you have information about the location of the image, the time of day, etc. would you throw away that information or should you incorporate it into the training somewhere?

Or what about an X-ray scan? If you have information about patient sex, scanner used, etc. perhaps these are things you want to teach the model to be invariant to.

The current DINOv2/3 framework would throw away all of this metadata and treat every image exactly the same.

Tanishq Mathew Abraham, Ph.D.@iScienceLuvr

So in a nutshell, pretrained self-supervised models like DINOv2/3 are often poorly suited for direct application to scientific domains (ex: microscopy, medical imaging, satellite imaging, etc.) . The representations should be adapted to the domain.

You can do continual pretraining of DINOv2/3 on your own dataset. However, this often throws away a lot of relevant context!

9h809100

REPLIES1

Tanishq Mathew Abraham, Ph.D.@iScienceLuvr

That's where the FINO framework comes in. The main innovation here is the incorporation of guidance with metadata you already have to better adapt the learned representations.

The authors utilize two types of metadata to guide training: informative factors that should shape the representation (an antibody label in microscopy, geography in satellite imagery) are encouraged, while spurious factors that just reflect how the data was collected (the imaging plate, the sensor resolution) are actively suppressed via gradient reversal.

Tanishq Mathew Abraham, Ph.D.@iScienceLuvr

Or what about an X-ray scan? If you have information about patient sex, scanner used, etc. perhaps these are things you want to teach the model to be invariant to.

The current DINOv2/3 framework would throw away all of this metadata and treat every image exactly the same.

9h48981

Tanishq Mathew Abraham, Ph.D.@iScienceLuvr

You can do continual pretraining of DINOv2/3 on your own dataset. However, this often throws away a lot of relevant context!

Tanishq Mathew Abraham, Ph.D.@iScienceLuvr

This is one of the papers I'm quite excited about in the past few weeks. It's a very simple but practical modification to the DINOv3 training framework.

Let me explain how it works.

9h86971

Tanishq Mathew Abraham, Ph.D.@iScienceLuvr

FINO was test across four different domains: protein-localization microscopy (HPA), Earth observation (FMoW), wildlife camera traps (iWildCam), and chest X-rays (MIMIC-CXR).

In all cases, FINO beats both unsupervised domain adaptation and fully supervised fine-tuning, and even heavily engineered, domain-specific SOTA models. Just finetuning DINOv3 or even continually pretraining DINOv3 on the target dataset doesn't reliably give gains on performance but FINO does!

Tanishq Mathew Abraham, Ph.D.@iScienceLuvr

To handle discrete metadata with high cardinality (lots of categories), FINO uses a momentum-updated prototype bank for discrete factors. The loss used is a contrastive loss, inspired by supervised contrastive learning.

For continuous metadata, the loss just regresses a small predictor head against the metadata target.

Of course, no metadata is needed at inference, it is only used to guide learning.

9h393100

Tanishq Mathew Abraham, Ph.D.@iScienceLuvr

What's interesting to note is how different metadata features can affect training in different ways.

Let's take high-throughput cell microscopy as an example. In this domain, "plates" refer to physical sample batches that introduce technical noise and environmental artifacts, which AI models can often exploit instead of learning true biological features, leading to poor generalization. However, it turns out that suppressing this feature is harmful!

Turns out that since each plate has a single type of cell and set of biological structures that are fluorescently labeled, trying to suppress the model from learning the plate features also suppress learning of cell lines and biological structures! So when there are various entangled factors, it's hard to cleanly decide if a metadata label should be encouraged in model training or suppressed, or neither!

Tanishq Mathew Abraham, Ph.D.@iScienceLuvr

FINO was test across four different domains: protein-localization microscopy (HPA), Earth observation (FMoW), wildlife camera traps (iWildCam), and chest X-rays (MIMIC-CXR).

9h35670

Tanishq Mathew Abraham, Ph.D.@iScienceLuvr

For continuous metadata, the loss just regresses a small predictor head against the metadata target.

Of course, no metadata is needed at inference, it is only used to guide learning.

Tanishq Mathew Abraham, Ph.D.@iScienceLuvr

That's where the FINO framework comes in. The main innovation here is the incorporation of guidance with metadata you already have to better adapt the learned representations.

9h47060

Tanishq Mathew Abraham, Ph.D.@iScienceLuvr

Another great aspect of the paper is a variety of practical DINOv3 domain adaptation tricks mentioned. For example, the use of SIGReg, or the use of a two-stage training pipeline.

9h129

Lucas Beyer (bl16)@giffmana

@iScienceLuvr Great thread, thank you!

Tanishq Mathew Abraham, Ph.D.@iScienceLuvr

This is one of the papers I'm quite excited about in the past few weeks. It's a very simple but practical modification to the DINOv3 training framework.

Let me explain how it works.

1h78120

Steven Collard@stalmico

@iScienceLuvr simple mods to DINOv3 are underrated, curious if it touches the register tokens

8h164

长期收购 LLM-APIKEY 🌸🌸🌸@agustinmussan07

@iScienceLuvr sounds interesting, can't wait to see the details!

2h18

John Owen@dreamingElvis

@iScienceLuvr link in case someone wants to read it for themself...tho grok will summarize too: https://www.alphaxiv.org/overview/2606.05107v1

21m7

John Silver@JohnGolf_CA

@iScienceLuvr Makes absolute sense - metadata is already there, why not leverage it. The annotation bottleneck in medical imaging is real. Curious about compute overhead?

8h5