/Tech3h ago

Researcher Tests Small VLMs as Judges in Detection Training Pipeline

4407194.6K

Original post

merve@mervenoyann#861inTech

day 2 findings on this pipeline 🥹

> it works, got map@50=0.8028 on road sign detection against human annotations, with only 1.3k examples 🙌🏼 see results below

> Liquid rejects way more than Gemma-4 (530 vs 306 in hard document parsing, 1022 vs 116 in easy road sign detection, tbh it's smaller and more prone to hallucination when I vibe check) > in some cases (see document media parsing examples below) trained RF-DETR outperforms Qwen annotations it was trained on which is super cool, sometimes judges introduce bboxes (and I don't remove them) it's a win? 😄 > multiple VLMs as judges will shrink your dataset depending on the difficulty of the problem, sometimes taking only one "correct" from a judge is enough. since you are training small models it's better to kickoff training for consensus and single correct verdict separately

> use super-specific prompts of what you want and don't want in labelling and judging especially if your labels as words could mean many things

next up: make this library leaner to generalize better to be problem-agnostic, try again on segmentation, actually use Gemma for orchestration

merve@mervenoyann

I'm testing multiple small VLMs-as-judges, all parts of pipeline are different model families

let me know below if you want me to test any other models, these are very convenient

8:18 AM · Jun 17, 2026 · 3.3K Views

Sentiment

Users appreciate the open sharing of labeled datasets, judged data, and trained models from testing small VLMs as judges in detection training pipelines.

Pos

100.0%

Neg

0.0%

1 comments with sentiment.

Cluster Engagement

Digg Deeper

No Digg Deeper questions have been answered for this story yet.

Related links

Vision Intern - a merve Collection

HUGGINGFACEVia

#861

Posts from X

Most Activity

VIEWS752RETWEETS1

merve@mervenoyann

@DataScienceHarp @skalskip92 @maximelabonne you might be interested in this ^

merve@mervenoyann

day 2 findings on this pipeline 🥹

> it works, got map@50=0.8028 on road sign detection against human annotations, with only 1.3k examples 🙌🏼 see results below

> use super-specific prompts of what you want and don't want in labelling and judging especially if your labels as words could mean many things

next up: make this library leaner to generalize better to be problem-agnostic, try again on segmentation, actually use Gemma for orchestration

2h75241

BOOKMARKS1LIKES4

merve@mervenoyann

all my artifacts are here https://huggingface.co/collections/merve/vision-intern labelled datasets, judged datasets, trained models, parts of pipelines and more

also shoutout to @huggingface infra 💟 I use Buckets, Jobs, Dataset Viewer and more heavily

merve@mervenoyann

day 2 findings on this pipeline 🥹

> it works, got map@50=0.8028 on road sign detection against human annotations, with only 1.3k examples 🙌🏼 see results below

> use super-specific prompts of what you want and don't want in labelling and judging especially if your labels as words could mean many things

next up: make this library leaner to generalize better to be problem-agnostic, try again on segmentation, actually use Gemma for orchestration

3h51841

V0LYX@0xV0LYX

@mervenoyann liquid rejecting way more is actually kind of a feature not a bug for most road use cases