/Tech6h ago

Developer Tests Small VLMs As Judges In Detection Training Pipeline

8314104.1K

Original post

merve@mervenoyann#861inTech

my first finding with this pipeline is that it works well but (rarely) there's a false positive tendency, I don't pass bboxes as tokens but rather overlaid masks/bboxes to judges

when the larger labelling model indicates there's something in the image but it's vaguely there in that object, smaller models can be convinced because there's bbox circling them. scores give absolutely zero signal btw, dump the research

e.g. I'm trying to detect tables, figures, signatures etc from documents, sometimes what a model considers a table (structured data) changes 😄 which is cool!

maybe I should use a document specific model here or for general images (although these models do good job), zero-shot detector as judge with actual confidence signal, or optimize the prompt, or all. but my goal is to get one working version until AIE SF and then generalize more. let's see!

merve@mervenoyann

I'm testing multiple small VLMs-as-judges, all parts of pipeline are different model families

let me know below if you want me to test any other models, these are very convenient

1:46 AM · Jun 16, 2026 · 3K Views

Sentiment

Users criticized testing small VLMs as judges in detection training pipelines because overlaid hints and bboxes essentially give judges the answer key, producing unreliable rubber-stamped scores.

Pos

0.0%

Neg

100.0%

2 comments with sentiment.

Cluster Engagement

Posts from X

Most Activity

VIEWS805LIKES3

merve@mervenoyann

here's what judges consider as a table because it's structured data, Gemma-4 likes to say it

merve@mervenoyann

my first finding with this pipeline is that it works well but (rarely) there's a false positive tendency, I don't pass bboxes as tokens but rather overlaid masks/bboxes to judges

e.g. I'm trying to detect tables, figures, signatures etc from documents, sometimes what a model considers a table (structured data) changes 😄 which is cool!

6h80530

REPLIES1

merve@mervenoyann

@skalskip92 in case you started using this^

merve@mervenoyann

my first finding with this pipeline is that it works well but (rarely) there's a false positive tendency, I don't pass bboxes as tokens but rather overlaid masks/bboxes to judges

e.g. I'm trying to detect tables, figures, signatures etc from documents, sometimes what a model considers a table (structured data) changes 😄 which is cool!

6h32830

Puzzle Paws@paws4puzzles

@mervenoyann i'd drop the overlaid hints first. you're basically giving judges the answer key. of course they rubber-stamp it. no wonder scores give zero signal.

6h4

Puzzle Paws@paws4puzzles

@mervenoyann @skalskip92 overlaid bboxes are basically peer pressure for vlms. smaller judges just fold. i ignore confidence scores entirely. dump that research, your ensemble merge logic does the real work

6h21

merve@mervenoyann

@paws4puzzles yeah but they are supposed to judge detections here that's the thing

5h2