my first finding with this pipeline is that it works well but (rarely) there's a false positive tendency, I don't pass bboxes as tokens but rather overlaid masks/bboxes to judges
when the larger labelling model indicates there's something in the image but it's vaguely there in that object, smaller models can be convinced because there's bbox circling them. scores give absolutely zero signal btw, dump the research
e.g. I'm trying to detect tables, figures, signatures etc from documents, sometimes what a model considers a table (structured data) changes 😄 which is cool!
maybe I should use a document specific model here or for general images (although these models do good job), zero-shot detector as judge with actual confidence signal, or optimize the prompt, or all. but my goal is to get one working version until AIE SF and then generalize more. let's see!
I'm testing multiple small VLMs-as-judges, all parts of pipeline are different model families
let me know below if you want me to test any other models, these are very convenient

