day 2 findings on this pipeline 🥹
> it works, got map@50=0.8028 on road sign detection against human annotations, with only 1.3k examples 🙌🏼 see results below
> Liquid rejects way more than Gemma-4 (530 vs 306 in hard document parsing, 1022 vs 116 in easy road sign detection, tbh it's smaller and more prone to hallucination when I vibe check) > in some cases (see document media parsing examples below) trained RF-DETR outperforms Qwen annotations it was trained on which is super cool, sometimes judges introduce bboxes (and I don't remove them) it's a win? 😄 > multiple VLMs as judges will shrink your dataset depending on the difficulty of the problem, sometimes taking only one "correct" from a judge is enough. since you are training small models it's better to kickoff training for consensus and single correct verdict separately
> use super-specific prompts of what you want and don't want in labelling and judging especially if your labels as words could mean many things
next up: make this library leaner to generalize better to be problem-agnostic, try again on segmentation, actually use Gemma for orchestration
I'm testing multiple small VLMs-as-judges, all parts of pipeline are different model families
let me know below if you want me to test any other models, these are very convenient
