1/ We use LLM judges to scale up costly human evaluation. But to trust an LLM judge, you need… human evaluation. 🔄
Our new preprint tackles this circularity: "Metric Match: A Subset Selection Approach to Evaluating LLM Judge Reliability" 🧵
1/ We use LLM judges to scale up costly human evaluation. But to trust an LLM judge, you need… human evaluation. 🔄
Our new preprint tackles this circularity: "Metric Match: A Subset Selection Approach to Evaluating LLM Judge Reliability" 🧵
No Digg Deeper questions have been answered for this story yet.
2/ Can we predict how well an LLM judge agrees with humans from just a few human-labeled samples? Turns out LLMs tell us which samples are most informative to annotate. We use cheap synthetic labels from other LLMs to pick which samples to send to a human, instead of at random.
6/ Takeaway: LLMs + synthetic data work as an informative prior for human annotation. Fewer annotations needed can result in more reliable judges. As auto-raters become standard, fast calibration against real human raters matters more than ever.
📄 https://arxiv.org/abs/2606.15029