/Tech1h ago

LLMs Select Informative Samples For Human Annotation Of LLM Judges

36812386.3K

Original post

1/ We use LLM judges to scale up costly human evaluation. But to trust an LLM judge, you need… human evaluation. 🔄

Our new preprint tackles this circularity: "Metric Match: A Subset Selection Approach to Evaluating LLM Judge Reliability" 🧵

9:05 AM · Jun 23, 2026 · 5.3K Views

Sentiment

Sentiment building, check back later.

Cluster Engagement

Digg Deeper

No Digg Deeper questions have been answered for this story yet.

ARXIV.ORGVia

Posts from X

Most Activity

VIEWS3RETWEETS3

Alyssa Unell@AlyssaUnell

2/ Can we predict how well an LLM judge agrees with humans from just a few human-labeled samples? Turns out LLMs tell us which samples are most informative to annotate. We use cheap synthetic labels from other LLMs to pick which samples to send to a human, instead of at random.

10h38141

Alyssa Unell@AlyssaUnell

6/ Takeaway: LLMs + synthetic data work as an informative prior for human annotation. Fewer annotations needed can result in more reliable judges. As auto-raters become standard, fast calibration against real human raters matters more than ever.

📄 https://arxiv.org/abs/2606.15029

10h43642