/Tech1h ago

Netflix Researcher To Present LLM Judge Techniques At Toronto ML Summit

419251.7K
Original post
Cameron R. Wolfe, Ph.D.@cwolferesearch#1942inTech

I’m giving a talk on LLM judges at the Toronto Machine Learning Summit next week. The talk will cover practical techniques like:

- Collecting high-quality expert feedback on subjective tasks. - Calibrating LLM judges with expert opinions. - Properly eliciting reasoning within an LLM judge. - Using multiple agents to decompose complex evaluation tasks. - Continually improving LLM judges with production monitoring / metrics.

This talk will be full of practical details for building useful evaluation systems. Hope to see you there!

3:34 PM · Jun 11, 2026 · 879 Views
Sentiment

Users are excited about the Netflix researcher's LLM Judge Techniques presentation at the Toronto ML Summit because they call the topic killer and express optimism about it.

Pos
100.0%
Neg
0.0%
2 comments with sentiment.
Cluster Engagement
Posts from X
Most Activity
Most Activity
VIEWS802LIKES2RETWEETS1

The talk is based on this blog: https://netflixtechblog.com/evaluating-netflix-show-synopses-with-llm-as-a-judge-6269251e6f28

Register for the talk here: https://www.eventbrite.ca/e/toronto-machine-learning-summit-tmls-10th-annual-conference-expo-2026-tickets-1976645039523?aff=oddtdtcreator

I’m giving a talk on LLM judges at the Toronto Machine Learning Summit next week. The talk will cover practical techniques like:

- Collecting high-quality expert feedback on subjective tasks. - Calibrating LLM judges with expert opinions. - Properly eliciting reasoning within an LLM judge. - Using multiple agents to decompose complex evaluation tasks. - Continually improving LLM judges with production monitoring / metrics.

This talk will be full of practical details for building useful evaluation systems. Hope to see you there!

1hViews 802Likes 2Bookmarks 0
REPLIES1
Strata@ChainZenit

@cwolferesearch that sounds like a killer topic for the summit.

1hViews 5Likes 1
Rugbist@rugbist_

@cwolferesearch the calibration with expert opinions part is what most people skip. curious how you handle disagreement between judges.

1h
Blissy@BlissyOnX

@cwolferesearch The dataset of 290 reviews and 15 system rubrics was interesting.

But 290 feels small for calibration of subjective tasks.

1h