/AI4h ago

Talk Shows LLM Eval Scores Depend On Full Measurement Pipeline

--0--
Original postFlorian Brand#1117
Vivek@vivek_2332

excellent talk on why evals are so hard to judge. core point: an eval score isn't one number, it's the output of a whole stack: harness, sandbox, hardware, prompt, llm, grader, engine/api; and perturbing any single layer can swing the final score dramatically.

which means most reported eval comparisons aren't apples-to-apples. you're told you're comparing models, but you're really comparing entire measurement pipelines that happen to share a model somewhere inside.

what this points at, i think, is that the eval harness is becoming a real engineering surface, not a wrapper you bolt on at the end. once you accept that the number is a property of the whole pipeline, "model X scored Y" stops being a fact about the model and becomes a fact about a setup someone built. the leaderboard era assumed the harness was neutral plumbing. it isn't, and the gap between two labs' reported scores is increasingly the gap between their measurement stacks. so, there's no real line between eval infra and environment infra. it's the same machinery: fan-out, sandboxing, adversarial grading. pointed at measurement in one case and training in the other. a good env is already an eval you can also learn from; a good eval is just an environment you've frozen. the teams who build one stack for both will be the ones whose numbers you can actually trust.

highly recommend watching!!

The talk is now on YouTube!

Link: https://www.youtube.com/watch?v=kmTMc-fVSXw

2:55 AM · Jun 5, 2026 · 2.6K Views
Sentiment
Sentiment building, check back later.
Cluster Engagement
-
Views
-
Comments
-
Reposts
-
Bookmarks
Expand data
Posts from X
Most Activity
Most Activity
No ranked X posts are available for this story yet.