excellent talk on why evals are so hard to judge. core point: an eval score isn't one number, it's the output of a whole stack: harness, sandbox, hardware, prompt, llm, grader, engine/api; and perturbing any single layer can swing the final score dramatically.
which means most reported eval comparisons aren't apples-to-apples. you're told you're comparing models, but you're really comparing entire measurement pipelines that happen to share a model somewhere inside.
what this points at, i think, is that the eval harness is becoming a real engineering surface, not a wrapper you bolt on at the end. once you accept that the number is a property of the whole pipeline, "model X scored Y" stops being a fact about the model and becomes a fact about a setup someone built. the leaderboard era assumed the harness was neutral plumbing. it isn't, and the gap between two labs' reported scores is increasingly the gap between their measurement stacks. so, there's no real line between eval infra and environment infra. it's the same machinery: fan-out, sandboxing, adversarial grading. pointed at measurement in one case and training in the other. a good env is already an eval you can also learn from; a good eval is just an environment you've frozen. the teams who build one stack for both will be the ones whose numbers you can actually trust.
highly recommend watching!!
The talk is now on YouTube!
Link: https://www.youtube.com/watch?v=kmTMc-fVSXw
