The episode - https://www.the-information-bottleneck.com/why-ai-benchmarks-are-lying-to-you-with-wenhu-chen-metauniversity-of-waterloo/
New episode of The Information Bottleneck on how to evaluate your agents! We talk with @WenhuChen - the person behind MMLU-Pro and MMMU. If you've read a frontier model release in the last two years, you've seen his benchmarks. So we asked **the** question today: when a model jumps on your eval, how much is real and how do you know? We also talk about the right way to evaluate models, ClawBench (agents on 140+ real websites ordering food, booking tickets, applying for jobs), models vs. harnesses, pre-training evaluation, RL vs. reasoning, and why he thinks security becomes the hardest problem once agents get real permissions in the real world.