/Tech2h ago

Podcast Questions Reliability of AI Benchmarks With Wenhu Chen

1412857

Original post

The episode - https://www.the-information-bottleneck.com/why-ai-benchmarks-are-lying-to-you-with-wenhu-chen-metauniversity-of-waterloo/

Ravid Shwartz Ziv@ziv_ravid

New episode of The Information Bottleneck on how to evaluate your agents! We talk with @WenhuChen - the person behind MMLU-Pro and MMMU. If you've read a frontier model release in the last two years, you've seen his benchmarks. So we asked **the** question today: when a model jumps on your eval, how much is real and how do you know? We also talk about the right way to evaluate models, ClawBench (agents on 140+ real websites ordering food, booking tickets, applying for jobs), models vs. harnesses, pre-training evaluation, RL vs. reasoning, and why he thinks security becomes the hardest problem once agents get real permissions in the real world.

8:53 AM · Jun 15, 2026 · 389 Views

Sentiment

Sentiment building, check back later.

Cluster Engagement

Posts from X

Most Activity

VIEWS468BOOKMARKS1LIKES4RETWEETS1REPLIES1

Ravid Shwartz Ziv@ziv_ravid

2h46841