2h ago

Lun Wang publishes 'Your Evals Will Break and You Won't See It Coming' after leaving Google DeepMind, arguing static benchmarks fail to prepare for self-evolving models entering new capability regimes

491.5K1621.1K508.5K

——0——

Mercor CEO Brendan Foody reposted the evaluation critique.

Original post

#1990@BRENDANFOODYOP

Lun Wang@LUNWANG1996

I’ve left Google DeepMind after an amazing chapter. I’m incredibly grateful for the people I worked with, the things we built, and the lessons I learned from taking frontier AI research into production. DeepMind shaped how I think about research, product, evaluation, and what it takes to build AI systems at real scale. As I wrap up this chapter, I wrote down something I’ve been thinking about a lot: evals. We’re good at evaluating the models we have. We’re much worse at evaluating the models we’re about to build — especially if they cross into a new capability regime. We will have self-evolving models, but before that, we need self-evolving evaluations. https://wanglun1996.github.io/blog/your-evals-will-break.html

8:57 PM · May 17, 2026

Reposted by

#1990@BRENDANFOODY

QUOTE POST

#1457Seán Ó hÉigeartaigh@S_OHEIGEARTAIGH

"We’re good at evaluating the models we have. We’re much worse at evaluating the models we’re about to build — especially if they cross into a new capability regime. We will have self-evolving models, but before that, we need self-evolving evaluations." https://wanglun1996.github.io/blog/your-evals-will-break.html

Lun Wang@lunwang1996

3:57 AM · May 18, 2026 · 506.8K Views

10:20 AM · May 19, 2026 · 201 Views

Lun Wang publishes 'Your Evals Will Break and You Won't See It Coming' after leaving Google DeepMind, arguing static benchmarks fail to prepare for self-evolving models entering new capability regimes

Cluster engagement

Sentiment