2h ago

Lun Wang publishes 'Your Evals Will Break and You Won't See It Coming' after leaving Google DeepMind, arguing static benchmarks fail to prepare for self-evolving models entering new capability regimes

0

Mercor CEO Brendan Foody reposted the evaluation critique.

Original post

I’ve left Google DeepMind after an amazing chapter. I’m incredibly grateful for the people I worked with, the things we built, and the lessons I learned from taking frontier AI research into production. DeepMind shaped how I think about research, product, evaluation, and what it takes to build AI systems at real scale. As I wrap up this chapter, I wrote down something I’ve been thinking about a lot: evals. We’re good at evaluating the models we have. We’re much worse at evaluating the models we’re about to build — especially if they cross into a new capability regime. We will have self-evolving models, but before that, we need self-evolving evaluations. https://wanglun1996.github.io/blog/your-evals-will-break.html

8:57 PM · May 17, 2026 View on X
Reposted by

"We’re good at evaluating the models we have. We’re much worse at evaluating the models we’re about to build — especially if they cross into a new capability regime. We will have self-evolving models, but before that, we need self-evolving evaluations." https://wanglun1996.github.io/blog/your-evals-will-break.html

Lun WangLun Wang@lunwang1996

I’ve left Google DeepMind after an amazing chapter. I’m incredibly grateful for the people I worked with, the things we built, and the lessons I learned from taking frontier AI research into production. DeepMind shaped how I think about research, product, evaluation, and what it takes to build AI systems at real scale. As I wrap up this chapter, I wrote down something I’ve been thinking about a lot: evals. We’re good at evaluating the models we have. We’re much worse at evaluating the models we’re about to build — especially if they cross into a new capability regime. We will have self-evolving models, but before that, we need self-evolving evaluations. https://wanglun1996.github.io/blog/your-evals-will-break.html

3:57 AM · May 18, 2026 · 506.8K Views
10:20 AM · May 19, 2026 · 201 Views
Lun Wang publishes 'Your Evals Will Break and You Won't See It Coming' after leaving Google DeepMind, arguing static benchmarks fail to prepare for self-evolving models entering new capability regimes · Digg