8h ago

Lun Wang leaves Google DeepMind and argues in a new blog post that static benchmarks will lose relevance for self-evolving models entering new capability regimes

541.7K1641.2K546.5K

——0——

The post advocates replacing them with self-evolving evaluation frameworks.

Original post

#1990@BRENDANFOODYOP

Lun Wang@LUNWANG1996

I’ve left Google DeepMind after an amazing chapter. I’m incredibly grateful for the people I worked with, the things we built, and the lessons I learned from taking frontier AI research into production. DeepMind shaped how I think about research, product, evaluation, and what it takes to build AI systems at real scale. As I wrap up this chapter, I wrote down something I’ve been thinking about a lot: evals. We’re good at evaluating the models we have. We’re much worse at evaluating the models we’re about to build — especially if they cross into a new capability regime. We will have self-evolving models, but before that, we need self-evolving evaluations. https://wanglun1996.github.io/blog/your-evals-will-break.html

8:57 PM · May 17, 2026

QUOTE POST

#1457Seán Ó hÉigeartaigh@S_OHEIGEARTAIGH

"We’re good at evaluating the models we have. We’re much worse at evaluating the models we’re about to build — especially if they cross into a new capability regime. We will have self-evolving models, but before that, we need self-evolving evaluations." https://wanglun1996.github.io/blog/your-evals-will-break.html

Lun Wang@lunwang1996

3:57 AM · May 18, 2026 · 547.4K Views

10:20 AM · May 19, 2026 · 814 Views

Sentiment

Pos83.3%

83.3%

50%

Neg16.7%

Users praise Brendan Foody for leaving DeepMind and urging self-evolving AI evaluations, since static benchmarks miss real skills, though some reply with insults and doom accusations.

14 comments with sentiment.

DIGG DEPTH

Koncep.to@KONCEPTOCHANNEL

How should we evaluate self-evolving AI models if, as Lun Wang argues, static benchmarks are destined to fail?

Shift to self-evolving evaluations that track order parameters for capability regime shifts, monitor meta-signals like score distribution changes, and let models generate adaptive tests to probe new behaviors. Lun Wang argues this is essential before self-evolving models arrive, as static benchmarks miss qualitative jumps such as strategic omission.

Cluster engagement

40 snapshots