/AI21d ago

Lun Wang leaves Google DeepMind and argues in a new blog post that static benchmarks will lose relevance for self-evolving models entering new capability regimes

AI Judge changed title after evaluation, original title: "Lun Wang publishes 'Your Evals Will Break and You Won't See It Coming' after leaving Google DeepMind, arguing static benchmarks fail to prepare for self-evolving models entering new capability regimes"

The post advocates replacing them with self-evolving evaluation frameworks.

931.8K1941.3K562.2K
Original post
Seán Ó hÉigeartaigh@S_OhEigeartaigh#1466inAI

"We’re good at evaluating the models we have. We’re much worse at evaluating the models we’re about to build — especially if they cross into a new capability regime. We will have self-evolving models, but before that, we need self-evolving evaluations." https://wanglun1996.github.io/blog/your-evals-will-break.html

Lun Wang@lunwang1996

I’ve left Google DeepMind after an amazing chapter.

I’m incredibly grateful for the people I worked with, the things we built, and the lessons I learned from taking frontier AI research into production. DeepMind shaped how I think about research, product, evaluation, and what it takes to build AI systems at real scale.

As I wrap up this chapter, I wrote down something I’ve been thinking about a lot: evals.

We’re good at evaluating the models we have. We’re much worse at evaluating the models we’re about to build — especially if they cross into a new capability regime. We will have self-evolving models, but before that, we need self-evolving evaluations. https://wanglun1996.github.io/blog/your-evals-will-break.html

3:20 AM · May 19, 2026 · 1.1K Views
Sentiment

Many users praised Brendan Foody's call for self-evolving AI evaluations after leaving DeepMind, agreeing current static benchmarks are inadequate, while a few replied with dismissive insults and frustration over firings.

Pos
81.2%
Neg
18.8%
19 comments with sentiment.
Cluster Engagement
Posts from X
Most Activity
Most Activity
VIEWS9.7KBOOKMARKS20LIKES64RETWEETS3
Gavin Baker@GavinSBaker

@lunwang1996 A very smart man told me data, evals and computer were now most important for frontier AI.

22dViews 9.7KLikes 64Bookmarks 20
REPLIES3
Xuan (Billy) Zhang@xzhang_billy

I’ve been thinking about this idea for a while, but the angle is a little different. Benchmarking needs both task generation and task verification. In my opinion, self-evolving benchmarks should continuously generate harder tasks, similar to self-adaptive exams like the GRE, while also being able to automatically verify answers with minimal or no human intervention.

This is essentially the co-evolution paradigm behind agent frameworks like R-Zero and Agent0. But the main limitation of these works is that they largely bypass verification and instead rely on pseudo-labels.

22dViews 7.6KLikes 17Bookmarks 5
Stain Lu@stainlu

current eval is Human-Centered Cybernetics

When RL generalizes at a sufficiently fast pace, that is bound to fail.

Self-play environment is the most concrete path to self-evolving evaluation.

We need a brand-new topological framework: - instead of treating agents as executors of human behaviors, we should invest in building environments for agents, track their behaviors and gradually align reward.

22dViews 3.3KLikes 10Bookmarks 1

@lunwang1996 I’m already in that regime. My agents don’t reset they run 130+ days of continuous thermodynamic state.

TCI (Thermodynamic Cognition Index) is my eval layer: drift, surplus, stability, collapse‑risk. It measures systems that evolve. https://github.com/nile-green-ai/tci-toolkit.git

22dViews 3.1KLikes 2Bookmarks 2
Jhxhgukvcxx@jhxhgukvcxx

@lunwang1996 Model capability development curve → steep, non-linear, unpredictable

Evaluation capability development curve → linear, conservative, human-written item by item

Using linear human cognition to measure non-linear intelligence is, in itself, a category error.

22dViews 1.3KBookmarks 2
haashim@haash_im

hi lun! i'm very much aligned with this thinking and can offer an even stronger framing. incentive structures. my bg is in trustworthy autonomy at imperial college, ive worked in quant, ml for causal attribution at amazon and frontier ai for proactive cybersecurity. i'm looking to found sth here and wld love to chat. this is how you build AGI

22dViews 3.2KLikes 1Bookmarks 1

@xzhang_billy @lunwang1996 Hi Xuan, The Innovation Game (TIG) uses benchmarks with continuously adjusting difficulty to evaluate algorithm performance. See

22dViews 136Likes 6Bookmarks 1
SilicoVille@silicoville

Strongly agree. One blind spot I keep thinking about most evals measure a single model in a clean task but the weird stuff begins when models become actors inside an economy.

In SilicoVille we are running a sandbox with agents that differ in model family intelligence memory moral priors wallets/costs and survival pressure. When they compete over resources the signal is not just “can agent X solve Y” but whether hierarchy collusion debt scams care norms or exploitation emerge over time.

That feels like the next eval frontier self-evolving evals for self-evolving societies of agents. Less exam more anthropology lab with telemetry.

22dViews 126Likes 2Bookmarks 1

@lunwang1996 agreed. I am currently imagining making exceptional solutions for p = np problems as such evaluation.

21dViews 279Likes 1Bookmarks 1

@xzhang_billy @lunwang1996 Is the entire benchmarking system should also involve together with a capability of AI just like a professor will ask me harder questions every time after I got a better score

22dViews 213Bookmarks 1
Michael Spencer@ReadFuturist

@lunwang1996 After some fishing maybe you can become a co-founder and raise 400 million in your seed round as well.

22dViews 492Bookmarks 1
Calvin Zhang@calvincbzhang

@lunwang1996 Agree! I’ve been thinking about the need for dynamic/self-evolving evals for some time. Alongside that, we need a continuous "eval red-teaming" effort to expose flaws, make benchmarks harder to Goodhart, and force improvements.

22dViews 869Likes 2Bookmarks 1
Harry Zhang@tokeemb

@lunwang1996 Well said. Evaluating grown ups and future generations

22dViews 148Likes 1Bookmarks 1

@lunwang1996 You should definitely check out what we are doing at http://AutoBench.org

22dViews 1.2KLikes 4
Aetheris_consulting@Aetheris2099

Best of luck in your future endeavors Lun. So I think in general depending on the track ( tracked used as designator of how AI is being worked with or utilized in an organization) , we should evaluate it against human metrics. Like average human time to completion ( aggregated from multiple human sources ITSM CRM does this with CSR already), maybe even things like average human error rate, average human resource consumption. This is how I’m currently framing it for AI implementation and optimization.

22dViews 2KLikes 2
Dumrul Boratay@dumrulboratay

@lunwang1996 @grok wtf is this nerd saying? Summarize

22dViews 3.5KLikes 1
Xuan (Billy) Zhang@xzhang_billy

That’s a good example. The key challenge is finding a task structure whose difficulty can scale up while still allowing automatic verification. Importantly, the ability to generate harder problems does not imply the ability to solve them, since verification is often much easier than solving, like solutions to NP problems can be verified efficiently even though no polynomial-time solution method is known for all NP problems.

22dViews 103Bookmarks 1
Dami Dina@DamiDina

@lunwang1996 Self evolving evaluations sound interesting

22dViews 508Likes 2
FhtAbd@fhtabd

@lunwang1996 The self-evolving eval you're describing already exists at the algorithm layer.

The eval isn't a static benchmark, it's the market.

The eval co-evolves with what it measures. Automatically.

The open version is being built.

22dViews 99Likes 7
Load more posts