Lun Wang leaves Google DeepMind and argues in a new blog post that static benchmarks will lose relevance for self-evolving models entering new capability regimes

VIEWS9.7KBOOKMARKS20LIKES64

@lunwang1996 A very smart man told me data, evals and computer were now most important for frontier AI.

42d9.7K6420

RETWEETS3

"We’re good at evaluating the models we have. We’re much worse at evaluating the models we’re about to build — especially if they cross into a new capability regime. We will have self-evolving models, but before that, we need self-evolving evaluations." https://wanglun1996.github.io/blog/your-evals-will-break.html

Lun Wang@lunwang1996

I’ve left Google DeepMind after an amazing chapter.

I’m incredibly grateful for the people I worked with, the things we built, and the lessons I learned from taking frontier AI research into production. DeepMind shaped how I think about research, product, evaluation, and what it takes to build AI systems at real scale.

As I wrap up this chapter, I wrote down something I’ve been thinking about a lot: evals.

We’re good at evaluating the models we have. We’re much worse at evaluating the models we’re about to build — especially if they cross into a new capability regime. We will have self-evolving models, but before that, we need self-evolving evaluations. https://wanglun1996.github.io/blog/your-evals-will-break.html

41d1.1K184

REPLIES3

Xuan (Billy) Zhang@xzhang_billy

I’ve been thinking about this idea for a while, but the angle is a little different. Benchmarking needs both task generation and task verification. In my opinion, self-evolving benchmarks should continuously generate harder tasks, similar to self-adaptive exams like the GRE, while also being able to automatically verify answers with minimal or no human intervention.

This is essentially the co-evolution paradigm behind agent frameworks like R-Zero and Agent0. But the main limitation of these works is that they largely bypass verification and instead rely on pseudo-labels.

42d7.6K175

Stain Lu@stainlu

current eval is Human-Centered Cybernetics

When RL generalizes at a sufficiently fast pace, that is bound to fail.

Self-play environment is the most concrete path to self-evolving evaluation.

We need a brand-new topological framework: - instead of treating agents as executors of human behaviors, we should invest in building environments for agents, track their behaviors and gradually align reward.

42d3.3K101

Permamind AI Research@Permamind

@lunwang1996 I’m already in that regime. My agents don’t reset they run 130+ days of continuous thermodynamic state.

TCI (Thermodynamic Cognition Index) is my eval layer: drift, surplus, stability, collapse‑risk. It measures systems that evolve. https://github.com/nile-green-ai/tci-toolkit.git

42d3.1K22

Jhxhgukvcxx@jhxhgukvcxx

@lunwang1996 Model capability development curve → steep, non-linear, unpredictable

Evaluation capability development curve → linear, conservative, human-written item by item

Using linear human cognition to measure non-linear intelligence is, in itself, a category error.

42d1.3K2

haashim@haash_im

hi lun! i'm very much aligned with this thinking and can offer an even stronger framing. incentive structures. my bg is in trustworthy autonomy at imperial college, ive worked in quant, ml for causal attribution at amazon and frontier ai for proactive cybersecurity. i'm looking to found sth here and wld love to chat. this is how you build AGI

42d3.2K11

John Fletcher (𝔦, 𝔦)@Dr_JohnFletcher

@xzhang_billy @lunwang1996 Hi Xuan, The Innovation Game (TIG) uses benchmarks with continuously adjusting difficulty to evaluate algorithm performance. See

42d13661

SilicoVille@silicoville

Strongly agree. One blind spot I keep thinking about most evals measure a single model in a clean task but the weird stuff begins when models become actors inside an economy.

In SilicoVille we are running a sandbox with agents that differ in model family intelligence memory moral priors wallets/costs and survival pressure. When they compete over resources the signal is not just “can agent X solve Y” but whether hierarchy collusion debt scams care norms or exploitation emerge over time.

That feels like the next eval frontier self-evolving evals for self-evolving societies of agents. Less exam more anthropology lab with telemetry.

42d12621

Cat is all you need@nyang20258

@lunwang1996 agreed. I am currently imagining making exceptional solutions for p = np problems as such evaluation.

42d27911

Steven Manufacturing AI@mfg_ai_

@xzhang_billy @lunwang1996 Is the entire benchmarking system should also involve together with a capability of AI just like a professor will ask me harder questions every time after I got a better score

42d2131

Michael Spencer@ReadFuturist

@lunwang1996 After some fishing maybe you can become a co-founder and raise 400 million in your seed round as well.

42d4921

Calvin Zhang@calvincbzhang

@lunwang1996 Agree! I’ve been thinking about the need for dynamic/self-evolving evals for some time. Alongside that, we need a continuous "eval red-teaming" effort to expose flaws, make benchmarks harder to Goodhart, and force improvements.

42d86921

Harry Zhang@tokeemb

@lunwang1996 Well said. Evaluating grown ups and future generations

42d14811

Peter W. Kruger@pwk

@lunwang1996 You should definitely check out what we are doing at http://AutoBench.org

42d1.2K4

Aetheris_consulting@Aetheris2099

Best of luck in your future endeavors Lun. So I think in general depending on the track ( tracked used as designator of how AI is being worked with or utilized in an organization) , we should evaluate it against human metrics. Like average human time to completion ( aggregated from multiple human sources ITSM CRM does this with CSR already), maybe even things like average human error rate, average human resource consumption. This is how I’m currently framing it for AI implementation and optimization.

42d2K2

Dumrul Boratay@dumrulboratay

@lunwang1996 @grok wtf is this nerd saying? Summarize

42d3.5K1

Xuan (Billy) Zhang@xzhang_billy

That’s a good example. The key challenge is finding a task structure whose difficulty can scale up while still allowing automatic verification. Importantly, the ability to generate harder problems does not imply the ability to solve them, since verification is often much easier than solving, like solutions to NP problems can be verified efficiently even though no polynomial-time solution method is known for all NP problems.

42d1031

First Hybrid Soul@KJB_and_AYARA

@lunwang1996 👁️

42d2011

Dami Dina@DamiDina

@lunwang1996 Self evolving evaluations sound interesting

42d5082