Lun Wang departs Google DeepMind and argues evaluations determine training objectives, safety layers, scaling decisions, and safe capability transitions for frontier AI systems
He called for adaptive evaluations as models cross new thresholds.
Your model is what you measure

I’ve left Google DeepMind after an amazing chapter. I’m incredibly grateful for the people I worked with, the things we built, and the lessons I learned from taking frontier AI research into production. DeepMind shaped how I think about research, product, evaluation, and what it takes to build AI systems at real scale. As I wrap up this chapter, I wrote down something I’ve been thinking about a lot: evals. We’re good at evaluating the models we have. We’re much worse at evaluating the models we’re about to build — especially if they cross into a new capability regime. We will have self-evolving models, but before that, we need self-evolving evaluations. https://wanglun1996.github.io/blog/your-evals-will-break.html
this parts even better
