Static benchmarks are dying — they tend to get saturated quickly.
Evaluation and training data should co-evolve with frontier models.
We released BenchEvolver — a framework that automatically evolves saturated problems into harder, verified tasks for evaluating frontier models, which can also serve as useful self-improvement signals for RL.
New work from UC Berkeley @berkeley_ai @BerkeleyRDI @BerkeleySky
Project Page: http://benchevolver.github.io Paper: https://arxiv.org/abs/2606.01286