Zhengyao Jiang introduces FML-Bench, arguing that recent ML research agent gains on MLE-Bench are driven by stronger base models, not algorithmic progress
The two-year-old AIDE algorithm matched modern agent performance.
Research agents need better search control. : )
MLE-Bench scores have jumped from 30% to 80% over the last two years. But how much of that is real algorithmic progress vs. better base models + problem definition shifts + overfitting? Turns out: not much. Once you control for the same step budget and models, and then test on a different set of tasks, the two-year-old AIDE algorithm matches modern agent/evolutionary search systems. Figure from FML-Bench, a new automated ML research benchmark, which unifies the code editing agent, step definition, and val/test split, and tries to benchmark the algorithmic efficiency (search/memory) of the agents. paper link: https://arxiv.org/pdf/2605.17373