1d ago

Zhengyao Jiang introduces FML-Bench, arguing that recent ML research agent gains on MLE-Bench are driven by stronger base models, not algorithmic progress

The two-year-old AIDE algorithm matched modern agent performance.

0
Original post

MLE-Bench scores have jumped from 30% to 80% over the last two years. But how much of that is real algorithmic progress vs. better base models + problem definition shifts + overfitting? Turns out: not much. Once you control for the same step budget and models, and then test on a different set of tasks, the two-year-old AIDE algorithm matches modern agent/evolutionary search systems. Figure from FML-Bench, a new automated ML research benchmark, which unifies the code editing agent, step definition, and val/test split, and tries to benchmark the algorithmic efficiency (search/memory) of the agents. paper link: https://arxiv.org/pdf/2605.17373

11:28 AM · May 29, 2026 View on X

Research agents need better search control. : )

Zhengyao JiangZhengyao Jiang@zhengyaojiang

MLE-Bench scores have jumped from 30% to 80% over the last two years. But how much of that is real algorithmic progress vs. better base models + problem definition shifts + overfitting? Turns out: not much. Once you control for the same step budget and models, and then test on a different set of tasks, the two-year-old AIDE algorithm matches modern agent/evolutionary search systems. Figure from FML-Bench, a new automated ML research benchmark, which unifies the code editing agent, step definition, and val/test split, and tries to benchmark the algorithmic efficiency (search/memory) of the agents. paper link: https://arxiv.org/pdf/2605.17373

6:28 PM · May 29, 2026 · 8.2K Views
7:15 PM · May 29, 2026 · 27 Views