1d ago

Zhengyao Jiang introduces FML-Bench, arguing that recent ML research agent gains on MLE-Bench are driven by stronger base models, not algorithmic progress

The two-year-old AIDE algorithm matched modern agent performance.

0010127

——0——

Original post

#87@EGREFENOP

Zhengyao Jiang@ZHENGYAOJIANG

MLE-Bench scores have jumped from 30% to 80% over the last two years. But how much of that is real algorithmic progress vs. better base models + problem definition shifts + overfitting? Turns out: not much. Once you control for the same step budget and models, and then test on a different set of tasks, the two-year-old AIDE algorithm matches modern agent/evolutionary search systems. Figure from FML-Bench, a new automated ML research benchmark, which unifies the code editing agent, step definition, and val/test split, and tries to benchmark the algorithmic efficiency (search/memory) of the agents. paper link: https://arxiv.org/pdf/2605.17373

11:28 AM · May 29, 2026

QUOTE POST

#901Anirudh Goyal@ANIRUDHG9119

Research agents need better search control. : )

Zhengyao Jiang@zhengyaojiang

6:28 PM · May 29, 2026 · 8.2K Views

7:15 PM · May 29, 2026 · 27 Views

Zhengyao Jiang introduces FML-Bench, arguing that recent ML research agent gains on MLE-Bench are driven by stronger base models, not algorithmic progress

Cluster engagement

Sentiment