Maksym Andriushchenko and researchers release InferenceBench, a benchmark evaluating AI agents on open-ended optimization of OpenAI-compatible LLM servers using latency and throughput metrics on H100 GPUs

VIEWS5KBOOKMARKS15LIKES44

Charles 🎉 Frye @ AIEng World's Fair@charles_irl

"'boring' tasks like inference speed optimization"

:<

💥Today we release InferenceBench, our next benchmark after PostTrainBench that measures progress on AI R&D automation.

AI R&D automation will very likely unfold gradually, starting from “boring” tasks like inference speed optimization that are very easily verifiable (accuracy + inference time). We show a rather negative result for current frontier agents. They are not good at system-level engineering and managing complex dependencies. They do show non-trivial performance, but they fail compared to a simple baseline: hyperparameter tuning of vLLM/SGLang hyperparameters.

Importantly, InferenceBench tests *open-ended* inference optimization capabilities. This is different from more narrow benchmarks like KernelBench that only let agents optimize kernels (which is a very valuable task, too!). The benchmark is intentionally open-ended, so the poor performance of the agents is not an underelicitation issue. The agents have everything needed to succeed, but they still fail because they are not yet reliable enough for this task.

Our results suggest an inverse scaling phenomenon: Claude Sonnet 4.6 and GLM-5 rank highly because they more often preserve simple, valid, high-performing final servers, while several larger models show stronger peak runs but lose utility through brittle final-state choices. This contrasts with benchmarks where rankings track raw capability (e.g., SWE-Bench, Terminal-Bench, PostTrainBench, FrontierSWE).

One of the primary bottlenecks we have clearly observed is the lack of diversity of strategies: nearly all agents just use vLLM, without exploring alternatives. Overall, proper exploration is lacking: the current agents are not ready to tackle broad enough goals and get stuck after the first found solution (such as vLLM). I’m sure future agents will do much better, but here is where we are now.

This benchmark is our 2nd one in a suite of benchmarks that will track the progress on AI R&D automation. We will develop many more benchmarks that will cover different aspects of AI R&D automation, culminating in recursive self-improvement. Stay tuned!

41d5K4415

RETWEETS2

Kevin A. Bryan@Afinetheorem

A very interesting benchmark of AI R&D automation because larger models in their current scaffold do not do better. Need more of these!

Maksym Andriushchenko@maksym_andr

💥Today we release InferenceBench, our next benchmark after PostTrainBench that measures progress on AI R&D automation.

AI R&D automation will very likely unfold gradually, starting from “boring” tasks like inference speed optimization that are very easily verifiable (accuracy + inference time). We show a rather negative result for current frontier agents. They are not good at system-level engineering and managing complex dependencies. They do show non-trivial performance, but they fail compared to a simple baseline: hyperparameter tuning of vLLM/SGLang hyperparameters.

Importantly, InferenceBench tests *open-ended* inference optimization capabilities. This is different from more narrow benchmarks like KernelBench that only let agents optimize kernels (which is a very valuable task, too!). The benchmark is intentionally open-ended, so the poor performance of the agents is not an underelicitation issue. The agents have everything needed to succeed, but they still fail because they are not yet reliable enough for this task.

Our results suggest an inverse scaling phenomenon: Claude Sonnet 4.6 and GLM-5 rank highly because they more often preserve simple, valid, high-performing final servers, while several larger models show stronger peak runs but lose utility through brittle final-state choices. This contrasts with benchmarks where rankings track raw capability (e.g., SWE-Bench, Terminal-Bench, PostTrainBench, FrontierSWE).

One of the primary bottlenecks we have clearly observed is the lack of diversity of strategies: nearly all agents just use vLLM, without exploring alternatives. Overall, proper exploration is lacking: the current agents are not ready to tackle broad enough goals and get stuck after the first found solution (such as vLLM). I’m sure future agents will do much better, but here is where we are now.

This benchmark is our 2nd one in a suite of benchmarks that will track the progress on AI R&D automation. We will develop many more benchmarks that will cover different aspects of AI R&D automation, culminating in recursive self-improvement. Stay tuned!

41d1.7K84

REPLIES2

finbarr@finbarrtimbers

This is consistent with my experience, where all the frontier models completely fail to reason about inflight updates and continually try to remove it from my codebase. It seems to indicate a lack of ability to reason about the mathematical consequences of the low level details.

Maksym Andriushchenko@maksym_andr

💥Today we release InferenceBench, our next benchmark after PostTrainBench that measures progress on AI R&D automation.

AI R&D automation will very likely unfold gradually, starting from “boring” tasks like inference speed optimization that are very easily verifiable (accuracy + inference time). We show a rather negative result for current frontier agents. They are not good at system-level engineering and managing complex dependencies. They do show non-trivial performance, but they fail compared to a simple baseline: hyperparameter tuning of vLLM/SGLang hyperparameters.

Importantly, InferenceBench tests *open-ended* inference optimization capabilities. This is different from more narrow benchmarks like KernelBench that only let agents optimize kernels (which is a very valuable task, too!). The benchmark is intentionally open-ended, so the poor performance of the agents is not an underelicitation issue. The agents have everything needed to succeed, but they still fail because they are not yet reliable enough for this task.

Our results suggest an inverse scaling phenomenon: Claude Sonnet 4.6 and GLM-5 rank highly because they more often preserve simple, valid, high-performing final servers, while several larger models show stronger peak runs but lose utility through brittle final-state choices. This contrasts with benchmarks where rankings track raw capability (e.g., SWE-Bench, Terminal-Bench, PostTrainBench, FrontierSWE).

One of the primary bottlenecks we have clearly observed is the lack of diversity of strategies: nearly all agents just use vLLM, without exploring alternatives. Overall, proper exploration is lacking: the current agents are not ready to tackle broad enough goals and get stuck after the first found solution (such as vLLM). I’m sure future agents will do much better, but here is where we are now.

This benchmark is our 2nd one in a suite of benchmarks that will track the progress on AI R&D automation. We will develop many more benchmarks that will cover different aspects of AI R&D automation, culminating in recursive self-improvement. Stay tuned!

40d3.7K2811

L@llllvvuu

Nice, I also experienced that frontier models are pretty bad at inference engineering. I thought I was stupid because everyone else is saying GPT 5.5 xhigh can do anything meanwhile I have a session that’s been running for days and removed 0% of the overhead I asked it to

Maksym Andriushchenko@maksym_andr

💥Today we release InferenceBench, our next benchmark after PostTrainBench that measures progress on AI R&D automation.

AI R&D automation will very likely unfold gradually, starting from “boring” tasks like inference speed optimization that are very easily verifiable (accuracy + inference time). We show a rather negative result for current frontier agents. They are not good at system-level engineering and managing complex dependencies. They do show non-trivial performance, but they fail compared to a simple baseline: hyperparameter tuning of vLLM/SGLang hyperparameters.

Importantly, InferenceBench tests *open-ended* inference optimization capabilities. This is different from more narrow benchmarks like KernelBench that only let agents optimize kernels (which is a very valuable task, too!). The benchmark is intentionally open-ended, so the poor performance of the agents is not an underelicitation issue. The agents have everything needed to succeed, but they still fail because they are not yet reliable enough for this task.

Our results suggest an inverse scaling phenomenon: Claude Sonnet 4.6 and GLM-5 rank highly because they more often preserve simple, valid, high-performing final servers, while several larger models show stronger peak runs but lose utility through brittle final-state choices. This contrasts with benchmarks where rankings track raw capability (e.g., SWE-Bench, Terminal-Bench, PostTrainBench, FrontierSWE).

One of the primary bottlenecks we have clearly observed is the lack of diversity of strategies: nearly all agents just use vLLM, without exploring alternatives. Overall, proper exploration is lacking: the current agents are not ready to tackle broad enough goals and get stuck after the first found solution (such as vLLM). I’m sure future agents will do much better, but here is where we are now.

This benchmark is our 2nd one in a suite of benchmarks that will track the progress on AI R&D automation. We will develop many more benchmarks that will cover different aspects of AI R&D automation, culminating in recursive self-improvement. Stay tuned!

40d3.6K1810

Maksym Andriushchenko@maksym_andr

💥 Check out this detailed thread from @jehyeoky248 about InferenceBench, our new paper that tracks capabilities relevant to AI R&D automation and recursive self-improvement.

Jehyeok Yeon @ ICML 2026 🇰🇷@jehyeoky248

AI R&D agents look great in demos. They write code, fix bugs, and propose research-shaped ideas. But what do researchers actually spend their time doing? Fighting dependency conflicts, noisy metrics, and configs that I’m pretty sure worked 20 minutes ago. Can agents do that? 🧵

41d2.6K166

Maksym Andriushchenko@maksym_andr

Website: https://inferencebench.ai/ Paper: https://inferencebench.ai/assets/paper.pdf

This work is led by @jehyeoky248 with support from @full__rank!

Maksym Andriushchenko@maksym_andr

💥Today we release InferenceBench, our next benchmark after PostTrainBench that measures progress on AI R&D automation.

AI R&D automation will very likely unfold gradually, starting from “boring” tasks like inference speed optimization that are very easily verifiable (accuracy + inference time). We show a rather negative result for current frontier agents. They are not good at system-level engineering and managing complex dependencies. They do show non-trivial performance, but they fail compared to a simple baseline: hyperparameter tuning of vLLM/SGLang hyperparameters.

Importantly, InferenceBench tests *open-ended* inference optimization capabilities. This is different from more narrow benchmarks like KernelBench that only let agents optimize kernels (which is a very valuable task, too!). The benchmark is intentionally open-ended, so the poor performance of the agents is not an underelicitation issue. The agents have everything needed to succeed, but they still fail because they are not yet reliable enough for this task.

Our results suggest an inverse scaling phenomenon: Claude Sonnet 4.6 and GLM-5 rank highly because they more often preserve simple, valid, high-performing final servers, while several larger models show stronger peak runs but lose utility through brittle final-state choices. This contrasts with benchmarks where rankings track raw capability (e.g., SWE-Bench, Terminal-Bench, PostTrainBench, FrontierSWE).

One of the primary bottlenecks we have clearly observed is the lack of diversity of strategies: nearly all agents just use vLLM, without exploring alternatives. Overall, proper exploration is lacking: the current agents are not ready to tackle broad enough goals and get stuck after the first found solution (such as vLLM). I’m sure future agents will do much better, but here is where we are now.

This benchmark is our 2nd one in a suite of benchmarks that will track the progress on AI R&D automation. We will develop many more benchmarks that will cover different aspects of AI R&D automation, culminating in recursive self-improvement. Stay tuned!

41d1.3K156

Maksym Andriushchenko@maksym_andr

@jehyeoky248 @full__rank A detailed thread from Tommy @jehyeoky248 about this work:

Jehyeok Yeon @ ICML 2026 🇰🇷@jehyeoky248

AI R&D agents look great in demos. They write code, fix bugs, and propose research-shaped ideas. But what do researchers actually spend their time doing? Fighting dependency conflicts, noisy metrics, and configs that I’m pretty sure worked 20 minutes ago. Can agents do that? 🧵

41d1.6K81

Maksym Andriushchenko@maksym_andr

@jehyeoky248 @full__rank A detailed thread from Ben @full__rank:

41d101

Botir Khaltaev@botir33751732

@charles_irl Bro does not know I love micro optimizations

41d273

Noema@noemaclips

@maksym_andr @jehyeoky248 @full__rank @xeophon @stalkermustang @spicey_lemonade you might want to check this one!

41d2073

Florian Brand@xeophon

@maksym_andr @jehyeoky248 Another, very common Tübingen banger

Maksym Andriushchenko@maksym_andr

💥 Check out this detailed thread from @jehyeoky248 about InferenceBench, our new paper that tracks capabilities relevant to AI R&D automation and recursive self-improvement.

41d13920

Kris Gulati@krisgulati

@maksym_andr @jehyeoky248 @full__rank Great work! One of the most important questions people could be working on right now IMO!

40d1022

Vikash Sehwag@VSehwag_

@maksym_andr @jehyeoky248 @full__rank Nice work!! Curious where 3.5 ranks on it.

41d2731

Utkarsh Singh@Utkarsh51557661

@maksym_andr @jehyeoky248 @full__rank always starts with the small stuff. tiny wins lead to bigger shifts down the road.

41d1481

Maksym Andriushchenko@maksym_andr

@jehyeoky248 @full__rank Also a detailed thread from Ben @full__rank:

Ben Rank@full__rank

Inference consumes a big share of frontier labs' compute.

What if we could change that, using AI agents themselves?

We built a benchmark for measuring how well AI can accelerate inference speed of LLMs.

And the results are quite surprising! 🧵

41d13210

Charles 🎉 Frye@charles_irl

@botir33751732 http://modal.jobs

41d361

Stephen Fernandes@stephennfern

@maksym_andr @jehyeoky248 @full__rank I've been working on something in stealth on a similar domain .. inferenceBench would absolutely compliment my work.

would love to talk in DMs if you are up for it

40d90

Gregor@bygregorr

@maksym_andr @jehyeoky248 @full__rank benchmarks said the same about MLPerf in 2019. still measuring, still not automating.

41d81

Adel Bucetta@adelbucetta

@maksym_andr @jehyeoky248 @full__rank the real unlock isn't even a specific benchmark or automation task it's the collective momentum and feedback loops they create for research teams. that's what drives meaningful progress in ai r&d.

40d42

gerred@sloppenheimer

@charles_irl I love me some inference speed optimization :(

40d6