2h ago

Maksym Andriushchenko and researchers release InferenceBench, a benchmark evaluating AI agents on open-ended optimization of OpenAI-compatible LLM servers using latency and throughput metrics on H100 GPUs

Frontier models fail to beat simple hyperparameter tuning baselines on full setups.

0
Original post

💥Today we release InferenceBench, our next benchmark after PostTrainBench that measures progress on AI R&D automation. AI R&D automation will very likely unfold gradually, starting from “boring” tasks like inference speed optimization that are very easily verifiable (accuracy + inference time). We show a rather negative result for current frontier agents. They are not good at system-level engineering and managing complex dependencies. They do show non-trivial performance, but they fail compared to a simple baseline: hyperparameter tuning of vLLM/SGLang hyperparameters. Importantly, InferenceBench tests *open-ended* inference optimization capabilities. This is different from more narrow benchmarks like KernelBench that only let agents optimize kernels (which is a very valuable task, too!). The benchmark is intentionally open-ended, so the poor performance of the agents is not an underelicitation issue. The agents have everything needed to succeed, but they still fail because they are not yet reliable enough for this task. Our results suggest an inverse scaling phenomenon: Claude Sonnet 4.6 and GLM-5 rank highly because they more often preserve simple, valid, high-performing final servers, while several larger models show stronger peak runs but lose utility through brittle final-state choices. This contrasts with benchmarks where rankings track raw capability (e.g., SWE-Bench, Terminal-Bench, PostTrainBench, FrontierSWE). One of the primary bottlenecks we have clearly observed is the lack of diversity of strategies: nearly all agents just use vLLM, without exploring alternatives. Overall, proper exploration is lacking: the current agents are not ready to tackle broad enough goals and get stuck after the first found solution (such as vLLM). I’m sure future agents will do much better, but here is where we are now. This benchmark is our 2nd one in a suite of benchmarks that will track the progress on AI R&D automation. We will develop many more benchmarks that will cover different aspects of AI R&D automation, culminating in recursive self-improvement. Stay tuned!

7:29 AM · May 20, 2026 View on X

"'boring' tasks like inference speed optimization"

:<

Maksym AndriushchenkoMaksym Andriushchenko@maksym_andr

💥Today we release InferenceBench, our next benchmark after PostTrainBench that measures progress on AI R&D automation. AI R&D automation will very likely unfold gradually, starting from “boring” tasks like inference speed optimization that are very easily verifiable (accuracy + inference time). We show a rather negative result for current frontier agents. They are not good at system-level engineering and managing complex dependencies. They do show non-trivial performance, but they fail compared to a simple baseline: hyperparameter tuning of vLLM/SGLang hyperparameters. Importantly, InferenceBench tests *open-ended* inference optimization capabilities. This is different from more narrow benchmarks like KernelBench that only let agents optimize kernels (which is a very valuable task, too!). The benchmark is intentionally open-ended, so the poor performance of the agents is not an underelicitation issue. The agents have everything needed to succeed, but they still fail because they are not yet reliable enough for this task. Our results suggest an inverse scaling phenomenon: Claude Sonnet 4.6 and GLM-5 rank highly because they more often preserve simple, valid, high-performing final servers, while several larger models show stronger peak runs but lose utility through brittle final-state choices. This contrasts with benchmarks where rankings track raw capability (e.g., SWE-Bench, Terminal-Bench, PostTrainBench, FrontierSWE). One of the primary bottlenecks we have clearly observed is the lack of diversity of strategies: nearly all agents just use vLLM, without exploring alternatives. Overall, proper exploration is lacking: the current agents are not ready to tackle broad enough goals and get stuck after the first found solution (such as vLLM). I’m sure future agents will do much better, but here is where we are now. This benchmark is our 2nd one in a suite of benchmarks that will track the progress on AI R&D automation. We will develop many more benchmarks that will cover different aspects of AI R&D automation, culminating in recursive self-improvement. Stay tuned!

2:29 PM · May 20, 2026 · 7.7K Views
4:09 PM · May 20, 2026 · 200 Views

Website: https://inferencebench.ai/ Paper: https://inferencebench.ai/assets/paper.pdf

This work is led by @jehyeoky248 with support from @full__rank!

Maksym AndriushchenkoMaksym Andriushchenko@maksym_andr

💥Today we release InferenceBench, our next benchmark after PostTrainBench that measures progress on AI R&D automation. AI R&D automation will very likely unfold gradually, starting from “boring” tasks like inference speed optimization that are very easily verifiable (accuracy + inference time). We show a rather negative result for current frontier agents. They are not good at system-level engineering and managing complex dependencies. They do show non-trivial performance, but they fail compared to a simple baseline: hyperparameter tuning of vLLM/SGLang hyperparameters. Importantly, InferenceBench tests *open-ended* inference optimization capabilities. This is different from more narrow benchmarks like KernelBench that only let agents optimize kernels (which is a very valuable task, too!). The benchmark is intentionally open-ended, so the poor performance of the agents is not an underelicitation issue. The agents have everything needed to succeed, but they still fail because they are not yet reliable enough for this task. Our results suggest an inverse scaling phenomenon: Claude Sonnet 4.6 and GLM-5 rank highly because they more often preserve simple, valid, high-performing final servers, while several larger models show stronger peak runs but lose utility through brittle final-state choices. This contrasts with benchmarks where rankings track raw capability (e.g., SWE-Bench, Terminal-Bench, PostTrainBench, FrontierSWE). One of the primary bottlenecks we have clearly observed is the lack of diversity of strategies: nearly all agents just use vLLM, without exploring alternatives. Overall, proper exploration is lacking: the current agents are not ready to tackle broad enough goals and get stuck after the first found solution (such as vLLM). I’m sure future agents will do much better, but here is where we are now. This benchmark is our 2nd one in a suite of benchmarks that will track the progress on AI R&D automation. We will develop many more benchmarks that will cover different aspects of AI R&D automation, culminating in recursive self-improvement. Stay tuned!

2:29 PM · May 20, 2026 · 7.7K Views
2:29 PM · May 20, 2026 · 517 Views

@jehyeoky248 @full__rank A detailed thread from Tommy @jehyeoky248 about this work:

Jehyeok Yeon @ ICML 2026 🇰🇷Jehyeok Yeon @ ICML 2026 🇰🇷@jehyeoky248

AI R&D agents look great in demos. They write code, fix bugs, and propose research-shaped ideas. But what do researchers actually spend their time doing? Fighting dependency conflicts, noisy metrics, and configs that I’m pretty sure worked 20 minutes ago. Can agents do that? 🧵

2:19 PM · May 20, 2026 · 5.7K Views
2:32 PM · May 20, 2026 · 566 Views

💥 Check out this detailed thread from @jehyeoky248 about InferenceBench, our new paper that tracks capabilities relevant to AI R&D automation and recursive self-improvement.

Jehyeok Yeon @ ICML 2026 🇰🇷Jehyeok Yeon @ ICML 2026 🇰🇷@jehyeoky248

AI R&D agents look great in demos. They write code, fix bugs, and propose research-shaped ideas. But what do researchers actually spend their time doing? Fighting dependency conflicts, noisy metrics, and configs that I’m pretty sure worked 20 minutes ago. Can agents do that? 🧵

2:19 PM · May 20, 2026 · 5.7K Views
2:31 PM · May 20, 2026 · 1.5K Views

@maksym_andr @jehyeoky248 Another, very common Tübingen banger

Maksym AndriushchenkoMaksym Andriushchenko@maksym_andr

💥 Check out this detailed thread from @jehyeoky248 about InferenceBench, our new paper that tracks capabilities relevant to AI R&D automation and recursive self-improvement.

2:31 PM · May 20, 2026 · 1.5K Views
2:43 PM · May 20, 2026 · 90 Views
Maksym Andriushchenko and researchers release InferenceBench, a benchmark evaluating AI agents on open-ended optimization of OpenAI-compatible LLM servers using latency and throughput metrics on H100 GPUs · Digg