/Tech1h ago

OpenHands co-founder Graham Neubig launches the OpenHands Index to benchmark autonomous AI agents on combined LLM-harness performance

It tests configurations across tasks like SWE-Bench and GAIA.

316112.1K

#89

Original post

Graham Neubig@gneubig#89inTech

Specifically, we evaluated on the OpenHands Index, our composite benchmark that covers: - Issue Resolution (SWE-Bench) - Frontend Development (SWE-Bench Multimodal) - Greenfield Development (commit0) - Software Testing (SWT-Bench) - Information Gathering (GAIA)

Graham Neubig@gneubig

Many coding agent benchmarks evaluate LLM performance (with the same harness) or harness performance (with the same LLM).

But in reality, it is the harness + the LLM that determins the overall performance.

We introduced new, holistic results that measure both.

9:49 AM · Jun 18, 2026 · 415 Views

Sentiment

Sentiment building, check back later.

Cluster Engagement

Digg Deeper

No Digg Deeper questions have been answered for this story yet.

Related links

OpenHands Index

OPENHANDS.DEVVia

#89

Posts from X

Most Activity

VIEWS1.2KBOOKMARKS3LIKES18RETWEETS1REPLIES2

Graham Neubig@gneubig

Many coding agent benchmarks evaluate LLM performance (with the same harness) or harness performance (with the same LLM).

But in reality, it is the harness + the LLM that determins the overall performance.

We introduced new, holistic results that measure both.

1h1.2K183

Graham Neubig@gneubig

The results are interesting. Overall the trends are: - OpenHands generally outperforms Claude Code on both accuracy and cost when using Claude Opus - OpenHands beats Codex on accuracy, but falls behind on cost when using GPT - Gemini-CLI outperforms OpenHands when using Gemini

Graham Neubig@gneubig

1h42000

Graham Neubig@gneubig

We also have much more fine grained results, and the full set of agent trajectories across the 5 benchmarks here: https://index.openhands.dev/alternative-agents

And these results were enabled by our new ACP support in the OpenHands SDK, which you can use for benchmarking or many other things, learn more here:

OpenHands@OpenHandsDev

What if you could run any coding agent through a single interface, locally, remotely, or on the cloud?

Today we introduced Agent Client Protocol support to the OpenHands Agent Canvas, SDK, and Cloud, making this possible.

1h27000

elvis@omarsar0

@gneubig Looks useful!

Graham Neubig@gneubig

Many coding agent benchmarks evaluate LLM performance (with the same harness) or harness performance (with the same LLM).

But in reality, it is the harness + the LLM that determins the overall performance.

We introduced new, holistic results that measure both.

13m5000