Specifically, we evaluated on the OpenHands Index, our composite benchmark that covers: - Issue Resolution (SWE-Bench) - Frontend Development (SWE-Bench Multimodal) - Greenfield Development (commit0) - Software Testing (SWT-Bench) - Information Gathering (GAIA)
Many coding agent benchmarks evaluate LLM performance (with the same harness) or harness performance (with the same LLM).
But in reality, it is the harness + the LLM that determins the overall performance.
We introduced new, holistic results that measure both.