19h ago

Paper Shows AI Agent Performance Depends More On Harnesses Than Prompts

0
Original post

This paper shows that agent performance depends less on prompts alone and more on the harness around them. “Agent intelligence” is becoming partly a systems problem. The problem is that many AI agents look like 1 model, but their real behavior comes from surrounding code that controls planning, tools, memory, retries, checking, and stopping. A model may reason well in one step, but long tasks fail in messier places: state disappears, verification drifts, tools return partial evidence, and the agent forgets which intermediate artifact actually matters. Natural-Language Agent Harnesses try to make that control layer visible. Instead of burying the logic in controller code, they express the stages, roles, contracts, state rules, failure modes, and stopping conditions in structured natural language that a shared runtime can execute. The claim is not that natural language should replace code, but that the important design choices around an agent should become inspectable, portable, and testable instead of hiding inside one framework’s habits. On SWE-bench, heavier harnessing changed behavior dramatically, with more calls, tools, delegation, and runtime, but it did not produce a simple win curve; sometimes added structure helped, and sometimes it pushed the agent away from the shortest benchmark-aligned repair. A harness is not magic scaffolding around a model; it is a set of bets about where reliability comes from. ---- Paper Link – arxiv. org/abs/2603.25723 Paper Title: "Natural-Language Agent Harnesses"

6:26 AM · May 23, 2026 View on X
Reposted by