Better AI agent systems scale by remembering useful feedback, not by spending more compute.
The simple mistake is to count tokens, calls, or dollars as if they were all evidence.
The authors say those numbers miss the real issue, because 2 runs can spend the same budget while only 1 gets feedback that is correct, new, relevant, and remembered.
An agent harness is not just a wrapper around a model; it is a feedback machine that decides what to test, what to trust, what to store, and what to ignore.
Their answer is Effective Feedback Compute, or EFC, a score that counts feedback only when it teaches the agent something useful and changes later decisions.
They also divide EFC by task demand, because a small lookup task and a messy software-repair task need different amounts of helpful feedback before the agent has enough to solve them.
They tested this on synthetic tasks, code tasks with executable tests, real benchmark traces, held-out settings, and a new prospective batch, then compared EFC with raw compute and a strong agent-scaling baseline.
The main result is that task-normalized EFC predicted failures much better than raw compute, and in 1 matched-budget test, better feedback raised success from 0.27 to 0.90 while cost and tool calls stayed fixed.
----
Link – arxiv. org/abs/2605.29682
Title: "Scaling Laws for Agent Harnesses via Effective Feedback Compute"