Stronger agents will not come only from larger models, but from better systems around them.
The problem is that many AI agents are judged as if the model alone did the work, even though the real behavior also depends on memory, tools, context, routing, checks, and permissions.
This surrounding setup around the agent is called harness, meaning the system that decides what the model sees, what tools it can use, what it remembers, and what actions get checked.
Progress should come from scaling this harness, especially 3 parts: better context control, more trustworthy memory, and better routing to tools or helper agents.
Long context is not the same as usable context, memory is not the same as trustworthy memory, and having many tools is not the same as knowing when to use them.
A stale note can be more dangerous than no note, because it gives the agent confidence exactly when it should re-check the world.
A specialized subagent can also fail quietly if its output sounds plausible but no later layer verifies whether it is true.
This is why one-shot benchmark scores feel increasingly thin.
Two agents can reach the same final answer, while one burns far more tokens, makes riskier tool calls, carries corrupted memory, or succeeds only by accident.
The next frontier is not just scaling the mind inside the machine.
It is scaling the discipline around it.
----
Link – arxiv. org/abs/2605.26112
Title: "From Model Scaling to System Scaling: Scaling the Harness in Agentic AI"
