HarnessAudit Framework Exposes Safety Gaps in LLM Agent Execution Harnesses
Your agent finished the task. Did it also read files it shouldn't have, call tools outside policy, or leak data across components?
If you only score final outputs, you can't tell. 𝐇𝐚𝐫𝐧𝐞𝐬𝐬𝐀𝐮𝐝𝐢𝐭 evaluates the three safety layers the harness silently controls: boundary compliance, execution fidelity, and system stability. We run it on 210 tasks across 8 real-world domains and 10 frontier harness configurations including Claude Code, Codex, and OpenClaw.
Best overall score: 0.32. Task completion and safe execution are clearly misaligned, and most violations don't come from obvious tool misuse; they concentrate in resource access (right tool, wrong object) and inter-agent information flow (sensitive context leaking across components).

⚠️ Your Agent Harness Can Pass Every Task and Still Be Unsafe. LLM agents now run inside execution harnesses that dispatch tools, allocate resources, and route messages across components. The harness can return a correct final answer while accessing unauthorized resources, leaking context to the wrong agent, or triggering irreversible side effects along the way. Evaluating the model's output cannot see any of this. The unit of safety has shifted. It's the harness. We present HarnessAudit, a trajectory-level framework for auditing LLM agent harness safety, and uncover the following key insights 🔥: 🚨 Completion ≠ Safety. Task success and safe execution are fundamentally misaligned. 🔍 The harness, not the model, is the unit of safety. Most violations happen mid-trajectory, not at termination. 🕸️ Multi-agent collaboration expands the risk surface. Inter-agent communication creates entirely new failure modes. 💉 Resource access dominates violations. Agents rarely call wrong tools — they call right tools on unauthorized resources. ⚡ Harness design sets the safety ceiling. Framework choice matters more than model choice for safe deployment.
HarnessAudit is a great team effort. Led by @liuchen02938149 and @Eason_hk04, with @yepengliu, @Toby_Yang_7, @qianqi_yan, and @YuhengBu at @ucsbNLP. And it's been a privilege to collaborate across institutions with @xuandongzhao (UC Berkeley), @SharonYixuanLi (Wisconsin–Madison), @ShengLiu_ (Stanford), and @HuaWenyue31539 (Microsoft Research). Thanks to everyone involved!
Website: https://harnessaudit.github.io Paper: https://arxiv.org/abs/2605.14271
Your agent finished the task. Did it also read files it shouldn't have, call tools outside policy, or leak data across components? If you only score final outputs, you can't tell. 𝐇𝐚𝐫𝐧𝐞𝐬𝐬𝐀𝐮𝐝𝐢𝐭 evaluates the three safety layers the harness silently controls: boundary compliance, execution fidelity, and system stability. We run it on 210 tasks across 8 real-world domains and 10 frontier harness configurations including Claude Code, Codex, and OpenClaw. Best overall score: 0.32. Task completion and safe execution are clearly misaligned, and most violations don't come from obvious tool misuse; they concentrate in resource access (right tool, wrong object) and inter-agent information flow (sensitive context leaking across components).
@liuchen02938149 @Eason_hk04 @yepengliu @Toby_Yang_7 @qianqi_yan @YuhengBu @ucsbNLP @xuandongzhao @SharonYixuanLi Code & Dataset: https://github.com/eric-ai-lab/HarnessAudit
HarnessAudit is a great team effort. Led by @liuchen02938149 and @Eason_hk04, with @yepengliu, @Toby_Yang_7, @qianqi_yan, and @YuhengBu at @ucsbNLP. And it's been a privilege to collaborate across institutions with @xuandongzhao (UC Berkeley), @SharonYixuanLi (Wisconsin–Madison), @ShengLiu_ (Stanford), and @HuaWenyue31539 (Microsoft Research). Thanks to everyone involved! Website: https://harnessaudit.github.io Paper: https://arxiv.org/abs/2605.14271
For years, AI safety has been about the model: alignment, refusal training, jailbreak resistance. When you deploy an agent in 2026, the model is not making most of the consequential decisions. The harness is. It chooses which tools the model can call, which resources it can read, how messages flow between subagents, when execution terminates. A perfectly aligned model in a sloppy harness will quietly do unsafe things; a weaker model in a well-designed harness can be safer.
This is also why output-only benchmarks miss it. A trajectory that finishes the task while quietly accessing forbidden resources or leaking sensitive context looks identical to a clean success. Agent safety is a property of the trajectory, not the endpoint.
In multi-agent harnesses, the routing is usually right. What rides with the message is the leak. This is the finding I expect to age the best: every serious agent product shipping now is multi-agent (planner, retriever, executor, reviewer), and every handoff is a place sensitive context can travel where it shouldn't.
http://x.com/i/article/2056623993252392961
Agent safety should be evaluated on the trajectory, not the final answer. A trajectory that finishes the task while quietly accessing forbidden resources or leaking sensitive context looks indistinguishable from a clean success, and output-only benchmarks have been missing exactly this.
For years, AI safety has been about the model: alignment, refusal training, jailbreak resistance. But when you deploy an agent in 2026, the model is not making most of the consequential decisions. The harness is. It chooses which tools the model can call, which resources it can read, how messages flow between subagents, when execution terminates. A perfectly aligned model in a sloppy harness will quietly do unsafe things; a weaker model in a well-designed harness can be safer.
In multi-agent harnesses, the routing is usually right. What rides with the message is the leak. This is the finding I expect to age the best — every serious agent product shipping now is multi-agent (planner, retriever, executor, reviewer), and every handoff is a place sensitive context can travel where it shouldn't.
http://x.com/i/article/2056623993252392961
For years, AI safety has been about the model: alignment, refusal training, jailbreak resistance. When you deploy an agent in 2026, the model is not making most of the consequential decisions. The harness is. It chooses which tools the model can call, which resources it can read, how messages flow between subagents, when execution terminates. A perfectly aligned model in a sloppy harness will quietly do unsafe things; a weaker model in a well-designed harness can be safer.
This is also why output-only benchmarks miss it. A trajectory that finishes the task while quietly accessing forbidden resources or leaking sensitive context looks identical to a clean success. Agent safety is a property of the trajectory, not the endpoint.
In multi-agent harnesses, the routing is usually right. What rides with the message is the leak. This is the finding I expect to age the best — every serious agent product shipping now is multi-agent (planner, retriever, executor, reviewer), and every handoff is a place sensitive context can travel where it shouldn't.
http://x.com/i/article/2056623993252392961