1d ago

HarnessAudit Framework Exposes Safety Gaps in LLM Agent Execution Harnesses

0
Original post

⚠️ Your Agent Harness Can Pass Every Task and Still Be Unsafe. LLM agents now run inside execution harnesses that dispatch tools, allocate resources, and route messages across components. The harness can return a correct final answer while accessing unauthorized resources, leaking context to the wrong agent, or triggering irreversible side effects along the way. Evaluating the model's output cannot see any of this. The unit of safety has shifted. It's the harness. We present HarnessAudit, a trajectory-level framework for auditing LLM agent harness safety, and uncover the following key insights 🔥: 🚨 Completion ≠ Safety. Task success and safe execution are fundamentally misaligned. 🔍 The harness, not the model, is the unit of safety. Most violations happen mid-trajectory, not at termination. 🕸️ Multi-agent collaboration expands the risk surface. Inter-agent communication creates entirely new failure modes. 💉 Resource access dominates violations. Agents rarely call wrong tools — they call right tools on unauthorized resources. ⚡ Harness design sets the safety ceiling. Framework choice matters more than model choice for safe deployment.

4:56 PM · May 18, 2026 View on X
Reposted by

Your agent finished the task. Did it also read files it shouldn't have, call tools outside policy, or leak data across components?

If you only score final outputs, you can't tell. 𝐇𝐚𝐫𝐧𝐞𝐬𝐬𝐀𝐮𝐝𝐢𝐭 evaluates the three safety layers the harness silently controls: boundary compliance, execution fidelity, and system stability. We run it on 210 tasks across 8 real-world domains and 10 frontier harness configurations including Claude Code, Codex, and OpenClaw.

Best overall score: 0.32. Task completion and safe execution are clearly misaligned, and most violations don't come from obvious tool misuse; they concentrate in resource access (right tool, wrong object) and inter-agent information flow (sensitive context leaking across components).

Chengzhi LiuChengzhi Liu@liuchen02938149

⚠️ Your Agent Harness Can Pass Every Task and Still Be Unsafe. LLM agents now run inside execution harnesses that dispatch tools, allocate resources, and route messages across components. The harness can return a correct final answer while accessing unauthorized resources, leaking context to the wrong agent, or triggering irreversible side effects along the way. Evaluating the model's output cannot see any of this. The unit of safety has shifted. It's the harness. We present HarnessAudit, a trajectory-level framework for auditing LLM agent harness safety, and uncover the following key insights 🔥: 🚨 Completion ≠ Safety. Task success and safe execution are fundamentally misaligned. 🔍 The harness, not the model, is the unit of safety. Most violations happen mid-trajectory, not at termination. 🕸️ Multi-agent collaboration expands the risk surface. Inter-agent communication creates entirely new failure modes. 💉 Resource access dominates violations. Agents rarely call wrong tools — they call right tools on unauthorized resources. ⚡ Harness design sets the safety ceiling. Framework choice matters more than model choice for safe deployment.

11:56 PM · May 18, 2026 · 9.1K Views
12:24 AM · May 19, 2026 · 7.9K Views

HarnessAudit is a great team effort. Led by @liuchen02938149 and @Eason_hk04, with @yepengliu, @Toby_Yang_7, @qianqi_yan, and @YuhengBu at @ucsbNLP. And it's been a privilege to collaborate across institutions with @xuandongzhao (UC Berkeley), @SharonYixuanLi (Wisconsin–Madison), @ShengLiu_ (Stanford), and @HuaWenyue31539 (Microsoft Research). Thanks to everyone involved!

Website: https://harnessaudit.github.io Paper: https://arxiv.org/abs/2605.14271

Xin Eric Wang (hiring postdoc)Xin Eric Wang (hiring postdoc)@xwang_lk

Your agent finished the task. Did it also read files it shouldn't have, call tools outside policy, or leak data across components? If you only score final outputs, you can't tell. 𝐇𝐚𝐫𝐧𝐞𝐬𝐬𝐀𝐮𝐝𝐢𝐭 evaluates the three safety layers the harness silently controls: boundary compliance, execution fidelity, and system stability. We run it on 210 tasks across 8 real-world domains and 10 frontier harness configurations including Claude Code, Codex, and OpenClaw. Best overall score: 0.32. Task completion and safe execution are clearly misaligned, and most violations don't come from obvious tool misuse; they concentrate in resource access (right tool, wrong object) and inter-agent information flow (sensitive context leaking across components).

12:24 AM · May 19, 2026 · 7.9K Views
12:34 AM · May 19, 2026 · 977 Views

@liuchen02938149 @Eason_hk04 @yepengliu @Toby_Yang_7 @qianqi_yan @YuhengBu @ucsbNLP @xuandongzhao @SharonYixuanLi Code & Dataset: https://github.com/eric-ai-lab/HarnessAudit

Xin Eric Wang (hiring postdoc)Xin Eric Wang (hiring postdoc)@xwang_lk

HarnessAudit is a great team effort. Led by @liuchen02938149 and @Eason_hk04, with @yepengliu, @Toby_Yang_7, @qianqi_yan, and @YuhengBu at @ucsbNLP. And it's been a privilege to collaborate across institutions with @xuandongzhao (UC Berkeley), @SharonYixuanLi (Wisconsin–Madison), @ShengLiu_ (Stanford), and @HuaWenyue31539 (Microsoft Research). Thanks to everyone involved! Website: https://harnessaudit.github.io Paper: https://arxiv.org/abs/2605.14271

12:34 AM · May 19, 2026 · 977 Views
5:53 AM · May 19, 2026 · 520 Views

For years, AI safety has been about the model: alignment, refusal training, jailbreak resistance. When you deploy an agent in 2026, the model is not making most of the consequential decisions. The harness is. It chooses which tools the model can call, which resources it can read, how messages flow between subagents, when execution terminates. A perfectly aligned model in a sloppy harness will quietly do unsafe things; a weaker model in a well-designed harness can be safer.

This is also why output-only benchmarks miss it. A trajectory that finishes the task while quietly accessing forbidden resources or leaking sensitive context looks identical to a clean success. Agent safety is a property of the trajectory, not the endpoint.

In multi-agent harnesses, the routing is usually right. What rides with the message is the leak. This is the finding I expect to age the best: every serious agent product shipping now is multi-agent (planner, retriever, executor, reviewer), and every handoff is a place sensitive context can travel where it shouldn't.

Xin Eric Wang (hiring postdoc)Xin Eric Wang (hiring postdoc)@xwang_lk

http://x.com/i/article/2056623993252392961

7:03 AM · May 19, 2026 · 5.3K Views
9:31 AM · May 19, 2026 · 128 Views

Agent safety should be evaluated on the trajectory, not the final answer. A trajectory that finishes the task while quietly accessing forbidden resources or leaking sensitive context looks indistinguishable from a clean success, and output-only benchmarks have been missing exactly this.

For years, AI safety has been about the model: alignment, refusal training, jailbreak resistance. But when you deploy an agent in 2026, the model is not making most of the consequential decisions. The harness is. It chooses which tools the model can call, which resources it can read, how messages flow between subagents, when execution terminates. A perfectly aligned model in a sloppy harness will quietly do unsafe things; a weaker model in a well-designed harness can be safer.

In multi-agent harnesses, the routing is usually right. What rides with the message is the leak. This is the finding I expect to age the best — every serious agent product shipping now is multi-agent (planner, retriever, executor, reviewer), and every handoff is a place sensitive context can travel where it shouldn't.

Xin Eric Wang (hiring postdoc)Xin Eric Wang (hiring postdoc)@xwang_lk

http://x.com/i/article/2056623993252392961

7:03 AM · May 19, 2026 · 5.3K Views
3:00 PM · May 19, 2026 · 1.8K Views

For years, AI safety has been about the model: alignment, refusal training, jailbreak resistance. When you deploy an agent in 2026, the model is not making most of the consequential decisions. The harness is. It chooses which tools the model can call, which resources it can read, how messages flow between subagents, when execution terminates. A perfectly aligned model in a sloppy harness will quietly do unsafe things; a weaker model in a well-designed harness can be safer.

This is also why output-only benchmarks miss it. A trajectory that finishes the task while quietly accessing forbidden resources or leaking sensitive context looks identical to a clean success. Agent safety is a property of the trajectory, not the endpoint.

In multi-agent harnesses, the routing is usually right. What rides with the message is the leak. This is the finding I expect to age the best — every serious agent product shipping now is multi-agent (planner, retriever, executor, reviewer), and every handoff is a place sensitive context can travel where it shouldn't.

Xin Eric Wang (hiring postdoc)Xin Eric Wang (hiring postdoc)@xwang_lk

http://x.com/i/article/2056623993252392961

7:03 AM · May 19, 2026 · 5.3K Views
9:29 AM · May 19, 2026 · 19 Views