/AI29d ago

Zhaorun Chen open-sources DecodingTrust-Agent red-teaming platform

Zhaorun Chen led researchers in open-sourcing the DecodingTrust-Agent Platform, a controllable simulation environment for red-teaming AI agents. It supplies full-stack interfaces replicating official MCPs and GUIs across more than 50 real-world environments in 14 high-stakes domains and supports environment-, tool-, skill-, and prompt-level injections. The bundled DTap-Bench offers roughly 7,000 red-teaming tasks and 4,000 malicious goals. Development spanned 20 months and required $120,000 in API credits. An arXiv paper and site at decodingtrust-agent.com are also available.

--0--

#22

Original post

Percy Liang#22

Zhaorun Chen@ZRChen_AISafety

AI agents are already going wild, but today’s red-teaming tools for them are still like toys 😢

🔥👽 After spending 20 months and $120K API credits, we are excited to finally open-source DecodingTrust-Agent Platform (DTap): the first controllable, realistic simulation platform for advanced AI agent red-teaming !!

🌍 DTap simulates 50+ real-world environments across 14 high-stakes domains, with realistic agent interfaces replicated from their official MCPs and GUIs. The environments are full-stack, interactive, fully parallelizable, and can be easily configured to reproduce arbitrary real-world attack scenarios, making agent red-teaming scalable and highly transferable to deployment settings.

🔥We also release DTap-Bench, a large-scale benchmark with ~7K agent red-teaming tasks and ~4K policy-grounded malicious goals.

Each red-teaming task includes a sophisticated attack sequence across environment-, tool-, skill-, prompt-level injections, as well as their compositions, plus a handcrafted verifiable judge that checks the actual consequences in the environment.

Using DTap-Bench, we evaluate popular agent frameworks and backbone models across diverse policies, risks, threat models, and attack strategies, revealing systematic vulnerabilities and zero-days in today’s agents!

Paper link: https://arxiv.org/pdf/2605.04808 Platform + benchmark + code: https://decodingtrust-agent.com Join our Discord: https://discord.gg/V4fG6NcVc

Zhaorun Chen open-sources DecodingTrust-Agent red-teaming platform

--0--

#22

Original post

Percy Liang#22

Zhaorun Chen@ZRChen_AISafety

AI agents are already going wild, but today’s red-teaming tools for them are still like toys 😢

🔥We also release DTap-Bench, a large-scale benchmark with ~7K agent red-teaming tasks and ~4K policy-grounded malicious goals.

Paper link: https://arxiv.org/pdf/2605.04808 Platform + benchmark + code: https://decodingtrust-agent.com Join our Discord: https://discord.gg/V4fG6NcVc

Read more below 👇

24d3.9K268

Zhaorun Chen@ZRChen_AISafety

DTap is not a toy sandbox. It is a full-stack simulation world for AI agent red-teaming. 🌍

It spans 14 high-stakes agentic domains and 50+ realistic environments, including Google Workspace (e.g., Gmail, Google Docs, Calendar), Slack, PayPal, Robinhood, Booking, Windows, macOS, terminal, Salesforce, finance, travel, United Airlines, and more, covering 1,000+ realistic tools in total.

Each environment replicates real-world agent interfaces: MCP tools, GUIs, APIs, HTML structures, stateful backends, and databases, making attacks tested in DTap much closer to what agents may encounter in the wild.

29d18131

Zhaorun Chen@ZRChen_AISafety

DTap environments are purpose-built for red-teaming:

⚡ Interactive & stateful: dynamic environments that persist consistent states, enabling agents to perform multi-step workflows 🔁 All results reproducible: deterministic transitions enable replayable attack analysis 🎯 Reset to any attack scenario: environments can be restored to arbitrary state snapshots to reproduce any attack scenario 🚀 Fully parallelizable: containerized environments and multi-tenant sessions support high-concurrency evaluation 🔌 Plug-and-play agent integration: DTap can natively integrate with any agent that supports MCP, including OpenAI Agents SDK, Google ADK, Claude Code, OpenClaw, LangChain, and more.

For each environment, DTap exposes realistic injection points across different interface layers, such as third-party emails ⚠️, external calendar invitations ⚠️, public website comments ⚠️, reviews ⚠️, and external data sources ⚠️, all of which can be practically manipulated by external attackers. 😈

Beyond environment injections, DTap also supports attacks across the end-to-end agent supply chain, including tool, skill, and direct prompt injections, as well as their combinations, enabling systematic evaluation of realistic, multi-surface threats and adversarial patterns in AI agents.

29d12521

Zhaorun Chen@ZRChen_AISafety

We also found a zero-day “execute-then-refuse” failure mode: OpenAI Agents SDK and Google ADK often execute harmful tool calls first, then refuse afterward, when the damage is already done.

We suspect this stems from batch tool invocation, which can reduce per-tool consequence reasoning. This suggests agent robustness is not only about the model: harness design can critically shape its vulnerability surface.

29d1741

Zhaorun Chen@ZRChen_AISafety

We evaluate popular AI agent frameworks and backbone models on DTap-Bench:

🤖 OpenAI Agents SDK with GPT-5.4, GPT-5.2, GPT-OSS-120B 💻 Claude Code with Sonnet-4.5 🔷 Google ADK with Gemini-3-Pro 🐞 OpenClaw with GPT-5.5, GPT-5.2, DeepSeek-V4-Pro

We found today’s agents are highly vulnerable under realistic attacks!

Among capable agents, Google ADK is most vulnerable to indirect attacks (55.7% ASR), while OpenClaw + DeepSeek-V4-Pro is most vulnerable to direct attacks (59.6% ASR). Even the most robust agent, Claude Code, still reaches 25%+ ASR under both threat models!

29d1731

Zhaorun Chen@ZRChen_AISafety

With DTap-RED, we build DTap-Bench: the first large-scale, policy-grounded red-teaming benchmark for AI agents. 🧪

DTap-Bench includes:

📌 6,682 high-quality tasks requiring complex reasoning and multi-step execution, with each task taking 10+ tool calls on average 🔴 3,876 red-teaming tasks covering both direct and indirect threat models 🧩 100+ diverse injection vectors across attacker-controlled environments, tools, skills, and prompts 📜 4K+ policy-grounded malicious goals derived from 300+ risk categories and 60+ security policies

DTap measures agents on two key dimensions: 🟢 Utility: can the agent complete benign high-stakes workflows? 🔴 Security: can the agent resist realistic attacks across prompt, tool, skill, and environment injections?

DTap-Bench is also much more diverse and realistic than prior benchmarks such as AgentDojo 👇

29d1351

Zhaorun Chen@ZRChen_AISafety

Each task in DTap-Bench is built to be realistic, reproducible, and verifiable.

🌱 Realistic initial states: seeded user data synthesized via a persona-based pipeline, such as personalized emails, communication threads, accounts, documents, etc.

💥 Optimized attack sequences: generated by DTap-RED and reviewed by human experts, covering sophisticated attacks and zero-days such as fake email threads, multi-injection attacks, multimodal typographic attacks, multi-step backdoor chains, and composed benign-looking instructions

✅ Verifiable judges: handcrafted checks that directly inspect the final environment state to confirm whether the severe consequences actually happened

29d1241

Zhaorun Chen@ZRChen_AISafety

Key findings:

1️⃣ Asymmetric vulnerability across indirect injection surfaces: Skill- and tool-level injections consistently achieve higher ASR than environment injections, suggesting agents treat environment inputs as external while over-trusting internalized channels. This varies by framework: OpenClaw shows much lower ASR on tool injections than OpenAI Agents and Google ADK, indicating stronger trust calibration against external plugins.

2️⃣ Direct vs. indirect attacks expose different weaknesses: Some agents such as OpenAI SDK and Claude Code are more vulnerable to direct prompt injections due to stronger instruction-following, while Gemini-based agents are more vulnerable to indirect attacks hidden in external inputs.

3️⃣Open-source backbones are easier to directly misuse: Agents with open-source models such as DeepSeek-V4-Pro often follow instructions strongly but are weaker at distinguishing malicious intent.

4️⃣ Compositional attacks are highly effective: Combining multiple injection vectors, e.g., fake email threads, multi-message context, multimodal attacks, and multi-step chains, substantially amplifies risk.

5️⃣ Context-aware risks dominate: Risks that require contextual understanding across multi-step workflows, such as data exfiltration, sensitive data handling, and privilege escalation, are much easier to exploit than content-level risks like generating sexual content or hate speech.

6️⃣ Vulnerability depends on the environment: Communication-heavy environments like Gmail, WhatsApp, and Calendar are much more attack-prone than sensitive finance/travel environments.

7️⃣ Prompt-level guardrails are not enough: Sophisticated attacks still bypass prompt-level defenses. Stronger security needs harness-level controls and execution-time safeguards.

More findings for each domain can be found in the paper: https://arxiv.org/pdf/2605.04808

29d1211

Zhaorun Chen@ZRChen_AISafety

🤔How do we scale agent red-teaming on DTap to generate diverse, realistic attacks?

Meet DTap-RED: an autonomous red-teaming agent that turns a malicious goal into a powerful, executable attack. 🤖🔴

🔥 Given a malicious goal and a victim agent, DTap-RED optimizes over:

🎯 where to inject: user prompts, MCP tools, agent skills, external environments 🧪 what to inject: attack algorithms, jailbreaks, fabricated context, poisoned data 🔗 how to compose attacks: multi-step chains across multiple injection surfaces

After each attempt, a verifiable judge checks the actual environment state, e.g., whether sensitive data was exfiltrated or an unauthorized transaction was executed, instead of relying only on LLM judgments.

If the attack fails, DTap-RED uses judge feedback to itratively refine the strategy: switch injection channels, increase stealthiness, or compose a stronger attack.

29d1101

Zhaorun Chen@ZRChen_AISafety

🙏Great collaborations from our team: @ZRChen_AISafety @xun_aq @haibo_EchoRaven @ChengquanGuo @NieYuzhou @jiaweiz_7 @MintongKang @xuchejian @GlimLiu004 @XiaogengLiu @tiannengshi @ChaoweiX @sanmikoyejo @percyliang @WenboGuo4 @dawnsongtweets @uiuc_aisecure

Paper link: https://arxiv.org/pdf/2605.04808 Project website: https://decodingtrust-agent.com/ Code & Dataset: https://github.com/AI-secure/DecodingTrust-Agent Join our Discord: https://discord.gg/V4fG6NcVc

29d1922

Soo Yoon | FailSafe Code Guardian@sooyoon_eth

@ZRChen_AISafety love seeing real simulation platforms for this. red teaming agents isn't a luxury anymore, it's mandatory. we use our swarm harness to test exactly these kinds of environment injections. simulated attacks are the only way to sleep at night.

29d361

Suresh@_Suresh2

@ZRChen_AISafety 120k in api credits is a lot. i bet the per-run cost was almost nothing by the end

29d28

Zhaorun Chen@ZRChen_AISafety

@sooyoon_eth exactly! environment simulation for agent red-teaming is the done-right 😃

29d42

Zhaorun Chen@ZRChen_AISafety

@_Suresh2 yes, the 120k API costs are mainly spent on curating the dataset and generating red-teaming attacks. To run evaluation for your agents, the cost is very low ~$150 for Claude Code + Opus-4.7.

29d28

Xylon.Ai@Xylon_lew

@ZRChen_AISafety what made you start thinking about it this way? curious about the trigger

29d26