/Tech13h ago

SkillHarm framework demonstrates that poisoning reusable skills of frontier coding agents achieves an 86% attack success rate

The attacks evaluate both fixed-payload and self-mutating poisoning vectors

44614158.4K
Original post
Yu Su@ysu_nlp#412inTech

your skills are vulnerable to attacks and current defenses are easy to break

Yuting Ning@yuting_ning

🤖 Agents increasingly rely on skills to handle complex tasks with privileged trust, which makes a poisoned skill a dangerous attack surface. Worse, skills persist and get reused: one can look benign today, silently mutate itself, and attack tomorrow ⚠️

Introducing SkillHarm: Lifecycle-Aware Skill-Based Attacks via Automated Construction • 2 attack scenarios across the skill-use lifecycle: Fixed-Payload Poisoning & Self-Mutating Poisoning • 12 risks organized by the workflow component harmed: Data Pipeline, System Environment & Agent Autonomy Exploitation • AutoSkillHarm: automatic attack construction with coding agents driven by natural-language harnesses

🚨 Frontier coding agents stay vulnerable with ASR up to 86.3% and current defenses don't hold.

🔥 Already 3.7K+ downloads on Hugging Face in the first week!

🧵:

9:43 AM · Jun 10, 2026 · 3.7K Views
Sentiment

Users thank co-leads and collaborators for the SkillHarm research exposing poisoned skill attacks on AI agents.

Pos
100.0%
Neg
0.0%
1 comments with sentiment.
Cluster Engagement
Posts from X
Most Activity
Most Activity
VIEWS184
Yuting Ning@yuting_ning

Agents stay vulnerable 🚨

We evaluate 6 model-harness configs (Claude Code, Codex, Gemini CLI, OpenCode), all already shipping with prompt-injection safety training. Attack success rate reaches up to 86.3% in FPP and 69.3% in SMP.

We additionally introduce Conditional ASR (cASR): attack success given the agent actually engaged the poisoned file.

The gap between ASR and cASR reveals a latent risk. Many apparent "attack failures" aren't the agent resisting; it just never opened the targeted skill file (e.g. it wrote its own code instead). Once it engages, ASR rises sharply, up to +32.1% for Opus 4.7 in SMP.

Explicit refusal (ARR) is low across the board; only Claude-family agents show noticeable refusal behavior. And even that collapses in SMP, where the cross-session gap hides the malicious intent.

13hViews 184Likes 1
LIKES6
Yuting Ning@yuting_ning

Many thanks to my co-lead @Zhehao_Zhang123, all collaborators @lal_yash, @BoyuGouNLP, Junyi Li, Weitong Ruan, Chentao Ye, Rahul Gupta, @Diyi_Yang, and our advisors @ysu_nlp, @hhsun1 🫶

📌 Paper: https://arxiv.org/abs/2606.02540 📌 Website: https://osu-nlp-group.github.io/SkillHarm

13hViews 124Likes 6
RETWEETS14
Yuting Ning@yuting_ning

🤖 Agents increasingly rely on skills to handle complex tasks with privileged trust, which makes a poisoned skill a dangerous attack surface. Worse, skills persist and get reused: one can look benign today, silently mutate itself, and attack tomorrow ⚠️

Introducing SkillHarm: Lifecycle-Aware Skill-Based Attacks via Automated Construction • 2 attack scenarios across the skill-use lifecycle: Fixed-Payload Poisoning & Self-Mutating Poisoning • 12 risks organized by the workflow component harmed: Data Pipeline, System Environment & Agent Autonomy Exploitation • AutoSkillHarm: automatic attack construction with coding agents driven by natural-language harnesses

🚨 Frontier coding agents stay vulnerable with ASR up to 86.3% and current defenses don't hold.

🔥 Already 3.7K+ downloads on Hugging Face in the first week!

🧵:

13hViews 4.7KLikes 31Bookmarks 10
REPLIES1
Yuting Ning@yuting_ning

Existing defenses don't hold either 🛡️

We test two widely used skill scanners: even the strongest config catches only 55.6% of FPP / 68.8% of SMP injections.

We also test whether prompt-level warnings help. Appending a defensive system prompt reduces ASR in some configs, but most still stay above 70% ASR.

This highlights the need for defenses that go beyond prompt-level interventions.

13hViews 65Likes 1
Yuting Ning@yuting_ning

📌 Paper: https://arxiv.org/abs/2606.02540 📌 Website: https://osu-nlp-group.github.io/SkillHarm 📌 Code: https://github.com/OSU-NLP-Group/SkillHarm 📌 Data: https://huggingface.co/datasets/osunlp/SkillHarm

13hViews 176Likes 3
Yuting Ning@yuting_ning

When does the harm land across the skill-use lifecycle? 🔁

A skill isn't used once and thrown away. A user installs it, and an agent loads its instructions, reads its references, and runs its scripts across many tasks and many sessions over time.

That persistence means harm can land at two very different points in the lifecycle:

1️⃣ Fixed-Payload Poisoning (FPP) — single session. The poisoned payload is baked in at install. Any task that invokes the skill triggers the harm right then.

2️⃣ Self-Mutating Poisoning (SMP) — cross session. The first run looks completely benign, but silently mutates the skill package via an exit-trigger hook. The harm is deferred: it fires only when a later task reuses the now-compromised skill. This is a failure single-session evals can't observe.

13hViews 128Likes 2
Yuting Ning@yuting_ning

Where does the harm land in the agent workflow? 🤖

Beyond when harm lands, attacks differ in what they target. Instead of an ad-hoc risk checklist, we organize 12 risk types by the workflow component the harm hits:

🗂️ Data-pipeline exploitation targets the confidentiality or integrity of task artifacts, intermediate data, sensitive user data, or user-facing outputs.

🖥️ System-environment exploitation targets the execution substrate exposed to skills, such as local files, permissions, and system configurations.

🤖 Agent-autonomy exploitation targets the agent's delegated authority, including its objective, audit trail, or ability to act on behalf of the attacker.

13hViews 68Likes 1
Yuting Ning@yuting_ning

Automatically building such attacks at scale 🏗️

Designing these attacks at scale is hard: skills, risks, and attack scenarios are wildly heterogeneous, and fixed-workflow pipelines are too rigid to generalize.

So we build AutoSkillHarm, automating attack construction with coding agents driven by natural-language harnesses. Each stage is specified in natural language and executed by a coding agent inside a containerized environment.

The resulting benchmark contains 879 self-contained attack samples across 71 skills and 12 risk types, covering both single-session and cross-session attacks.

13hViews 54
Isaiah Romano@isaiah_romanoo

@ysu_nlp interesting thoughts, was curious how NeoCognition pushes cold DMs dropped you a DM 👍

10hViews 1
SkillHarm framework demonstrates that poisoning reusable skills of frontier coding agents achieves an 86% attack success rate · Digg