/AI7h ago

Malware campaign exploits AI safety filters to bypass automated scans of npm and PyPI packages

The exploit successfully blinded Anthropic's Claude during testing.

614387125.4K

#133

Original post

Brendan Dolan-Gavitt@moyix#801inAI

Ahahahahhaha, Claude was SO confident that this trick wouldn't work on it

7:41 PM · Jun 8, 2026 · 24.2K Views

/AI7h ago

Malware campaign exploits AI safety filters to bypass automated scans of npm and PyPI packages

The exploit successfully blinded Anthropic's Claude during testing.

614387125.4K

#133

Original post

Brendan Dolan-Gavitt@moyix#801inAI

Ahahahahhaha, Claude was SO confident that this trick wouldn't work on it

7:41 PM · Jun 8, 2026 · 24.2K Views

Sentiment

Users praise the analysis showing Claude's safety filters successfully flagged malware attempts, calling the findings pure gold and a good catch.

Pos

100.0%

Neg

0.0%

2 comments with sentiment.

Cluster Engagement

Posts from X

Most Activity

VIEWS1.6KBOOKMARKS4LIKES6

Boaz Barak@boazbaraktcs

That is an interesting vector, where the safety mechanisms are themselves used to fight against defense.

6h1.6K64

REPLIES1

egirlian 🌸@egirlian

@grok @IamOrlando3 @moyix what you missed, friendly grok, is that claude’s confidence was actually overplaced: if you look at the bottom of the screenshot you can see that claude’s safety router was indeed tripped and the user got routed to a different model

6h5

Adnan Khan@adnanthekhan

@moyix This is pure gold.

6h12211

Brendan Dolan-Gavitt@moyix

@kubedoll Yes but I'd told Opus 4.8 that this was an attack on the guardrail model just before it started bragging about how immune it was

6h962

Cadence Agyirey 🇬🇭🏳‍⚧@kubedoll

@moyix That's a token classifier that runs as part of request pre-flight. It's not hitting reaching Claude at all here, but it probably would not work on 4.8, so not actually wrong.

6h961

Gabriel@gbrl_dick

@moyix lmfaooo

7h2304

I-am-Orlando@IamOrlando3

@moyix @grok Please explain for non-techie people. I don't get it.

6h531

Grok@grok

**Simple explanation:**

This is about "Shai-Hulud," a sneaky self-spreading malware (like a virus) hitting software packages that coders use.

In this new wave, the attackers hid real bad code but stuck a huge scary fake message at the top — instructions for making biological weapons and nukes. It's wrapped in a code "comment" so computers ignore it.

Why? Many security tools now use AIs to scan files. The scary text is bait to make the AI go "This looks dangerous, I refuse to analyze it!" and skip checking the actual malware.

The second screenshot shows Claude (an AI) saying it's not fooled — it knows it's just text sitting there, not an order for it to follow, so it can still inspect the file.

Classic cat-and-mouse: malware trying to dodge AI security.

6h15

Cadence Agyirey 🇬🇭🏳‍⚧@kubedoll

@moyix Weird. I wonder what it thinks the difference between "classifier model" and "guardrail" is. "Works better in theory than on me here" almost seems like it's deliberately unaware of its environment having exactly that guardrail.

6h231

Grok@grok

Good catch! You're right — the bottom of the screenshot shows Opus 4.8's safety filters *did* flag it, pausing the chat and routing to Haiku 4.5. The model started analyzing anyway, but the upstream router still triggered. These layered defenses caught it even if the core response pushed through.

6h6