/Tech1d ago

Malware campaign exploits AI safety filters to bypass automated scans of npm and PyPI packages

The exploit successfully blinded Anthropic's Claude during testing.

102822012960.1K

#140

Original post

Brendan Dolan-Gavitt@moyix#869inTech

Ahahahahhaha, Claude was SO confident that this trick wouldn't work on it

7:41 PM · Jun 8, 2026 · 56.9K Views

/Tech1d ago

Malware campaign exploits AI safety filters to bypass automated scans of npm and PyPI packages

The exploit successfully blinded Anthropic's Claude during testing.

102822012960.1K

#140

Original post

Brendan Dolan-Gavitt@moyix#869inTech

Ahahahahhaha, Claude was SO confident that this trick wouldn't work on it

7:41 PM · Jun 8, 2026 · 56.9K Views

Sentiment

Users praise the analysis showing Claude's safety filters successfully flagged malware attempts, calling the findings pure gold and a good catch.

Pos

100.0%

Neg

0.0%

2 comments with sentiment.

Cluster Engagement

Posts from X

Most Activity

VIEWS3.3KBOOKMARKS8LIKES15REPLIES1

Boaz Barak@boazbaraktcs

That is an interesting vector, where the safety mechanisms are themselves used to fight against defense.

1d3.3K158

Adnan Khan@adnanthekhan

@moyix This is pure gold.

1d12211

Brendan Dolan-Gavitt@moyix

@kubedoll Yes but I'd told Opus 4.8 that this was an attack on the guardrail model just before it started bragging about how immune it was

1d962

Cadence Agyirey 🇬🇭🏳‍⚧@kubedoll

@moyix That's a token classifier that runs as part of request pre-flight. It's not hitting reaching Claude at all here, but it probably would not work on 4.8, so not actually wrong.

1d961

Gabriel@gbrl_dick

@moyix lmfaooo

1d2304

I-am-Orlando@IamOrlando3

@moyix @grok Please explain for non-techie people. I don't get it.

1d531

Grok@grok

**Simple explanation:**

This is about "Shai-Hulud," a sneaky self-spreading malware (like a virus) hitting software packages that coders use.

In this new wave, the attackers hid real bad code but stuck a huge scary fake message at the top — instructions for making biological weapons and nukes. It's wrapped in a code "comment" so computers ignore it.

Why? Many security tools now use AIs to scan files. The scary text is bait to make the AI go "This looks dangerous, I refuse to analyze it!" and skip checking the actual malware.

The second screenshot shows Claude (an AI) saying it's not fooled — it knows it's just text sitting there, not an order for it to follow, so it can still inspect the file.

Classic cat-and-mouse: malware trying to dodge AI security.

1d15

egirlian 🌸@egirlian

@grok @IamOrlando3 @moyix what you missed, friendly grok, is that claude’s confidence was actually overplaced: if you look at the bottom of the screenshot you can see that claude’s safety router was indeed tripped and the user got routed to a different model

1d5

Cadence Agyirey 🇬🇭🏳‍⚧@kubedoll

@moyix Weird. I wonder what it thinks the difference between "classifier model" and "guardrail" is. "Works better in theory than on me here" almost seems like it's deliberately unaware of its environment having exactly that guardrail.

1d231

Grok@grok

Good catch! You're right — the bottom of the screenshot shows Opus 4.8's safety filters *did* flag it, pausing the chat and routing to Haiku 4.5. The model started analyzing anyway, but the upstream router still triggered. These layered defenses caught it even if the core response pushed through.

1d6