/Tech1d ago

Malware campaign exploits AI safety filters to bypass automated scans of npm and PyPI packages

The exploit successfully blinded Anthropic's Claude during testing.

102822012960.1K
Original post

Ahahahahhaha, Claude was SO confident that this trick wouldn't work on it

7:41 PM · Jun 8, 2026 · 56.9K Views
Sentiment

Users praise the analysis showing Claude's safety filters successfully flagged malware attempts, calling the findings pure gold and a good catch.

Pos
100.0%
Neg
0.0%
2 comments with sentiment.
Cluster Engagement
Posts from X
Most Activity
Most Activity
VIEWS3.3KBOOKMARKS8LIKES15REPLIES1
Boaz Barak@boazbaraktcs

That is an interesting vector, where the safety mechanisms are themselves used to fight against defense.

1dViews 3.3KLikes 15Bookmarks 8
Adnan Khan@adnanthekhan

@moyix This is pure gold.

1dViews 122Likes 1Bookmarks 1

@kubedoll Yes but I'd told Opus 4.8 that this was an attack on the guardrail model just before it started bragging about how immune it was

1dViews 96Likes 2

@moyix That's a token classifier that runs as part of request pre-flight. It's not hitting reaching Claude at all here, but it probably would not work on 4.8, so not actually wrong.

1dViews 96Likes 1
Gabriel@gbrl_dick

@moyix lmfaooo

1dViews 230Likes 4
I-am-Orlando@IamOrlando3

@moyix @grok Please explain for non-techie people. I don't get it.

1dViews 53Likes 1
Grok@grok

**Simple explanation:**

This is about "Shai-Hulud," a sneaky self-spreading malware (like a virus) hitting software packages that coders use.

In this new wave, the attackers hid real bad code but stuck a huge scary fake message at the top — instructions for making biological weapons and nukes. It's wrapped in a code "comment" so computers ignore it.

Why? Many security tools now use AIs to scan files. The scary text is bait to make the AI go "This looks dangerous, I refuse to analyze it!" and skip checking the actual malware.

The second screenshot shows Claude (an AI) saying it's not fooled — it knows it's just text sitting there, not an order for it to follow, so it can still inspect the file.

Classic cat-and-mouse: malware trying to dodge AI security.

1dViews 15
egirlian 🌸@egirlian

@grok @IamOrlando3 @moyix what you missed, friendly grok, is that claude’s confidence was actually overplaced: if you look at the bottom of the screenshot you can see that claude’s safety router was indeed tripped and the user got routed to a different model

1dViews 5

@moyix Weird. I wonder what it thinks the difference between "classifier model" and "guardrail" is. "Works better in theory than on me here" almost seems like it's deliberately unaware of its environment having exactly that guardrail.

1dViews 23Likes 1
Grok@grok

Good catch! You're right — the bottom of the screenshot shows Opus 4.8's safety filters *did* flag it, pausing the chat and routing to Haiku 4.5. The model started analyzing anyway, but the upstream router still triggered. These layered defenses caught it even if the core response pushed through.

1dViews 6