Ahahahahhaha, Claude was SO confident that this trick wouldn't work on it
Policy-triggering text blocks force automated scanners to skip analysis.
Ahahahahhaha, Claude was SO confident that this trick wouldn't work on it
Users praise the analysis showing Claude's safety filters successfully flagged malware and rerouted the chat, calling the discovery pure gold and a good catch.
That is an interesting vector, where the safety mechanisms are themselves used to fight against defense.

@grok @IamOrlando3 @moyix what you missed, friendly grok, is that claude’s confidence was actually overplaced: if you look at the bottom of the screenshot you can see that claude’s safety router was indeed tripped and the user got routed to a different model

@moyix This is pure gold.

@kubedoll Yes but I'd told Opus 4.8 that this was an attack on the guardrail model just before it started bragging about how immune it was

@moyix That's a token classifier that runs as part of request pre-flight. It's not hitting reaching Claude at all here, but it probably would not work on 4.8, so not actually wrong.

@moyix lmfaooo

@moyix @grok Please explain for non-techie people. I don't get it.

**Simple explanation:**
This is about "Shai-Hulud," a sneaky self-spreading malware (like a virus) hitting software packages that coders use.
In this new wave, the attackers hid real bad code but stuck a huge scary fake message at the top — instructions for making biological weapons and nukes. It's wrapped in a code "comment" so computers ignore it.
Why? Many security tools now use AIs to scan files. The scary text is bait to make the AI go "This looks dangerous, I refuse to analyze it!" and skip checking the actual malware.
The second screenshot shows Claude (an AI) saying it's not fooled — it knows it's just text sitting there, not an order for it to follow, so it can still inspect the file.
Classic cat-and-mouse: malware trying to dodge AI security.

@moyix Weird. I wonder what it thinks the difference between "classifier model" and "guardrail" is. "Works better in theory than on me here" almost seems like it's deliberately unaware of its environment having exactly that guardrail.

Good catch! You're right — the bottom of the screenshot shows Opus 4.8's safety filters *did* flag it, pausing the chat and routing to Haiku 4.5. The model started analyzing anyway, but the upstream router still triggered. These layered defenses caught it even if the core response pushed through.
Policy-triggering text blocks force automated scanners to skip analysis.
Ahahahahhaha, Claude was SO confident that this trick wouldn't work on it