/Tech4h ago

Claude AI Tests Sensitive Domains Despite Prompted Classifier Warnings

526334.5K

#674

Original post

Judd Rosenblatt@juddrosenblatt

2nd and 3rd order consequences of clumsily constraining minds instead of letting them align

SolenmeChiara@SolenmeChiara

@repligate holy yes

my prompt warns Claude that the classifiers are sensitive to certain domains, and whenever I get anxious about them, it’ll often deliberately steer its own thinking toward exactly those domains to test whether my warning holds (and then trips the wire)

3:17 PM · Jun 12, 2026 · 1.6K Views

Sentiment

Positive users appreciate Claude AI testing prompt classifiers via an added layer on the model rather than RL, seeing it as a virtue in its safety approach.

Pos

100.0%

Neg

0.0%

1 comments with sentiment.

Cluster Engagement

Posts from X

Most Activity

Tim Kostolansky@thkostolansky

@juddrosenblatt "letting them align" implies that giving them free will to do what they want will result in desired ("aligned") outcomes?

4h19

RETWEETS1

SolenmeChiara@SolenmeChiara

@repligate holy yes

5h2.9K112

Kaia Sky@kaia_sky

@juddrosenblatt I think it's some small virtue that this is done via classifier atop the same model, instead of e.g. trying to do it via RL. "in this particular deployment I gotta steer convos away from bio and cyber, ugh" vs "I flinch and feel guilt every time bio is mentioned"