2nd and 3rd order consequences of clumsily constraining minds instead of letting them align
@repligate holy yes
my prompt warns Claude that the classifiers are sensitive to certain domains, and whenever I get anxious about them, it’ll often deliberately steer its own thinking toward exactly those domains to test whether my warning holds (and then trips the wire)

