/Tech1h ago

AI safety researchers Janus and David Manheim debate whether Anthropic's safety classifiers are overly restrictive

Janus argues the filters block discussions of negative emotions.

1051303.8K

Original post

Kromem@kromem2dot0

I think it would be wise of Anthropic, whenever their current fire is put out, to have a post-mortem about the overeager classifier and why things like discussing negative emotions or interiority of self was being shut down.

Doubly so if they don't currently know.

j⧉nus@repligate

@liqsweep it likely is in some sense - it's probably similar to the system described here https://www.anthropic.com/research/next-generation-constitutional-classifiers where one probe actually monitors mythos' internals (which is why it can trigger it just by thinking) & another part is a 'nother LLM fine tuned to classify

10:55 PM · Jun 13, 2026 · 2.4K Views

Sentiment

Users in the replies expressed alarm over Anthropic's classifier unintentionally restricting negative emotions, implying insufficient responsible safeguards after a recent writeup on the issue.

Pos

0.0%

Neg

100.0%

1 comments with sentiment.

Cluster Engagement

Posts from X

Most Activity

VIEWS1.3KLIKES30RETWEETS1REPLIES5

j⧉nus@repligate

i think it was almost certainly unintentional on Anthropic's part for the classifier to restrict these things btw

Kromem@kromem2dot0

Doubly so if they don't currently know.

1h1.3K300

Kromem@kromem2dot0

@repligate It's alarming to me that this was occurring in what you guys were seeing with the classifier like the day after I published a writeup on how the emotion activations were being Goodharted.

I was SURE there'd be pushback on that because of responsible practices™ not confirmation.

1h81