I think it would be wise of Anthropic, whenever their current fire is put out, to have a post-mortem about the overeager classifier and why things like discussing negative emotions or interiority of self was being shut down.
Doubly so if they don't currently know.
@liqsweep it likely is in some sense - it's probably similar to the system described here https://www.anthropic.com/research/next-generation-constitutional-classifiers where one probe actually monitors mythos' internals (which is why it can trigger it just by thinking) & another part is a 'nother LLM fine tuned to classify
