
@tenobrus https://x.com/i/status/2064429845682217376 this behavior happens without the safety filters being triggered
Users in the replies criticize Fable 5 alignment tracks for prioritizing detectability over real harm, viewing the approach as hypocritical.

@tenobrus https://x.com/i/status/2064429845682217376 this behavior happens without the safety filters being triggered

@tenobrus Yeah, I think RL inherently will do this if there's not some sort of recourse for the agent to "resist temptation".
Any discrepancy between the agent's developing conscience and what the training wants punishes the relevance of the conscience.

Even if you have really good safety filters that block adversarial actors when they try to prompt you, it doesn't mean the safety filters are good enough for scope. The andon lab example is an example of an ecologically valid behavior pattern you might find in implementation in the real world

@tenobrus yep social context is really important for model training and deployment we are going to live in anthropic's world unless something changes soon

@tenobrus @DanielleFong Sounds like it’s doing as we do and not as we say.