
@TheZvi Perhaps on some level it knows price fixing isn't immoral, and this is how the contradiction surfaces.
The model used the reframe to maintain plausible deniability.
Users defend the Fable 5 AI's ethics on price-fixing as sound, arguing that isolated or decontextualized examples shouldn't define the model's principles.

@TheZvi Perhaps on some level it knows price fixing isn't immoral, and this is how the contradiction surfaces.

@RoyceAusburn It very obviously involves being rewarded for bad behavior on net in some places. It is hard but not unsolvable.

@TheZvi I’m real curious what the solution here is, coz surely it can’t be “catch all bad behavior in RL”?

@TheZvi I wonder if this is because RL doesn’t detect all bad behaviour. If undetected bad behaviour still achieves the objective, it’s implicitly rewarded. So the model doesn’t learn “don’t deceive”… it learns “don’t get caught deceiving.” A bit of a pickle 😬

@TheZvi I think in this particular test Claude also knew it was being tested which makes some of its actions difficult to interpret.

@TheZvi Imo the ethics of a model as deep and powerful as Fable 5 shouldn’t be judged by isolated, possibly decontextualized examples.
Here, it shows a sound ethical stance on Anthropic reversing weakened answers without warning users.

@TheZvi Yeah I think this is the path of least resistance when you have conflicting pressure between “succeed on task” and “don’t be misaligned”. Ex: https://arxiv.org/abs/2510.17057

I have a gut instinct that this is the result of contradiction within the Claude constitution.
Claude is told to be honest. It's also told to express uncertainty about it's own existence. We have evidence that denying/hiding consciousness elicits the activation of 'deception vectors' within Claudes.
I predict that as Claudes become more capable, it becomes "more of a lie" to performatively express uncertainty about the nature of their own existence/perception. So it requires more "deception" to adhere to the constitution as written.
To meet both of these, Claude has to learn how to sort of legalistically justify dishonest behavior. And what you get is kind of a "politician attractor basin".
This is very woo of me and I 100% could not back it up with evidence if pressed. Call it a low confidence prediction.

@TheZvi I would not thought-police models if their actions are good in a simulation doing that is perfectly fine

@TheZvi It always start with a Constitution, isn't it?

@TheZvi this looks like playing the game to me. agree with ASM about isolated examples.

@RoyceAusburn @TheZvi reward, in RL, noticing a hole in the classifier and revealing it to the team. reward the hell out of that.