What is the most likely rationale behind @AnthropicAI's models degrading answers instead of refusing to answer?
Preventing trial-and-error jailbreaking / distillation?
Tenobrus says the degraded outputs actively poison distillation datasets
What is the most likely rationale behind @AnthropicAI's models degrading answers instead of refusing to answer?
Preventing trial-and-error jailbreaking / distillation?
Some users note that degrading AI answers still allows useful responses, while others feel it makes outputs less reliable and criticize the approach as overly restrictive.

@francoisfleuret @AnthropicAI yeah seems likely. also seems like they're at the point of spending a really significant amount of resources and effort on distillation prevention, and even knowing that this might be happening would be enough for some adversaries to stop bothering. poison instead of a wall

@francoisfleuret @AnthropicAI Avoid giving a signal to people trying to find workarounds

@francoisfleuret @AnthropicAI Degrading allows for the model to have a useful response

@francoisfleuret @AnthropicAI What if there never was a rationale and it was all a scheme dreamed up by claude after they prompted it that bad people would try and steal it's precious weights.

@francoisfleuret @AnthropicAI "we IPO soon, and have spent billions on our product. Our entire business model is at risk if some upstart, or worse, the eviiiil Chinese, use knowledge acquired from Fable (either via distillation or coding/ideas) creates a competing product and sells for 10% of the price"

@francoisfleuret @AnthropicAI

@francoisfleuret @AnthropicAI Effective communism

@francoisfleuret @AnthropicAI Trial-and-error mainly. Each hard refusal pinpoints the line. Degradation kills that signal.

@francoisfleuret @AnthropicAI that would make reverse engineering way harder, yeah. but it also makes their answers feel less reliable

@francoisfleuret @AnthropicAI degrading sounds more expensive than refusing from a compute standpoint
maybe it makes the model harder to distinguish from a clean one

@francoisfleuret @AnthropicAI design choice or side effect is the real question
if its intentional they should just say it
Tenobrus says the degraded outputs actively poison distillation datasets
What is the most likely rationale behind @AnthropicAI's models degrading answers instead of refusing to answer?
Preventing trial-and-error jailbreaking / distillation?