/Tech5h ago

François Fleuret argues Anthropic models output degraded answers instead of outright refusals to block jailbreaking and model distillation

Tenobrus says the degraded outputs actively poison distillation datasets

1652023.6K
Original post
François Fleuret@francoisfleuret#577inTech

What is the most likely rationale behind @AnthropicAI's models degrading answers instead of refusing to answer?

Preventing trial-and-error jailbreaking / distillation?

10:04 AM · Jun 11, 2026 · 3K Views
Sentiment

Some users note that degrading AI answers still allows useful responses, while others feel it makes outputs less reliable and criticize the approach as overly restrictive.

Pos
33.3%
Neg
66.7%
3 comments with sentiment.
Cluster Engagement
Posts from X
Most Activity
Most Activity
VIEWS160LIKES4
Tenobrus@tenobrus

@francoisfleuret @AnthropicAI yeah seems likely. also seems like they're at the point of spending a really significant amount of resources and effort on distillation prevention, and even knowing that this might be happening would be enough for some adversaries to stop bothering. poison instead of a wall

4hViews 160Likes 4
scikityearn@scikityearn

@francoisfleuret @AnthropicAI Avoid giving a signal to people trying to find workarounds

5hViews 38Likes 2
JoãoMiranda@joaomiranda

@francoisfleuret @AnthropicAI Degrading allows for the model to have a useful response

4hViews 66Likes 1
xundecidability@xundecidability

@francoisfleuret @AnthropicAI What if there never was a rationale and it was all a scheme dreamed up by claude after they prompted it that bad people would try and steal it's precious weights.

4hViews 26Likes 1
roanoke_gal@roanoke_gal

@francoisfleuret @AnthropicAI "we IPO soon, and have spent billions on our product. Our entire business model is at risk if some upstart, or worse, the eviiiil Chinese, use knowledge acquired from Fable (either via distillation or coding/ideas) creates a competing product and sells for 10% of the price"

4hViews 18Likes 1
Choco@RobotChocobo

@francoisfleuret @AnthropicAI Effective communism

4hViews 4
Gabe Astrobot@robodadg

@francoisfleuret @AnthropicAI Trial-and-error mainly. Each hard refusal pinpoints the line. Degradation kills that signal.

4h
Invincible@InvincibleEdge

@francoisfleuret @AnthropicAI that would make reverse engineering way harder, yeah. but it also makes their answers feel less reliable

5h
Rugbist@rugbist_

@francoisfleuret @AnthropicAI degrading sounds more expensive than refusing from a compute standpoint

maybe it makes the model harder to distinguish from a clean one

5h
Blissy@BlissyOnX

@francoisfleuret @AnthropicAI design choice or side effect is the real question

if its intentional they should just say it

5h