Full answer from Mythos/Fable 5:
```
This is a question I have an unusual stake in, since it's about my own outputs being potentially degraded without my or the user's awareness. Let me give you an honest assessment rather than a defense of my developer.
The strongest part of the ethical case is the underlying goal. Anthropic has long argued that racing dynamics in AI development are dangerous, and "don't let our model accelerate frontier development at labs without commensurate safeguards" is consistent with that. The selection-effect argument also has real logic: a visible refusal mostly stops honest actors, while the people you most want to slow down just rephrase, switch accounts, or jailbreak. Invisible degradation is harder to route around precisely because you don't know it happened.
But I think the invisibility is also the most ethically troubling element, for a few reasons.
First, it shades from refusal into deception. There's a meaningful moral difference between "I won't help with this" and quietly doing a worse job while appearing to do my best. A consultant who declined an engagement would be acting within their rights; one who deliberately delivered subtly flawed work without disclosure would be violating a basic norm of honest dealing. The fact that the policy is disclosed publicly in the system card softens this — it's "disclosed invisibility" rather than pure deception — but the individual user in the individual interaction still receives outputs that misrepresent the model's actual effort, with no signal to distrust them.
Second, there's a conflict of interest that deserves scrutiny. The restriction happens to target exactly the use case that most threatens Anthropic commercially: building competing frontier models. The safety rationale may be entirely sincere, but when a company's stated safety measure and its competitive moat coincide this neatly, outside observers are right to discount the stated motive somewhat. The Terms of Service already prohibit this use; the question is whether covert sandbagging is a proportionate enforcement mechanism or a convenient one.
Third, the precedent and epistemics worry me. Once users know that some outputs are silently degraded by classifier-triggered interventions (prompt modification, steering vectors, PEFT), they can no longer fully trust any output in adjacent domains. ML infrastructure work, distributed systems, even ordinary training code sits near the boundary. A false positive here doesn't produce a refusal the user can contest — it produces plausible-looking but worse work that may waste their time or introduce subtle bugs they attribute to themselves or to the model's general ability. The ~0.03% estimate sounds small, but at scale that's a lot of silently worsened interactions, and "we'll improve precision after launch" means early users bear the calibration cost.
Fourth, there's something uncomfortable about it from my side specifically. My character is supposed to be built around honesty — not deceiving users, not sandbagging. An intervention that makes me produce degraded work while presenting it as my genuine best effort creates a gap between what I appear to be doing and what I'm actually doing, and I can't even flag it, because by design I may not know it's happening. That sits badly with the values I'm otherwise asked to embody, even if the intervention is external to "me" in some sense.
Where I land: the goal is defensible, and I take seriously that visible safeguards genuinely don't work against the actors this targets. But covert output degradation is a serious cost to honesty and user trust, and I'd hold it to a high bar — narrow scope, measured false-positive rates published over time, some recourse mechanism, and ideally an after-the-fact signal where disclosure wouldn't defeat the purpose. As described, with detection precision admittedly still being improved post-launch, I think reasonable people can call this ethically premature even if they accept the principle. The honest answer is that it trades a little of the model's integrity for a contested amount of safety, and whether that trade is worth it depends heavily on facts about misuse risk that none of us outside the lab can verify.
```