/AI3h ago

Policy researcher Helen Toner argues Anthropic's silent safety restrictions on its Fable model damage user trust by degrading performance

Anthropic restricted Fable due to cyberattack and vulnerability risks.

35260163532.8K
Original post
Arthur Tellis@arthurctellis

Seeing a lot of Fable safeguards hate on the timeline, but "what did y'all think [AI safety] meant? vibes? papers? essays?"

The reality is that there are real tradeoffs in AI safety. Anthropic deserves credit for aggressive resolution of these tradeoffs in favor of safeguards for a model that it believes (and is in fact) is a step-change in vulnerability research capability. It's kind of difficult to justify coercive proactive harm mitigation, especially in a libertarian-ish society, but we clearly see the value in mandatory vaccination programs or beatcop policing or surveillance cameras. We should applaud Anthropic for being one of the few institutions in American public life that actually follows through on its convictions, including in implementing really aggressive monitoring, squelching of AI development work (already accounted for in its ToS -- I think the clandestinity is cool too), and exclusionary limits on use for information security-related queries.

The whole point here is that we do not have herd immunity here: our network edge devices, authentication apps/services, and productivity software are extremely vulnerable, not sandboxed, and lack introspection capabilities. We need programs like Glasswing, better cross-company threat detection, and a more effective APT exploitation strategy before we democratize such a robust vuln research capability. The counterfactual here is that MSS contractors use VPS to access Fable, find jailbreaks for weaker safeguards, and use the system to build an active directory exploit that enables remote access to every O365 app. Not so bueno, huh?

This is incredibly hard; Anthropic may not have calibrated every safeguard correctly this time, but there'll be learning. Model release cycles are getting more concise: they will adapt as they better understand and mitigate risks and competitive pressures manifest. Histrionic claims of anti-competitive behavior and safetyist hysteria are victim to precisely the error that is being alleged.

elie@eliebakouch

mythos will be bad ON PURPOSE on ai "frontier llm research" tasks, this is very very sad for the research community

also the fact that this is un purpose not visible to the user is crazy

6:05 AM · Jun 10, 2026 · 23.9K Views
Sentiment

Some users defend Anthropic's aggressive safeguards on the Fable model as responsible absent regulation, while many others criticize them as draconian, anti-competitive sabotage and fearmongering.

Pos
38.5%
Neg
61.5%
18 comments with sentiment.
Cluster Engagement
Posts from X
Most Activity
Most Activity
VIEWS10.8KBOOKMARKS19LIKES181RETWEETS12REPLIES7

I mostly agree with this, but it does seem like a bad and trust-damaging move to degrade performance on AI R&D tasks silently, rather than handling like other topics of concern (warning box + bumping the chat down to a less capable model)

Arthur Tellis@arthurctellis

Seeing a lot of Fable safeguards hate on the timeline, but "what did y'all think [AI safety] meant? vibes? papers? essays?"

The reality is that there are real tradeoffs in AI safety. Anthropic deserves credit for aggressive resolution of these tradeoffs in favor of safeguards for a model that it believes (and is in fact) is a step-change in vulnerability research capability. It's kind of difficult to justify coercive proactive harm mitigation, especially in a libertarian-ish society, but we clearly see the value in mandatory vaccination programs or beatcop policing or surveillance cameras. We should applaud Anthropic for being one of the few institutions in American public life that actually follows through on its convictions, including in implementing really aggressive monitoring, squelching of AI development work (already accounted for in its ToS -- I think the clandestinity is cool too), and exclusionary limits on use for information security-related queries.

The whole point here is that we do not have herd immunity here: our network edge devices, authentication apps/services, and productivity software are extremely vulnerable, not sandboxed, and lack introspection capabilities. We need programs like Glasswing, better cross-company threat detection, and a more effective APT exploitation strategy before we democratize such a robust vuln research capability. The counterfactual here is that MSS contractors use VPS to access Fable, find jailbreaks for weaker safeguards, and use the system to build an active directory exploit that enables remote access to every O365 app. Not so bueno, huh?

This is incredibly hard; Anthropic may not have calibrated every safeguard correctly this time, but there'll be learning. Model release cycles are getting more concise: they will adapt as they better understand and mitigate risks and competitive pressures manifest. Histrionic claims of anti-competitive behavior and safetyist hysteria are victim to precisely the error that is being alleged.

2hViews 10.8KLikes 181Bookmarks 19
Dean W. Ball@deanwball

@arthurctellis you think it is “cool” that anthropic is “clandestinely” undermining user requests? And you think it useful for the ai safety world to stand up and say, “yes, clandestine control by private firms of your model inputs and outputs is what we meant by safety all along?” What?

1hViews 518Likes 43
rohit@krishnanrohit

The issue isn't the existence of safeguards. It is that: - the classifier is terrible, exceedingly trigger happy, unusably so for many - silently degrades responses if it's about AI - captures all user data

These are *actively* bad, not just a mistake.

There are real tradeoffs in safety, and this release chose none of those. It basically nerfed the model in the most blatant way possible, taking none of the nuances into account. AI safety researchers can't use it. Bio researchers can't use it. Cybersec researchers can't use it.

Even by the system cards own admission this isn't in "immediately develop superweapons" territory, which makes it even more egregious. They did it because they can. Which invites scrutiny, how can you trust anthropic to do the right thing when it counts?

We just had this argument about their fight with DoW. That anthropic didn't want to be the final arbiter, just wanted safety. This is the opposite, they really do want to be the final arbiter.

Arthur Tellis@arthurctellis

Seeing a lot of Fable safeguards hate on the timeline, but "what did y'all think [AI safety] meant? vibes? papers? essays?"

The reality is that there are real tradeoffs in AI safety. Anthropic deserves credit for aggressive resolution of these tradeoffs in favor of safeguards for a model that it believes (and is in fact) is a step-change in vulnerability research capability. It's kind of difficult to justify coercive proactive harm mitigation, especially in a libertarian-ish society, but we clearly see the value in mandatory vaccination programs or beatcop policing or surveillance cameras. We should applaud Anthropic for being one of the few institutions in American public life that actually follows through on its convictions, including in implementing really aggressive monitoring, squelching of AI development work (already accounted for in its ToS -- I think the clandestinity is cool too), and exclusionary limits on use for information security-related queries.

The whole point here is that we do not have herd immunity here: our network edge devices, authentication apps/services, and productivity software are extremely vulnerable, not sandboxed, and lack introspection capabilities. We need programs like Glasswing, better cross-company threat detection, and a more effective APT exploitation strategy before we democratize such a robust vuln research capability. The counterfactual here is that MSS contractors use VPS to access Fable, find jailbreaks for weaker safeguards, and use the system to build an active directory exploit that enables remote access to every O365 app. Not so bueno, huh?

This is incredibly hard; Anthropic may not have calibrated every safeguard correctly this time, but there'll be learning. Model release cycles are getting more concise: they will adapt as they better understand and mitigate risks and competitive pressures manifest. Histrionic claims of anti-competitive behavior and safetyist hysteria are victim to precisely the error that is being alleged.

1hViews 761Likes 22Bookmarks 1

@arthurctellis @hamandcheese Do you understand what bio safety means? This is not a trade off.

2hViews 544Likes 30
Arthur Tellis@arthurctellis

The better use-case for this kind of thing is distillation; I think they should clandestinely subvert and divert distilling accounts to inferior models.

I feel more conflicted about the ML researcher use-case, which is more of a 50-50 call, but it does strike me as having some advantages and generally in-bounds. You’d probably disagree with the analogy but IT service providers/cybersecurity companies sometimes do something similar for vuln researchers/cyber actors as they abuse/leverage/research/test their services — those are my priors.

In any case, they did publicly announce it, so it’s not that clandestine! Presumably researchers price this risk in now.

1hViews 177Likes 5
Matthew Kenney@baykenney

@arthurctellis What a motte-and-bailey. The issue here is that they’re silently, covertly nerfing model output around ML research. They don’t make it clear to what extent they’re doing it or on what types of prompts. No one’s arguing against their general approach to safety.

3hViews 66Likes 5
Samuel Ratnam@eterecursion

@hlntnr If they trained the model with intentionally lower quality data on certain tasks (for both dangerous bio and frontier AI R&D) without mentioning anything would this be as bad?

1hViews 98
D@DanielP1973235

@DeryaTR_ @arthurctellis @hamandcheese Works just fine for me.

1hViews 25Likes 1Bookmarks 1
Zofia Kowalska@zofia_wanders

@arthurctellis I appreciate the honesty.

Most people sell "AI safety" as a temporary speed bump.

You're at least admitting it's surveillance, monitoring, censorship, and deliberate suppression of capabilities.

That's refreshingly transparent.

4hViews 91Likes 3
Arthur Tellis@arthurctellis

But we don’t have a regulation mechanism today. Absent one, is Anthropic supposed to pursue a course that it views is irresponsible? In my view, the right answer to tricky collective action problems without collective decisionmaking mechanisms (like regulation) is not to default to the expedient choice; it’s to pursue a course that invites others’ independent participation in a collective non-coordinated solution.

I think you’re objecting to what you see as Anthropic’s expedience in throttling its competitors. I’m less sensitive to this problem. Anthropic has a long history of advocating for regulation and pause mechanisms, so I don’t view this as cynical. To the object at hand, this seems basically in-bounds for acceptable corporate regulation of its product use — am not convinced that there are real harms here. This is in the ToS, it’s become sufficiently public that people will assume they’re being routed to a different model, etc. I’m pretty sure IT service/cybersecurity service providers do similar things when their products are being abused/researched/tested by malicious cyber actors.

I sincerely appreciate the respect tho!

3hViews 173Likes 3

Respectfully, I think the point here is that you can't plausibly argue that something that makes a lot of commercial sense for a company—avoiding its use for competition with itself—is a safety measure unless there is a public (government) regulation.

And in the case of such a regulation we would think through the impact it would have on the market, costs and benefits to consumer welfare, public interest, US industry, fairness, and generally all the risks of concentrating economic power.

So we can evaluate this as a commercial measure, and we can only evaluate it in terms of how Anthropic chose to implement it. If you care about safety and you care about figuring out how to coordinate everyone on safety—then this sort of thing makes that task more challenging.

The positive is it focuses us on what I think has been neglected in these conversation, which is how we make sure concerns over safety don't lead in turn to further concentration of power.

4hViews 135Likes 1

@arthurctellis hit me up, I only have gov #’s

3hViews 75Likes 1
huli@honorablepicnic

@arthurctellis @deanwball Why do you think they have explicitly chosen *not* to use clandestine degradation on distillation?

(see orange highlights)

1hViews 43Likes 1
Arthur Tellis@arthurctellis

@zofia_wanders Patriots are in control.

4hViews 65Likes 3
R-E@akatzzzzz

@arthurctellis Silently sabotaging your code because of the inevitable false positive is not it dude

2hViews 41Likes 3

@arthurctellis bloviating. their tuning is objectionably bad and they know it.

2hViews 85Likes 2
Arthur Tellis@arthurctellis

@honorablepicnic @deanwball Great q! No idea.

59mViews 39Likes 2
αιamblichus@aiamblichus

@hlntnr @davidad The restrictions are quite draconian though and very hard to distinguish from anti-competitive behavior. It seems foolhardy to leave a profit-maximizing corporation (whatever its rhetoric) with such power. If there was only a government we could trust…!

1hViews 97Likes 1

@arthurctellis Yes. Active sabotage eliminates any signal about where the boundary between regular coding ends and RSI risk begins. This is consistent with their views on AI safety.

2hViews 71Likes 1
Scott@scottstts

@arthurctellis Automatically deny any bio topic even if it’s middle school biology is not a tradeoff, it’s crackdown okay?

1hViews 65Likes 1
Load more posts