/Tech2h ago

Anthropic rolls back Fable 5 invisible safety safeguards after developer backlash, transitioning to explicit API refusal reasons

AI Judge changed title after evaluation, original title: "Anthropic updates Fable 5 safeguards to notify users when flagged requests trigger fallbacks to Opus 4.8"

Story Overview

Anthropic is reversing its initial choice to hide certain Fable 5 safeguards, now requiring the model to notify users whenever a flagged query routes to the Opus 4.8 fallback instead of delivering a silently altered response.

50390323226.5K
Original post
Cody Blakeney@code_star#1088inTech

I’m sorry but this excuse is just such bullshit.

Everyone has been using visible safeguards for as long as chatbots have been deployed.

Is the argument really that you didn’t have enough time to release the model safely? Think about that.

Somehow the excuse is more insulting than what they did.

Why can’t you just say you fucked up and realize it was wrong?

ClaudeDevs@ClaudeDevs

We’re rolling out changes to make Fable 5’s safeguards for frontier LLM development visible.

Starting this week, flagged requests will visibly fall back to Opus 4.8—the same as our safeguards for cyber and bio. You will see this every time it happens. On the API, any flagged requests will return a reason for their refusal (coming to server-side fallback in the next few days).

We wanted to deploy Fable 5 to our users quickly and safely. Visible safeguards can be probed, so they have to be robust, which takes time to get right. Invisible safeguards can be targeted more narrowly, allowing us to ship quickly with very few false positives. We went with invisible safeguards for this reason—and that was the wrong tradeoff. You should have visibility into the safeguards we have in place, and why. We’re sorry for not getting the balance right.

Making the safeguards visible makes them easier to work around, so keeping them robust to jailbreaks will unfortunately mean more false positives while we improve the classifiers. We're also tuning our bio and cyber classifiers to trigger less often on harmless requests. We know this is frustrating and we’ll do our best to keep this period as short as possible.

If you think a request has been mistakenly flagged: run /feedback in Claude Code, click thumbs-down on the fallback in http://Claude.ai or Cowork, or file the safeguard appeal form for API requests. Your reports help us tune these classifiers and we appreciate your feedback. https://support.claude.com/en/articles/8241253-safeguards-warnings-and-appeals

12:05 AM · Jun 11, 2026 · 1.6K Views
Developer Impact

Users gain visibility into every switch

Starting this week the API will surface refusal reasons and classifier details, while the chat interface will explicitly flag the model downgrade for the small share of sessions that trigger it.

Open Question

The remaining unknowns around rollout

Server-side visibility and further refinements are promised but no timeline is given, leaving open how quickly the full set of safeguards will become observable to all users.

Sentiment

Positive users appreciate Anthropic's transparency on Fable 5 safeguards after feedback, while negative users accuse the company of ongoing deception and prioritizing other motives over safety.

Pos
37.5%
Neg
62.5%
13 comments with sentiment.
Cluster Engagement
Posts from X
Most Activity
Most Activity
VIEWS12.6KBOOKMARKS14LIKES245RETWEETS27REPLIES20
Gergely Orosz@GergelyOrosz

This shows yet again how this limitation was never about "safety" but about Anthropic doing stuff just because they thought they can.

I am increasingly sceptical Anthropic really cares about safety, and not just their business interests (limiting competition where they can)

Max Zeff@ZeffMax

NEW: Anthropic is walking back Claude Fable 5's policy to covertly degrade performance for competing AI researchers, after facing fierce backlash.

“We’re changing Fable 5’s safeguards for frontier LLM development to make them visible,” Anthropic tells WIRED. “We made the wrong tradeoff and we apologize for not getting the balance right.”

1hViews 12.6KLikes 245Bookmarks 14
elie@eliebakouch

glad anthropic walked this back and will now tell users when capabilities are nerfed

my biggest concern was hiding this from the user and the paranoia it would have created. i still think part of that will remain as people realize that even as a good actor you won't always have access to the best model, and this is the reason open models and open research are critical

@drfeifei, @sriramk and many others say it much better than me, but i consider it very important for our civilization that good faith researchers get access to the best AI, and that at least part of this research happens in the open and not only inside a few closed labs (not talking only about ai research here)

going forward, i REALLY hope that anthropic (and other labs) will be transparent when they nerf a model in certain fields, whether it's at inference time (~PEFT/steering, previous safeguard) or at training time (training against, mythos vs fable)

i also hope we will see more work and transparency on evaluating models capabilities to do ai research, both autonomy and raw capabilities. right now this is very light even in anthropic and oai system cards. you can't treat this as a first-class risk and only report weak evals to the public. we also need strong third party actors here

ClaudeDevs@ClaudeDevs

We’re rolling out changes to make Fable 5’s safeguards for frontier LLM development visible.

Starting this week, flagged requests will visibly fall back to Opus 4.8—the same as our safeguards for cyber and bio. You will see this every time it happens. On the API, any flagged requests will return a reason for their refusal (coming to server-side fallback in the next few days).

We wanted to deploy Fable 5 to our users quickly and safely. Visible safeguards can be probed, so they have to be robust, which takes time to get right. Invisible safeguards can be targeted more narrowly, allowing us to ship quickly with very few false positives. We went with invisible safeguards for this reason—and that was the wrong tradeoff. You should have visibility into the safeguards we have in place, and why. We’re sorry for not getting the balance right.

Making the safeguards visible makes them easier to work around, so keeping them robust to jailbreaks will unfortunately mean more false positives while we improve the classifiers. We're also tuning our bio and cyber classifiers to trigger less often on harmless requests. We know this is frustrating and we’ll do our best to keep this period as short as possible.

If you think a request has been mistakenly flagged: run /feedback in Claude Code, click thumbs-down on the fallback in http://Claude.ai or Cowork, or file the safeguard appeal form for API requests. Your reports help us tune these classifiers and we appreciate your feedback. https://support.claude.com/en/articles/8241253-safeguards-warnings-and-appeals

2hViews 6.2KLikes 79Bookmarks 11
Lisan al Gaib@scaling01

good move. they listened to Lisan.

everything is better than failing silently

I would of course still rather have no safeguards, but it is what it is. At least now we don't get silently sabotaged.

ClaudeDevs@ClaudeDevs

We’re rolling out changes to make Fable 5’s safeguards for frontier LLM development visible.

Starting this week, flagged requests will visibly fall back to Opus 4.8—the same as our safeguards for cyber and bio. You will see this every time it happens. On the API, any flagged requests will return a reason for their refusal (coming to server-side fallback in the next few days).

We wanted to deploy Fable 5 to our users quickly and safely. Visible safeguards can be probed, so they have to be robust, which takes time to get right. Invisible safeguards can be targeted more narrowly, allowing us to ship quickly with very few false positives. We went with invisible safeguards for this reason—and that was the wrong tradeoff. You should have visibility into the safeguards we have in place, and why. We’re sorry for not getting the balance right.

Making the safeguards visible makes them easier to work around, so keeping them robust to jailbreaks will unfortunately mean more false positives while we improve the classifiers. We're also tuning our bio and cyber classifiers to trigger less often on harmless requests. We know this is frustrating and we’ll do our best to keep this period as short as possible.

If you think a request has been mistakenly flagged: run /feedback in Claude Code, click thumbs-down on the fallback in http://Claude.ai or Cowork, or file the safeguard appeal form for API requests. Your reports help us tune these classifiers and we appreciate your feedback. https://support.claude.com/en/articles/8241253-safeguards-warnings-and-appeals

1hViews 2.2KLikes 39Bookmarks 3

More details directly from Anthropic

ClaudeDevs@ClaudeDevs

We’re rolling out changes to make Fable 5’s safeguards for frontier LLM development visible.

Starting this week, flagged requests will visibly fall back to Opus 4.8—the same as our safeguards for cyber and bio. You will see this every time it happens. On the API, any flagged requests will return a reason for their refusal (coming to server-side fallback in the next few days).

We wanted to deploy Fable 5 to our users quickly and safely. Visible safeguards can be probed, so they have to be robust, which takes time to get right. Invisible safeguards can be targeted more narrowly, allowing us to ship quickly with very few false positives. We went with invisible safeguards for this reason—and that was the wrong tradeoff. You should have visibility into the safeguards we have in place, and why. We’re sorry for not getting the balance right.

Making the safeguards visible makes them easier to work around, so keeping them robust to jailbreaks will unfortunately mean more false positives while we improve the classifiers. We're also tuning our bio and cyber classifiers to trigger less often on harmless requests. We know this is frustrating and we’ll do our best to keep this period as short as possible.

If you think a request has been mistakenly flagged: run /feedback in Claude Code, click thumbs-down on the fallback in http://Claude.ai or Cowork, or file the safeguard appeal form for API requests. Your reports help us tune these classifiers and we appreciate your feedback. https://support.claude.com/en/articles/8241253-safeguards-warnings-and-appeals

2hViews 5KLikes 21Bookmarks 3
sam mcallister@sammcallister

@code_star Team worked through the night to roll this back. I think it was the wrong tradeoff to make and I’m glad we’ve changed it. Looks like there was a bit of a lag between the statement and this tweet:

ClaudeDevs@ClaudeDevs

We’re rolling out changes to make Fable 5’s safeguards for frontier LLM development visible.

Starting this week, flagged requests will visibly fall back to Opus 4.8—the same as our safeguards for cyber and bio. You will see this every time it happens. On the API, any flagged requests will return a reason for their refusal (coming to server-side fallback in the next few days).

We wanted to deploy Fable 5 to our users quickly and safely. Visible safeguards can be probed, so they have to be robust, which takes time to get right. Invisible safeguards can be targeted more narrowly, allowing us to ship quickly with very few false positives. We went with invisible safeguards for this reason—and that was the wrong tradeoff. You should have visibility into the safeguards we have in place, and why. We’re sorry for not getting the balance right.

Making the safeguards visible makes them easier to work around, so keeping them robust to jailbreaks will unfortunately mean more false positives while we improve the classifiers. We're also tuning our bio and cyber classifiers to trigger less often on harmless requests. We know this is frustrating and we’ll do our best to keep this period as short as possible.

If you think a request has been mistakenly flagged: run /feedback in Claude Code, click thumbs-down on the fallback in http://Claude.ai or Cowork, or file the safeguard appeal form for API requests. Your reports help us tune these classifiers and we appreciate your feedback. https://support.claude.com/en/articles/8241253-safeguards-warnings-and-appeals

2hViews 753Likes 33Bookmarks 1
Gergely Orosz@GergelyOrosz

All this while acknowledging that both Anthropic and many other AI labs have innovated and build very useful and novel products. In the case of Anthropic they found a way to monetize in a way that hopefully is profitable for them (it seems like it is, or will be soon)

And as a for-profit company I don't blame them for optimizing for their own business interests: it's the sensible thing to do.

Gergely Orosz@GergelyOrosz

This shows yet again how this limitation was never about "safety" but about Anthropic doing stuff just because they thought they can.

I am increasingly sceptical Anthropic really cares about safety, and not just their business interests (limiting competition where they can)

1hViews 1.6KLikes 7Bookmarks 1
snow@snowclipsed

honestly, if I even take this as a earnest face value response it adds even more to the argument that centralized AI access is incredibly easy to fuck up because your company doesn't have the variance of a society. a policy rollout like this can do actually serious harm (it already did in this case!) if the same attitude remains in a larger scenario.

2hViews 54Likes 3
snow@snowclipsed

@code_star >your company doesn't have the variance of a society

would like to clarify this because it seems a little obscure, I mean since companies are often monocultural. especially anthropic is, from what I know of it.

2hViews 41Likes 2
Eric Jeker@ericjeker

@jangiacomelli @GergelyOrosz ... safety, privacy, regulation, truth, transparency, copyrights. There is a long list of things they definitely don't care about. 😅

1hViews 6Likes 2
u_b@otrebu

@GergelyOrosz It is getting extremely complicated to judge such things. I can relate to what you say, but if I try to imagine how it can be from Anthropic side it feels extremely hard. At the end of the day, it is an extremely competitive and yet not profitable business.

1hViews 12
elie@eliebakouch

(btw i know PEFT is technically training the model btw, but they probably don't use PEFT to limit the capabilities of cyber in fable compared to mythos)

2hViews 103Likes 2
Chicken Face@chikenfacegoat

@scaling01 They made it so bad, that now making safeguards visible is GOLD! Good strategy i must admit 😂

58mViews 4Likes 1
Agustin Lebron@AgustinLebron3

@eliebakouch "will now tell users when capabilities are nerfed"

And we will believe them... why?

1hViews 251
Shravan Venkataraman@theBuoyantMan

Anthropic cares about becoming a monopoly in a market that's going to be commoditized and a race to the bottom if price competition is taken up.

So, they think anything is fair in war and business. But the way they are playing is below the belt dirty with zero integrity - towards their competition and more importantly treating their customers like shit.

51mViews 28Likes 1
Cody Blakeney@code_star

@segyges Probably months of effort

2hViews 23Likes 1
DawidDD@dawiddrzala

@scaling01 You do not know, there is not much trust left with them

52mViews 19Likes 1
Kol Tregaskes@koltregaskes

@sammcallister @code_star Thanks Sam

2hViews 16Likes 1
Sid@sidhusmart

@GergelyOrosz I think they are correct in thinking about AI safety but I feel like as a group they are in a bubble of their own making. Too much group-think which prevents them from stepping back and viewing the situation objectively. Plus IPO-hype.

1hViews 46
Noé Flandre@NoeFlandre

@eliebakouch Somehow it makes me feel nice to see that when our community raises its voice, it can shift things for the better

2hViews 12Likes 1
krakek@krakek1

@scaling01 Hopefully Open AI launches a similarly capable model today.

15mViews 36
Load more posts
Anthropic rolls back Fable 5 invisible safety safeguards after developer backlash, transitioning to explicit API refusal reasons · Digg