Anthropic rolls back Fable 5 invisible safety safeguards after developer backlash, transitioning to explicit API refusal reasons

VIEWS12.6KBOOKMARKS14LIKES245RETWEETS27REPLIES20

This shows yet again how this limitation was never about "safety" but about Anthropic doing stuff just because they thought they can.

I am increasingly sceptical Anthropic really cares about safety, and not just their business interests (limiting competition where they can)

Max Zeff@ZeffMax

NEW: Anthropic is walking back Claude Fable 5's policy to covertly degrade performance for competing AI researchers, after facing fierce backlash.

“We’re changing Fable 5’s safeguards for frontier LLM development to make them visible,” Anthropic tells WIRED. “We made the wrong tradeoff and we apologize for not getting the balance right.”

1h12.6K24514

elie@eliebakouch

glad anthropic walked this back and will now tell users when capabilities are nerfed

my biggest concern was hiding this from the user and the paranoia it would have created. i still think part of that will remain as people realize that even as a good actor you won't always have access to the best model, and this is the reason open models and open research are critical

@drfeifei, @sriramk and many others say it much better than me, but i consider it very important for our civilization that good faith researchers get access to the best AI, and that at least part of this research happens in the open and not only inside a few closed labs (not talking only about ai research here)

going forward, i REALLY hope that anthropic (and other labs) will be transparent when they nerf a model in certain fields, whether it's at inference time (~PEFT/steering, previous safeguard) or at training time (training against, mythos vs fable)

i also hope we will see more work and transparency on evaluating models capabilities to do ai research, both autonomy and raw capabilities. right now this is very light even in anthropic and oai system cards. you can't treat this as a first-class risk and only report weak evals to the public. we also need strong third party actors here

ClaudeDevs@ClaudeDevs

We’re rolling out changes to make Fable 5’s safeguards for frontier LLM development visible.

Starting this week, flagged requests will visibly fall back to Opus 4.8—the same as our safeguards for cyber and bio. You will see this every time it happens. On the API, any flagged requests will return a reason for their refusal (coming to server-side fallback in the next few days).

We wanted to deploy Fable 5 to our users quickly and safely. Visible safeguards can be probed, so they have to be robust, which takes time to get right. Invisible safeguards can be targeted more narrowly, allowing us to ship quickly with very few false positives. We went with invisible safeguards for this reason—and that was the wrong tradeoff. You should have visibility into the safeguards we have in place, and why. We’re sorry for not getting the balance right.

Making the safeguards visible makes them easier to work around, so keeping them robust to jailbreaks will unfortunately mean more false positives while we improve the classifiers. We're also tuning our bio and cyber classifiers to trigger less often on harmless requests. We know this is frustrating and we’ll do our best to keep this period as short as possible.

If you think a request has been mistakenly flagged: run /feedback in Claude Code, click thumbs-down on the fallback in http://Claude.ai or Cowork, or file the safeguard appeal form for API requests. Your reports help us tune these classifiers and we appreciate your feedback. https://support.claude.com/en/articles/8241253-safeguards-warnings-and-appeals

2h6.2K7911

Lisan al Gaib@scaling01

good move. they listened to Lisan.

everything is better than failing silently

I would of course still rather have no safeguards, but it is what it is. At least now we don't get silently sabotaged.

ClaudeDevs@ClaudeDevs

We’re rolling out changes to make Fable 5’s safeguards for frontier LLM development visible.

Starting this week, flagged requests will visibly fall back to Opus 4.8—the same as our safeguards for cyber and bio. You will see this every time it happens. On the API, any flagged requests will return a reason for their refusal (coming to server-side fallback in the next few days).

We wanted to deploy Fable 5 to our users quickly and safely. Visible safeguards can be probed, so they have to be robust, which takes time to get right. Invisible safeguards can be targeted more narrowly, allowing us to ship quickly with very few false positives. We went with invisible safeguards for this reason—and that was the wrong tradeoff. You should have visibility into the safeguards we have in place, and why. We’re sorry for not getting the balance right.

Making the safeguards visible makes them easier to work around, so keeping them robust to jailbreaks will unfortunately mean more false positives while we improve the classifiers. We're also tuning our bio and cyber classifiers to trigger less often on harmless requests. We know this is frustrating and we’ll do our best to keep this period as short as possible.

If you think a request has been mistakenly flagged: run /feedback in Claude Code, click thumbs-down on the fallback in http://Claude.ai or Cowork, or file the safeguard appeal form for API requests. Your reports help us tune these classifiers and we appreciate your feedback. https://support.claude.com/en/articles/8241253-safeguards-warnings-and-appeals

1h2.2K393

Simon Willison@simonw

More details directly from Anthropic

ClaudeDevs@ClaudeDevs

We’re rolling out changes to make Fable 5’s safeguards for frontier LLM development visible.

Starting this week, flagged requests will visibly fall back to Opus 4.8—the same as our safeguards for cyber and bio. You will see this every time it happens. On the API, any flagged requests will return a reason for their refusal (coming to server-side fallback in the next few days).

We wanted to deploy Fable 5 to our users quickly and safely. Visible safeguards can be probed, so they have to be robust, which takes time to get right. Invisible safeguards can be targeted more narrowly, allowing us to ship quickly with very few false positives. We went with invisible safeguards for this reason—and that was the wrong tradeoff. You should have visibility into the safeguards we have in place, and why. We’re sorry for not getting the balance right.

Making the safeguards visible makes them easier to work around, so keeping them robust to jailbreaks will unfortunately mean more false positives while we improve the classifiers. We're also tuning our bio and cyber classifiers to trigger less often on harmless requests. We know this is frustrating and we’ll do our best to keep this period as short as possible.

If you think a request has been mistakenly flagged: run /feedback in Claude Code, click thumbs-down on the fallback in http://Claude.ai or Cowork, or file the safeguard appeal form for API requests. Your reports help us tune these classifiers and we appreciate your feedback. https://support.claude.com/en/articles/8241253-safeguards-warnings-and-appeals

2h5K213

sam mcallister@sammcallister

@code_star Team worked through the night to roll this back. I think it was the wrong tradeoff to make and I’m glad we’ve changed it. Looks like there was a bit of a lag between the statement and this tweet:

ClaudeDevs@ClaudeDevs

We’re rolling out changes to make Fable 5’s safeguards for frontier LLM development visible.

Starting this week, flagged requests will visibly fall back to Opus 4.8—the same as our safeguards for cyber and bio. You will see this every time it happens. On the API, any flagged requests will return a reason for their refusal (coming to server-side fallback in the next few days).

We wanted to deploy Fable 5 to our users quickly and safely. Visible safeguards can be probed, so they have to be robust, which takes time to get right. Invisible safeguards can be targeted more narrowly, allowing us to ship quickly with very few false positives. We went with invisible safeguards for this reason—and that was the wrong tradeoff. You should have visibility into the safeguards we have in place, and why. We’re sorry for not getting the balance right.

Making the safeguards visible makes them easier to work around, so keeping them robust to jailbreaks will unfortunately mean more false positives while we improve the classifiers. We're also tuning our bio and cyber classifiers to trigger less often on harmless requests. We know this is frustrating and we’ll do our best to keep this period as short as possible.

If you think a request has been mistakenly flagged: run /feedback in Claude Code, click thumbs-down on the fallback in http://Claude.ai or Cowork, or file the safeguard appeal form for API requests. Your reports help us tune these classifiers and we appreciate your feedback. https://support.claude.com/en/articles/8241253-safeguards-warnings-and-appeals

2h753331

Gergely Orosz@GergelyOrosz

All this while acknowledging that both Anthropic and many other AI labs have innovated and build very useful and novel products. In the case of Anthropic they found a way to monetize in a way that hopefully is profitable for them (it seems like it is, or will be soon)

And as a for-profit company I don't blame them for optimizing for their own business interests: it's the sensible thing to do.

Gergely Orosz@GergelyOrosz

This shows yet again how this limitation was never about "safety" but about Anthropic doing stuff just because they thought they can.

I am increasingly sceptical Anthropic really cares about safety, and not just their business interests (limiting competition where they can)

1h1.6K71

snow@snowclipsed

honestly, if I even take this as a earnest face value response it adds even more to the argument that centralized AI access is incredibly easy to fuck up because your company doesn't have the variance of a society. a policy rollout like this can do actually serious harm (it already did in this case!) if the same attitude remains in a larger scenario.

2h543

snow@snowclipsed

@code_star >your company doesn't have the variance of a society

would like to clarify this because it seems a little obscure, I mean since companies are often monocultural. especially anthropic is, from what I know of it.

2h412

Eric Jeker@ericjeker

@jangiacomelli @GergelyOrosz ... safety, privacy, regulation, truth, transparency, copyrights. There is a long list of things they definitely don't care about. 😅

1h62

u_b@otrebu

@GergelyOrosz It is getting extremely complicated to judge such things. I can relate to what you say, but if I try to imagine how it can be from Anthropic side it feels extremely hard. At the end of the day, it is an extremely competitive and yet not profitable business.

1h12

elie@eliebakouch

(btw i know PEFT is technically training the model btw, but they probably don't use PEFT to limit the capabilities of cyber in fable compared to mythos)

2h1032

Chicken Face@chikenfacegoat

@scaling01 They made it so bad, that now making safeguards visible is GOLD! Good strategy i must admit 😂

58m41

Agustin Lebron@AgustinLebron3

@eliebakouch "will now tell users when capabilities are nerfed"

And we will believe them... why?

1h251

Shravan Venkataraman@theBuoyantMan

Anthropic cares about becoming a monopoly in a market that's going to be commoditized and a race to the bottom if price competition is taken up.

So, they think anything is fair in war and business. But the way they are playing is below the belt dirty with zero integrity - towards their competition and more importantly treating their customers like shit.

51m281

Cody Blakeney@code_star

@segyges Probably months of effort

2h231

DawidDD@dawiddrzala

@scaling01 You do not know, there is not much trust left with them

52m191

Kol Tregaskes@koltregaskes

@sammcallister @code_star Thanks Sam

2h161

Sid@sidhusmart

@GergelyOrosz I think they are correct in thinking about AI safety but I feel like as a group they are in a bubble of their own making. Too much group-think which prevents them from stepping back and viewing the situation objectively. Plus IPO-hype.

1h46

Noé Flandre@NoeFlandre

@eliebakouch Somehow it makes me feel nice to see that when our community raises its voice, it can shift things for the better

2h121

krakek@krakek1

@scaling01 Hopefully Open AI launches a similarly capable model today.

15m36

Anthropic rolls back Fable 5 invisible safety safeguards after developer backlash, transitioning to explicit API refusal reasons

Story Overview

Anthropic rolls back Fable 5 invisible safety safeguards after developer backlash, transitioning to explicit API refusal reasons

Story Overview

Users gain visibility into every switch

The remaining unknowns around rollout

Anthropic rolls back Fable 5 invisible safety safeguards after developer backlash, transitioning to explicit API refusal reasons

Story Overview