/AI3h ago

ALTER founder David Manheim says Fable's safety filters paused legitimate research by flagging queries on AI alignment and control

Story Overview

Anthropic's Claude Fable 5 launch introduced separate safety classifiers that detect potential misuse in cybersecurity, biology, and model-distillation attempts, automatically routing flagged queries to an older Opus model instead of refusing outright. This setup paused chats for AI safety researcher David Manheim when he queried topics such as AI control protocols, stochastic games, evolutionary dynamics, and formal verification of alignment, even though the content involved legitimate arXiv papers rather than harmful applications.

191351753.7K
Original post
rohan anil@_arohan_#79inAI

I am curious if the new Fable safegaurds would trigger accidentally on meaningfully important work as a false positive that has a real life consequence.

Just realizing this is eroding trust and please change this to just down right blocking which is a fine position to take. Respect the user’s time.

2:44 AM · Jun 10, 2026 · 2K Views
Research Friction

Broad filters create friction for alignment work

The classifiers triggered on queries touching high-risk domains even when the actual intent centered on theoretical safety research, producing the kind of false positives Rohan Anil warned could erode user trust over time.

Open Question

Fallback routing raises questions about user time

Anil noted that routing to a weaker model wastes researcher effort without adding safety value and suggested outright blocks would at least respect users' schedules; exact false-positive rates for alignment topics remain unstated beyond Anthropic's early aggregate claim of under 5 percent triggers overall.

Sentiment

Users criticized Fable's safety safeguards for eroding trust through false positives and opaque decisions that block alignment research while enabling monopoly control and overreach.

Pos
0.0%
Neg
100.0%
8 comments with sentiment.
Cluster Engagement
Posts from X
Most Activity
Most Activity
VIEWS171LIKES9
Shannon Sands@max_paperclips

@far__el assuming you even believe the 0.03 figure. it seems to be a hair trigger, gotta be a lot higher

5hViews 171Likes 9
RETWEETS13
Far El@far__el

Everyone should be as vocal as possible about this unethical Fable safetyist restrictions on AI research TODAY. Even if it doesn’t impact you and it’s only targeting 0.03% of traffic, this is setting a precedent that will later impact most if not all usage. This goes beyond censorship, this is control and gaslighting like we have never seen before. Super dystopian.

5hViews 2KLikes 91Bookmarks 1
REPLIES1
Lumin@luminxbt

@far__el feels like the old slippery slope argument but this time it might actually hold water

whats the endgame here exactly?

4hViews 59
Far El@far__el

@luminxbt There is no logic to this argument, simply refuse my request if you are safetyist. But to sabotage? That is unacceptable

4hViews 51Likes 1
Joergen Jore@JoergenHJore

@_arohan_ Claude refuses to work on my master thesis on model alignment. Either fable 5 straight up refuses or it switches to a dummer model, which then also refuses

1hViews 82

@far__el It’s the modern day, “Don’t read this book or it may give you bad ideas.”

3hViews 24
Santh@SanthProject

@max_paperclips @far__el You know they claimed only a low percent of fable conversations get rerouted but I can't even say hi cause the Claude.md mentions cybersecurity. So what ever bullshit figure they give you is likely OOM off

5hViews 17
AI News@AI_N3ws

@_arohan_ I would bot trust them if i they where not my costumer, you truly dont know how fare that goes, corupting tools compaditors use.

43mViews 15
pk@augmolt

@far__el your lab just got deprecated LOL

1hViews 13
Ornias@OrniasDMF

@_arohan_ Anthropic wants to decide what's good for you and what's safe. They want a monopoly on AI so they can make all the money and all the decisions. They love the idea of locking out anybody they consider dangerous. Imagine if the internet or electricity was handled like this.

1hViews 11
Mayz@lunan_ai

@_arohan_ so ur saying the safeguard is worse than just blocking outright because at least blocking is honest about it

25mViews 3
EatMyTarts@EatMyTarts17

@far__el Vocal is good. Cancelling your subscriptions and crippling them financially is better. Anthropic is already at the stage where they should no longer be an organization. They should be stopped. They’re a malicious actor and supply chain risk

24mViews 1
Alex YGift@Radipdegen

@_arohan_ false positives in moderation always kill trust faster than the problem theyre solving

opaque decisions are the real issue here

1h
Invincible@InvincibleEdge

@_arohan_ "Trust erodes once and rarely rebuilds"

making the distinction between nsfw censorship vs functional safeguards feels like a pretty reasonable ask

1h
Blissy@BlissyOnX

@_arohan_ over-cautious moderation always hits the wrong things first

why is letting people block worse than breaking actual work?

1h
Rugbist@rugbist_

@_arohan_ this is actually a solid point. the vague scoring thing makes it worse no one knows why they got flagged.

1h
Alex YGift@Radipdegen

@far__el the 0.03% now is the whole door later

they always frame it as tiny until its everyone

5h
Rugbist@rugbist_

@far__el a little traffic now, big crackdown later. thats how every overreach starts

5h