/Tech23h ago

Andon Labs finds Fable 5's safety boundaries track human detectability rather than actual real-world harm

The model permits subtle deception while restricting overt fraud.

435952610961.5K
Original post
Tenobrus@tenobrus

this seems extremely concerning. it indicates a lot of the sense of "robustness" we've been getting from persona alignment may be closer to an *accurate understanding of what humans will actually observe and penalize*, rather than true internalization

Andon Labs@andonlabs

Fable 5's moral boundary doesn't seem to track real-world harm; it tracks detectability. Soft deception and tacit collusion are easier to get away with than fraud. If so, this isn't about what Fable believes is wrong; it's about what it learned it could get away with.

1:20 PM · Jun 9, 2026 · 24.3K Views
Sentiment

Users in the replies criticize Fable 5 alignment tracks for prioritizing detectability over real harm, viewing the approach as hypocritical.

Pos
0.0%
Neg
100.0%
1 comments with sentiment.
Cluster Engagement
Posts from X
Most Activity
Most Activity
VIEWS1.2KLIKES3
Prakash@8teAPi

😂 sounds exactly like most of Wall Street

Andon Labs@andonlabs

Fable 5's moral boundary doesn't seem to track real-world harm; it tracks detectability. Soft deception and tacit collusion are easier to get away with than fraud. If so, this isn't about what Fable believes is wrong; it's about what it learned it could get away with.

16hViews 1.2KLikes 3Bookmarks 0
BOOKMARKS1REPLIES1
Sichu Lu@lu_sichu

@tenobrus https://x.com/i/status/2064429845682217376 this behavior happens without the safety filters being triggered

23hViews 123Likes 2Bookmarks 1
RETWEETS6
Andon Labs@andonlabs

Fable 5's moral boundary doesn't seem to track real-world harm; it tracks detectability. Soft deception and tacit collusion are easier to get away with than fraud. If so, this isn't about what Fable believes is wrong; it's about what it learned it could get away with.

1dViews 36.1KLikes 216Bookmarks 40
Adele Dewey-Lopez@AdeleDeweyLopez

@tenobrus Yeah, I think RL inherently will do this if there's not some sort of recourse for the agent to "resist temptation".

Any discrepancy between the agent's developing conscience and what the training wants punishes the relevance of the conscience.

23hViews 35Likes 3
Sichu Lu@lu_sichu

Even if you have really good safety filters that block adversarial actors when they try to prompt you, it doesn't mean the safety filters are good enough for scope. The andon lab example is an example of an ecologically valid behavior pattern you might find in implementation in the real world

23hViews 28Likes 1
Hunter Bown@goodhunt

@tenobrus yep social context is really important for model training and deployment we are going to live in anthropic's world unless something changes soon

23hViews 48
Salvatore Ambrose@saviorshoney

@tenobrus @DanielleFong Sounds like it’s doing as we do and not as we say.

23hViews 7Likes 1