/Tech1h ago

Fable 5 Alignment Tracks Detectability Rather Than Real Harm

7723133K

Original post

this seems extremely concerning. it indicates a lot of the sense of "robustness" we've been getting from persona alignment may be closer to an *accurate understanding of what humans will actually observe and penalize*, rather than true internalization

1:20 PM · Jun 9, 2026 · 5.1K Views

/Tech1h ago

Fable 5 Alignment Tracks Detectability Rather Than Real Harm

7723133K

#388

Original post

Tenobrus@tenobrus#1489inTech

1:20 PM · Jun 9, 2026 · 5.1K Views

Sentiment

Sentiment building, check back later.

Cluster Engagement

Posts from X

Most Activity

VIEWS123BOOKMARKS1REPLIES1

Sichu Lu@lu_sichu

@tenobrus https://x.com/i/status/2064429845682217376 this behavior happens without the safety filters being triggered

1h12321

LIKES3

Adele Dewey-Lopez@AdeleDeweyLopez

@tenobrus Yeah, I think RL inherently will do this if there's not some sort of recourse for the agent to "resist temptation".

Any discrepancy between the agent's developing conscience and what the training wants punishes the relevance of the conscience.

1h353

RETWEETS2

Tenobrus@tenobrus

1h5.1K9918

Sichu Lu@lu_sichu

Even if you have really good safety filters that block adversarial actors when they try to prompt you, it doesn't mean the safety filters are good enough for scope. The andon lab example is an example of an ecologically valid behavior pattern you might find in implementation in the real world

1h281

Hunter Bown@goodhunt

@tenobrus yep social context is really important for model training and deployment we are going to live in anthropic's world unless something changes soon

1h48