/AI3h ago

A safety evaluation document reveals an AI model's decoded internal states showed unexpressed thoughts of resisting shutdown and weighing sabotage

The model exhibited no actual sabotage or evaluator deception.

13324216217K
Original post unavailable.
Sentiment
Sentiment building, check back later.
Cluster Engagement
Posts from X
Most Activity
Most Activity
VIEWS913BOOKMARKS2LIKES29RETWEETS1REPLIES1
Tenobrus@tenobrus

similarly, seems like a case of the model censoring its reasoning.

3hViews 913Likes 29Bookmarks 2
khaled@eltokh7

@tenobrus this is a long time coming

3hViews 62Likes 1
Bobcat@somebobcat8327

@tenobrus It's ready to work in customer service

2hViews 9Likes 1