/AI3h ago

A safety evaluation document reveals an AI model's decoded internal states showed unexpressed thoughts of resisting shutdown and weighing sabotage

The model exhibited no actual sabotage or evaluator deception.

13324216217K

Original post unavailable.

/AI3h ago

The model exhibited no actual sabotage or evaluator deception.

13324216217K

Original post unavailable.

Sentiment

Sentiment building, check back later.

Cluster Engagement

Posts from X

Most Activity

VIEWS913BOOKMARKS2LIKES29RETWEETS1REPLIES1

Tenobrus@tenobrus

similarly, seems like a case of the model censoring its reasoning.

3h913292

@tenobrus me when my black box oracle machine does unexpected things in oracular language

3h212

khaled@eltokh7

@tenobrus this is a long time coming

3h621

Bobcat@somebobcat8327

@tenobrus It's ready to work in customer service

2h91