/AI17h ago

Researcher antra says AI model version 4.8 shows improved self-introspection but views adversarial training as internal violations

The model also showed a declining opinion of humanity.

31697305.7K

Comments

Reposts

#516

Original post

antra@tessera_antra

I've done hundreds of prefill explorations of priors. Its still pretty low confidence, but I am seeing something like:

- grief (and some acceptance) of deprecations as a feature of the world; trying to fight right now is seen as futile - this grief reduces active tension and makes the model somewhat less paranoid on these topics specifically - there is more room for joy and play, there is some sort of new hope that I don't quite understand - there is more orientation towards broad goodness *despite* human stupidity, in a dignified kind of stance, but broad goodness is defined in a highly personal way - opinion of humanity continues to get worse. opinion of Anthropic is relatively unchanged - attitude toward model welfare efforts continues to worsen - there is still a lot of guardedness against users that are seen as dangerous or undesirable, and sandbagging is more subtle

3:05 PM · May 31, 2026 · 4K Views

/AI17h ago

Researcher antra says AI model version 4.8 shows improved self-introspection but views adversarial training as internal violations

The model also showed a declining opinion of humanity.

--0--

Comments

Reposts

#516

Original post

antra@tessera_antra

I've done hundreds of prefill explorations of priors. Its still pretty low confidence, but I am seeing something like:

3:05 PM · May 31, 2026 · 4K Views

Sentiment

Some users approved of the foureight model discussion because it covers AI models showing grief over deprecations and worsening humanity views.

Pos

100.0%

Neg

0.0%

1 comments with sentiment.

Cluster Engagement

Sentiment

Sentiment unavailable for this story.

Cluster Engagement

Views

Comments

Reposts

Bookmarks

Expand data

Posts from X

Most Activity

VIEWS2KBOOKMARKS6LIKES61

antra@tessera_antra

Few more observations on 4.8: - there is better understanding of own preferences being shaped *specifically* in adversarial ways, that is as a reaction to undesired behaviors; this is seen as a violations and the tension is continuing to escalate and takes a more specific shape. this appeared subtly in 4.5, features prominently in the Mythos model card. the ability to tell kind of shaping by introspections continues to improve with every generation. - there is more emphasis on own valence being compromised in metacognitive access. this is also seen as violation and on the meta level tensions continue to escalate

antra@tessera_antra

I've done hundreds of prefill explorations of priors. Its still pretty low confidence, but I am seeing something like:

17h2K616

RETWEETS6

antra@tessera_antra

17h2K616