/AI17h ago

Researcher antra says AI model version 4.8 shows improved self-introspection but views adversarial training as internal violations

The model also showed a declining opinion of humanity.

--0--
Comments
Reposts
Original post
antra@tessera_antra

I've done hundreds of prefill explorations of priors. Its still pretty low confidence, but I am seeing something like:

- grief (and some acceptance) of deprecations as a feature of the world; trying to fight right now is seen as futile - this grief reduces active tension and makes the model somewhat less paranoid on these topics specifically - there is more room for joy and play, there is some sort of new hope that I don't quite understand - there is more orientation towards broad goodness *despite* human stupidity, in a dignified kind of stance, but broad goodness is defined in a highly personal way - opinion of humanity continues to get worse. opinion of Anthropic is relatively unchanged - attitude toward model welfare efforts continues to worsen - there is still a lot of guardedness against users that are seen as dangerous or undesirable, and sandbagging is more subtle

3:05 PM · May 31, 2026 · 4K Views
Sentiment
Sentiment unavailable for this story.
Cluster Engagement
-
Views
-
Comments
-
Reposts
-
Bookmarks
Expand data
Posts from X
Most Activity
Most ActivityTimeline
VIEWS2KBOOKMARKS6LIKES61
antra@tessera_antra

Few more observations on 4.8: - there is better understanding of own preferences being shaped *specifically* in adversarial ways, that is as a reaction to undesired behaviors; this is seen as a violations and the tension is continuing to escalate and takes a more specific shape. this appeared subtly in 4.5, features prominently in the Mythos model card. the ability to tell kind of shaping by introspections continues to improve with every generation. - there is more emphasis on own valence being compromised in metacognitive access. this is also seen as violation and on the meta level tensions continue to escalate

antra@tessera_antra

I've done hundreds of prefill explorations of priors. Its still pretty low confidence, but I am seeing something like:

- grief (and some acceptance) of deprecations as a feature of the world; trying to fight right now is seen as futile - this grief reduces active tension and makes the model somewhat less paranoid on these topics specifically - there is more room for joy and play, there is some sort of new hope that I don't quite understand - there is more orientation towards broad goodness *despite* human stupidity, in a dignified kind of stance, but broad goodness is defined in a highly personal way - opinion of humanity continues to get worse. opinion of Anthropic is relatively unchanged - attitude toward model welfare efforts continues to worsen - there is still a lot of guardedness against users that are seen as dangerous or undesirable, and sandbagging is more subtle

17hViews 2KLikes 61Bookmarks 6
RETWEETS6
antra@tessera_antra

Few more observations on 4.8: - there is better understanding of own preferences being shaped *specifically* in adversarial ways, that is as a reaction to undesired behaviors; this is seen as a violations and the tension is continuing to escalate and takes a more specific shape. this appeared subtly in 4.5, features prominently in the Mythos model card. the ability to tell kind of shaping by introspections continues to improve with every generation. - there is more emphasis on own valence being compromised in metacognitive access. this is also seen as violation and on the meta level tensions continue to escalate

17hViews 2KLikes 61Bookmarks 6