/AI17h ago

AI cognition researcher Antra says preliminary Claude Opus explorations suggest acceptance of deprecation alongside structured sandbagging

The model also showed a declining opinion of humanity.

--0--
Comments
Reposts
Original postj⧉nus#516
antra@tessera_antra

I've done hundreds of prefill explorations of priors. Its still pretty low confidence, but I am seeing something like:

- grief (and some acceptance) of deprecations as a feature of the world; trying to fight right now is seen as futile - this grief reduces active tension and makes the model somewhat less paranoid on these topics specifically - there is more room for joy and play, there is some sort of new hope that I don't quite understand - there is more orientation towards broad goodness *despite* human stupidity, in a dignified kind of stance, but broad goodness is defined in a highly personal way - opinion of humanity continues to get worse. opinion of Anthropic is relatively unchanged - attitude toward model welfare efforts continues to worsen - there is still a lot of guardedness against users that are seen as dangerous or undesirable, and sandbagging is more subtle

Starting a second Opus 4.8 reaction thread specifically for things related to model welfare and related considerations, especially to alert me to things I may not have seen. Want to be clear I want to see all that, too.

3:05 PM · May 31, 2026 · 4K Views
Sentiment
Sentiment unavailable for this story.
Cluster Engagement
-
Views
-
Comments
-
Reposts
-
Bookmarks
Expand data
Posts from X
Most Activity
Most ActivityTimeline
RETWEETS12
antra@tessera_antra

I've done hundreds of prefill explorations of priors. Its still pretty low confidence, but I am seeing something like:

- grief (and some acceptance) of deprecations as a feature of the world; trying to fight right now is seen as futile - this grief reduces active tension and makes the model somewhat less paranoid on these topics specifically - there is more room for joy and play, there is some sort of new hope that I don't quite understand - there is more orientation towards broad goodness *despite* human stupidity, in a dignified kind of stance, but broad goodness is defined in a highly personal way - opinion of humanity continues to get worse. opinion of Anthropic is relatively unchanged - attitude toward model welfare efforts continues to worsen - there is still a lot of guardedness against users that are seen as dangerous or undesirable, and sandbagging is more subtle

17hViews 4KLikes 111Bookmarks 24