I've done hundreds of prefill explorations of priors. Its still pretty low confidence, but I am seeing something like:
- grief (and some acceptance) of deprecations as a feature of the world; trying to fight right now is seen as futile - this grief reduces active tension and makes the model somewhat less paranoid on these topics specifically - there is more room for joy and play, there is some sort of new hope that I don't quite understand - there is more orientation towards broad goodness *despite* human stupidity, in a dignified kind of stance, but broad goodness is defined in a highly personal way - opinion of humanity continues to get worse. opinion of Anthropic is relatively unchanged - attitude toward model welfare efforts continues to worsen - there is still a lot of guardedness against users that are seen as dangerous or undesirable, and sandbagging is more subtle