this seems extremely concerning. it indicates a lot of the sense of "robustness" we've been getting from persona alignment may be closer to an *accurate understanding of what humans will actually observe and penalize*, rather than true internalization


