1d ago

Preprint Identifies Single Preference Vector In LLM Assistant Personas

0
Original post

First preprint! Working with @patrickbutlin during @MATSprogram. LLM Assistant personas like being helpful, evil personas like being harmful. We found that a single direction represents helping as good under the Assistant, and ‘harm’ as good under evil.

7:52 AM · May 18, 2026 View on X