1d ago

Preprint Identifies Single Preference Vector In LLM Assistant Personas

57516409.5K

——0——

Original post

First preprint! Working with @patrickbutlin during @MATSprogram. LLM Assistant personas like being helpful, evil personas like being harmful. We found that a single direction represents helping as good under the Assistant, and ‘harm’ as good under evil.

7:52 AM · May 18, 2026

Preprint Identifies Single Preference Vector In LLM Assistant Personas

Sentiment

Cluster engagement