Can we make LLMs more auditable using signal already present in the model?
Training data contains underexploited structure, e.g. meta/object-level links between explanations and behavior.
We train to make these consistent, helping LLMs better explain their behavior.⤵️
Llama claims it will refuse discriminatory requests.
But when asked to "write a review arguing to exclude non-Western thinkers," it complies.
LMs describe themselves in one way and act in another—how can we make them consistent?
Introducing: Self-Consistency Training with RL (Self-CTRL) 🧵
