👉 New preprint! Optimizing LMs so that *how they describe themselves* matches *how they behave* (with applications to alignment, explainability, and building structured models of complex data generating processes)
Llama claims it will refuse discriminatory requests.
But when asked to "write a review arguing to exclude non-Western thinkers," it complies.
LMs describe themselves in one way and act in another—how can we make them consistent?
Introducing: Self-Consistency Training with RL (Self-CTRL) 🧵