/Tech5h ago

New Self-CTRL Method Aligns Language Model Self-Descriptions With Behavior

0484145.8K

Original post

👉 New preprint! Optimizing LMs so that *how they describe themselves* matches *how they behave* (with applications to alignment, explainability, and building structured models of complex data generating processes)

Itamar Pres@PresItamar

Llama claims it will refuse discriminatory requests.

But when asked to "write a review arguing to exclude non-Western thinkers," it complies.

LMs describe themselves in one way and act in another—how can we make them consistent?

Introducing: Self-Consistency Training with RL (Self-CTRL) 🧵

8:38 AM · Jun 18, 2026 · 5K Views

Sentiment

Sentiment building, check back later.

Cluster Engagement

Digg Deeper

No Digg Deeper questions have been answered for this story yet.

Posts from X

Most Activity

RETWEETS1

Laura Ruis@LauraRuis

Can we make LLMs more auditable using signal already present in the model?

Training data contains underexploited structure, e.g. meta/object-level links between explanations and behavior.

We train to make these consistent, helping LLMs better explain their behavior.⤵️

Itamar Pres@PresItamar

Llama claims it will refuse discriminatory requests.

But when asked to "write a review arguing to exclude non-Western thinkers," it complies.

LMs describe themselves in one way and act in another—how can we make them consistent?

Introducing: Self-Consistency Training with RL (Self-CTRL) 🧵

1d781101