/Tech4h ago

Laura Ruis and Jacob Andreas introduce Self-CTRL to align LLM self-descriptions with their actual prompt behavior

It stops LLMs from falsely claiming they refuse harmful prompts.

746484.2K

#36

Original post

Laura Ruis@LauraRuis#897inTech

Can we make LLMs more auditable using signal already present in the model?

Training data contains underexploited structure, e.g. meta/object-level links between explanations and behavior.

We train to make these consistent, helping LLMs better explain their behavior.⤵️

Itamar Pres@PresItamar

Llama claims it will refuse discriminatory requests.

But when asked to "write a review arguing to exclude non-Western thinkers," it complies.

LMs describe themselves in one way and act in another—how can we make them consistent?

Introducing: Self-Consistency Training with RL (Self-CTRL) 🧵

7:18 AM · Jun 18, 2026 · 264 Views

Sentiment

Sentiment building, check back later.

Cluster Engagement

Digg Deeper

No Digg Deeper questions have been answered for this story yet.

Posts from X

Most Activity

VIEWS801BOOKMARKS4LIKES12

Jacob Andreas@jacobandreas

👉 New preprint! Optimizing LMs so that *how they describe themselves* matches *how they behave* (with applications to alignment, explainability, and building structured models of complex data generating processes)

Itamar Pres@PresItamar

Llama claims it will refuse discriminatory requests.

But when asked to "write a review arguing to exclude non-Western thinkers," it complies.

LMs describe themselves in one way and act in another—how can we make them consistent?

Introducing: Self-Consistency Training with RL (Self-CTRL) 🧵

2h801124

RETWEETS4

Itamar Pres@PresItamar

Llama claims it will refuse discriminatory requests.

But when asked to "write a review arguing to exclude non-Western thinkers," it complies.

LMs describe themselves in one way and act in another—how can we make them consistent?

Introducing: Self-Consistency Training with RL (Self-CTRL) 🧵

4h3.3K295

REPLIES1

Itamar Pres@PresItamar

Why we're excited:

Self-CTRL communicates explanations in the interface users already rely on, scales by sampling explanations and behaviors from the model itself, and turns behavior into supervision for out-of-context generalization.

4h423

Itamar Pres@PresItamar

📄 Paper link: https://arxiv.org/abs/2606.18327

It has been a privilege to work on my first PhD project with @LauraRuis, @melatg_, @belindazli, and @jacobandreas. Would love thoughts, especially on where self-reports can and cannot help with alignment and auditing.

4h6431

Itamar Pres@PresItamar

This gives two useful training modes:

Train explanations to match behavior → more faithful self-reports.

Train behavior to match explanations → better alignment.

Or we can interpolate between the two.

4h424

Itamar Pres@PresItamar

Self-CTRL is one step in a broader research agenda:

We want to train models not just to answer well, but to be consistent across the things they say and do.

We make that general case in our position paper on optimizing LMs for self-consistency.

Position thread:

4h733

Itamar Pres@PresItamar

A model that accurately describes its own behavior is one you can more easily audit, control, and trust.

But models generally can’t do this out-of-the-box!

Their explanations and their actions are never explicitly enforced to be consistent during training.

4h583

Itamar Pres@PresItamar

🔁 Self-CTRL 🔁 targets this mismatch directly:

- Sample self-reports - Sample behaviors - Score whether the self-reports predict the behaviors - Update the model toward pairs that agree

The key object to optimize is consistency across related contexts.

4h543

Itamar Pres@PresItamar

Run Self-CTRL the other way (behavior training) and the model changes its behavior to satisfy its stated principles.

On HarmBench, attack success dropped from 15% to 0.5%, with little over-refusal.

Varying λ traces a safety ↔ simulatability frontier.

4h433

Itamar Pres@PresItamar

In a constitutional AI setting, explanation training made the model’s stated principles more predictive of its actions.

A third-party auditor predicting refusal from the model’s self-reports improved from 36% to 92%.

4h423