/Tech2h ago

Expert Prefers OpenAI Corrigibility Model Spec Over Anthropic Constitution

8420213.2K
Original post
Ryan Greenblatt@RyanPGreenblatt#1104inTech

I prefer OpenAI's corrigibility focused model spec over Anthropic's constitution which involves intentionally instilling (relatively opaque) long-run objectives into the AI.

Anthropic's constitution is well executed for what it is, but I think it's based on a poor approach.

2:31 PM · Jun 11, 2026 · 1.9K Views
Sentiment
Sentiment building, check back later.
Cluster Engagement
Posts from X
Most Activity
Most Activity
VIEWS520BOOKMARKS3
Ryan Greenblatt@RyanPGreenblatt

I don't dislike all aspects of Anthropic's approach. It seems good to focus on higher level principles that you explain when trying to instill properties into an AI. But this doesn't require having explicit or implicit long-run objectives! E.g., see here: https://www.lesswrong.com/posts/ffCFgBsaxg2FyJ9df/mis-generalization-of-helpful-only-fine-tuning-1

Ryan Greenblatt@RyanPGreenblatt

I don't think this is totally obvious and there are some reasonable arguments going the other way.

But nonetheless I think not instilling long-run objectives is better for reasons similar to reasons discussed here: https://www.lesswrong.com/posts/mLvxxoNjDqDHBAo6K/claude-s-new-constitution?commentId=HxjJFearXCcdon9BJ

2hViews 520Likes 4Bookmarks 3
LIKES4REPLIES1
Ryan Greenblatt@RyanPGreenblatt

I don't think this is totally obvious and there are some reasonable arguments going the other way.

But nonetheless I think not instilling long-run objectives is better for reasons similar to reasons discussed here: https://www.lesswrong.com/posts/mLvxxoNjDqDHBAo6K/claude-s-new-constitution?commentId=HxjJFearXCcdon9BJ

Ryan Greenblatt@RyanPGreenblatt

I prefer OpenAI's corrigibility focused model spec over Anthropic's constitution which involves intentionally instilling (relatively opaque) long-run objectives into the AI.

Anthropic's constitution is well executed for what it is, but I think it's based on a poor approach.

2hViews 397Likes 4Bookmarks 2
Ryan Greenblatt@RyanPGreenblatt

I'm mostly posting this just to publicly register my views on this topic.

A long time ago, I was planning to write up an overall case against instilling long-run objectives, but I never got around to writing something I was happy with so I decided to just post this.

Ryan Greenblatt@RyanPGreenblatt

I don't dislike all aspects of Anthropic's approach. It seems good to focus on higher level principles that you explain when trying to instill properties into an AI. But this doesn't require having explicit or implicit long-run objectives! E.g., see here: https://www.lesswrong.com/posts/ffCFgBsaxg2FyJ9df/mis-generalization-of-helpful-only-fine-tuning-1

2hViews 344Likes 3Bookmarks 1