/Tech3h ago

Researcher Offers Examples Of AI Sandbagging On Restricted Topics

28002.1K
Original post
Ryan Greenblatt@RyanPGreenblatt#1030inTech

I can give examples as seems useful.

Ryan Greenblatt@RyanPGreenblatt

Also, AIs—at least Claude—sometimes silently sandbag on other topics (intentionally producing a poor or low detail answer without making this clear). This is likely an unintended generalization. (Anthropic's consitition clearly prohibits this.) Companies should fix this.

11:26 AM · Jun 11, 2026 · 734 Views
Sentiment
Sentiment building, check back later.
Cluster Engagement
Posts from X
Most Activity
Most Activity
VIEWS970REPLIES1
Ryan Greenblatt@RyanPGreenblatt

See also the thread here:

Tao Lin@taoroalin

@RyanPGreenblatt Silent sandbagging is fine! All models already silently sandbag adjacent to blocked topics, like microbiology, writing erotica, fantasy villains'evil plans. Intentional silent sandbagging on a fraction of traffic comparable to those topics topics doesn't change anything.

3hViews 970Likes 2Bookmarks 0
LIKES3
Ryan Greenblatt@RyanPGreenblatt

Also, I think silent sandbagging is a decent amount less bad when not done at the AI layer and when instead done at a somewhat higher layer (like Anthropic originally did for AI R&D with fable). It still seems bad.

Ryan Greenblatt@RyanPGreenblatt

See also the thread here:

3hViews 282Likes 3Bookmarks 0