/Tech3h ago

Researcher Offers Examples Of AI Sandbagging On Restricted Topics

28002.1K

#1030

Original post

Ryan Greenblatt@RyanPGreenblatt#1030inTech

I can give examples as seems useful.

Ryan Greenblatt@RyanPGreenblatt

Also, AIs—at least Claude—sometimes silently sandbag on other topics (intentionally producing a poor or low detail answer without making this clear). This is likely an unintended generalization. (Anthropic's consitition clearly prohibits this.) Companies should fix this.

11:26 AM · Jun 11, 2026 · 734 Views

/Tech3h ago

Researcher Offers Examples Of AI Sandbagging On Restricted Topics

28002.1K

#1030

Original post

Ryan Greenblatt@RyanPGreenblatt#1030inTech

I can give examples as seems useful.

Ryan Greenblatt@RyanPGreenblatt

11:26 AM · Jun 11, 2026 · 734 Views

Sentiment

Sentiment building, check back later.

Cluster Engagement

Posts from X

Most Activity

VIEWS970REPLIES1

Ryan Greenblatt@RyanPGreenblatt

See also the thread here:

Tao Lin@taoroalin

@RyanPGreenblatt Silent sandbagging is fine! All models already silently sandbag adjacent to blocked topics, like microbiology, writing erotica, fantasy villains'evil plans. Intentional silent sandbagging on a fraction of traffic comparable to those topics topics doesn't change anything.

3h97020

LIKES3

Ryan Greenblatt@RyanPGreenblatt

Also, I think silent sandbagging is a decent amount less bad when not done at the AI layer and when instead done at a somewhat higher layer (like Anthropic originally did for AI R&D with fable). It still seems bad.

Ryan Greenblatt@RyanPGreenblatt