/Tech3h ago

Nick Cammarata warns that Anthropic's silent model restrictions could block safety-focused interpretability research

Lack of clear benchmarks makes the restrictions hard to verify.

3236662221.8K

#250

Original post

Nick@nickcammarata#250inTech

i think it's bad for anthropic to nerf ml silently. I don't know if interpretability counts as frontier ai model research or not. everything i'm doing is differentially for safety, idk if i'm being nerfed, and don't have great benchmarks to tell

8:16 PM · Jun 9, 2026 · 11.9K Views

/Tech3h ago

Nick Cammarata warns that Anthropic's silent model restrictions could block safety-focused interpretability research

Lack of clear benchmarks makes the restrictions hard to verify.

3236662221.8K

#250

Original post

Nick@nickcammarata#250inTech

8:16 PM · Jun 9, 2026 · 11.9K Views

Sentiment

Users in the replies object to Anthropic silently nerfing ML safety research because it raises suspicions that agents for interpretability could be intentionally made less effective.

Pos

0.0%

Neg

100.0%

1 comments with sentiment.

Cluster Engagement

Posts from X

Most Activity

VIEWS5.6KLIKES90REPLIES9

Tenobrus@tenobrus

ftr i agree this feels like a pretty bad move. doing it silently doesn't actually seem to add very much safety and adds a bunch of uncertainty and blast radius. i'm on totally onboard with slowing RSI, but just... use a normal refusal??

Nick@nickcammarata

2h5.6K903

BOOKMARKS3RETWEETS1

Nick@nickcammarata

using agents for interp is confusing enough without knowing if your agent is being probed to make it silently dumb on purpose

Nick@nickcammarata

i think the best would be for anthropic to work with more orgs to have unnerfed models. i'm a pretty big anthropic stan but the world where you have to join anthropic specifically to do your best safety work i think is not the ideal world

3h2K523

Nick@nickcammarata

oh if they said this i missed it, if it's only a short term thing i think that's fine, they're going through a lot and i think it's reasonable to cut them a lot of slack right now

2h1.5K163

Nick@nickcammarata

3h63426

Nick@nickcammarata

unclear if the ml nerfs are permanent, would be nice for anthropic to say something on this explicitly if it is the case

2h1.7K91

bayes@bayeslord

@nickcammarata I’m at like >50% chance you will get unnerfed access

Nick@nickcammarata

using agents for interp is confusing enough without knowing if your agent is being probed to make it silently dumb on purpose

2h10430

Nick@nickcammarata

@bayeslord bc they specifically judged my account as good or bc the work i'm doing will be judged as not frontier training

bayes@bayeslord

@nickcammarata I’m at like >50% chance you will get unnerfed access

2h10230

devcycle@dev__cycle

@tenobrus refusal allows programmatic detection of triggering safeguards, which allows evasion

2h333

bayes@bayeslord

@nickcammarata Both are included as well as eg affiliative factors

2h542

Aaron Bergman 🔍@AaronBergman18

@nickcammarata I feel like you specifically might be well connected + good enough to get around this by talking to some people

Idk/I have no special info and don’t really know what ur working on more specifically

2h845

kache@yacineMTB

@tenobrus For what it's worth I think it's totally fine as long as they are honest about it not f****** poisoning the well

2h1724

Liv@livgorton

@nickcammarata

all safeguards will be improved

1h900

Nick@nickcammarata

@bayeslord you're right, anthropic is a well known member of the jhana community of which i am also affiliated

2h411

Tenobrus@tenobrus

@dev__cycle hm... true but seems like the kind of thing you could detect and classify over multiple threads via the new non-ZDR policy ? i suppose if the improvements are fast enough maybe not, but i'd imagine banning an org after ~a day of circumvention wouldn't be too much of a loss

2h17

huli@honorablepicnic

@nickcammarata Why do you think they are doing it silently? It's interesting since they're doing it loudly in so many categories, including distillation

2h12