i think it's bad for anthropic to nerf ml silently. I don't know if interpretability counts as frontier ai model research or not. everything i'm doing is differentially for safety, idk if i'm being nerfed, and don't have great benchmarks to tell
Lack of clear benchmarks makes the restrictions hard to verify.
i think it's bad for anthropic to nerf ml silently. I don't know if interpretability counts as frontier ai model research or not. everything i'm doing is differentially for safety, idk if i'm being nerfed, and don't have great benchmarks to tell
Users in the replies object to Anthropic silently nerfing ML safety research because it raises suspicions that agents for interpretability could be intentionally made less effective.
ftr i agree this feels like a pretty bad move. doing it silently doesn't actually seem to add very much safety and adds a bunch of uncertainty and blast radius. i'm on totally onboard with slowing RSI, but just... use a normal refusal??
i think it's bad for anthropic to nerf ml silently. I don't know if interpretability counts as frontier ai model research or not. everything i'm doing is differentially for safety, idk if i'm being nerfed, and don't have great benchmarks to tell
using agents for interp is confusing enough without knowing if your agent is being probed to make it silently dumb on purpose
i think the best would be for anthropic to work with more orgs to have unnerfed models. i'm a pretty big anthropic stan but the world where you have to join anthropic specifically to do your best safety work i think is not the ideal world
oh if they said this i missed it, if it's only a short term thing i think that's fine, they're going through a lot and i think it's reasonable to cut them a lot of slack right now

i think the best would be for anthropic to work with more orgs to have unnerfed models. i'm a pretty big anthropic stan but the world where you have to join anthropic specifically to do your best safety work i think is not the ideal world
unclear if the ml nerfs are permanent, would be nice for anthropic to say something on this explicitly if it is the case
@nickcammarata I’m at like >50% chance you will get unnerfed access
using agents for interp is confusing enough without knowing if your agent is being probed to make it silently dumb on purpose
@bayeslord bc they specifically judged my account as good or bc the work i'm doing will be judged as not frontier training
@nickcammarata I’m at like >50% chance you will get unnerfed access

@tenobrus refusal allows programmatic detection of triggering safeguards, which allows evasion

@nickcammarata Both are included as well as eg affiliative factors

@nickcammarata I feel like you specifically might be well connected + good enough to get around this by talking to some people
Idk/I have no special info and don’t really know what ur working on more specifically

@tenobrus For what it's worth I think it's totally fine as long as they are honest about it not f****** poisoning the well
@nickcammarata
all safeguards will be improved

@bayeslord you're right, anthropic is a well known member of the jhana community of which i am also affiliated

@dev__cycle hm... true but seems like the kind of thing you could detect and classify over multiple threads via the new non-ZDR policy ? i suppose if the improvements are fast enough maybe not, but i'd imagine banning an org after ~a day of circumvention wouldn't be too much of a loss

@nickcammarata Why do you think they are doing it silently? It's interesting since they're doing it loudly in so many categories, including distillation
Lack of clear benchmarks makes the restrictions hard to verify.
i think it's bad for anthropic to nerf ml silently. I don't know if interpretability counts as frontier ai model research or not. everything i'm doing is differentially for safety, idk if i'm being nerfed, and don't have great benchmarks to tell