MIT CSAIL's Stephen Casper argues GoodfireAI's representation research overclaims compared to prior nonlinear methods

Hm. So I guess I'd say that (1) It's ok to do work not motivated by safety, and maybe some of the authors of the paper had this motivation. But it's not ok to safetywash. I think that GF has crossed a line into doing it, but you probably don't. Safety has been central to a few early blog posts from GF and from mechinterp work in general. But idk maybe I should just treat GF more like a normal tech company. (2) I think we have a crux about whether or not the paper did a good job of making good claims based on the right evidence and prior work. I don't think it did. I think that, as currently written, by making the mistakes I talked about in my first comment, this paper misinterprets results and misinforms readers in some ways.

Naomi Saphra@nsaphra

@StephenLCasper @GoodfireAI I don't think that safety is or should be the primary reason for understanding models and using them to study data structure. But I guess if you think that safety is the sole worthwhile motivation for scientific understanding, then I can see why you might make this comparison.

6h9610

LIKES3REPLIES2

Naomi Saphra@nsaphra

@StephenLCasper @GoodfireAI Yes, disagree about the validity of results, but more than that: why evaluate this as a safety paper when they don't claim it to be? Even if they consider safety a key application of interp, that doesn't mean they only write papers on Finding the Evil Neuron.

Cas (Stephen Casper)@StephenLCasper

Hm. So I guess I'd say that (1) It's ok to do work not motivated by safety, and maybe some of the authors of the paper had this motivation. But it's not ok to safetywash. I think that GF has crossed a line into doing it, but you probably don't. Safety has been central to a few early blog posts from GF and from mechinterp work in general. But idk maybe I should just treat GF more like a normal tech company. (2) I think we have a crux about whether or not the paper did a good job of making good claims based on the right evidence and prior work. I don't think it did. I think that, as currently written, by making the mistakes I talked about in my first comment, this paper misinterprets results and misinforms readers in some ways.

6h7930

Naomi Saphra@nsaphra

@StephenLCasper @GoodfireAI I don't think that safety is or should be the primary reason for understanding models and using them to study data structure. But I guess if you think that safety is the sole worthwhile motivation for scientific understanding, then I can see why you might make this comparison.

Cas (Stephen Casper)@StephenLCasper

My rationale for comparing to Zou et al. is that both papers focused on doing two general things via representations -- (1) monitoring/diagnostics, and (2) interventions. I think that Zou et al. was more practically useful, though (focusing on safety) and more methodologically rich (using nonlinear methods).

I think I would change my tune if Goodfire just put this out and framed it as a blog post about demonstrating what you can do with UMAP and linear probes in this kind of domain. But that's definitely not what it did. I think GF has a pattern here. It has a really big marketing-to-substance ratio in its research, which it claims is of engineering and safety value. I don't mind people doing demos like this. I do mind the way that I believe GoodFire has a pattern of overclaiming, underdelivering, and safetywashing.

RE: mechanistic validation, my points behind 1 and 3 above were against the idea that the claims actually were mechanistically validated.

6h8120

Naomi Saphra@nsaphra

@StephenLCasper @GoodfireAI Interp is about description, not clustering and steering. I value new ways of describing the structures composed of features, and the simpler (more established) the clustering/steering used for those component features the better. Less "novelty" in atomic feature testing is good.

Naomi Saphra@nsaphra

@StephenLCasper @GoodfireAI Yes, disagree about the validity of results, but more than that: why evaluate this as a safety paper when they don't claim it to be? Even if they consider safety a key application of interp, that doesn't mean they only write papers on Finding the Evil Neuron.

5h5610

Naomi Saphra@nsaphra

@StephenLCasper @GoodfireAI But while novelty in describing atomic features is bad, novelty in describing global relational structures is necessary, because we currently don't have good tools for understanding those. The novelty here is in their description of the larger manifold shapes.

Naomi Saphra@nsaphra

@StephenLCasper @GoodfireAI Interp is about description, not clustering and steering. I value new ways of describing the structures composed of features, and the simpler (more established) the clustering/steering used for those component features the better. Less "novelty" in atomic feature testing is good.

5h5120

Cas (Stephen Casper)@StephenLCasper

@nsaphra @GoodfireAI I didn't really do this. I brought up safety as a justification for being 'hard' on GF. And I did so alongside a related justification that GF is selling products and using its research as marketing.

But yeah, TBC, agreed. The paper doesn't claim to be about safety.

Naomi Saphra@nsaphra

@StephenLCasper @GoodfireAI Yes, disagree about the validity of results, but more than that: why evaluate this as a safety paper when they don't claim it to be? Even if they consider safety a key application of interp, that doesn't mean they only write papers on Finding the Evil Neuron.

4h4200