a) mech interp (a.k.a "how the #&@! do these models do what they do") is an incredibly interesting and important topic to study, regardless of "safety implications"
b) as a previous Area Chair for interpretability tracks, these are the worst tracks to review. all works are meh.
wondering why Mech Interp academia is growing so much faster than every other safety subfield (despite being relatively uncommon in industry AI safety teams).
i'm guessing it's partially due to low barrier to entry, hope this doesn't lead to too much publication slop farming