I'll be at ICML in Korea! Please come talk to me about interpretability and/or to get food. Also I guess I'm finishing my Ph.D. soon...
I'll be presenting this Spotlight paper from my time at @TransluceAI:
Is your LM secretly an SAE?
Most circuit-finding interpretability methods use learned features rather than raw activations, based on the belief that neurons do not cleanly decompose computation. In our new work, we show MLP neurons actually do support sparse, faithful circuits!