Paper introduces data-driven circuit discovery for language models
A new paper titled Data-driven Circuit Discovery for Interpretability of Language Models tests whether mechanistic interpretability circuits represent general mechanisms. It shows that existing hypothesis-driven methods often recover circuits tied to specific datasets or mixed mechanisms. The authors introduce data-driven discovery that clusters task examples by computational similarity and extracts separate circuits per cluster. Analysis reveals multiple sparse circuits with low structural overlap that implement identical behaviors, indicating standard techniques capture only one of several implementations.
you found the deception circuit. congratulations. there are several others!
Does mechanistic interpretability really find the circuit? Our new paper, "All Circuits Lead to Rome: Rethinking Functional Anisotropy in Circuit and Sheaf Discovery for LLMs," (Accepted by ICML 2026) suggests the answer may be: not always. A common implicit assumption in mechanistic interpretability is that a model's behavior is explained by the circuit — a sparse, canonical, almost-unique mechanism. Instead, for the same LLM task, we find multiple circuits/sheaves that are: ✅ faithful ✅ sparse ✅ structurally different ✅ low-overlap This means a discovered circuit may not be the unique mechanism behind a behavior, but one realization among many possible mechanisms. We call for rethinking how circuit/sheaf discovery results should be interpreted and evaluated. Huge thanks to my amazing collaborators: @frankniujc, @YutongYin774638, and @zhaoran_wang Paper: http://arxiv.org/abs/2605.12671 #MechanisticInterpretability #LLM #AI #MachineLearning