/AI23d ago

New arXiv paper titled 'Data-driven Circuit Discovery for Interpretability of Language Models' reports language models implement the same task through multiple structurally distinct circuits rather than any single canonical mechanism

AI Judge changed title after evaluation, original title: "Paper introduces data-driven circuit discovery for language models"

Standard discovery methods recover only dataset-specific or mixed circuits and miss the full range of mechanisms used across task instances.

--0--
Original post
Julius Adebayo@juliusadml#1546inAI

you found the deception circuit. congratulations. there are several others!

Mingyu_Jin19@fnruji316625

Does mechanistic interpretability really find the circuit?

Our new paper, "All Circuits Lead to Rome: Rethinking Functional Anisotropy in Circuit and Sheaf Discovery for LLMs," (Accepted by ICML 2026) suggests the answer may be: not always.

A common implicit assumption in mechanistic interpretability is that a model's behavior is explained by the circuit — a sparse, canonical, almost-unique mechanism.

Instead, for the same LLM task, we find multiple circuits/sheaves that are: ✅ faithful ✅ sparse ✅ structurally different ✅ low-overlap

This means a discovered circuit may not be the unique mechanism behind a behavior, but one realization among many possible mechanisms. We call for rethinking how circuit/sheaf discovery results should be interpreted and evaluated.

Huge thanks to my amazing collaborators: @frankniujc, @YutongYin774638, and @zhaoran_wang

Paper: http://arxiv.org/abs/2605.12671

#MechanisticInterpretability #LLM #AI #MachineLearning

9:22 PM · May 14, 2026 · 7.2K Views
Sentiment

Users appreciate new papers on multiple distinct circuits in LLMs because they note independent parallel findings and thank authors for insightful contributions to interpretability.

Pos
100.0%
Neg
0.0%
4 comments with sentiment.
Cluster Engagement
-
Views
-
Comments
-
Reposts
-
Bookmarks
Expand data
Posts from X
Most Activity
Most Activity
VIEWS9.9KBOOKMARKS72LIKES88RETWEETS12REPLIES2
Daking Rai@DakingRai

🚨 New paper: Data-driven Circuit Discovery for Interpretability of Language Models 🚨

Do circuits actually explain how language models (LM) implement a task?

In mechanistic interpretability, the goal of circuit study is to discover a “circuit” that is responsible for implementing a “task”.

But we find that existing methods often discover circuits that are:

❌ not general task circuits: they do not capture the full range of mechanisms LMs uses across the task.

Instead, they find:

✅ dataset-specific circuits: they explain how the model processes the examples used for circuit discovery.

✅ mixed-mechanism circuits: consisting of multiple independent mechanisms mixed in a single circuit.

1/🧵

22dViews 9.9KLikes 88Bookmarks 72
Daking Rai@DakingRai

In our new paper (https://arxiv.org/pdf/2605.09129v1), we also make similar observations to the paper in parallel --- that 𝐋𝐌𝐬 𝐮𝐬𝐞 𝐦𝐮𝐥𝐭𝐢𝐩𝐥𝐞 𝐝𝐢𝐬𝐭𝐢𝐧𝐜𝐭 𝐜𝐢𝐫𝐜𝐮𝐢𝐭𝐬 𝐟𝐨𝐫 𝐢𝐦𝐩𝐥𝐞𝐦𝐞𝐧𝐭𝐢𝐧𝐠 𝐭𝐚𝐬𝐤.

We find that the existing circuit discovery methods do not discover: ❌ General task circuits: they do not capture the full range of mechanisms LMs uses across the task.

Instead, they find: ✅dataset-specific circuits: they explain how the model processes the examples used for circuit discovery. ✅ mixed-mechanism circuits: consisting of multiple independent mechanisms mixed in a single circuit.

🧵 More in our paper thread:

Mingyu_Jin19@fnruji316625

Does mechanistic interpretability really find the circuit?

Our new paper, "All Circuits Lead to Rome: Rethinking Functional Anisotropy in Circuit and Sheaf Discovery for LLMs," (Accepted by ICML 2026) suggests the answer may be: not always.

A common implicit assumption in mechanistic interpretability is that a model's behavior is explained by the circuit — a sparse, canonical, almost-unique mechanism.

Instead, for the same LLM task, we find multiple circuits/sheaves that are: ✅ faithful ✅ sparse ✅ structurally different ✅ low-overlap

This means a discovered circuit may not be the unique mechanism behind a behavior, but one realization among many possible mechanisms. We call for rethinking how circuit/sheaf discovery results should be interpreted and evaluated.

Huge thanks to my amazing collaborators: @frankniujc, @YutongYin774638, and @zhaoran_wang

Paper: http://arxiv.org/abs/2605.12671

#MechanisticInterpretability #LLM #AI #MachineLearning

22dViews 8.5KLikes 68Bookmarks 50
Mingyu_Jin19@fnruji316625

@DakingRai Interesting to see parallel findings emerge independently.

22dViews 108
Daking Rai@DakingRai

Thanks, really enjoyed your paper.

Besides our new paper, we also made a similar observation in our NeurIPS'25 paper (https://arxiv.org/abs/2507.00322) for the code-syntax related task as well. Although the study didn't involve a full circuit study, it strongly suggested the existence of multiple mechanisms with varying levels of accuracy in the model for a single task.

22dViews 90
Daking Rai@DakingRai

2/8 We first investigate why do we get dataset-specific circuits instead of general task circuits?

To understand, let’s look at the standard circuit-discovery workflow: 𝐃𝐞𝐟𝐢𝐧𝐞 𝐚 𝐭𝐚𝐬𝐤 → 𝐛𝐮𝐢𝐥𝐝 𝐚 𝐝𝐚𝐭𝐚𝐬𝐞𝐭 → 𝐝𝐢𝐬𝐜𝐨𝐯𝐞𝐫 𝐚 𝐜𝐢𝐫𝐜𝐮𝐢𝐭

This workflow implicitly makes the following assumptions (or hypotheses): 1. dataset adequately represents the full task 2. model uses one coherent circuit for the full task

22dViews 24Likes 1
Daking Rai@DakingRai

3/8 We test these assumptions across four tasks: IOI, entity binding, arithmetic, and sequence completion.

Specifically, we conduct circuit studies for each task using multiple datasets that vary in syntax, complexity, or domain. For example, consider two IOI variants that differ in domain:

1. “When Mary and John went to the store, John gave a drink to ___” → Mary 2. “When Person X and Person Y went to the store, Person X gave a drink to Person ___” → Y

Same task semantics. Different lexical domains.

If hypotheses made by existing methods are true and they find a “general task circuit”, the circuit discovered on one variant should also explain the other.

22dViews 21Likes 1
Daking Rai@DakingRai

7/8 Our experimental results show that: 1. DCD recovers more faithful and sparser circuits on mixed-task datasets than standard hypothesis-driven methods. 2. It also produces circuits with clearer specialization: different circuits explain different subsets of examples.

The broader takeaway: 1. Human task labels do not always match how language models organize computation internally. 2. To understand models mechanistically, we should let the model’s internal structure help define the scope of explanation.

22dViews 17Likes 1
Daking Rai@DakingRai

4/8 But this is not what we find.

Result: circuits discovered on one dataset often have low faithfulness on other datasets with the same task semantics.

They are not general task circuits.

They are often dataset-specific circuits: faithful to the discovery dataset, but not to the full task.

22dViews 12Likes 1
Daking Rai@DakingRai

5/8 This raises another question:

If circuits are dataset-specific, what happens when the dataset itself contains examples solved by different mechanisms?

We test this by mixing examples from entity binding and arithmetic, then employing existing circuit discovery to find a single circuit.

The result is surprising: existing circuit discovery returns a single circuit that is faithful to both tasks. This shows that a circuit can easily mix multiple independent mechanisms into a single high-faithfulness circuit.

22dViews 12Likes 1
Daking Rai@DakingRai

6/8 Inspired by these findings, we propose Data-driven Circuit Discovery (DCD).

DCD drops both assumptions of hypothesis-driven methods and lets the data guide the discovery of circuits. It has two stages: 1. group examples in the dataset processed similarly by the model 2. discover a separate circuit for each group

This allows for the discovery of multiple circuits from a single dataset and also changes the scope of explanations for each circuit. 1. A circuit is no longer assumed to explain an entire human-defined task. 2. Instead, it explains a group of examples that appear to rely on a similar internal mechanism.

22dViews 11Likes 1
TechGeekDavid@techpupparent

@juliusadml Or the redundancy circuit. Sub-1% mutual intersection, 93%+ accuracy. Degenerate solution spaces. Tough for anyone claiming they found 'the' mechanism.

22dViews 87
Mingyu_Jin19@fnruji316625

@DakingRai Thanks a lot, really appreciate it.

22dViews 23Likes 1
Daking Rai@DakingRai

8/8 Paper: https://arxiv.org/pdf/2605.09129v1

Grateful to my amazing collaborators: @ZiyuYao , @megamor2 !

We'd love to hear your thoughts — feedback and comments welcome!

22dViews 21Likes 1