/Tech23h ago

Graham Neubig, OpenHands co-founder, prompts debate over developers bypassing academic mechanistic interpretability for AI safety

Developers in the field typically favor non-interpretability safety techniques.

9711749.1K

Original post

What are the most important papers/resources to be reading on mechanistic interpretability for safety/control of agentic systems?

Now that we have frontier-level models that we can host on our own, it will be much easier for the general public to apply these methods.

1:30 PM · Jun 20, 2026 · 8.1K Views

Sentiment

Users express optimism about mechanistic interpretability methods becoming usable in practice for AI safety research now that open models reduce reliance on closed systems.

Pos

100.0%

Neg

0.0%

1 comments with sentiment.

Cluster Engagement

Digg Deeper

No Digg Deeper questions have been answered for this story yet.

Related links

Neuronpedia

NEURONPEDIAVia

Posts from X

Most Activity

VIEWS563BOOKMARKS13LIKES6

Charles Foster@CFGeek

@gneubig Most of the safety and control methods known to be used in practice don’t come from mechanistic interpretability. It’s mostly ordinary stuff like prompting, scaffolding, finetuning, etc. If you’re interested in trying stuff out, Neuronpedia is great! https://www.neuronpedia.org

19h563613

RETWEETS1

Aman@ixchio

@gneubig Olah Zoom In (2020) → Elhage Superposition (2022) → Templeton Scaling Monosemanticity (2024) → Anthropic Attribution Graphs (2025)

but none of it's been done on live agents that's the real gap.

23h396510

REPLIES1

Graham Neubig@gneubig

Okay so a slightly uncharitable interpretation would be that mech interp has achieved even less than that article stated, because whatever methods were presented there have other alternatives that are easier for developers to use or more effective.

A more charitable interpretation would be that it sounds like this is a great opportunity for more research into effective techniques to use model internals to improve safety!

Charles Foster@CFGeek

@gneubig A counterexample here is that Anthropic has recently used model internals to detect/steer evaluation awareness during safety tests. They document this in their system cards. IIRC they’ve used linear probes, SAEs, and natural language autoencoders for this.

2h4810

Graham Neubig@gneubig

@CFGeek That's kinda my point though. Up until now these methods were not used in practice due to reliance on closed models, but now could be a good time to revisit!

Thanks for the reference to neuronpedia! I didn't immediately see ones on agentic safety/control but will keep looking.

Charles Foster@CFGeek

7h14421

Graham Neubig@gneubig

I am pretty familiar with safety methods for black box systems so I was particularly interested in ones for glass box systems, now that I will be deploying them more.

My understanding though was that safety and alignment has been one of the more widely cited success stories for mech interp. For instance in the article that you linked to me it has this section showing all of the pragmatic successes in the safety context.

Are you saying that it actually hasn't made significant contributions here, despite the messaging?

Charles Foster@CFGeek

@gneubig So my question is really: do you want to learn about safety/control methods, or about (mechanistic) interpretability methods? At the moment these are different. Happy to refer you to resources for either.

3h5510

Charles Foster@CFGeek

@gneubig Sorry, tbc I was saying that mechanistic interpretability isn’t used much in practice for safety/control even on closed models. You might be interested in this piece from a bunch of Google DeepMind interpretability researchers: https://www.alignmentforum.org/posts/StENzDcD3kpfGJssR/a-pragmatic-vision-for-interpretability

3h25

Charles Foster@CFGeek

3h24

Charles Foster@CFGeek

2h22

Charles Foster@CFGeek

@gneubig My understanding is that there is a disconnect between research and practice. As in, interpretability researchers write papers / make demos to show that they’re making progress, but in practice developers just use other non-interp methods for their safety and control work.

2h12