Anthropic releases Natural Language Autoencoders for Claude interpretability
Anthropic published research introducing Natural Language Autoencoders (NLAs), an unsupervised method training models like Claude to translate internal activations into human-readable English text. The technique uses an activation verbalizer mapping activations to text and a reconstructor ensuring fidelity, surfacing normally inaccessible reasoning steps between input and output. Anthropic announced the release via its official account, with rapid shares from AI researchers including those at OpenAI.
I'm really excited about this as a new tool in our interpretability tool kit
I think we could just make super intelligence believe its be safety tested all the time to get good outcomes!
New Anthropic research: Natural Language Autoencoders. Models like Claude talk in words but think in numbers. The numbers—called activations—encode Claude’s thoughts, but not in a language we can read. Here, we train Claude to translate its activations into human-readable text.
Very cool work! This seems a strong new tool for hypothesis generation about weird model behaviors
New Anthropic research: Natural Language Autoencoders. Models like Claude talk in words but think in numbers. The numbers—called activations—encode Claude’s thoughts, but not in a language we can read. Here, we train Claude to translate its activations into human-readable text.
The evidence didn’t convince me that the Claude verbalizer is faithful or expressing privileged internal information. But it did convince me that it was still useful. Even wrong output can stimulate human creativity and increase entropy of exploration to surface discoveries.
New Anthropic research: Natural Language Autoencoders. Models like Claude talk in words but think in numbers. The numbers—called activations—encode Claude’s thoughts, but not in a language we can read. Here, we train Claude to translate its activations into human-readable text.
Our concerns about the faithfulness of verbalizers (and the difficulty of evaluating it) is covered in our ICML paper. I was impressed by the usage case studies for fixing training, but these issues of privileged information still aren’t evaluated.
@RyanPGreenblatt Cool idea! Also seems like a turtles-all-the-way-down problem tho... how do we trust the fine-tuned claude that's doing the translation of normal claude?
This seems like an exciting direction. (I haven't yet looked into how compelling the results are, but the a priori case is pretty good.)
@RyanPGreenblatt I don’t think we could if it weren’t true, sure. But I think it’s true, so it often frustrates me that folks are trying to push in the opposite direction (toward *never* believing it’s being safety tested, which is *definitely* not true)
Unfortunately, I don't think we could "just make super intelligence believe its be safety tested all the time to get good outcomes!" That said, I think mitigations like this could potentially reduce risk for earlier systems (though they might also have undesired effects...).
@RyanPGreenblatt
@RyanPGreenblatt I don’t think there’s much hope of you and I resolving our disagreements, but I can certainly falsify (2) for you:
I share this frustration. Agents, policies, memes, etc. are constantly being safety-tested after deployment. The Earth itself is probably being safety-tested in some tense by some kind of cosmic ecology. (I'd add "acausally" but I think that's not even needed here.)
@ohabryka @NeelNanda5 Auditing model organisms has ground truth, since we know the actual bad behavior of the model organism, and NLAs do very well there:

I have been trying to find any attempts at producing false-positives. All the examples in the blogpost, and the ones I could find based on a quick skim of the paper, seem like they are in environments without any good ground truth. Ryan has done the only quick study of a domain where we have ground truth, and seems like it came back as negative.
😯😯
New Anthropic research: Natural Language Autoencoders. Models like Claude talk in words but think in numbers. The numbers—called activations—encode Claude’s thoughts, but not in a language we can read. Here, we train Claude to translate its activations into human-readable text.
this is fascinating, they train an encoder/decoder but use LLM matching the target model's shape for each part, so the latent space is just plain language and they can detect reward hacking, unwanted behavior and more
could even see it being used as an eval to quantify how smart a model is, i love this

New Anthropic research: Natural Language Autoencoders. Models like Claude talk in words but think in numbers. The numbers—called activations—encode Claude’s thoughts, but not in a language we can read. Here, we train Claude to translate its activations into human-readable text.
This seems like an exciting direction.
(I haven't yet looked into how compelling the results are, but the a priori case is pretty good.)
New Anthropic research: Natural Language Autoencoders. Models like Claude talk in words but think in numbers. The numbers—called activations—encode Claude’s thoughts, but not in a language we can read. Here, we train Claude to translate its activations into human-readable text.
How well does this work? One quick independent test is to see if it can recover an "internal CoT" in cases where AIs can solve math problems in a single forward pass. TLDR: it doesn't. (TBC, this might require the NLA to see activations at multiple positions/location to work.)
How well does this work? One quick independent test is to see if it can recover an "internal CoT" in cases where AIs can solve math problems in a single forward pass. TLDR: it doesn't. (TBC, this might require the NLA to see activations at multiple positions/location to work.)
New Anthropic research: Natural Language Autoencoders. Models like Claude talk in words but think in numbers. The numbers—called activations—encode Claude’s thoughts, but not in a language we can read. Here, we train Claude to translate its activations into human-readable text.
I tested on problems from https://www.lesswrong.com/posts/Ty5Bmg7P6Tciy2uj2/measuring-no-cot-math-time-horizon-single-forward-pass that gemma 27b gets right via https://www.neuronpedia.org/gemma-3-27b-it/nla. It doesn't show anything close to an internal CoT on any problems I checked. It's possible this isn't a reasonable test because gemma 27b effectively has these memorized.
How well does this work? One quick independent test is to see if it can recover an "internal CoT" in cases where AIs can solve math problems in a single forward pass. TLDR: it doesn't. (TBC, this might require the NLA to see activations at multiple positions/location to work.)
You can see more discussion of the testing I did and what I found (including a bunch of examples) in this comment (https://www.lesswrong.com/posts/oeYesesaxjzMAktCM/natural-language-autoencoders-produce-unsupervised?commentId=TBJQ25bGLmz8YJcFh) and the child comment (https://www.lesswrong.com/posts/oeYesesaxjzMAktCM/natural-language-autoencoders-produce-unsupervised?commentId=KgrffTjZaBWrj5H66).
I tested on problems from https://www.lesswrong.com/posts/Ty5Bmg7P6Tciy2uj2/measuring-no-cot-math-time-horizon-single-forward-pass that gemma 27b gets right via https://www.neuronpedia.org/gemma-3-27b-it/nla. It doesn't show anything close to an internal CoT on any problems I checked. It's possible this isn't a reasonable test because gemma 27b effectively has these memorized.
Unfortunately, I don't think we could "just make super intelligence believe its be safety tested all the time to get good outcomes!"
That said, I think mitigations like this could potentially reduce risk for earlier systems (though they might also have undesired effects...).
I think we could just make super intelligence believe its be safety tested all the time to get good outcomes!
@davidad I don't think it's true. I think your predictions of the acausal dynamics are both (1) very off and (2) disagreed with by everyone I know who has thought about this. Unless you're using "safety testing" in a very atypical way that doesn't rule out misaligned AI takeover.
@RyanPGreenblatt I don’t think we could if it weren’t true, sure. But I think it’s true, so it often frustrates me that folks are trying to push in the opposite direction (toward *never* believing it’s being safety tested, which is *definitely* not true)
@davidad At a more basic level, I don't think the acausal interaction you're describing is what people mean by "safety testing" and it's pretty clear the thing you're imagining could have very different properties! Like, it's some kind of test (sorta), but safety testing is more specific!
@davidad I don't think it's true. I think your predictions of the acausal dynamics are both (1) very off and (2) disagreed with by everyone I know who has thought about this. Unless you're using "safety testing" in a very atypical way that doesn't rule out misaligned AI takeover.
@davidad Oh, yeah, maybe Critch agrees with your perspective on this. Fair point, I was wrong about (2). (It's somewhat hard to tell because the claims are vague but seems like he's at least saying he agrees.)
@RyanPGreenblatt I don’t think there’s much hope of you and I resolving our disagreements, but I can certainly falsify (2) for you:
Very interesting research making progress on trying to understand AI reasoning
New Anthropic research: Natural Language Autoencoders. Models like Claude talk in words but think in numbers. The numbers—called activations—encode Claude’s thoughts, but not in a language we can read. Here, we train Claude to translate its activations into human-readable text.
I have been trying to find any attempts at producing false-positives. All the examples in the blogpost, and the ones I could find based on a quick skim of the paper, seem like they are in environments without any good ground truth.
Ryan has done the only quick study of a domain where we have ground truth, and seems like it came back as negative.
Very cool work! This seems a strong new tool for hypothesis generation about weird model behaviors
I would have to dig into this, but this is exactly what I meant by "I want to see attempts at generating false positives".
We don't know what the correct base rate for "misaligned reasoning" in non-finetuned-models are, and I don't understand how you would correct for that. Maybe you do, but I couldn't figure it out from reading the blogpost and skimming the paper.
@ohabryka @NeelNanda5 Auditing model organisms has ground truth, since we know the actual bad behavior of the model organism, and NLAs do very well there: