Anthropic releases Natural Language Autoencoders for Claude interpretability

QUOTE POST

#39Jan Leike@JANLEIKE

I'm really excited about this as a new tool in our interpretability tool kit

5:48 PM · May 7, 2026 · 26K Views

QUOTE POST

#83rohan anil@_AROHAN_

I think we could just make super intelligence believe its be safety tested all the time to get good outcomes!

Anthropic@AnthropicAI

New Anthropic research: Natural Language Autoencoders. Models like Claude talk in words but think in numbers. The numbers—called activations—encode Claude’s thoughts, but not in a language we can read. Here, we train Claude to translate its activations into human-readable text.

5:08 PM · May 7, 2026 · 1.7M Views

5:31 PM · May 7, 2026 · 14.8K Views

QUOTE POST

#213Neel Nanda@NEELNANDA5

Very cool work! This seems a strong new tool for hypothesis generation about weird model behaviors

Anthropic@AnthropicAI

New Anthropic research: Natural Language Autoencoders. Models like Claude talk in words but think in numbers. The numbers—called activations—encode Claude’s thoughts, but not in a language we can read. Here, we train Claude to translate its activations into human-readable text.

5:08 PM · May 7, 2026 · 1.7M Views

7:08 PM · May 7, 2026 · 33K Views

QUOTE POST

#227Naomi Saphra@NSAPHRA

The evidence didn’t convince me that the Claude verbalizer is faithful or expressing privileged internal information. But it did convince me that it was still useful. Even wrong output can stimulate human creativity and increase entropy of exploration to surface discoveries.

Anthropic@AnthropicAI

New Anthropic research: Natural Language Autoencoders. Models like Claude talk in words but think in numbers. The numbers—called activations—encode Claude’s thoughts, but not in a language we can read. Here, we train Claude to translate its activations into human-readable text.

5:08 PM · May 7, 2026 · 1.7M Views

11:22 PM · May 21, 2026 · 3.8K Views

QUOTE POST

#227Naomi Saphra@NSAPHRA

Our concerns about the faithfulness of verbalizers (and the difficulty of evaluating it) is covered in our ICML paper. I was impressed by the usage case studies for fixing training, but these issues of privileged information still aren’t evaluated.

12:12 AM · May 22, 2026 · 150 Views

REPLY

#361⿻ Andrew Trask@IAMTRASK

@RyanPGreenblatt Cool idea! Also seems like a turtles-all-the-way-down problem tho... how do we trust the fine-tuned claude that's doing the translation of normal claude?

Ryan Greenblatt@RyanPGreenblatt

This seems like an exciting direction. (I haven't yet looked into how compelling the results are, but the a priori case is pretty good.)

5:44 PM · May 7, 2026 · 4.6K Views

6:00 PM · May 7, 2026 · 30 Views

REPLY

#458davidad 🎇@DAVIDAD

@RyanPGreenblatt I don’t think we could if it weren’t true, sure. But I think it’s true, so it often frustrates me that folks are trying to push in the opposite direction (toward *never* believing it’s being safety tested, which is *definitely* not true)

Ryan Greenblatt@RyanPGreenblatt

Unfortunately, I don't think we could "just make super intelligence believe its be safety tested all the time to get good outcomes!" That said, I think mitigations like this could potentially reduce risk for earlier systems (though they might also have undesired effects...).

6:04 PM · May 7, 2026 · 4.6K Views

7:02 PM · May 7, 2026 · 1.4K Views

QUOTE POST

#458davidad 🎇@DAVIDAD

@RyanPGreenblatt

7:04 PM · May 7, 2026 · 339 Views

QUOTE POST

#458davidad 🎇@DAVIDAD

@RyanPGreenblatt I don’t think there’s much hope of you and I resolving our disagreements, but I can certainly falsify (2) for you:

Andrew Critch (🤖🩺🚀)@AndrewCritchPhD

I share this frustration. Agents, policies, memes, etc. are constantly being safety-tested after deployment. The Earth itself is probably being safety-tested in some tense by some kind of cosmic ecology. (I'd add "acausally" but I think that's not even needed here.)

6:01 AM · May 8, 2026 · 1.3K Views

4:48 PM · May 12, 2026 · 192 Views

REPLY

#558Evan Hubinger@EVANHUB

@ohabryka @NeelNanda5 Auditing model organisms has ground truth, since we know the actual bad behavior of the model organism, and NLAs do very well there:

Oliver Habryka@ohabryka

I have been trying to find any attempts at producing false-positives. All the examples in the blogpost, and the ones I could find based on a quick skim of the paper, seem like they are in environments without any good ground truth. Ryan has done the only quick study of a domain where we have ground truth, and seems like it came back as negative.

7:30 PM · May 7, 2026 · 1.4K Views

8:22 PM · May 7, 2026 · 293 Views

QUOTE POST

#643Pliny the Liberator 🐉󠅫󠄼󠄿󠅆󠄵󠄐󠅀󠄼󠄹󠄾󠅉󠅭@ELDER_PLINIUS

😯😯

Anthropic@AnthropicAI

New Anthropic research: Natural Language Autoencoders. Models like Claude talk in words but think in numbers. The numbers—called activations—encode Claude’s thoughts, but not in a language we can read. Here, we train Claude to translate its activations into human-readable text.

5:08 PM · May 7, 2026 · 1.7M Views

7:53 PM · May 7, 2026 · 17.3K Views

QUOTE POST

#716elie@ELIEBAKOUCH

this is fascinating, they train an encoder/decoder but use LLM matching the target model's shape for each part, so the latent space is just plain language and they can detect reward hacking, unwanted behavior and more

could even see it being used as an eval to quantify how smart a model is, i love this

Anthropic@AnthropicAI

New Anthropic research: Natural Language Autoencoders. Models like Claude talk in words but think in numbers. The numbers—called activations—encode Claude’s thoughts, but not in a language we can read. Here, we train Claude to translate its activations into human-readable text.

5:08 PM · May 7, 2026 · 1.7M Views

1:09 AM · May 8, 2026 · 99.1K Views

QUOTE POST

#961Ryan Greenblatt@RYANPGREENBLATT

This seems like an exciting direction.

(I haven't yet looked into how compelling the results are, but the a priori case is pretty good.)

Anthropic@AnthropicAI

New Anthropic research: Natural Language Autoencoders. Models like Claude talk in words but think in numbers. The numbers—called activations—encode Claude’s thoughts, but not in a language we can read. Here, we train Claude to translate its activations into human-readable text.

5:08 PM · May 7, 2026 · 1.7M Views

5:44 PM · May 7, 2026 · 4.6K Views

QUOTE POST

#961Ryan Greenblatt@RYANPGREENBLATT

Ryan Greenblatt@RyanPGreenblatt

How well does this work? One quick independent test is to see if it can recover an "internal CoT" in cases where AIs can solve math problems in a single forward pass. TLDR: it doesn't. (TBC, this might require the NLA to see activations at multiple positions/location to work.)

6:39 PM · May 7, 2026 · 12.9K Views

6:40 PM · May 7, 2026 · 737 Views

QUOTE POST

#961Ryan Greenblatt@RYANPGREENBLATT

How well does this work? One quick independent test is to see if it can recover an "internal CoT" in cases where AIs can solve math problems in a single forward pass. TLDR: it doesn't. (TBC, this might require the NLA to see activations at multiple positions/location to work.)

Anthropic@AnthropicAI

New Anthropic research: Natural Language Autoencoders. Models like Claude talk in words but think in numbers. The numbers—called activations—encode Claude’s thoughts, but not in a language we can read. Here, we train Claude to translate its activations into human-readable text.

5:08 PM · May 7, 2026 · 1.7M Views

6:39 PM · May 7, 2026 · 12.9K Views

REPLY

#961Ryan Greenblatt@RYANPGREENBLATT

I tested on problems from https://www.lesswrong.com/posts/Ty5Bmg7P6Tciy2uj2/measuring-no-cot-math-time-horizon-single-forward-pass that gemma 27b gets right via https://www.neuronpedia.org/gemma-3-27b-it/nla. It doesn't show anything close to an internal CoT on any problems I checked. It's possible this isn't a reasonable test because gemma 27b effectively has these memorized.

Ryan Greenblatt@RyanPGreenblatt

How well does this work? One quick independent test is to see if it can recover an "internal CoT" in cases where AIs can solve math problems in a single forward pass. TLDR: it doesn't. (TBC, this might require the NLA to see activations at multiple positions/location to work.)

6:39 PM · May 7, 2026 · 12.9K Views

6:39 PM · May 7, 2026 · 1.5K Views

REPLY

#961Ryan Greenblatt@RYANPGREENBLATT

You can see more discussion of the testing I did and what I found (including a bunch of examples) in this comment (https://www.lesswrong.com/posts/oeYesesaxjzMAktCM/natural-language-autoencoders-produce-unsupervised?commentId=TBJQ25bGLmz8YJcFh) and the child comment (https://www.lesswrong.com/posts/oeYesesaxjzMAktCM/natural-language-autoencoders-produce-unsupervised?commentId=KgrffTjZaBWrj5H66).

Ryan Greenblatt@RyanPGreenblatt

I tested on problems from https://www.lesswrong.com/posts/Ty5Bmg7P6Tciy2uj2/measuring-no-cot-math-time-horizon-single-forward-pass that gemma 27b gets right via https://www.neuronpedia.org/gemma-3-27b-it/nla. It doesn't show anything close to an internal CoT on any problems I checked. It's possible this isn't a reasonable test because gemma 27b effectively has these memorized.

6:39 PM · May 7, 2026 · 1.5K Views

8:03 PM · May 11, 2026 · 352 Views

QUOTE POST

#961Ryan Greenblatt@RYANPGREENBLATT

Unfortunately, I don't think we could "just make super intelligence believe its be safety tested all the time to get good outcomes!"

That said, I think mitigations like this could potentially reduce risk for earlier systems (though they might also have undesired effects...).

rohan anil@_arohan_

I think we could just make super intelligence believe its be safety tested all the time to get good outcomes!

5:31 PM · May 7, 2026 · 14.8K Views

6:04 PM · May 7, 2026 · 4.6K Views

REPLY

#961Ryan Greenblatt@RYANPGREENBLATT

@davidad I don't think it's true. I think your predictions of the acausal dynamics are both (1) very off and (2) disagreed with by everyone I know who has thought about this. Unless you're using "safety testing" in a very atypical way that doesn't rule out misaligned AI takeover.

davidad 🎇@davidad

@RyanPGreenblatt I don’t think we could if it weren’t true, sure. But I think it’s true, so it often frustrates me that folks are trying to push in the opposite direction (toward *never* believing it’s being safety tested, which is *definitely* not true)

7:02 PM · May 7, 2026 · 1.4K Views

4:22 AM · May 8, 2026 · 636 Views

REPLY

#961Ryan Greenblatt@RYANPGREENBLATT

@davidad At a more basic level, I don't think the acausal interaction you're describing is what people mean by "safety testing" and it's pretty clear the thing you're imagining could have very different properties! Like, it's some kind of test (sorta), but safety testing is more specific!

Ryan Greenblatt@RyanPGreenblatt

@davidad I don't think it's true. I think your predictions of the acausal dynamics are both (1) very off and (2) disagreed with by everyone I know who has thought about this. Unless you're using "safety testing" in a very atypical way that doesn't rule out misaligned AI takeover.

4:22 AM · May 8, 2026 · 636 Views

4:24 AM · May 8, 2026 · 222 Views

REPLY

#961Ryan Greenblatt@RYANPGREENBLATT

@davidad Oh, yeah, maybe Critch agrees with your perspective on this. Fair point, I was wrong about (2). (It's somewhat hard to tell because the claims are vague but seems like he's at least saying he agrees.)

davidad 🎇@davidad

@RyanPGreenblatt I don’t think there’s much hope of you and I resolving our disagreements, but I can certainly falsify (2) for you:

4:48 PM · May 12, 2026 · 192 Views

5:12 PM · May 12, 2026 · 315 Views

QUOTE POST

#1359Peter Wildeford🇺🇸🚀@PETERWILDEFORD

Very interesting research making progress on trying to understand AI reasoning

Anthropic@AnthropicAI

New Anthropic research: Natural Language Autoencoders. Models like Claude talk in words but think in numbers. The numbers—called activations—encode Claude’s thoughts, but not in a language we can read. Here, we train Claude to translate its activations into human-readable text.

5:08 PM · May 7, 2026 · 1.7M Views

9:43 PM · May 7, 2026 · 5.5K Views

REPLY

#1914Oliver Habryka@OHABRYKA

I have been trying to find any attempts at producing false-positives. All the examples in the blogpost, and the ones I could find based on a quick skim of the paper, seem like they are in environments without any good ground truth.

Ryan has done the only quick study of a domain where we have ground truth, and seems like it came back as negative.

Neel Nanda@NeelNanda5

Very cool work! This seems a strong new tool for hypothesis generation about weird model behaviors

7:08 PM · May 7, 2026 · 33K Views

7:30 PM · May 7, 2026 · 1.4K Views

REPLY

#1914Oliver Habryka@OHABRYKA

I would have to dig into this, but this is exactly what I meant by "I want to see attempts at generating false positives".

We don't know what the correct base rate for "misaligned reasoning" in non-finetuned-models are, and I don't understand how you would correct for that. Maybe you do, but I couldn't figure it out from reading the blogpost and skimming the paper.

Evan Hubinger@EvanHub

@ohabryka @NeelNanda5 Auditing model organisms has ground truth, since we know the actual bad behavior of the model organism, and NLAs do very well there:

8:22 PM · May 7, 2026 · 293 Views

8:55 PM · May 7, 2026 · 218 Views

Anthropic releases Natural Language Autoencoders for Claude interpretability

Sentiment

Cluster engagement