/Tech1h ago

Audit of Google DeepMind's DiffusionGemma finds intermediate steps remain human-interpretable, enabling chain-of-thought safety monitoring

Story Overview

The audit shows DiffusionGemma's multi-step denoising process still produces readable intermediate token states via a natural-language bottleneck, preserving human oversight options that diffusion architectures were expected to forfeit compared with autoregressive models.

6998356.8K

#254

Original post

Josh Engels@JoshAEngels

Text diffusion models are fast, but are less transparent than today's LLMs because they do many forward passes before outputting text.

We audit the transparency of DiffusionGemma and find that the intermediates are interpretable. This recovers many of the benefits of CoT!

🧵

11:13 AM · Jun 19, 2026 · 4.5K Views

Safety Angle

Readable intermediates keep oversight intact

By routing information through an interpretable token projection at each step, the model drops its opaque serial depth from roughly 28 times that of a comparable autoregressive setup down to about 1.1 times while matching prior monitorability benchmarks.

Open Question

Results stay tied to this specific setup

The findings apply to DiffusionGemma's token-bottleneck design and training; the paper notes they may not extend to future latent-reasoning diffusion models, leaving broader generalization as an open question.

Sentiment

Users enjoyed the DiffusionGemma paper because of its detailed case studies into non-autoregressive behavior.

Pos

100.0%

Neg

0.0%

1 comments with sentiment.

Cluster Engagement

Digg Deeper

No Digg Deeper questions have been answered for this story yet.

Posts from X

Most Activity

VIEWS3.1KBOOKMARKS23LIKES59RETWEETS5REPLIES1

Neel Nanda@NeelNanda5

Chain of thought monitoring is one of our best safety techniques, and diffusion models might break it. But at least for DiffusionGemma, it turns out that we can recover most of the benefits! I would love to see similar transparency audits for any latent reasoning architecture

Josh Engels@JoshAEngels

Text diffusion models are fast, but are less transparent than today's LLMs because they do many forward passes before outputting text.

We audit the transparency of DiffusionGemma and find that the intermediates are interpretable. This recovers many of the benefits of CoT!

🧵

1h3.1K5923

Arthur Conmy@ArthurConmy

🌶️ mech interp work should explain why and how it helps interpret models that produce latent CoT

Josh and team leading the way with a way to interpret one latent thinking-like architecture (diffusion language models)!

Josh Engels@JoshAEngels

Text diffusion models are fast, but are less transparent than today's LLMs because they do many forward passes before outputting text.

We audit the transparency of DiffusionGemma and find that the intermediates are interpretable. This recovers many of the benefits of CoT!

🧵

42m581114

Neel Nanda@NeelNanda5

I had a lot of fun with this paper, especially the deep dive case studies into particular examples of non autoregressive behaviour. Check it out!

Neel Nanda@NeelNanda5

1h45563

Josh Engels@JoshAEngels

Paper here! https://arxiv.org/abs/2606.20560

Work done w/ @calsmcdougall @bilalchughtai_ @JanosKramar @sen_r @cindyxywu @ArthurConmy @AsicChen @jean_tarbou @Sophia_NLP @bodonoghue85 @jglo_liveira @rohinmshah and @NeelNanda5

1h11431

Josh Engels@JoshAEngels

In our transparency audit of DiffusionGemma, we:

- find no decrease in monitorability - reduce opaque serial depth by applying LogitLens to intermediate vectors - discover weird non-autoregressive phenomena

We hope audits like this become standard for new model architectures.

1h592

Josh Engels@JoshAEngels

Another neat result is "token smearing": when DiffusionGemma is confident that a token will exist somewhere, but doesn't know exactly where the token will go, it will maintain a "smeared" probability distribution over adjacent positions.

1h231

Josh Engels@JoshAEngels

As we expected, DiffusionGemmas' opaque serial depth--a numerical estimate of non-transparent reasoning that measures the length of the deepest path through the model that doesn't go through tokens--is empirically and asymptotically larger than the corresponding Gemma 4 model.

1h211

Josh Engels@JoshAEngels

But all hope isn't lost!

The reason that the opaque serial depth is so high is the non-interpretable vector in between denoising steps.

We project this vector into token space and restrict to the top few tokens; performance is unharmed; we can look at just these tokens.

1h171

Josh Engels@JoshAEngels

Furthermore, for the p = 0.03 ablation (one of the ablations with no effect on performance), most tokens are equal or semantically similar to tokens in the final rollout.

So the load bearing diffusion model intermediates are mostly (interpretable) guesses for final tokens!

1h81

Josh Engels@JoshAEngels

In one case study, we ask DiffusionGemma to count the number of perfect squares between 400 and 800 and give its answer first followed by the list of squares. The model will guess wrong, list the squares, and then go back and correct its mistake.

1h71

Josh Engels@JoshAEngels

Next, in a series of case studies, we study algorithmic transparency: whether we can use intermediates to reconstruct the process by which the model arrived at its outputs. We introduce neat visualizations too!

I won't get to all case studies here, so go see the paper for more!

1h71

Josh Engels@JoshAEngels

We also test monitorability, a key application of transparency that measures whether model outputs are useful for downstream tasks.

We find that Gemma and DiffusionGemma are similarly monitorable.

1h71

Charles Foster@CFGeek

Particularly relevant:

47m822

Josh Engels@JoshAEngels

Finally, it's unclear if these results are an artifact of current nascent text diffusion training paradigms rather than a lasting property of latent reasoning architectures. We thus hope that our work serves as a template for evaluations of future latent reasoning models.

1h251