UC Berkeley's Nika Haghtalab and Pieter Abbeel introduce DiPOD to stabilize post-training for diffusion language models

VIEWS2.1KBOOKMARKS10LIKES33

Paper: https://arxiv.org/abs/2606.13795 Code: https://github.com/Astro-Eric/DiPOD-release Blog: https://astro-eric.github.io/blogs/dipod/ This is an amazing collaboration with @HavenFeng, @pabbeel, @JiantaoJ, @akanazawa, @nhaghtal

🧵5/5

19h2.1K3310

RETWEETS40

Haozhe Jiang@erichzjiang

Why aren’t Diffusion Language Model smart yet? Lacking stable post training is a major bottleneck!

Meet DiPOD: the tripod for diffusion model post-training.

DiPOD boosts accuracy across reasoning tasks, with Sudoku jumping from 22% to 97%, through a one-line code change.

🧵1/5

19h23K304268

REPLIES2

Samian@ApplyWiseAi

@erichzjiang 22% to 97% on sudoku is a cute demo but reasoning benchmarks are where diffusion models actually need to prove out. anyone tested this on math/code yet

13h378

Nika Haghtalab@nhaghtal

1/ Meet DiPOD the tripod for stabilizing Diffusion Language Model training!

This was a fun collaboration to bridge theory and practice with the awesome group of coauthors: @erichzjiang, @HavenFeng, @pabbeel, @JiantaoJ, and @akanazawa.

Haozhe Jiang@erichzjiang

Why aren’t Diffusion Language Model smart yet? Lacking stable post training is a major bottleneck!

Meet DiPOD: the tripod for diffusion model post-training.

DiPOD boosts accuracy across reasoning tasks, with Sudoku jumping from 22% to 97%, through a one-line code change.

🧵1/5

2h46123

Haozhe Jiang@erichzjiang

DLM post-training is hard because log-likelihood is intractable, and people replace it with proxies. We identify the double drift issue: proxy and drift from log-likelihood, and gradient subsequently drifts from policy gradient.

🧵2/5

19h98017

Haozhe Jiang@erichzjiang

DiPOD takes a variational inference perspective, and provides a theoretical framework to analyze policy gradients algorithms for generative models. It could produce useful algorithms in other domains like robotics as well.

🧵4/5

19h1K13

Haozhe Jiang@erichzjiang

DiPOD tackles double drift by interleaving the gradient steps with self-distillations. In implementation, this results in adding a regularization term to the original objective, and consistently improves on GSM8K, MATH500, Countdown, and Sudoku.

🧵3/5

19h94313

Haozhe Jiang@erichzjiang

@Valery_12138 manim, amazing library from 3b1b

4h7712

Nika Haghtalab@nhaghtal

2/ @erichzjiang's detailed blog gives a very accessible overview of the drift phenomena and our approach to fixing them: http://astro-eric.github.io/blogs/dipod/

Paper here: https://arxiv.org/abs/2606.13795.

Nika Haghtalab@nhaghtal

1/ Meet DiPOD the tripod for stabilizing Diffusion Language Model training!

This was a fun collaboration to bridge theory and practice with the awesome group of coauthors: @erichzjiang, @HavenFeng, @pabbeel, @JiantaoJ, and @akanazawa.

2h17411

Haozhe Jiang@erichzjiang

@ApplyWiseAi We include math reasoning like gsm8k and math500 here. DiPOD could stabilize the training there as well. There are papers trying dlms on coding and we believe there will be exciting follow-ups on DiPOD for coding.

12h2924

Valery@Valery_12138

@erichzjiang Cool work! Really nice visualization, may I ask how you generated the animation?

8h1581

Siddharth Ancha@siddancha

Really cool work @erichzjiang! 👏👏👏 I have a question about your visualization.

IIUC, in a "pure" self-distillation step you would sample terminal actions from the current policy and maximize the ELBO under these actions. (For FPO, this is the supervised conditional flow-matching loss.) Perfect self-distillation would produce a new policy that preserves the marginal distribution of terminal actions (and hence the action log-likelihoods). But your visualization seems to suggest that all intermediate marginal distributions for t ∈ [0, 1] are also preserved in the self-distillation step? Is that correct? If so, I don't see why that would necessarily be true.

11h115

Samian@ApplyWiseAi

@erichzjiang interesting that gsm8k stabilizes. does DiPOD help with the longer chain-of-thought cases or mainly shorter reasoning?

10h49

Haozhe Jiang@erichzjiang

@berkeley_ai

18h6511

Haozhe Jiang@erichzjiang

@ApplyWiseAi For real applications we expect the architecture to be more complicated than vanilla DLM. DiPOD adopts a variational inference approach and would be adaptable to those instead of just hacking the DLM architecture.

12h15

Vik@vkalahas

@erichzjiang this is some cool research! diffusion models for language, more than just images

17h3581

Haozhe Jiang@erichzjiang

@siddancha @HavenFeng @pabbeel @JiantaoJ @akanazawa @nhaghtal The intermediate distributions are not preserved. What I am showing in the background is just the ground truth intermediate distribution. Thank you for pointing this out and I will make this clearer in the blog.

4h261

Haozhe Jiang@erichzjiang

@ApplyWiseAi Yes we have open sourced on GitHub. Besides if you want to try it out on top of new algorithm the code change is really small.

4h111

Vishnu Teja Kunde@sampleparticle

@erichzjiang Congratulations on this exciting work on RL for diffusion LLMs! Our recent paper (https://arxiv.org/pdf/2603.12554) also explores RL post-training for DLMs. Since we study similar benchmarks, it would be interesting to compare approaches and results. Looking forward to further progress!

15h35

Samian@ApplyWiseAi

@erichzjiang makes sense, variational inference feels more robust than just patching the DLM. is DiPOD open source? curious to dig into the implementation

10h18