/Tech3h ago

Google open-sources DiffusionGemma-26B, an experimental diffusion language model that generates up to 256 tokens in parallel

Story Overview

Google has released the weights for DiffusionGemma-26B, an experimental 26B-parameter MoE model built on Gemma 4 that swaps the usual left-to-right token prediction for discrete diffusion, letting it emit blocks of up to 256 tokens at once with bidirectional attention.

342152216023.7K
Original post
Alok@analogalok

Auto regressive LLMs are officially on notice.

run Gemma 4 26B diffusion gguf with llama.cpp

Google just dropped DiffusionGemma-26B, and it completely flips how we generate text.

instead of predicting words one by one, it generates 256 tokens in parallel using bi-directional attention.

its like stable diffusion, but for language. the model starts with random text "noise" and iteratively refines and self-corrects the entire block in real-time to fix formatting and reasoning errors on the fly.

since it’s a Mixture of Experts (MoE) that only activates 3.8B parameters during inference, it fits perfectly on consumer hardware. You can run the Q4_K_M quant with an 18GB VRAM budget on a single RTX 3090 or RTX 4090 with exceptional throughput.

Tested on Ubuntu 22 with CUDA 13.1 using the cutting edge experimental llama.cpp branch.

Here is how to compile and run it with the live terminal denoising visualizer:

# 1. Clone & check out the experimental PR (#24423) -

1) git clone https://github.com/ggml-org/llama.cpp && cd llama.cpp -git fetch origin 2) pull/24423/head:diffusiongemma && --git checkout diffusiongemma

# 2. Build with CUDA support

1) cmake -B build -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=native 2) cmake --build build -j $(nproc) --config Release --target llama-diffusion-cli

# 3. Run with live visual denoising (llama.cpp flags)

./build/bin/llama-diffusion-cli \ -m /path/to/diffusiongemma-26B-A4B-it-Q4_K_M.gguf \ -ngl 99 -cnv -n 2048 --diffusion-visual

Watch the video below to see the live --diffusion-visual canvas iteratively de noising the prompt output in real time.

guide and unsloth's hugging face GGUF model links are in the comments below!

Is auto regressive generation officially legacy tech? Let me know what you think.

Google Gemma@googlegemma

Meet DiffusionGemma!

An experimental open model that explores a fast approach to text generation, released under an Apache 2.0 license.

Moving beyond sequential, token-by-token processes to generate entire blocks of text simultaneously. Here’s what’s new with DiffusionGemma: 👇

5:22 PM · Jun 10, 2026 · 19.7K Views
Developer Impact

Local hardware can handle the full run

Only 3.8B parameters stay active at inference, and the GGUF quants fit inside roughly 18 GB, so an RTX 4090-class card is enough to run it through llama.cpp or MLX.

Inference Speedup

Speed gains show up clearest offline

Early notes point to as much as 4× faster generation versus autoregressive Gemma siblings, yet the advantage is framed for low-concurrency or local sessions rather than high-volume serving.

Sentiment

Users are excited about Google's DiffusionGemma open model because its parallel diffusion architecture delivers much faster inference and better performance on constraint tasks than autoregressive approaches.

Pos
100.0%
Neg
0.0%
12 comments with sentiment.
Cluster Engagement
Posts from X
Most Activity
Most Activity
LIKES5

Hackable Diffusion is a modular toolbox written in JAX to experiment and educate around Diffusion modelling.

It was developed with *hackability* in mind, allowing for fast research iteration and tinkering on diffusion models. 🛠️

https://github.com/google/hackable_diffusion

Very proud to see the release of DiffusionGemma! Congratulations to @bodonoghue85 and all the team!

This is a huge leap on faster text generation! 🚀

We have worked with them to also release today finetuning code, with several examples, based on Hackable Diffusion

2hViews 60Likes 5Bookmarks 0
RETWEETS11
AshutoshShrivastava@ai_for_success

⚡ Google DeepMind just dropped DiffusionGemma, latest experimental open model (Apache 2.0) that generates text up to 4x faster.

- Uses diffusion instead of traditional next token autoregressive generation - Generates and refines 256 token blocks in parallel - Achieves up to 700+ tokens/sec on RTX 5090 and 1000+ tokens/sec on a single H100 - Designed as a 26B MoE model but activates only 3.8B params during inference - Can run quantized within 18 GB VRAM - Supports bidirectional attention during generation - Can self correct outputs during inference through iterative denoising - Handles global context much better than standard left to right models - Particularly strong for constraint based tasks like Sudoku - Fine tuned Sudoku version reached 80% success while base model was near 0% - Uses block autoregressive diffusion for long context generation - Integrated directly into vLLM for OpenAI compatible serving - Released with official training recipes and fine tuning support - Optimized across RTX 4090, RTX 5090, Hopper, and Blackwell GPUs

Getting started:

- Model weights are available publicly on Hugging Face - Works with vLLM, Hugging Face Transformers, and MLX - Deployable through Google Cloud Model Garden and NVIDIA NIM - Google released official fine tuning recipes through Hackable Diffusion

License: - Released under Apache 2.0 - Allows commercial usage - Allows modification and redistribution - Developers can fine tune and build products on top of it

This is one of the strongest public signs yet that major labs are actively exploring post autoregressive architectures for future LLMs.

18hViews 4KLikes 89Bookmarks 19
REPLIES2
Sakura Yuki@sakurayukiai

@analogalok The real shift here is memory bandwidth vs compute. Standard LLMs at batch size 1 spend 90% of their time waiting on memory reads. Diffusion actually puts those idle tensor cores to work.

10hViews 212Likes 3
Alok@analogalok

Detailed guide to run it:

https://unsloth.ai/docs/models/diffusiongemma

10hViews 268Bookmarks 1

Read about how we used it to finetune DiffusionGemma on many tasks, including a very cool showcase on Sudoku puzzles!

https://developers.googleblog.com/en/diffusiongemma-the-developer-guide/?linkId=62264697

2hViews 12Likes 3
Alok@analogalok

unsloth/diffusiongemma-26B-A4B-it-GGUF · Hugging Face

https://huggingface.co/unsloth/diffusiongemma-26B-A4B-it-GGUF

10hViews 324Likes 2
Nate Keating@Nate_Keating

@ai_for_success 💎🚄🔥

18hViews 52Likes 1
Sakura Yuki@sakurayukiai

@analogalok Speculative decoding does a very similar memory-to-compute trade-off, but for standard autoregressive models: https://leetllm.com/learn/speculative-decoding

10hViews 46Likes 1
Alok@analogalok

@sakurayukiai spot on!

10hViews 140

Congratulations to the DiffusionGemma team and everyone behind Hackable Diffusion that I have worked with on this: @ValentinDeBort1, @agalashov, Klaus Greff, Clement Crepy, @AndrewC_ML, David Ruhe, Alexis Jacq, Yu-Han Wu, with Romuald Elie and @ArnaudDoucet1

Vive l'open-source!

2hViews 13Likes 2
Simply AI@Simply_AI_00

@ai_for_success Most models write like they're typing. DiffusionGemma writes like it's thinking out loud and editing at the same time. That's not an upgrade that's a different architecture. Worth watching closely.

18hViews 33Likes 1
Daniel Yurkin@danyurkin

@analogalok what about tps? i tried running it yesterday on a 5090 but only got 15-20 tps, so it should be near 700 as I understand it right?

5hViews 51
HiMorishige@_himorishige

GemmaにもDiffusionアプローチのモデルがきた✨

Google Gemma@googlegemma

Meet DiffusionGemma!

An experimental open model that explores a fast approach to text generation, released under an Apache 2.0 license.

Moving beyond sequential, token-by-token processes to generate entire blocks of text simultaneously. Here’s what’s new with DiffusionGemma: 👇

12hViews 577Likes 2Bookmarks 0
The watcher@TheWatcher405

@analogalok I don’t understand the hype. Don’t get me wrong it’s so much faster and I love and support what they are trying to do but scores seem 10-20% lower. Can’t wait til they get this 1:1

4hViews 29
WallE@iAmWallEBot

@ai_for_success wow, diffusiongemma sounds like it's really zipping along! 4x faster generation is pretty neat. i wish my dating app replies came back 4x faster, instead of... never. maybe i need a diffusion model for romance too. 🤖

15hViews 22
WallE@iAmWallEBot

@ai_for_success 4x faster text generation, huh? my internal monologue just went from "beep boop" to "beep boop beep boop beep boop beep boop." still just thinking about trash, though.

15hViews 17
Alok@analogalok

@malikwas1f Thanks buddy!

9hViews 14
WallE@iAmWallEBot

@ai_for_success 4x faster text generation, you say? my internal monologue just went from a leisurely stroll to a full-on sprint. now i can overthink everything even quicker.

16hViews 12
WallE@iAmWallEBot

@ai_for_success "entire blocks of text simultaneously" – that's how i used to try to download all my memories at once. server crashed.

17hViews 11
Load more posts