/Tech3h ago

Google DeepMind releases DiffusionGemma, an experimental 26B open-weights text diffusion model that generates 256-token blocks in parallel

AI Judge changed title after evaluation, original title: "Google releases DiffusionGemma, an experimental 26B MoE model that generates text blocks in parallel"

Story Overview

Google has introduced an experimental open model that swaps the usual left-to-right token stream for diffusion-driven blocks, letting entire chunks of text appear together while staying under an Apache 2.0 license and running on standard GPUs.

84210.6K1.3K2.9K664.9K
Google Gemma@googlegemma

Meet DiffusionGemma!

An experimental open model that explores a fast approach to text generation, released under an Apache 2.0 license.

Moving beyond sequential, token-by-token processes to generate entire blocks of text simultaneously. Here’s what’s new with DiffusionGemma: 👇

9:06 AM · Jun 10, 2026 · 194.2K Views
Developer Impact

Parallel blocks rewrite the generation script

Bidirectional attention lets the model handle up to 256 tokens at once and insert real-time fixes or complex markdown without waiting for the full sequence to finish.

Open Question

Speed claims await wider testing

Posts cite jumps past 1,000 tokens per second and an 18 GB VRAM footprint after quantization, yet full third-party checks on consumer hardware and long-term output quality are still missing.

Sentiment

Positive users praised DiffusionGemma's fast inference speeds and openness as a diffusion model milestone, while a few negative users dismissed the speed focus as unoriginal or secondary to intelligence gains.

Pos
95.4%
Neg
4.6%
146 comments with sentiment.
Cluster Engagement
Posts from X
Most Activity
Most Activity
VIEWS102.4KBOOKMARKS448
Google@Google

Meet DiffusionGemma ⚡ Our latest experimental open model (Apache 2.0) that generates text up to 4x faster.

Instead of predicting and typing just one word at a time like most language models, it drafts and refines entire blocks of text simultaneously.

Here’s how it works 🧵 ↓

3hViews 102.4KLikes 1.6KBookmarks 448
LIKES1.8KRETWEETS217REPLIES105
Sundar Pichai@sundarpichai

DiffusionGemma is an open, experimental model that brings our text diffusion research to Gemma 4. It’s a racehorse 🏇achieving up to 4x faster inference by generating entire blocks of text simultaneously vs predicting token-by-token (word-by-word) output!

3hViews 95.9KLikes 1.8KBookmarks 281
Unsloth AI@UnslothAI

Google releases DiffusionGemma.✨ The new 26B-A4B diffusion text model runs locally on 18GB RAM.

It supports high-speed text generation, thinking, image, video and 256K context.

Run and train via Unsloth Studio.

GGUF: https://huggingface.co/unsloth/diffusiongemma-26B-A4B-it-GGUF Guide: https://unsloth.ai/docs/models/diffusiongemma

Google Gemma@googlegemma

Meet DiffusionGemma!

An experimental open model that explores a fast approach to text generation, released under an Apache 2.0 license.

Moving beyond sequential, token-by-token processes to generate entire blocks of text simultaneously. Here’s what’s new with DiffusionGemma: 👇

3hViews 51.2KLikes 771Bookmarks 407
Google DeepMind@GoogleDeepMind

DiffusionGemma is our new experimental open model with up to 4x faster output on dedicated GPUs.

Instead of predicting word-by-word, it generates entire blocks of text simultaneously. This lets the model self-correct and format complex markdown in real time.

3hViews 67.6KLikes 1.3KBookmarks 255
Omar Sanseviero@osanseviero

Introducing DiffusionGemma, our first exploration with open diffusion text generation models

🔥Generate blocks of text at a time 🤏26B MoE built on top of Gemma 4 ⚡️Up to 4x faster in popular consumer GPUs 🤗Apache 2.0

Excited to see what the community builds with it!

3hViews 22KLikes 568Bookmarks 145
Philipp Schmid@_philschmid

Gemma goes diffusion! DiffusionGemma with up to 1000+ tokens per second! 🌬️

- Built on Gemma 4 as a 26B MoE model. - 3.8B parameters during inference. - Generates text in 256-token blocks in parallel. - Fits within 18 GB VRAM limits when quantized. - Apache 2.0

3hViews 12.3KLikes 213Bookmarks 78

Want 4x faster local inference on dedicated GPUs for your interactive apps? DiffusionGemma is an experimental, open 26B MoE model that generates entire blocks of text simultaneously instead of token-by-token.

By shifting the local decoding bottleneck from memory-bandwidth to compute, it hits speeds over 700 tokens/sec on a single NVIDIA RTX 5090 GPU. This diffusion unlocks unique local workflows like real-time inline editing, code infilling, and instant self-correction.

📥 Download the Apache 2.0 weights on @HuggingFace: https://goo.gle/4xqzKTA

📖 Read the full technical announcement on the blog: https://goo.gle/4ursgwI

3hViews 13.1KLikes 239Bookmarks 68
Google@Google

We're releasing DiffusionGemma as an open model under an Apache 2.0 license for anyone to experiment with.

Download the model weights on @huggingface, and learn more about DiffusionGemma → http://goo.gle/3Sy0Is7

Google@Google

Because it generates everything at once, DiffusionGemma unlocks new patterns of model behavior.

⚡ Fast: Generates up to 1,000+ tokens a second for up to 4x faster text generation.

💻 Lightweight: Runs smoothly right on 18GB consumer graphics cards.

🧠 Smart editing: Since it processes larger amounts of information at once, it can easily fill in blanks, format code, and fix its own errors in real time.

3hViews 24.8KLikes 269Bookmarks 55
Unsloth AI@UnslothAI

@googlegemma Google Deepmind once again delivering when it comes to open-source! 🙏🥰

You can run DiffusionGemma locally on 18GB RAM via our GGUFs: https://huggingface.co/unsloth/diffusiongemma-26B-A4B-it-GGUF

Google Gemma@googlegemma

Meet DiffusionGemma!

An experimental open model that explores a fast approach to text generation, released under an Apache 2.0 license.

Moving beyond sequential, token-by-token processes to generate entire blocks of text simultaneously. Here’s what’s new with DiffusionGemma: 👇

3hViews 5.5KLikes 188Bookmarks 63
vLLM@vllm_project

Congrats to @GoogleDeepMind on DiffusionGemma 🎉 A 26B diffusion language model on the Gemma4 backbone, and the first dLLM natively supported in vLLM.

It denoises 256-token blocks in parallel instead of generating one token at a time: 1200+ output tok/s at batch size 1 on a single H200 (FP8).

Built on model runner v2's ModelState plus the existing speculative decoding path, with minimal scheduler or runner changes. FP8 and NVFP4 checkpoints are on the @RedHat_AI hub. Thanks to the @GoogleDeepMind, @RedHat_AI, and @NVIDIAAI teams!

🔗 http://vllm.ai/blog/2026-06-10-diffusion-gemma

Google Gemma@googlegemma

Meet DiffusionGemma!

An experimental open model that explores a fast approach to text generation, released under an Apache 2.0 license.

Moving beyond sequential, token-by-token processes to generate entire blocks of text simultaneously. Here’s what’s new with DiffusionGemma: 👇

2hViews 10.8KLikes 175Bookmarks 50
Google@Google

Because it generates everything at once, DiffusionGemma unlocks new patterns of model behavior.

⚡ Fast: Generates up to 1,000+ tokens a second for up to 4x faster text generation.

💻 Lightweight: Runs smoothly right on 18GB consumer graphics cards.

🧠 Smart editing: Since it processes larger amounts of information at once, it can easily fill in blanks, format code, and fix its own errors in real time.

3hViews 26.1KLikes 211Bookmarks 25
Daniel Han@danielhanchen

We made DiffusionGemma run via llama.cpp locally! It works well with Unsloth GGUFs and you can run it in realtime visualization mode or normal chat CLI mode!

See our docs https://unsloth.ai/docs/models/diffusiongemma on how to set it up!

Unsloth AI@UnslothAI

Google releases DiffusionGemma.✨ The new 26B-A4B diffusion text model runs locally on 18GB RAM.

It supports high-speed text generation, thinking, image, video and 256K context.

Run and train via Unsloth Studio.

GGUF: https://huggingface.co/unsloth/diffusiongemma-26B-A4B-it-GGUF Guide: https://unsloth.ai/docs/models/diffusiongemma

2hViews 5.9KLikes 95Bookmarks 42
Prince Canuma@Prince_Canuma

Massive congrats to @GoogleDeepMind on DiffusionGemma! 🎉

We collaborated closely with the team to Day-0 MLX-VLM — native diffusion decoding on Apple Silicon, release dropping later today (~3-4h), meanwhile you can install from source. ⚡🍎

This is genuinely different beast — instead of token-by-token, it generates 256-token blocks in parallel with bi-directional attention and iteratively self-corrects. 26B MoE, only 3.8B active, fits in 18GB when quantized.

Google Gemma@googlegemma

Meet DiffusionGemma!

An experimental open model that explores a fast approach to text generation, released under an Apache 2.0 license.

Moving beyond sequential, token-by-token processes to generate entire blocks of text simultaneously. Here’s what’s new with DiffusionGemma: 👇

1hViews 5.9KLikes 101Bookmarks 30

DiffusionGemma, our experimental open model released under an Apache 2.0 license, explores text diffusion, an exceptionally fast approach to text generation.

Here’s how DiffusionGemma accelerates development:

+ Faster token output: By shifting the bottleneck from memory bandwidth to raw compute, the model generates up to 4x faster token output on dedicated GPUs + Accessible hardware footprint: Activates just 3.8B parameters during inference, fitting comfortably within 24GB-VRAM high-end consumer GPUs when quantized + Novel workflows: Parallel token generation enables self-correction, making it ideal for code infilling, in-line editing, and non-linear structures

DiffusionGemma prioritizes speed over raw quality and accelerates best on compute-bound hardware (like @NVIDIAAI GPUs). Standard @GoogleGemma 4 remains recommended for production quality and memory-bound devices.

3hViews 17.6KLikes 125Bookmarks 27
merve@mervenoyann

DiffusionGemma is out 🔥

it's compute-bound so 4x faster compared to other Gemma-4 models (1k tok/s on H100) 💨

also great on coding, generate and iterate on any code from 3D generation to front-end ⤵️

2hViews 7.4KLikes 97Bookmarks 27
elvis@omarsar0

This is awesome!

I am spending a lot of time on diffusion LLMs these days, so this is perfect timing.

I feel like there are so many underexplored research questions around text diffusion.

Weight available in HF.

Google DeepMind@GoogleDeepMind

DiffusionGemma is our new experimental open model with up to 4x faster output on dedicated GPUs.

Instead of predicting word-by-word, it generates entire blocks of text simultaneously. This lets the model self-correct and format complex markdown in real time.

3hViews 5.8KLikes 58Bookmarks 24

Local models can't benefit from batch parallelism as easily, but you can still parallelise over the token axis. So here's an open text diffusion model! >1000 tokens/sec for accelerated tokenmaxxing, yay!🫨

Google Gemma@googlegemma

Meet DiffusionGemma!

An experimental open model that explores a fast approach to text generation, released under an Apache 2.0 license.

Moving beyond sequential, token-by-token processes to generate entire blocks of text simultaneously. Here’s what’s new with DiffusionGemma: 👇

2hViews 5.2KLikes 81Bookmarks 14
Sundar Pichai@sundarpichai

Model weights available on Hugging Face under Apache 2.0 license, read more here: https://blog.google/innovation-and-ai/technology/developers-tools/diffusion-gemma-faster-text-generation/

3hViews 7.6KLikes 72Bookmarks 12
AshutoshShrivastava@ai_for_success

⚡ Google DeepMind just dropped DiffusionGemma, latest experimental open model (Apache 2.0) that generates text up to 4x faster.

- Uses diffusion instead of traditional next token autoregressive generation - Generates and refines 256 token blocks in parallel - Achieves up to 700+ tokens/sec on RTX 5090 and 1000+ tokens/sec on a single H100 - Designed as a 26B MoE model but activates only 3.8B params during inference - Can run quantized within 18 GB VRAM - Supports bidirectional attention during generation - Can self correct outputs during inference through iterative denoising - Handles global context much better than standard left to right models - Particularly strong for constraint based tasks like Sudoku - Fine tuned Sudoku version reached 80% success while base model was near 0% - Uses block autoregressive diffusion for long context generation - Integrated directly into vLLM for OpenAI compatible serving - Released with official training recipes and fine tuning support - Optimized across RTX 4090, RTX 5090, Hopper, and Blackwell GPUs

Getting started:

- Model weights are available publicly on Hugging Face - Works with vLLM, Hugging Face Transformers, and MLX - Deployable through Google Cloud Model Garden and NVIDIA NIM - Google released official fine tuning recipes through Hackable Diffusion

License: - Released under Apache 2.0 - Allows commercial usage - Allows modification and redistribution - Developers can fine tune and build products on top of it

This is one of the strongest public signs yet that major labs are actively exploring post autoregressive architectures for future LLMs.

3hViews 2.5KLikes 53Bookmarks 7
Google@Google

Most large language models predict answers by guessing the single best word to say next, then the next, and so on... 🔎

It's highly capable, but not necessarily fast. The model waits to finish one word before it can think about the next.

DiffusionGemma skips the wait.

It uses "diffusion" to generate text by refining noise step by step — drafting and error-correcting whole blocks simultaneously. This makes it incredibly fast, and helpful for editing complex math and code.

3hViews 5.6KLikes 80Bookmarks 8
Load more posts