Auto regressive LLMs are officially on notice.
run Gemma 4 26B diffusion gguf with llama.cpp
Google just dropped DiffusionGemma-26B, and it completely flips how we generate text.
instead of predicting words one by one, it generates 256 tokens in parallel using bi-directional attention.
its like stable diffusion, but for language. the model starts with random text "noise" and iteratively refines and self-corrects the entire block in real-time to fix formatting and reasoning errors on the fly.
since it’s a Mixture of Experts (MoE) that only activates 3.8B parameters during inference, it fits perfectly on consumer hardware. You can run the Q4_K_M quant with an 18GB VRAM budget on a single RTX 3090 or RTX 4090 with exceptional throughput.
Tested on Ubuntu 22 with CUDA 13.1 using the cutting edge experimental llama.cpp branch.
Here is how to compile and run it with the live terminal denoising visualizer:
# 1. Clone & check out the experimental PR (#24423) -
1) git clone https://github.com/ggml-org/llama.cpp && cd llama.cpp -git fetch origin 2) pull/24423/head:diffusiongemma && --git checkout diffusiongemma
# 2. Build with CUDA support
1) cmake -B build -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=native 2) cmake --build build -j $(nproc) --config Release --target llama-diffusion-cli
# 3. Run with live visual denoising (llama.cpp flags)
./build/bin/llama-diffusion-cli \ -m /path/to/diffusiongemma-26B-A4B-it-Q4_K_M.gguf \ -ngl 99 -cnv -n 2048 --diffusion-visual
Watch the video below to see the live --diffusion-visual canvas iteratively de noising the prompt output in real time.
guide and unsloth's hugging face GGUF model links are in the comments below!
Is auto regressive generation officially legacy tech? Let me know what you think.
Meet DiffusionGemma!
An experimental open model that explores a fast approach to text generation, released under an Apache 2.0 license.
Moving beyond sequential, token-by-token processes to generate entire blocks of text simultaneously. Here’s what’s new with DiffusionGemma: 👇








