⚡ Google DeepMind just dropped DiffusionGemma, latest experimental open model (Apache 2.0) that generates text up to 4x faster.
- Uses diffusion instead of traditional next token autoregressive generation
- Generates and refines 256 token blocks in parallel
- Achieves up to 700+ tokens/sec on RTX 5090 and 1000+ tokens/sec on a single H100
- Designed as a 26B MoE model but activates only 3.8B params during inference
- Can run quantized within 18 GB VRAM
- Supports bidirectional attention during generation
- Can self correct outputs during inference through iterative denoising
- Handles global context much better than standard left to right models
- Particularly strong for constraint based tasks like Sudoku
- Fine tuned Sudoku version reached 80% success while base model was near 0%
- Uses block autoregressive diffusion for long context generation
- Integrated directly into vLLM for OpenAI compatible serving
- Released with official training recipes and fine tuning support
- Optimized across RTX 4090, RTX 5090, Hopper, and Blackwell GPUs
Getting started:
- Model weights are available publicly on Hugging Face
- Works with vLLM, Hugging Face Transformers, and MLX
- Deployable through Google Cloud Model Garden and NVIDIA NIM
- Google released official fine tuning recipes through Hackable Diffusion
License:
- Released under Apache 2.0
- Allows commercial usage
- Allows modification and redistribution
- Developers can fine tune and build products on top of it
This is one of the strongest public signs yet that major labs are actively exploring post autoregressive architectures for future LLMs.