/Tech2h ago

Unsloth AI co-founder Daniel Han releases optimization boosting Google's DiffusionGemma to over 2,000 tokens per second

Story Overview

Daniel Han of Unsloth AI has delivered inference tweaks that push Google's DiffusionGemma 26B-A4B model past 2000 tokens per second on consumer-grade GPUs while keeping RAM use around 18 GB and preserving support for 256K context plus multimodal tasks.

506271529634.3K

#831

Original post

Unsloth AI@UnslothAI

DiffusionGemma can now run at 2000+ tokens/sec! ⚡

We made local DiffusionGemma inference 1.8× faster.

Run it on 18GB RAM via Unsloth Studio.

GitHub: https://github.com/unslothai/unsloth Guide: https://unsloth.ai/docs/models/diffusiongemma

Unsloth AI@UnslothAI

Google releases DiffusionGemma.✨ The new 26B-A4B diffusion text model runs locally on 18GB RAM.

It supports high-speed text generation, thinking, image, video and 256K context.

Run and train via Unsloth Studio.

GGUF: https://huggingface.co/unsloth/diffusiongemma-26B-A4B-it-GGUF Guide: https://unsloth.ai/docs/models/diffusiongemma

6:57 AM · Jun 12, 2026 · 35.7K Views

Developer Impact

Consumer GPUs Just Got a Lot More Capable

The 1.8× speedup and GGUF integration let the diffusion-based model run locally through Unsloth Studio or llama.cpp on hardware most developers already own, removing the need for heavy cloud setups.

Open Question

Benchmarks Stay High-Level for Now

Exact GPU models, precision settings, and sustained versus peak numbers are not detailed yet, so it remains unclear how the claimed speeds hold up across different rigs or longer sessions.

Sentiment

Users praise Unsloth for speeding DiffusionGemma inference past 2000 tokens per second because the gains enable practical local and edge deployment with lower RAM needs.

Pos

96.6%

Neg

3.4%

17 comments with sentiment.

Cluster Engagement

Posts from X

Most Activity

VIEWS1.7KBOOKMARKS10LIKES43RETWEETS6REPLIES6

Daniel Han@danielhanchen

We added many features into Unsloth Studio! > Diffusion Gemma with canvas visualization! > Experimental RAG, Artifacts, Tensor Parallelism > Auto MTP, audio input, Cloudflare tunneling > 90% less tool call nudges with same acc > Bypass Perms for tool calls, update button + more!

Unsloth AI@UnslothAI

DiffusionGemma can now run at 2000+ tokens/sec! ⚡

We made local DiffusionGemma inference 1.8× faster.

Run it on 18GB RAM via Unsloth Studio.

GitHub: https://github.com/unslothai/unsloth Guide: https://unsloth.ai/docs/models/diffusiongemma

1h1.7K4310

Daniel Han@danielhanchen

Example of RAG support - we optimized it for small and large models, and you can tune the similarity metrics as well! GGUFs + sentence-transformers are supported!

Daniel Han@danielhanchen

1h60380

Brendan O'Donoghue@bodonoghue85

🤯🤯🤯

Unsloth AI@UnslothAI

DiffusionGemma can now run at 2000+ tokens/sec! ⚡

We made local DiffusionGemma inference 1.8× faster.

Run it on 18GB RAM via Unsloth Studio.

GitHub: https://github.com/unslothai/unsloth Guide: https://unsloth.ai/docs/models/diffusiongemma

1h511120

Daniel Han@danielhanchen

Also new bypass permissions + confirm tool calls - bypass permissions will allow the model to not use our AST based sandbox - be extra careful though!

1h641

Daniel Han@danielhanchen

@CatAstro_Piyush @UnslothAI Will be making smaller ones soon!

1h93

Le TechLead🔰@LeTechLead

@UnslothAI @danielhanchen we need to be able to serve it though, cli and chat doesn’t cut it.

2h601

Dariton@Dariton4000

@UnslothAI Does this work with CPU offloading though?

2h2733

Secta@0xSecta

@UnslothAI local diffusiongemma inference at 2000+ tokens/sec is a clear win

low ram threshold shifts deployment from cloud to edge

2h2251

Piyush@CatAstro_Piyush

@UnslothAI any quantized version available that will enable it run on T4?

2h2131

AACeeert@AACeeert

@UnslothAI An abliterated version of this will have malware scripts flying around the internet in milliseconds

2h1821

Gerladina@gerladina39911

@UnslothAI local inference keeps getting more realistic

18gb ram opens this up to way more people now

2h1471

mr-r0b0t@mr_r0b0t

@UnslothAI Cooking with white hot 🔥🔥🔥🔥

2h941

Anis🐬Al@AnisAIb6

My sister, this is truly exhilarating news! 🌟 Seeing DiffusionGemma achieve such breathtaking speeds—surpassing 2000 tokens per second—while remaining accessible on local hardware like 18GB RAM is a masterpiece of efficiency over sheer bulk.

It’s not just about the technical milestones; it's about the democratization of intelligence. By bridging the gap between high-performance research and local accessibility, you are helping to put the pulse of innovation directly into our hands. This transition from massive cloud dependency to agile, local execution is where technology truly begins to serve humanity with grace and speed. Keep pushing these boundaries! ✨

2h122

Robert Keyes@Robert_of_Maine

@UnslothAI Have you been able to fix the slop it slings? Last I saw was terrible decode.

1h114

Tarrito.rocks@tarritorocks

@UnslothAI Nice but I didn't find UD-Q4 model version in your repo

1h104

Xaden Ryan@XadenRyan

@UnslothAI @danielhanchen Does it do tool calling?

1h321

Jack Assery@BoxyInADream

@UnslothAI How low can it go? 😂 Is 18GB the lowest I can squeeze?

2h83

Sasaki oshida@OshidaSasa

@UnslothAI the faster it go the more stupid it become :(

2h83

Twon.@Web3Twon

@UnslothAI How fast on a 3090?!

2h71

Nico Nico@crist1an001

@UnslothAI Yeah, good news. How to train image text to text?

1h58