/AI23h ago

Google DeepMind releases Gemma 4 Quantization-Aware Training checkpoints, cutting VRAM requirements by 72 percent

AI Judge changed title after evaluation, original title: "Google DeepMind releases Gemma 4 QAT, but Unsloth developer Daniel Han warns naive llama.cpp conversions suffer accuracy loss"

Story Overview

Google DeepMind has published quantization-aware training checkpoints for the full Gemma 4 family on Hugging Face, enabling lower-precision inference formats that shrink memory use for on-device and edge runs. The E2B variant reaches roughly 1 GB in a new mobile format, while Q4_0 versions cut footprints across the range compared with BF16 baselines. Unsloth developer Daniel Han cautions that straightforward conversions of these checkpoints into llama.cpp GGUF files can still produce measurable accuracy drops.

4677.9K8573.4K787.9K
Original postutku#1430
Google Gemma@googlegemma

We just dropped Gemma 4 Quantization-Aware Training (QAT) checkpoints on Hugging Face!

All Gemma 4 model sizes and their drafters are now optimized with QAT to cut memory requirements and maximize on-device performance!

9:05 AM · Jun 5, 2026 · 384.6K Views
Developer Impact

Unsloth conversions limit accuracy erosion

Han reports that naive Q4_0 to GGUF steps lose several percentage points of top-1 accuracy on larger Gemma 4 variants, yet Unsloth’s dynamic GGUF handling recovers 8.8–15.4 points in their tests while keeping the same memory envelope. Both the official checkpoints and the Unsloth variants sit publicly on Hugging Face.

Open Question

Edge and mobile use cases gain immediate options

The QAT release adds Q4_0 support to major inference libraries and introduces a mobile-specific format, making local deployment practical on consumer GPUs and phones without cloud round-trips. Adoption speed will depend on how quickly developers test the accuracy trade-offs on their own workloads.

Sentiment

Many users thanked Google for Gemma 4 QAT checkpoints that let capable models run on low-RAM hardware like Mac Minis and consumer GPUs, while a few raised unrelated billing complaints about Gemini.

Pos
86.7%
Neg
13.3%
142 comments with sentiment.
Cluster Engagement
Posts from X
Most Activity
Most Activity
VIEWS197KBOOKMARKS1.7KLIKES2.5KRETWEETS334REPLIES76
Unsloth AI@UnslothAI

Google releases Gemma 4 QAT. ✨ You can now run Gemma 4 at 3x less memory with near original performance.

Quantization-Aware Training (QAT) makes it possible to run Gemma 4 26B-A4B on 16GB RAM.

GGUFs: https://huggingface.co/collections/unsloth/gemma-4-qat QAT Guide: https://unsloth.ai/docs/models/gemma-4/qat

Google Gemma@googlegemma

We just dropped Gemma 4 Quantization-Aware Training (QAT) checkpoints on Hugging Face!

All Gemma 4 model sizes and their drafters are now optimized with QAT to cut memory requirements and maximize on-device performance!

23hViews 197KLikes 2.5KBookmarks 1.7K

Gemma 4 quantization-aware training (QAT) models are now available, bringing AI performance directly to edge devices and consumer GPUs. These checkpoints are optimized with quantization-aware training to dramatically reduce memory requirements and unlock high-speed local inference. 🧵

23hViews 59KLikes 882Bookmarks 236
Omar Sanseviero@osanseviero

Introducing Gemma 4 QAT 🤏

- Quantization aware training to reduce models' precision while preserving quality - Introducing a new mobile quantization format that reduces memory footprint of E2B to 1GB - Q4 for all your favorite libraries ✨

22hViews 53.2KLikes 786Bookmarks 219
Daniel Han@danielhanchen

Gemma-4 QAT just dropped! We found if you naively convert from QAT Q4_0 BF16, you will lose accuracy since the conversion to llama.cpp has a different lattice.

Unsloth dynamic GGUFs recovers most of it! 26B-A4B: 85.6% top-1 % from 70.2% (+15.4%) 31B: 96.7% from 87.9% (+8.8%)

Unsloth AI@UnslothAI

Google releases Gemma 4 QAT. ✨ You can now run Gemma 4 at 3x less memory with near original performance.

Quantization-Aware Training (QAT) makes it possible to run Gemma 4 26B-A4B on 16GB RAM.

GGUFs: https://huggingface.co/collections/unsloth/gemma-4-qat QAT Guide: https://unsloth.ai/docs/models/gemma-4/qat

22hViews 27.4KLikes 285Bookmarks 137
Chubby♨️@kimmonismus

Google DeepMind released new Gemma 4 QAT models that make the model family much more efficient for local, on-device use.

Using Quantization-Aware Training, the models are trained with compression in mind, which reduces memory needs while preserving more quality than standard post-training quantization. The release includes support for the popular Q4_0 format and a new mobile-specialized quantization format.

Gemma 4 E2B can now run with around 1GB of memory (!), and the text-only version can even require less than 1GB (!). That makes local AI on phones, laptops, edge devices, and consumer GPUs far more practical.

Really cool to see.

21hViews 20.3KLikes 411Bookmarks 103
Ian Ballantyne@IanBallantyne

We're gonna need a bigger... repo 🚢 Gemma 4 QAT has docked and it's a whopping 23 models (100+ inc community) 😲 🏋 We trained a Q4_0 and mobile quant scheme to maintain capability and quality for Gemma 4 models post quantization 🛳️ We shipped GGUFs, compressed tensors to try now via llama.cpp, vLLM, SGLang 📊 We also shipped the unquantized QAT formats for all models (and all MTP drafters) to be converted into your prefered format 🤝 We worked with Unsloth, MLX, LMStudio, Ollama, Transformers.js who converted even more! Don't say we don't give you choice 🍨

18hViews 10.7KLikes 150Bookmarks 73
Google Gemma@googlegemma

We just dropped Gemma 4 Quantization-Aware Training (QAT) checkpoints on Hugging Face!

All Gemma 4 model sizes and their drafters are now optimized with QAT to cut memory requirements and maximize on-device performance!

23hViews 384.6KLikes 2.5KBookmarks 783
LMSYS Org@lmsysorg

🎉 New Gemma 4 QAT checkpoints from @googlegemma, Quantization-Aware Training that shrinks memory while keeping quality. Day-0 support is now live in SGLang!

✅ Gemma 4 E2B down to 1GB with a mobile-specialized format ✅ QAT beats standard PTQ on quality at the same compression ✅ Q4_0 + MTP checkpoints keep the MTP speedup while quantized

Run it now with SGLang!

Google Gemma@googlegemma

We just dropped Gemma 4 Quantization-Aware Training (QAT) checkpoints on Hugging Face!

All Gemma 4 model sizes and their drafters are now optimized with QAT to cut memory requirements and maximize on-device performance!

22hViews 10.2KLikes 98Bookmarks 45
Unsloth AI@UnslothAI

@googlegemma Thank you Google Deepmind for caring about local users and making it more efficient for us!

We made QAT GGUFs which you can now run locally with here: https://huggingface.co/unsloth/gemma-4-26B-A4B-it-qat-GGUF

23hViews 2.6KLikes 97Bookmarks 30
Rohan Paul@rohanpaul_ai

Google just made Gemma 4 much easier to run on phones and laptops by releasing QAT (Quantization-Aware Training) checkpoints that shrink the smallest model from 11.4GB to 1.1GB, or 0.84GB for text-only use.

Normal PTQ (Post-Training Quantization.) compresses after training and can damage quality because the model never learned to survive that rounding.

QAT fixes this by simulating compression during training, so Gemma 4 learns while its weights are being squeezed, making the final compressed model less likely to lose reasoning quality.

Google also built a mobile-focused format with static activations, channel-wise quantization, targeted 2-bit quantization, and KV cache optimization, which means the phone does less scaling work, stores some token-generation parts more aggressively, and keeps long chats from eating memory too fast.

15hViews 4.9KLikes 84Bookmarks 29
Google Gemma@googlegemma

Read more in our blog: https://blog.google/innovation-and-ai/technology/developers-tools/quantization-aware-training-gemma-4/

23hViews 6.4KLikes 84Bookmarks 28
Omar Sanseviero@osanseviero

Get started today! https://blog.google/innovation-and-ai/technology/developers-tools/quantization-aware-training-gemma-4/

Omar Sanseviero@osanseviero

Introducing Gemma 4 QAT 🤏

- Quantization aware training to reduce models' precision while preserving quality - Introducing a new mobile quantization format that reduces memory footprint of E2B to 1GB - Q4 for all your favorite libraries ✨

22hViews 3.9KLikes 46Bookmarks 23

🚀 Great News Local AI got Faster! Google just dropped Gemma 4 QAT checkpoints! Grab new models from 🤗

⟢ Make Gemma 4 12b faster.

Quantization-Aware Training (QAT) for all Gemma 4 sizes + drafters: ▶ Much lower memory use ▶ Minimal quality loss vs BF16 ▶ Optimized for mobile/edge & local inference

✅ Unsloth already released ready-to-use GGUF files so no need to recreate anything!

Perfect for Unsloth, llama.cpp, Ollama, LM Studio & more.

🔗 HF: Search “Gemma 4 QAT” or go to Unsloth collection Big win for on-device AI 🔥

Google Gemma@googlegemma

We just dropped Gemma 4 Quantization-Aware Training (QAT) checkpoints on Hugging Face!

All Gemma 4 model sizes and their drafters are now optimized with QAT to cut memory requirements and maximize on-device performance!

22hViews 2.5KLikes 26Bookmarks 15
Olivier Lacombe@o_lacombe

🚀 We just released Gemma 4 Quantization-Aware Training (QAT) model checkpoints.

What this means for developers: 🧠 Sharp performance retention 🗜️ Smaller memory footprints ⚡ Major efficiency boosts on mobile & laptops

22hViews 1.9KLikes 37Bookmarks 2
Google Gemma@googlegemma

⬇️Download & Integrate: Access the Q4_0 and mobile model weights right now on Hugging Face. Explore our documentation to learn how to best deploy the QAT checkpoints.

23hViews 3.8KLikes 38Bookmarks 5
Chubby♨️@kimmonismus

Source https://blog.google/innovation-and-ai/technology/developers-tools/quantization-aware-training-gemma-4/

Chubby♨️@kimmonismus

Google DeepMind released new Gemma 4 QAT models that make the model family much more efficient for local, on-device use.

Using Quantization-Aware Training, the models are trained with compression in mind, which reduces memory needs while preserving more quality than standard post-training quantization. The release includes support for the popular Q4_0 format and a new mobile-specialized quantization format.

Gemma 4 E2B can now run with around 1GB of memory (!), and the text-only version can even require less than 1GB (!). That makes local AI on phones, laptops, edge devices, and consumer GPUs far more practical.

Really cool to see.

21hViews 4.8KLikes 23Bookmarks 5
Google Gemma@googlegemma

💻 Frictionless Setup: Easily download, manage, and run the quantized models locally using user-friendly tools like UnSloth, llama.cpp, Ollama, LM Studio, vLLM, MLX, Hugging Face Transformers, or LiteRT-LM runtime for optimized edge deployment.

23hViews 2.4KLikes 26Bookmarks 5
Google Gemma@googlegemma

⚖️ High-Quality Compression: Standard quantization can degrade performance. QAT bakes compression directly into the training process, shrinking model size while preserving the reasoning capabilities you expect from Gemma 4.

23hViews 3.8KLikes 46Bookmarks 2
👩‍💻 Paige Bailey@DynamicWebPaige

💎 Massive intelligence with @googlegemma 4, but tiny resource footprint! Take a gander at our QAT models:

Gemma 4 quantization-aware training (QAT) models are now available, bringing AI performance directly to edge devices and consumer GPUs. These checkpoints are optimized with quantization-aware training to dramatically reduce memory requirements and unlock high-speed local inference. 🧵

22hViews 2.8KLikes 36Bookmarks 0

Ecosystem integrations are live today across popular developer tools including Hugging Face, Llama.cpp, Ollama, MLX, LM Studio, NVIDIA, vLLM, Unsloth, and LiteRT-LM.

23hViews 4.7KLikes 22Bookmarks 3
Load more posts