Introducing Gemma 4 QAT 🤏
- Quantization aware training to reduce models' precision while preserving quality - Introducing a new mobile quantization format that reduces memory footprint of E2B to 1GB - Q4 for all your favorite libraries ✨
AI Judge changed title after evaluation, original title: "Daniel Han implements Unsloth dynamic GGUFs to recover accuracy lost when converting Google's new Gemma 4 QAT models to llama.cpp"
Unsloth AI released dynamic GGUFs to restore conversion accuracy.
Introducing Gemma 4 QAT 🤏
- Quantization aware training to reduce models' precision while preserving quality - Introducing a new mobile quantization format that reduces memory footprint of E2B to 1GB - Q4 for all your favorite libraries ✨
Many users praised Gemma 4 QAT releases for cutting memory use enough to run capable models on phones and edge devices, while a few dismissed the gains or aired unrelated grievances about Gemini.
Gemma-4 QAT just dropped! We found if you naively convert from QAT Q4_0 BF16, you will lose accuracy since the conversion to llama.cpp has a different lattice.
Unsloth dynamic GGUFs recovers most of it! 26B-A4B: 85.6% top-1 % from 70.2% (+15.4%) 31B: 96.7% from 87.9% (+8.8%)
Google releases Gemma 4 QAT. ✨ You can now run Gemma 4 at 3x less memory with near original performance.
Quantization-Aware Training (QAT) makes it possible to run Gemma 4 26B-A4B on 16GB RAM.
GGUFs: https://huggingface.co/collections/unsloth/gemma-4-qat QAT Guide: https://unsloth.ai/docs/models/gemma-4/qat
Google DeepMind released new Gemma 4 QAT models that make the model family much more efficient for local, on-device use.
Using Quantization-Aware Training, the models are trained with compression in mind, which reduces memory needs while preserving more quality than standard post-training quantization. The release includes support for the popular Q4_0 format and a new mobile-specialized quantization format.
Gemma 4 E2B can now run with around 1GB of memory (!), and the text-only version can even require less than 1GB (!). That makes local AI on phones, laptops, edge devices, and consumer GPUs far more practical.
Really cool to see.
🎉 New Gemma 4 QAT checkpoints from @googlegemma, Quantization-Aware Training that shrinks memory while keeping quality. Day-0 support is now live in SGLang!
✅ Gemma 4 E2B down to 1GB with a mobile-specialized format ✅ QAT beats standard PTQ on quality at the same compression ✅ Q4_0 + MTP checkpoints keep the MTP speedup while quantized
Run it now with SGLang!
We just dropped Gemma 4 Quantization-Aware Training (QAT) checkpoints on Hugging Face!
All Gemma 4 model sizes and their drafters are now optimized with QAT to cut memory requirements and maximize on-device performance!

@googlegemma Thank you Google Deepmind for caring about local users and making it more efficient for us!
We made QAT GGUFs which you can now run locally with here: https://huggingface.co/unsloth/gemma-4-26B-A4B-it-qat-GGUF
Google just made Gemma 4 much easier to run on phones and laptops by releasing QAT (Quantization-Aware Training) checkpoints that shrink the smallest model from 11.4GB to 1.1GB, or 0.84GB for text-only use.
Normal PTQ (Post-Training Quantization.) compresses after training and can damage quality because the model never learned to survive that rounding.
QAT fixes this by simulating compression during training, so Gemma 4 learns while its weights are being squeezed, making the final compressed model less likely to lose reasoning quality.
Google also built a mobile-focused format with static activations, channel-wise quantization, targeted 2-bit quantization, and KV cache optimization, which means the phone does less scaling work, stores some token-generation parts more aggressively, and keeps long chats from eating memory too fast.

Read more in our blog: https://blog.google/innovation-and-ai/technology/developers-tools/quantization-aware-training-gemma-4/
Get started today! https://blog.google/innovation-and-ai/technology/developers-tools/quantization-aware-training-gemma-4/
Introducing Gemma 4 QAT 🤏
- Quantization aware training to reduce models' precision while preserving quality - Introducing a new mobile quantization format that reduces memory footprint of E2B to 1GB - Q4 for all your favorite libraries ✨
🚀 We just released Gemma 4 Quantization-Aware Training (QAT) model checkpoints.
What this means for developers: 🧠 Sharp performance retention 🗜️ Smaller memory footprints ⚡ Major efficiency boosts on mobile & laptops

⬇️Download & Integrate: Access the Q4_0 and mobile model weights right now on Hugging Face. Explore our documentation to learn how to best deploy the QAT checkpoints.
Source https://blog.google/innovation-and-ai/technology/developers-tools/quantization-aware-training-gemma-4/
Google DeepMind released new Gemma 4 QAT models that make the model family much more efficient for local, on-device use.
Using Quantization-Aware Training, the models are trained with compression in mind, which reduces memory needs while preserving more quality than standard post-training quantization. The release includes support for the popular Q4_0 format and a new mobile-specialized quantization format.
Gemma 4 E2B can now run with around 1GB of memory (!), and the text-only version can even require less than 1GB (!). That makes local AI on phones, laptops, edge devices, and consumer GPUs far more practical.
Really cool to see.

💻 Frictionless Setup: Easily download, manage, and run the quantized models locally using user-friendly tools like UnSloth, llama.cpp, Ollama, LM Studio, vLLM, MLX, Hugging Face Transformers, or LiteRT-LM runtime for optimized edge deployment.

⚖️ High-Quality Compression: Standard quantization can degrade performance. QAT bakes compression directly into the training process, shrinking model size while preserving the reasoning capabilities you expect from Gemma 4.
💎 Massive intelligence with @googlegemma 4, but tiny resource footprint! Take a gander at our QAT models:
Gemma 4 quantization-aware training (QAT) models are now available, bringing AI performance directly to edge devices and consumer GPUs. These checkpoints are optimized with quantization-aware training to dramatically reduce memory requirements and unlock high-speed local inference. 🧵

Ecosystem integrations are live today across popular developer tools including Hugging Face, Llama.cpp, Ollama, MLX, LM Studio, NVIDIA, vLLM, Unsloth, and LiteRT-LM.

For mobile and edge hardware, a novel quantization schema maximizes efficiency through channel-wise quantization, targeted 2-bit decoding layers, and static activations. The text-only Gemma 4 E2B model requires less than 1 GB of memory.

https://blog.google/innovation-and-ai/technology/developers-tools/quantization-aware-training-gemma-4/

Instead of traditional post-training quantization, these models employ custom loss functions and targeted fine-tuning to minimize precision error. This approach delivers a massive reduction in disk and memory footprint while fully preserving the exceptional quality of Gemma 4.

Mobile-Optimized AI: Standard formats are hard for mobile processors to run efficiently. Our custom mobile-quantization schema lets edge hardware run calculations natively—enabling faster responses and efficient battery use.

Download the model weights on @HuggingFace: https://goo.gle/4foOxra
Read more in the blog: https://goo.gle/4vx8QYd

@AI_Andrew But wait, there's more!