/Tech23h ago

Google DeepMind releases Gemma 4 QAT checkpoints, cutting VRAM requirements up to threefold with near-original quality

AI Judge changed title after evaluation, original title: "Daniel Han implements Unsloth dynamic GGUFs to recover accuracy lost when converting Google's new Gemma 4 QAT models to llama.cpp"

Unsloth AI released dynamic GGUFs to restore conversion accuracy.

4677.9K8573.4K787.9K

#583

Original post

Omar Sanseviero@osanseviero#1004inTech

Introducing Gemma 4 QAT 🤏

- Quantization aware training to reduce models' precision while preserving quality - Introducing a new mobile quantization format that reduces memory footprint of E2B to 1GB - Q4 for all your favorite libraries ✨

9:22 AM · Jun 5, 2026 · 53.2K Views

Sentiment

Many users praised Gemma 4 QAT releases for cutting memory use enough to run capable models on phones and edge devices, while a few dismissed the gains or aired unrelated grievances about Gemini.

Pos

85.0%

Neg

15.0%

153 comments with sentiment.

Cluster Engagement

Posts from X

Most Activity

VIEWS27.4KBOOKMARKS137

Daniel Han@danielhanchen

Gemma-4 QAT just dropped! We found if you naively convert from QAT Q4_0 BF16, you will lose accuracy since the conversion to llama.cpp has a different lattice.

Unsloth dynamic GGUFs recovers most of it! 26B-A4B: 85.6% top-1 % from 70.2% (+15.4%) 31B: 96.7% from 87.9% (+8.8%)

Unsloth AI@UnslothAI

Google releases Gemma 4 QAT. ✨ You can now run Gemma 4 at 3x less memory with near original performance.

Quantization-Aware Training (QAT) makes it possible to run Gemma 4 26B-A4B on 16GB RAM.

GGUFs: https://huggingface.co/collections/unsloth/gemma-4-qat QAT Guide: https://unsloth.ai/docs/models/gemma-4/qat

22h27.4K285137

LIKES411RETWEETS40REPLIES14

Chubby♨️@kimmonismus

Google DeepMind released new Gemma 4 QAT models that make the model family much more efficient for local, on-device use.

Using Quantization-Aware Training, the models are trained with compression in mind, which reduces memory needs while preserving more quality than standard post-training quantization. The release includes support for the popular Q4_0 format and a new mobile-specialized quantization format.

Gemma 4 E2B can now run with around 1GB of memory (!), and the text-only version can even require less than 1GB (!). That makes local AI on phones, laptops, edge devices, and consumer GPUs far more practical.

Really cool to see.

21h20.3K411103

LMSYS Org@lmsysorg

🎉 New Gemma 4 QAT checkpoints from @googlegemma, Quantization-Aware Training that shrinks memory while keeping quality. Day-0 support is now live in SGLang!

✅ Gemma 4 E2B down to 1GB with a mobile-specialized format ✅ QAT beats standard PTQ on quality at the same compression ✅ Q4_0 + MTP checkpoints keep the MTP speedup while quantized

Run it now with SGLang!

Google Gemma@googlegemma

We just dropped Gemma 4 Quantization-Aware Training (QAT) checkpoints on Hugging Face!

All Gemma 4 model sizes and their drafters are now optimized with QAT to cut memory requirements and maximize on-device performance!

22h10.2K9845

Unsloth AI@UnslothAI

@googlegemma Thank you Google Deepmind for caring about local users and making it more efficient for us!

We made QAT GGUFs which you can now run locally with here: https://huggingface.co/unsloth/gemma-4-26B-A4B-it-qat-GGUF

23h2.6K9730

Rohan Paul@rohanpaul_ai

Google just made Gemma 4 much easier to run on phones and laptops by releasing QAT (Quantization-Aware Training) checkpoints that shrink the smallest model from 11.4GB to 1.1GB, or 0.84GB for text-only use.

Normal PTQ (Post-Training Quantization.) compresses after training and can damage quality because the model never learned to survive that rounding.

QAT fixes this by simulating compression during training, so Gemma 4 learns while its weights are being squeezed, making the final compressed model less likely to lose reasoning quality.

Google also built a mobile-focused format with static activations, channel-wise quantization, targeted 2-bit quantization, and KV cache optimization, which means the phone does less scaling work, stores some token-generation parts more aggressively, and keeps long chats from eating memory too fast.

15h4.9K8429

Google Gemma@googlegemma

Read more in our blog: https://blog.google/innovation-and-ai/technology/developers-tools/quantization-aware-training-gemma-4/

23h6.4K8428

Omar Sanseviero@osanseviero

Get started today! https://blog.google/innovation-and-ai/technology/developers-tools/quantization-aware-training-gemma-4/

Omar Sanseviero@osanseviero

Introducing Gemma 4 QAT 🤏

22h3.9K4623

Olivier Lacombe@o_lacombe

🚀 We just released Gemma 4 Quantization-Aware Training (QAT) model checkpoints.

What this means for developers: 🧠 Sharp performance retention 🗜️ Smaller memory footprints ⚡ Major efficiency boosts on mobile & laptops

22h1.9K372

Google Gemma@googlegemma

⬇️Download & Integrate: Access the Q4_0 and mobile model weights right now on Hugging Face. Explore our documentation to learn how to best deploy the QAT checkpoints.

23h3.8K385

Chubby♨️@kimmonismus

Source https://blog.google/innovation-and-ai/technology/developers-tools/quantization-aware-training-gemma-4/

Chubby♨️@kimmonismus

Google DeepMind released new Gemma 4 QAT models that make the model family much more efficient for local, on-device use.

Really cool to see.

21h4.8K235

Google Gemma@googlegemma

💻 Frictionless Setup: Easily download, manage, and run the quantized models locally using user-friendly tools like UnSloth, llama.cpp, Ollama, LM Studio, vLLM, MLX, Hugging Face Transformers, or LiteRT-LM runtime for optimized edge deployment.

23h2.4K265

Google Gemma@googlegemma

⚖️ High-Quality Compression: Standard quantization can degrade performance. QAT bakes compression directly into the training process, shrinking model size while preserving the reasoning capabilities you expect from Gemma 4.

23h3.8K462

👩‍💻 Paige Bailey@DynamicWebPaige

💎 Massive intelligence with @googlegemma 4, but tiny resource footprint! Take a gander at our QAT models:

Google for Developers@googledevs

Gemma 4 quantization-aware training (QAT) models are now available, bringing AI performance directly to edge devices and consumer GPUs. These checkpoints are optimized with quantization-aware training to dramatically reduce memory requirements and unlock high-speed local inference. 🧵

22h2.8K360

Google for Developers@googledevs

Ecosystem integrations are live today across popular developer tools including Hugging Face, Llama.cpp, Ollama, MLX, LM Studio, NVIDIA, vLLM, Unsloth, and LiteRT-LM.

23h4.7K223

Google for Developers@googledevs

For mobile and edge hardware, a novel quantization schema maximizes efficiency through channel-wise quantization, targeted 2-bit decoding layers, and static activations. The text-only Gemma 4 E2B model requires less than 1 GB of memory.

23h3.2K252

Ian Ballantyne@IanBallantyne

https://blog.google/innovation-and-ai/technology/developers-tools/quantization-aware-training-gemma-4/

17h36035

Google for Developers@googledevs

Instead of traditional post-training quantization, these models employ custom loss functions and targeted fine-tuning to minimize precision error. This approach delivers a massive reduction in disk and memory footprint while fully preserving the exceptional quality of Gemma 4.

23h2.9K261

Google Gemma@googlegemma

Mobile-Optimized AI: Standard formats are hard for mobile processors to run efficiently. Our custom mobile-quantization schema lets edge hardware run calculations natively—enabling faster responses and efficient battery use.

23h3.4K30

Google for Developers@googledevs

Download the model weights on @HuggingFace: https://goo.gle/4foOxra

Read more in the blog: https://goo.gle/4vx8QYd

22h2.7K72

Omar Sanseviero@osanseviero

@AI_Andrew But wait, there's more!

22h24251