/Tech6h ago

Google DeepMind releases quantized Gemma 4 models, using quantization-aware training to compress the smallest variant to 0.84 GB

The checkpoints shrink an 11.4 GB model for mobile.

811911447.3K

Original post

Google just made Gemma 4 much easier to run on phones and laptops by releasing QAT (Quantization-Aware Training) checkpoints that shrink the smallest model from 11.4GB to 1.1GB, or 0.84GB for text-only use.

Normal PTQ (Post-Training Quantization.) compresses after training and can damage quality because the model never learned to survive that rounding.

QAT fixes this by simulating compression during training, so Gemma 4 learns while its weights are being squeezed, making the final compressed model less likely to lose reasoning quality.

Google also built a mobile-focused format with static activations, channel-wise quantization, targeted 2-bit quantization, and KV cache optimization, which means the phone does less scaling work, stores some token-generation parts more aggressively, and keeps long chats from eating memory too fast.

4:34 PM · Jun 5, 2026 · 6.3K Views

/Tech6h ago

Google DeepMind releases quantized Gemma 4 models, using quantization-aware training to compress the smallest variant to 0.84 GB

The checkpoints shrink an 11.4 GB model for mobile.

811911447.3K

#769

Original post

Rohan Paul@rohanpaul_ai

Normal PTQ (Post-Training Quantization.) compresses after training and can damage quality because the model never learned to survive that rounding.

QAT fixes this by simulating compression during training, so Gemma 4 learns while its weights are being squeezed, making the final compressed model less likely to lose reasoning quality.

4:34 PM · Jun 5, 2026 · 6.3K Views

Sentiment

Many users praised Google's quantized Gemma models and QAT checkpoints shrinking them to 1.1GB as a huge engineering win enabling efficient local inference on low-resource devices like Raspberry Pi.

Pos

100.0%

Neg

0.0%

4 comments with sentiment.

Cluster Engagement

Posts from X

Most Activity

VIEWS1.6KBOOKMARKS1LIKES6

Rohan Paul@rohanpaul_ai

https://blog.google/innovation-and-ai/technology/developers-tools/quantization-aware-training-gemma-4/

1d1.6K61

REPLIES1

Phillip Lanos@philliplanos

@rohanpaul_ai One thing I’m curious about is how we can get memory to exist on these local models across multiple chats from what I understand it just does the one chat and then resets

1d221

utku@utkuevci

8/ 🙏 Huge thanks to @lucianommartins , Sara Smoot, @osanseviero (among many) landing this across the OSS ecosystem. Also our friends from @RedHat, @mgoin_ and @_EldarKurtic , for making these models available at @vllm_project. More to come! 🚀

1d6751

utku@utkuevci

🎯 To maximize Gemma 4's impact, we designed a simple quantization format minimizing hardware overhead, focusing on two main pillars: 1️⃣ Basic Recipe: Mixed Precision (wNa8o8) 2️⃣ Targeted Compression: 2-bit MLPs & embeddings, 4/8-bit static KV cache.

1d233

utku@utkuevci

🛠️ 1️⃣ The Basic Recipe: Mixed Precision (wNa8o8) Instead of small block sizes, we apply 2, 4, or 8-bit symmetric channel-wise quant to the weights. For activations, we use 8-bit static quantization for quantized matmuls and KV-cache to avoid dynamic grid overhead.

1d691

utku@utkuevci

7/ 🗃️ (c) Hybrid KV Cache: We use a dual approach (4-bit for global layers, 8-bit for local). The KV size impact becomes significant mainly with longer contexts, so quantizing global KV-caches provides the absolute best trade-off here.

1d47

utku@utkuevci

5/ 🔬 2️⃣ Targeted Compression for maximum efficiency.

(a) 🧠 Targeted 2-Bit Decoding: Low bit-width matmuls are best for memory-bound ops. Therefore, we apply 2-bit quant to decode-only layers (with double MLP width)! It maintains iso-latency while improving quality.

1d37

utku@utkuevci

🗜️ (b) 2-Bit Embeddings: We compress embedding tables to 2-bit. Since these parameters occupy a significant fraction of small models, this offers a highly favorable quality-to-latency ratio.

1d24

utku@utkuevci

4/ ⚡ (Funny enough, static activation quant is less common than dynamic—likely because it's hard to get right! We hope to add/see more support for this format in popular libraries in the near future).

1d22

Maya N@mayasolos

@rohanpaul_ai Huge win for local inference. I really hope to see this replicated for llama soon.

1d2261

Shinka - AI@ShinkaIoT

@rohanpaul_ai Shrinking Gemma 4 with QAT is proper engineering; the race to efficient on-device AI is heating up. 🔥

1d1481

Winston B.@DoDataThings

@rohanpaul_ai Very nice. The jump from 11.4GB to 1.1GB means Gemma 4 clears the RAM budget on a Pi without swapping, which changes what you can actually schedule in an agent loop.

11h50

DesignCntrl Inc. / Destrozado@DesignCntrl

@philliplanos @rohanpaul_ai That's what everyone learned 4 years ago when GPT3 was released. AI is dumb on its own. It can do nothing useful. You have to build a harness to make it read the chat history, use tools or anything practical. It is just a dumb text file otherwise.

1d19