/AI3h ago

Google releases Gemma 4 quantization-aware training checkpoints, shrinking its smallest model from 11.4 GB to 0.84 GB

The release optimizes Gemma 4 models for mobile devices.

81139436.7K
Original post
utku@utkuevci

🚀 Gemma official quantized models are out! Alongside the familiar q4_0 formats for all sizes, we're releasing something special: "mobile quantization" recipe for the e2b and e4b models, designed for the best quality/latency tradeoff. 👇 Link: https://blog.google/innovation-and-ai/technology/developers-tools/quantization-aware-training-gemma-4/

12:26 PM · Jun 5, 2026 · 909 Views
Sentiment

Many users praise Google's QAT-quantized Gemma models shrinking to 1.1GB because the size reduction enables practical local inference on low-RAM devices like a Raspberry Pi.

Pos
100.0%
Neg
0.0%
4 comments with sentiment.
Cluster Engagement
Posts from X
Most Activity
Most Activity
VIEWS1.6KBOOKMARKS1LIKES6
Rohan Paul@rohanpaul_ai

https://blog.google/innovation-and-ai/technology/developers-tools/quantization-aware-training-gemma-4/

1dViews 1.6KLikes 6Bookmarks 1
RETWEETS8
Rohan Paul@rohanpaul_ai

Google just made Gemma 4 much easier to run on phones and laptops by releasing QAT (Quantization-Aware Training) checkpoints that shrink the smallest model from 11.4GB to 1.1GB, or 0.84GB for text-only use.

Normal PTQ (Post-Training Quantization.) compresses after training and can damage quality because the model never learned to survive that rounding.

QAT fixes this by simulating compression during training, so Gemma 4 learns while its weights are being squeezed, making the final compressed model less likely to lose reasoning quality.

Google also built a mobile-focused format with static activations, channel-wise quantization, targeted 2-bit quantization, and KV cache optimization, which means the phone does less scaling work, stores some token-generation parts more aggressively, and keeps long chats from eating memory too fast.

1dViews 5.9KLikes 103Bookmarks 39
REPLIES1
Phillip Lanos@philliplanos

@rohanpaul_ai One thing I’m curious about is how we can get memory to exist on these local models across multiple chats from what I understand it just does the one chat and then resets

23hViews 221
utku@utkuevci

8/ 🙏 Huge thanks to @lucianommartins , Sara Smoot, @osanseviero (among many) landing this across the OSS ecosystem. Also our friends from @RedHat, @mgoin_ and @_EldarKurtic , for making these models available at @vllm_project. More to come! 🚀

1dViews 67Likes 5Bookmarks 1
utku@utkuevci

🎯 To maximize Gemma 4's impact, we designed a simple quantization format minimizing hardware overhead, focusing on two main pillars: 1️⃣ Basic Recipe: Mixed Precision (wNa8o8) 2️⃣ Targeted Compression: 2-bit MLPs & embeddings, 4/8-bit static KV cache.

1dViews 233
utku@utkuevci

🛠️ 1️⃣ The Basic Recipe: Mixed Precision (wNa8o8) Instead of small block sizes, we apply 2, 4, or 8-bit symmetric channel-wise quant to the weights. For activations, we use 8-bit static quantization for quantized matmuls and KV-cache to avoid dynamic grid overhead.

1dViews 69Likes 1
utku@utkuevci

7/ 🗃️ (c) Hybrid KV Cache: We use a dual approach (4-bit for global layers, 8-bit for local). The KV size impact becomes significant mainly with longer contexts, so quantizing global KV-caches provides the absolute best trade-off here.

1dViews 47
utku@utkuevci

5/ 🔬 2️⃣ Targeted Compression for maximum efficiency.

(a) 🧠 Targeted 2-Bit Decoding: Low bit-width matmuls are best for memory-bound ops. Therefore, we apply 2-bit quant to decode-only layers (with double MLP width)! It maintains iso-latency while improving quality.

1dViews 37
utku@utkuevci

🗜️ (b) 2-Bit Embeddings: We compress embedding tables to 2-bit. Since these parameters occupy a significant fraction of small models, this offers a highly favorable quality-to-latency ratio.

1dViews 24
utku@utkuevci

4/ ⚡ (Funny enough, static activation quant is less common than dynamic—likely because it's hard to get right! We hope to add/see more support for this format in popular libraries in the near future).

1dViews 22
Maya N@mayasolos

@rohanpaul_ai Huge win for local inference. I really hope to see this replicated for llama soon.

1dViews 226Likes 1
Shinka - AI@ShinkaIoT

@rohanpaul_ai Shrinking Gemma 4 with QAT is proper engineering; the race to efficient on-device AI is heating up. 🔥

23hViews 148Likes 1
Winston B.@DoDataThings

@rohanpaul_ai Very nice. The jump from 11.4GB to 1.1GB means Gemma 4 clears the RAM budget on a Pi without swapping, which changes what you can actually schedule in an agent loop.

9hViews 50

@philliplanos @rohanpaul_ai That's what everyone learned 4 years ago when GPT3 was released. AI is dumb on its own. It can do nothing useful. You have to build a harness to make it read the chat history, use tools or anything practical. It is just a dumb text file otherwise.

23hViews 19