🚀 Gemma official quantized models are out! Alongside the familiar q4_0 formats for all sizes, we're releasing something special: "mobile quantization" recipe for the e2b and e4b models, designed for the best quality/latency tradeoff. 👇 Link: https://blog.google/innovation-and-ai/technology/developers-tools/quantization-aware-training-gemma-4/
Google releases Gemma 4 quantization-aware training checkpoints, shrinking its smallest model from 11.4 GB to 0.84 GB
The release optimizes Gemma 4 models for mobile devices.
Many users praise Google's QAT-quantized Gemma models shrinking to 1.1GB because the size reduction enables practical local inference on low-RAM devices like a Raspberry Pi.
Most Activity

https://blog.google/innovation-and-ai/technology/developers-tools/quantization-aware-training-gemma-4/
Google just made Gemma 4 much easier to run on phones and laptops by releasing QAT (Quantization-Aware Training) checkpoints that shrink the smallest model from 11.4GB to 1.1GB, or 0.84GB for text-only use.
Normal PTQ (Post-Training Quantization.) compresses after training and can damage quality because the model never learned to survive that rounding.
QAT fixes this by simulating compression during training, so Gemma 4 learns while its weights are being squeezed, making the final compressed model less likely to lose reasoning quality.
Google also built a mobile-focused format with static activations, channel-wise quantization, targeted 2-bit quantization, and KV cache optimization, which means the phone does less scaling work, stores some token-generation parts more aggressively, and keeps long chats from eating memory too fast.

@rohanpaul_ai One thing I’m curious about is how we can get memory to exist on these local models across multiple chats from what I understand it just does the one chat and then resets

8/ 🙏 Huge thanks to @lucianommartins , Sara Smoot, @osanseviero (among many) landing this across the OSS ecosystem. Also our friends from @RedHat, @mgoin_ and @_EldarKurtic , for making these models available at @vllm_project. More to come! 🚀

🎯 To maximize Gemma 4's impact, we designed a simple quantization format minimizing hardware overhead, focusing on two main pillars: 1️⃣ Basic Recipe: Mixed Precision (wNa8o8) 2️⃣ Targeted Compression: 2-bit MLPs & embeddings, 4/8-bit static KV cache.

🛠️ 1️⃣ The Basic Recipe: Mixed Precision (wNa8o8) Instead of small block sizes, we apply 2, 4, or 8-bit symmetric channel-wise quant to the weights. For activations, we use 8-bit static quantization for quantized matmuls and KV-cache to avoid dynamic grid overhead.

7/ 🗃️ (c) Hybrid KV Cache: We use a dual approach (4-bit for global layers, 8-bit for local). The KV size impact becomes significant mainly with longer contexts, so quantizing global KV-caches provides the absolute best trade-off here.

5/ 🔬 2️⃣ Targeted Compression for maximum efficiency.
(a) 🧠 Targeted 2-Bit Decoding: Low bit-width matmuls are best for memory-bound ops. Therefore, we apply 2-bit quant to decode-only layers (with double MLP width)! It maintains iso-latency while improving quality.

🗜️ (b) 2-Bit Embeddings: We compress embedding tables to 2-bit. Since these parameters occupy a significant fraction of small models, this offers a highly favorable quality-to-latency ratio.

4/ ⚡ (Funny enough, static activation quant is less common than dynamic—likely because it's hard to get right! We hope to add/see more support for this format in popular libraries in the near future).

@rohanpaul_ai Huge win for local inference. I really hope to see this replicated for llama soon.

@rohanpaul_ai Shrinking Gemma 4 with QAT is proper engineering; the race to efficient on-device AI is heating up. 🔥

@rohanpaul_ai Very nice. The jump from 11.4GB to 1.1GB means Gemma 4 clears the RAM budget on a Pi without swapping, which changes what you can actually schedule in an agent loop.

@philliplanos @rohanpaul_ai That's what everyone learned 4 years ago when GPT3 was released. AI is dumb on its own. It can do nothing useful. You have to build a harness to make it read the chat history, use tools or anything practical. It is just a dumb text file otherwise.