Google just made Gemma 4 much easier to run on phones and laptops by releasing QAT (Quantization-Aware Training) checkpoints that shrink the smallest model from 11.4GB to 1.1GB, or 0.84GB for text-only use.
Normal PTQ (Post-Training Quantization.) compresses after training and can damage quality because the model never learned to survive that rounding.
QAT fixes this by simulating compression during training, so Gemma 4 learns while its weights are being squeezed, making the final compressed model less likely to lose reasoning quality.
Google also built a mobile-focused format with static activations, channel-wise quantization, targeted 2-bit quantization, and KV cache optimization, which means the phone does less scaling work, stores some token-generation parts more aggressively, and keeps long chats from eating memory too fast.






