Gemma 4 12B shipped today under the label "encoder-free."
A local 12b model that shows really good results. I'm a big fan of Gemma Gemma 4 12B is out: a dense, fully open model (Apache 2.0) that runs on a 16GB laptop and does agentic reasoning, vision and audio at a quality Google puts near its 26B model.
The reason a 12B can pull this off: Google removed the separate vision and audio encoders and feeds both straight into the model, which keeps the memory footprint small enough for consumer GPUs.
For on-device assistants and private coding agents, that lowers the bar a lot. always look forward to the updates. 12b is a good sweet spot in terms of size.
a few facts:
Vision: the 550M encoder (27 transformer layers) is now a 35M embedder, one matmul on 48x48 pixel patches. Roughly 15x smaller.
Audio: the 300M encoder (12 conformer layers) is gone. Raw 16kHz audio cut into 40ms frames, projected straight into the LLM. So encoding didn't vanish, it collapsed into the backbone.
The payoff is real: one shared set of weights, so you LoRA-tune vision, audio and text in a single pass.