/Tech9d ago

Google releases Gemma 4 12B, an encoder-free multimodal model under Apache 2.0 that beats the larger Gemma 3 27B

AI Judge changed title after evaluation, original title: "Google releases Gemma 4 12B, a unified open multimodal model that outperforms the larger Gemma 3 27B"

The model runs locally on laptops with 16GB VRAM.

--0--

Original post

Armand Joulin#874

Google Gemma@googlegemma

Meet Gemma 4 12B!

A unified, encoder-free multimodal model designed to bring high-performance intelligence directly to your laptop, and released under an Apache 2.0 license.

Bridging the gap between edge efficiency and advanced reasoning. Here is what’s new with Gemma 4 12B: 👇

9:00 AM · Jun 3, 2026 · 2.6M Views

Sentiment

Many users praised Gemma 4 12B's unified multimodal architecture and local laptop performance for delivering strong open capabilities, while others called it basic or unsuitable for tasks like ASR.

Pos

80.1%

Neg

19.9%

580 comments with sentiment.

Cluster Engagement

Views

Comments

Reposts

Bookmarks

Expand data

Posts from X

Most Activity

VIEWS679.3KBOOKMARKS2.6KLIKES8.2KRETWEETS1.1KREPLIES217

Google@Google

Today we’re introducing Gemma 4 12B — our latest open model that brings advanced agentic reasoning, vision and audio directly to your laptop.

It delivers performance nearing our larger Gemma models with a much smaller total memory footprint, while being small enough to run locally with just 16GB of VRAM. It’s open and accessible for everyone to use under a permissive Apache 2.0 license.

This is all made possible by our new, unified architecture that removes separate multimodal encoders. Here’s how we did it 🧵

9d679.3K8.2K2.6K

Unsloth AI@UnslothAI

Gemma 4 12B can now run locally on just 8GB RAM via Dynamic GGUFs.

Google's new model, Gemma 4 12B Unified supports image, audio and 256K context.

You can run and train the model via Unsloth Studio.

GGUF: https://huggingface.co/unsloth/gemma-4-12b-it-GGUF Guide: https://unsloth.ai/docs/models/gemma-4

Google Gemma@googlegemma

Meet Gemma 4 12B!

A unified, encoder-free multimodal model designed to bring high-performance intelligence directly to your laptop, and released under an Apache 2.0 license.

Bridging the gap between edge efficiency and advanced reasoning. Here is what’s new with Gemma 4 12B: 👇

9d267.2K2.6K1.4K

Sundar Pichai@sundarpichai

Our new Gemma 4 12B model hits a sweet spot between size + performance: it can run locally on a laptop, while enabling powerful multi-step reasoning and agentic workflows. Can’t wait to see what the community does with this one!

Demis Hassabis@demishassabis

Celebrating the milestone of a massive 150+ million downloads of Gemma 4 with the release of the new Gemma 4 12B model! It's incredibly powerful for such a small model and it’s tiny enough to run locally on a laptop with just 16GB VRAM. Apache 2.0 license - happy building!

9d393.9K4.9K765

Google for Developers@googledevs

Unlock local, agentic workflows with Gemma 4 12B and Google AI Edge, directly on your laptop. Experience 100% on-device AI:

• Generate code in AI Edge Gallery (new to Mac) • Dictate and edit text via AI Edge Eloquent (new to Mac) • Serve Gemma 4 12B locally with LiteRT-LM

Dive in: http://goo.gle/4uQlVfq

9d116.2K1.5K1.1K

Unsloth AI@UnslothAI

2-bit Gemma 4 12B GGUF, only 4.66 GB on disk, managed to cite 15 sites from a single prompt.

Try this locally on >6GB RAM via Unsloth Studio.

GitHub: https://github.com/unslothai/unsloth

Unsloth AI@UnslothAI

Gemma 4 12B can now run locally on just 8GB RAM via Dynamic GGUFs.

Google's new model, Gemma 4 12B Unified supports image, audio and 256K context.

You can run and train the model via Unsloth Studio.

GGUF: https://huggingface.co/unsloth/gemma-4-12b-it-GGUF Guide: https://unsloth.ai/docs/models/gemma-4

9d125.9K1.5K1.1K

Omar Sanseviero@osanseviero

Super excited to introduce Gemma 4 12B! 💎

- Multimodal: audio, image, video, and text input - Novel architecture: we removed the multimodal encoders for a unified, streamlined arch - New MacOS desktop app powered by LiteRT - MTP support

Excited to see what you build with it!

9d118.6K1.9K914

Prince Canuma@Prince_Canuma

🚀 Gemma 4 12B is here!

We partnered with @GoogleDeepMind to bring and optimize their new dense and unifed multimodal model for Apple Silicon.

◈ 12B dense · 256K context ◈ Thinking mode (built-in reasoning) ◈ Vision: dynamic res, OCR, UI + charts ◈ Native audio: ASR + speech translation ◈ Function calling for agents ◈ Text + image + audio, interleaved

Runs local. Get started now ⚡

> uv pip install -U mlx-vlm

https://github.com/Blaizzy/mlx-vlm

Google Gemma@googlegemma

Meet Gemma 4 12B!

A unified, encoder-free multimodal model designed to bring high-performance intelligence directly to your laptop, and released under an Apache 2.0 license.

Bridging the gap between edge efficiency and advanced reasoning. Here is what’s new with Gemma 4 12B: 👇

9d143.4K1.4K912

Demis Hassabis@demishassabis

Google Gemma@googlegemma

Meet Gemma 4 12B!

A unified, encoder-free multimodal model designed to bring high-performance intelligence directly to your laptop, and released under an Apache 2.0 license.

Bridging the gap between edge efficiency and advanced reasoning. Here is what’s new with Gemma 4 12B: 👇

9d619.3K3.1K521

Michael Tschannen@mtschannen

For the past years my research focus was on unifying models and training paradigms across modalities. Today I'm excited that we're releasing our latest model aligned with this theme:

Gemma 4 12B, a dense encoder-free model which processes raw text, image, and audio inputs!

9d105.3K1.1K537

LM Studio@lmstudio

Gemma 4 12B is here!

Dense, mid-sized Gemma that fits right on your laptop - released by @google under Apache 2.0

Available now in LM Studio https://lmstudio.ai/models/google/gemma-4-12b

9d119.9K1.4K353

Google AI Developers@googleaidevs

We’re launching Gemma 4 12B: Our unified, encoder-free model that brings powerful multimodal intelligence straight to your laptop 🚀

The model bridges the gap between our mobile E4B model and larger 26B MoE models, packaging frontier-class reasoning and native audio into a highly optimized footprint, all under a permissive Apache 2.0 license.

Here’s what makes it unique:

+ Encoder-Less Architecture: We removed the multimodal encoders. The vision and audio inputs flow directly into the LLM backbone. + Agentic Performance (16GB VRAM): Run complex, multi-step workflows locally, with performance nearing our 26B model.

9d65.5K1.1K198

Taelin@VictorTaelin

I have 256 applications for this

Google Gemma@googlegemma

Meet Gemma 4 12B!

A unified, encoder-free multimodal model designed to bring high-performance intelligence directly to your laptop, and released under an Apache 2.0 license.

Bridging the gap between edge efficiency and advanced reasoning. Here is what’s new with Gemma 4 12B: 👇

9d102.1K653184

Jeff Dean@JeffDean

Check out our Gemma 4 12B model: it's a super capable open weights model that can run directly on your laptop.

Google Gemma@googlegemma

Meet Gemma 4 12B!

A unified, encoder-free multimodal model designed to bring high-performance intelligence directly to your laptop, and released under an Apache 2.0 license.

Bridging the gap between edge efficiency and advanced reasoning. Here is what’s new with Gemma 4 12B: 👇

9d50.2K57364

vLLM@vllm_project

Congrats to the @googlegemma team on the Gemma 4 12B launch 🎉 Day-0 support on vLLM is ready to go.

It's an encoder-free unified multimodal model — text, image, audio, and video all project straight into the LLM's embedding space, no separate vision or audio towers. 256K context, built-in thinking, native tool calling.

Reasoning + tool parsers (`gemma4`), vision, and audio all served through the OpenAI-compatible API.

🔗 Recipe: http://recipes.vllm.ai/Google/gemma-4-12B-it

Google Gemma@googlegemma

Meet Gemma 4 12B!

A unified, encoder-free multimodal model designed to bring high-performance intelligence directly to your laptop, and released under an Apache 2.0 license.

Bridging the gap between edge efficiency and advanced reasoning. Here is what’s new with Gemma 4 12B: 👇

9d18.5K38891

Philipp Schmid@_philschmid

We just launched a Gemma 4 12B! Our first mid-sized model with native audio inputs. Gemma 4 12 B is a unified, encoder-free multimodal model.

🧠 vision and audio directly into the LLM. 💻 Just need 16GB of memory. 📊 Benchmark nearing 26B. 📄 Apache 2.0.

9d13.4K36159

Google for Developers@googledevs

✨ Introducing @GoogleGemma 4 12B, a unified open model bringing high-performance agentic multimodal intelligence directly to your laptop.

Bridging the gap between edge efficiency and advanced reasoning, nearing 26B MoE at <50% the memory footprint.

9d13.7K39348

Lotto@LottoLabs

Woah the mad men did it and keep dropping

Absolute banger at 12b

That’s probably the sweet spot for small local GPUs

Google Gemma@googlegemma

Meet Gemma 4 12B!

A unified, encoder-free multimodal model designed to bring high-performance intelligence directly to your laptop, and released under an Apache 2.0 license.

Bridging the gap between edge efficiency and advanced reasoning. Here is what’s new with Gemma 4 12B: 👇

9d28.7K32563

Chubby♨️@kimmonismus

Gemma 4 12B shipped today under the label "encoder-free."

A local 12b model that shows really good results. I'm a big fan of Gemma Gemma 4 12B is out: a dense, fully open model (Apache 2.0) that runs on a 16GB laptop and does agentic reasoning, vision and audio at a quality Google puts near its 26B model.

The reason a 12B can pull this off: Google removed the separate vision and audio encoders and feeds both straight into the model, which keeps the memory footprint small enough for consumer GPUs.

For on-device assistants and private coding agents, that lowers the bar a lot. always look forward to the updates. 12b is a good sweet spot in terms of size.

a few facts:

Vision: the 550M encoder (27 transformer layers) is now a 35M embedder, one matmul on 48x48 pixel patches. Roughly 15x smaller. Audio: the 300M encoder (12 conformer layers) is gone. Raw 16kHz audio cut into 40ms frames, projected straight into the LLM. So encoding didn't vanish, it collapsed into the backbone. The payoff is real: one shared set of weights, so you LoRA-tune vision, audio and text in a single pass.

Google@Google

Today we’re introducing Gemma 4 12B — our latest open model that brings advanced agentic reasoning, vision and audio directly to your laptop.

This is all made possible by our new, unified architecture that removes separate multimodal encoders. Here’s how we did it 🧵

9d21.1K31251

merve@mervenoyann

Google dropped Gemma-4 12B, it's a beast 🔥

> unified: audio + image go straight into model, no encoder > multimodal + tool calling > dense 12B with 256K context, comes with assistants for MTP (faster!⚡️) > day-0 in transformers, llama.cpp & MLX > A2.0 🤗

9d17.3K18569

wh@nrehiew_

Cool multimodal architecture. Raw images and audio are patchified and slices respectively with a single linear projection into the same space as text tokens.

Would be cool if anyone can turn it into a psuedo-realtime interaction model

Google Gemma@googlegemma

Meet Gemma 4 12B!

A unified, encoder-free multimodal model designed to bring high-performance intelligence directly to your laptop, and released under an Apache 2.0 license.

Bridging the gap between edge efficiency and advanced reasoning. Here is what’s new with Gemma 4 12B: 👇

9d15.3K17159

Digg Deeper

Anthony Dikéasked

give me example for how i can start using this today

Download the free Google AI Edge Eloquent macOS app from ai.google.dev/edge/eloquent and run the voice-to-text editing demo locally with Gemma 4 12B on a 16 GB RAM laptop. ^¹

Alternatively, install Ollama, then run `ollama pull gemma4:12b` (or the HF `google/gemma-4-12b-it` variant) for immediate local multimodal chat and inference. ^²

For advanced setups, follow the SGLang cookbook or Hugging Face collection for the encoder-free model.

Jonny Dasked

How does Gemma 4 12b compare to other open source models?

Gemma 4 12B beats Gemma 3 27B on key benchmarks (e.g., MMLU Pro 77.2% vs. 67.6%) while running natively multimodal (text/image/audio/video) on laptops with just 16GB VRAM via its encoder-free design. ^¹

It stands out among open models for on-device efficiency and Apache 2.0 licensing, though larger variants like Qwen 3.5 or DeepSeek often lead raw scores at higher hardware cost.