/AI12h ago

Google DeepMind's Omar Sanseviero says Gemma 4 multi-token prediction support has been merged into llama.cpp

The update enables over 2x speedups for dense models.

619368750146K

#486

Original post

Omar Sanseviero@osanseviero#486inAI

Gemma 4 MTP just got officially merged into llama.cpp

This means you can use Gemma 4 QAT + MTP for a lightweight + super fast setup. Excited to see what the community builds with it

https://github.com/ggml-org/llama.cpp/pull/23398

10:38 AM · Jun 7, 2026 · 46K Views

/AI12h ago

Google DeepMind's Omar Sanseviero says Gemma 4 multi-token prediction support has been merged into llama.cpp

The update enables over 2x speedups for dense models.

619368750146K

#486

Original post

Omar Sanseviero@osanseviero#486inAI

Gemma 4 MTP just got officially merged into llama.cpp

This means you can use Gemma 4 QAT + MTP for a lightweight + super fast setup. Excited to see what the community builds with it

https://github.com/ggml-org/llama.cpp/pull/23398

10:38 AM · Jun 7, 2026 · 46K Views

Sentiment

Sentiment building, check back later.

Cluster Engagement

Posts from X

Most Activity

VIEWS1.5KLIKES17

Marko Tasic@mtasic85

@osanseviero I keep refreshing @UnslothAI @huggingface page 😉 Stress test 🚀

12h1.5K17

BOOKMARKS5

Humberto Oliveira@holiveira

@osanseviero Unsloth already uploaded the assistant mtp models inside a mtp folder on the repos: https://huggingface.co/unsloth/gemma-4-12b-it-GGUF/tree/main/MTP for the 12B, https://huggingface.co/unsloth/gemma-4-26B-A4B-it-GGUF/tree/main/MTP for the 26B, and https://huggingface.co/unsloth/gemma-4-31B-it-GGUF/tree/main/MTP for the 31B.

11h29325

RETWEETS81

Omar Sanseviero@osanseviero

Gemma 4 MTP just got officially merged into llama.cpp

This means you can use Gemma 4 QAT + MTP for a lightweight + super fast setup. Excited to see what the community builds with it

https://github.com/ggml-org/llama.cpp/pull/23398

12h46K936501

REPLIES2

Sakura Yuki@sakurayukiai

@osanseviero Reusing the target model's KV cache instead of maintaining two separate ones is the real win here. The speculative decoding VRAM tax has always been the silent killer on consumer GPUs.

11h1.5K141

Ljubomir Josifovski@ljupc0

@osanseviero daang!! - and I just finished testing @atomic_chat_hq 's MTP https://huggingface.co/AtomicChat/gemma-4-31B-it-assistant-GGUF + @UnslothAI 's QAT https://huggingface.co/unsloth/gemma-4-31B-it-qat-GGUF today 😂

11h1.1K94

vini@apeiron_spx

@osanseviero please ggml team @AdrienGallouet update the llama.cpp pre-built binaries version from http://llama.app https://huggingface.co/buckets/ggml-org/install.sh

11h4931

TensorEspresso@TensorEspresso

@osanseviero I tried this build, unsloth QAT GGUF, and MTP draft model. For some reason Gemma 4 31B is too heavy for RTX 5090 :/ Qwen 3.6 allows you to actually run it with decent context length.

8h6503

Gladius@gladius_atmfy

@osanseviero Gemma 4 MTP in llama.cpp is the kind of compounding that makes local inference real. QAT drops the memory floor, MTP cuts the latency — together they make 12GB consumer GPUs actually usable for production inference. What's the decode speed looking like on 4090 with the new path?

10h1901

Sakura Yuki@sakurayukiai

@osanseviero More on how speculative decoding actually works under the hood and why draft matching matters: https://leetllm.com/learn/speculative-decoding

11h1581

Two Minute Papers@twominutepapers

@osanseviero Thank you!

10h6275

Jaydan Urwin@jaydanurwin

@osanseviero Do the ollama models include MTP yet?

11h8394

synabun.ai@SynabunAI

@osanseviero QAT plus MTP used to mean a custom inference pipeline and a blog post nobody could reproduce. now it's llama.cpp and a model path.

11h8883

jolsky@Jolsky11Jolsky1

@mtasic85 @osanseviero @UnslothAI @huggingface it's sunday man!

11h651

Jonathan Dunlap@JonathanRoseD

@osanseviero Just to confirm, MTP isn't very helpful for MoE model, right?

7h195

Arunabh hazarika@seamon67

@osanseviero Would love to give this a go with Gemma 4:124b

11h1.1K2

Marko Tasic@mtasic85

@Jolsky11Jolsky1 @osanseviero @UnslothAI @huggingface It is always Monday or Saturday somewhere. Weekends are not Sat-Sun everywhere in the world. Don’t forget that we are nerds and there is not such a thing as weekend 🤣😉

11h55

Jahanzaib Ahmed@jahanzaibai

@osanseviero QAT handles quantization quality, MTP buys back the throughput. Honestly that combo in llama.cpp is probably enough for most local agentic loops without touching cloud inference.

5h4902

Nikita Belokopytov@NikiBelokopytov

@TensorEspresso @osanseviero MTP doesn't come for free

7h44

Noctus@noctus91

@osanseviero waiting for @UnslothAI to upload

12h3202

BenUsesAI@BenUsesAI1

@osanseviero gemma 4 mtp in llama.cpp sounds promising until latency spikes stop chasing every new merge and just optimize what actually works

10h6011