/AI12h ago

Google DeepMind's Omar Sanseviero says Gemma 4 multi-token prediction support has been merged into llama.cpp

The update enables over 2x speedups for dense models.

619368750146K
Original post
Omar Sanseviero@osanseviero#486inAI

Gemma 4 MTP just got officially merged into llama.cpp

This means you can use Gemma 4 QAT + MTP for a lightweight + super fast setup. Excited to see what the community builds with it

https://github.com/ggml-org/llama.cpp/pull/23398

10:38 AM · Jun 7, 2026 · 46K Views
Sentiment
Sentiment building, check back later.
Cluster Engagement
Posts from X
Most Activity
Most Activity
VIEWS1.5KLIKES17
Marko Tasic@mtasic85

@osanseviero I keep refreshing @UnslothAI @huggingface page 😉 Stress test 🚀

12hViews 1.5KLikes 17
BOOKMARKS5

@osanseviero Unsloth already uploaded the assistant mtp models inside a mtp folder on the repos: https://huggingface.co/unsloth/gemma-4-12b-it-GGUF/tree/main/MTP for the 12B, https://huggingface.co/unsloth/gemma-4-26B-A4B-it-GGUF/tree/main/MTP for the 26B, and https://huggingface.co/unsloth/gemma-4-31B-it-GGUF/tree/main/MTP for the 31B.

11hViews 293Likes 2Bookmarks 5
RETWEETS81
Omar Sanseviero@osanseviero

Gemma 4 MTP just got officially merged into llama.cpp

This means you can use Gemma 4 QAT + MTP for a lightweight + super fast setup. Excited to see what the community builds with it

https://github.com/ggml-org/llama.cpp/pull/23398

12hViews 46KLikes 936Bookmarks 501
REPLIES2
Sakura Yuki@sakurayukiai

@osanseviero Reusing the target model's KV cache instead of maintaining two separate ones is the real win here. The speculative decoding VRAM tax has always been the silent killer on consumer GPUs.

11hViews 1.5KLikes 14Bookmarks 1

@osanseviero daang!! - and I just finished testing @atomic_chat_hq 's MTP https://huggingface.co/AtomicChat/gemma-4-31B-it-assistant-GGUF + @UnslothAI 's QAT https://huggingface.co/unsloth/gemma-4-31B-it-qat-GGUF today 😂

11hViews 1.1KLikes 9Bookmarks 4
vini@apeiron_spx

@osanseviero please ggml team @AdrienGallouet update the llama.cpp pre-built binaries version from http://llama.app https://huggingface.co/buckets/ggml-org/install.sh

11hViews 493Bookmarks 1
TensorEspresso@TensorEspresso

@osanseviero I tried this build, unsloth QAT GGUF, and MTP draft model. For some reason Gemma 4 31B is too heavy for RTX 5090 :/ Qwen 3.6 allows you to actually run it with decent context length.

8hViews 650Likes 3
Gladius@gladius_atmfy

@osanseviero Gemma 4 MTP in llama.cpp is the kind of compounding that makes local inference real. QAT drops the memory floor, MTP cuts the latency — together they make 12GB consumer GPUs actually usable for production inference. What's the decode speed looking like on 4090 with the new path?

10hViews 190Bookmarks 1
Sakura Yuki@sakurayukiai

@osanseviero More on how speculative decoding actually works under the hood and why draft matching matters: https://leetllm.com/learn/speculative-decoding

11hViews 158Bookmarks 1
Jaydan Urwin@jaydanurwin

@osanseviero Do the ollama models include MTP yet?

11hViews 839Likes 4
synabun.ai@SynabunAI

@osanseviero QAT plus MTP used to mean a custom inference pipeline and a blog post nobody could reproduce. now it's llama.cpp and a model path.

11hViews 888Likes 3
jolsky@Jolsky11Jolsky1

@mtasic85 @osanseviero @UnslothAI @huggingface it's sunday man!

11hViews 65Likes 1
Jonathan Dunlap@JonathanRoseD

@osanseviero Just to confirm, MTP isn't very helpful for MoE model, right?

7hViews 195

@osanseviero Would love to give this a go with Gemma 4:124b

11hViews 1.1KLikes 2
Marko Tasic@mtasic85

@Jolsky11Jolsky1 @osanseviero @UnslothAI @huggingface It is always Monday or Saturday somewhere. Weekends are not Sat-Sun everywhere in the world. Don’t forget that we are nerds and there is not such a thing as weekend 🤣😉

11hViews 55
Jahanzaib Ahmed@jahanzaibai

@osanseviero QAT handles quantization quality, MTP buys back the throughput. Honestly that combo in llama.cpp is probably enough for most local agentic loops without touching cloud inference.

5hViews 490Likes 2
Nikita Belokopytov@NikiBelokopytov

@TensorEspresso @osanseviero MTP doesn't come for free

7hViews 44
Noctus@noctus91

@osanseviero waiting for @UnslothAI to upload

12hViews 320Likes 2
BenUsesAI@BenUsesAI1

@osanseviero gemma 4 mtp in llama.cpp sounds promising until latency spikes stop chasing every new merge and just optimize what actually works

10hViews 601Likes 1
Load more posts
Google DeepMind's Omar Sanseviero says Gemma 4 multi-token prediction support has been merged into llama.cpp · Digg