/Tech22h ago

Gemma 4 Multi-Token Prediction support merges into llama.cpp, offering a 2x speedup for local dense models

Testing showed no performance speedup for MoE variants.

691.1K11763875.2K

#1004

Original post

Omar Sanseviero@osanseviero#1004inTech

Gemma 4 MTP just got officially merged into llama.cpp

This means you can use Gemma 4 QAT + MTP for a lightweight + super fast setup. Excited to see what the community builds with it

https://github.com/ggml-org/llama.cpp/pull/23398

10:38 AM · Jun 7, 2026 · 75.2K Views

/Tech22h ago

Gemma 4 Multi-Token Prediction support merges into llama.cpp, offering a 2x speedup for local dense models

Testing showed no performance speedup for MoE variants.

691.1K11763875.2K

#1004

Original post

Omar Sanseviero@osanseviero#1004inTech

Gemma 4 MTP just got officially merged into llama.cpp

This means you can use Gemma 4 QAT + MTP for a lightweight + super fast setup. Excited to see what the community builds with it

https://github.com/ggml-org/llama.cpp/pull/23398

10:38 AM · Jun 7, 2026 · 75.2K Views

Sentiment

Sentiment building, check back later.

Cluster Engagement

Posts from X

Most Activity

VIEWS1.5KLIKES17

Marko Tasic@mtasic85

@osanseviero I keep refreshing @UnslothAI @huggingface page 😉 Stress test 🚀

21h1.5K17

BOOKMARKS5

Humberto Oliveira@holiveira

@osanseviero Unsloth already uploaded the assistant mtp models inside a mtp folder on the repos: https://huggingface.co/unsloth/gemma-4-12b-it-GGUF/tree/main/MTP for the 12B, https://huggingface.co/unsloth/gemma-4-26B-A4B-it-GGUF/tree/main/MTP for the 26B, and https://huggingface.co/unsloth/gemma-4-31B-it-GGUF/tree/main/MTP for the 31B.

21h29325

RETWEETS1

Ljubomir Josifovski@ljupc0

@osanseviero daang!! - and I just finished testing @atomic_chat_hq 's MTP https://huggingface.co/AtomicChat/gemma-4-31B-it-assistant-GGUF + @UnslothAI 's QAT https://huggingface.co/unsloth/gemma-4-31B-it-qat-GGUF today 😂

21h1.1K94

REPLIES2

Sakura Yuki@sakurayukiai

@osanseviero Reusing the target model's KV cache instead of maintaining two separate ones is the real win here. The speculative decoding VRAM tax has always been the silent killer on consumer GPUs.

21h1.5K141

vini@apeiron_spx

@osanseviero please ggml team @AdrienGallouet update the llama.cpp pre-built binaries version from http://llama.app https://huggingface.co/buckets/ggml-org/install.sh

21h4931

TensorEspresso@TensorEspresso

@osanseviero I tried this build, unsloth QAT GGUF, and MTP draft model. For some reason Gemma 4 31B is too heavy for RTX 5090 :/ Qwen 3.6 allows you to actually run it with decent context length.

18h6503

Gladius@gladius_atmfy

@osanseviero Gemma 4 MTP in llama.cpp is the kind of compounding that makes local inference real. QAT drops the memory floor, MTP cuts the latency — together they make 12GB consumer GPUs actually usable for production inference. What's the decode speed looking like on 4090 with the new path?

19h1901

Sakura Yuki@sakurayukiai

@osanseviero More on how speculative decoding actually works under the hood and why draft matching matters: https://leetllm.com/learn/speculative-decoding

21h1581

Two Minute Papers@twominutepapers

@osanseviero Thank you!

20h6275

Jaydan Urwin@jaydanurwin

@osanseviero Do the ollama models include MTP yet?

21h8394

synabun.ai@SynabunAI

@osanseviero QAT plus MTP used to mean a custom inference pipeline and a blog post nobody could reproduce. now it's llama.cpp and a model path.

21h8883

jolsky@Jolsky11Jolsky1

@mtasic85 @osanseviero @UnslothAI @huggingface it's sunday man!

21h651

Jonathan Dunlap@JonathanRoseD

@osanseviero Just to confirm, MTP isn't very helpful for MoE model, right?

17h195

Arunabh hazarika@seamon67

@osanseviero Would love to give this a go with Gemma 4:124b

21h1.1K2

Marko Tasic@mtasic85

@Jolsky11Jolsky1 @osanseviero @UnslothAI @huggingface It is always Monday or Saturday somewhere. Weekends are not Sat-Sun everywhere in the world. Don’t forget that we are nerds and there is not such a thing as weekend 🤣😉

21h55

Jahanzaib Ahmed@jahanzaibai

@osanseviero QAT handles quantization quality, MTP buys back the throughput. Honestly that combo in llama.cpp is probably enough for most local agentic loops without touching cloud inference.

15h4902

Nikita Belokopytov@NikiBelokopytov

@TensorEspresso @osanseviero MTP doesn't come for free

17h44

Noctus@noctus91

@osanseviero waiting for @UnslothAI to upload

21h3202

BenUsesAI@BenUsesAI1

@osanseviero gemma 4 mtp in llama.cpp sounds promising until latency spikes stop chasing every new merge and just optimize what actually works

20h6011

3rdEyeVisuals@3rdEyeVisuals

@osanseviero I have gotten to the point where any cool new llama.cpp updates that come in need to be hand-rolled by me into my build.. I fear that I have diverged too far from the main branch XD

20h5481