Gemma 4 MTP just got officially merged into llama.cpp
This means you can use Gemma 4 QAT + MTP for a lightweight + super fast setup. Excited to see what the community builds with it
https://github.com/ggml-org/llama.cpp/pull/23398
Testing showed no performance speedup for MoE variants.
Gemma 4 MTP just got officially merged into llama.cpp
This means you can use Gemma 4 QAT + MTP for a lightweight + super fast setup. Excited to see what the community builds with it
https://github.com/ggml-org/llama.cpp/pull/23398

@osanseviero I keep refreshing @UnslothAI @huggingface page 😉 Stress test 🚀

@osanseviero Unsloth already uploaded the assistant mtp models inside a mtp folder on the repos: https://huggingface.co/unsloth/gemma-4-12b-it-GGUF/tree/main/MTP for the 12B, https://huggingface.co/unsloth/gemma-4-26B-A4B-it-GGUF/tree/main/MTP for the 26B, and https://huggingface.co/unsloth/gemma-4-31B-it-GGUF/tree/main/MTP for the 31B.

@osanseviero daang!! - and I just finished testing @atomic_chat_hq 's MTP https://huggingface.co/AtomicChat/gemma-4-31B-it-assistant-GGUF + @UnslothAI 's QAT https://huggingface.co/unsloth/gemma-4-31B-it-qat-GGUF today 😂

@osanseviero Reusing the target model's KV cache instead of maintaining two separate ones is the real win here. The speculative decoding VRAM tax has always been the silent killer on consumer GPUs.

@osanseviero please ggml team @AdrienGallouet update the llama.cpp pre-built binaries version from http://llama.app https://huggingface.co/buckets/ggml-org/install.sh

@osanseviero I tried this build, unsloth QAT GGUF, and MTP draft model. For some reason Gemma 4 31B is too heavy for RTX 5090 :/ Qwen 3.6 allows you to actually run it with decent context length.

@osanseviero Gemma 4 MTP in llama.cpp is the kind of compounding that makes local inference real. QAT drops the memory floor, MTP cuts the latency — together they make 12GB consumer GPUs actually usable for production inference. What's the decode speed looking like on 4090 with the new path?

@osanseviero More on how speculative decoding actually works under the hood and why draft matching matters: https://leetllm.com/learn/speculative-decoding

@osanseviero Thank you!

@osanseviero Do the ollama models include MTP yet?

@osanseviero QAT plus MTP used to mean a custom inference pipeline and a blog post nobody could reproduce. now it's llama.cpp and a model path.

@mtasic85 @osanseviero @UnslothAI @huggingface it's sunday man!

@osanseviero Just to confirm, MTP isn't very helpful for MoE model, right?

@osanseviero Would love to give this a go with Gemma 4:124b

@Jolsky11Jolsky1 @osanseviero @UnslothAI @huggingface It is always Monday or Saturday somewhere. Weekends are not Sat-Sun everywhere in the world. Don’t forget that we are nerds and there is not such a thing as weekend 🤣😉

@osanseviero QAT handles quantization quality, MTP buys back the throughput. Honestly that combo in llama.cpp is probably enough for most local agentic loops without touching cloud inference.

@TensorEspresso @osanseviero MTP doesn't come for free

@osanseviero waiting for @UnslothAI to upload

@osanseviero gemma 4 mtp in llama.cpp sounds promising until latency spikes stop chasing every new merge and just optimize what actually works

@osanseviero I have gotten to the point where any cool new llama.cpp updates that come in need to be hand-rolled by me into my build.. I fear that I have diverged too far from the main branch XD
Testing showed no performance speedup for MoE variants.
Gemma 4 MTP just got officially merged into llama.cpp
This means you can use Gemma 4 QAT + MTP for a lightweight + super fast setup. Excited to see what the community builds with it
https://github.com/ggml-org/llama.cpp/pull/23398