I've seen some confusion online on how to run llama.cpp with MTP (Multi-token prediction) in the simplest way possible.
ICYMI, MTP is a new flavor of speculative decoding built-in to the model itself, that ~2x your tokens per sec for most use cases.
2x generation speed = Truly a game changer. 🔥
How to run it?
brew upgrade llama.cpp
# or you might need to install from source until build 9200 is in your package manager:
brew install llama.cpp --HEAD
Then pick either the Dense 27B or the 35B A3B MoE.
Personally I tend to stick to the Dense model where I achieve ~30 tok/sec on my machine. The MoE is of course way faster at an impressive ~100 tok/sec on my machine. Truly rapid. ⚡️
In both cases you probably want 48GB or better 64GB RAM or VRAM, though 36GB might work with more strongly-quantized versions.
# Dense:
llama-server -hf ggml-org/Qwen3.6-27B-MTP-GGUF --spec-type draft-mtp --spec-draft-n-max 2
# MoE:
llama-server -hf ggml-org/Qwen3.6-35B-A3B-MTP-GGUF --spec-type draft-mtp --spec-draft-n-max 3
Enjoy!