Unsloth releases MTP GGUFs for Qwen3.6 models
Unsloth released experimental MTP GGUFs for the Qwen3.6-27B and Qwen3.6 35B-A3B models. The files add native speculative decoding support from a recent llama.cpp update and deliver up to 1.8x faster inference. On a single GPU the 27B model reaches 140 tokens per second while the 35B-A3B reaches 220 tokens per second. The quantized releases require 18 GB and 22 GB of RAM respectively.
More info and follow ups:
Qwen3.6 MTP Unsloth GGUFs now run 1.8x faster, increased from 1.4x just two days ago! This is due to llama.cpp adding --spec-draft-p-min 0.75! Args have also changed from --spec-type mtp to --spec-type draft-mtp Also increase --spec-draft-n-max 2 to 6 We also released Qwen3.6-0.8B, 2B, 4B, 9B MTP GGUFs! We'll be providing more soon! For folks who find the new updated branch to have some perf regression, set --spec-draft-p-min to 0.0 to get the old behavior - we provided a plot of the old branch (red) vs the new branch (blue / green) as well. Also you can use 2 speculative decoding algos - you can add ngram via --spec-type ngram-mod,draft-mtp - the perf isn't yet optimized so I'll do more benchmarks to find better numbers - see https://github.com/ggml-org/llama.cpp/pull/22673 Guide for MTP: https://unsloth.ai/docs/models/qwen3.6#mtp-guide
Qwen3.6 MTP Unsloth GGUFs now run 1.8x faster, increased from 1.4x just two days ago!
This is due to llama.cpp adding --spec-draft-p-min 0.75!
Args have also changed from --spec-type mtp to --spec-type draft-mtp Also increase --spec-draft-n-max 2 to 6
We also released Qwen3.6-0.8B, 2B, 4B, 9B MTP GGUFs! We'll be providing more soon!
For folks who find the new updated branch to have some perf regression, set --spec-draft-p-min to 0.0 to get the old behavior - we provided a plot of the old branch (red) vs the new branch (blue / green) as well.
Also you can use 2 speculative decoding algos - you can add ngram via --spec-type ngram-mod,draft-mtp - the perf isn't yet optimized so I'll do more benchmarks to find better numbers - see https://github.com/ggml-org/llama.cpp/pull/22673
Guide for MTP: https://unsloth.ai/docs/models/qwen3.6#mtp-guide

@UnslothAI Qwen3.5 0.8B, 2B, 4B, 9B updated MTP GGUFs are at https://huggingface.co/unsloth
Mis-spoke earlier as well - not Qwen3.6* but Qwen3.5 small GGUFs - but we're hoping there will be smaller quants in the future from Qwen!
We're also doing Qwen3.5-122B and Qwen3.5-397B MTP!
Qwen3.6 MTP Unsloth GGUFs now run 1.8x faster, increased from 1.4x just two days ago! This is due to llama.cpp adding --spec-draft-p-min 0.75! Args have also changed from --spec-type mtp to --spec-type draft-mtp Also increase --spec-draft-n-max 2 to 6 We also released Qwen3.6-0.8B, 2B, 4B, 9B MTP GGUFs! We'll be providing more soon! For folks who find the new updated branch to have some perf regression, set --spec-draft-p-min to 0.0 to get the old behavior - we provided a plot of the old branch (red) vs the new branch (blue / green) as well. Also you can use 2 speculative decoding algos - you can add ngram via --spec-type ngram-mod,draft-mtp - the perf isn't yet optimized so I'll do more benchmarks to find better numbers - see https://github.com/ggml-org/llama.cpp/pull/22673 Guide for MTP: https://unsloth.ai/docs/models/qwen3.6#mtp-guide