1d ago

Unsloth releases MTP GGUFs for Qwen3.6 models

0

Unsloth released experimental MTP GGUFs for the Qwen3.6-27B and Qwen3.6 35B-A3B models. The files add native speculative decoding support from a recent llama.cpp update and deliver up to 1.8x faster inference. On a single GPU the 27B model reaches 140 tokens per second while the 35B-A3B reaches 220 tokens per second. The quantized releases require 18 GB and 22 GB of RAM respectively.

Original post

Qwen3.6 MTP Unsloth GGUFs now run 1.8x faster, increased from 1.4x just two days ago! This is due to llama.cpp adding --spec-draft-p-min 0.75! Args have also changed from --spec-type mtp to --spec-type draft-mtp Also increase --spec-draft-n-max 2 to 6 We also released Qwen3.6-0.8B, 2B, 4B, 9B MTP GGUFs! We'll be providing more soon! For folks who find the new updated branch to have some perf regression, set --spec-draft-p-min to 0.0 to get the old behavior - we provided a plot of the old branch (red) vs the new branch (blue / green) as well. Also you can use 2 speculative decoding algos - you can add ngram via --spec-type ngram-mod,draft-mtp - the perf isn't yet optimized so I'll do more benchmarks to find better numbers - see https://github.com/ggml-org/llama.cpp/pull/22673 Guide for MTP: https://unsloth.ai/docs/models/qwen3.6#mtp-guide

6:10 AM · May 15, 2026 View on X
Reposted by

More info and follow ups:

Daniel HanDaniel Han@danielhanchen

Qwen3.6 MTP Unsloth GGUFs now run 1.8x faster, increased from 1.4x just two days ago! This is due to llama.cpp adding --spec-draft-p-min 0.75! Args have also changed from --spec-type mtp to --spec-type draft-mtp Also increase --spec-draft-n-max 2 to 6 We also released Qwen3.6-0.8B, 2B, 4B, 9B MTP GGUFs! We'll be providing more soon! For folks who find the new updated branch to have some perf regression, set --spec-draft-p-min to 0.0 to get the old behavior - we provided a plot of the old branch (red) vs the new branch (blue / green) as well. Also you can use 2 speculative decoding algos - you can add ngram via --spec-type ngram-mod,draft-mtp - the perf isn't yet optimized so I'll do more benchmarks to find better numbers - see https://github.com/ggml-org/llama.cpp/pull/22673 Guide for MTP: https://unsloth.ai/docs/models/qwen3.6#mtp-guide

1:10 PM · May 15, 2026 · 31.4K Views
1:11 PM · May 15, 2026 · 823 Views

Qwen3.6 MTP Unsloth GGUFs now run 1.8x faster, increased from 1.4x just two days ago!

This is due to llama.cpp adding --spec-draft-p-min 0.75!

Args have also changed from --spec-type mtp to --spec-type draft-mtp Also increase --spec-draft-n-max 2 to 6

We also released Qwen3.6-0.8B, 2B, 4B, 9B MTP GGUFs! We'll be providing more soon!

For folks who find the new updated branch to have some perf regression, set --spec-draft-p-min to 0.0 to get the old behavior - we provided a plot of the old branch (red) vs the new branch (blue / green) as well.

Also you can use 2 speculative decoding algos - you can add ngram via --spec-type ngram-mod,draft-mtp - the perf isn't yet optimized so I'll do more benchmarks to find better numbers - see https://github.com/ggml-org/llama.cpp/pull/22673

Guide for MTP: https://unsloth.ai/docs/models/qwen3.6#mtp-guide

1:10 PM · May 15, 2026 · 31.4K Views

@UnslothAI Qwen3.5 0.8B, 2B, 4B, 9B updated MTP GGUFs are at https://huggingface.co/unsloth

Mis-spoke earlier as well - not Qwen3.6* but Qwen3.5 small GGUFs - but we're hoping there will be smaller quants in the future from Qwen!

We're also doing Qwen3.5-122B and Qwen3.5-397B MTP!

Daniel HanDaniel Han@danielhanchen

Qwen3.6 MTP Unsloth GGUFs now run 1.8x faster, increased from 1.4x just two days ago! This is due to llama.cpp adding --spec-draft-p-min 0.75! Args have also changed from --spec-type mtp to --spec-type draft-mtp Also increase --spec-draft-n-max 2 to 6 We also released Qwen3.6-0.8B, 2B, 4B, 9B MTP GGUFs! We'll be providing more soon! For folks who find the new updated branch to have some perf regression, set --spec-draft-p-min to 0.0 to get the old behavior - we provided a plot of the old branch (red) vs the new branch (blue / green) as well. Also you can use 2 speculative decoding algos - you can add ngram via --spec-type ngram-mod,draft-mtp - the perf isn't yet optimized so I'll do more benchmarks to find better numbers - see https://github.com/ggml-org/llama.cpp/pull/22673 Guide for MTP: https://unsloth.ai/docs/models/qwen3.6#mtp-guide

1:10 PM · May 15, 2026 · 31.4K Views
1:59 PM · May 15, 2026 · 2.8K Views
Unsloth releases MTP GGUFs for Qwen3.6 models · Digg