1d ago

Unsloth releases MTP GGUFs for Qwen3.6 models

506957068445.6K

——0——

Unsloth released experimental MTP GGUFs for the Qwen3.6-27B and Qwen3.6 35B-A3B models. The files add native speculative decoding support from a recent llama.cpp update and deliver up to 1.8x faster inference. On a single GPU the 27B model reaches 140 tokens per second while the 35B-A3B reaches 220 tokens per second. The quantized releases require 18 GB and 22 GB of RAM respectively.

Original post

Daniel Han#772@DANIELHANCHEN

Qwen3.6 MTP Unsloth GGUFs now run 1.8x faster, increased from 1.4x just two days ago! This is due to llama.cpp adding --spec-draft-p-min 0.75! Args have also changed from --spec-type mtp to --spec-type draft-mtp Also increase --spec-draft-n-max 2 to 6 We also released Qwen3.6-0.8B, 2B, 4B, 9B MTP GGUFs! We'll be providing more soon! For folks who find the new updated branch to have some perf regression, set --spec-draft-p-min to 0.0 to get the old behavior - we provided a plot of the old branch (red) vs the new branch (blue / green) as well. Also you can use 2 speculative decoding algos - you can add ngram via --spec-type ngram-mod,draft-mtp - the perf isn't yet optimized so I'll do more benchmarks to find better numbers - see https://github.com/ggml-org/llama.cpp/pull/22673 Guide for MTP: https://unsloth.ai/docs/models/qwen3.6#mtp-guide

6:10 AM · May 15, 2026

Cluster engagement

50 snapshots

Reposted by

#1482@PETESKOMOROCH

QUOTE POST

#772Daniel Han@DANIELHANCHEN

More info and follow ups:

Daniel Han@danielhanchen

1:10 PM · May 15, 2026 · 31.4K Views

1:11 PM · May 15, 2026 · 823 Views

ORIGINAL POST

#772Daniel Han@DANIELHANCHEN

Qwen3.6 MTP Unsloth GGUFs now run 1.8x faster, increased from 1.4x just two days ago!

This is due to llama.cpp adding --spec-draft-p-min 0.75!

Args have also changed from --spec-type mtp to --spec-type draft-mtp Also increase --spec-draft-n-max 2 to 6

We also released Qwen3.6-0.8B, 2B, 4B, 9B MTP GGUFs! We'll be providing more soon!

For folks who find the new updated branch to have some perf regression, set --spec-draft-p-min to 0.0 to get the old behavior - we provided a plot of the old branch (red) vs the new branch (blue / green) as well.

Also you can use 2 speculative decoding algos - you can add ngram via --spec-type ngram-mod,draft-mtp - the perf isn't yet optimized so I'll do more benchmarks to find better numbers - see https://github.com/ggml-org/llama.cpp/pull/22673

Guide for MTP: https://unsloth.ai/docs/models/qwen3.6#mtp-guide

1:10 PM · May 15, 2026 · 31.4K Views

#772Daniel Han@DANIELHANCHEN

@UnslothAI Qwen3.5 0.8B, 2B, 4B, 9B updated MTP GGUFs are at https://huggingface.co/unsloth

Mis-spoke earlier as well - not Qwen3.6* but Qwen3.5 small GGUFs - but we're hoping there will be smaller quants in the future from Qwen!

We're also doing Qwen3.5-122B and Qwen3.5-397B MTP!

Daniel Han@danielhanchen

1:10 PM · May 15, 2026 · 31.4K Views

1:59 PM · May 15, 2026 · 2.8K Views