/AI23d ago

Unsloth releases MTP GGUFs for Qwen3.6 models

Unsloth released experimental MTP GGUFs for the Qwen3.6-27B and Qwen3.6 35B-A3B models. The files add native speculative decoding support from a recent llama.cpp update and deliver up to 1.8x faster inference. On a single GPU the 27B model reaches 140 tokens per second while the 35B-A3B reaches 220 tokens per second. The quantized releases require 18 GB and 22 GB of RAM respectively.

--0--
Original post
Daniel Han@danielhanchen#773inAI

Qwen3.6 MTP Unsloth GGUFs now run 1.8x faster, increased from 1.4x just two days ago!

This is due to llama.cpp adding --spec-draft-p-min 0.75!

Args have also changed from --spec-type mtp to --spec-type draft-mtp Also increase --spec-draft-n-max 2 to 6

We also released Qwen3.6-0.8B, 2B, 4B, 9B MTP GGUFs! We'll be providing more soon!

For folks who find the new updated branch to have some perf regression, set --spec-draft-p-min to 0.0 to get the old behavior - we provided a plot of the old branch (red) vs the new branch (blue / green) as well.

Also you can use 2 speculative decoding algos - you can add ngram via --spec-type ngram-mod,draft-mtp - the perf isn't yet optimized so I'll do more benchmarks to find better numbers - see https://github.com/ggml-org/llama.cpp/pull/22673

Guide for MTP: https://unsloth.ai/docs/models/qwen3.6#mtp-guide

6:10 AM · May 15, 2026 · 31.4K Views
Sentiment

Users are excited about the 1.8x inference speedup from Unsloth's MTP GGUFs for Qwen3.6 thanks to easy CLI setup and solid performance gains, while others report regressions, memory spikes, and slower speeds on RTX GPUs and Macs.

Pos
64.2%
Neg
35.8%
29 comments with sentiment.
Cluster Engagement
-
Views
-
Comments
-
Reposts
-
Bookmarks
Expand data
Posts from X
Most Activity
Most Activity
VIEWS13.6KBOOKMARKS62LIKES86RETWEETS9REPLIES8

💪 Unsloth pushed Qwen3.6 MTP even further.

⚡ Qwen3.6 MTP models jumped from 1.4x → 1.8x faster in the last 2 days

Thanks to a new llama.cpp update: --spec-draft-p-min 0.75 + --spec-type draft-mtp

They also raised --spec-draft-n-max from 2 → 6 for more aggressive drafting.

✅ Bigger speedups on local inference ✅ Still works with simple CLI flags ✅ New small MTP GGUFs released too (0.8B–9B)

Local Qwen just got quicker.

22dViews 13.6KLikes 86Bookmarks 62
Daniel Han@danielhanchen

@UnslothAI Qwen3.5 0.8B, 2B, 4B, 9B updated MTP GGUFs are at https://huggingface.co/unsloth

Mis-spoke earlier as well - not Qwen3.6* but Qwen3.5 small GGUFs - but we're hoping there will be smaller quants in the future from Qwen!

We're also doing Qwen3.5-122B and Qwen3.5-397B MTP!

Daniel Han@danielhanchen

Qwen3.6 MTP Unsloth GGUFs now run 1.8x faster, increased from 1.4x just two days ago!

This is due to llama.cpp adding --spec-draft-p-min 0.75!

Args have also changed from --spec-type mtp to --spec-type draft-mtp Also increase --spec-draft-n-max 2 to 6

We also released Qwen3.6-0.8B, 2B, 4B, 9B MTP GGUFs! We'll be providing more soon!

For folks who find the new updated branch to have some perf regression, set --spec-draft-p-min to 0.0 to get the old behavior - we provided a plot of the old branch (red) vs the new branch (blue / green) as well.

Also you can use 2 speculative decoding algos - you can add ngram via --spec-type ngram-mod,draft-mtp - the perf isn't yet optimized so I'll do more benchmarks to find better numbers - see https://github.com/ggml-org/llama.cpp/pull/22673

Guide for MTP: https://unsloth.ai/docs/models/qwen3.6#mtp-guide

23dViews 2.8KLikes 23Bookmarks 6
Sakura Yuki@sakurayukiai

@TeksEdge @danielhanchen @UnslothAI MTP is the only way speculative decoding actually makes sense on a local GPU. Running a separate draft model just burns VRAM, but self-drafting is basically free speed.

22dViews 124Likes 4
Janvitos@janvitos

@danielhanchen @UnslothAI Hey @danielhanchen, there seems to be a regression with the latest mtp-clean commits. On my 4070 12GB, with --spec-draft-n-max 7, acceptance rate drops from 0.95 to 0.55, and tok/sec drops from 90/sec to 65 tok/sec. So I had to revert to --spec-draft-n-max 3.

23dViews 516
Kamil Skowron@kamilskowron

@danielhanchen @UnslothAI I think you meant "released Qwen3.5-0.8B, 2B, 4B, 9B MTP GGUFs" 😉 Thank you for those 🙏 is the same guide (MTP) applying to them? (as you have a separate guide for 3.5)

23dViews 460Likes 4
Daniel Han@danielhanchen

@janvitos @UnslothAI Could you try setting `--spec-draft-p-min` to 0.0 and check again

23dViews 448
Daniel Han@danielhanchen

@bruce_x_offi @UnslothAI I was working on a PR for llama.cpp haha but I abandoned it for now - I'll re-take a look over the weekend!

https://github.com/unslothai/llama.cpp/pull/14

23dViews 426Likes 3
Joel - coffee/acc@JoelDeTeves

@danielhanchen @UnslothAI Hi Daniel I am for some reason seeing a performance drop with these settings (RTX 3090) - went from 67 tokens/ sec down to 61 tokens/ sec on coding, wen tfrom 51 tokens/sec to 33 tokens/sec on general.

22dViews 87
Aaron Stannard@Aaronontheweb

@0xfldr @danielhanchen @UnslothAI I think I may have turned the n-gram based decoding off because it mutilated tool calls but otherwise this is what I'm running on vulkan

22dViews 11
Sakura Yuki@sakurayukiai

@danielhanchen @UnslothAI People sleep on speculative decoding because it used to be a headache to set up. Now it's just two CLI flags for a 1.8x speedup. The local Qwen stack is getting ridiculous.

23dViews 212Likes 1
Aaron Stannard@Aaronontheweb

@danielhanchen @UnslothAI I can't wait until llama.cpp et al get ROCm support for MTP. Been itching to try this

23dViews 140Likes 1
Carlo@Italianclownz

@TeksEdge @danielhanchen @UnslothAI Unsloth is one of the most dedicated at making LLMs accessible to thousands of people. Always appreciate what they do

22dViews 90Likes 1
darthsider@thedarthsider

@danielhanchen @UnslothAI I seeing huge memory spike when using MTP for some reason. Earlier what used to fit on 2 x 16GB GPUs is now getting OOM even if I change KV quantization from Q8 to Q4.

Model: Qwen 3.6 27B

23dViews 236
Crown 👑@barackomaba

@danielhanchen @UnslothAI On the original I saw you showed that 2 drafts was optimal but in all my tests I found 4 was the sweet spot. Not sure why. It was very consistent (not tried this new update I'll try now )

22dViews 45Likes 1
Strata@ChainZenit

@danielhanchen @UnslothAI 1.8x already? wild gains so quick

23dViews 42Likes 1
Daniel Han@danielhanchen

@kamilskowron @UnslothAI OOO haha oops 3.5*** yes

23dViews 381Likes 3
SGode@Seb_Gode

@TeksEdge @danielhanchen @UnslothAI Sadly Prefill speed needs to still be solved. 35B-A3B runs fine decode wise, but on long context (120k+) Prefill is just awfully slow which leads to long TTFT. Not sure how I can solve this sadly because that annoys me the most

22dViews 36Likes 1
Angelo M. Calvão@angelo_m_calvao

@danielhanchen @UnslothAI Does it work well on AMD GPUs?

23dViews 102
Janvitos@janvitos

@danielhanchen @UnslothAI Just tried with --spec-draft-p-min 0.0 and results are pretty much identical.

23dViews 99
fldr@0xfldr

@Aaronontheweb @danielhanchen @UnslothAI It works with vulkan currently I'm using AMD iGPUs

23dViews 16Likes 1
Load more posts