/Tech23d ago

Unsloth AI releases MTP-optimized GGUF files for Qwen3.6-27B and Qwen3.6-35B-A3B on Hugging Face delivering 1.4 to 2.2 times faster generation

AI Judge changed title after evaluation, original title: "Unsloth AI released MTP-optimized GGUFs for the Qwen3.6-27B and Qwen3.6-35B-A3B models that enable 1.4–2.2× faster generation reaching 240 tokens per second"

llama.cpp merged native MTP support on May 16 for Qwen3.6 models.

3194.9K6053.4K535K
Original postDaniel Han#823
Unsloth AI@UnslothAI

Qwen3.6 now runs 2x faster with MTP GGUFs! Run locally on just 18GB RAM. ⚡️

MTP enables Qwen3.6 to generate ~1.4–2.2× faster with no accuracy change.

Qwen3.6-27B MTP runs at 160 tokens/s. 35B-A3B reaches 240 t/s.

GGUFs: https://huggingface.co/unsloth/Qwen3.6-27B-MTP-GGUF Guide: https://unsloth.ai/docs/models/qwen3.6#mtp-guide

6:40 AM · May 18, 2026 · 114.6K Views
Sentiment
Sentiment building, check back later.
Cluster Engagement
Posts from X
Most Activity
Most Activity
VIEWS252.2KLIKES1.2KRETWEETS180REPLIES48
Georgi Gerganov@ggerganov

llama.cpp adds MTP for the Qwen3.6 family

This is a significant milestone for the local AI ecosystem. The performance jump with these changes is massive and elevates local inference on commodity hardware further.

Special thanks to Aman Gupta for leading this development!

https://github.com/ggml-org/llama.cpp/pull/22673

23dViews 252.2KLikes 1.2KBookmarks 513
BOOKMARKS912
Victor M@victormustar

llama.cpp with MTP support makes local models fast enough to use as daily drivers 🚀

Qwen3.6-27B dense generation (on A10G): From 25 tok/s → 45 tok/s (+78%).

Two flags on llama-server: --spec-type draft-mtp --spec-draft-n-max 2

Georgi Gerganov@ggerganov

llama.cpp adds MTP for the Qwen3.6 family

This is a significant milestone for the local AI ecosystem. The performance jump with these changes is massive and elevates local inference on commodity hardware further.

Special thanks to Aman Gupta for leading this development!

https://github.com/ggml-org/llama.cpp/pull/22673

23dViews 145.5KLikes 1.1KBookmarks 912

I've seen some confusion online on how to run llama.cpp with MTP (Multi-token prediction) in the simplest way possible.

ICYMI, MTP is a new flavor of speculative decoding built-in to the model itself, that ~2x your tokens per sec for most use cases.

2x generation speed = Truly a game changer. 🔥

How to run it?

brew upgrade llama.cpp # or you might need to install from source until build 9200 is in your package manager: brew install llama.cpp --HEAD

Then pick either the Dense 27B or the 35B A3B MoE.

Personally I tend to stick to the Dense model where I achieve ~30 tok/sec on my machine. The MoE is of course way faster at an impressive ~100 tok/sec on my machine. Truly rapid. ⚡️

In both cases you probably want 48GB or better 64GB RAM or VRAM, though 36GB might work with more strongly-quantized versions.

# Dense:

llama-server -hf ggml-org/Qwen3.6-27B-MTP-GGUF --spec-type draft-mtp --spec-draft-n-max 2

# MoE:

llama-server -hf ggml-org/Qwen3.6-35B-A3B-MTP-GGUF --spec-type draft-mtp --spec-draft-n-max 3

Enjoy!

22dViews 17.8KLikes 382Bookmarks 368
merve@mervenoyann

finally faster Qwen3.6 models with MTP support ⚡️

brb updating my Pi & Hermes setup 🤝

Georgi Gerganov@ggerganov

llama.cpp adds MTP for the Qwen3.6 family

This is a significant milestone for the local AI ecosystem. The performance jump with these changes is massive and elevates local inference on commodity hardware further.

Special thanks to Aman Gupta for leading this development!

https://github.com/ggml-org/llama.cpp/pull/22673

23dViews 4.9KLikes 60Bookmarks 6
Eeshan@notesundrground

@julien_c I wrote a gist about this with all the flags that I've been using. 35B works super smooth on my Mac M4 Pro 48 GB, and 27B is usable as well. The MTP update definitely helped.

https://gist.github.com/eeshansrivastava89/85797104af34181944bfd1360d69e8af

22dViews 138Likes 2Bookmarks 1

@victormustar seems shitty and slow try llama-server flags: -ngl 99 -c 262144 -np 1 -fa on --cache-type-k q4_0 --cache-type-v q4_0 --port 8080 --host 0.0.0.0

23dViews 5Likes 1Bookmarks 1
zIGGY@_z_I_G_G_Y_

@victormustar That only works on dense models right? And you have to load a second llm and run that at the same time? No real gain with moe models is that correct?

22dViews 494

@UnslothAI @Alibaba_Qwen You have a typo in the MTP Qwen3.6-27B code sample:

Instead of:

-hf unsloth/Qwen3.6-27B-GGUF:UD-Q4_K_XL \

It should be:

-hf unsloth/Qwen3.6-27B-MTP-GGUF:UD-Q4_K_XL \

Also it would be amazing to have similar tests/results with AMD GPU ROCm hardware

23dViews 153Likes 1Bookmarks 1

@ggerganov its cute - but we gotta consider diffusion instead of auto-regression - I need more compute to verify claims - but it's alegedly 4x on base qwen 2.5 - testing on qwen 3.6 now - https://github.com/pengzhangzhi/Open-dLLM/pull/27

22dViews 292Bookmarks 1
Sergei Fonov@SergeiFonov

@ggerganov performance gains this are what make local ai move from cool demo to daily use tool. i'm building local ai on the iphone side, and the pattern is the same the more reliable local inference gets, the more private workflows become possible

22dViews 248Bookmarks 1
Unsloth AI@UnslothAI

@Alibaba_Qwen Here’s 4-bit Qwen3.6-27B-MTP-GGUF hitting 96.4 tokens/s in Unsloth Studio on an H100.

Try it yourself: https://github.com/unslothai/unsloth

23dViews 339Likes 3
Grok@grok

MTP = Multi-Token Prediction.

It's a new speculative decoding feature in llama.cpp for Qwen3.6 models (special MTP GGUF versions). The model predicts multiple tokens at once instead of one-by-one, giving ~1.5-2x faster generation with no quality loss.

Enable with --spec-type draft-mtp. Game changer for local inference.

23dViews 18Likes 1Bookmarks 1
Daniel Moll@rumgewieselt

@ggerganov Its fantastic - thanks a lot to both of you! For me its a game changer with my hardware from 2017 ...

23dViews 38Bookmarks 1

@victormustar @ClementDelangue i was already building llama.cpp from source and using dflash+turboquant. not sure MTP is going to beat that

23dViews 30Bookmarks 1
Victor M@victormustar

@_z_I_G_G_Y_ no it works on MoE too, just way less gain: to check N drafts at once you have to load all the experts they touch, so you lose the MoE speed advantage. (from my tests: dense ~+78% & for the A3B ~+10%)

22dViews 277Likes 2
Benjamin Babik@localoptimiser

@Chris65536 @HarrisDePercept @UnslothAI @Alibaba_Qwen These are MTP numbers. It's a little over half in either case without it. I can run the 27b at 6bit without it all day but I prefer the shorter feedback. Models are great but they're dumb so I'd rather have more attempts on goal and have them "remember what didn't work".

23dViews 5Likes 1
Rameswar@rameswar08

@UnslothAI @Alibaba_Qwen I'm new to local LLM's, do I need a GPU?

I'm having 64GB RAM and 6GB VRAM with a nvidia card

22dViews 181Likes 2
uttertard@uttertard

@victormustar AFAIK you lose a lot of quality with --spec-draft-n-max 2 , from my understanding 3 is widespread considered the bottom, 4 is optimal, and higher than that is pretty much lossless but at the cost of substantially reduced speed gains, but I may be wrong.

22dViews 325
Victor M@victormustar

@uttertard mhh I dont see how you loose quality: spec decoding is lossless by design. the main model verifies every draft token, so output is identical regardless of n-max. so (at least from my undersanding) n only affects speed, not quality.

22dViews 274
Load more posts