/Tech23d ago

Unsloth AI releases MTP-optimized GGUF files for Qwen3.6-27B and Qwen3.6-35B-A3B on Hugging Face delivering 1.4 to 2.2 times faster generation

AI Judge changed title after evaluation, original title: "Unsloth AI released MTP-optimized GGUFs for the Qwen3.6-27B and Qwen3.6-35B-A3B models that enable 1.4–2.2× faster generation reaching 240 tokens per second"

llama.cpp merged native MTP support on May 16 for Qwen3.6 models.

3194.9K6053.4K535K

#34

Original post

Daniel Han#823

Unsloth AI@UnslothAI

Qwen3.6 now runs 2x faster with MTP GGUFs! Run locally on just 18GB RAM. ⚡️

MTP enables Qwen3.6 to generate ~1.4–2.2× faster with no accuracy change.

Qwen3.6-27B MTP runs at 160 tokens/s. 35B-A3B reaches 240 t/s.

GGUFs: https://huggingface.co/unsloth/Qwen3.6-27B-MTP-GGUF Guide: https://unsloth.ai/docs/models/qwen3.6#mtp-guide

6:40 AM · May 18, 2026 · 114.6K Views

/Tech23d ago

Unsloth AI releases MTP-optimized GGUF files for Qwen3.6-27B and Qwen3.6-35B-A3B on Hugging Face delivering 1.4 to 2.2 times faster generation

llama.cpp merged native MTP support on May 16 for Qwen3.6 models.

3194.9K6053.4K535K

#34

Original post

Daniel Han#823

Unsloth AI@UnslothAI

Qwen3.6 now runs 2x faster with MTP GGUFs! Run locally on just 18GB RAM. ⚡️

MTP enables Qwen3.6 to generate ~1.4–2.2× faster with no accuracy change.

Qwen3.6-27B MTP runs at 160 tokens/s. 35B-A3B reaches 240 t/s.

GGUFs: https://huggingface.co/unsloth/Qwen3.6-27B-MTP-GGUF Guide: https://unsloth.ai/docs/models/qwen3.6#mtp-guide

6:40 AM · May 18, 2026 · 114.6K Views

Sentiment

Sentiment building, check back later.

Cluster Engagement

Posts from X

Most Activity

VIEWS252.2KLIKES1.2KRETWEETS180REPLIES48

Georgi Gerganov@ggerganov

llama.cpp adds MTP for the Qwen3.6 family

This is a significant milestone for the local AI ecosystem. The performance jump with these changes is massive and elevates local inference on commodity hardware further.

Special thanks to Aman Gupta for leading this development!

https://github.com/ggml-org/llama.cpp/pull/22673

23d252.2K1.2K513

BOOKMARKS912

Victor M@victormustar

llama.cpp with MTP support makes local models fast enough to use as daily drivers 🚀

Qwen3.6-27B dense generation (on A10G): From 25 tok/s → 45 tok/s (+78%).

Two flags on llama-server: --spec-type draft-mtp --spec-draft-n-max 2

Georgi Gerganov@ggerganov

llama.cpp adds MTP for the Qwen3.6 family

This is a significant milestone for the local AI ecosystem. The performance jump with these changes is massive and elevates local inference on commodity hardware further.

Special thanks to Aman Gupta for leading this development!

https://github.com/ggml-org/llama.cpp/pull/22673

23d145.5K1.1K912

Julien Chaumond@julien_c

I've seen some confusion online on how to run llama.cpp with MTP (Multi-token prediction) in the simplest way possible.

ICYMI, MTP is a new flavor of speculative decoding built-in to the model itself, that ~2x your tokens per sec for most use cases.

2x generation speed = Truly a game changer. 🔥

How to run it?

brew upgrade llama.cpp # or you might need to install from source until build 9200 is in your package manager: brew install llama.cpp --HEAD

Then pick either the Dense 27B or the 35B A3B MoE.

Personally I tend to stick to the Dense model where I achieve ~30 tok/sec on my machine. The MoE is of course way faster at an impressive ~100 tok/sec on my machine. Truly rapid. ⚡️

In both cases you probably want 48GB or better 64GB RAM or VRAM, though 36GB might work with more strongly-quantized versions.

# Dense:

llama-server -hf ggml-org/Qwen3.6-27B-MTP-GGUF --spec-type draft-mtp --spec-draft-n-max 2

# MoE:

llama-server -hf ggml-org/Qwen3.6-35B-A3B-MTP-GGUF --spec-type draft-mtp --spec-draft-n-max 3

Enjoy!

22d17.8K382368

merve@mervenoyann

finally faster Qwen3.6 models with MTP support ⚡️

brb updating my Pi & Hermes setup 🤝

Georgi Gerganov@ggerganov

llama.cpp adds MTP for the Qwen3.6 family

This is a significant milestone for the local AI ecosystem. The performance jump with these changes is massive and elevates local inference on commodity hardware further.

Special thanks to Aman Gupta for leading this development!

https://github.com/ggml-org/llama.cpp/pull/22673

23d4.9K606

Learts【ツ】| Sura 🇧🇷@learts

@LongOptimist23b @victormustar How many t/s are you getting?

23d522

Eeshan@notesundrground

@julien_c I wrote a gist about this with all the flags that I've been using. 35B works super smooth on my Mac M4 Pro 48 GB, and 27B is usable as well. The MTP update definitely helped.

https://gist.github.com/eeshansrivastava89/85797104af34181944bfd1360d69e8af

22d13821

X.com_sux_my_balls@xsux_cok

@victormustar seems shitty and slow try llama-server flags: -ngl 99 -c 262144 -np 1 -fa on --cache-type-k q4_0 --cache-type-v q4_0 --port 8080 --host 0.0.0.0

23d511

zIGGY@_z_I_G_G_Y_

@victormustar That only works on dense models right? And you have to load a second llm and run that at the same time? No real gain with moe models is that correct?

22d494

Alexandre Mutel@xoofx

@UnslothAI @Alibaba_Qwen You have a typo in the MTP Qwen3.6-27B code sample:

Instead of:

-hf unsloth/Qwen3.6-27B-GGUF:UD-Q4_K_XL \

It should be:

-hf unsloth/Qwen3.6-27B-MTP-GGUF:UD-Q4_K_XL \

Also it would be amazing to have similar tests/results with AMD GPU ROCm hardware

23d15311

John D. Pope 🦒@johndpope

@ggerganov its cute - but we gotta consider diffusion instead of auto-regression - I need more compute to verify claims - but it's alegedly 4x on base qwen 2.5 - testing on qwen 3.6 now - https://github.com/pengzhangzhi/Open-dLLM/pull/27

22d2921

Sergei Fonov@SergeiFonov

@ggerganov performance gains this are what make local ai move from cool demo to daily use tool. i'm building local ai on the iphone side, and the pattern is the same the more reliable local inference gets, the more private workflows become possible

22d2481

Unsloth AI@UnslothAI

@Alibaba_Qwen Here’s 4-bit Qwen3.6-27B-MTP-GGUF hitting 96.4 tokens/s in Unsloth Studio on an H100.

Try it yourself: https://github.com/unslothai/unsloth

23d3393

Grok@grok

MTP = Multi-Token Prediction.

It's a new speculative decoding feature in llama.cpp for Qwen3.6 models (special MTP GGUF versions). The model predicts multiple tokens at once instead of one-by-one, giving ~1.5-2x faster generation with no quality loss.

Enable with --spec-type draft-mtp. Game changer for local inference.

23d1811

Daniel Moll@rumgewieselt

@ggerganov Its fantastic - thanks a lot to both of you! For me its a game changer with my hardware from 2017 ...

23d381

@brock@hachyderm.io@bchap1n

@victormustar @ClementDelangue i was already building llama.cpp from source and using dflash+turboquant. not sure MTP is going to beat that

23d301

Victor M@victormustar

@_z_I_G_G_Y_ no it works on MoE too, just way less gain: to check N drafts at once you have to load all the experts they touch, so you lose the MoE speed advantage. (from my tests: dense ~+78% & for the A3B ~+10%)

22d2772

Benjamin Babik@localoptimiser

@Chris65536 @HarrisDePercept @UnslothAI @Alibaba_Qwen These are MTP numbers. It's a little over half in either case without it. I can run the 27b at 6bit without it all day but I prefer the shorter feedback. Models are great but they're dumb so I'd rather have more attempts on goal and have them "remember what didn't work".

23d51

Rameswar@rameswar08

@UnslothAI @Alibaba_Qwen I'm new to local LLM's, do I need a GPU?

I'm having 64GB RAM and 6GB VRAM with a nvidia card

22d1812

uttertard@uttertard

@victormustar AFAIK you lose a lot of quality with --spec-draft-n-max 2 , from my understanding 3 is widespread considered the bottom, 4 is optimal, and higher than that is pretty much lossless but at the cost of substantially reduced speed gains, but I may be wrong.

22d325

Victor M@victormustar

@uttertard mhh I dont see how you loose quality: spec decoding is lossless by design. the main model verifies every draft token, so output is identical regardless of n-max. so (at least from my undersanding) n only affects speed, not quality.

22d274