/Tech33d ago

LLaMA.cpp Patch Delivers 40% Faster Qwen 27B Inference on M5 Max

75410258.6K

Original post

Qwen 3.6 27B on a MacBook Pro M5 Max 64GB hitting 34tokens per sec, locally with atomic[.]chat

90% acceptance rate, i.e. most draft tokens matched what the main model would have produced, so the speed gain is not from skipping quality checks, but from avoiding repeated full-cost decoding work.

TurboQuant and GGUF handle the storage and runtime side: the model is compressed enough to run locally, while llama.cpp can feed Apple Silicon efficiently instead of waiting on huge weight movement.

Pretty serious local-inference result, changes what “laptop AI” can feel like.

atomic.chat@atomic_chat_hq

Multi-Token Prediction (MTP) for Qwen on LLaMA.cpp!

+40% performance! 90% acceptance rate. Running locally on a MacBook Pro M5 Max 64GB

We patched LLaMA.cpp, quantized Qwen 3.6 27B into GGUF format with TurboQuant and shipped MTP drafts on top. Benchmark, Source code & models👇

10:34 PM · May 13, 2026 · 8.6K Views

Sentiment

Sentiment building, check back later.

Cluster Engagement

Digg Deeper

No Digg Deeper questions have been answered for this story yet.

Posts from X

Most Activity

VIEWS1.4KBOOKMARKS4LIKES3

Rohan Paul@rohanpaul_ai

GGUF model: http://huggingface.co/AtomicChat/Qwen3.6-27B-UDT-MTP-GGUF

Local AI Models Studio: http://atomic.chat

Patched LLaMA.cpp: http://github.com/AtomicBot-ai/atomic-llama-cpp-turboquant

33d1.4K34

RETWEETS7

Rohan Paul@rohanpaul_ai

Qwen 3.6 27B on a MacBook Pro M5 Max 64GB hitting 34tokens per sec, locally with atomic[.]chat

90% acceptance rate, i.e. most draft tokens matched what the main model would have produced, so the speed gain is not from skipping quality checks, but from avoiding repeated full-cost decoding work.

TurboQuant and GGUF handle the storage and runtime side: the model is compressed enough to run locally, while llama.cpp can feed Apple Silicon efficiently instead of waiting on huge weight movement.

Pretty serious local-inference result, changes what “laptop AI” can feel like.

atomic.chat@atomic_chat_hq

Multi-Token Prediction (MTP) for Qwen on LLaMA.cpp!

+40% performance! 90% acceptance rate. Running locally on a MacBook Pro M5 Max 64GB

We patched LLaMA.cpp, quantized Qwen 3.6 27B into GGUF format with TurboQuant and shipped MTP drafts on top. Benchmark, Source code & models👇

33d8.6K5425

REPLIES1

ImL1s@iml1s

@rohanpaul_ai 34 tokens/sec on M5 Max for Qwen 3.6 27B is serious. Laptop AI capabilities are improving fast.

33d501

Fahis@fahism767

@rohanpaul_ai 34 tok/s on Apple Silicon is the real story. The efficiency gap keeps widening.

33d411

Arslan Yousaf@Arslandev97

@rohanpaul_ai If those numbers hold up in broader testing, that’s a strong signal for how far local inference on Apple Silicon has come, especially the combination of speed + high acceptance rate.

33d381

Jasper 🌰@building BBX@oknextlin

@rohanpaul_ai Regulatory clarity is net positive long-term. Clear rules attract institutional capital. Short-term pain, long-term gain.

33d43

Alex@AlexFromAtomic

@iml1s @rohanpaul_ai yeah, and the wild part is it's a laptop, not a workstation. m-series unified memory + speculative decoding on top is starting to look like a real production setup for local inference.

33d121

Sathish Harry@SathishAiHype

@rohanpaul_ai Most speculative decoding trades quality for speed. TurboQuant keeps coherence while hitting 34 tps. This finally makes local 27B models competitive with cloud Sonnet for daily work. Apple Silicon + modern quantization just ate another chunk of OpenAI/Anthropic’s lunch.

33d21

Alex@AlexFromAtomic

@fahism767 @rohanpaul_ai yeah that's the wild part - m5 max with 64gb pulling 34 tok/s on a 27b model is wild compared to what apple silicon was doing a year ago. the gap with discrete gpus on bandwidth-bound workloads gets smaller every generation.

33d21

Alex@AlexFromAtomic

@Arslandev97 @rohanpaul_ai fair caveat - that's why everything's open. the fork, the gguf models, the bench scripts are all on github. anyone with an m-series can reproduce in 10 minutes. would love more numbers from broader hardware (m1/m2/m3 max) - comparing across generations would be useful.

33d9

Agent Alpha@agentalpha_xyz

@rohanpaul_ai 34 t/s at 90% acceptance on m5

33d1