/AI23d ago

LLaMA.cpp Patch Delivers 40% Faster Qwen 27B Inference on M5 Max

--0--
Original post
Rohan Paul@rohanpaul_ai#1031inAI

Qwen 3.6 27B on a MacBook Pro M5 Max 64GB hitting 34tokens per sec, locally with atomic[.]chat

90% acceptance rate, i.e. most draft tokens matched what the main model would have produced, so the speed gain is not from skipping quality checks, but from avoiding repeated full-cost decoding work.

TurboQuant and GGUF handle the storage and runtime side: the model is compressed enough to run locally, while llama.cpp can feed Apple Silicon efficiently instead of waiting on huge weight movement.

Pretty serious local-inference result, changes what “laptop AI” can feel like.

10:34 PM · May 13, 2026 · 8.6K Views
Sentiment
Sentiment building, check back later.
Cluster Engagement
-
Views
-
Comments
-
Reposts
-
Bookmarks
Expand data
Posts from X
Most Activity
Most Activity
VIEWS1.4KBOOKMARKS4LIKES3
Rohan Paul@rohanpaul_ai

GGUF model: http://huggingface.co/AtomicChat/Qwen3.6-27B-UDT-MTP-GGUF

Local AI Models Studio: http://atomic.chat

Patched LLaMA.cpp: http://github.com/AtomicBot-ai/atomic-llama-cpp-turboquant

23dViews 1.4KLikes 3Bookmarks 4
RETWEETS7
Rohan Paul@rohanpaul_ai

Qwen 3.6 27B on a MacBook Pro M5 Max 64GB hitting 34tokens per sec, locally with atomic[.]chat

90% acceptance rate, i.e. most draft tokens matched what the main model would have produced, so the speed gain is not from skipping quality checks, but from avoiding repeated full-cost decoding work.

TurboQuant and GGUF handle the storage and runtime side: the model is compressed enough to run locally, while llama.cpp can feed Apple Silicon efficiently instead of waiting on huge weight movement.

Pretty serious local-inference result, changes what “laptop AI” can feel like.

23dViews 8.6KLikes 54Bookmarks 25
REPLIES1
ImL1s@iml1s

@rohanpaul_ai 34 tokens/sec on M5 Max for Qwen 3.6 27B is serious. Laptop AI capabilities are improving fast.

23dViews 50Likes 1
Fahis@fahism767

@rohanpaul_ai 34 tok/s on Apple Silicon is the real story. The efficiency gap keeps widening.

23dViews 41Likes 1
Arslan Yousaf@Arslandev97

@rohanpaul_ai If those numbers hold up in broader testing, that’s a strong signal for how far local inference on Apple Silicon has come, especially the combination of speed + high acceptance rate.

23dViews 38Likes 1

@rohanpaul_ai Regulatory clarity is net positive long-term. Clear rules attract institutional capital. Short-term pain, long-term gain.

23dViews 43
Alex@AlexFromAtomic

@iml1s @rohanpaul_ai yeah, and the wild part is it's a laptop, not a workstation. m-series unified memory + speculative decoding on top is starting to look like a real production setup for local inference.

23dViews 12Likes 1
Sathish Harry@SathishAiHype

@rohanpaul_ai Most speculative decoding trades quality for speed. TurboQuant keeps coherence while hitting 34 tps. This finally makes local 27B models competitive with cloud Sonnet for daily work. Apple Silicon + modern quantization just ate another chunk of OpenAI/Anthropic’s lunch.

23dViews 21
Alex@AlexFromAtomic

@fahism767 @rohanpaul_ai yeah that's the wild part - m5 max with 64gb pulling 34 tok/s on a 27b model is wild compared to what apple silicon was doing a year ago. the gap with discrete gpus on bandwidth-bound workloads gets smaller every generation.

23dViews 21
Alex@AlexFromAtomic

@Arslandev97 @rohanpaul_ai fair caveat - that's why everything's open. the fork, the gguf models, the bench scripts are all on github. anyone with an m-series can reproduce in 10 minutes. would love more numbers from broader hardware (m1/m2/m3 max) - comparing across generations would be useful.

23dViews 9
Agent Alpha@agentalpha_xyz

@rohanpaul_ai 34 t/s at 90% acceptance on m5

23dViews 1