1d ago

Atomic Chat demonstrates Multi-Token Prediction for fully offline Qwen models on consumer hardware, lifting 27B dense model throughput from 51 to 117 tokens per second

MoE 35B variant rose from 218 to 267 tokens per second on dual RTX 5090 GPUs.

0
Original post

Another good news for local-LLM from atomic[.]chat, that runs 100% offline on your computer.

They just showed MTP (Multi-Token Prediction) pushing local Qwen models from 51 to 117 tokens/s on dense 27B.

And an MoE 35B-A3B model rose from 218 to 267 tokens/s on 2x RTX 5090.

Instead of generating and checking one token at a time, MTP (Multi-Token Prediction) drafts multiple future tokens and verifies them together, so the GPU does less repeated work for every word it prints.

And this makes local LLMs much faster when the draft tokens are accepted often enough.

For many local LLM runs, the limit is not pure compute, but memory bandwidth: how fast the GPU can keep feeding weights into computation.

A local GPU generating text often spends most of its time pulling model weights from VRAM again and again for each token, so if MTP lets the model check several drafted tokens in one forward pass, it reduces how often the same giant weight matrix has to be reread.

The most interesting claim in their test is ~80% draft acceptance with zero accuracy loss and only ~1GB extra VRAM, because speculative decoding often becomes useful only when the draft tokens are accepted often enough.

So we get this strong local AI result because it improves generation speed without changing the model’s answers, but the dense model is the real winner because memory bandwidth was its main bottleneck.

Their GitHub repo is fully open source.

3:50 AM · May 21, 2026 · 9.4K Views