Smola Outlines LLM Inference Efficiency Across Hardware and Models

Wrap-up and resources. Where to actually get all of this, and the papers behind each trick.

https://alex.smola.org/posts/45-mlss-efficiency/ https://alex.smola.org/posts/45-mlss-efficiency/main.pdf

Serving. Continuous batching, PagedAttention, prefix caching, RadixAttention, and splitting prefill off from decode onto separate hardware. The punchline: your API bill is really a cache-hit-rate report. Tools: vLLM and SGLang.

Weight compression. Mixture of experts as “do not touch most of the weights,” then quantization down to four bits, and the rather pretty fact that the exponents of trained weights are almost losslessly compressible to about 4.7 bits each.

KV compression. Shrinking the other thing that saturates your memory bus once the context gets long.. Continuous batching, PagedAttention, prefix caching, RadixAttention, and splitting prefill off from decode onto separate hardware. Fun fact: your API bill is probably a cache-hit-rate report.

3h24001

BOOKMARKS1LIKES2RETWEETS2REPLIES1

Alex Smola@smolix

Overview. Prefill versus decode, arithmetic intensity, and the roofline plot that tells you which of the two you are fighting. Hardware. The physics of the box: bandwidth ladders, the FP8 and FP4 number formats, why a single DRAM access costs about 500 multiplies, and why compute grows roughly 4x per hardware generation while memory bandwidth only doubles. There is also the at-home angle, where a DGX Spark, a Strix Halo box, or a Apple Silicon with plenty of unified memory turns out to be a surprisingly good decode machine.

Alex Smola@smolix

It’s a good opportunity to review how we have this exciting convergent evolution of models, hardware, and algorithms for serving efficiency. Be prepared for a deep dive into chips, bandwidth but also randomized algorithms and architectures. My goal was to write a practitioner’s guide in six parts.

The running example throughout is Qwen3, both the dense 8B and the 30B-A3B mixture of experts, at a 40k token context.

3h17021

Alex Smola@smolix

It’s a good opportunity to review how we have this exciting convergent evolution of models, hardware, and algorithms for serving efficiency. Be prepared for a deep dive into chips, bandwidth but also randomized algorithms and architectures. My goal was to write a practitioner’s guide in six parts.

The running example throughout is Qwen3, both the dense 8B and the 30B-A3B mixture of experts, at a 40k token context.

Alex Smola@smolix

Next week I am teaching a tutorial on efficient LLM inference at the Machine Learning Summer School 2026 in NYC, hosted this year at Columbia University. The slides are below. There are about 150 of them, which sounds small, given how far the field has come.

3h17811

Alex Smola@smolix

Serving. Continuous batching, PagedAttention, prefix caching, RadixAttention, and splitting prefill off from decode onto separate hardware. The punchline: your API bill is really a cache-hit-rate report. Tools: vLLM and SGLang.

Weight compression. Mixture of experts as “do not touch most of the weights,” then quantization down to four bits, and the rather pretty fact that the exponents of trained weights are almost losslessly compressible to about 4.7 bits each.

KV compression. Shrinking the other thing that saturates your memory bus once the context gets long.. Continuous batching, PagedAttention, prefix caching, RadixAttention, and splitting prefill off from decode onto separate hardware. Fun fact: your API bill is probably a cache-hit-rate report.

3h91

Raphael cohen@cohenrap

@smolix Will the talk be uploaded online later?

2h3

Raphael cohen@cohenrap

@smolix You had these massive scale algorithms talks 15 years ago, made a big impact on me

2h2