Great freaking read.
DROP EVERYTHING
The bible for running LLMs locally is now available online to read for free
Covers what to use on
- Laptop / edge / odd hardware - Mac-first workflows - Single RTX GPUs - 2-4+ NVIDIA / CUDA GPUs - General production serving - Long-context / MoE / routing - NVIDIA max performance - Cluster orchestration
Software
- llama.cpp - MLX / MLX-LM - ExLlamaV2 - ExLlamaV3 - vLLM - SGLang - TensorRT-LLM - NVIDIA Dynamo
You should read this, and if you cannot now then you most definitely wanna bookmark it for later
Local AI FTW