1d ago

PreFT accelerates multi-adapter LLM inference by restricting LoRA adapters to the prefill stage

It avoids decode-phase memory bottlenecks with minimal performance loss.

0
Original post

New paper!! Prefill and decode represent very different inference workloads; when we try to serve many LoRA adapters at once, inference slows down a ton during decode because we are memory bound :( What if we didn’t need those adapters at decode? We introduce Prefill-Only Fine Tuning (PreFT), adapters that are only trained and applied at prefill. We show that this speeds up multi-adapter serving with limited loss in performance!

9:52 AM · May 28, 2026 View on X
Reposted by