Prefill-Only Fine Tuning speeds up multi-adapter LLM inference by up to 2.21x by skipping decode-phase adapters
The technique maintains near-accuracy parity with standard LoRA serving.
In honor of this paper's acknowledgments section:

New paper!! Prefill and decode represent very different inference workloads; when we try to serve many LoRA adapters at once, inference slows down a ton during decode because we are memory bound :( What if we didn’t need those adapters at decode? We introduce Prefill-Only Fine Tuning (PreFT), adapters that are only trained and applied at prefill. We show that this speeds up multi-adapter serving with limited loss in performance!
new paper 🫡 we made serving many different finetunes surprisingly efficient by just… not intervening at decode steps!
New paper!! Prefill and decode represent very different inference workloads; when we try to serve many LoRA adapters at once, inference slows down a ton during decode because we are memory bound :( What if we didn’t need those adapters at decode? We introduce Prefill-Only Fine Tuning (PreFT), adapters that are only trained and applied at prefill. We show that this speeds up multi-adapter serving with limited loss in performance!
thread of random thoughts on prefill-only finetuning (PreFT), our new work
New paper!! Prefill and decode represent very different inference workloads; when we try to serve many LoRA adapters at once, inference slows down a ton during decode because we are memory bound :( What if we didn’t need those adapters at decode? We introduce Prefill-Only Fine Tuning (PreFT), adapters that are only trained and applied at prefill. We show that this speeds up multi-adapter serving with limited loss in performance!
on efficiency earlier in my Ph.D., I worked on ReFT (a new PEFT) with @ZhengxuanZenWu. ReFT works by only editing representations in prefill; however, in the original paper, we were too unsure about making claims about efficiency gains from this, given neither of us did systems.
thread of random thoughts on prefill-only finetuning (PreFT), our new work
PreFT finally closes the loop on that; we have shown that prefill-only finetuning does have clear efficiency benefits in a set of workloads that actually exist in the real world. this would not be possible w/o @aslanpouthakoun's hard work on forking vLLM!
on efficiency earlier in my Ph.D., I worked on ReFT (a new PEFT) with @ZhengxuanZenWu. ReFT works by only editing representations in prefill; however, in the original paper, we were too unsure about making claims about efficiency gains from this, given neither of us did systems.
interp x arch as soon as an idea becomes too useful downstream, people refuse to call it interpretability. PreFT is quite far removed from standard interp research I've done before, but it draws a clear lineage from interp: distributed interchange intervention -> ReFT -> PreFTs
PreFT finally closes the loop on that; we have shown that prefill-only finetuning does have clear efficiency benefits in a set of workloads that actually exist in the real world. this would not be possible w/o @aslanpouthakoun's hard work on forking vLLM!
i'm very excited about continuing to do such unconventional interp-ish work that actually cares about tackling modern problems. while purely scientific interp is extremely important, it can become ungrounded from reality b/c it is hard to eval...
interp x arch as soon as an idea becomes too useful downstream, people refuse to call it interpretability. PreFT is quite far removed from standard interp research I've done before, but it draws a clear lineage from interp: distributed interchange intervention -> ReFT -> PreFTs
when we instead take ideas we got from interp (low-rank subspaces encode useful info + intervening from prompts cascades to generations), it can be highly generative for progress on non-interp problems!
my favourite story on this topic is induction heads leading to better SSMs.
i'm very excited about continuing to do such unconventional interp-ish work that actually cares about tackling modern problems. while purely scientific interp is extremely important, it can become ungrounded from reality b/c it is hard to eval...
the agents were indispensable for managing the experiment sweeps we ran once the right scaffold was in place. we had them write Python scripts for generating all the scripts and appendix tables in the paper which avoided mistakes when updating and helped us prioritise exps
the role of AI in research i've become a serious user of Claude Code since January. this was the first project where a coding agent really felt like magic for me: for the whole time, me and Andrew had persistent agents running on the cluster @tilderesearch generously provided us
the role of AI in research i've become a serious user of Claude Code since January. this was the first project where a coding agent really felt like magic for me: for the whole time, me and Andrew had persistent agents running on the cluster @tilderesearch generously provided us
when we instead take ideas we got from interp (low-rank subspaces encode useful info + intervening from prompts cascades to generations), it can be highly generative for progress on non-interp problems! my favourite story on this topic is induction heads leading to better SSMs.
continual learning and personalisation the obvious application of PreFTs is to enable serving many different finetunes of a model highly efficiently, keeping in mind prefill-decode mismatch (e.g. now, if disaggregated, your decode machines don't need to store the finetunes!)
i'm also pretty intrigued by autoresearch; i ran my first 24-hour autonomous claude loop on some exps related to this paper, whose results will hopefully be used in some upcoming work :)
i felt pretty cool when @jurafsky gave me feedback on a figure and i asked claude to update it in the overleaf from my phone, so we could live-iterate on it!
the agents were indispensable for managing the experiment sweeps we ran once the right scaffold was in place. we had them write Python scripts for generating all the scripts and appendix tables in the paper which avoided mistakes when updating and helped us prioritise exps
i'm also pretty intrigued by autoresearch; i ran my first 24-hour autonomous claude loop on some exps related to this paper, whose results will hopefully be used in some upcoming work :)
i felt pretty cool when @jurafsky gave me feedback on a figure and i asked claude to update it in the overleaf from my phone, so we could live-iterate on it!
why don't we have a world where each user gets their own continually-updated finetune of a frontier model? one bottleneck is definitely systems; we hope PreFTs make a step on fixing this.
continual learning and personalisation the obvious application of PreFTs is to enable serving many different finetunes of a model highly efficiently, keeping in mind prefill-decode mismatch (e.g. now, if disaggregated, your decode machines don't need to store the finetunes!)
but another is data; given a set of interactions w/ a user, how do you create useful finetuning data that improves their experience? right now, models just write down their own memory prompts. surely something more sophisticated is possible even with only a few interactions?
why don't we have a world where each user gets their own continually-updated finetune of a frontier model? one bottleneck is definitely systems; we hope PreFTs make a step on fixing this.
perhaps ameliorating the systems/serving issues as we did will PreFT will make it more viable to try working on this problem. at a broader level, I now have a much better appreciation of how systems enable cool new research!
but another is data; given a set of interactions w/ a user, how do you create useful finetuning data that improves their experience? right now, models just write down their own memory prompts. surely something more sophisticated is possible even with only a few interactions?
closing thoughts (for now) i never thought, as an interp person, that i would write a paper with the word "efficiency" in the title :) this would not have been possible w/o having a great collaborator in @aslanpouthakoun!
perhaps ameliorating the systems/serving issues as we did will PreFT will make it more viable to try working on this problem. at a broader level, I now have a much better appreciation of how systems enable cool new research!
i'm excited about doing more of this kind of different (for me) research -- it's what a phd is meant for!
closing thoughts (for now) i never thought, as an interp person, that i would write a paper with the word "efficiency" in the title :) this would not have been possible w/o having a great collaborator in @aslanpouthakoun!