2h ago

Prefill-Only Fine Tuning speeds up multi-adapter LLM inference by up to 2.21x by skipping decode-phase adapters

The technique maintains near-accuracy parity with standard LoRA serving.

——0——

Original post

#678@ARYAMAN2020OP

Andrew Lanpouthakoun@ASLANPOUTHAKOUN

New paper!! Prefill and decode represent very different inference workloads; when we try to serve many LoRA adapters at once, inference slows down a ton during decode because we are memory bound :( What if we didn’t need those adapters at decode? We introduce Prefill-Only Fine Tuning (PreFT), adapters that are only trained and applied at prefill. We show that this speeds up multi-adapter serving with limited loss in performance!

9:52 AM · May 28, 2026

QUOTE POST

#222Christopher Potts@CHRISGPOTTS

In honor of this paper's acknowledgments section:

A picture of Paul Erdös with laser eyes, and the quotation below has been haphazardly changed to say "A AI researcher is a machine for turning Molly Tea into hyperparam sweeps."

Andrew Lanpouthakoun@aslanpouthakoun

New paper!! Prefill and decode represent very different inference workloads; when we try to serve many LoRA adapters at once, inference slows down a ton during decode because we are memory bound :( What if we didn’t need those adapters at decode? We introduce Prefill-Only Fine Tuning (PreFT), adapters that are only trained and applied at prefill. We show that this speeds up multi-adapter serving with limited loss in performance!

4:52 PM · May 28, 2026 · 2.7K Views

6:01 PM · May 28, 2026 · 347 Views

QUOTE POST

#678Aryaman Arora@ARYAMAN2020

new paper 🫡 we made serving many different finetunes surprisingly efficient by just… not intervening at decode steps!

Andrew Lanpouthakoun@aslanpouthakoun

New paper!! Prefill and decode represent very different inference workloads; when we try to serve many LoRA adapters at once, inference slows down a ton during decode because we are memory bound :( What if we didn’t need those adapters at decode? We introduce Prefill-Only Fine Tuning (PreFT), adapters that are only trained and applied at prefill. We show that this speeds up multi-adapter serving with limited loss in performance!

4:52 PM · May 28, 2026 · 2.7K Views

5:28 PM · May 28, 2026 · 723 Views

QUOTE POST

#678Aryaman Arora@ARYAMAN2020

thread of random thoughts on prefill-only finetuning (PreFT), our new work

Andrew Lanpouthakoun@aslanpouthakoun

New paper!! Prefill and decode represent very different inference workloads; when we try to serve many LoRA adapters at once, inference slows down a ton during decode because we are memory bound :( What if we didn’t need those adapters at decode? We introduce Prefill-Only Fine Tuning (PreFT), adapters that are only trained and applied at prefill. We show that this speeds up multi-adapter serving with limited loss in performance!

4:52 PM · May 28, 2026 · 2.7K Views

7:01 PM · May 28, 2026 · 792 Views

REPLY

#678Aryaman Arora@ARYAMAN2020

on efficiency earlier in my Ph.D., I worked on ReFT (a new PEFT) with @ZhengxuanZenWu. ReFT works by only editing representations in prefill; however, in the original paper, we were too unsure about making claims about efficiency gains from this, given neither of us did systems.

Aryaman Arora@aryaman2020

thread of random thoughts on prefill-only finetuning (PreFT), our new work

7:01 PM · May 28, 2026 · 792 Views

7:01 PM · May 28, 2026 · 240 Views

REPLY

#678Aryaman Arora@ARYAMAN2020

PreFT finally closes the loop on that; we have shown that prefill-only finetuning does have clear efficiency benefits in a set of workloads that actually exist in the real world. this would not be possible w/o @aslanpouthakoun's hard work on forking vLLM!

Aryaman Arora@aryaman2020

on efficiency earlier in my Ph.D., I worked on ReFT (a new PEFT) with @ZhengxuanZenWu. ReFT works by only editing representations in prefill; however, in the original paper, we were too unsure about making claims about efficiency gains from this, given neither of us did systems.

7:01 PM · May 28, 2026 · 240 Views

7:01 PM · May 28, 2026 · 114 Views

REPLY

#678Aryaman Arora@ARYAMAN2020

interp x arch as soon as an idea becomes too useful downstream, people refuse to call it interpretability. PreFT is quite far removed from standard interp research I've done before, but it draws a clear lineage from interp: distributed interchange intervention -> ReFT -> PreFTs

Aryaman Arora@aryaman2020

PreFT finally closes the loop on that; we have shown that prefill-only finetuning does have clear efficiency benefits in a set of workloads that actually exist in the real world. this would not be possible w/o @aslanpouthakoun's hard work on forking vLLM!

7:01 PM · May 28, 2026 · 114 Views

7:01 PM · May 28, 2026 · 17 Views

REPLY

#678Aryaman Arora@ARYAMAN2020

i'm very excited about continuing to do such unconventional interp-ish work that actually cares about tackling modern problems. while purely scientific interp is extremely important, it can become ungrounded from reality b/c it is hard to eval...

Aryaman Arora@aryaman2020

interp x arch as soon as an idea becomes too useful downstream, people refuse to call it interpretability. PreFT is quite far removed from standard interp research I've done before, but it draws a clear lineage from interp: distributed interchange intervention -> ReFT -> PreFTs

7:01 PM · May 28, 2026 · 17 Views

7:01 PM · May 28, 2026 · 15 Views

REPLY

#678Aryaman Arora@ARYAMAN2020

when we instead take ideas we got from interp (low-rank subspaces encode useful info + intervening from prompts cascades to generations), it can be highly generative for progress on non-interp problems!

my favourite story on this topic is induction heads leading to better SSMs.

Aryaman Arora@aryaman2020

i'm very excited about continuing to do such unconventional interp-ish work that actually cares about tackling modern problems. while purely scientific interp is extremely important, it can become ungrounded from reality b/c it is hard to eval...

7:01 PM · May 28, 2026 · 15 Views

7:01 PM · May 28, 2026 · 16 Views

REPLY

#678Aryaman Arora@ARYAMAN2020

the agents were indispensable for managing the experiment sweeps we ran once the right scaffold was in place. we had them write Python scripts for generating all the scripts and appendix tables in the paper which avoided mistakes when updating and helped us prioritise exps

Aryaman Arora@aryaman2020

the role of AI in research i've become a serious user of Claude Code since January. this was the first project where a coding agent really felt like magic for me: for the whole time, me and Andrew had persistent agents running on the cluster @tilderesearch generously provided us

7:01 PM · May 28, 2026 · 23 Views

7:01 PM · May 28, 2026 · 17 Views

REPLY

#678Aryaman Arora@ARYAMAN2020

the role of AI in research i've become a serious user of Claude Code since January. this was the first project where a coding agent really felt like magic for me: for the whole time, me and Andrew had persistent agents running on the cluster @tilderesearch generously provided us

Aryaman Arora@aryaman2020

when we instead take ideas we got from interp (low-rank subspaces encode useful info + intervening from prompts cascades to generations), it can be highly generative for progress on non-interp problems! my favourite story on this topic is induction heads leading to better SSMs.

7:01 PM · May 28, 2026 · 16 Views

7:01 PM · May 28, 2026 · 23 Views

REPLY

#678Aryaman Arora@ARYAMAN2020

continual learning and personalisation the obvious application of PreFTs is to enable serving many different finetunes of a model highly efficiently, keeping in mind prefill-decode mismatch (e.g. now, if disaggregated, your decode machines don't need to store the finetunes!)

Aryaman Arora@aryaman2020

i'm also pretty intrigued by autoresearch; i ran my first 24-hour autonomous claude loop on some exps related to this paper, whose results will hopefully be used in some upcoming work :)

7:01 PM · May 28, 2026 · 17 Views

7:01 PM · May 28, 2026 · 17 Views

REPLY

#678Aryaman Arora@ARYAMAN2020

i felt pretty cool when @jurafsky gave me feedback on a figure and i asked claude to update it in the overleaf from my phone, so we could live-iterate on it!

Aryaman Arora@aryaman2020

the agents were indispensable for managing the experiment sweeps we ran once the right scaffold was in place. we had them write Python scripts for generating all the scripts and appendix tables in the paper which avoided mistakes when updating and helped us prioritise exps

7:01 PM · May 28, 2026 · 17 Views

7:01 PM · May 28, 2026 · 16 Views

REPLY

#678Aryaman Arora@ARYAMAN2020

i'm also pretty intrigued by autoresearch; i ran my first 24-hour autonomous claude loop on some exps related to this paper, whose results will hopefully be used in some upcoming work :)

Aryaman Arora@aryaman2020

i felt pretty cool when @jurafsky gave me feedback on a figure and i asked claude to update it in the overleaf from my phone, so we could live-iterate on it!

7:01 PM · May 28, 2026 · 16 Views

7:01 PM · May 28, 2026 · 17 Views

REPLY

#678Aryaman Arora@ARYAMAN2020

why don't we have a world where each user gets their own continually-updated finetune of a frontier model? one bottleneck is definitely systems; we hope PreFTs make a step on fixing this.

Aryaman Arora@aryaman2020

continual learning and personalisation the obvious application of PreFTs is to enable serving many different finetunes of a model highly efficiently, keeping in mind prefill-decode mismatch (e.g. now, if disaggregated, your decode machines don't need to store the finetunes!)

7:01 PM · May 28, 2026 · 17 Views

7:01 PM · May 28, 2026 · 12 Views

REPLY

#678Aryaman Arora@ARYAMAN2020

but another is data; given a set of interactions w/ a user, how do you create useful finetuning data that improves their experience? right now, models just write down their own memory prompts. surely something more sophisticated is possible even with only a few interactions?

Aryaman Arora@aryaman2020

why don't we have a world where each user gets their own continually-updated finetune of a frontier model? one bottleneck is definitely systems; we hope PreFTs make a step on fixing this.

7:01 PM · May 28, 2026 · 12 Views

7:01 PM · May 28, 2026 · 12 Views

REPLY

#678Aryaman Arora@ARYAMAN2020

perhaps ameliorating the systems/serving issues as we did will PreFT will make it more viable to try working on this problem. at a broader level, I now have a much better appreciation of how systems enable cool new research!

Aryaman Arora@aryaman2020

but another is data; given a set of interactions w/ a user, how do you create useful finetuning data that improves their experience? right now, models just write down their own memory prompts. surely something more sophisticated is possible even with only a few interactions?

7:01 PM · May 28, 2026 · 12 Views

7:01 PM · May 28, 2026 · 12 Views

REPLY

#678Aryaman Arora@ARYAMAN2020

closing thoughts (for now) i never thought, as an interp person, that i would write a paper with the word "efficiency" in the title :) this would not have been possible w/o having a great collaborator in @aslanpouthakoun!

Aryaman Arora@aryaman2020

perhaps ameliorating the systems/serving issues as we did will PreFT will make it more viable to try working on this problem. at a broader level, I now have a much better appreciation of how systems enable cool new research!

7:01 PM · May 28, 2026 · 12 Views

7:01 PM · May 28, 2026 · 52 Views

REPLY

#678Aryaman Arora@ARYAMAN2020

i'm excited about doing more of this kind of different (for me) research -- it's what a phd is meant for!

Aryaman Arora@aryaman2020

closing thoughts (for now) i never thought, as an interp person, that i would write a paper with the word "efficiency" in the title :) this would not have been possible w/o having a great collaborator in @aslanpouthakoun!

7:01 PM · May 28, 2026 · 52 Views

7:01 PM · May 28, 2026 · 52 Views

Prefill-Only Fine Tuning speeds up multi-adapter LLM inference by up to 2.21x by skipping decode-phase adapters · Digg