Speculative decoding is the closest you can get to a free lunch method. It is beautiful and astounding. I am surprised that it does not play a greater role in "AI".
Meta FAIR's François Fleuret argues speculative decoding is a nearly free lunch, prompting debate over token volume versus deeper reasoning
Story Overview
Meta FAIR researcher François Fleuret called speculative decoding the closest thing to a free lunch in AI inference because a smaller draft model proposes tokens that the larger target model verifies in one pass, preserving output quality while cutting steps. His post drew a reply framing the technique as a chance to generate fewer tokens overall and prioritize deeper reasoning instead of volume.
Why the method still sits on the sidelines remains unclear
Fleuret noted surprise that the approach does not play a larger role, yet the thread supplies no adoption metrics or concrete barriers, leaving the exact reasons for limited prominence as an open gap.
Inference speed gains could shift focus from token count
A reply highlighted trading raw token volume for higher-quality thought, which aligns with the lossless speedup property but offers no new benchmarks to quantify the practical difference in current systems.
Positive users praise speculative decoding as a simple and efficient near-free technique, while negative users question the free lunch claim as an unfavorable tradeoff of flops for interactivity.
No Digg Deeper questions have been answered for this story yet.
Most Activity
@francoisfleuret why grug make token. grug think. less token. more thought
Speculative decoding is the closest you can get to a free lunch method. It is beautiful and astounding. I am surprised that it does not play a greater role in "AI".
@francoisfleuret It's rejection sampling.
Speculative decoding is the closest you can get to a free lunch method. It is beautiful and astounding. I am surprised that it does not play a greater role in "AI".

@francoisfleuret if you like speculative decoding, you are gonna love speculative speculative decoding https://arxiv.org/abs/2603.03251

@DaniloJRezende I know!
But it is such a simple and beautiful "trick" that I'd expect it cannot be so easily used for real with huge models.

@francoisfleuret @francoisfleuret Here is another equivalent free lunch but for videos

@DaniloJRezende @francoisfleuret Rejection sampling does not include the genius efficiency of it, imo.

@giffmana @francoisfleuret True it is particularly favourable with AR transformers because computing log probs is much cheaper than sampling with larger model. But that is orthogonal to the algorithm itself.

@francoisfleuret Interactivity means tokens per second per query. Throughput is total tokens per second across all queries in the batch

@francoisfleuret Can you say more about what you mean by greater role in AI?
I thought spec decoding was already a pretty popular inference technique? Is there more that can be done?

@noisybytes @francoisfleuret lol

@francoisfleuret What’s free about it? It is a way to trade off flops for interactivity at a very unfavourable exchange rate

@francoisfleuret with a big enough decoding batch size the bottleneck switches from mem bandwidth to FLOPS, so the benefit of spec dec drops to 0. so when trying to maximize throughput, you should turn off spec dec

@francoisfleuret oh, but it does! https://modal.com/blog/spec-is-all-u-need

@giffmana @francoisfleuret There are many other places where rejection sampling is used for similar reason: sampling from target is expensive but cheap to eval log-prob, so use a cheap surrogate to sample and accept/reject.

@francoisfleuret @norpadon I believe he just means since it’s faster it’s more interactive
@francoisfleuret I think it’s standard everywhere?

@francoisfleuret it’s nice but i wouldn’t call it beautiful or astounding

@francoisfleuret If you want to maximise throughput the better use of flops is to just increase the batch size (until KV cache traffic becomes the limiting factor)
For long sequences speculative decoding can amortize the cost of attention, but this is less relevant for sparse/hybrid models

@francoisfleuret Why can't we just build a native multi token prediction architecture directly?

@francoisfleuret it plays a pretty big role, doesn't it?