/Tech5h ago

Meta FAIR's François Fleuret argues speculative decoding is a nearly free lunch, prompting debate over token volume versus deeper reasoning

Story Overview

Meta FAIR researcher François Fleuret called speculative decoding the closest thing to a free lunch in AI inference because a smaller draft model proposes tokens that the larger target model verifies in one pass, preserving output quality while cutting steps. His post drew a reply framing the technique as a chance to generate fewer tokens overall and prioritize deeper reasoning instead of volume.

2123388020.9K

#149

Original post

François Fleuret@francoisfleuret#577inTech

Speculative decoding is the closest you can get to a free lunch method. It is beautiful and astounding. I am surprised that it does not play a greater role in "AI".

5:47 AM · Jun 20, 2026 · 19.9K Views

Open Question

Why the method still sits on the sidelines remains unclear

Fleuret noted surprise that the approach does not play a larger role, yet the thread supplies no adoption metrics or concrete barriers, leaving the exact reasons for limited prominence as an open gap.

Developer Impact

Inference speed gains could shift focus from token count

A reply highlighted trading raw token volume for higher-quality thought, which aligns with the lossless speedup property but offers no new benchmarks to quantify the practical difference in current systems.

Sentiment

Positive users praise speculative decoding as a simple and efficient near-free technique, while negative users question the free lunch claim as an unfavorable tradeoff of flops for interactivity.

Pos

80.0%

Neg

20.0%

5 comments with sentiment.

Cluster Engagement

Digg Deeper

No Digg Deeper questions have been answered for this story yet.

Posts from X

Most Activity

VIEWS1.5KBOOKMARKS1LIKES20

kache@yacineMTB

@francoisfleuret why grug make token. grug think. less token. more thought

François Fleuret@francoisfleuret

Speculative decoding is the closest you can get to a free lunch method. It is beautiful and astounding. I am surprised that it does not play a greater role in "AI".

4h1.5K201

REPLIES2

Danilo J. Rezende@DaniloJRezende

@francoisfleuret It's rejection sampling.

François Fleuret@francoisfleuret

Speculative decoding is the closest you can get to a free lunch method. It is beautiful and astounding. I am surprised that it does not play a greater role in "AI".

36m24230

@noisybytes@noisybytes

@francoisfleuret if you like speculative decoding, you are gonna love speculative speculative decoding https://arxiv.org/abs/2603.03251

2h29351

François Fleuret@francoisfleuret

@DaniloJRezende I know!

But it is such a simple and beautiful "trick" that I'd expect it cannot be so easily used for real with huge models.

33m13611

Mohamed@mohammad2012191

@francoisfleuret @francoisfleuret Here is another equivalent free lunch but for videos

4h471

Lucas Beyer (bl16)@giffmana

@DaniloJRezende @francoisfleuret Rejection sampling does not include the genius efficiency of it, imo.

33m922

Danilo J. Rezende@DaniloJRezende

@giffmana @francoisfleuret True it is particularly favourable with AR transformers because computing log probs is much cheaper than sampling with larger model. But that is orthogonal to the algorithm itself.

27m211

Artur Chakhvadze@norpadon

@francoisfleuret Interactivity means tokens per second per query. Throughput is total tokens per second across all queries in the batch

58m61

Trash Panda 🦝@trashpandaemoji

@francoisfleuret Can you say more about what you mean by greater role in AI?

I thought spec decoding was already a pretty popular inference technique? Is there more that can be done?

5h1831

@dingchilling🫪@dingchilling

@noisybytes @francoisfleuret lol

2h371

Artur Chakhvadze@norpadon

@francoisfleuret What’s free about it? It is a way to trade off flops for interactivity at a very unfavourable exchange rate

4h109

Norman Mu@TheNormanMu

@francoisfleuret with a big enough decoding batch size the bottleneck switches from mem bandwidth to FLOPS, so the benefit of spec dec drops to 0. so when trying to maximize throughput, you should turn off spec dec

2h231

jason@jvmncs

@francoisfleuret oh, but it does! https://modal.com/blog/spec-is-all-u-need

46m191

Danilo J. Rezende@DaniloJRezende

@giffmana @francoisfleuret There are many other places where rejection sampling is used for similar reason: sampling from target is expensive but cheap to eval log-prob, so use a cheap surrogate to sample and accept/reject.

23m131

Lucien@Ljt019117161

@francoisfleuret @norpadon I believe he just means since it’s faster it’s more interactive

2h37

Mark@yieldthought

@francoisfleuret I think it’s standard everywhere?

4h37

lineardiff@lineardiff

@francoisfleuret it’s nice but i wouldn’t call it beautiful or astounding

3h32

Artur Chakhvadze@norpadon

@francoisfleuret If you want to maximise throughput the better use of flops is to just increase the batch size (until KV cache traffic becomes the limiting factor)

For long sequences speculative decoding can amortize the cost of attention, but this is less relevant for sparse/hybrid models

53m22

Shikhar@xikhar

@francoisfleuret Why can't we just build a native multi token prediction architecture directly?

3h17

Max Zimmer@maxzimmerberlin

@francoisfleuret it plays a pretty big role, doesn't it?

3h16