No Digg Deeper questions have been answered for this story yet.
Most Activity
The problem with autoregressive decoding: one token of GPU work is small, but CPU bookkeeping is paid every token.
In a blocking decode loop, the GPU finishes a forward and then waits while the CPU commits results and plans the next step.
That idle gap is called a GPU bubble.
https://moondream.ai/blog/popping-the-gpu-bubble
The fix is pipelined decoding: launch the next forward before the current token has fully returned to the CPU.
The next token can stay on the GPU. The CPU copy for streaming and done-state can happen in the background.
The problem with autoregressive decoding: one token of GPU work is small, but CPU bookkeeping is paid every token.
In a blocking decode loop, the GPU finishes a forward and then waits while the CPU commits results and plans the next step.
That idle gap is called a GPU bubble.
Running ahead means a request is already included in t+1 by the time we get the EOS from step t.
Photon handles this with zombies: finalize early, release late. The extra row rides one more forward, then its KV pages and slots are freed once refs hit zero.
Constrained decoding adds a dependency: the mask for step t+1 depends on the token sampled at step t.
Fortunately, the forward does not need that mask.
So Photon launches forward t+1 first, commits t, then finalizes sampling for t+1.
Fun fact: writing this post led me to discover an accidental sync in the mask H2D that meant we were only doing partial overlap. 🤦♂️
Went digging because the cost model said we were only observing half the speedup that was theoretically predicted.
Prefill rides the same pipeline too.
A prefill is another launch in the two-slot loop, so decode commits can overlap prefill forwards, and decode forwards can overlap request admission.
This matters for short outputs, where bubbles between decode and prefill can be brutal.
To make that safe, Photon uses ping-pong decode slots.
Each step has a working set: inputs, logits, sampled token, KV metadata, pinned host buffers.
Two slots let one step’s results be read while the next step is already running.
The fix is pipelined decoding: launch the next forward before the current token has fully returned to the CPU.
The next token can stay on the GPU. The CPU copy for streaming and done-state can happen in the background.
The forwards share one compute stream.
The token copy back to CPU goes on a separate copy stream, anchored to the exact event that produced it.
So the CPU no longer blocks the GPU just to retrieve bookkeeping data.
To make that safe, Photon uses ping-pong decode slots.
Each step has a working set: inputs, logits, sampled token, KV metadata, pinned host buffers.
Two slots let one step’s results be read while the next step is already running.
Constrained decoding adds a dependency: the mask for step t+1 depends on the token sampled at step t.
Fortunately, the forward does not need that mask.
So Photon launches forward t+1 first, commits t, then finalizes sampling for t+1.
The forwards share one compute stream.
The token copy back to CPU goes on a separate copy stream, anchored to the exact event that produced it.
So the CPU no longer blocks the GPU just to retrieve bookkeeping data.
Prefill rides the same pipeline too.
A prefill is another launch in the two-slot loop, so decode commits can overlap prefill forwards, and decode forwards can overlap request admission.
This matters for short outputs, where bubbles between decode and prefill can be brutal.
Running ahead means a request is already included in t+1 by the time we get the EOS from step t.
Photon handles this with zombies: finalize early, release late. The extra row rides one more forward, then its KV pages and slots are freed once refs hit zero.

https://moondream.ai/blog/popping-the-gpu-bubble

@vikhyatk TBH, this works until you need to do MTP draft. In that case, even with early abortion / restart tricks etc, it just not as good as sync on CPU, make the decision there, making sure you have minimal CPU overhead.

@liuliu Interesting. Need to think about this -- planning to work on speculative decoding this month.

@vikhyatk If you control your own kernel, you can make verification decision (move 1, move 2, or move 3 for your KV cache) right in the GPU kernel, but that is too complicated for me to try.