Moondream Photon Engine Speeds Inference 35% by Hiding GPU Bubbles
Most Activity
The problem with autoregressive decoding: one token of GPU work is small, but CPU bookkeeping is paid every token.
In a blocking decode loop, the GPU finishes a forward and then waits while the CPU commits results and plans the next step.
That idle gap is called a GPU bubble.
https://moondream.ai/blog/popping-the-gpu-bubble
Running ahead means a request is already included in t+1 by the time we get the EOS from step t.
Photon handles this with zombies: finalize early, release late. The extra row rides one more forward, then its KV pages and slots are freed once refs hit zero.
Constrained decoding adds a dependency: the mask for step t+1 depends on the token sampled at step t.
Fortunately, the forward does not need that mask.
So Photon launches forward t+1 first, commits t, then finalizes sampling for t+1.