/AI2h ago

Moondream Photon Engine Speeds Inference 35% by Hiding GPU Bubbles

--0--
Original posts
Comments
Original post
vik@vikhyatk#1193inAI

Wrote a post about how Photon (Moondream's inference engine) hides GPU bubbles using pipelined decoding. Speeding up inference by up to 35%.

11:39 AM · Jun 4, 2026 · 5.3K Views
Sentiment
Sentiment building, check back later.
Cluster Engagement
-
Views
-
Comments
-
Reposts
-
Bookmarks
Expand data
Posts from X
Most Activity
Most ActivityTimeline
VIEWS544BOOKMARKS1LIKES6
vik@vikhyatk

The problem with autoregressive decoding: one token of GPU work is small, but CPU bookkeeping is paid every token.

In a blocking decode loop, the GPU finishes a forward and then waits while the CPU commits results and plans the next step.

That idle gap is called a GPU bubble.

vik@vikhyatk

https://moondream.ai/blog/popping-the-gpu-bubble

2hViews 544Likes 6Bookmarks 1
REPLIES1
vik@vikhyatk

Running ahead means a request is already included in t+1 by the time we get the EOS from step t.

Photon handles this with zombies: finalize early, release late. The extra row rides one more forward, then its KV pages and slots are freed once refs hit zero.

vik@vikhyatk

Constrained decoding adds a dependency: the mask for step t+1 depends on the token sampled at step t.

Fortunately, the forward does not need that mask.

So Photon launches forward t+1 first, commits t, then finalizes sampling for t+1.

2hViews 125Likes 3Bookmarks 0