/Tech27d ago

Moondream Photon Engine Speeds Inference 35% by Hiding GPU Bubbles

--0--

Original post

vik@vikhyatk#1634inTech

Wrote a post about how Photon (Moondream's inference engine) hides GPU bubbles using pipelined decoding. Speeding up inference by up to 35%.

11:39 AM · Jun 4, 2026 · 20.6K Views

Sentiment

Sentiment building, check back later.

Cluster Engagement

Views

Comments

Reposts

Bookmarks

Expand data

Digg Deeper

No Digg Deeper questions have been answered for this story yet.

Related links

Popping the GPU Bubble | Moondream

MOONDREAMVia

#1634

Posts from X

Most Activity

VIEWS1.3KLIKES17

vik@vikhyatk

The problem with autoregressive decoding: one token of GPU work is small, but CPU bookkeeping is paid every token.

In a blocking decode loop, the GPU finishes a forward and then waits while the CPU commits results and plans the next step.

That idle gap is called a GPU bubble.

vik@vikhyatk

https://moondream.ai/blog/popping-the-gpu-bubble

27d1.3K171

BOOKMARKS1

vik@vikhyatk

The fix is pipelined decoding: launch the next forward before the current token has fully returned to the CPU.

The next token can stay on the GPU. The CPU copy for streaming and done-state can happen in the background.

vik@vikhyatk

The problem with autoregressive decoding: one token of GPU work is small, but CPU bookkeeping is paid every token.

In a blocking decode loop, the GPU finishes a forward and then waits while the CPU commits results and plans the next step.

That idle gap is called a GPU bubble.

27d516101

REPLIES1

vik@vikhyatk

Running ahead means a request is already included in t+1 by the time we get the EOS from step t.

Photon handles this with zombies: finalize early, release late. The extra row rides one more forward, then its KV pages and slots are freed once refs hit zero.

vik@vikhyatk

Constrained decoding adds a dependency: the mask for step t+1 depends on the token sampled at step t.

Fortunately, the forward does not need that mask.

So Photon launches forward t+1 first, commits t, then finalizes sampling for t+1.

27d33480

vik@vikhyatk

Fun fact: writing this post led me to discover an accidental sync in the mask H2D that meant we were only doing partial overlap. 🤦‍♂️

Went digging because the cost model said we were only observing half the speedup that was theoretically predicted.

vik@vikhyatk

Prefill rides the same pipeline too.

A prefill is another launch in the two-slot loop, so decode commits can overlap prefill forwards, and decode forwards can overlap request admission.

This matters for short outputs, where bubbles between decode and prefill can be brutal.

27d506110

vik@vikhyatk

To make that safe, Photon uses ping-pong decode slots.

Each step has a working set: inputs, logits, sampled token, KV metadata, pinned host buffers.

Two slots let one step’s results be read while the next step is already running.

vik@vikhyatk

The fix is pipelined decoding: launch the next forward before the current token has fully returned to the CPU.

The next token can stay on the GPU. The CPU copy for streaming and done-state can happen in the background.

27d446110

vik@vikhyatk

The forwards share one compute stream.

The token copy back to CPU goes on a separate copy stream, anchored to the exact event that produced it.

So the CPU no longer blocks the GPU just to retrieve bookkeeping data.

vik@vikhyatk

To make that safe, Photon uses ping-pong decode slots.

Each step has a working set: inputs, logits, sampled token, KV metadata, pinned host buffers.

Two slots let one step’s results be read while the next step is already running.

27d402110

vik@vikhyatk

Constrained decoding adds a dependency: the mask for step t+1 depends on the token sampled at step t.

Fortunately, the forward does not need that mask.

So Photon launches forward t+1 first, commits t, then finalizes sampling for t+1.

vik@vikhyatk

The forwards share one compute stream.

The token copy back to CPU goes on a separate copy stream, anchored to the exact event that produced it.

So the CPU no longer blocks the GPU just to retrieve bookkeeping data.

27d363100

vik@vikhyatk

Prefill rides the same pipeline too.

A prefill is another launch in the two-slot loop, so decode commits can overlap prefill forwards, and decode forwards can overlap request admission.

This matters for short outputs, where bubbles between decode and prefill can be brutal.

vik@vikhyatk

Running ahead means a request is already included in t+1 by the time we get the EOS from step t.

Photon handles this with zombies: finalize early, release late. The extra row rides one more forward, then its KV pages and slots are freed once refs hit zero.

27d52890

vik@vikhyatk

https://moondream.ai/blog/popping-the-gpu-bubble

27d4393

Liu Liu@liuliu

@vikhyatk TBH, this works until you need to do MTP draft. In that case, even with early abortion / restart tricks etc, it just not as good as sync on CPU, make the decision there, making sure you have minimal CPU overhead.

27d331

vik@vikhyatk

@liuliu Interesting. Need to think about this -- planning to work on speculative decoding this month.

27d141

Liu Liu@liuliu

@vikhyatk If you control your own kernel, you can make verification decision (move 1, move 2, or move 3 for your KV cache) right in the GPU kernel, but that is too complicated for me to try.

26d6