5h ago

Kog AI Hits 3000 Tokens Per Second On 8x AMD MI300X GPUs

0
Original post

I had to test it myself to believe this unreal inference speed. 3,000 tokens/s for 1 user on standard datacenter GPUs. They leveraged a hidden efficiency gap in how GPUs generate tokens. @Kog__AI just achieved 3,000 tokens/s on 8× AMD MI300X GPUs and 2,100 on 8× NVIDIA H200 (FP16, no speculative decoding). Their tech preview is on a 2B model, and they show how their techniques will scale to large frontier MoE models at similar speeds. That's a huge number because normal low-batch GPU decoding for 2B to 8B models is usually closer to 100 to 300 tokens/s per request, so Kog is claiming something like a 10X to 30X jump in the speed one user actually feels. Their trick: they are getting the speed by treating LLM decoding as a memory streaming problem, not mainly a math problem. For 1 user at batch size 1, the GPU is not doing big, efficient matrix-matrix work like in training or large-batch serving; it is repeatedly pulling the model’s active weights from high-bandwidth memory for each new token, so speed depends on how smoothly those weights keep flowing. Normal inference stacks keep breaking that flow. They run many separate GPU programs for different parts of the model, move intermediate results through memory, wait at synchronization points, talk back to the CPU for scheduling or sampling, and then repeat this token after token. Kog’s answer is to co-design 3 things that are usually tuned separately: the runtime, the low-level GPU code, and the model architecture. The biggest engineering move is the monokernel, where the whole decode pass runs as 1 persistent GPU-resident program, including sampling, so the system does not keep stopping for kernel launches, CPU scheduling, and intermediate memory round trips. They also rebuilt synchronization, because their own measurements say grid sync was eating around 35% of token-generation time; instead of making every compute unit wait at a broad barrier, each unit waits only for the exact data it needs. On AMD MI300X, they also map memory access around the chiplet layout, because memory latency changes depending on which die makes the request. Then their Laneformer model uses Delayed Tensor Parallelism, which lets cross-GPU communication happen in the background instead of blocking every layer.

10:14 AM · May 29, 2026 View on X

Try their Playground → http://playground.kog.ai

Their main technical report/blog - https://blog.kog.ai/real-time-llm-inference-on-standard-gpus-3-000-tokens-s-per-request/

Now, why is 3,000 tokens/s so hard?

At batch size 1, LLM decoding is mostly memory movement, not raw compute. Every new token needs the model’s active weights streamed from high-bandwidth memory, and at this speed the whole system gets only ~333 microseconds per token.

That is why small delays suddenly become huge: a few microseconds lost at every layer can destroy the whole speed target.

Rohan PaulRohan Paul@rohanpaul_ai

I had to test it myself to believe this unreal inference speed. 3,000 tokens/s for 1 user on standard datacenter GPUs. They leveraged a hidden efficiency gap in how GPUs generate tokens. @Kog__AI just achieved 3,000 tokens/s on 8× AMD MI300X GPUs and 2,100 on 8× NVIDIA H200 (FP16, no speculative decoding). Their tech preview is on a 2B model, and they show how their techniques will scale to large frontier MoE models at similar speeds. That's a huge number because normal low-batch GPU decoding for 2B to 8B models is usually closer to 100 to 300 tokens/s per request, so Kog is claiming something like a 10X to 30X jump in the speed one user actually feels. Their trick: they are getting the speed by treating LLM decoding as a memory streaming problem, not mainly a math problem. For 1 user at batch size 1, the GPU is not doing big, efficient matrix-matrix work like in training or large-batch serving; it is repeatedly pulling the model’s active weights from high-bandwidth memory for each new token, so speed depends on how smoothly those weights keep flowing. Normal inference stacks keep breaking that flow. They run many separate GPU programs for different parts of the model, move intermediate results through memory, wait at synchronization points, talk back to the CPU for scheduling or sampling, and then repeat this token after token. Kog’s answer is to co-design 3 things that are usually tuned separately: the runtime, the low-level GPU code, and the model architecture. The biggest engineering move is the monokernel, where the whole decode pass runs as 1 persistent GPU-resident program, including sampling, so the system does not keep stopping for kernel launches, CPU scheduling, and intermediate memory round trips. They also rebuilt synchronization, because their own measurements say grid sync was eating around 35% of token-generation time; instead of making every compute unit wait at a broad barrier, each unit waits only for the exact data it needs. On AMD MI300X, they also map memory access around the chiplet layout, because memory latency changes depending on which die makes the request. Then their Laneformer model uses Delayed Tensor Parallelism, which lets cross-GPU communication happen in the background instead of blocking every layer.

5:14 PM · May 29, 2026 · 3.9K Views
5:14 PM · May 29, 2026 · 707 Views

The monokernel idea was one of their powerful trick.

Instead of launching many small GPU programs for normalization, attention, feed-forward layers, sampling, and communication, Kog keeps the whole decode loop inside 1 long-running GPU program.

With a monokernel, weights for the next stage can start loading while the current stage is still finishing, so the GPU behaves more like a pipeline and less like a machine constantly being paused and restarted.

If a Transformer layer is broken into many small GPU programs, the system can burn a scary amount of its budget just stopping, starting, syncing, writing, reloading, and waiting, before doing useful token generation.

The monokernel tries to remove that stop-start behavior.

Once it begins, it stays resident on the GPU and handles the full sequence, including prefill, decode, sampling, tensor-parallel communication, reductions, and internal state, without going back to the CPU for every little step.

The big gain is that weight streaming stays continuous.

For batch-size-1 inference, the GPU mostly needs to stream active model weights from high-bandwidth memory into compute units as smoothly as possible. Read more about their “monokernel” implementation here.

kog.ai
Building a single-kernel, latency-optimized LLM inference engine on AMD MI300X GPUs
We implemented the entire LLM decode pass in a single persistent kernel, no kernel launches, no interruptions, achieving 3,000+ tokens/s per request on AMD MI300X.
Rohan PaulRohan Paul@rohanpaul_ai

Try their Playground → http://playground.kog.ai Their main technical report/blog - https://blog.kog.ai/real-time-llm-inference-on-standard-gpus-3-000-tokens-s-per-request/ Now, why is 3,000 tokens/s so hard? At batch size 1, LLM decoding is mostly memory movement, not raw compute. Every new token needs the model’s active weights streamed from high-bandwidth memory, and at this speed the whole system gets only ~333 microseconds per token. That is why small delays suddenly become huge: a few microseconds lost at every layer can destroy the whole speed target.

5:14 PM · May 29, 2026 · 707 Views
5:14 PM · May 29, 2026 · 585 Views

Their grid sync trick is probably the most underrated part.

A normal barrier makes the whole GPU wait, while Kog’s version lets each compute unit wait only for the data it actually depends on.

Instead of forcing the whole GPU to wait at one big barrier, they make each compute unit wait only for the exact value it needs.

This image is showing their grid sync idea in the simplest way: instead of making the whole GPU wait together, each compute unit checks only the small piece of data it needs.

The NaN boxes mean “not ready yet,” and the v0, v1, v2 boxes mean “real data is ready now.”

So the sync signal is built into the data itself, which is why they can reduce broad waiting and cut grid sync from around 7 microseconds to under 1 microsecond.

Read more about grid sync here -

kog.ai
Building a single-kernel, latency-optimized LLM inference engine on AMD MI300X GPUs
We implemented the entire LLM decode pass in a single persistent kernel, no kernel launches, no interruptions, achieving 3,000+ tokens/s per request on AMD MI300X.
Rohan PaulRohan Paul@rohanpaul_ai

The monokernel idea was one of their powerful trick. Instead of launching many small GPU programs for normalization, attention, feed-forward layers, sampling, and communication, Kog keeps the whole decode loop inside 1 long-running GPU program. With a monokernel, weights for the next stage can start loading while the current stage is still finishing, so the GPU behaves more like a pipeline and less like a machine constantly being paused and restarted. If a Transformer layer is broken into many small GPU programs, the system can burn a scary amount of its budget just stopping, starting, syncing, writing, reloading, and waiting, before doing useful token generation. The monokernel tries to remove that stop-start behavior. Once it begins, it stays resident on the GPU and handles the full sequence, including prefill, decode, sampling, tensor-parallel communication, reductions, and internal state, without going back to the CPU for every little step. The big gain is that weight streaming stays continuous. For batch-size-1 inference, the GPU mostly needs to stream active model weights from high-bandwidth memory into compute units as smoothly as possible. Read more about their “monokernel” implementation here. https://blog.kog.ai/building-a-single-kernel-latency-optimized-llm-inference-engine-on-amd-mi300x-gpus/

5:14 PM · May 29, 2026 · 585 Views
5:14 PM · May 29, 2026 · 98 Views

Delayed Tensor Parallelism changes the model so cross-GPU communication does not sit directly on the critical path.

Standard tensor parallelism often waits for all GPUs to combine their partial results after a module, but DTP delays that combine step so useful work can continue.

Read more about their “Delayed Tensor Parallelism” here -

kog.ai
Delayed Tensor Parallelism for Faster Transformer Inference
DTP is a new Transformer architecture that hides communication overhead behind computation and weight streaming, enabling significantly faster batch-size-one inference on AMD and NVIDIA GPUs.
Rohan PaulRohan Paul@rohanpaul_ai

Their grid sync trick is probably the most underrated part. A normal barrier makes the whole GPU wait, while Kog’s version lets each compute unit wait only for the data it actually depends on. Instead of forcing the whole GPU to wait at one big barrier, they make each compute unit wait only for the exact value it needs. This image is showing their grid sync idea in the simplest way: instead of making the whole GPU wait together, each compute unit checks only the small piece of data it needs. The NaN boxes mean “not ready yet,” and the v0, v1, v2 boxes mean “real data is ready now.” So the sync signal is built into the data itself, which is why they can reduce broad waiting and cut grid sync from around 7 microseconds to under 1 microsecond. Read more about grid sync here - https://blog.kog.ai/building-a-single-kernel-latency-optimized-llm-inference-engine-on-amd-mi300x-gpus/

5:14 PM · May 29, 2026 · 98 Views
5:14 PM · May 29, 2026 · 83 Views

Now what about scaling to other large third-party MoE models

- at batch size 1, GPT-OSS-120B has 5.1B active parameters - in FP8, it's in the same size ballpark than their 2B model in FP16 (5.1 GB vs 4GB), on which their preview experiment was run.

- Similarly, DeepSeek V4 Flash has 13B in mixed FP4/FP8, so let's say ballpark around 3x bigger than 4GB - so Kog could reach >1,000 tok/s on it with MI300X/H200 and up to 4k on next generation GPUs.

Rohan PaulRohan Paul@rohanpaul_ai

Delayed Tensor Parallelism changes the model so cross-GPU communication does not sit directly on the critical path. Standard tensor parallelism often waits for all GPUs to combine their partial results after a module, but DTP delays that combine step so useful work can continue. Read more about their “Delayed Tensor Parallelism” here - https://blog.kog.ai/delayed-tensor-parallelism-for-faster-transformer-inference/

5:14 PM · May 29, 2026 · 83 Views
5:14 PM · May 29, 2026 · 745 Views
Rohan PaulRohan Paul@rohanpaul_ai

Now what about scaling to other large third-party MoE models - at batch size 1, GPT-OSS-120B has 5.1B active parameters - in FP8, it's in the same size ballpark than their 2B model in FP16 (5.1 GB vs 4GB), on which their preview experiment was run. - Similarly, DeepSeek V4 Flash has 13B in mixed FP4/FP8, so let's say ballpark around 3x bigger than 4GB - so Kog could reach >1,000 tok/s on it with MI300X/H200 and up to 4k on next generation GPUs.

5:14 PM · May 29, 2026 · 745 Views
5:14 PM · May 29, 2026 · 695 Views