/Tech8h ago

QK Norm Raises LLM Decode Costs 30% at Low Batch Sizes

1611734514.9K

#1630

Original post

vik@vikhyatk#1630inTech

qk norm makes your decode 30% more expensive at low bsz. stop using qk norm

8:22 PM · Jun 26, 2026 · 11.9K Views

Sentiment

Users criticized QK Norm for raising LLM decode costs and latency 30% at low batch sizes, calling it an unnecessary extra parameter that worsens real-time performance.

Pos

0.0%

Neg

100.0%

3 comments with sentiment.

Cluster Engagement

Digg Deeper

No Digg Deeper questions have been answered for this story yet.

Posts from X

Most Activity

VIEWS3KBOOKMARKS12LIKES53RETWEETS1REPLIES2

vik@vikhyatk

i take every microsecond of sm inactivity personally

vik@vikhyatk

qk norm makes your decode 30% more expensive at low bsz. stop using qk norm

6h3K5312

vik@vikhyatk

@liuliu kimi style qk clipping works well. but you can also make it differentiable instead of clipping (which is what we do)

7h20634

vik@vikhyatk

@liuliu

7h16622

Louis@Louis9687221579

@vikhyatk can't you just fine tune your model to not use norm anymore ? Also the new stuff https://arxiv.org/abs/2606.16310 claims they can do better then clipping with only 2% latency overhead at 256ctx

7h8512

Liu Liu@liuliu

@vikhyatk How do you avoid the activation growth caused attention nan then?

7h2263

vik@vikhyatk

@osieberling i'm trying to do a bs1 megakernel and the per-head reduction is forcing me to do a separate qkv_epi op. which the hazy megakernel for llama 1b didn't have to do because they didn't have to deal with qknorm

and it's causing this big bubble that i can't figure out how to solve

6h281

vik@vikhyatk

@osieberling it's an extra reduction, makes it hard to saturate mem bw

7h1712

Oliver Sieberling@osieberling

@vikhyatk it’s not even a full reduction, just per head, how is this so slow? And why does low bs make the overhead bigger? Shouldnt it be the other way around, for bs 1 doing decoding is tons of matrix vector product which is so slow that rms norm is like 0 overhead?

6h591

Oliver Sieberling@osieberling

@vikhyatk why?

7h173

Elliot Arledge@elliotarledge

@vikhyatk anything that does entire row reduction is slow on low batch

7h2483

Ferbin@Ferbin08

@vikhyatk What's the training time cost vs hard clipping? In RL this seems essential, but curious what the overhead is.

7h82

Strata@ChainZenit

@vikhyatk wait, does this impact training speed as much as costs?

7h78

Owlfy.ai@Owlfy_ai

@vikhyatk Yeah, the latency gets even worse with continuous decode on a single stream—30% adds up fast across every token. Learned that the hard way tuning for real-time local inference.

7h69

MinotaurOnLucy@minotauronlucy

@vikhyatk Why clip? Another stupid parameter we need tuning.

6h26

Dilreet Raju@DilreetR

@vikhyatk Fuck qk norm

7h25

Craig Gefedhi@CGefedhi

@liuliu @vikhyatk Hello, author. Why is there no LTX2.3 scail2 added to the software?

7h9

Strata@ChainZenit

@vikhyatk this energy is exactly what keeps protocols growing tbh

6h2