Users criticized QK Norm for raising LLM decode costs and latency 30% at low batch sizes, calling it an unnecessary extra parameter that worsens real-time performance.
No Digg Deeper questions have been answered for this story yet.
Most Activity
i take every microsecond of sm inactivity personally
qk norm makes your decode 30% more expensive at low bsz. stop using qk norm

@liuliu kimi style qk clipping works well. but you can also make it differentiable instead of clipping (which is what we do)

@liuliu

@vikhyatk can't you just fine tune your model to not use norm anymore ? Also the new stuff https://arxiv.org/abs/2606.16310 claims they can do better then clipping with only 2% latency overhead at 256ctx

@vikhyatk How do you avoid the activation growth caused attention nan then?

@osieberling i'm trying to do a bs1 megakernel and the per-head reduction is forcing me to do a separate qkv_epi op. which the hazy megakernel for llama 1b didn't have to do because they didn't have to deal with qknorm
and it's causing this big bubble that i can't figure out how to solve

@osieberling it's an extra reduction, makes it hard to saturate mem bw

@vikhyatk it’s not even a full reduction, just per head, how is this so slow? And why does low bs make the overhead bigger? Shouldnt it be the other way around, for bs 1 doing decoding is tons of matrix vector product which is so slow that rms norm is like 0 overhead?

@vikhyatk why?

@vikhyatk anything that does entire row reduction is slow on low batch

@vikhyatk What's the training time cost vs hard clipping? In RL this seems essential, but curious what the overhead is.

@vikhyatk wait, does this impact training speed as much as costs?

@vikhyatk Yeah, the latency gets even worse with continuous decode on a single stream—30% adds up fast across every token. Learned that the hard way tuning for real-time local inference.

@vikhyatk Why clip? Another stupid parameter we need tuning.

@vikhyatk Fuck qk norm

@liuliu @vikhyatk Hello, author. Why is there no LTX2.3 scail2 added to the software?

@vikhyatk this energy is exactly what keeps protocols growing tbh