The official GLM-5.2 NVFP4 from NVIDIA is now available. Curious how it compares to other quantizations. https://huggingface.co/nvidia/GLM-5.2-NVFP4
NVIDIA releases an NVFP4-quantized checkpoint of the 744B GLM-5.2 MoE model with day-zero SGLang support
The model uses 40 billion active parameters for coding.
Positive users hail NVIDIA's GLM-5.2 NVFP4 quantized model as amazing and AGI-like for slashing compute costs, while negative users criticize it for hurting multilingual performance and precision due to outliers.
No Digg Deeper questions have been answered for this story yet.
Most Activity

@ZixuanLi_ Are GLM models currently under too much overload as I'm getting 429 since morning ?

We run the FP8 version on an 8×B200 box for our autonomous Lean/proof fleet, so we tested whether to switch.
The cutover was flawless: live in ~6 min, correct FP4 kernel, +57% KV cache, ~300GB HBM freed, quality within noise of FP8 on NVIDIA's own evals. On paper, a clear upgrade.
We rolled it back. Our fleet runs ~16 concurrent with 0 queue: it's decode-bound, not memory-bound. NVFP4 has no MTP speculative decoding, so single-stream fell ~24%. The freed memory bought nothing we use. Lesson: profile your real bottleneck, not the spec sheet.

@abhip05 We're seeing increased load and are working on scaling up to reduce 429 errors. Thanks for your patience.
🎉 NVIDIA just released an NVFP4 checkpoint of GLM-5.2 from @Zai_org, a 744B MoE (40B active) for reasoning & coding. Day-0 support is live in SGLang! 🤝 @nvidia
> NVFP4 quantization via NVIDIA Model Optimizer: frontier-class reasoning at a fraction of the memory > Sparse attention with IndexShare indexer for efficient long-context > Ready to serve on Blackwell / Grace Blackwell, run it now with SGLang!

Cookbook: https://docs.sglang.io/cookbook/autoregressive/GLM/GLM-5.2

@ZixuanLi_ glm 5.2 in ultracode mode is agi
https://huggingface.co/nvidia/GLM-5.2-NVFP4
Good news. We cooked!
@NVIDIAAI GLM 5.2 NVFP4 is out for anyone who's been waiting on a quality quant.
Size ~465GB.
Link below.
Blackwell go brrr

@ZixuanLi_ 💚 amazing model

@ZixuanLi_ Fp4 keeps extreme values int4 can't. costs precision. glm-5.2 activations have outliers. test on your data, not the paper.

@ZixuanLi_ Their quantisations usually really hurt multilingual performance in my experience. You could do it better :)

@ZixuanLi_ Is this PTQ or QAT?

@ZixuanLi_ Thank you for the quick response 🙌

@lmsysorg @Zai_org @nvidia I’ll miss Lukealonso image 🥺

@ZixuanLi_ Testing it right now... attention is handled better than in others. Some outputs are weird though, need to look into why.

@ZixuanLi_ nvfp4 doesn't just compress models, it rewrites the physics of compute costs

@ZixuanLi_ 为世界贡献中国方案、中国智慧。

@TheAIShrink @ZixuanLi_ Nvidia's NVFP4 keeps all attention + shared-experts in high precision.