/Tech9h ago

NVIDIA releases an NVFP4-quantized checkpoint of the 744B GLM-5.2 MoE model with day-zero SGLang support

The model uses 40 billion active parameters for coding.

295821914554.3K

#109

Original post

Zixuan Li@ZixuanLi_

The official GLM-5.2 NVFP4 from NVIDIA is now available. Curious how it compares to other quantizations. https://huggingface.co/nvidia/GLM-5.2-NVFP4

11:17 PM · Jun 25, 2026 · 40K Views

Sentiment

Positive users hail NVIDIA's GLM-5.2 NVFP4 quantized model as amazing and AGI-like for slashing compute costs, while negative users criticize it for hurting multilingual performance and precision due to outliers.

Pos

33.4%

Neg

66.6%

7 comments with sentiment.

Cluster Engagement

Digg Deeper

No Digg Deeper questions have been answered for this story yet.

Related links

nvidia/GLM-5.2-NVFP4 · Hugging Face

HUGGINGFACE.COVia

Posts from X

Most Activity

Abhi Prajapati@abhip05

@ZixuanLi_ Are GLM models currently under too much overload as I'm getting 429 since morning ?

13h9011

BOOKMARKS1

Jonathan Washburn@JonWashburn

We run the FP8 version on an 8×B200 box for our autonomous Lean/proof fleet, so we tested whether to switch.

The cutover was flawless: live in ~6 min, correct FP4 kernel, +57% KV cache, ~300GB HBM freed, quality within noise of FP8 on NVIDIA's own evals. On paper, a clear upgrade.

We rolled it back. Our fleet runs ~16 concurrent with 0 queue: it's decode-bound, not memory-bound. NVFP4 has no MTP speculative decoding, so single-stream fell ~24%. The freed memory bought nothing we use. Lesson: profile your real bottleneck, not the spec sheet.

7h3411

LIKES11REPLIES1

Zixuan Li@ZixuanLi_

@abhip05 We're seeing increased load and are working on scaling up to reduce 429 errors. Thanks for your patience.

13h82511

RETWEETS7

LMSYS Org@lmsysorg

🎉 NVIDIA just released an NVFP4 checkpoint of GLM-5.2 from @Zai_org, a 744B MoE (40B active) for reasoning & coding. Day-0 support is live in SGLang! 🤝 @nvidia

> NVFP4 quantization via NVIDIA Model Optimizer: frontier-class reasoning at a fraction of the memory > Sparse attention with IndexShare indexer for efficient long-context > Ready to serve on Blackwell / Grace Blackwell, run it now with SGLang!