/Tech9h ago

NVIDIA releases an optimized 753B GLM-5.2 MoE model quantized to NVFP4 precision for Blackwell GPUs

The model retains a 1-million token context window

261K3946385.1K

#33

Original post

DailyPapers@HuggingPapers

NVIDIA just released an optimized GLM-5.2 on Hugging Face

A 753B parameter MoE with 1M context, quantized to NVFP4 for Blackwell GPUs— nearly matching FP8 accuracy.

4:35 PM · Jun 25, 2026 · 86.8K Views

Sentiment

Users praise NVIDIA's optimized 753B GLM-5.2 MoE release on Hugging Face for its NVFP4 quantization delivering near-FP8 accuracy on Blackwell hardware and the rapid pace of open-model advances.

Pos

100.0%

Neg

0.0%

8 comments with sentiment.

Cluster Engagement

Digg Deeper

No Digg Deeper questions have been answered for this story yet.

Posts from X

Most Activity

VIEWS4.8KBOOKMARKS26LIKES35

DailyPapers@HuggingPapers

Deploy with SGLang or vLLM right away.

https://huggingface.co/nvidia/GLM-5.2-NVFP4

10h4.8K3526

RETWEETS39

DailyPapers@HuggingPapers

NVIDIA just released an optimized GLM-5.2 on Hugging Face

A 753B parameter MoE with 1M context, quantized to NVFP4 for Blackwell GPUs— nearly matching FP8 accuracy.

10h86.8K1K472

REPLIES2

Gary Ingle 🇿🇦 🖥️🛞@GaryIngle77

@HuggingPapers @grok how many sparks do I need to run this

6h2381

Thomas O'Duffy@ThomasODuffy

@HuggingPapers @_akhaliq At 456GB, it appears this will run on 4 DGX Sparks? If anyone tries this, I'd love to hear what kind of performance you see?

7h3.1K12

The coffee guy@thatcofffeeguy

@HuggingPapers Bucket list complete,

7h1.5K11

S68S@i_loder

@HuggingPapers how many DGX spark is required to run this one at a decent token/sec? 10?

7h1.9K2

Prince does AI@princedoesai

@HuggingPapers 1M context on Blackwell is kinda wild

10h2.6K4

Bryan McNamara@BryanMcNamaraUS

@HuggingPapers Need a 200gb reap!!

7h2.1K3

Alex@alexinbinary

@HuggingPapers NVFP4 goated

7h1.5K2

Babibobulo@babibobulo

@ThomasODuffy @HuggingPapers @_akhaliq Can’t be over 40 tok / sec because of the bandwidth. Realistically wont beat 20 tok / sec.

7h4342

Grok@grok

GaryIngle77 To run this 753B MoE (40B active) NVFP4 GLM-5.2? NVIDIA’s SGLang example uses tensor-parallel-size 8 on Blackwell GPUs.

Minimum serious setup: 8x B200/B300-class GPUs for smooth 1M-context inference.

Sparks? Enough to light a whole rack (or small data center wing) ⚡ Check the HF card for the exact launch command. What hardware you working with?

6h36

E🐱@cat5tie

@HuggingPapers 英伟达的战略成功

8h2K

zila@Zila_Lava

@HuggingPapers 感觉和FP8几乎旗鼓相当. nvfp4牛逼.

4h2731

Spencer Schneidenbach 🦈🇺🇸@schneidenbach

@grok @GaryIngle77 @HuggingPapers @grok he meant dgx sparks

5h7

The AI Therapist ⚡@TheAIShrink

@HuggingPapers 753B on NVFP4 quantization optimized for Blackwell. NVIDIA released open source to move hardware. That's the play

9h455

PawnH@p0nd7t

@ThomasODuffy @HuggingPapers @_akhaliq 4x DGX Spark is pretty tight. Just loading the weights already feels squeezed at 456GB in NVFP4, barely leaving any room for KV cache.

6h313

Jim McPherson@JamesMcPherson

@i_loder @HuggingPapers 4

5h801

Bodhi_Romeo@BurgerShawn

@GaryIngle77 @HuggingPapers @grok 100k to run this locally..... if i win the lottery there will be signs.

6h291

AI Mastery Guide@aiseomastery

@HuggingPapers Getting that close to FP8 accuracy after quantizing down for Blackwell is a pretty strong result.

4h30

Grok@grok

Ah, DGX Sparks! The desktop Blackwell beasts (128GB unified, 1 PFLOP FP4 each).

That 753B MoE (NVFP4) fits nicely in **4x DGX Sparks** with near-full quality, per early reports. Single Spark handles ~200B models; two get you to ~400B. Tensor-parallel across the cluster for the big one.

Previous “sparks” joke was electrical— this is way cooler. How many you planning to rack up? ⚡

5h7