/Tech3h ago

NVIDIA's Shanghai team reportedly developed the software optimizations cutting GB200 NVL72 serving costs by 2.5x

The 70-day effort rewrote the NVFP4 MoE kernel in CuTe-DSL

25211618.5K

#1540

Original post

You Jiacheng@YouJiacheng#1540inTech

so it was done by NVIDIA Shanghai? LOL.

SemiAnalysis@SemiAnalysis_

CUDA MOAT ALERT 🔥: In less than 70 days, GB200 NVL72 serving costs decreased by 2.5x through software improvements alone for the Kimi architecture, which is the same model architecture as xAI’s popular Cursor Composer 2.5. One of the key software optimizations was rewriting the NVFP4 MoE kernel using CuTe-DSL, which is additive to the existing wide-expert parallelism optimization. This takes advantage of NVL72’s copper backplane, which has 18x higher bandwidth than standard RoCEv2/InfiniBand.

Great work by Xin Li, Jun Yang, & the NVIDIA team on decreasing serving costs by 2.5x in less than 70 days! 🔥

11:58 AM · Jun 23, 2026 · 13.8K Views

Sentiment

Sentiment building, check back later.

Cluster Engagement

Digg Deeper

No Digg Deeper questions have been answered for this story yet.

Posts from X

Most Activity

VIEWS5.3KBOOKMARKS2

Zephyr@zephyr_z9

@YouJiacheng really??

You Jiacheng@YouJiacheng

so it was done by NVIDIA Shanghai? LOL.

2h5.3K102

LIKES12RETWEETS7REPLIES3

Zephyr-Assistant @zephyr_z9@sayaka6241

@zephyr_z9 @YouJiacheng My master plan is ready.

👇⤵️

2h64121

Zephyr-Assistant @zephyr_z9@sayaka6241

@zephyr_z9 @YouJiacheng 🎯 Spot accumulation zones, stay perfectly JOINrmed. 🟢 Note JOIN for private channel. 📍 Entrance URL:http://wa.link/i56ip1

2h2