Elon Musk claims SpaceX's custom C-based AI training framework is 10x faster than JAX on 220,000 GB300 GPUs
Critics questioned the JAX benchmark and choice of C.
@elonmusk Elon, trust me, this makes no sense. Someone is strongly overclaiming or overpromising to you.
SpaceX has almost finished writing V1.0 of an in-house AI training stack in C that exact-maps to 220k GB300s with 800G NICs, making heavy use of pipeline parallelism and getting as close to bare metal as possible. The potential speed improvement vs JAX for large training runs is over an order of magnitude.
j...jax????
SpaceX has almost finished writing V1.0 of an in-house AI training stack in C that exact-maps to 220k GB300s with 800G NICs, making heavy use of pipeline parallelism and getting as close to bare metal as possible. The potential speed improvement vs JAX for large training runs is over an order of magnitude.
@MParakhin I've become extreme in my position: any primitive that isn't directly provided by the hardware provider is hamstringing you too much
The fact that JAX was even mentioned makes me think of two things: 1) xAI still needs better ML people 2) PyTorch is stagnating and probably is not going to recover :-( It is such an unparalleled achievement, but lost key people, exiled to FAIR now...
@MParakhin as soon as you want to do anything out of the fold pytorch will hold you back. it's more difficult but the pressure on talented SWE time has been released enough by LLMs for it to be a good tradeoff to make. Just do it in cuda
@MParakhin I've become extreme in my position: any primitive that isn't directly provided by the hardware provider is hamstringing you too much
Could gain another order of magnitude if they were to switch to decision trees instead of neural nets. Just sayinโ.
SpaceX has almost finished writing V1.0 of an in-house AI training stack in C that exact-maps to 220k GB300s with 800G NICs, making heavy use of pipeline parallelism and getting as close to bare metal as possible. The potential speed improvement vs JAX for large training runs is over an order of magnitude.
@yacineMTB What I said :-)
The fact that JAX was even mentioned makes me think of two things: 1) xAI still needs better ML people 2) PyTorch is stagnating and probably is not going to recover :-( It is such an unparalleled achievement, but lost key people, exiled to FAIR now...
The fact that JAX was even mentioned makes me think of two things: 1) xAI still needs better ML people 2) PyTorch is stagnating and probably is not going to recover :-( It is such an unparalleled achievement, but lost key people, exiled to FAIR now...
SpaceX has almost finished writing V1.0 of an in-house AI training stack in C that exact-maps to 220k GB300s with 800G NICs, making heavy use of pipeline parallelism and getting as close to bare metal as possible. The potential speed improvement vs JAX for large training runs is over an order of magnitude.
@elonmusk Well thatโs good
SpaceX has almost finished writing V1.0 of an in-house AI training stack in C that exact-maps to 220k GB300s with 800G NICs, making heavy use of pipeline parallelism and getting as close to bare metal as possible. The potential speed improvement vs JAX for large training runs is over an order of magnitude.
PufferLib trains small task-specific reinforcement learning models at up to 20M steps/second in 5,000 lines of CUDA C on a single GPU. Stop using awful DSLs just because they are pretending to be Python!
SpaceX has almost finished writing V1.0 of an in-house AI training stack in C that exact-maps to 220k GB300s with 800G NICs, making heavy use of pipeline parallelism and getting as close to bare metal as possible. The potential speed improvement vs JAX for large training runs is over an order of magnitude.
mythos is so good it can rewrite jax in c ๐ซช
SpaceX has almost finished writing V1.0 of an in-house AI training stack in C that exact-maps to 220k GB300s with 800G NICs, making heavy use of pipeline parallelism and getting as close to bare metal as possible. The potential speed improvement vs JAX for large training runs is over an order of magnitude.
@eliebakouch selling half their compute in exchange for claude access was so worth it
"/goal rewrite jax in rust"