For the nerds out there:
The way this model was trained was pretty cool, and probably, the first time this has been done to train a video generation model.
If you want to train a model, you rent a huge GPU cluster and do it there.
But this model was trained differently:
The team that built this model trained different "experts" using separate GPU clusters and data. These GPUs were all different, and didn't need any communication between them.
So instead of allocating 100 interconnected GPUs to train the model, they used 3 GPUs here, 5 there, 4 more over there, etc.
After training, they added a smart router on top of all of the trained experts. This router's job is to take an inference request, and route it to the appropriate experts.
In other words, instead of training a single model, they trained multiple smaller models that work together at inference time.
And the results were really good!
Here is the link to their paper: https://arxiv.org/pdf/2605.26064. It's a very good read.