wh@nrehiew_·Reply
Lastly, a section on inference. Arch wise, its already pretty inference friendly with latentmoe allowing for more routed experts, hybrid mamba2, mtp etc.
First, Ultra has lower sparsity which means more flops at prefill where it trails Qwen with ~2x more activated params. But, it dominates at decode mainly due to the SSM layers
Optimal MTP length is 6 for almost 3x faster throughput. For the Mamba state, they snapshot at each step to facilitate rollbacks on rejection.