So, we spent some time banging on the details for sparse MoEs, and here is my current understanding. Putting it out here if anyone is interested, and to see if this seems right.
I am actually kind of super CODA pilled right now https://arxiv.org/abs/2605.19269 ... need to work out the details for sparse MoEs though
