local models from Apple Siri AI are a 3B dense and a 20B total ~1-4B active (adaptive compute!)
for the MoE they use early routing decision to decide once per prompt which experts will be activated for the full model depth. it seems that the early routing also decides the number of active parameters the model is going to allocate, but no information if it's doing layer selection or expert selection
for the AFM Server they train/serve it with PT (parallel track) parallelism and not EP. you can see PT as an extension of TP but instead of syncing multiple times per layer you do it once every few layers (so this is not exact computation anymore)
this is overall quite similar to previous generation, so i'm wondering what the gemini licensing was about, probably data?



