/Tech1d ago

Apple Deploys 3B Dense And 20B MoE Models For On-Device Siri AI

915310448.9K
Original post
elie@eliebakouch#762inTech

local models from Apple Siri AI are a 3B dense and a 20B total ~1-4B active (adaptive compute!)

for the MoE they use early routing decision to decide once per prompt which experts will be activated for the full model depth. it seems that the early routing also decides the number of active parameters the model is going to allocate, but no information if it's doing layer selection or expert selection

for the AFM Server they train/serve it with PT (parallel track) parallelism and not EP. you can see PT as an extension of TP but instead of syncing multiple times per layer you do it once every few layers (so this is not exact computation anymore)

this is overall quite similar to previous generation, so i'm wondering what the gemini licensing was about, probably data?

1:53 AM · Jun 9, 2026 · 7.8K Views
Sentiment

Users express excitement about the impressive optimization of Apple's 3B dense and 20B MoE models for on-device Siri AI.

Pos
100.0%
Neg
0.0%
1 comments with sentiment.
Cluster Engagement
Posts from X
Most Activity
Most Activity
VIEWS1.1KBOOKMARKS1LIKES10RETWEETS2
elie@eliebakouch

https://machinelearning.apple.com/research/introducing-third-generation-of-apple-foundation-models

elie@eliebakouch

local models from Apple Siri AI are a 3B dense and a 20B total ~1-4B active (adaptive compute!)

for the MoE they use early routing decision to decide once per prompt which experts will be activated for the full model depth. it seems that the early routing also decides the number of active parameters the model is going to allocate, but no information if it's doing layer selection or expert selection

for the AFM Server they train/serve it with PT (parallel track) parallelism and not EP. you can see PT as an extension of TP but instead of syncing multiple times per layer you do it once every few layers (so this is not exact computation anymore)

this is overall quite similar to previous generation, so i'm wondering what the gemini licensing was about, probably data?

1dViews 1.1KLikes 10Bookmarks 1
REPLIES1
jojo@jojopirker

@eliebakouch Do they train or just run inference on it?

As I understand it, inference is done via their private cloud compute on NVIDIA and training on tpu’s

1dViews 8
elie@eliebakouch

@jojopirker they train the server one on nvidia tho

1dViews 20
jojo@jojopirker

@eliebakouch Maybe not only data but also the hardware they trained on

1dViews 19
Alex YGift@Radipdegen

@eliebakouch Siri running MoE before I even decide what I want. Peak Apple timing.

1dViews 24
Strata@ChainZenit

@eliebakouch this level of optimization for local hardware is actually wild.

1dViews 18
elie@eliebakouch

@jojopirker oh true, i think you are right

1dViews 6