/AI5h ago

Apple Deploys 3B Dense And 20B MoE Models For On-Device Siri AI

9976325.1K
Original post
elie@eliebakouch#706inAI

local models from Apple Siri AI are a 3B dense and a 20B total ~1-4B active (adaptive compute!)

for the MoE they use early routing decision to decide once per prompt which experts will be activated for the full model depth. it seems that the early routing also decides the number of active parameters the model is going to allocate, but no information if it's doing layer selection or expert selection

for the AFM Server they train/serve it with PT (parallel track) parallelism and not EP. you can see PT as an extension of TP but instead of syncing multiple times per layer you do it once every few layers (so this is not exact computation anymore)

this is overall quite similar to previous generation, so i'm wondering what the gemini licensing was about, probably data?

1:53 AM · Jun 9, 2026 · 4.4K Views
Sentiment

Positive users praise the impressive optimization of Apple's 3B dense and 20B MoE models for on-device Siri, while negative users sarcastically criticize the timing as premature.

Pos
50.0%
Neg
50.0%
2 comments with sentiment.
Cluster Engagement
Posts from X
Most Activity
Most Activity
VIEWS775BOOKMARKS1LIKES9RETWEETS2
elie@eliebakouch

https://machinelearning.apple.com/research/introducing-third-generation-of-apple-foundation-models

elie@eliebakouch

local models from Apple Siri AI are a 3B dense and a 20B total ~1-4B active (adaptive compute!)

for the MoE they use early routing decision to decide once per prompt which experts will be activated for the full model depth. it seems that the early routing also decides the number of active parameters the model is going to allocate, but no information if it's doing layer selection or expert selection

for the AFM Server they train/serve it with PT (parallel track) parallelism and not EP. you can see PT as an extension of TP but instead of syncing multiple times per layer you do it once every few layers (so this is not exact computation anymore)

this is overall quite similar to previous generation, so i'm wondering what the gemini licensing was about, probably data?

4hViews 775Likes 9Bookmarks 1
REPLIES1
jojo@jojopirker

@eliebakouch Do they train or just run inference on it?

As I understand it, inference is done via their private cloud compute on NVIDIA and training on tpu’s

4hViews 8
elie@eliebakouch

@jojopirker they train the server one on nvidia tho

4hViews 20
jojo@jojopirker

@eliebakouch Maybe not only data but also the hardware they trained on

4hViews 19
Alex YGift@Radipdegen

@eliebakouch Siri running MoE before I even decide what I want. Peak Apple timing.

5hViews 24
Strata@ChainZenit

@eliebakouch this level of optimization for local hardware is actually wild.

5hViews 18
elie@eliebakouch

@jojopirker oh true, i think you are right

4hViews 6