/Tech1d ago

Apple Deploys 3B Dense And 20B MoE Models For On-Device Siri AI

915310448.9K

Original post

elie@eliebakouch#762inTech

local models from Apple Siri AI are a 3B dense and a 20B total ~1-4B active (adaptive compute!)

for the MoE they use early routing decision to decide once per prompt which experts will be activated for the full model depth. it seems that the early routing also decides the number of active parameters the model is going to allocate, but no information if it's doing layer selection or expert selection

for the AFM Server they train/serve it with PT (parallel track) parallelism and not EP. you can see PT as an extension of TP but instead of syncing multiple times per layer you do it once every few layers (so this is not exact computation anymore)

this is overall quite similar to previous generation, so i'm wondering what the gemini licensing was about, probably data?

1:53 AM · Jun 9, 2026 · 7.8K Views

/Tech1d ago

Apple Deploys 3B Dense And 20B MoE Models For On-Device Siri AI

915310448.9K

#762

Original post

elie@eliebakouch#762inTech

local models from Apple Siri AI are a 3B dense and a 20B total ~1-4B active (adaptive compute!)

this is overall quite similar to previous generation, so i'm wondering what the gemini licensing was about, probably data?

1:53 AM · Jun 9, 2026 · 7.8K Views

Sentiment

Users express excitement about the impressive optimization of Apple's 3B dense and 20B MoE models for on-device Siri AI.

Pos

100.0%

Neg

0.0%

1 comments with sentiment.

Cluster Engagement

Posts from X

Most Activity

VIEWS1.1KBOOKMARKS1LIKES10RETWEETS2

elie@eliebakouch

https://machinelearning.apple.com/research/introducing-third-generation-of-apple-foundation-models

elie@eliebakouch

local models from Apple Siri AI are a 3B dense and a 20B total ~1-4B active (adaptive compute!)

this is overall quite similar to previous generation, so i'm wondering what the gemini licensing was about, probably data?

1d1.1K101

REPLIES1

jojo@jojopirker

@eliebakouch Do they train or just run inference on it?

As I understand it, inference is done via their private cloud compute on NVIDIA and training on tpu’s

1d8

elie@eliebakouch

@jojopirker they train the server one on nvidia tho

1d20

jojo@jojopirker

@eliebakouch Maybe not only data but also the hardware they trained on

1d19

Alex YGift@Radipdegen

@eliebakouch Siri running MoE before I even decide what I want. Peak Apple timing.

1d24

Strata@ChainZenit

@eliebakouch this level of optimization for local hardware is actually wild.

1d18

elie@eliebakouch

@jojopirker oh true, i think you are right

1d6