/AI5h ago

Apple Deploys 3B Dense And 20B MoE Models For On-Device Siri AI

9976325.1K

Original post

elie@eliebakouch#706inAI

local models from Apple Siri AI are a 3B dense and a 20B total ~1-4B active (adaptive compute!)

for the MoE they use early routing decision to decide once per prompt which experts will be activated for the full model depth. it seems that the early routing also decides the number of active parameters the model is going to allocate, but no information if it's doing layer selection or expert selection

for the AFM Server they train/serve it with PT (parallel track) parallelism and not EP. you can see PT as an extension of TP but instead of syncing multiple times per layer you do it once every few layers (so this is not exact computation anymore)

this is overall quite similar to previous generation, so i'm wondering what the gemini licensing was about, probably data?

1:53 AM · Jun 9, 2026 · 4.4K Views

/AI5h ago

Apple Deploys 3B Dense And 20B MoE Models For On-Device Siri AI

9976325.1K

#706

Original post

elie@eliebakouch#706inAI

local models from Apple Siri AI are a 3B dense and a 20B total ~1-4B active (adaptive compute!)

this is overall quite similar to previous generation, so i'm wondering what the gemini licensing was about, probably data?

1:53 AM · Jun 9, 2026 · 4.4K Views

Sentiment

Positive users praise the impressive optimization of Apple's 3B dense and 20B MoE models for on-device Siri, while negative users sarcastically criticize the timing as premature.

Pos

50.0%

Neg

50.0%

2 comments with sentiment.

Cluster Engagement

Posts from X

Most Activity

VIEWS775BOOKMARKS1LIKES9RETWEETS2

elie@eliebakouch

https://machinelearning.apple.com/research/introducing-third-generation-of-apple-foundation-models

elie@eliebakouch

local models from Apple Siri AI are a 3B dense and a 20B total ~1-4B active (adaptive compute!)

this is overall quite similar to previous generation, so i'm wondering what the gemini licensing was about, probably data?

4h77591

REPLIES1

jojo@jojopirker

@eliebakouch Do they train or just run inference on it?

As I understand it, inference is done via their private cloud compute on NVIDIA and training on tpu’s

4h8

elie@eliebakouch

@jojopirker they train the server one on nvidia tho

4h20

jojo@jojopirker

@eliebakouch Maybe not only data but also the hardware they trained on

4h19

Alex YGift@Radipdegen

@eliebakouch Siri running MoE before I even decide what I want. Peak Apple timing.

5h24

Strata@ChainZenit

@eliebakouch this level of optimization for local hardware is actually wild.

5h18

elie@eliebakouch

@jojopirker oh true, i think you are right

4h6