/Tech1d ago

MLX co-creator Awni Hannun details how Apple runs its 20B parameter model on-device by loading experts once per query

This bypasses memory-bandwidth bottlenecks on resource-constrained hardware.

1543.9K3401.8K331K
Original post
Awni Hannun@awnihannun#890inTech

It's very cool that Apple shipped a 20B parameter on-device.

You can't put 20B parameters in RAM at any reasonable precision. To make it work they are using pretty exotic architecture by today's standards.

A small model predicts from the query (or prompt) which experts to load from Nand into RAM. The key distinction from a typical MoE is that you do this once per query and then generate all the tokens with the same experts (instead of switching the experts for every token).

9:25 PM · Jun 8, 2026 · 214.5K Views
Sentiment

Positive users praise the engineering of Apple's 20B sparse MoE on-device model for efficient local inference from NAND, while negative users dismiss the effort as unoriginal, worry about battery drain and profiling, and urge fixes to Siri.

Pos
34.6%
Neg
65.4%
14 comments with sentiment.
Cluster Engagement
Posts from X
Most Activity
Most Activity
VIEWS61.6KBOOKMARKS250LIKES422REPLIES20
Zephyr@zephyr_z9

Interesting approach from Apple They are storing the shared attention block in the DRAM While the FFN weights stay in NAND and are loaded in the DRAM, depending on the request Apple is facing 3 constraints - 1) Limited DRAM size 2) Large model size (20B params) 3) Slow NAND read speed A super small model (sub 8B) won't be that useful, but they can't store a 20B model in DRAM (due to memory shortage). They also have to manage the KV cache overhead. If they streamed the weights completely through iPhone SSD, then it would take 2.5 seconds to generate just 1 token (0.4 tokens/s)

So the big thing here is that a normal MoE activates different experts based on every token, but in Apple's case, a sparse mask predictor decides which parameters to activate based on the request/prompt, locks it in, and loads it into the DRAM (1B-4B depending on the request). They basically convert a 20B MoE (with 1B-4B active) into a dense 1B-4B param model for a request.

The tradeoff: They are basically adding 0.3-1.5 seconds (1B to 4B params loaded) of latency to TTFT time by loading FFN weights from NAND to SSD per request (read speed is around 1.5-1.7 GB/s for iPhones) and taking a hit to performance They will get around 15-50 tokens/s of decode speed (depending on params loaded) Ideally, smartphones would come with 24-32 GB of RAM so that 20B param models could be loaded, but memory shortage won't allow it to happen

But, their competitor here is ChatGPT Instant, which is a much smarter model that runs at 200+ tokens/s and has a TTFT of 0.8 seconds (Apple's TTFT will be around 0.5-2 seconds, and decode speed is around 15-50 tokens/s), and is also free Apple's AFM on device models will be great for privacy-focused tasks. They get beaten by cloud models on other benchmarks (perf, speed, quality)

1dViews 61.6KLikes 422Bookmarks 250
RETWEETS62
Awni Hannun@awnihannun

It's very cool that Apple shipped a 20B parameter on-device.

You can't put 20B parameters in RAM at any reasonable precision. To make it work they are using pretty exotic architecture by today's standards.

A small model predicts from the query (or prompt) which experts to load from Nand into RAM. The key distinction from a typical MoE is that you do this once per query and then generate all the tokens with the same experts (instead of switching the experts for every token).

1dViews 214.5KLikes 3.1KBookmarks 1.4K
Zephyr@zephyr_z9

Apple's On-Device Model Architecture

1dViews 43.2KLikes 207Bookmarks 97
Awni Hannun@awnihannun

Blog post: https://machinelearning.apple.com/research/introducing-third-generation-of-apple-foundation-models

Original publication: https://machinelearning.apple.com/research/pruning-large-language

Awni Hannun@awnihannun

It's very cool that Apple shipped a 20B parameter on-device.

You can't put 20B parameters in RAM at any reasonable precision. To make it work they are using pretty exotic architecture by today's standards.

A small model predicts from the query (or prompt) which experts to load from Nand into RAM. The key distinction from a typical MoE is that you do this once per query and then generate all the tokens with the same experts (instead of switching the experts for every token).

1dViews 13.9KLikes 160Bookmarks 109

This is quite amazing. First time I see explicit MoE design for the GPU-poors.

Awni Hannun@awnihannun

It's very cool that Apple shipped a 20B parameter on-device.

You can't put 20B parameters in RAM at any reasonable precision. To make it work they are using pretty exotic architecture by today's standards.

A small model predicts from the query (or prompt) which experts to load from Nand into RAM. The key distinction from a typical MoE is that you do this once per query and then generate all the tokens with the same experts (instead of switching the experts for every token).

1dViews 6.7KLikes 100Bookmarks 21
stochasm@stochasticchasm

@teortaxesTex similar to

1dViews 476Likes 4Bookmarks 2
7y913@aayeinbaigan

@awnihannun "Instead of forcing the entire model into DRAM, the full model is stored in flash memory (NAND)" - does this mean the entire 20B param model is stored on the device taking up ~10GB Nand Space needed?

1dViews 1.5KLikes 4
Kraggi@Kraggich

@awnihannun the part I'm waiting for too. running a bunch of coding agents in parallel, half the work is boring (labels, summaries, classify) and doesn't need a frontier model. a solid local one quietly grinding that while the cloud does the actual reasoning changes the whole cost math.

1dViews 621Likes 8
Rob Anderson@robandersonnz

@awnihannun @ActuallyIsaak its almost like each expert is a memory page & depending on the code being executed the os moves the required page into fast ram. Like a 1970's IBM Mainframe 😉

1dViews 511Likes 3
Zephyr@zephyr_z9

@eliebakouch yeah

Zephyr@zephyr_z9

Interesting approach from Apple They are storing the shared attention block in the DRAM While the FFN weights stay in NAND and are loaded in the DRAM, depending on the request Apple is facing 3 constraints - 1) Limited DRAM size 2) Large model size (20B params) 3) Slow NAND read speed A super small model (sub 8B) won't be that useful, but they can't store a 20B model in DRAM (due to memory shortage). They also have to manage the KV cache overhead. If they streamed the weights completely through iPhone SSD, then it would take 2.5 seconds to generate just 1 token (0.4 tokens/s)

So the big thing here is that a normal MoE activates different experts based on every token, but in Apple's case, a sparse mask predictor decides which parameters to activate based on the request/prompt, locks it in, and loads it into the DRAM (1B-4B depending on the request). They basically convert a 20B MoE (with 1B-4B active) into a dense 1B-4B param model for a request.

The tradeoff: They are basically adding 0.3-1.5 seconds (1B to 4B params loaded) of latency to TTFT time by loading FFN weights from NAND to SSD per request (read speed is around 1.5-1.7 GB/s for iPhones) and taking a hit to performance They will get around 15-50 tokens/s of decode speed (depending on params loaded) Ideally, smartphones would come with 24-32 GB of RAM so that 20B param models could be loaded, but memory shortage won't allow it to happen

But, their competitor here is ChatGPT Instant, which is a much smarter model that runs at 200+ tokens/s and has a TTFT of 0.8 seconds (Apple's TTFT will be around 0.5-2 seconds, and decode speed is around 15-50 tokens/s), and is also free Apple's AFM on device models will be great for privacy-focused tasks. They get beaten by cloud models on other benchmarks (perf, speed, quality)

1dViews 601Likes 2Bookmarks 1
Sam Gijsen@SamCJG

@Everlier @awnihannun In principle you can include the same n experts every time to allow for the remaining experts to be distinct, isn’t this already done in one of the deepseek releases?

1dViews 36Likes 1Bookmarks 1
Owen@owenyuwono

@awnihannun apple gets it, the scalable way for AI is locally run, not with data centers

1dViews 964Likes 5
Alex Vu@robberviet

@awnihannun 20B is huge. Not sure how will they do it. At least I cannot run gpt-oss20b comfortably on my macbook.

1dViews 482Bookmarks 1
无痕@tracenull1

@zephyr_z9 @grok @gork 这个图说了什么

1dViews 103

@aayeinbaigan @awnihannun At least 10gb if int4 but it is on the device

1dViews 233Likes 2
Pranav@IamPranavJ

@awnihannun Cool breakdown. The once-per-query loading is forced by NAND bandwidth. Swapping experts every token would stall generation on flash reads, so they load once and reuse. The tradeoff: you lose per-token routing. Every token is stuck with the expert set the prompt picked.

1dViews 882Likes 4
Jeffrey 杰弗瑞@tomcocobrico

@Everlier @awnihannun Maybe the engineer implementing it also didn’t know and it worked somewhat okay

1dViews 70Likes 2
checo fan #11@CadillacCheco11

@awnihannun Can they just fix Siri first? wtf. I don’t wanna hear about Apple and AI

1dViews 378
Load more posts