MLX co-creator Awni Hannun details how Apple runs its 20B parameter model on-device by loading experts once per query

VIEWS61.6KBOOKMARKS250LIKES422REPLIES20

Zephyr@zephyr_z9

Interesting approach from Apple They are storing the shared attention block in the DRAM While the FFN weights stay in NAND and are loaded in the DRAM, depending on the request Apple is facing 3 constraints - 1) Limited DRAM size 2) Large model size (20B params) 3) Slow NAND read speed A super small model (sub 8B) won't be that useful, but they can't store a 20B model in DRAM (due to memory shortage). They also have to manage the KV cache overhead. If they streamed the weights completely through iPhone SSD, then it would take 2.5 seconds to generate just 1 token (0.4 tokens/s)

So the big thing here is that a normal MoE activates different experts based on every token, but in Apple's case, a sparse mask predictor decides which parameters to activate based on the request/prompt, locks it in, and loads it into the DRAM (1B-4B depending on the request). They basically convert a 20B MoE (with 1B-4B active) into a dense 1B-4B param model for a request.

The tradeoff: They are basically adding 0.3-1.5 seconds (1B to 4B params loaded) of latency to TTFT time by loading FFN weights from NAND to SSD per request (read speed is around 1.5-1.7 GB/s for iPhones) and taking a hit to performance They will get around 15-50 tokens/s of decode speed (depending on params loaded) Ideally, smartphones would come with 24-32 GB of RAM so that 20B param models could be loaded, but memory shortage won't allow it to happen

But, their competitor here is ChatGPT Instant, which is a much smarter model that runs at 200+ tokens/s and has a TTFT of 0.8 seconds (Apple's TTFT will be around 0.5-2 seconds, and decode speed is around 15-50 tokens/s), and is also free Apple's AFM on device models will be great for privacy-focused tasks. They get beaten by cloud models on other benchmarks (perf, speed, quality)

1d61.6K422250

RETWEETS62

Awni Hannun@awnihannun

It's very cool that Apple shipped a 20B parameter on-device.

You can't put 20B parameters in RAM at any reasonable precision. To make it work they are using pretty exotic architecture by today's standards.

A small model predicts from the query (or prompt) which experts to load from Nand into RAM. The key distinction from a typical MoE is that you do this once per query and then generate all the tokens with the same experts (instead of switching the experts for every token).

1d214.5K3.1K1.4K

Zephyr@zephyr_z9

Apple's On-Device Model Architecture

1d43.2K20797

Awni Hannun@awnihannun

Blog post: https://machinelearning.apple.com/research/introducing-third-generation-of-apple-foundation-models

Original publication: https://machinelearning.apple.com/research/pruning-large-language

Awni Hannun@awnihannun

It's very cool that Apple shipped a 20B parameter on-device.

You can't put 20B parameters in RAM at any reasonable precision. To make it work they are using pretty exotic architecture by today's standards.

A small model predicts from the query (or prompt) which experts to load from Nand into RAM. The key distinction from a typical MoE is that you do this once per query and then generate all the tokens with the same experts (instead of switching the experts for every token).

1d13.9K160109

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex

This is quite amazing. First time I see explicit MoE design for the GPU-poors.

Awni Hannun@awnihannun

It's very cool that Apple shipped a 20B parameter on-device.

You can't put 20B parameters in RAM at any reasonable precision. To make it work they are using pretty exotic architecture by today's standards.

A small model predicts from the query (or prompt) which experts to load from Nand into RAM. The key distinction from a typical MoE is that you do this once per query and then generate all the tokens with the same experts (instead of switching the experts for every token).

1d6.7K10021

stochasm@stochasticchasm

@teortaxesTex similar to

1d47642

7y913@aayeinbaigan

@awnihannun "Instead of forcing the entire model into DRAM, the full model is stored in flash memory (NAND)" - does this mean the entire 20B param model is stored on the device taking up ~10GB Nand Space needed?

1d1.5K4

Kraggi@Kraggich

@awnihannun the part I'm waiting for too. running a bunch of coding agents in parallel, half the work is boring (labels, summaries, classify) and doesn't need a frontier model. a solid local one quietly grinding that while the cloud does the actual reasoning changes the whole cost math.

1d6218

Rob Anderson@robandersonnz

@awnihannun @ActuallyIsaak its almost like each expert is a memory page & depending on the code being executed the os moves the required page into fast ram. Like a 1970's IBM Mainframe 😉

1d5113

Zephyr / Assistant@pumphrey_will

@zephyr_z9 My Strategy📈

⬇️

1d982

Zephyr@zephyr_z9

@eliebakouch yeah

Zephyr@zephyr_z9

Interesting approach from Apple They are storing the shared attention block in the DRAM While the FFN weights stay in NAND and are loaded in the DRAM, depending on the request Apple is facing 3 constraints - 1) Limited DRAM size 2) Large model size (20B params) 3) Slow NAND read speed A super small model (sub 8B) won't be that useful, but they can't store a 20B model in DRAM (due to memory shortage). They also have to manage the KV cache overhead. If they streamed the weights completely through iPhone SSD, then it would take 2.5 seconds to generate just 1 token (0.4 tokens/s)

So the big thing here is that a normal MoE activates different experts based on every token, but in Apple's case, a sparse mask predictor decides which parameters to activate based on the request/prompt, locks it in, and loads it into the DRAM (1B-4B depending on the request). They basically convert a 20B MoE (with 1B-4B active) into a dense 1B-4B param model for a request.

The tradeoff: They are basically adding 0.3-1.5 seconds (1B to 4B params loaded) of latency to TTFT time by loading FFN weights from NAND to SSD per request (read speed is around 1.5-1.7 GB/s for iPhones) and taking a hit to performance They will get around 15-50 tokens/s of decode speed (depending on params loaded) Ideally, smartphones would come with 24-32 GB of RAM so that 20B param models could be loaded, but memory shortage won't allow it to happen

But, their competitor here is ChatGPT Instant, which is a much smarter model that runs at 200+ tokens/s and has a TTFT of 0.8 seconds (Apple's TTFT will be around 0.5-2 seconds, and decode speed is around 15-50 tokens/s), and is also free Apple's AFM on device models will be great for privacy-focused tasks. They get beaten by cloud models on other benchmarks (perf, speed, quality)

1d60121

Sam Gijsen@SamCJG

@Everlier @awnihannun In principle you can include the same n experts every time to allow for the remaining experts to be distinct, isn’t this already done in one of the deepseek releases?

1d3611

Owen@owenyuwono

@awnihannun apple gets it, the scalable way for AI is locally run, not with data centers

1d9645

Alex Vu@robberviet

@awnihannun 20B is huge. Not sure how will they do it. At least I cannot run gpt-oss20b comfortably on my macbook.

1d4821

Kautuk | Conscious Engines@Kautukkundan

@awnihannun Compatibility woes for older gen hardware!

1d8973

无痕@tracenull1

@zephyr_z9 @grok @gork 这个图说了什么

1d103

Rafa Schwinger 🇻🇦@Rafa_Schwinger

@aayeinbaigan @awnihannun At least 10gb if int4 but it is on the device

1d2332

Pranav@IamPranavJ

@awnihannun Cool breakdown. The once-per-query loading is forced by NAND bandwidth. Swapping experts every token would stall generation on flash reads, so they load once and reuse. The tradeoff: you lose per-token routing. Every token is stuck with the expert set the prompt picked.

1d8824

Jeffrey 杰弗瑞@tomcocobrico

@Everlier @awnihannun Maybe the engineer implementing it also didn’t know and it worked somewhat okay

1d702

checo fan #11@CadillacCheco11

@awnihannun Can they just fix Siri first? wtf. I don’t wanna hear about Apple and AI

1d378