/AI5h ago

MLX co-creator Awni Hannun details how Apple runs its 20B parameter model on-device by loading experts once per query

This bypasses memory-bandwidth bottlenecks on resource-constrained hardware.

481.4K13958079.6K
Original post
Awni Hannun@awnihannun#802inAI

It's very cool that Apple shipped a 20B parameter on-device.

You can't put 20B parameters in RAM at any reasonable precision. To make it work they are using pretty exotic architecture by today's standards.

A small model predicts from the query (or prompt) which experts to load from Nand into RAM. The key distinction from a typical MoE is that you do this once per query and then generate all the tokens with the same experts (instead of switching the experts for every token).

9:25 PM · Jun 8, 2026 · 75.5K Views
Sentiment

Many users praise Apple's 20B on-device LLM with query-based sparse expert architecture for its clever engineering that enables efficient local scaling and helps GPU-poor users, while some worry about compatibility issues and hampered性能.

Pos
81.8%
Neg
18.2%
15 comments with sentiment.
Cluster Engagement
Posts from X
Most Activity
Most Activity
VIEWS6.3KBOOKMARKS39LIKES71
Awni Hannun@awnihannun

Blog post: https://machinelearning.apple.com/research/introducing-third-generation-of-apple-foundation-models

Original publication: https://machinelearning.apple.com/research/pruning-large-language

Awni Hannun@awnihannun

It's very cool that Apple shipped a 20B parameter on-device.

You can't put 20B parameters in RAM at any reasonable precision. To make it work they are using pretty exotic architecture by today's standards.

A small model predicts from the query (or prompt) which experts to load from Nand into RAM. The key distinction from a typical MoE is that you do this once per query and then generate all the tokens with the same experts (instead of switching the experts for every token).

5hViews 6.3KLikes 71Bookmarks 39
RETWEETS62
Awni Hannun@awnihannun

It's very cool that Apple shipped a 20B parameter on-device.

You can't put 20B parameters in RAM at any reasonable precision. To make it work they are using pretty exotic architecture by today's standards.

A small model predicts from the query (or prompt) which experts to load from Nand into RAM. The key distinction from a typical MoE is that you do this once per query and then generate all the tokens with the same experts (instead of switching the experts for every token).

5hViews 75.5KLikes 1.3KBookmarks 557
REPLIES7

This is quite amazing. First time I see explicit MoE design for the GPU-poors.

Awni Hannun@awnihannun

It's very cool that Apple shipped a 20B parameter on-device.

You can't put 20B parameters in RAM at any reasonable precision. To make it work they are using pretty exotic architecture by today's standards.

A small model predicts from the query (or prompt) which experts to load from Nand into RAM. The key distinction from a typical MoE is that you do this once per query and then generate all the tokens with the same experts (instead of switching the experts for every token).

1hViews 1.2KLikes 17Bookmarks 4
7y913@aayeinbaigan

@awnihannun "Instead of forcing the entire model into DRAM, the full model is stored in flash memory (NAND)" - does this mean the entire 20B param model is stored on the device taking up ~10GB Nand Space needed?

4hViews 1.5KLikes 4
Owen@owenyuwono

@awnihannun apple gets it, the scalable way for AI is locally run, not with data centers

5hViews 964Likes 5
Alex Vu@robberviet

@awnihannun 20B is huge. Not sure how will they do it. At least I cannot run gpt-oss20b comfortably on my macbook.

2hViews 482Bookmarks 1

@aayeinbaigan @awnihannun At least 10gb if int4 but it is on the device

4hViews 233Likes 2
Pranav@IamPranavJ

@awnihannun Cool breakdown. The once-per-query loading is forced by NAND bandwidth. Swapping experts every token would stall generation on flash reads, so they load once and reuse. The tradeoff: you lose per-token routing. Every token is stuck with the expert set the prompt picked.

4hViews 882Likes 4
Jeffrey 杰弗瑞@tomcocobrico

@Everlier @awnihannun Maybe the engineer implementing it also didn’t know and it worked somewhat okay

2hViews 70Likes 2
dns_di@Denis_13_1982

@awnihannun I am sure that shit will be used mainly to profile you, drain battery like crazy and make you phone hot so you will consider buying gloves🥴

3hViews 283
Ayush@Highestage

@awnihannun However, the capability of system gets seriously hampered due to invoking of a single expert. memory compression is needed to have an effective single system, instead of having to stream weights for every individual invoke per prompt.

4hViews 852Likes 3
Satyam@stym06

@Everlier @awnihannun There has been a lot of distilling and refinement using the latest Gemini models to get here

2hViews 66Likes 1
Everlier@Everlier

@awnihannun I find it funny that this is actually exactly how most think that MoE works.

The overlap between experts must be pretty high to support such an architecture. So much so, I'm questioning how much of its performance will be retained by just frankenmerging those.

3hViews 164
Maziyar PANAHI@MaziyarPanahi

@awnihannun @ivanfioravanti Genuinely amazing work!

1hViews 311Likes 3

@awnihannun per-query expert loading dodges the NAND bandwidth wall.

at 1M context the KV cache geometry matters more than active parameter count.

1hViews 777Likes 1
MobaiLabs@mobailabs

The expert prediction layer is the key insight here. Instead of loading all 20B params, you predict which experts matter per query and only materialize those weights. It's essentially a dynamic routing problem — and it makes on-device inference feasible at a scale that would otherwise be impossible. Curious how the routing overhead compares to the inference savings.

4hViews 620
Jordan Hamel@jordanjhamel

@awnihannun oh interesting use of on device routing to let the small model select the experts cool

4hViews 563
egesea@egesea009

@awnihannun What's fascinating is that on-device AI is forcing entirely different architectural trade-offs. Cloud models optimize around compute. On-device models optimize around memory. The constraints are changing, and architecture is evolving with them.

1hViews 50Likes 2

@aayeinbaigan @awnihannun It’s a QAT model (Gemma4 architecture) so it should be a bit less but yes.

3hViews 98Likes 1
Load more posts