/AI5h ago

MLX co-creator Awni Hannun details how Apple runs its 20B parameter model on-device by loading experts once per query

This bypasses memory-bandwidth bottlenecks on resource-constrained hardware.

481.4K13958079.6K

#38

Original post

Awni Hannun@awnihannun#802inAI

It's very cool that Apple shipped a 20B parameter on-device.

You can't put 20B parameters in RAM at any reasonable precision. To make it work they are using pretty exotic architecture by today's standards.

A small model predicts from the query (or prompt) which experts to load from Nand into RAM. The key distinction from a typical MoE is that you do this once per query and then generate all the tokens with the same experts (instead of switching the experts for every token).

9:25 PM · Jun 8, 2026 · 75.5K Views

/AI5h ago

MLX co-creator Awni Hannun details how Apple runs its 20B parameter model on-device by loading experts once per query

This bypasses memory-bandwidth bottlenecks on resource-constrained hardware.

481.4K13958079.6K

#38

Original post

Awni Hannun@awnihannun#802inAI

It's very cool that Apple shipped a 20B parameter on-device.

You can't put 20B parameters in RAM at any reasonable precision. To make it work they are using pretty exotic architecture by today's standards.

9:25 PM · Jun 8, 2026 · 75.5K Views

Sentiment

Many users praise Apple's 20B on-device LLM with query-based sparse expert architecture for its clever engineering that enables efficient local scaling and helps GPU-poor users, while some worry about compatibility issues and hampered性能.

Pos

81.8%

Neg

18.2%

15 comments with sentiment.

Cluster Engagement

Posts from X

Most Activity

VIEWS6.3KBOOKMARKS39LIKES71

Awni Hannun@awnihannun

Blog post: https://machinelearning.apple.com/research/introducing-third-generation-of-apple-foundation-models

Original publication: https://machinelearning.apple.com/research/pruning-large-language

Awni Hannun@awnihannun

It's very cool that Apple shipped a 20B parameter on-device.

You can't put 20B parameters in RAM at any reasonable precision. To make it work they are using pretty exotic architecture by today's standards.

5h6.3K7139

RETWEETS62

Awni Hannun@awnihannun

It's very cool that Apple shipped a 20B parameter on-device.

You can't put 20B parameters in RAM at any reasonable precision. To make it work they are using pretty exotic architecture by today's standards.

5h75.5K1.3K557

REPLIES7

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex

This is quite amazing. First time I see explicit MoE design for the GPU-poors.

Awni Hannun@awnihannun

It's very cool that Apple shipped a 20B parameter on-device.

You can't put 20B parameters in RAM at any reasonable precision. To make it work they are using pretty exotic architecture by today's standards.

1h1.2K174

7y913@aayeinbaigan

@awnihannun "Instead of forcing the entire model into DRAM, the full model is stored in flash memory (NAND)" - does this mean the entire 20B param model is stored on the device taking up ~10GB Nand Space needed?

4h1.5K4

Owen@owenyuwono

@awnihannun apple gets it, the scalable way for AI is locally run, not with data centers

5h9645

Alex Vu@robberviet

@awnihannun 20B is huge. Not sure how will they do it. At least I cannot run gpt-oss20b comfortably on my macbook.

2h4821

Kautuk | Conscious Engines@Kautukkundan

@awnihannun Compatibility woes for older gen hardware!

4h8973

Rafa Schwinger 🇻🇦@Rafa_Schwinger

@aayeinbaigan @awnihannun At least 10gb if int4 but it is on the device

4h2332

Pranav@IamPranavJ

@awnihannun Cool breakdown. The once-per-query loading is forced by NAND bandwidth. Swapping experts every token would stall generation on flash reads, so they load once and reuse. The tradeoff: you lose per-token routing. Every token is stuck with the expert set the prompt picked.

4h8824

Jeffrey 杰弗瑞@tomcocobrico

@Everlier @awnihannun Maybe the engineer implementing it also didn’t know and it worked somewhat okay

2h702

dns_di@Denis_13_1982

@awnihannun I am sure that shit will be used mainly to profile you, drain battery like crazy and make you phone hot so you will consider buying gloves🥴

3h283

Ayush@Highestage

@awnihannun However, the capability of system gets seriously hampered due to invoking of a single expert. memory compression is needed to have an effective single system, instead of having to stream weights for every individual invoke per prompt.

4h8523

Satyam@stym06

@Everlier @awnihannun There has been a lot of distilling and refinement using the latest Gemini models to get here

2h661

Everlier@Everlier

@awnihannun I find it funny that this is actually exactly how most think that MoE works.

The overlap between experts must be pretty high to support such an architecture. So much so, I'm questioning how much of its performance will be retained by just frankenmerging those.

3h164

Maziyar PANAHI@MaziyarPanahi

@awnihannun @ivanfioravanti Genuinely amazing work!

1h3113

Mert · AI Architect@MertLovesAI

@awnihannun per-query expert loading dodges the NAND bandwidth wall.

at 1M context the KV cache geometry matters more than active parameter count.

1h7771

MobaiLabs@mobailabs

The expert prediction layer is the key insight here. Instead of loading all 20B params, you predict which experts matter per query and only materialize those weights. It's essentially a dynamic routing problem — and it makes on-device inference feasible at a scale that would otherwise be impossible. Curious how the routing overhead compares to the inference savings.

4h620

Jordan Hamel@jordanjhamel

@awnihannun oh interesting use of on device routing to let the small model select the experts cool

4h563

egesea@egesea009

@awnihannun What's fascinating is that on-device AI is forcing entirely different architectural trade-offs. Cloud models optimize around compute. On-device models optimize around memory. The constraints are changing, and architecture is evolving with them.

1h502

hristoforgeorgiev.eth@HristoforG

@aayeinbaigan @awnihannun It’s a QAT model (Gemma4 architecture) so it should be a bit less but yes.

3h981