Interesting approach from Apple
They are storing the shared attention block in the DRAM
While the FFN weights stay in NAND and are loaded in the DRAM, depending on the request
Apple is facing 3 constraints -
1) Limited DRAM size
2) Large model size (20B params)
3) Slow NAND read speed
A super small model (sub 8B) won't be that useful, but they can't store a 20B model in DRAM (due to memory shortage). They also have to manage the KV cache overhead. If they streamed the weights completely through iPhone SSD, then it would take 2.5 seconds to generate just 1 token (0.4 tokens/s)
So the big thing here is that a normal MoE activates different experts based on every token, but in Apple's case, a sparse mask predictor decides which parameters to activate based on the request/prompt, locks it in, and loads it into the DRAM (1B-4B depending on the request). They basically convert a 20B MoE (with 1B-4B active) into a dense 1B-4B param model for a request.
The tradeoff:
They are basically adding 0.3-1.5 seconds (1B to 4B params loaded) of latency to TTFT time by loading FFN weights from NAND to SSD per request (read speed is around 1.5-1.7 GB/s for iPhones) and taking a hit to performance
They will get around 15-50 tokens/s of decode speed (depending on params loaded)
Ideally, smartphones would come with 24-32 GB of RAM so that 20B param models could be loaded, but memory shortage won't allow it to happen
But, their competitor here is ChatGPT Instant, which is a much smarter model that runs at 200+ tokens/s and has a TTFT of 0.8 seconds (Apple's TTFT will be around 0.5-2 seconds, and decode speed is around 15-50 tokens/s), and is also free
Apple's AFM on device models will be great for privacy-focused tasks. They get beaten by cloud models on other benchmarks (perf, speed, quality)