
@Prince_Canuma Super fast delivery 🚀🚀
Positive users are excited about Gemma 4 12B's 1.72× speedup on M3 Ultra via MLX-VLM while some worry the model size will be too large and speeds too low on 24GB devices like the M5 Air.

@Prince_Canuma Super fast delivery 🚀🚀

@Prince_Canuma Ah...I see there is assistant (draft) models also for the quants on huggingface now...that means MTP will work with the quants?

@Prince_Canuma What about the video input support?

@Prince_Canuma Awesome. But you need to start shipping some sort of helper script to download the right models, there’s so many of them now :)

@rcarmo Worry not Rui, I got you!

@Prince_Canuma I’m going to be curious to see how the tok/s with the unified decoder compares vs Qwen-3.5 9B and Gemma4 e4b.

@Prince_Canuma I fear the bf16 to use MTP will be too big in my 24gb unified (M5 Air), no? Also, just getting around 9-10tk/s for the 8bit/mxfp8 version, is that to be expected, not really that usable :(

@Prince_Canuma 等待 oMLX 更新!

@1littlecoder Thanks bro!

@noctus91 It should work at the same rate as image. I will test later

@Prince_Canuma oMLX i'm alive.

@Prince_Canuma Also I never had any luck fine tuning the other models on text+image inputs. I’m wondering if this unified decoder will make the fine tuning better.

@Prince_Canuma How mlx-vlm works with batch inference?

@Prince_Canuma MTP 在多模態上也有這效果?Apple Silicon 跑 MLX 真的游刃有餘

@noctus91 @Prince_Canuma will need to revisit this w/ https://vlmaxxi.ng/

@Prince_Canuma Here’s a thought - just create a JSON manifest for ease of maintenance (that’s what I do for the piclaw add-ons - https://github.com/rcarmo/piclaw-addons/blob/main/.github/workflows/sync-catalog.yml)

@yallgetscared Yes, MTPs work with quants out of the box.
It helps for bandwidth constraints