/Tech8d ago

Mlx-vlm Speeds Up Gemma 4 12B With MTP Speculative Decoding

--0--

Original post unavailable.

Sentiment

Positive users are excited about Gemma 4 12B's 1.72× speedup on M3 Ultra via MLX-VLM while some worry the model size will be too large and speeds too low on 24GB devices like the M5 Air.

Pos

66.7%

Neg

33.3%

4 comments with sentiment.

Cluster Engagement

Views

Comments

Reposts

Bookmarks

Expand data

Posts from X

Most Activity

1LittleCoder💻@1littlecoder

@Prince_Canuma Super fast delivery 🚀🚀

8d2171

LIKES1

rurounigit@yallgetscared

@Prince_Canuma Ah...I see there is assistant (draft) models also for the quants on huggingface now...that means MTP will work with the quants?

8d51

REPLIES2

Noctus@noctus91

@Prince_Canuma What about the video input support?

8d146

Rui Carmo ☯️@rcarmo

@Prince_Canuma Awesome. But you need to start shipping some sort of helper script to download the right models, there’s so many of them now :)

8d911

Prince Canuma@Prince_Canuma

@rcarmo Worry not Rui, I got you!

8d791

Matt Hamilton@HammerToe

@Prince_Canuma I’m going to be curious to see how the tok/s with the unified decoder compares vs Qwen-3.5 9B and Gemma4 e4b.

8d192

rurounigit@yallgetscared

@Prince_Canuma I fear the bf16 to use MTP will be too big in my 24gb unified (M5 Air), no? Also, just getting around 9-10tk/s for the 8bit/mxfp8 version, is that to be expected, not really that usable :(

8d45

✦@indes_yo

@Prince_Canuma 等待 oMLX 更新！

8d401

Prince Canuma@Prince_Canuma

@1littlecoder Thanks bro!

8d112

Prince Canuma@Prince_Canuma

@noctus91 It should work at the same rate as image. I will test later

8d73

father stretch my bandz@MILKANDH3NNY

@Prince_Canuma oMLX i'm alive.

8d48

Matt Hamilton@HammerToe

@Prince_Canuma Also I never had any luck fine tuning the other models on text+image inputs. I’m wondering if this unified decoder will make the fine tuning better.

8d40

Emi@SrEdm00

@Prince_Canuma How mlx-vlm works with batch inference?

8d33

阿納斯塔西婭@3___infinix

@Prince_Canuma MTP 在多模態上也有這效果？Apple Silicon 跑 MLX 真的游刃有餘

8d20

Sam D'Amico@sdamico

@noctus91 @Prince_Canuma will need to revisit this w/ https://vlmaxxi.ng/

8d51

Rui Carmo ☯️@rcarmo

@Prince_Canuma Here’s a thought - just create a JSON manifest for ease of maintenance (that’s what I do for the piclaw add-ons - https://github.com/rcarmo/piclaw-addons/blob/main/.github/workflows/sync-catalog.yml)

8d10

Prince Canuma@Prince_Canuma

@yallgetscared Yes, MTPs work with quants out of the box.

It helps for bandwidth constraints

8d1