/Tech38d ago

antirez releases 2-bit quantized DeepSeek V4 Pro GGUF model on Hugging Face as a 433 GB file that runs at 13 tokens per second on Mac Studio M3 Ultra

AI Judge changed title after evaluation, original title: "Antirez releases quantized DeepSeek V4 Flash model on Hugging Face"

An 80.8 GiB DeepSeek V4 Flash GGUF variant was also released, sparking discussion around single-GPU inference on RTX Pro 6000 hardware.

791.2K86463180.5K

#33

Original post

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex#501inTech

Already reasonably established that it preserves a lot of general capability, interesting to test this on *knowledge* against gpt-oss-120B, as they're actually close in on-disk size.

witcheer@witcheer

Antirez (the person who built redis) is now publishing quantized versions of deepseek V4 on huggingface. the technique he’s using is worth understanding even if the model is too big for your GPU.

quick background: quantization is how you shrink a model to fit on smaller

7:07 AM · May 16, 2026 · 7.2K Views

Sentiment

Users are excited and grateful that quantized DeepSeek V4 models perform well on high-end setups such as the M3 Ultra Mac Studio with 512GB RAM or a single RTX GPU, highlighting their practical value.

Pos

92.3%

Neg

7.7%

15 comments with sentiment.

Cluster Engagement

Digg Deeper

No Digg Deeper questions have been answered for this story yet.

Related links

antirez/deepseek-v4-gguf · Hugging Face

HUGGINGFACEVia

Posts from X

Most Activity

VIEWS147.1KBOOKMARKS359LIKES998RETWEETS76REPLIES50

antirez@antirez

I didn't expect DeepSeek v4 PRO (not Flash) to run well on the Mac Studio M3 Ultra with 512GB of RAM. This is 2 bit quantized with the same DwarfStar recipe used for Flash. 433GB GGUF file. 130 t/s prefill, 13 t/s generation. Prefill in the video is low because small prompt.

37d147.1K998359

Espen JD@Snixtp

DeepSeek V4 Flash on a single RTX Pro 6000? 👀

https://huggingface.co/antirez/deepseek-v4-gguf

38d26.3K12188

Dan@Relativ3pa1n

@Snixtp Oh so not lamacpp? https://github.com/antirez/ds4

38d19123

antirez@antirez

The questions now are: will 2-bit quantized PRO will be so resilient as the Flash to quantization? Or is just big and not better than Flash, quantized in this way? I need to make sure the inference graph is totally correct, to start, and also generate imatrix 2-bit quants for fairness.

37d1.3K9

antirez@antirez

Imagine having a 1.6T parameters model running at your home.

37d2302

antirez@antirez

@filipstrand No way to run it into 256GB but I also believe that it may not perform, 2 bit quantized, as well as the Flash. The Flash is the real deal IMHO.

37d2186

Michel Laclé@micheltamanda

@antirez @iotcoi Good morning from Miami! Solid progress @antirez. I am following you to learn from you. Thank you for sharing your knowledge.

37d29211

Ajeya@_Ajeya

@antirez this is mental, but isn't 13 t/s too slow for coding and agentic tasks?

37d5623

antirez@antirez

@_Ajeya 13 t/s generation could kinda work, but it is slow. 130 t/s prefill is the real blocker for agentic tasks IMHO. Not impossible, but annoying. Much better to use DS4 Flash currently.

37d5404

antirez@antirez

@SeregonWar Sì ne compro sicuro almeno uno a questo punto, quando esce. Ma credo sia programmato per ottobre.

37d4863

antirez@antirez

@SeregonWar @WeakConqueror Ma con due RTX 5090 sei a un totale di 64GB di VRAM, no?

37d105

Filip Strand@filipstrand

@antirez This is amazing! Do you think it might run decent on the 256GB version of the M3 Ultra too?

Anyways, I’ve been running the Flash version (2 bit) a lot this weekend on my M3 Ultra and it is seriously impressive, huge thanks for your amazing work on this project!!

37d2693

Seregon@SeregonWar

@WeakConqueror @antirez Arrivato a quel punto credo convenga prender 2 rtx 5090, non avrebbe senso, tieni a mente che è un proto o destinato a noi consumer

37d80

veridicus (e/acc, limitless)@eaglebuildz

@antirez and It's normally listed as more powerful than sonnet It's amazing to have such a powerful model locally but I wonder what's the loss in quality with quantization though?

37d1763