/AI21d ago

antirez releases 2-bit quantized DeepSeek V4 Pro GGUF model on Hugging Face as a 433 GB file that runs at 13 tokens per second on Mac Studio M3 Ultra

AI Judge changed title after evaluation, original title: "Antirez releases quantized DeepSeek V4 Flash model on Hugging Face"

An 80.8 GiB DeepSeek V4 Flash GGUF variant was also released, sparking discussion around single-GPU inference on RTX Pro 6000 hardware.

--0--
Original post

Already reasonably established that it preserves a lot of general capability, interesting to test this on *knowledge* against gpt-oss-120B, as they're actually close in on-disk size.

7:07 AM · May 16, 2026 · 7.2K Views
Sentiment

Users are excited and grateful that quantized DeepSeek V4 models perform well on high-end setups such as the M3 Ultra Mac Studio with 512GB RAM or a single RTX GPU, highlighting their practical value.

Pos
92.3%
Neg
7.7%
15 comments with sentiment.
Cluster Engagement
-
Views
-
Comments
-
Reposts
-
Bookmarks
Expand data
Posts from X
Most Activity
Most Activity
VIEWS147.1KBOOKMARKS359LIKES998RETWEETS76REPLIES50
antirez@antirez

I didn't expect DeepSeek v4 PRO (not Flash) to run well on the Mac Studio M3 Ultra with 512GB of RAM. This is 2 bit quantized with the same DwarfStar recipe used for Flash. 433GB GGUF file. 130 t/s prefill, 13 t/s generation. Prefill in the video is low because small prompt.

20dViews 147.1KLikes 998Bookmarks 359
Espen JD@Snixtp

DeepSeek V4 Flash on a single RTX Pro 6000? 👀

https://huggingface.co/antirez/deepseek-v4-gguf

21dViews 26.3KLikes 121Bookmarks 88
Dan@Relativ3pa1n

@Snixtp Oh so not lamacpp? https://github.com/antirez/ds4

21dViews 191Likes 2Bookmarks 3
antirez@antirez

The questions now are: will 2-bit quantized PRO will be so resilient as the Flash to quantization? Or is just big and not better than Flash, quantized in this way? I need to make sure the inference graph is totally correct, to start, and also generate imatrix 2-bit quants for fairness.

20dViews 1.3KLikes 9
antirez@antirez

Imagine having a 1.6T parameters model running at your home.

20dViews 230Likes 2
antirez@antirez

@filipstrand No way to run it into 256GB but I also believe that it may not perform, 2 bit quantized, as well as the Flash. The Flash is the real deal IMHO.

20dViews 218Likes 6
Michel Laclé@micheltamanda

@antirez @iotcoi Good morning from Miami! Solid progress @antirez. I am following you to learn from you. Thank you for sharing your knowledge.

20dViews 292Likes 1Bookmarks 1
Ajeya@_Ajeya

@antirez this is mental, but isn't 13 t/s too slow for coding and agentic tasks?

20dViews 562Likes 3
antirez@antirez

@_Ajeya 13 t/s generation could kinda work, but it is slow. 130 t/s prefill is the real blocker for agentic tasks IMHO. Not impossible, but annoying. Much better to use DS4 Flash currently.

20dViews 540Likes 4
antirez@antirez

@SeregonWar Sì ne compro sicuro almeno uno a questo punto, quando esce. Ma credo sia programmato per ottobre.

20dViews 486Likes 3
antirez@antirez

@SeregonWar @WeakConqueror Ma con due RTX 5090 sei a un totale di 64GB di VRAM, no?

20dViews 105
Filip Strand@filipstrand

@antirez This is amazing! Do you think it might run decent on the 256GB version of the M3 Ultra too?

Anyways, I’ve been running the Flash version (2 bit) a lot this weekend on my M3 Ultra and it is seriously impressive, huge thanks for your amazing work on this project!!

20dViews 269Likes 3
Seregon@SeregonWar

@WeakConqueror @antirez Arrivato a quel punto credo convenga prender 2 rtx 5090, non avrebbe senso, tieni a mente che è un proto o destinato a noi consumer

20dViews 80

@antirez and It's normally listed as more powerful than sonnet It's amazing to have such a powerful model locally but I wonder what's the loss in quality with quantization though?

20dViews 176Likes 3
Thanh Nguyen@ng_thanh8

@antirez What about performance in terms of coding?

20dViews 32Bookmarks 1
lifcc@mylifcc

@antirez 433GB GGUF is the wild part. Does the 13 t/s generation hold once the prompt grows, or is the M3 Ultra mostly winning on prefill here?

20dViews 671Likes 1
antirez@antirez

@mylifcc The slope of the generation speed is similar to Flash, it decreases as context grows but gently. So it remains usable at long contexts.

20dViews 626Likes 4
antirez@antirez

@djnmrjnvc Upload in progress but I have also to update the implementation with PRO support. It will take some time for quality checks.

20dViews 532Likes 4
AlexK@AlexKi1993

@antirez 130 t/s prefil pretty much unusable.

20dViews 127Likes 1
Load more posts