Already reasonably established that it preserves a lot of general capability, interesting to test this on *knowledge* against gpt-oss-120B, as they're actually close in on-disk size.
antirez releases 2-bit quantized DeepSeek V4 Pro GGUF model on Hugging Face as a 433 GB file that runs at 13 tokens per second on Mac Studio M3 Ultra
AI Judge changed title after evaluation, original title: "Antirez releases quantized DeepSeek V4 Flash model on Hugging Face"
An 80.8 GiB DeepSeek V4 Flash GGUF variant was also released, sparking discussion around single-GPU inference on RTX Pro 6000 hardware.
Users are excited and grateful that quantized DeepSeek V4 models perform well on high-end setups such as the M3 Ultra Mac Studio with 512GB RAM or a single RTX GPU, highlighting their practical value.
Most Activity
I didn't expect DeepSeek v4 PRO (not Flash) to run well on the Mac Studio M3 Ultra with 512GB of RAM. This is 2 bit quantized with the same DwarfStar recipe used for Flash. 433GB GGUF file. 130 t/s prefill, 13 t/s generation. Prefill in the video is low because small prompt.
DeepSeek V4 Flash on a single RTX Pro 6000? 👀
https://huggingface.co/antirez/deepseek-v4-gguf

@Snixtp Oh so not lamacpp? https://github.com/antirez/ds4

The questions now are: will 2-bit quantized PRO will be so resilient as the Flash to quantization? Or is just big and not better than Flash, quantized in this way? I need to make sure the inference graph is totally correct, to start, and also generate imatrix 2-bit quants for fairness.

Imagine having a 1.6T parameters model running at your home.

@filipstrand No way to run it into 256GB but I also believe that it may not perform, 2 bit quantized, as well as the Flash. The Flash is the real deal IMHO.

@antirez @iotcoi Good morning from Miami! Solid progress @antirez. I am following you to learn from you. Thank you for sharing your knowledge.

@antirez this is mental, but isn't 13 t/s too slow for coding and agentic tasks?

@_Ajeya 13 t/s generation could kinda work, but it is slow. 130 t/s prefill is the real blocker for agentic tasks IMHO. Not impossible, but annoying. Much better to use DS4 Flash currently.

@SeregonWar Sì ne compro sicuro almeno uno a questo punto, quando esce. Ma credo sia programmato per ottobre.

@SeregonWar @WeakConqueror Ma con due RTX 5090 sei a un totale di 64GB di VRAM, no?

@antirez This is amazing! Do you think it might run decent on the 256GB version of the M3 Ultra too?
Anyways, I’ve been running the Flash version (2 bit) a lot this weekend on my M3 Ultra and it is seriously impressive, huge thanks for your amazing work on this project!!

@WeakConqueror @antirez Arrivato a quel punto credo convenga prender 2 rtx 5090, non avrebbe senso, tieni a mente che è un proto o destinato a noi consumer

@antirez and It's normally listed as more powerful than sonnet It's amazing to have such a powerful model locally but I wonder what's the loss in quality with quantization though?

@antirez What about performance in terms of coding?

@Snixtp @_akhaliq Sure! You can.

@antirez 433GB GGUF is the wild part. Does the 13 t/s generation hold once the prompt grows, or is the M3 Ultra mostly winning on prefill here?

@mylifcc The slope of the generation speed is similar to Flash, it decreases as context grows but gently. So it remains usable at long contexts.

@djnmrjnvc Upload in progress but I have also to update the implementation with PRO support. It will take some time for quality checks.

@antirez 130 t/s prefil pretty much unusable.