1d ago

DeepSeek V4 sustains performance at two bits per weight

0

DeepSeek V4 maintains functional performance at roughly two average bits per weight under variable bit rate post-training quantization. Benchmarks on GPQA Diamond, Super GPQA, and AIME2025 showed only modest degradation compared with higher-precision baselines. The result illustrates a decoupling of model knowledge density from weight precision. Earlier models such as Llama 3 experienced sharp capability drops below four bits, while current variable-rate methods avoid comparable losses at the two-bit level.

Original post

So one thing that has changed in the last couple years is that model knowledge density has decoupled from weight precision. Remember when we were noticing how Llama 3 is "less quantizable" to, like, Q4? DSV4 functions okay-ishly at *two bits*.

12:03 PM · May 15, 2026 View on X

@teortaxesTex well, 2 bits is sort of colloquial here it's more like 2 "avg" bits per weight all serious modern low bit ptq is VBR, if you try to do GPTQ style uniform rounding for the 2 bit range shit obviously explodes to this very day

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex

So one thing that has changed in the last couple years is that model knowledge density has decoupled from weight precision. Remember when we were noticing how Llama 3 is "less quantizable" to, like, Q4? DSV4 functions okay-ishly at *two bits*.

7:03 PM · May 15, 2026 · 4.6K Views
8:42 PM · May 15, 2026 · 482 Views

@teortaxesTex it's possible that there is a regime of activations that conventional PTQ calibration makes systematically worse lets say you have a diverse english calibration set that preserves bits in the ffns most relevant to english gratz, you maybe(?) tanked chinese perf

kalomazekalomaze@kalomaze

@teortaxesTex well, 2 bits is sort of colloquial here it's more like 2 "avg" bits per weight all serious modern low bit ptq is VBR, if you try to do GPTQ style uniform rounding for the 2 bit range shit obviously explodes to this very day

8:42 PM · May 15, 2026 · 482 Views
8:46 PM · May 15, 2026 · 311 Views

@teortaxesTex on the bloomer side: there is a manifold hypothesis adjacent version of this thought experiment where PTQ calibrated for on-manifold pretraining webtext loses you ~nothing in KLD accuracy outside of high entropy meaning-devoid noise sequences

kalomazekalomaze@kalomaze

@teortaxesTex it's possible that there is a regime of activations that conventional PTQ calibration makes systematically worse lets say you have a diverse english calibration set that preserves bits in the ffns most relevant to english gratz, you maybe(?) tanked chinese perf

8:46 PM · May 15, 2026 · 311 Views
8:48 PM · May 15, 2026 · 255 Views

@teortaxesTex i think the bloomer side is closer to the truth and that aggressive weight-side PTQ works better than it has any right to in practice tho

kalomazekalomaze@kalomaze

@teortaxesTex on the bloomer side: there is a manifold hypothesis adjacent version of this thought experiment where PTQ calibrated for on-manifold pretraining webtext loses you ~nothing in KLD accuracy outside of high entropy meaning-devoid noise sequences

8:48 PM · May 15, 2026 · 255 Views
9:01 PM · May 15, 2026 · 105 Views
DeepSeek V4 sustains performance at two bits per weight · Digg