DeepSeek V4 sustains performance at two bits per weight
DeepSeek V4 maintains functional performance at roughly two average bits per weight under variable bit rate post-training quantization. Benchmarks on GPQA Diamond, Super GPQA, and AIME2025 showed only modest degradation compared with higher-precision baselines. The result illustrates a decoupling of model knowledge density from weight precision. Earlier models such as Llama 3 experienced sharp capability drops below four bits, while current variable-rate methods avoid comparable losses at the two-bit level.
So one thing that has changed in the last couple years is that model knowledge density has decoupled from weight precision. Remember when we were noticing how Llama 3 is "less quantizable" to, like, Q4? DSV4 functions okay-ishly at *two bits*.

@teortaxesTex well, 2 bits is sort of colloquial here it's more like 2 "avg" bits per weight all serious modern low bit ptq is VBR, if you try to do GPTQ style uniform rounding for the 2 bit range shit obviously explodes to this very day
So one thing that has changed in the last couple years is that model knowledge density has decoupled from weight precision. Remember when we were noticing how Llama 3 is "less quantizable" to, like, Q4? DSV4 functions okay-ishly at *two bits*.
@teortaxesTex it's possible that there is a regime of activations that conventional PTQ calibration makes systematically worse lets say you have a diverse english calibration set that preserves bits in the ffns most relevant to english gratz, you maybe(?) tanked chinese perf
@teortaxesTex well, 2 bits is sort of colloquial here it's more like 2 "avg" bits per weight all serious modern low bit ptq is VBR, if you try to do GPTQ style uniform rounding for the 2 bit range shit obviously explodes to this very day
@teortaxesTex on the bloomer side: there is a manifold hypothesis adjacent version of this thought experiment where PTQ calibrated for on-manifold pretraining webtext loses you ~nothing in KLD accuracy outside of high entropy meaning-devoid noise sequences
@teortaxesTex it's possible that there is a regime of activations that conventional PTQ calibration makes systematically worse lets say you have a diverse english calibration set that preserves bits in the ffns most relevant to english gratz, you maybe(?) tanked chinese perf
@teortaxesTex i think the bloomer side is closer to the truth and that aggressive weight-side PTQ works better than it has any right to in practice tho
@teortaxesTex on the bloomer side: there is a manifold hypothesis adjacent version of this thought experiment where PTQ calibrated for on-manifold pretraining webtext loses you ~nothing in KLD accuracy outside of high entropy meaning-devoid noise sequences