DeepSeek V4 sustains performance at two bits per weight

QUOTE POST

#400Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@TEORTAXESTEX

So one thing that has changed in the last couple years is that model knowledge density has decoupled from weight precision. Remember when we were noticing how Llama 3 is "less quantizable" to, like, Q4? DSV4 functions okay-ishly at *two bits*.

7:03 PM · May 15, 2026 · 4.6K Views

REPLY

#841kalomaze@KALOMAZE

@teortaxesTex well, 2 bits is sort of colloquial here it's more like 2 "avg" bits per weight all serious modern low bit ptq is VBR, if you try to do GPTQ style uniform rounding for the 2 bit range shit obviously explodes to this very day

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex

So one thing that has changed in the last couple years is that model knowledge density has decoupled from weight precision. Remember when we were noticing how Llama 3 is "less quantizable" to, like, Q4? DSV4 functions okay-ishly at *two bits*.

7:03 PM · May 15, 2026 · 4.6K Views

8:42 PM · May 15, 2026 · 482 Views

REPLY

#841kalomaze@KALOMAZE

@teortaxesTex it's possible that there is a regime of activations that conventional PTQ calibration makes systematically worse lets say you have a diverse english calibration set that preserves bits in the ffns most relevant to english gratz, you maybe(?) tanked chinese perf

kalomaze@kalomaze

@teortaxesTex well, 2 bits is sort of colloquial here it's more like 2 "avg" bits per weight all serious modern low bit ptq is VBR, if you try to do GPTQ style uniform rounding for the 2 bit range shit obviously explodes to this very day

8:42 PM · May 15, 2026 · 482 Views

8:46 PM · May 15, 2026 · 311 Views

REPLY

#841kalomaze@KALOMAZE

@teortaxesTex on the bloomer side: there is a manifold hypothesis adjacent version of this thought experiment where PTQ calibrated for on-manifold pretraining webtext loses you ~nothing in KLD accuracy outside of high entropy meaning-devoid noise sequences

kalomaze@kalomaze

@teortaxesTex it's possible that there is a regime of activations that conventional PTQ calibration makes systematically worse lets say you have a diverse english calibration set that preserves bits in the ffns most relevant to english gratz, you maybe(?) tanked chinese perf

8:46 PM · May 15, 2026 · 311 Views

8:48 PM · May 15, 2026 · 255 Views

REPLY

#841kalomaze@KALOMAZE

@teortaxesTex i think the bloomer side is closer to the truth and that aggressive weight-side PTQ works better than it has any right to in practice tho

kalomaze@kalomaze

@teortaxesTex on the bloomer side: there is a manifold hypothesis adjacent version of this thought experiment where PTQ calibrated for on-manifold pretraining webtext loses you ~nothing in KLD accuracy outside of high entropy meaning-devoid noise sequences

8:48 PM · May 15, 2026 · 255 Views

9:01 PM · May 15, 2026 · 105 Views

DeepSeek V4 sustains performance at two bits per weight

Cluster engagement

Cluster engagement

Sentiment