2h ago

User Runs 1T-Parameter Kimi K2.5 Model on RTX 3060 at 4 Tokens per Second

0
Original post

Somebody just ran one trillion param model (Kimi K2.5) on a single RTX 3060 12GB GPU at over 4 tokens/sec and 768GB of second-hand Intel Optane memory. What happened is that a sparse model met an unusual memory tier that could hold its enormous body while the GPU handled the most time-sensitive organs. i.e. the bulk of the sparse expert weights live in a larger, cheaper memory tier and are pulled into the computation as needed. This worked because Kimi K2.5 is a Mixture-of-Experts model, so it has 1T total parameters but activates only 32B per token. The RTX 3060’s 12GB VRAM holds latency-sensitive parts like routing, attention, dense layers, and shared experts. The huge expert weights sit in Optane PMem, configured as RAM, while 192GB DDR4 ECC acts as cache. He is using 6 Optane PMem (DCPMM) sticks. This retired memory format was made to bridge DRAM and SSD performance. The 768GB Optane configuration, using 6x128GB modules, does beat the best NVMe SSDs on latency by a wide margin, but remains 2x to 3x slower than DRAM. llama.cpp handled hybrid GPU/CPU inference, with tensor placement tuned through flags like override-tensor. The result was roughly 4 tokens/sec, which is slow for chat but impressive for a local 1T-parameter model on cheap retired enterprise hardware. The DDR4 acted as cache, the Optane acted as a giant memory pool, and llama.cpp pushed routing and other critical tensors onto the 12GB GPU.

11:12 PM · May 23, 2026 View on X
Rohan PaulRohan Paul@rohanpaul_ai

Somebody just ran one trillion param model (Kimi K2.5) on a single RTX 3060 12GB GPU at over 4 tokens/sec and 768GB of second-hand Intel Optane memory. What happened is that a sparse model met an unusual memory tier that could hold its enormous body while the GPU handled the most time-sensitive organs. i.e. the bulk of the sparse expert weights live in a larger, cheaper memory tier and are pulled into the computation as needed. This worked because Kimi K2.5 is a Mixture-of-Experts model, so it has 1T total parameters but activates only 32B per token. The RTX 3060’s 12GB VRAM holds latency-sensitive parts like routing, attention, dense layers, and shared experts. The huge expert weights sit in Optane PMem, configured as RAM, while 192GB DDR4 ECC acts as cache. He is using 6 Optane PMem (DCPMM) sticks. This retired memory format was made to bridge DRAM and SSD performance. The 768GB Optane configuration, using 6x128GB modules, does beat the best NVMe SSDs on latency by a wide margin, but remains 2x to 3x slower than DRAM. llama.cpp handled hybrid GPU/CPU inference, with tensor placement tuned through flags like override-tensor. The result was roughly 4 tokens/sec, which is slow for chat but impressive for a local 1T-parameter model on cheap retired enterprise hardware. The DDR4 acted as cache, the Optane acted as a giant memory pool, and llama.cpp pushed routing and other critical tensors onto the 12GB GPU.

6:12 AM · May 24, 2026 · 4K Views
6:12 AM · May 24, 2026 · 853 Views
User Runs 1T-Parameter Kimi K2.5 Model on RTX 3060 at 4 Tokens per Second · Digg