/AI1d ago

Gemma 4 26B MoE Runs On 8GB VRAM GPU At 20+ Tokens Per Second

921.7K1712.3K257K

Original post

Pete Skomoroch#1425

Alok@analogalok

Run Gemma 4 26B MoE on 8GB VRAM with 250k context at 20+ tokens/sec

If you own any 8GB VRAM graphics card, stop what you are doing. Local AI just had its absolute "Holy Shit" moment for budget hardware.

Yesterday, I benchmarked Unsloth Gemma 4 12B Q4_K_XL on an 8GB card.

The community went wild but immediately demanded more: "Can we run a 25B+ model on budget GPUs?"

Today, I’m delivering exactly that.

I am running a massive 26B parameter Mixture of Experts (MoE) model locally on a standard 8GB VRAM setup with 250k full native context!.

If you own an RTX 3060, 3070, 4060, or any budget GPU with 8GB of VRAM, the local AI paradigm has completely changed.

The performance metrics are astonishing:

- 20 tokens/sec flat decode throughput. - Stable, flat decode speed even with massive prompts. - I threw a 60k token prompt at it, and it still clocked in at 20 TPS without dropping a single frame.

# What about prefill?

Yes, Time To First Token (TTFT) is slightly high when swallowing massive contexts. But with a solid 200 tokens/sec prefill speed, the wait is barely noticeable and highly usable.

And this is running completely without Multi Token Prediction (MTP) active.

How is this possible? It’s the magic of Google's new QAT (Quantization Aware Training) quants for Gemma 4.

The model weight file (unsloth gemma-4-26B-A4B-it-qat-UD-Q4_K_XL.gguf) is only 13.2 GB, making it the ultimate local powerhouse.

# The Test Setup:

CPU: Intel Core i7

RAM: 16GB System RAM

GPU: NVIDIA GeForce RTX 4060 Laptop GPU (8GB VRAM)

# The Secret Sauce (The -cmoe Flag)

To make this work properly on any 8GB card, you must use the -cmoe (CPU MoE) flag in llama.cpp.

This flag isolates the heavy MoE expert weights directly to system memory (CPU/RAM) while letting your GPU focus strictly on the Attention layers and the KV Cache.

It prevents VRAM spillage and holds the throughput rock solid.

# The flags:

-m "gemma-4-26B-A4B-it-qat-UD-Q4_K_XL.gguf" -cmoe -c 248000 -v

Once running, just open the UI on localhost and toggle the new reasoning lightbulb icon in the text input box to watch the model perform multi step thinking.

Are you still running smaller models, or are you ready to scale up your budget local setups? Let's discuss in the replies

4:23 AM · Jun 7, 2026 · 256.4K Views

/AI1d ago

Gemma 4 26B MoE Runs On 8GB VRAM GPU At 20+ Tokens Per Second

921.7K1712.3K257K

Original post

Pete Skomoroch#1425

Alok@analogalok

Run Gemma 4 26B MoE on 8GB VRAM with 250k context at 20+ tokens/sec

If you own any 8GB VRAM graphics card, stop what you are doing. Local AI just had its absolute "Holy Shit" moment for budget hardware.

Yesterday, I benchmarked Unsloth Gemma 4 12B Q4_K_XL on an 8GB card.

The community went wild but immediately demanded more: "Can we run a 25B+ model on budget GPUs?"

Today, I’m delivering exactly that.

I am running a massive 26B parameter Mixture of Experts (MoE) model locally on a standard 8GB VRAM setup with 250k full native context!.

If you own an RTX 3060, 3070, 4060, or any budget GPU with 8GB of VRAM, the local AI paradigm has completely changed.

The performance metrics are astonishing:

- 20 tokens/sec flat decode throughput. - Stable, flat decode speed even with massive prompts. - I threw a 60k token prompt at it, and it still clocked in at 20 TPS without dropping a single frame.

# What about prefill?

Yes, Time To First Token (TTFT) is slightly high when swallowing massive contexts. But with a solid 200 tokens/sec prefill speed, the wait is barely noticeable and highly usable.

And this is running completely without Multi Token Prediction (MTP) active.

How is this possible? It’s the magic of Google's new QAT (Quantization Aware Training) quants for Gemma 4.

The model weight file (unsloth gemma-4-26B-A4B-it-qat-UD-Q4_K_XL.gguf) is only 13.2 GB, making it the ultimate local powerhouse.

# The Test Setup:

CPU: Intel Core i7

RAM: 16GB System RAM

GPU: NVIDIA GeForce RTX 4060 Laptop GPU (8GB VRAM)

# The Secret Sauce (The -cmoe Flag)

To make this work properly on any 8GB card, you must use the -cmoe (CPU MoE) flag in llama.cpp.

This flag isolates the heavy MoE expert weights directly to system memory (CPU/RAM) while letting your GPU focus strictly on the Attention layers and the KV Cache.

It prevents VRAM spillage and holds the throughput rock solid.

# The flags:

-m "gemma-4-26B-A4B-it-qat-UD-Q4_K_XL.gguf" -cmoe -c 248000 -v

Once running, just open the UI on localhost and toggle the new reasoning lightbulb icon in the text input box to watch the model perform multi step thinking.

Are you still running smaller models, or are you ready to scale up your budget local setups? Let's discuss in the replies

4:23 AM · Jun 7, 2026 · 256.4K Views

Sentiment

Sentiment building, check back later.

Cluster Engagement

Posts from X

Most Activity

No ranked X posts are available for this story yet.

Original post

Pete Skomoroch#1425

Alok@analogalok

Run Gemma 4 26B MoE on 8GB VRAM with 250k context at 20+ tokens/sec

If you own any 8GB VRAM graphics card, stop what you are doing. Local AI just had its absolute "Holy Shit" moment for budget hardware.

Yesterday, I benchmarked Unsloth Gemma 4 12B Q4_K_XL on an 8GB card.

The community went wild but immediately demanded more: "Can we run a 25B+ model on budget GPUs?"

Today, I’m delivering exactly that.

I am running a massive 26B parameter Mixture of Experts (MoE) model locally on a standard 8GB VRAM setup with 250k full native context!.

If you own an RTX 3060, 3070, 4060, or any budget GPU with 8GB of VRAM, the local AI paradigm has completely changed.

The performance metrics are astonishing:

- 20 tokens/sec flat decode throughput. - Stable, flat decode speed even with massive prompts. - I threw a 60k token prompt at it, and it still clocked in at 20 TPS without dropping a single frame.

# What about prefill?

Yes, Time To First Token (TTFT) is slightly high when swallowing massive contexts. But with a solid 200 tokens/sec prefill speed, the wait is barely noticeable and highly usable.

And this is running completely without Multi Token Prediction (MTP) active.

How is this possible? It’s the magic of Google's new QAT (Quantization Aware Training) quants for Gemma 4.

The model weight file (unsloth gemma-4-26B-A4B-it-qat-UD-Q4_K_XL.gguf) is only 13.2 GB, making it the ultimate local powerhouse.

# The Test Setup:

CPU: Intel Core i7

RAM: 16GB System RAM

GPU: NVIDIA GeForce RTX 4060 Laptop GPU (8GB VRAM)

# The Secret Sauce (The -cmoe Flag)

To make this work properly on any 8GB card, you must use the -cmoe (CPU MoE) flag in llama.cpp.

This flag isolates the heavy MoE expert weights directly to system memory (CPU/RAM) while letting your GPU focus strictly on the Attention layers and the KV Cache.

It prevents VRAM spillage and holds the throughput rock solid.

# The flags:

-m "gemma-4-26B-A4B-it-qat-UD-Q4_K_XL.gguf" -cmoe -c 248000 -v

Once running, just open the UI on localhost and toggle the new reasoning lightbulb icon in the text input box to watch the model perform multi step thinking.

Are you still running smaller models, or are you ready to scale up your budget local setups? Let's discuss in the replies

4:23 AM · Jun 7, 2026 · 256.4K Views