/AI9h ago

Qwen3.6 And Gemma4 Surpass GPT-4o In Chat Benchmarks

8779247.1K
Original postFlorian Brand#1117
N8 Programs@N8Programs

Back in March, I tested whether Qwen3.5 4B was as good as GPT-4o for gen chat. The result was that they were ~similar. But lots of models have since come out - Qwen3.6, Gemma4 - how do they stack up? The answer is decisive: they beat GPT-4o handily for general LMArena-like chat.

8:44 AM · Jun 7, 2026 · 7.1K Views
Sentiment

Many users are enthusiastic about Gemma 4 and Qwen3.6 outperforming GPT-4o in chat benchmarks because smaller open MoEs deliver higher-quality results than larger closed models.

Pos
100.0%
Neg
0.0%
9 comments with sentiment.
Cluster Engagement
Posts from X
Most Activity
Most Activity
VIEWS62
N8 Programs@N8Programs

For these tests, I used the same 1000 prompts I did in the Qwen3.5 4B comparison. I additionally used local 6-bit quants of both Qwen3.6 35B A3B and Gemma 4 26B A4B, w/ thinking off - this is to capture 'local, low-latency chat' - the same regime GPT-4o excelled at.

9hViews 62Likes 1
LIKES3
N8 Programs@N8Programs

Thanks to @lmstudio's new backend, I was able to timely evaluate both models in just a few hours and then run judging w/ Claude Opus 4.6 (same judge I used back in March, as I didn't want to change an additional latent factor).

9hViews 54Likes 3
REPLIES2
N8 Programs@N8Programs

@lmstudio Link to original thread for reference:

9hViews 20Likes 2
N8 Programs@N8Programs

@lmstudio The only place where GPT-4o still excels, likely due to its param count, is domains that require high-knowledge, high-reasoning:

9hViews 42Likes 2
N8 Programs@N8Programs

The result, summarized: Qwen3.6 35B and Gemma 4 26B punch way above their active param counts:

9hViews 39Likes 2

@N8Programs Great research!!

9hViews 52Likes 1
N8 Programs@N8Programs

@lmstudio Gemma 4 26B A4B beats GPT-4o 86.2% of the time, even w/ thinking mode off. This means that even without TTC scaffolding, modern smaller MoEs still pack enough bunch to beat frontier models from ~2 years ago in gen chat.

9hViews 50Likes 1
N8 Programs@N8Programs

@lmstudio Thank you for reading!

9hViews 27Likes 1
N8 Programs@N8Programs

@lmstudio Length-wise, GPT-4o is quite concise, Gemma a bit less so, and Qwen3.6 much less. Qwen3.5 4B + Llama 3.1 8B included as references.

9hViews 26Likes 1
N8 Programs@N8Programs

@lmstudio For replicability, I inferenced Gemma 26B A4B at a 6-bit MLX quant through LM studio w/ temp=1.0, top_k=64, top_p=0.95. I inferenced Qwen 3.6 35B A3B at a 6-bit MLX quant through LM studio w/ temp=0.6, top_k=20, top_p=0.95.

9hViews 23Likes 1
N8 Programs@N8Programs

So, tentatively: for modern MoEs like Gemma 26B A4B and Qwen3.6 35B A4B - even w/ thinking mode off - you can get general chat quality that exceeds GPT-4o significantly by winrate.

9hViews 18Likes 1
N8 Programs@N8Programs

@lmstudio I think this is important for people who seek to have a single, consistent, LLM to talk to - local solutions have surpassed the famously-enrapturing 4o, and are obviously immune from model switching, routing, or deprecation.

9hViews 10Likes 1
N8 Programs@N8Programs

@lmstudio One thing to additionally note: I intentionally ran these tests w/ the 6-bit quant through lmstudio to avoid any discrepancy between vllm/API-side deployment and the local deployment that any reader of this thread would consider running on their machine.

9hViews 20

Love seeing how far the open models have come!

For those looking to reproduce your setup, essentially 1. @lmstudio to generate batch of responses with models, 2. Claude opus 4.6 then to perform pair-wise comparison to pick which is best, 3. Custom UI which looks like LMArea to visualize?

9hViews 20
Vincent@InsiderPresider

@N8Programs @lmstudio benchmarks are one thing but testing real world performance is where the truth usually comes out

9h
N8 Programs@N8Programs

@xeophon Thank you!!!

9hViews 9Likes 1
N8 Programs@N8Programs

@GlavinW @lmstudio Yes - essentially - though there's no UI, its just gpt-5.5 generated matplotlib charts. Code here for repro: https://github.com/N8python/qwen4BvsGPT-4o

9hViews 7
N8 Programs@N8Programs

@InsiderPresider @lmstudio im fairly sure you're a bot, but these 1000 prompts are *intentionally drawn from what real users asked*.

9hLikes 1
N8 Programs@N8Programs

Another thing to additionally note: this thread's conclusions rests on a two main assumptions:

Claude Opus 4.6 reflects human preferences and isn't irrationally length-biased.

These 1000 prompts are representative of general chat use cases.

9h