/Tech9h ago

Qwen3.6 And Gemma4 Surpass GPT-4o In Chat Benchmarks

--0--

Original post unavailable.

/Tech9h ago

Qwen3.6 And Gemma4 Surpass GPT-4o In Chat Benchmarks

--0--

Original post unavailable.

Sentiment

Users are celebrating Gemma 4 and Qwen3.6 outperforming GPT-4o in chat benchmarks because smaller open MoE models deliver higher quality than the closed flagship, enabled by local tools for quick evaluation.

Pos

100.0%

Neg

0.0%

9 comments with sentiment.

Cluster Engagement

Views

Comments

Reposts

Bookmarks

Expand data

Posts from X

Most Activity

N8 Programs@N8Programs

For these tests, I used the same 1000 prompts I did in the Qwen3.5 4B comparison. I additionally used local 6-bit quants of both Qwen3.6 35B A3B and Gemma 4 26B A4B, w/ thinking off - this is to capture 'local, low-latency chat' - the same regime GPT-4o excelled at.

9h621

LIKES3

N8 Programs@N8Programs

Thanks to @lmstudio's new backend, I was able to timely evaluate both models in just a few hours and then run judging w/ Claude Opus 4.6 (same judge I used back in March, as I didn't want to change an additional latent factor).

9h543

REPLIES2

N8 Programs@N8Programs

@lmstudio Link to original thread for reference:

9h202

N8 Programs@N8Programs

@lmstudio The only place where GPT-4o still excels, likely due to its param count, is domains that require high-knowledge, high-reasoning:

9h422

N8 Programs@N8Programs

The result, summarized: Qwen3.6 35B and Gemma 4 26B punch way above their active param counts:

9h392

Florian Brand@xeophon

@N8Programs Great research!!

9h521

N8 Programs@N8Programs

@lmstudio Gemma 4 26B A4B beats GPT-4o 86.2% of the time, even w/ thinking mode off. This means that even without TTC scaffolding, modern smaller MoEs still pack enough bunch to beat frontier models from ~2 years ago in gen chat.

9h501

N8 Programs@N8Programs

@lmstudio Thank you for reading!

9h271

N8 Programs@N8Programs

@lmstudio Length-wise, GPT-4o is quite concise, Gemma a bit less so, and Qwen3.6 much less. Qwen3.5 4B + Llama 3.1 8B included as references.

9h261

N8 Programs@N8Programs

@lmstudio For replicability, I inferenced Gemma 26B A4B at a 6-bit MLX quant through LM studio w/ temp=1.0, top_k=64, top_p=0.95. I inferenced Qwen 3.6 35B A3B at a 6-bit MLX quant through LM studio w/ temp=0.6, top_k=20, top_p=0.95.

9h231

N8 Programs@N8Programs

So, tentatively: for modern MoEs like Gemma 26B A4B and Qwen3.6 35B A4B - even w/ thinking mode off - you can get general chat quality that exceeds GPT-4o significantly by winrate.

9h181

N8 Programs@N8Programs

@lmstudio I think this is important for people who seek to have a single, consistent, LLM to talk to - local solutions have surpassed the famously-enrapturing 4o, and are obviously immune from model switching, routing, or deprecation.

9h101

N8 Programs@N8Programs

@lmstudio One thing to additionally note: I intentionally ran these tests w/ the 6-bit quant through lmstudio to avoid any discrepancy between vllm/API-side deployment and the local deployment that any reader of this thread would consider running on their machine.

9h20

Glavin Wiechert👨‍💻@GlavinW

Love seeing how far the open models have come!

For those looking to reproduce your setup, essentially 1. @lmstudio to generate batch of responses with models, 2. Claude opus 4.6 then to perform pair-wise comparison to pick which is best, 3. Custom UI which looks like LMArea to visualize?

9h20

Vincent@InsiderPresider

@N8Programs @lmstudio benchmarks are one thing but testing real world performance is where the truth usually comes out

N8 Programs@N8Programs

@xeophon Thank you!!!

9h91

N8 Programs@N8Programs

@GlavinW @lmstudio Yes - essentially - though there's no UI, its just gpt-5.5 generated matplotlib charts. Code here for repro: https://github.com/N8python/qwen4BvsGPT-4o

9h7

N8 Programs@N8Programs

@InsiderPresider @lmstudio im fairly sure you're a bot, but these 1000 prompts are *intentionally drawn from what real users asked*.

9h1

N8 Programs@N8Programs

Another thing to additionally note: this thread's conclusions rests on a two main assumptions:

Claude Opus 4.6 reflects human preferences and isn't irrationally length-biased.

These 1000 prompts are representative of general chat use cases.