Back in March, I tested whether Qwen3.5 4B was as good as GPT-4o for gen chat. The result was that they were ~similar. But lots of models have since come out - Qwen3.6, Gemma4 - how do they stack up? The answer is decisive: they beat GPT-4o handily for general LMArena-like chat.
Many users are enthusiastic about Gemma 4 and Qwen3.6 outperforming GPT-4o in chat benchmarks because smaller open MoEs deliver higher-quality results than larger closed models.
Most Activity

For these tests, I used the same 1000 prompts I did in the Qwen3.5 4B comparison. I additionally used local 6-bit quants of both Qwen3.6 35B A3B and Gemma 4 26B A4B, w/ thinking off - this is to capture 'local, low-latency chat' - the same regime GPT-4o excelled at.

Thanks to @lmstudio's new backend, I was able to timely evaluate both models in just a few hours and then run judging w/ Claude Opus 4.6 (same judge I used back in March, as I didn't want to change an additional latent factor).

@lmstudio Link to original thread for reference:

@lmstudio The only place where GPT-4o still excels, likely due to its param count, is domains that require high-knowledge, high-reasoning:

The result, summarized: Qwen3.6 35B and Gemma 4 26B punch way above their active param counts:

@N8Programs Great research!!

@lmstudio Gemma 4 26B A4B beats GPT-4o 86.2% of the time, even w/ thinking mode off. This means that even without TTC scaffolding, modern smaller MoEs still pack enough bunch to beat frontier models from ~2 years ago in gen chat.

@lmstudio Thank you for reading!

@lmstudio Length-wise, GPT-4o is quite concise, Gemma a bit less so, and Qwen3.6 much less. Qwen3.5 4B + Llama 3.1 8B included as references.

@lmstudio For replicability, I inferenced Gemma 26B A4B at a 6-bit MLX quant through LM studio w/ temp=1.0, top_k=64, top_p=0.95. I inferenced Qwen 3.6 35B A3B at a 6-bit MLX quant through LM studio w/ temp=0.6, top_k=20, top_p=0.95.

So, tentatively: for modern MoEs like Gemma 26B A4B and Qwen3.6 35B A4B - even w/ thinking mode off - you can get general chat quality that exceeds GPT-4o significantly by winrate.

@lmstudio I think this is important for people who seek to have a single, consistent, LLM to talk to - local solutions have surpassed the famously-enrapturing 4o, and are obviously immune from model switching, routing, or deprecation.

@lmstudio One thing to additionally note: I intentionally ran these tests w/ the 6-bit quant through lmstudio to avoid any discrepancy between vllm/API-side deployment and the local deployment that any reader of this thread would consider running on their machine.

Love seeing how far the open models have come!
For those looking to reproduce your setup, essentially 1. @lmstudio to generate batch of responses with models, 2. Claude opus 4.6 then to perform pair-wise comparison to pick which is best, 3. Custom UI which looks like LMArea to visualize?

@N8Programs @lmstudio benchmarks are one thing but testing real world performance is where the truth usually comes out

@xeophon Thank you!!!

@GlavinW @lmstudio Yes - essentially - though there's no UI, its just gpt-5.5 generated matplotlib charts. Code here for repro: https://github.com/N8python/qwen4BvsGPT-4o

@InsiderPresider @lmstudio im fairly sure you're a bot, but these 1000 prompts are *intentionally drawn from what real users asked*.

Another thing to additionally note: this thread's conclusions rests on a two main assumptions:
Claude Opus 4.6 reflects human preferences and isn't irrationally length-biased.
These 1000 prompts are representative of general chat use cases.