MiniMax M3 scores above DeepSeek V4 Pro on DeepSWE, but below other chinese competitors
Many users dismissed MiniMax M3's benchmark win over DeepSeek V4 Pro as unreliable or meaningless after the retraction over testing issues, while positive users praised rapid Chinese lab progress and models like GLM or Kimi.
No Digg Deeper questions have been answered for this story yet.
Most Activity
DeepSWE informed me that the MiniMax-M3 run had several issues during the benchmarking and that they had to retract the score
The ranking as shown in the image should be seen more as a lower bound.
MiniMax M3 scores above DeepSeek V4 Pro on DeepSWE, but below other chinese competitors

@scaling01 deepswe, programbench,and gba eval somewhat destroyed the notion that chinese models were anywhere close in terms of agentic coding (which swe bench was falsely showing previously)
scores were removed because they were incomplete/invalid
MiniMax M3 scores above DeepSeek V4 Pro on DeepSWE, but below other chinese competitors
UPDATE
DeepSWE informed me that the MiniMax-M3 run had several issues during the benchmarking and that they had to retract the score
The ranking as shown in the image should be seen more as a lower bound.

@scaling01 DeepSeek solos every other Chinese LLM

@spicey_lemonade @scaling01 Just wait until they benchmax on those too

@scaling01 retracted score still sitting in the image tho
curious how much headroom they think it actually has

@scaling01 I don't understand why qwen 3.7 isn't benchmarked on deepswe

@scaling01 Where do you think Composer 2.5 would sit on their benchmark? I've been interested in trying some open source models but after seeing the top one ranks below gpt-5.4-mini, I think I'll pass.

@scaling01 This bench is stupid. 5.4 mini right next to opus 4.6?

@scaling01 I dislike the model, feels like a small model

@spicey_lemonade @scaling01 @scaling01 can you or someone do a new frontier calibration on these new benchmarks + benchmarks that are closed

@scaling01 @melvynx

@scaling01 DeepSWE really feels like the real deal. I don't know what their standards are, but it feels mostly accurate.

@scaling01 score lower than grok build is kinda crazy fr

@scaling01 This looks fake I don’t see it in deepSWE website

@scaling01 i don't see this in deepswe website its not listed yet

@scaling01 disappointed to see months old k2.6 beats latest minimax

@scaling01 famous last words "mini" in the name of an LLM in 2026
must be a real short attention span demo

@scaling01 kimi k2.6 still the best chinese model