/Tech19d ago

MiniMax M3 Outperforms DeepSeek V4 Pro On DeepSWE Leaderboard

--0--

#1215

Original post

Lisan al Gaib@scaling01#1215inTech

MiniMax M3 scores above DeepSeek V4 Pro on DeepSWE, but below other chinese competitors

7:55 PM · Jun 2, 2026 · 58K Views

Sentiment

Many users dismissed MiniMax M3's benchmark win over DeepSeek V4 Pro as unreliable or meaningless after the retraction over testing issues, while positive users praised rapid Chinese lab progress and models like GLM or Kimi.

Pos

25.0%

Neg

75.0%

18 comments with sentiment.

Cluster Engagement

Views

Comments

Reposts

Bookmarks

Expand data

Digg Deeper

No Digg Deeper questions have been answered for this story yet.

Posts from X

Most Activity

VIEWS24.7KBOOKMARKS20LIKES172RETWEETS5REPLIES10

Lisan al Gaib@scaling01

DeepSWE informed me that the MiniMax-M3 run had several issues during the benchmarking and that they had to retract the score

The ranking as shown in the image should be seen more as a lower bound.

Lisan al Gaib@scaling01

MiniMax M3 scores above DeepSeek V4 Pro on DeepSWE, but below other chinese competitors

18d24.7K17220

spicylemonade@spicey_lemonade

@scaling01 deepswe, programbench,and gba eval somewhat destroyed the notion that chinese models were anywhere close in terms of agentic coding (which swe bench was falsely showing previously)

19d1.7K172

Lisan al Gaib@scaling01

scores were removed because they were incomplete/invalid

Lisan al Gaib@scaling01

MiniMax M3 scores above DeepSeek V4 Pro on DeepSWE, but below other chinese competitors

18d3K102

Lisan al Gaib@scaling01

UPDATE

Lisan al Gaib@scaling01

DeepSWE informed me that the MiniMax-M3 run had several issues during the benchmarking and that they had to retract the score

The ranking as shown in the image should be seen more as a lower bound.

18d2.9K110

Pheyls@Pheylz

@scaling01 DeepSeek solos every other Chinese LLM

19d1.8K10

TheTinman@NguyenTinMan

@spicey_lemonade @scaling01 Just wait until they benchmax on those too

19d1724

Lunari@0x_lun

@scaling01 retracted score still sitting in the image tho

curious how much headroom they think it actually has

18d2991

sunnycity2.0@0Sunnycity2

@scaling01 I don't understand why qwen 3.7 isn't benchmarked on deepswe

18d7025

The Noble Simian@thenoblesimian

@scaling01 Where do you think Composer 2.5 would sit on their benchmark? I've been interested in trying some open source models but after seeing the top one ranks below gpt-5.4-mini, I think I'll pass.

19d1.1K4