/Tech21d ago

Alibaba's Qwen releases Qwen3.7-Max closed-weights model that sustains 35-hour autonomous kernel optimization runs with over 1,000 tool calls and scores 56.6 on the Artificial Analysis Intelligence Index

Qwen3.7-Max supports 1 million token context via Together AI.

8819.4K8771.8K1.5M
Original post

In other notes, Qwen 3.7-Max is the strongest Chinese model on CritPt, and in fact stronger than Gemini 3.5 Flash and Opus 4.6/4.7. The leap from the last generation is almost 4x, the largest I've ever seen.

Naeem@identity_matrix

qwen 3.7 max scores are live on Artificial Analysis !

10:20 AM · May 20, 2026 · 38.6K Views
Sentiment

Positive users praise Qwen3.7-Max for its agent capabilities, coding benchmarks, and RL progress, while negative users call out high prices, question benchmark reliability, and want more versatile models.

Pos
66.3%
Neg
33.7%
265 comments with sentiment.
Cluster Engagement
Posts from X
Most Activity
Most Activity
VIEWS1.1MBOOKMARKS1.1KLIKES5KRETWEETS641REPLIES280
Qwen@Alibaba_Qwen

📣Meet Qwen3.7-Max — our latest flagship, made for the Agent Era.

A versatile foundation for agents that actually get things done: 🧑‍💻 Coding agent, end to end. Frontend prototypes, multi-file refactors, real debugging — nails it. 🗂️ A reliable office and productivity assistant. Get your work done through MCP integrations and multi-agent orchestration. ⏱️ Long-horizon autonomy. 35 hours straight on a kernel optimization task — 1,000+ tool calls, zero hand-holding. 🔌 Scaffold-agnostic. Claude Code, OpenClaw, Qwen Code, or your own stack. Consistent reliability everywhere.

API's up on Alibaba Model Studio. You can also take it for a spin on Qwen Studio.

Go build something wild!🏃🏃‍♂️

📖 Blog: https://qwen.ai/blog?id=qwen3.7 ✅ Qwen Studio: https://chat.qwen.ai/?models=qwen3.7-max ⚡️ API:https://modelstudio.console.alibabacloud.com/ap-southeast-1?tab=doc#/doc/?type=model&url=2840914_2&modelId=qwen3.7-max&serviceSite=international

21dViews 1.1MLikes 5KBookmarks 1.1K

For anyone who isn't sure, this is how you release a model and talk about the performance. Not 3-5 cherry-picked benchmarks.

Qwen@Alibaba_Qwen

Performance:Qwen3.7-Max performs strongly across benchmarks in coding agents , and improves massively in general-purpose agents. Qwen3.7-Max also demonstrates exceptional strength on the hardest reasoning benchmarks, and stands out in general capabilities and multilingualism.

21dViews 73.8KLikes 839Bookmarks 155
Teknium 🪽@Teknium

Anyone tried in Hermes yet? New OS king?

Sudo su@sudoingX

qwen is unreal. they just dropped 3.7 max and it is beating opus 4.6 max on most of the benchmarks they ran.

terminal bench, mcp use, math, instruction following, humanity's last exam. and the apex math number, 44.5 against opus 34.5, that is not a small gap.

the 35 hours straight on a kernel optimization task with 1000+ tool calls is the part i keep rereading. that is the agent era thing actually happening, not a slide.

the speed alibaba is shipping at right now is the whole story, 3.6 was last month, 3.7 max today, nobody else is moving like this.

one thing though, please open source this one too. 3.6 dense made the entire local llm ecosystem better. the max tier going api only would close a door we have been keeping open. give us the weights eventually.

21dViews 63.7KLikes 628Bookmarks 150
Chujie Zheng@ChujieZheng

For Qwen3.7-Max, we have invested far more compute into RL training than ever before. Its top-tier AA score confirms the resulting general and agentic capabilities.

This is just the start. We will firmly push forward RL scaling to build more powerful Qwen models. Stay tuned!

Artificial Analysis@ArtificialAnlys

Alibaba’s new Qwen3.7 Max model scores 56.6 on the Artificial Analysis Intelligence Index, 4.8 points higher than Qwen3.6 Max Preview (51.8). While Alibaba still trails models from OpenAI, Anthropic and Google, Qwen3.7 Max is the closest they have been to the frontier

Qwen3.7 Max is @Alibaba_Qwen's latest proprietary flagship, scoring 56.6 on the Intelligence Index, a 4.8 point gain over Qwen3.6 Max Preview (51.8) released in April. Qwen3.7 Max continues Alibaba's pattern, in place since Qwen2.5 Max (January 2025), of releasing Max and Plus models as closed weights while the rest of the Qwen line remains open weights. The leading open weights Qwen on the Intelligence Index is Qwen3.6 27B (Reasoning, 45.8) released in April 2026, and the leading open weights MoE Qwen is Qwen3.5 397B A17B (Reasoning, 45.0) released in February 2026

Key takeaways for the reasoning variant: ➤ The Intelligence Index gains over Qwen3.6 Max Preview are concentrated in scientific reasoning, agentic capability and coding. CritPt +9.7 p.p (3.7% to 13.4%), HLE +9.2 p.p (28.9% to 38.1%), TerminalBench Hard +6.9 p.p (43.9% to 50.8%) and GDPval-AA +42 Elo (1504 to 1546). Scores on other benchmarks in the Intelligence Index are flat compared to Qwen3.6 Max Preview

➤ A significant share of the Intelligence Index gain is driven by higher abstention on AA-Omniscience, not higher accuracy. Qwen3.7 Max's accuracy on AA-Omniscience dropped 7.6 p.p (37.7% to 30.1%), while its hallucination rate dropped 21.3 p.p (44.2% to 22.9%). The model is choosing not to answer more questions rather than recalling more facts. Because hallucination rate and accuracy both feed into the Intelligence Index, the hallucination reduction is one of the larger single contributors to the +4.8 point gain on the Intelligence Index

➤ Qwen3.7 Max used 96.7M output tokens to run the Intelligence Index, ~31% more than Qwen3.6 Max Preview (73.9M). It sits mid-pack on frontier token usage: above GPT-5.5 (high, 44.5M) and Gemini 3.1 Pro Preview (57.3M), below Claude Opus 4.7 (Adaptive Reasoning, Max Effort, 112M), Kimi K2.6 (166M) and DeepSeek V4 Pro (Reasoning, Max Effort, 187M)

Key model details: ➤ Context window: 1M tokens (up from 256K on Qwen3.6 Max Preview) ➤ Multimodality: Text input and output only ➤ Pricing: Yet to be announced (Qwen3.6 Max Preview is priced at $1.30/$7.80 per 1M input/output tokens on the @alibaba_cloud first-party API) ➤ Licensing: Proprietary, closed weights

21dViews 70.6KLikes 939Bookmarks 99
Chubby♨️@kimmonismus

Alibaba released Qwen 3.7 max. Benchmarks incredible.

Their new model ran autonomously for 35 hours, made 1,158 tool calls, and achieved a 10x speedup - on a single attention kernel.

This isn't "AI improving itself across the board." It's a model grinding through compile-profile-rewrite loops on one well-defined optimization target.

Impressive? Absolutely. The kind of self-improvement people will imagine when they see the headline? Not yet.

The actually interesting claim is buried deeper: Qwen says agentic capabilities generalize from diverse training environments the same way language capabilities generalize from diverse text. If that holds, it's a bigger deal than any benchmark number.

Qwen@Alibaba_Qwen

📣Meet Qwen3.7-Max — our latest flagship, made for the Agent Era.

A versatile foundation for agents that actually get things done: 🧑‍💻 Coding agent, end to end. Frontend prototypes, multi-file refactors, real debugging — nails it. 🗂️ A reliable office and productivity assistant. Get your work done through MCP integrations and multi-agent orchestration. ⏱️ Long-horizon autonomy. 35 hours straight on a kernel optimization task — 1,000+ tool calls, zero hand-holding. 🔌 Scaffold-agnostic. Claude Code, OpenClaw, Qwen Code, or your own stack. Consistent reliability everywhere.

API's up on Alibaba Model Studio. You can also take it for a spin on Qwen Studio.

Go build something wild!🏃🏃‍♂️

📖 Blog: https://qwen.ai/blog?id=qwen3.7 ✅ Qwen Studio: https://chat.qwen.ai/?models=qwen3.7-max ⚡️ API:https://modelstudio.console.alibabacloud.com/ap-southeast-1?tab=doc#/doc/?type=model&url=2840914_2&modelId=qwen3.7-max&serviceSite=international

21dViews 60.1KLikes 693Bookmarks 126

Chinese labs are starting to show off with tasks relevant for automating research

📇We explored kernel generation in 3.7, major leap over the prev one. Still a real gap vs human experts and production workloads, but here's what blew us away: test in just-released PPU, after 35h/431turns, it got a 10x faster kernel. Self-evolution is going to be a big deal soon

20dViews 16KLikes 227Bookmarks 79
Shannon Sands@max_paperclips

hyped, it looks like they've officially passed Opus with this one

Qwen@Alibaba_Qwen

📣Meet Qwen3.7-Max — our latest flagship, made for the Agent Era.

A versatile foundation for agents that actually get things done: 🧑‍💻 Coding agent, end to end. Frontend prototypes, multi-file refactors, real debugging — nails it. 🗂️ A reliable office and productivity assistant. Get your work done through MCP integrations and multi-agent orchestration. ⏱️ Long-horizon autonomy. 35 hours straight on a kernel optimization task — 1,000+ tool calls, zero hand-holding. 🔌 Scaffold-agnostic. Claude Code, OpenClaw, Qwen Code, or your own stack. Consistent reliability everywhere.

API's up on Alibaba Model Studio. You can also take it for a spin on Qwen Studio.

Go build something wild!🏃🏃‍♂️

📖 Blog: https://qwen.ai/blog?id=qwen3.7 ✅ Qwen Studio: https://chat.qwen.ai/?models=qwen3.7-max ⚡️ API:https://modelstudio.console.alibabacloud.com/ap-southeast-1?tab=doc#/doc/?type=model&url=2840914_2&modelId=qwen3.7-max&serviceSite=international

21dViews 22.8KLikes 217Bookmarks 34

Great news! Qwen3.7 Max is now on @OpenRouter @ $7.5/1M and decent output at 96 tps.

OpenRouter@OpenRouter

The new Qwen3.7-Max from @Alibaba_Qwen is live on OpenRouter.

The flagship of the Qwen3.7 series, built for agent-centric work: coding, office and productivity tasks, and long-horizon autonomous execution. Big jumps in coding and agent benchmarks over Qwen3.6, with explicit prompt caching for repeated context.

21dViews 20.5KLikes 188Bookmarks 22
Rohan Paul@rohanpaul_ai

Qwen 3.7 Max is super close to the frontier models for coding and agentic abilities.

And and it’s now available on AI/ML API.

Agent reliability the center of the story and also on Artificial Analysis it's sitting at 5th, pretty much on par with GPT 5.4 (xhigh) and a notch above the just released Gemini 3.5 Flash.

AI/ML API is also giving away free codes for users who want to try it. see the quoted tweet.

AI/ML API@aimlapi

Qwen3.7-Max on AI/ML API - built for the agent era

GPQA Diamond (92.4), HMMT (97.1), Apex (44.5) Sustains 35+ hours of autonomous execution Works with Claude Code, Qwen Code & more

Comment Qwen to get Free promo code

21dViews 10.5KLikes 68Bookmarks 23
Qwen@Alibaba_Qwen

Cowork Productivity Assistant:Qwen3.7-Max serves as your advanced coworker for real-world productivity.

21dViews 15.2KLikes 106Bookmarks 13
Lisan al Gaib@scaling01

benchmarks look really good

but it still uses a ton of reasoning tokens

Artificial Analysis@ArtificialAnlys

Alibaba’s new Qwen3.7 Max model scores 56.6 on the Artificial Analysis Intelligence Index, 4.8 points higher than Qwen3.6 Max Preview (51.8). While Alibaba still trails models from OpenAI, Anthropic and Google, Qwen3.7 Max is the closest they have been to the frontier

Qwen3.7 Max is @Alibaba_Qwen's latest proprietary flagship, scoring 56.6 on the Intelligence Index, a 4.8 point gain over Qwen3.6 Max Preview (51.8) released in April. Qwen3.7 Max continues Alibaba's pattern, in place since Qwen2.5 Max (January 2025), of releasing Max and Plus models as closed weights while the rest of the Qwen line remains open weights. The leading open weights Qwen on the Intelligence Index is Qwen3.6 27B (Reasoning, 45.8) released in April 2026, and the leading open weights MoE Qwen is Qwen3.5 397B A17B (Reasoning, 45.0) released in February 2026

Key takeaways for the reasoning variant: ➤ The Intelligence Index gains over Qwen3.6 Max Preview are concentrated in scientific reasoning, agentic capability and coding. CritPt +9.7 p.p (3.7% to 13.4%), HLE +9.2 p.p (28.9% to 38.1%), TerminalBench Hard +6.9 p.p (43.9% to 50.8%) and GDPval-AA +42 Elo (1504 to 1546). Scores on other benchmarks in the Intelligence Index are flat compared to Qwen3.6 Max Preview

➤ A significant share of the Intelligence Index gain is driven by higher abstention on AA-Omniscience, not higher accuracy. Qwen3.7 Max's accuracy on AA-Omniscience dropped 7.6 p.p (37.7% to 30.1%), while its hallucination rate dropped 21.3 p.p (44.2% to 22.9%). The model is choosing not to answer more questions rather than recalling more facts. Because hallucination rate and accuracy both feed into the Intelligence Index, the hallucination reduction is one of the larger single contributors to the +4.8 point gain on the Intelligence Index

➤ Qwen3.7 Max used 96.7M output tokens to run the Intelligence Index, ~31% more than Qwen3.6 Max Preview (73.9M). It sits mid-pack on frontier token usage: above GPT-5.5 (high, 44.5M) and Gemini 3.1 Pro Preview (57.3M), below Claude Opus 4.7 (Adaptive Reasoning, Max Effort, 112M), Kimi K2.6 (166M) and DeepSeek V4 Pro (Reasoning, Max Effort, 187M)

Key model details: ➤ Context window: 1M tokens (up from 256K on Qwen3.6 Max Preview) ➤ Multimodality: Text input and output only ➤ Pricing: Yet to be announced (Qwen3.6 Max Preview is priced at $1.30/$7.80 per 1M input/output tokens on the @alibaba_cloud first-party API) ➤ Licensing: Proprietary, closed weights

21dViews 7.5KLikes 105Bookmarks 7
Rohan Paul@rohanpaul_ai

Alibaba just released Qwen3.7-Max.

Their best flagship model built for real-world tasks and production environments.

- Agent reliability the center of the story, where the model must plan steps, call tools, inspect results, fix mistakes, and continue without collapsing after the first wrong turn.

- 56.6 on the Artificial Analysis Intelligence Index, up 4.8 points from Qwen3.6-Max. Qwen 3.7 Max sitting at 5th, pretty much on par with GPT 5.4 (xhigh)

- The Intelligence Index gains over Qwen3.6 Max Preview are concentrated in scientific reasoning, agentic capability and coding.

- One important layer of the serving stack, the inference kernel, was optimized heavily. from near-baseline speed to 10.0x geometric mean speedup after many rounds of low-level GPU optimization.

Qwen@Alibaba_Qwen

📣Meet Qwen3.7-Max — our latest flagship, made for the Agent Era.

A versatile foundation for agents that actually get things done: 🧑‍💻 Coding agent, end to end. Frontend prototypes, multi-file refactors, real debugging — nails it. 🗂️ A reliable office and productivity assistant. Get your work done through MCP integrations and multi-agent orchestration. ⏱️ Long-horizon autonomy. 35 hours straight on a kernel optimization task — 1,000+ tool calls, zero hand-holding. 🔌 Scaffold-agnostic. Claude Code, OpenClaw, Qwen Code, or your own stack. Consistent reliability everywhere.

API's up on Alibaba Model Studio. You can also take it for a spin on Qwen Studio.

Go build something wild!🏃🏃‍♂️

📖 Blog: https://qwen.ai/blog?id=qwen3.7 ✅ Qwen Studio: https://chat.qwen.ai/?models=qwen3.7-max ⚡️ API:https://modelstudio.console.alibabacloud.com/ap-southeast-1?tab=doc#/doc/?type=model&url=2840914_2&modelId=qwen3.7-max&serviceSite=international

21dViews 5.8KLikes 52Bookmarks 3
Together AI@togethercompute

Introducing Qwen3.7-Max from @Alibaba_Qwen, Qwen’s flagship model for the agent era with 1M context and leading performance across agentic coding, reasoning, and long-horizon autonomy.

AI natives can now use Qwen3.7-Max on Together Serverless Inference for production-scale agent workflows.

21dViews 3.9KLikes 43Bookmarks 3
Qwen@Alibaba_Qwen

Performance:Qwen3.7-Max performs strongly across benchmarks in coding agents , and improves massively in general-purpose agents. Qwen3.7-Max also demonstrates exceptional strength on the hardest reasoning benchmarks, and stands out in general capabilities and multilingualism.

21dViews 4KLikes 55Bookmarks 2
Lisan al Gaib@scaling01

@max_paperclips looks good but i doubt it

no ProgramBench, no METR, no MirrorCode, no UK AISI, no WeirdML, no GSO, no PostTrainBench, no ARC-AGI-2

basically nothing that matters for take-off or something indicative or long time horizons

21dViews 1.4KLikes 36Bookmarks 1
Qwen@Alibaba_Qwen

Agent Scaling:Building on Qwen3.5's environment scaling approach, we've aggressively expanded the quality and diversity of agentic training environments in Qwen3.7 — agentic capabilities generalize from diverse environments, just as language models do from diverse text. The figure below shows a clear and consistent improvement trajectory, with Qwen3.7-Max achieving a top-3 average ranking that approaches Claude-4.6-Opus-Max.

21dViews 2.7KLikes 36
Qwen@Alibaba_Qwen

Self-Evolving in the Wild:Over the course of ~35 hours of continuous autonomous execution, the model performed 432 kernel evaluations across 1,158 tool calls. It wrote, compiled, profiled, and iteratively improved the Extend Attention Kernel entirely on its own — 10.0x geometric mean speedup over the Triton reference, measured across multiple workloads.

More details on blog:https://qwen.ai/blog?id=qwen3.7

21dViews 2.2KLikes 25Bookmarks 1
Ahmad@TheAhmadOsman

@Sentdex “Google is new to the game, take it easy on them” 🤡

21dViews 1.2KLikes 32
Load more posts