/Tech22h ago

CMU's Beidi Chen releases Vortex, an agentic system that designs sparse attention kernels to speed up SGLang inference

The system boosted GLM-4.7-Flash throughput by 4.7x.

91353787103.2K

#1907

Original post

Ying Sheng#1806

Infini-AI-Lab@InfiniAILab

🌀 Introducing Vortex — sparse attention designed by AI agents, efficient at scale.

📈 Same accuracy, way more throughput — across every model we tried 👇 🔹 GLM-4.7-Flash (MLA) → 4.7× faster 🔹 MiniMax-M2.7 (229B) → 1.37× faster 🔹 Qwen3-1.7B (agent-discovered!) → 3.46× faster

🤖 How? An agent writes a flow in a few lines of Python; Vortex compiles it into fused kernels in a real serving stack (SGLang) and benchmarks it end-to-end.

🏗️ The design: a Python frontend (vFlow) over a page-centric tensor abstraction (vTensor) + a serving-integrated backend.

📄 https://arxiv.org/abs/2606.06453 💻 https://github.com/Infini-AI-Lab/vortex_torch 🌐 https://infini-ai-lab.github.io/vortex_torch/ 📚 https://infini-ai-lab.github.io/vortex_torch/docs/

12:43 PM · Jun 5, 2026 · 53.6K Views

Sentiment

Users are excited that frontier AI agents can autonomously design Vortex Sparse Attention because it enables up to 4.7x faster inference and scales successfully to 229B-parameter models.

Pos

100.0%

Neg

0.0%

5 comments with sentiment.

Cluster Engagement

Posts from X

Most Activity

VIEWS5KBOOKMARKS23LIKES39RETWEETS4

Banghua Zhu@BanghuaZ

Agents providing improvements in production environments with SGLang. Very cool work!

Infini-AI-Lab@InfiniAILab

🌀 Introducing Vortex — sparse attention designed by AI agents, efficient at scale.

🤖 How? An agent writes a flow in a few lines of Python; Vortex compiles it into fused kernels in a real serving stack (SGLang) and benchmarks it end-to-end.

🏗️ The design: a Python frontend (vFlow) over a page-centric tensor abstraction (vTensor) + a serving-integrated backend.

📄 https://arxiv.org/abs/2606.06453 💻 https://github.com/Infini-AI-Lab/vortex_torch 🌐 https://infini-ai-lab.github.io/vortex_torch/ 📚 https://infini-ai-lab.github.io/vortex_torch/docs/

16h5K3923

REPLIES1

Infini-AI-Lab@InfiniAILab

[5/6] Does it scale? We went to 229B. 🏔️

At this size — MiniMax-M2.7 across 4× B200 (TP=4) — even *running* a sparse-attention experiment is basically impossible without Vortex.

With it: up to 1.37× faster on AIME26, accuracy even nudging *above* full attention. Sparse attention still pays off at the frontier of model size. 💪

22h17

Infini-AI-Lab@InfiniAILab

[2/6] What if the AI agent did the research? 🤔

Hand it Vortex + a goal, and it just goes: ✍️ write a flow → 🚀 compile to real kernels → 📊 benchmark → 🔁 improve → repeat.

18 hours later, fully on its own: 92 algorithms over 23 rounds, frontier pushed way out — 3.46× faster on AIME24 (3,437 → 11,894 tok/s), zero accuracy lost. 🔥

22h36

Infini-AI-Lab@InfiniAILab

[3/6] Not a fluke either. 🎯

Three different frontier agents — Claude Opus 4.7, Claude Sonnet 4.6, GPT-5 — each generate structurally diverse designs, and after a staged filtering pipeline, the selected ones are efficient: full-attention accuracy at 2–3.1× higher throughput across three benchmarks.

The design space is theirs to explore. 🤖

22h30

Infini-AI-Lab@InfiniAILab

[4/6] New architecture? No problem. 🧬

MLA (the attention behind DeepSeek & GLM) is tricky — it squeezes the KV cache into one shared latent. In Vortex, we sketched a rope-aware block sparse attention for it in a few lines.

GLM-4.7-Flash result: up to 4.7× faster, matching full attention on mean@16, pass@4 & pass@8. ⚡

22h19

Infini-AI-Lab@InfiniAILab

[6/6] The big picture 👇

Sparse-attention research should be a loop an agent can run on its own. The hard part was never the idea — it was turning math into fast, production-ready kernels. Vortex removes that wall.

When trying, an algorithm is as easy to describe as it is to implement; humans + AI agents can co-discover the next generation of efficient attention. 🌀

Thanks to @chenzhuoming911, @XinruiZhongx, Qilong Feng, @RJ_Sadhukhan, @IronSteveZhou, @michaelqshieh, @JiaZhihao, @BeidiChen Come build it with us 👇

22h17