/AI16h ago

InfiniAI Lab releases Vortex, an agent-designed sparse attention framework that accelerates LLM inference throughput by up to 4.7x

It compiles agent-generated attention flows into SGLang kernels.

8120297999.2K

#534

Original post

Ying Sheng#608

Infini-AI-Lab@InfiniAILab

🌀 Introducing Vortex — sparse attention designed by AI agents, efficient at scale.

📈 Same accuracy, way more throughput — across every model we tried 👇 🔹 GLM-4.7-Flash (MLA) → 4.7× faster 🔹 MiniMax-M2.7 (229B) → 1.37× faster 🔹 Qwen3-1.7B (agent-discovered!) → 3.46× faster

🤖 How? An agent writes a flow in a few lines of Python; Vortex compiles it into fused kernels in a real serving stack (SGLang) and benchmarks it end-to-end.

🏗️ The design: a Python frontend (vFlow) over a page-centric tensor abstraction (vTensor) + a serving-integrated backend.

📄 https://arxiv.org/abs/2606.06453 💻 https://github.com/Infini-AI-Lab/vortex_torch 🌐 https://infini-ai-lab.github.io/vortex_torch/ 📚 https://infini-ai-lab.github.io/vortex_torch/docs/

12:43 PM · Jun 5, 2026 · 51.4K Views

Sentiment

Users are excited that AI agents can autonomously design Vortex Sparse Attention because it delivers up to 4.7x faster inference and scales successfully to 229B models.

Pos

100.0%

Neg

0.0%

5 comments with sentiment.

Cluster Engagement

Posts from X

Most Activity

VIEWS39.5K

chen zhuoming@chenzhuoming911

Vortex builds a layer for the current AI agents to make themselves more efficient, if you view AI agents as a species. We are soon to see AI agents directly program on an FPGA, defining efficient and expressive architecture, writing efficient low-level implementations, and training themselves. And, I do not know what can stop this. At that moment, whether AI can conduct research on something novel (rather than simply replicate what humans can already do) will determine whether 2025-2026 is a real genesis moment.

Infini-AI-Lab@InfiniAILab

🌀 Introducing Vortex — sparse attention designed by AI agents, efficient at scale.

🤖 How? An agent writes a flow in a few lines of Python; Vortex compiles it into fused kernels in a real serving stack (SGLang) and benchmarks it end-to-end.

🏗️ The design: a Python frontend (vFlow) over a page-centric tensor abstraction (vTensor) + a serving-integrated backend.

📄 https://arxiv.org/abs/2606.06453 💻 https://github.com/Infini-AI-Lab/vortex_torch 🌐 https://infini-ai-lab.github.io/vortex_torch/ 📚 https://infini-ai-lab.github.io/vortex_torch/docs/

14h39.5K64

BOOKMARKS23LIKES39

Banghua Zhu@BanghuaZ

Agents providing improvements in production environments with SGLang. Very cool work!

Infini-AI-Lab@InfiniAILab

🌀 Introducing Vortex — sparse attention designed by AI agents, efficient at scale.

🤖 How? An agent writes a flow in a few lines of Python; Vortex compiles it into fused kernels in a real serving stack (SGLang) and benchmarks it end-to-end.

🏗️ The design: a Python frontend (vFlow) over a page-centric tensor abstraction (vTensor) + a serving-integrated backend.

📄 https://arxiv.org/abs/2606.06453 💻 https://github.com/Infini-AI-Lab/vortex_torch 🌐 https://infini-ai-lab.github.io/vortex_torch/ 📚 https://infini-ai-lab.github.io/vortex_torch/docs/

10h4.4K3923

RETWEETS7REPLIES1

Beidi Chen@BeidiChen

📢Vortex feels like an early glimpse of recursive self-improvement for ML systems: agents discovering better architectures, compiling them into real serving stacks, and shifting the bottleneck from "can it code?" to "can we still steer it?"

AI building AI just went from theory to benchmark chart.

Infini-AI-Lab@InfiniAILab

🌀 Introducing Vortex — sparse attention designed by AI agents, efficient at scale.

🤖 How? An agent writes a flow in a few lines of Python; Vortex compiles it into fused kernels in a real serving stack (SGLang) and benchmarks it end-to-end.

🏗️ The design: a Python frontend (vFlow) over a page-centric tensor abstraction (vTensor) + a serving-integrated backend.

📄 https://arxiv.org/abs/2606.06453 💻 https://github.com/Infini-AI-Lab/vortex_torch 🌐 https://infini-ai-lab.github.io/vortex_torch/ 📚 https://infini-ai-lab.github.io/vortex_torch/docs/

16h4.3K2616

Infini-AI-Lab@InfiniAILab

[2/6] What if the AI agent did the research? 🤔

Hand it Vortex + a goal, and it just goes: ✍️ write a flow → 🚀 compile to real kernels → 📊 benchmark → 🔁 improve → repeat.

18 hours later, fully on its own: 92 algorithms over 23 rounds, frontier pushed way out — 3.46× faster on AIME24 (3,437 → 11,894 tok/s), zero accuracy lost. 🔥

16h36

Infini-AI-Lab@InfiniAILab

[3/6] Not a fluke either. 🎯

Three different frontier agents — Claude Opus 4.7, Claude Sonnet 4.6, GPT-5 — each generate structurally diverse designs, and after a staged filtering pipeline, the selected ones are efficient: full-attention accuracy at 2–3.1× higher throughput across three benchmarks.

The design space is theirs to explore. 🤖

16h30

Infini-AI-Lab@InfiniAILab

[4/6] New architecture? No problem. 🧬

MLA (the attention behind DeepSeek & GLM) is tricky — it squeezes the KV cache into one shared latent. In Vortex, we sketched a rope-aware block sparse attention for it in a few lines.

GLM-4.7-Flash result: up to 4.7× faster, matching full attention on mean@16, pass@4 & pass@8. ⚡

16h19

Infini-AI-Lab@InfiniAILab

[5/6] Does it scale? We went to 229B. 🏔️

At this size — MiniMax-M2.7 across 4× B200 (TP=4) — even *running* a sparse-attention experiment is basically impossible without Vortex.

With it: up to 1.37× faster on AIME26, accuracy even nudging *above* full attention. Sparse attention still pays off at the frontier of model size. 💪

16h17

Infini-AI-Lab@InfiniAILab

[6/6] The big picture 👇

Sparse-attention research should be a loop an agent can run on its own. The hard part was never the idea — it was turning math into fast, production-ready kernels. Vortex removes that wall.

When trying, an algorithm is as easy to describe as it is to implement; humans + AI agents can co-discover the next generation of efficient attention. 🌀

Thanks to @chenzhuoming911, @XinruiZhongx, Qilong Feng, @RJ_Sadhukhan, @IronSteveZhou, @michaelqshieh, @JiaZhihao, @BeidiChen Come build it with us 👇

16h17