/AI16h ago

InfiniAI Lab releases Vortex, an agent-designed sparse attention framework that accelerates LLM inference throughput by up to 4.7x

It compiles agent-generated attention flows into SGLang kernels.

8120297999.2K
Original postYing Sheng#608
Infini-AI-Lab@InfiniAILab

πŸŒ€ Introducing Vortex β€” sparse attention designed by AI agents, efficient at scale.

πŸ“ˆ Same accuracy, way more throughput β€” across every model we tried πŸ‘‡ πŸ”Ή GLM-4.7-Flash (MLA) β†’ 4.7Γ— faster πŸ”Ή MiniMax-M2.7 (229B) β†’ 1.37Γ— faster πŸ”Ή Qwen3-1.7B (agent-discovered!) β†’ 3.46Γ— faster

πŸ€– How? An agent writes a flow in a few lines of Python; Vortex compiles it into fused kernels in a real serving stack (SGLang) and benchmarks it end-to-end.

πŸ—οΈ The design: a Python frontend (vFlow) over a page-centric tensor abstraction (vTensor) + a serving-integrated backend.

πŸ“„ https://arxiv.org/abs/2606.06453 πŸ’» https://github.com/Infini-AI-Lab/vortex_torch 🌐 https://infini-ai-lab.github.io/vortex_torch/ πŸ“š https://infini-ai-lab.github.io/vortex_torch/docs/

12:43 PM Β· Jun 5, 2026 Β· 51.4K Views
Sentiment

Users are excited that AI agents can autonomously design Vortex Sparse Attention because it delivers up to 4.7x faster inference and scales successfully to 229B models.

Pos
100.0%
Neg
0.0%
5 comments with sentiment.
Cluster Engagement
Posts from X
Most Activity
Most Activity
VIEWS39.5K
chen zhuoming@chenzhuoming911

Vortex builds a layer for the current AI agents to make themselves more efficient, if you view AI agents as a species. We are soon to see AI agents directly program on an FPGA, defining efficient and expressive architecture, writing efficient low-level implementations, and training themselves. And, I do not know what can stop this. At that moment, whether AI can conduct research on something novel (rather than simply replicate what humans can already do) will determine whether 2025-2026 is a real genesis moment.

Infini-AI-Lab@InfiniAILab

πŸŒ€ Introducing Vortex β€” sparse attention designed by AI agents, efficient at scale.

πŸ“ˆ Same accuracy, way more throughput β€” across every model we tried πŸ‘‡ πŸ”Ή GLM-4.7-Flash (MLA) β†’ 4.7Γ— faster πŸ”Ή MiniMax-M2.7 (229B) β†’ 1.37Γ— faster πŸ”Ή Qwen3-1.7B (agent-discovered!) β†’ 3.46Γ— faster

πŸ€– How? An agent writes a flow in a few lines of Python; Vortex compiles it into fused kernels in a real serving stack (SGLang) and benchmarks it end-to-end.

πŸ—οΈ The design: a Python frontend (vFlow) over a page-centric tensor abstraction (vTensor) + a serving-integrated backend.

πŸ“„ https://arxiv.org/abs/2606.06453 πŸ’» https://github.com/Infini-AI-Lab/vortex_torch 🌐 https://infini-ai-lab.github.io/vortex_torch/ πŸ“š https://infini-ai-lab.github.io/vortex_torch/docs/

14hViews 39.5KLikes 6Bookmarks 4
BOOKMARKS23LIKES39
Banghua Zhu@BanghuaZ

Agents providing improvements in production environments with SGLang. Very cool work!

Infini-AI-Lab@InfiniAILab

πŸŒ€ Introducing Vortex β€” sparse attention designed by AI agents, efficient at scale.

πŸ“ˆ Same accuracy, way more throughput β€” across every model we tried πŸ‘‡ πŸ”Ή GLM-4.7-Flash (MLA) β†’ 4.7Γ— faster πŸ”Ή MiniMax-M2.7 (229B) β†’ 1.37Γ— faster πŸ”Ή Qwen3-1.7B (agent-discovered!) β†’ 3.46Γ— faster

πŸ€– How? An agent writes a flow in a few lines of Python; Vortex compiles it into fused kernels in a real serving stack (SGLang) and benchmarks it end-to-end.

πŸ—οΈ The design: a Python frontend (vFlow) over a page-centric tensor abstraction (vTensor) + a serving-integrated backend.

πŸ“„ https://arxiv.org/abs/2606.06453 πŸ’» https://github.com/Infini-AI-Lab/vortex_torch 🌐 https://infini-ai-lab.github.io/vortex_torch/ πŸ“š https://infini-ai-lab.github.io/vortex_torch/docs/

10hViews 4.4KLikes 39Bookmarks 23
RETWEETS7REPLIES1
Beidi Chen@BeidiChen

πŸ“’Vortex feels like an early glimpse of recursive self-improvement for ML systems: agents discovering better architectures, compiling them into real serving stacks, and shifting the bottleneck from "can it code?" to "can we still steer it?"

AI building AI just went from theory to benchmark chart.

Infini-AI-Lab@InfiniAILab

πŸŒ€ Introducing Vortex β€” sparse attention designed by AI agents, efficient at scale.

πŸ“ˆ Same accuracy, way more throughput β€” across every model we tried πŸ‘‡ πŸ”Ή GLM-4.7-Flash (MLA) β†’ 4.7Γ— faster πŸ”Ή MiniMax-M2.7 (229B) β†’ 1.37Γ— faster πŸ”Ή Qwen3-1.7B (agent-discovered!) β†’ 3.46Γ— faster

πŸ€– How? An agent writes a flow in a few lines of Python; Vortex compiles it into fused kernels in a real serving stack (SGLang) and benchmarks it end-to-end.

πŸ—οΈ The design: a Python frontend (vFlow) over a page-centric tensor abstraction (vTensor) + a serving-integrated backend.

πŸ“„ https://arxiv.org/abs/2606.06453 πŸ’» https://github.com/Infini-AI-Lab/vortex_torch 🌐 https://infini-ai-lab.github.io/vortex_torch/ πŸ“š https://infini-ai-lab.github.io/vortex_torch/docs/

16hViews 4.3KLikes 26Bookmarks 16
Infini-AI-Lab@InfiniAILab

[2/6] What if the AI agent did the research? πŸ€”

Hand it Vortex + a goal, and it just goes: ✍️ write a flow β†’ πŸš€ compile to real kernels β†’ πŸ“Š benchmark β†’ πŸ” improve β†’ repeat.

18 hours later, fully on its own: 92 algorithms over 23 rounds, frontier pushed way out β€” 3.46Γ— faster on AIME24 (3,437 β†’ 11,894 tok/s), zero accuracy lost. πŸ”₯

16hViews 36
Infini-AI-Lab@InfiniAILab

[3/6] Not a fluke either. 🎯

Three different frontier agents β€” Claude Opus 4.7, Claude Sonnet 4.6, GPT-5 β€” each generate structurally diverse designs, and after a staged filtering pipeline, the selected ones are efficient: full-attention accuracy at 2–3.1Γ— higher throughput across three benchmarks.

The design space is theirs to explore. πŸ€–

16hViews 30
Infini-AI-Lab@InfiniAILab

[4/6] New architecture? No problem. 🧬

MLA (the attention behind DeepSeek & GLM) is tricky β€” it squeezes the KV cache into one shared latent. In Vortex, we sketched a rope-aware block sparse attention for it in a few lines.

GLM-4.7-Flash result: up to 4.7Γ— faster, matching full attention on mean@16, pass@4 & pass@8. ⚑

16hViews 19
Infini-AI-Lab@InfiniAILab

[5/6] Does it scale? We went to 229B. πŸ”οΈ

At this size β€” MiniMax-M2.7 across 4Γ— B200 (TP=4) β€” even *running* a sparse-attention experiment is basically impossible without Vortex.

With it: up to 1.37Γ— faster on AIME26, accuracy even nudging *above* full attention. Sparse attention still pays off at the frontier of model size. πŸ’ͺ

16hViews 17
Infini-AI-Lab@InfiniAILab

[6/6] The big picture πŸ‘‡

Sparse-attention research should be a loop an agent can run on its own. The hard part was never the idea β€” it was turning math into fast, production-ready kernels. Vortex removes that wall.

When trying, an algorithm is as easy to describe as it is to implement; humans + AI agents can co-discover the next generation of efficient attention. πŸŒ€

Thanks to @chenzhuoming911, @XinruiZhongx, Qilong Feng, @RJ_Sadhukhan, @IronSteveZhou, @michaelqshieh, @JiaZhihao, @BeidiChen Come build it with us πŸ‘‡

16hViews 17