/AI8h ago

Together AI increases MiniMax-M3 inference throughput by up to 125% using custom sparse attention kernels

The optimizations target agentic traffic with 1M-token context windows

218865.1K

Quote posts

Reposts

#22

Original post

Dan Fu#694

Together AI@togethercompute

MiniMax-M3 combines 1M context, native multimodality, and MiniMax Sparse Attention.

The next layer is serving it efficiently: KV-block-major sparse attention, paged MSA decode, optimized index scoring, and multimodal preprocessing before the GPU worker.

Together’s Inference and Kernel teams improved throughput by 81–125% across common agentic-shape traffic.

We go deeper in this deep dive from @ywangfirstlean, @zhyncs42, @realDanFu and the team.

Together AI@togethercompute

http://x.com/i/article/2061891247762026496

12:38 PM · Jun 2, 2026 · 5.1K Views

/AI8h ago

Together AI increases MiniMax-M3 inference throughput by up to 125% using custom sparse attention kernels

The optimizations target agentic traffic with 1M-token context windows

--0--

Quote posts

Reposts

#22

Original post

Dan Fu#694

Together AI@togethercompute

MiniMax-M3 combines 1M context, native multimodality, and MiniMax Sparse Attention.

The next layer is serving it efficiently: KV-block-major sparse attention, paged MSA decode, optimized index scoring, and multimodal preprocessing before the GPU worker.

Together’s Inference and Kernel teams improved throughput by 81–125% across common agentic-shape traffic.

We go deeper in this deep dive from @ywangfirstlean, @zhyncs42, @realDanFu and the team.

Together AI@togethercompute

http://x.com/i/article/2061891247762026496

12:38 PM · Jun 2, 2026 · 5.1K Views

Sentiment

Sentiment building, check back later.

Cluster Engagement

Sentiment

Sentiment building, check back later.

Cluster Engagement

Views

Comments

Reposts

Bookmarks

Expand data

Posts from X

Most Activity

RETWEETS8

Together AI@togethercompute

MiniMax-M3 combines 1M context, native multimodality, and MiniMax Sparse Attention.

The next layer is serving it efficiently: KV-block-major sparse attention, paged MSA decode, optimized index scoring, and multimodal preprocessing before the GPU worker.

Together’s Inference and Kernel teams improved throughput by 81–125% across common agentic-shape traffic.

We go deeper in this deep dive from @ywangfirstlean, @zhyncs42, @realDanFu and the team.

Together AI@togethercompute

http://x.com/i/article/2061891247762026496

8h5.1K186