/Tech1d ago

Dan Fu, Together AI VP of Kernels, releases ParallelKernelBench to evaluate LLMs on writing multi-GPU kernels

It assesses complex parallel constraints omitted by single-GPU benchmarks.

253281412638.1K

#871

Original post

Nathan@asplencmnt

Excited to release ParallelKernelBench (PKB), a benchmark for measuring LLMs’ ability to write fast multi-GPU kernels! 😀

Multi-GPU kernel generation compounds several hard problems:

- a large parallelism design space - a new communication axis to optimize - and hardware-specific decisions around communication mechanisms

Existing kernel-generation benchmarks mostly target single-GPU workloads, so we built PKB to cover real-world multi-GPU workloads (many of which do not have existing optimized solutions). 🧵👇

1:24 PM · Jun 23, 2026 · 2.9K Views

Sentiment

Some users express gratitude to coauthors, collaborators, and mentors for releasing ParallelKernelBench to test LLMs on multi-GPU kernel generation.

Pos

100.0%

Neg

0.0%

1 comments with sentiment.

Cluster Engagement

Digg Deeper

No Digg Deeper questions have been answered for this story yet.

ALPHAXIVVia

#871

Posts from X

Most Activity

VIEWS5.8KBOOKMARKS14LIKES44RETWEETS8

Dan Fu@realDanFu

Excited to release PKB Parallel Kernel Bench, led by Willy Chan and Nathan Paek @asplencmnt!!

A benchmark of mostly net-new multi-GPU kernel problems (solutions are independently useful for real-world workloads).

Together AI@togethercompute

LLMs write fast single-GPU kernels. Ask for a multi-GPU one and they fall apart.

ParallelKernelBench measures how they fail by benchmarking against 87 problems pulled from real codebases including Megatron-LM, DeepSpeed, DeepEP, TensorRT-LLM, NeMo-RL.

New research from Willy Chan @asplencmnt @simonguozirui @simran_s_arora and @realDanFu

1d5.8K4414

REPLIES2

Dan Fu@realDanFu

Fun fact - this is (my) first paper that I've posted on @askalphaxiv instead of arXiv. arXiv's moderation policies have become increasingly onerous (this one on hold for a month) - alphaXiv has a lot more features and is becoming my go-to!

Auto-blog https://www.alphaxiv.org/overview/2606.parallel-kernel-bench

1d1124

Dan Fu@realDanFu

Highlight - we built this benchmark to be largely net new multi-GPU problems. We provide the problem definition and reference PyTorch code, but no kernel solution to train against.

We were able to generate some new kernels that would have been entire papers during my PhD!

Dan Fu@realDanFu

Excited to release PKB Parallel Kernel Bench, led by Willy Chan and Nathan Paek @asplencmnt!!

A benchmark of mostly net-new multi-GPU kernel problems (solutions are independently useful for real-world workloads).

1d666131

Together AI@togethercompute

But why are multi-GPU kernels so different? Single-GPU kernels are bottlenecked by compute and memory bandwidth. Multi-GPU kernels by the interconnect.

Each PKB task hands the model a PyTorch + NCCL reference and asks it to communicate directly across GPUs via symmetric memory.

1d14431

Together AI@togethercompute

Frontier models struggle.

→ Best zero-shot: 28/87 correct, 22 beat the PyTorch + NCCL baseline → With 3 attempts: 36/87 correct, but fast1@3 tops out at 31%

Weak models fail to compile. Strong reasoners compile cleanly and return wrong answers.

1d12231

Dan Fu@realDanFu

Check out Nathan's @asplencmnt thread below or the Together for more details, including detailed error analysis of where models go wrong, and the relative frontier of where closed vs. open models are today.

Nathan@asplencmnt

Excited to release ParallelKernelBench (PKB), a benchmark for measuring LLMs’ ability to write fast multi-GPU kernels! 😀

Multi-GPU kernel generation compounds several hard problems:

- a large parallelism design space - a new communication axis to optimize - and hardware-specific decisions around communication mechanisms

Existing kernel-generation benchmarks mostly target single-GPU workloads, so we built PKB to cover real-world multi-GPU workloads (many of which do not have existing optimized solutions). 🧵👇

1d46470

Dan Fu@realDanFu

@askalphaxiv PKB of course builds on great work from friends and collaborators - the original KernelBench from @simonguozirui and @anneouyang, as well as great work on the fundamentals like ParallelKittens by @stuart_sul.

Excited to push on this frontier!

Dan Fu@realDanFu

Auto-blog https://www.alphaxiv.org/overview/2606.parallel-kernel-bench

1d43170

Simon Guo@simonguozirui

okay a much more accurate picture

Check it out at https://github.com/togethercomputer/ParallelKernelBench

1d10321

Dan Fu@realDanFu

Check out the code for examples and details on running it: https://github.com/togethercomputer/ParallelKernelBench Blog: https://www.together.ai/blog/parallelkernelbench And paper: https://www.alphaxiv.org/abs/2606.parallel-kernel-bench

1d454

Ryan Yang-Liu@ryanyang0

@simonguozirui @khshind thoughts on https://uccl-project.github.io/posts/commbench/ ?

1d311

Together AI@togethercompute

LLMs write fast single-GPU kernels. Ask for a multi-GPU one and they fall apart.

ParallelKernelBench measures how they fail by benchmarking against 87 problems pulled from real codebases including Megatron-LM, DeepSpeed, DeepEP, TensorRT-LLM, NeMo-RL.

New research from Willy Chan @asplencmnt @simonguozirui @simran_s_arora and @realDanFu

1d21.4K16061

Nathan@asplencmnt

That said, a few models did find solutions faster than the original repo code, producing net-new kernels! The example below speeds up a real vision workload.

1d201

Nathan@asplencmnt

Each problem starts from a standard PyTorch + NCCL implementation and a description of the hardware topology. We task LLMs with replacing that reference with CUDA that communicates directly across GPUs using symmetric memory.

1d121

Nathan@asplencmnt

We curated PKB’s 87 problems from open-source production repos such as Megatron-LM, DeepEP, TensorRT-LLM, NeMo-RL, with wide coverage over parallelism types — TP, DP, CP, EP, FSDP/ZeRO, etc, and combinations of the above.

1d121

Yaroslav Bulatov@yaroslavvb

@realDanFu @askalphaxiv cc @tdietterich

1d37

Nathan@asplencmnt

We evaluated frontier LLMs and found they struggle: single-shot correctness plateaus at 32%, and only 25% of cases beat an unoverlapped PyTorch+NCCL baseline.

1d101

Santosh Mohan@theycallmeMohan

@simonguozirui @realDanFu @khshind You don’t need to write CUDA to write Ring attention though

1d18

Simon Guo@simonguozirui

Struggling to write Ring Attention on TPUs/GPUs with @khshind was one of the original motivations for KernelBench 😅

It feels full circle with ParallelKernelBench — a dedicated eval to see whether LLMs can write fast multi-GPU kernels 📡

Introducing the latest KernelBench family member: PKB, led by awesome undergrad researchers @opengroundsFX & @NathanPaek9368! (+ the always amazing @simran_s_arora @realDanFu for their guidance 🙏)

Together AI@togethercompute

LLMs write fast single-GPU kernels. Ask for a multi-GPU one and they fall apart.

ParallelKernelBench measures how they fail by benchmarking against 87 problems pulled from real codebases including Megatron-LM, DeepSpeed, DeepEP, TensorRT-LLM, NeMo-RL.

New research from Willy Chan @asplencmnt @simonguozirui @simran_s_arora and @realDanFu

1d7.2K6745

Together AI@togethercompute

An agentic loop (compile, test, profile, revise) helps. Gemini 3 Pro went from 24 to 35/87 correct, then plateaued after ~20 steps.

Feedback fixes syntax, not rank coordination, collective ordering, or transfer-mechanism choice. TMA and NVLS stay almost unused.

1d205

Nathan@asplencmnt

Super grateful to my coauthor Willy Chan, collaborator @simonguozirui, and awesome mentors @simran_s_arora and @realDanFu! And huge thanks to @togethercompute for making the project happen! Check out PKB here:

Blog 🌐: https://www.together.ai/blog/parallelkernelbench Paper 📜: https://www.alphaxiv.org/abs/2606.parallel-kernel-bench GitHub 💻: https://github.com/togethercomputer/ParallelKernelBench HuggingFace 🤗: https://huggingface.co/datasets/togethercomputer/ParallelKernelBench_Problems

1d141