What does this writing signal mean?

Together AI published ParallelKernelBench: Frontier LLMs can't write fast multi-GPU kernels (yet). This talking signal gives public context for research themes, product direction, policy, or launch framing. High-signal details: Substantive benchmark post revealing LLM limitations. · ParallelKernelBench: Frontier LLMs can't write fast multi-GPU kernels (yet) 🚀 Now serving MiniMax-M3 for efficient inference → ⚡ On-demand B200s now available on.... onlylabs links this event to 1 captured evidence page and 6 related writing signals.

Together AI Writing: ParallelKernelBench: Frontier LLMs can't write fast multi-GPU kernels (yet)

Captured source

source ↗

together.ai/together.ai/blog/parallelkernelbench

ParallelKernelBench: Frontier LLMs can't write fast multi-GPU kernels (yet)

Source ↗

published Jun 23, 2026seen 3dcaptured 3dhttp 200method plain

ParallelKernelBench: Frontier LLMs can't write fast multi-GPU kernels (yet)

🚀 Now serving MiniMax-M3 for efficient inference →

⚡ On-demand B200s now available on Together GPU Clusters →

📊 Delivering 31% more TPS than the next-fastest OSS engine for production coding agent workloads →

💬 How Together built the world's fastest speech-to-text stack →

🇫🇷 Join us at RAISE 2026 in Paris →

All blog posts

Research

Published 6/23/2026

ParallelKernelBench: Frontier LLMs can't write fast multi-GPU kernels (yet)

The best frontier model solves under a third of 87 real-world problems — but a few generated kernels beat anything publicly available.

Authors

Willy Chan, Nathan Paek, Simon Guo, Simran Arora, Daniel Y. Fu

Table of contents

40+ Models Chosen for Production...40+ Models Chosen for Production...40+ Models Chosen for Production...

Links in this article

Paper HuggingFace Code

Summary

LLMs have gotten surprisingly good at writing GPU kernels [1][2][3] , but almost all current benchmarks measuring that progress are single-GPU. In production, communication is often the bottleneck: communication overhead can account for over 20% of inference latency [4] , and that gap keeps widening as compute scales faster than interconnect bandwidth. ParallelKernelBench (PKB) offers a benchmark and evaluation framework for multi-GPU kernel generation and includes 87 problems from real codebases where the task is replacing PyTorch + NCCL with a CUDA kernel that moves data directly over NVLink. We tested frontier coding models such as GPT-5.5, Gemini 3 Pro, Opus 4.7, and others. The evaluation revealed significant performance gaps across the board: under a third of problems were solved correctly, and fewer than a quarter of those beat the naive baseline. We'll cover why they fail, what the patterns look like, and a few cases where models surprisingly produced kernels faster than anything publicly available, including one for NVIDIA NeMo-RL's GRPO training loop , which has no prior optimized public reference.

Why multi-GPU is different from single-GPU kernel generation LLMs have made progress on GPU kernel generation, but that progress has mostly been measured on a single GPU. Production AI workloads no longer fit that frame: they span multiple GPUs, and performance is increasingly shaped by communication rather than just local compute and memory. That shift makes multi-GPU kernel generation a different problem in three ways: The design space expands combinatorially. Practitioners compose tensor, expert, data, context, and sequence parallelism to fit the hardware, and each composition creates a different communication pattern. The performance model changes. A single-GPU roofline is built around compute and memory bandwidth. In multi-GPU code, the bottleneck is often the interconnect. Multi-GPU kernel generation introduces a critical new design choice: how to move data between GPUs — through the copy engine, TMA, SM load/store, or NVLS — and whether to fuse that movement with compute.

ParallelKernelBench We built PKB to test whether models can move beyond pure torch.dist and actually write production multi-GPU kernels. Each problem starts from a standard PyTorch + NCCL implementation and a description of the hardware topology. The model then has to replace that reference with a CUDA kernel that communicates directly across GPUs using symmetric memory.

PKB evaluation pipeline. Each problem provides a task, hardware topology, and PyTorch + NCCL reference; the model generates a custom CUDA kernel that is evaluated for correctness, wall-clock speedup, and communication roofline. To make sure the 87 problems cover the real space of production parallelism types, we built them from a taxonomy of distributed workloads. First, we identified the major ways models get sharded — tensor, context, data, expert, sequence, and FSDP/ZeRO — along with the communication patterns each one creates. Then we chose 87 problems to cover that space taken from the codebases of systems like Megatron-LM , DeepSpeed , DeepEP , TensorRT-LLM , NeMo-RL , as well as a long tail of non-LLM workloads: GNN routing, distributed FFTs, Gaussian splatting, etc. Another benefit is that because PKB references are written in standard PyTorch + NCCL, the benchmark is not tied to any single, particular hardware generation. Instead, it is designed to naturally evolve alongside next-generation hardware architectures.

Taxonomy for parallelizing standard transformer blocks. Different sharding strategies create distinct communication patterns across normalization, attention, and MLP, illustrated here for a representative Gemma3-27B layer. PKB problem coverage across parallelism types (left) and source codebases (right), spanning RL post-training, LLM training, kernel libraries, vision models, GNNs, and more. Before evaluating models, we first checked whether the PyTorch + NCCL baselines leave real headroom. A communication-aware roofline says yes: most PKB problems are bottlenecked by NVLink, and the baselines run far below the hardware ceiling. So the next question is simple: can models close that gap? How frontier models do on PKB Not well. In the zero-shot setting, the best model solves 28 of 87 problems, and only 22 of those solutions are faster than the PyTorch + NCCL baseline. Sampling three attempts improves the best result to 36 correct solutions and 27 faster-than-baseline solutions, but fast 1 @3 still tops out at 31%.

Category # GPT-5.5 Claude Opus 4.7 Gemini 3 Pro GLM-5.1 GLM-5.2 DeepSeek V4 Pro

pass @1→3 fast 1 @1→3 pass @1→3 fast 1 @1→3 pass @1→3 fast 1 @1→3 pass @1→3 fast 1 @1→3 pass @1→3 fast 1 @1→3 pass @1→3 fast 1 @1→3

Collective Primitive 8 3→5 3→4 4→5 3→4 6→6 2→2 2→3 2→3 1→3 1→1 0→0 0→0

Tensor Parallel 17 2→2 1→2 1→3 1→3 3→3 1→1 1→1 0→0 0→1 0→0 0→0 0→0

Sequence Parallel 2 1→1 1→1 0→0 0→0 0→0 0→0 0→0 0→0 0→0 0→0 0→0 0→0

Context Parallel 12 7→8 5→6 7→7 3→4 7→9 5→7 2→2 0→0 2→2 0→0 2→3 0→1

Pipeline Parallel 1 0→0 0→0 0→0 0→0 0→0 0→0 0→0 0→0 0→0 0→0 0→0 0→0...

Excerpt shown — open the source for the full document.

Notability

notability 6.0/10

Substantive benchmark post revealing LLM limitations.