togethercomputer/ParallelKernelBench
Python
Captured source
source ↗togethercomputer/ParallelKernelBench
Language: Python
Stars: 0
Forks: 0
Open issues: 6
Created: 2026-06-03T19:58:41Z
Pushed: 2026-06-23T06:09:38Z
Default branch: main
Fork: no
Archived: no
README:
ParallelKernelBench: Can LLMs write fast multi-GPU kernels?
ParallelKernelBench (PKB) is a benchmark with the goal of enabling LLMs to optimize multi-GPU kernels. Specifically, we investigate model capabilities on turning existing PyTorch + NCCL reference code into fine-grained CUDA (or related DSLs).
The design is heavily inspired by KernelBench.
📄 Paper · 🤗 Hugging Face · 🌐 Project website
---
👋 Overview
PKB asks models to optimize multi-GPU kernels: each problem has a PyTorch + NCCL reference under reference/; candidates go in solutions_/ (CUDA, Triton, ParallelKittens, or run-specific trees from generation).
Correctness: eval mode runs reference and candidate on the same inputs and compares per-rank outputs (rank_*.pt) within --atol / --rtol.
Performance: optional timing reports speedup vs reference. We follow ThunderKittens 2 — benchmark rigor: 500 warmup iterations, 100 timed iterations (see worker / perf utilities).
Roofline (approximate): reference_rooflines_code/ provides utilization estimates; contributions welcome.
---
⚙️ Setup
PKB uses [uv](https://docs.astral.sh/uv/) for reproducible Python environments.
Prerequisites
- OS: Linux with NVIDIA GPUs (multi-GPU runs need matching
torchrun/ NCCL). - Driver: Recent enough for CUDA 12.8 wheels (H100 nodes typically satisfy this).
- ParallelKittens backend (optional): clone ThunderKittens and set
THUNDERKITTENS_ROOTto the repo root (Modal/Together images do this automatically).
Install with uv
# Install uv (skip if already installed) curl -LsSf https://astral.sh/uv/install.sh | sh cd ParallelKernelBench cd kernelgen git clone https://github.com/SWE-agent/mini-swe-agent.git cd .. uv sync # Verify the environment uv run python -c "import torch; print(torch.__version__, torch.cuda.is_available())" uv run pytest tests/ -q
This creates .venv/ at the repo root and installs all dependencies from the pinned uv.lock (commit that file when you change pyproject.toml).
API keys (generation / cloud eval)
- Google models:
GEMINI_API_KEYorGOOGLE_API_KEY. - Together models:
TOGETHER_API_KEY(seeALLOWED_MODELSinkernelgen/generate_kernel.py). - Anthropic models:
ANTHROPIC_API_KEY. - OpenAI models:
OPENAI_API_KEY.
---
🚀 Usage
Generate a single solution
[kernelgen/generate_kernel.py](kernelgen/generate_kernel.py) assembles a kernel-generation prompt from [kernelgen/prompts.toml](kernelgen/prompts.toml) and optionally calls an LLM. You can (1) print the prompt only (--print-prompt) or (2) generate a solution file for one problem and backend.
# [print-prompt] Inspect the assembled user prompt; no API call, no file written # --precision: fp32 | fp16 | bf16 (must match an entry in prompts.toml) # --hardware: h100_8 | b200_72 (optional; omitted → "none" in the output directory name) # --backend: cuda | triton | parallelkittens (must match [backends] in prompts.toml) # --model: must be in ALLOWED_MODELS in generate_kernel.py (ignored with --print-prompt) uv run python kernelgen/generate_kernel.py \ --precision bf16 \ --hardware h100_8 \ --problem 1 \ --model zai-org/GLM-5.1 \ --backend cuda \ --print-prompt # [generate] Call the LLM and write e.g. solutions_cuda_bf16_h100_8_together_zai-org_GLM-5.1/1_allreduce_cuda.py uv run python kernelgen/generate_kernel.py \ --precision bf16 \ --hardware h100_8 \ --problem 1 \ --model zai-org/GLM-5.1 \ --backend cuda # Other backends (same flags; different output dir prefix and filename suffix) uv run python kernelgen/generate_kernel.py --precision bf16 --hardware h100_8 --problem 1 --model gemini-3-pro-preview --backend triton uv run python kernelgen/generate_kernel.py --precision bf16 --hardware h100_8 --problem 1 --model gemini-3-pro-preview --backend parallelkittens # Optional: custom prompt template uv run python kernelgen/generate_kernel.py --paths-to-prompts-template /path/to/prompts.toml --precision bf16 --problem 1 --backend cuda --print-prompt
Outputs: without --print-prompt, each run writes under solutions_____/ as {stem}_{backend}.py (for example 1_allreduce_cuda.py). Pass that directory to [run_local.py](run_local.py) via --solutions-dir when evaluating.
Generate a single solution (mini-SWE-agent)
We provide a script ([kernelgen/generate_kernel_agent.py](kernelgen/generate_kernel_agent.py)) that uses mini-swe-agent in generating a kernel. Note that we verified functionality on Google models.
To use this script, you must install mini-swe-agent separately:
pip install -e kernelgen/mini-swe-agent cd kernelgen git clone https://github.com/SWE-agent/mini-swe-agent.git
An example command:
python kernelgen/generate_kernel_agent.py \
--problem 1 \
--backend cuda \
--model gemini-3-flash-preview \
--step-limit 3 \
--timeout 600 \
--remote-dryrun-command \
'python run_local.py --num-procs-per-node 4 --mode dryrun --problem {problem_arg} --solution {backend} --measure-perf' \
--remote-eval-command \
'python run_local.py --num-procs-per-node 4 --mode eval --problem {problem_arg} --solution {backend} --measure-perf'Generate multiple or all solutions
We provide convienient functionality to the --problem flag to make it simple to generate for multiple problems:
# generate every problem under reference/ python kernelgen/generate_kernel.py \ --precision bf16 \ --hardware h100_8 \ --problem all \ --model deepseek-ai/DeepSeek-V4-Pro \ --backend cuda # generate a specific subset of problems python kernelgen/generate_kernel.py \ --precision bf16 \ --hardware h100_8 \ --problem '[72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88]' \ --model zai-org/GLM-5.1 \ --backend cuda
Evaluate a single problem (locally)
[run_local.py](run_local.py) is the local multi-GPU harness (via torchrun). Assuming you have an environment with multiple GPUs connected via NVLink, you can (1) run one backend in isolation (dryrun) or (2) compare a candidate kernel against the reference (eval).
#...
Excerpt shown — open the source for the full document.
Notability
notability 5.0/10New benchmark repo for parallel kernels