RepoCoreWeaveCoreWeavepublished Jun 29, 2022seen 6d

coreweave/nccl-tests

Shell

Open original ↗

Captured source

source ↗
published Jun 29, 2022seen 6dcaptured 15hhttp 200method plain

coreweave/nccl-tests

Description: NVIDIA NCCL Tests for Distributed Training

Language: Shell

Stars: 146

Forks: 32

Open issues: 6

Created: 2022-06-29T10:49:49Z

Pushed: 2026-06-10T16:04:56Z

Default branch: master

Fork: no

Archived: no

README:

NCCL for Distributed Training

CoreWeave supports the NVIDIA Collective Communication Library (NCCL) for powering multi-GPU and multi-node neural network training. NCCL underpins the vast majority of all distributed training frameworks such as DeepSpeed, PyTorch Distributed and Horovod.

NCCL is supported across CoreWeave NVIDIA GPUs over Ethernet and InfiniBand. In addition, the specialized GB200 NVL72 clusters are built with NVIDIA Quantum-X800 InfiniBand networking and in-network collections using NVIDIA SHARP to deliver the highest distributed training performance possible.

  • [NCCL for Distributed Training](#nccl-for-distributed-training)
  • [Docker Images](#docker-images)
  • [Running NCCL Tests](#running-nccl-tests)
  • [MPI Operator](#mpi-operator)
  • [Running Jobs](#running-jobs)
  • [Slurm](#slurm)
  • [Running Jobs](#running-jobs-1)
  • [Enroot](#enroot)
  • [Running DeepSpeed Training Jobs](#running-deepspeed-training-jobs)
  • [GDRCopy](#gdrcopy)
  • [Expected Performance](#expected-performance)
  • [GB200](#gb200)
  • [Single Rack](#single-rack)
  • [2 Racks](#2-racks)
  • [20 Racks](#20-racks)

Docker Images

This repository includes Dockerfiles that can be used directly or as a template for your distributed training applications. The Dockerfiles include the following components:

userspace components. The kernel side is installed on our bare-metal nodes and does not need to be installed by users. The OFED drivers are necessary for optimized InfiniBand communication.

packaging of OpenMPI and UCX

  • NVIDIA HPC-X OpenMPI compiled with external PMIx to

enable SLURM integration

GPUDirect RDMA for improved GPU to host memory copy performance in certain applications. The kernel support for GDRCopy exists on CoreWeave's bare-metal nodes.

for SHARP support in NCCL

and benchmarking purposes

  • NVIDIA DCGM for GPU tests and health

checks

utility

  • RDMA Perftest with GPUDirect
  • OpenSSH server and related settings to enable images to easily be used as

MPI Runners

CoreWeave also publishes images built from these Dockerfiles that can be used as base for your own images. The images below include NCCL v2.30.4-1, HPC-X v2.26, and cuDNN v9.20.0.48-1. Each image is multi-arch, and can be used for both linux/amd64 and linux/arm64 containers. Compute capabilities up to Blackwell (10.0 & 12.0) are supported.

Ubuntu 24.04

| Image Tag | CUDA | |----------------------------------------------------------------------------|----------| | ghcr.io/coreweave/nccl-tests:13.2.1-devel-ubuntu24.04-nccl2.30.4-1-2eedd7c | 13.2.1 | | ghcr.io/coreweave/nccl-tests:13.1.1-devel-ubuntu24.04-nccl2.30.4-1-2eedd7c | 13.1.1 | | ghcr.io/coreweave/nccl-tests:13.0.2-devel-ubuntu24.04-nccl2.30.4-1-2eedd7c | 13.0.2 | | ghcr.io/coreweave/nccl-tests:12.9.1-devel-ubuntu24.04-nccl2.30.4-1-2eedd7c | 12.9.1 |

Ubuntu 22.04

| Image Tag | CUDA | |----------------------------------------------------------------------------|----------| | ghcr.io/coreweave/nccl-tests:13.2.1-devel-ubuntu22.04-nccl2.30.4-1-2eedd7c | 13.2.1 | | ghcr.io/coreweave/nccl-tests:13.1.1-devel-ubuntu22.04-nccl2.30.4-1-2eedd7c | 13.1.1 | | ghcr.io/coreweave/nccl-tests:13.0.2-devel-ubuntu22.04-nccl2.30.4-1-2eedd7c | 13.0.2 | | ghcr.io/coreweave/nccl-tests:12.9.1-devel-ubuntu22.04-nccl2.30.4-1-2eedd7c | 12.9.1 | | ghcr.io/coreweave/nccl-tests:12.8.1-devel-ubuntu22.04-nccl2.30.4-1-2eedd7c | 12.8.1 | | ghcr.io/coreweave/nccl-tests:12.6.3-devel-ubuntu22.04-nccl2.30.4-1-2eedd7c | 12.6.3 |

Running NCCL Tests

There are many sample jobs in this repo showing how to run distributed NCCL tests, using the following workload managers:

MPI Operator

CoreWeave provides a managed instance of the MPI Operator to allow running MPI Jobs in a container native fashion. No installation is required by the user, simply execute an MPIJob manifest in your namespace.

Example manifests are provided in the mpi-operator/ directory. There you'll find the following examples of 64 GPU (8 node) runs:

  • [A40](./mpi-operator/nccl-test-distributed-a40-64-mpijob.yaml)
  • [A100](./mpi-operator/nccl-test-distributed-a100-64-mpijob.yaml)
  • [A100 with GDRCopy](./mpi-operator/nccl-test-distributed-a100-64-gdrcopy-mpijob.yaml)
  • [A100 without Infiniband](./mpi-operator/nccl-test-distributed-a100-64-noib-mpijob.yaml)
  • [A100 with SHARP](./mpi-operator/nccl-test-distributed-a100-64-sharp-mpijob.yaml)
  • [H100](./mpi-operator/nccl-test-distributed-h100-64-mpijob.yaml)
  • [H100 with SHARP](./mpi-operator/nccl-test-distributed-h100-64-sharp-mpijob.yaml)
  • [B200](./mpi-operator/nccl-test-distributed-b200-64-mpijob.yaml)
  • [B200 with SHARP](./mpi-operator/nccl-test-distributed-b200-64-sharp-mpijob.yaml)
  • [B300](./mpi-operator/nccl-test-distributed-b300-64-mpijob.yaml)
  • [B300 with SHARP](./mpi-operator/nccl-test-distributed-b300-64-sharp-mpijob.yaml)
  • [GB200 NVL72](./mpi-operator/nccl-test-distributed-gb200-nvl72-mpijob.yaml)
  • [GB200 128 GPU multi-rack](./mpi-operator/nccl-test-distributed-gb200-128-multirack-mpijob.yaml)
  • [GB300 NVL72…

Excerpt shown — open the source for the full document.