What does this repo signal mean?

CoreWeave published coreweave/nccl-tests (Shell). This repository signal exposes tooling, eval, infrastructure, or model-adjacent work before it may appear in a launch post. High-signal details: repo coreweave/nccl-tests · language Shell · NCCL testing repo from CoreWeave, 147 stars.. onlylabs links this event to 1 captured evidence page and 6 related repo signals.

CoreWeave Repo: coreweave/nccl-tests

Captured source

source ↗

GitHub/github.com/coreweave/nccl-tests

coreweave/nccl-tests repository metadata

Source ↗

published Jun 29, 2022seen Jun 5captured Jun 11http 200method plain

coreweave/nccl-tests

Description: NVIDIA NCCL Tests for Distributed Training

Language: Shell

Stars: 146

Forks: 32

Open issues: 6

Created: 2022-06-29T10:49:49Z

Pushed: 2026-06-10T16:04:56Z

Default branch: master

Fork: no

Archived: no

README:

NCCL for Distributed Training

CoreWeave supports the NVIDIA Collective Communication Library (NCCL) for powering multi-GPU and multi-node neural network training. NCCL underpins the vast majority of all distributed training frameworks such as DeepSpeed, PyTorch Distributed and Horovod.

NCCL is supported across CoreWeave NVIDIA GPUs over Ethernet and InfiniBand. In addition, the specialized GB200 NVL72 clusters are built with NVIDIA Quantum-X800 InfiniBand networking and in-network collections using NVIDIA SHARP to deliver the highest distributed training performance possible.

[NCCL for Distributed Training](#nccl-for-distributed-training)
[Docker Images](#docker-images)
[Running NCCL Tests](#running-nccl-tests)
[MPI Operator](#mpi-operator)
[Running Jobs](#running-jobs)
[Slurm](#slurm)
[Running Jobs](#running-jobs-1)
[Enroot](#enroot)
[Running DeepSpeed Training Jobs](#running-deepspeed-training-jobs)
[GDRCopy](#gdrcopy)
[Expected Performance](#expected-performance)
[GB200](#gb200)
[Single Rack](#single-rack)
[2 Racks](#2-racks)
[20 Racks](#20-racks)

Docker Images

This repository includes Dockerfiles that can be used directly or as a template for your distributed training applications. The Dockerfiles include the following components:

NVIDIA Mellanox OFED Driver

userspace components. The kernel side is installed on our bare-metal nodes and does not need to be installed by users. The OFED drivers are necessary for optimized InfiniBand communication.

NVIDIA HPC-X which is a

packaging of OpenMPI and UCX

NVIDIA HPC-X OpenMPI compiled with external PMIx to

enable SLURM integration

NVIDIA GDRCopy libraries leverage

GPUDirect RDMA for improved GPU to host memory copy performance in certain applications. The kernel support for GDRCopy exists on CoreWeave's bare-metal nodes.

NVIDIA NCCL SHARP Plugin

for SHARP support in NCCL

NVIDIA NCCL Tests for verification

and benchmarking purposes

NVIDIA DCGM for GPU tests and health

checks

NVIDIA bandwidthTest

utility

RDMA Perftest with GPUDirect
OpenSSH server and related settings to enable images to easily be used as

MPI Runners

CoreWeave also publishes images built from these Dockerfiles that can be used as base for your own images. The images below include NCCL v2.30.4-1, HPC-X v2.26, and cuDNN v9.20.0.48-1. Each image is multi-arch, and can be used for both linux/amd64 and linux/arm64 containers. Compute capabilities up to Blackwell (10.0 & 12.0) are supported.

Ubuntu 24.04

| Image Tag | CUDA | |----------------------------------------------------------------------------|----------| | ghcr.io/coreweave/nccl-tests:13.2.1-devel-ubuntu24.04-nccl2.30.4-1-2eedd7c | 13.2.1 | | ghcr.io/coreweave/nccl-tests:13.1.1-devel-ubuntu24.04-nccl2.30.4-1-2eedd7c | 13.1.1 | | ghcr.io/coreweave/nccl-tests:13.0.2-devel-ubuntu24.04-nccl2.30.4-1-2eedd7c | 13.0.2 | | ghcr.io/coreweave/nccl-tests:12.9.1-devel-ubuntu24.04-nccl2.30.4-1-2eedd7c | 12.9.1 |

Ubuntu 22.04

| Image Tag | CUDA | |----------------------------------------------------------------------------|----------| | ghcr.io/coreweave/nccl-tests:13.2.1-devel-ubuntu22.04-nccl2.30.4-1-2eedd7c | 13.2.1 | | ghcr.io/coreweave/nccl-tests:13.1.1-devel-ubuntu22.04-nccl2.30.4-1-2eedd7c | 13.1.1 | | ghcr.io/coreweave/nccl-tests:13.0.2-devel-ubuntu22.04-nccl2.30.4-1-2eedd7c | 13.0.2 | | ghcr.io/coreweave/nccl-tests:12.9.1-devel-ubuntu22.04-nccl2.30.4-1-2eedd7c | 12.9.1 | | ghcr.io/coreweave/nccl-tests:12.8.1-devel-ubuntu22.04-nccl2.30.4-1-2eedd7c | 12.8.1 | | ghcr.io/coreweave/nccl-tests:12.6.3-devel-ubuntu22.04-nccl2.30.4-1-2eedd7c | 12.6.3 |

Running NCCL Tests

There are many sample jobs in this repo showing how to run distributed NCCL tests, using the following workload managers:

MPI Operator

CoreWeave provides a managed instance of the MPI Operator to allow running MPI Jobs in a container native fashion. No installation is required by the user, simply execute an MPIJob manifest in your namespace.

Example manifests are provided in the mpi-operator/ directory. There you'll find the following examples of 64 GPU (8 node) runs:

[A40](./mpi-operator/nccl-test-distributed-a40-64-mpijob.yaml)
[A100](./mpi-operator/nccl-test-distributed-a100-64-mpijob.yaml)
[A100 with GDRCopy](./mpi-operator/nccl-test-distributed-a100-64-gdrcopy-mpijob.yaml)
[A100 without Infiniband](./mpi-operator/nccl-test-distributed-a100-64-noib-mpijob.yaml)
[A100 with SHARP](./mpi-operator/nccl-test-distributed-a100-64-sharp-mpijob.yaml)
[H100](./mpi-operator/nccl-test-distributed-h100-64-mpijob.yaml)
[H100 with SHARP](./mpi-operator/nccl-test-distributed-h100-64-sharp-mpijob.yaml)
[B200](./mpi-operator/nccl-test-distributed-b200-64-mpijob.yaml)
[B200 with SHARP](./mpi-operator/nccl-test-distributed-b200-64-sharp-mpijob.yaml)
[B300](./mpi-operator/nccl-test-distributed-b300-64-mpijob.yaml)
[B300 with SHARP](./mpi-operator/nccl-test-distributed-b300-64-sharp-mpijob.yaml)
[GB200 NVL72](./mpi-operator/nccl-test-distributed-gb200-nvl72-mpijob.yaml)
[GB200 128 GPU multi-rack](./mpi-operator/nccl-test-distributed-gb200-128-multirack-mpijob.yaml)
[GB300 NVL72...

Excerpt shown — open the source for the full document.

Notability

notability 5.0/10

NCCL testing repo from CoreWeave, 147 stars.