ReleaseNVIDIANVIDIApublished Jun 11, 2026seen 1d

NVIDIA/nccl nccl4py-v0.3.1

NVIDIA/nccl

Open original ↗

Captured source

source ↗
published Jun 11, 2026seen 1dcaptured 1dhttp 200method plain

NCCL4py v0.3.1 Release

Repository: NVIDIA/nccl

Tag: nccl4py-v0.3.1

Published: 2026-06-11T18:45:48Z

Prerelease: no

Release notes:

Highlights

  • Added nccl.ep, a Pythonic interface to libnccl_ep.so for expert

parallel dispatch/combine workflows. The package exposes Group, Handle, Tensor, typed config dataclasses, Algorithm, Layout, PassDir, and the named input/output structs used by the NCCL EP API.

  • Added nccl.core.device.cute, enabling CuTeDSL kernels to call NCCL device

APIs.

  • Added top-level stack diagnostics with nccl.get_version() and

nccl.show_versions(), reporting nccl4py, libnccl.so, and libnccl_ep.so versions, CUDA build variants, and loaded shared-library paths.

  • Added free-threaded CPython support.

New Features

NCCL EP Python API

  • New nccl.ep package provides Pythonic access to the NCCL EP extension

library.

  • Group.create() creates EP groups from a Communicator and GroupConfig;

Group.create_handle() creates handles with an explicit Layout.

  • Handle supports update(), dispatch(), combine(), complete(), and

destroy().

  • DispatchInputs, DispatchOutputs, CombineInputs, CombineOutputs, and

LayoutInfo provide named containers for the tensors and metadata used by dispatch, combine, and handle setup.

  • Tensor resolves Python buffers into ncclEpTensor_t descriptors.
  • GroupConfig, HandleConfig, DispatchConfig, CombineConfig, and

AllocConfig expose typed configuration objects.

  • AllocFn and FreeFn expose caller-controlled EP allocation hooks.
  • nccl.ep.interop.torch.get_nccl_comm_from_group() provides PyTorch interop

for creating an NCCL communicator from a PyTorch process group's rank and world-size information.

  • Importing nccl.ep sets default NCCL_EP_HOME when bundled EP JIT headers

are present, and NCCL_HOME when NCCL public headers are available from the installed nvidia.nccl package.

  • nccl.ep checks that the loaded libnccl.so and libnccl_ep.so were built

with the same CUDA major version. CUDA minor differences are accepted.

Communicator Configuration

  • Added graph_stream_ordering to NCCLConfig.

Device API and CuTe DSL

  • New nccl.core.device.cute module exposes the NCCL device API to CuTeDSL

kernels, including communicator/window access, GIN primitives, barrier operations, and typed structs.

  • Added bindings/nccl4py/examples/cute/main.py, a GIN put/wait example with

host-side validation.

  • Added gin_strong_signals_required and gin_va_signals_required to

NCCLDevCommRequirements for configuring device communicator requirements.

  • Added NcclGinType.GPI for the GPU-Push Interface transport.

Version and Diagnostics API

  • Top-level nccl.get_version() returns a VersionInfo dataclass containing

the nccl4py package version plus LibraryInfo entries for the loaded libnccl.so and, when available, libnccl_ep.so.

  • Top-level nccl.show_versions() prints the same stack information in a

human-readable version block.

  • Direct library probes are available for each native library:

nccl.core.get_lib_version() and nccl.core.get_lib_path() report the loaded libnccl.so; nccl.ep.get_lib_version() and nccl.ep.get_lib_path() report the loaded libnccl_ep.so.

  • Each LibraryInfo includes release version, CUDA build variant, and loaded

shared-library path.

Installation and Packaging

  • CuTeDSL support can be installed through the CUDA-specific extras:

nccl4py[cu12] installs nvidia-cutlass-dsl>=4.5.2,=4.5.2,<5.0.

  • Wheels include package data for nccl/ep/lib/libnccl_ep.so plus EP JIT

headers. The bundled libnccl_ep.so is built with CUDA 13, regardless of whether the cu12 or cu13 extra is installed. Users who want to use a CUDA 12 build of libnccl_ep.so must provide that library themselves, for example through LD_PRELOAD or LD_LIBRARY_PATH.

  • Wheels are available for free-threaded CPython 3.14t.

Examples and Documentation

  • Added Python examples for:
  • multiple devices in one process:

docs/examples/01_communicators/01_multiple_devices_single_process/python/;

  • one device per MPI process:

docs/examples/01_communicators/03_one_device_per_process_mpi/python/;

  • point-to-point ring pattern:

docs/examples/02_point_to_point/01_ring_pattern/python/;

  • allreduce: docs/examples/03_collectives/01_allreduce/python/;
  • user-buffer allreduce:

docs/examples/04_user_buffer_registration/01_allreduce/python/;

  • symmetric-memory allreduce:

docs/examples/05_symmetric_memory/01_allreduce/python/;

  • symmetric-memory allgather:

docs/examples/05_symmetric_memory/02_allgather/python/.

  • Added nccl4py documentation under docs/userguide/source/nccl4py/, with the

main entry point at docs/userguide/source/nccl4py.rst.

Breaking Changes

Removed APIs

  • nccl.core.group_simulate_end() has been removed. Use

nccl.core.group_end(simulate=True):

from nccl.core import group_end, group_start

group_start()
# enqueue operations
info = group_end(simulate=True)
  • NCCL_SPLIT_NOCOLOR has been removed from the public constants. Use

color=None when a rank should opt out of Communicator.split().

Deprecated APIs

  • nccl.core.get_version() remains available, but is deprecated. Use top-level

nccl.get_version() for structured version information, or nccl.show_versions() for human-readable output.

Other Compatibility Notes

  • Public NCCL enum wrappers are pure-Python IntEnum or IntFlag classes.

Integer compatibility is preserved, and dtype conversion remains supported. Code that depends on binding-backed enum class identity from earlier releases may need updates.

  • Enum members now follow the Python enum convention of UPPER_SNAKE_CASE

names, such as CTAPolicy.DEFAULT, CommShrinkFlag.ABORT, WindowFlag.COLL_SYMMETRIC, and NcclCommMemStat.GPU_MEM_TOTAL. The previous PascalCase/camelCase aliases, such as CTAPolicy.Default and NcclCommMemStat.GpuMemTotal, still work in 0.3.1 for compatibility, but will be removed in a future release. New code should use the uppercase names.

Fixes and Enhancements

  • Fixed pointer lifetime handling for non-blocking communicator and window

initialization.

  • Torch interop covers torch.uint32 and torch.uint64 when those dtypes are

available.

API Stability

  • nccl.ep and nccl.core.device.cute are initial API support. Their public

interfaces may change in future releases as the NCCL EP and CuTeDSL device API integration...

Excerpt shown — open the source for the full document.

Notability

notability 3.0/10

Routine point release of nccl4py library.