What does this release signal mean?

NVIDIA published NVIDIA/nccl nccl4py-v0.3.1 (NVIDIA/nccl). This release signal is evidence of what shipped, changed, or was packaged for users. High-signal details: Routine point release of nccl4py library. · NCCL4py v0.3.1 Release Repository: NVIDIA/nccl Tag: nccl4py-v0.3.1 Published: 2026-06-11T18:45:48Z Prerelease: no Release notes: Highlights - Added `nccl.ep`, a Pythonic.... onlylabs links this event to 1 captured evidence page and 6 related release signals.

NVIDIA Release: NVIDIA/nccl nccl4py-v0.3.1

Captured source

source ↗

GitHub/github.com/NVIDIA/nccl

NVIDIA/nccl nccl4py-v0.3.1

Source ↗

published Jun 11, 2026seen 1dcaptured 1dhttp 200method plain

NCCL4py v0.3.1 Release

Repository: NVIDIA/nccl

Tag: nccl4py-v0.3.1

Published: 2026-06-11T18:45:48Z

Prerelease: no

Release notes:

Highlights

Added nccl.ep, a Pythonic interface to libnccl_ep.so for expert

parallel dispatch/combine workflows. The package exposes Group, Handle, Tensor, typed config dataclasses, Algorithm, Layout, PassDir, and the named input/output structs used by the NCCL EP API.

Added nccl.core.device.cute, enabling CuTeDSL kernels to call NCCL device

APIs.

Added top-level stack diagnostics with nccl.get_version() and

nccl.show_versions(), reporting nccl4py, libnccl.so, and libnccl_ep.so versions, CUDA build variants, and loaded shared-library paths.

Added free-threaded CPython support.

New Features

NCCL EP Python API

New nccl.ep package provides Pythonic access to the NCCL EP extension

library.

Group.create() creates EP groups from a Communicator and GroupConfig;

Group.create_handle() creates handles with an explicit Layout.

Handle supports update(), dispatch(), combine(), complete(), and

destroy().

DispatchInputs, DispatchOutputs, CombineInputs, CombineOutputs, and

LayoutInfo provide named containers for the tensors and metadata used by dispatch, combine, and handle setup.

Tensor resolves Python buffers into ncclEpTensor_t descriptors.
GroupConfig, HandleConfig, DispatchConfig, CombineConfig, and

AllocConfig expose typed configuration objects.

AllocFn and FreeFn expose caller-controlled EP allocation hooks.
nccl.ep.interop.torch.get_nccl_comm_from_group() provides PyTorch interop

for creating an NCCL communicator from a PyTorch process group's rank and world-size information.

Importing nccl.ep sets default NCCL_EP_HOME when bundled EP JIT headers

are present, and NCCL_HOME when NCCL public headers are available from the installed nvidia.nccl package.

nccl.ep checks that the loaded libnccl.so and libnccl_ep.so were built

with the same CUDA major version. CUDA minor differences are accepted.

Communicator Configuration

Added graph_stream_ordering to NCCLConfig.

Device API and CuTe DSL

New nccl.core.device.cute module exposes the NCCL device API to CuTeDSL

kernels, including communicator/window access, GIN primitives, barrier operations, and typed structs.

Added bindings/nccl4py/examples/cute/main.py, a GIN put/wait example with

host-side validation.

Added gin_strong_signals_required and gin_va_signals_required to

NCCLDevCommRequirements for configuring device communicator requirements.

Added NcclGinType.GPI for the GPU-Push Interface transport.

Version and Diagnostics API

Top-level nccl.get_version() returns a VersionInfo dataclass containing

the nccl4py package version plus LibraryInfo entries for the loaded libnccl.so and, when available, libnccl_ep.so.

Top-level nccl.show_versions() prints the same stack information in a

human-readable version block.

Direct library probes are available for each native library:

nccl.core.get_lib_version() and nccl.core.get_lib_path() report the loaded libnccl.so; nccl.ep.get_lib_version() and nccl.ep.get_lib_path() report the loaded libnccl_ep.so.

Each LibraryInfo includes release version, CUDA build variant, and loaded

shared-library path.

Installation and Packaging

CuTeDSL support can be installed through the CUDA-specific extras:

nccl4py[cu12] installs nvidia-cutlass-dsl>=4.5.2,=4.5.2,<5.0.

Wheels include package data for nccl/ep/lib/libnccl_ep.so plus EP JIT

headers. The bundled libnccl_ep.so is built with CUDA 13, regardless of whether the cu12 or cu13 extra is installed. Users who want to use a CUDA 12 build of libnccl_ep.so must provide that library themselves, for example through LD_PRELOAD or LD_LIBRARY_PATH.

Wheels are available for free-threaded CPython 3.14t.

Examples and Documentation

Added Python examples for:
multiple devices in one process:

docs/examples/01_communicators/01_multiple_devices_single_process/python/;

one device per MPI process:

docs/examples/01_communicators/03_one_device_per_process_mpi/python/;

point-to-point ring pattern:

docs/examples/02_point_to_point/01_ring_pattern/python/;

allreduce: docs/examples/03_collectives/01_allreduce/python/;
user-buffer allreduce:

docs/examples/04_user_buffer_registration/01_allreduce/python/;

symmetric-memory allreduce:

docs/examples/05_symmetric_memory/01_allreduce/python/;

symmetric-memory allgather:

docs/examples/05_symmetric_memory/02_allgather/python/.

Added nccl4py documentation under docs/userguide/source/nccl4py/, with the

main entry point at docs/userguide/source/nccl4py.rst.

Breaking Changes

Removed APIs

nccl.core.group_simulate_end() has been removed. Use

nccl.core.group_end(simulate=True):

from nccl.core import group_end, group_start

group_start()
# enqueue operations
info = group_end(simulate=True)

NCCL_SPLIT_NOCOLOR has been removed from the public constants. Use

color=None when a rank should opt out of Communicator.split().

Deprecated APIs

nccl.core.get_version() remains available, but is deprecated. Use top-level

nccl.get_version() for structured version information, or nccl.show_versions() for human-readable output.

Other Compatibility Notes

Public NCCL enum wrappers are pure-Python IntEnum or IntFlag classes.

Integer compatibility is preserved, and dtype conversion remains supported. Code that depends on binding-backed enum class identity from earlier releases may need updates.

Enum members now follow the Python enum convention of UPPER_SNAKE_CASE

names, such as CTAPolicy.DEFAULT, CommShrinkFlag.ABORT, WindowFlag.COLL_SYMMETRIC, and NcclCommMemStat.GPU_MEM_TOTAL. The previous PascalCase/camelCase aliases, such as CTAPolicy.Default and NcclCommMemStat.GpuMemTotal, still work in 0.3.1 for compatibility, but will be removed in a future release. New code should use the uppercase names.

Fixes and Enhancements

Fixed pointer lifetime handling for non-blocking communicator and window

initialization.

Torch interop covers torch.uint32 and torch.uint64 when those dtypes are

available.

API Stability

nccl.ep and nccl.core.device.cute are initial API support. Their public

interfaces may change in future releases as the NCCL EP and CuTeDSL device API integration...

Excerpt shown — open the source for the full document.

Notability

notability 3.0/10

Routine point release of nccl4py library.