Skip to content

Conversation

@malfet
Copy link

@malfet malfet commented Oct 1, 2025

Fixes spurious failures when PyTorch is linked statically with NCCL-2.28.3 because error is not drained, but rather gets propagated into a next CUDA kernel invocation

Fixes pytorch/pytorch#164402

Fixes spurious failures when PyTorch is linked statically with NCCL-2.28.3 because error is not drained, but rather gets propagated into a next CUDA kernel invocation

Fixes pytorch/pytorch#164402
@mnicely
Copy link
Collaborator

mnicely commented Oct 2, 2025

@sjeaugey @xiaofanl

@xiaofanl-nvidia
Copy link
Collaborator

ACK. We are looking into this issue now.

marksantesson added a commit that referenced this pull request Oct 18, 2025
GPU-Initiated Networking (GIN):
 * Provides device-side API for integrating GPU-Initiated Networking
   capability into application kernels.
 * New transport layer called DOCA GPUNetIO.
 * New ncclGin construct to create, destroy and manipulate GIN contexts.
 * New ncclGinBarrierSession to provide synchronization functionality.
 * New put, signal, counter operations for data movement and signaling.
 * GIN API signatures and functionalities are subject to change.
 * GIN Support Requirements
   * CUDA 12.2 or later when compiling the GPU code
   * NVIDIA GPUs: Volta or newer. NVIDIA GPU drivers >= 510.40.3
   * NVIDIA NICs: CX4 or newer. rdma-core >= 44.0
   * Requires nvidia-peermem or DMABUF support. When using DMABUF, linux
     kernel >= 6.1 is required.

New ncclCommRevoke API for fault tolerance:
 * Introduces ncclCommRevoke to quiesce ongoing NCCL work on a
   communicator without freeing resources.
 * This answers the need for a lightweight way to cancel in-flight
   collectives and bring a communicator to a safe state before
   split/shrink/finalize/destroy.
 * Includes optional cross-rank coordination (global barrier) and
   supports blocking/non-blocking usage.

New NCCL Environment Plugin:
 * The env plugin allows users to set NCCL environment variables, for
   example, after loading them from a centralized database.
 * The NCCL_ENV_PLUGIN variable can be used to let NCCL load an external
   environment plugin.

New NCCL Examples on GitHub:
 * The NCCL examples directory provides users and developers with
   practical code samples that highlight NCCL’s core features.
 * It covers basic operations like communicator initialization,
   point-to-point communication, and collective operations, as well as
   advanced features such as user buffer registration, symmetric memory,
   and the device API.

Device API improvements:
 * Adds ncclFindWindow API.
 * Adds new ncclBarrierSession to provide hybrid synchronization
   functionality.
 * Makes multimem available with as few as two ranks.
 * Removes distance (NCCL_P2P_LEVEL) considerations from determining the
   availability of symmetric memory.

Enhanced NCCL RAS output:
 * Extends RAS subsystem with JSON format to support machine-parsable
   metrics collection.
 * Enables structured data export for monitoring tools, dashboards, and
   automated analysis systems.

Github Pull Requests resolved:
 * Fast Init - CPU Optimizations for NCCL Initialization Large Scale.
   (PR #1789)
 * Fast Init - Improve Bootstrap AllGather by 2x at large scale by
   sending bootstrap information bidirectionally. (PR #1791)
 * Fixes spurious failures when PyTorch is statically linked with
   NCCL-2.28.3 because error is not drained, but rather gets propagated
   into the next CUDA kernel invocation. (PR #1864)

Other notable improvements:
 * Fixes multicast object leaks in case of failed NVLS user buffer
   registrations, which could lead to crashes. Avoids such registration
   attempts in case of the use of incompatible memory allocators.
 * Fixes potential data corruption with built-in symmetric kernels for
   small messages with size granularity under 8 bytes or when multiple
   symmetric operations were aggregated in a group.
 * Generalizes the existing point-to-point scheduling to the case of
   un-even GPU count per node.
 * Fixes a crash when network plugin assignment fails.
 * Fixes a large performance issue with NCCL_CROSS_NIC=0 and certain
   split mask settings, where NCCL cannot find a viable ring.
 * Fixes crash when NCCL is compiled with recent CUDA versions but
   running on hosts with certain specific older CUDA drivers.
@Skylion007
Copy link

@malfet Should be fixed in master now with the latest commit 2.28.7

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

NCCL-2.28.3 build locally is unusable on H100

4 participants