Drain error code when kernel is not found #1864

malfet · 2025-10-01T20:42:41Z

Fixes spurious failures when PyTorch is linked statically with NCCL-2.28.3 because error is not drained, but rather gets propagated into a next CUDA kernel invocation

Fixes pytorch/pytorch#164402

Fixes spurious failures when PyTorch is linked statically with NCCL-2.28.3 because error is not drained, but rather gets propagated into a next CUDA kernel invocation Fixes pytorch/pytorch#164402

src/enqueue.cc

mnicely · 2025-10-02T01:45:02Z

@sjeaugey @xiaofanl

xiaofanl-nvidia · 2025-10-02T02:05:16Z

ACK. We are looking into this issue now.

GPU-Initiated Networking (GIN): * Provides device-side API for integrating GPU-Initiated Networking capability into application kernels. * New transport layer called DOCA GPUNetIO. * New ncclGin construct to create, destroy and manipulate GIN contexts. * New ncclGinBarrierSession to provide synchronization functionality. * New put, signal, counter operations for data movement and signaling. * GIN API signatures and functionalities are subject to change. * GIN Support Requirements * CUDA 12.2 or later when compiling the GPU code * NVIDIA GPUs: Volta or newer. NVIDIA GPU drivers >= 510.40.3 * NVIDIA NICs: CX4 or newer. rdma-core >= 44.0 * Requires nvidia-peermem or DMABUF support. When using DMABUF, linux kernel >= 6.1 is required. New ncclCommRevoke API for fault tolerance: * Introduces ncclCommRevoke to quiesce ongoing NCCL work on a communicator without freeing resources. * This answers the need for a lightweight way to cancel in-flight collectives and bring a communicator to a safe state before split/shrink/finalize/destroy. * Includes optional cross-rank coordination (global barrier) and supports blocking/non-blocking usage. New NCCL Environment Plugin: * The env plugin allows users to set NCCL environment variables, for example, after loading them from a centralized database. * The NCCL_ENV_PLUGIN variable can be used to let NCCL load an external environment plugin. New NCCL Examples on GitHub: * The NCCL examples directory provides users and developers with practical code samples that highlight NCCL’s core features. * It covers basic operations like communicator initialization, point-to-point communication, and collective operations, as well as advanced features such as user buffer registration, symmetric memory, and the device API. Device API improvements: * Adds ncclFindWindow API. * Adds new ncclBarrierSession to provide hybrid synchronization functionality. * Makes multimem available with as few as two ranks. * Removes distance (NCCL_P2P_LEVEL) considerations from determining the availability of symmetric memory. Enhanced NCCL RAS output: * Extends RAS subsystem with JSON format to support machine-parsable metrics collection. * Enables structured data export for monitoring tools, dashboards, and automated analysis systems. Github Pull Requests resolved: * Fast Init - CPU Optimizations for NCCL Initialization Large Scale. (PR #1789) * Fast Init - Improve Bootstrap AllGather by 2x at large scale by sending bootstrap information bidirectionally. (PR #1791) * Fixes spurious failures when PyTorch is statically linked with NCCL-2.28.3 because error is not drained, but rather gets propagated into the next CUDA kernel invocation. (PR #1864) Other notable improvements: * Fixes multicast object leaks in case of failed NVLS user buffer registrations, which could lead to crashes. Avoids such registration attempts in case of the use of incompatible memory allocators. * Fixes potential data corruption with built-in symmetric kernels for small messages with size granularity under 8 bytes or when multiple symmetric operations were aggregated in a group. * Generalizes the existing point-to-point scheduling to the case of un-even GPU count per node. * Fixes a crash when network plugin assignment fails. * Fixes a large performance issue with NCCL_CROSS_NIC=0 and certain split mask settings, where NCCL cannot find a viable ring. * Fixes crash when NCCL is compiled with recent CUDA versions but running on hosts with certain specific older CUDA drivers.

Skylion007 · 2025-10-18T20:53:13Z

@malfet Should be fixed in master now with the latest commit 2.28.7

Drain error code when kernel is not found

4360748

Fixes spurious failures when PyTorch is linked statically with NCCL-2.28.3 because error is not drained, but rather gets propagated into a next CUDA kernel invocation Fixes pytorch/pytorch#164402

malfet commented Oct 1, 2025

View reviewed changes

src/enqueue.cc Outdated Show resolved Hide resolved

Update src/enqueue.cc

01e6dd9

ngimel mentioned this pull request Oct 2, 2025

Revert nccl upgrade back to 2.27.5 pytorch/pytorch#164352

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Drain error code when kernel is not found #1864

Drain error code when kernel is not found #1864

Uh oh!

malfet commented Oct 1, 2025

Uh oh!

Uh oh!

mnicely commented Oct 2, 2025

Uh oh!

xiaofanl-nvidia commented Oct 2, 2025

Uh oh!

Skylion007 commented Oct 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Drain error code when kernel is not found #1864

Are you sure you want to change the base?

Drain error code when kernel is not found #1864

Uh oh!

Conversation

malfet commented Oct 1, 2025

Uh oh!

Uh oh!

mnicely commented Oct 2, 2025

Uh oh!

xiaofanl-nvidia commented Oct 2, 2025

Uh oh!

Skylion007 commented Oct 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants