-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Drain error code when kernel is not found #1864
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
malfet
wants to merge
2
commits into
NVIDIA:master
Choose a base branch
from
malfet:patch-1
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Fixes spurious failures when PyTorch is linked statically with NCCL-2.28.3 because error is not drained, but rather gets propagated into a next CUDA kernel invocation Fixes pytorch/pytorch#164402
malfet
commented
Oct 1, 2025
Collaborator
Collaborator
|
ACK. We are looking into this issue now. |
marksantesson
added a commit
that referenced
this pull request
Oct 18, 2025
GPU-Initiated Networking (GIN):
* Provides device-side API for integrating GPU-Initiated Networking
capability into application kernels.
* New transport layer called DOCA GPUNetIO.
* New ncclGin construct to create, destroy and manipulate GIN contexts.
* New ncclGinBarrierSession to provide synchronization functionality.
* New put, signal, counter operations for data movement and signaling.
* GIN API signatures and functionalities are subject to change.
* GIN Support Requirements
* CUDA 12.2 or later when compiling the GPU code
* NVIDIA GPUs: Volta or newer. NVIDIA GPU drivers >= 510.40.3
* NVIDIA NICs: CX4 or newer. rdma-core >= 44.0
* Requires nvidia-peermem or DMABUF support. When using DMABUF, linux
kernel >= 6.1 is required.
New ncclCommRevoke API for fault tolerance:
* Introduces ncclCommRevoke to quiesce ongoing NCCL work on a
communicator without freeing resources.
* This answers the need for a lightweight way to cancel in-flight
collectives and bring a communicator to a safe state before
split/shrink/finalize/destroy.
* Includes optional cross-rank coordination (global barrier) and
supports blocking/non-blocking usage.
New NCCL Environment Plugin:
* The env plugin allows users to set NCCL environment variables, for
example, after loading them from a centralized database.
* The NCCL_ENV_PLUGIN variable can be used to let NCCL load an external
environment plugin.
New NCCL Examples on GitHub:
* The NCCL examples directory provides users and developers with
practical code samples that highlight NCCL’s core features.
* It covers basic operations like communicator initialization,
point-to-point communication, and collective operations, as well as
advanced features such as user buffer registration, symmetric memory,
and the device API.
Device API improvements:
* Adds ncclFindWindow API.
* Adds new ncclBarrierSession to provide hybrid synchronization
functionality.
* Makes multimem available with as few as two ranks.
* Removes distance (NCCL_P2P_LEVEL) considerations from determining the
availability of symmetric memory.
Enhanced NCCL RAS output:
* Extends RAS subsystem with JSON format to support machine-parsable
metrics collection.
* Enables structured data export for monitoring tools, dashboards, and
automated analysis systems.
Github Pull Requests resolved:
* Fast Init - CPU Optimizations for NCCL Initialization Large Scale.
(PR #1789)
* Fast Init - Improve Bootstrap AllGather by 2x at large scale by
sending bootstrap information bidirectionally. (PR #1791)
* Fixes spurious failures when PyTorch is statically linked with
NCCL-2.28.3 because error is not drained, but rather gets propagated
into the next CUDA kernel invocation. (PR #1864)
Other notable improvements:
* Fixes multicast object leaks in case of failed NVLS user buffer
registrations, which could lead to crashes. Avoids such registration
attempts in case of the use of incompatible memory allocators.
* Fixes potential data corruption with built-in symmetric kernels for
small messages with size granularity under 8 bytes or when multiple
symmetric operations were aggregated in a group.
* Generalizes the existing point-to-point scheduling to the case of
un-even GPU count per node.
* Fixes a crash when network plugin assignment fails.
* Fixes a large performance issue with NCCL_CROSS_NIC=0 and certain
split mask settings, where NCCL cannot find a viable ring.
* Fixes crash when NCCL is compiled with recent CUDA versions but
running on hosts with certain specific older CUDA drivers.
|
@malfet Should be fixed in master now with the latest commit 2.28.7 |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Fixes spurious failures when PyTorch is linked statically with NCCL-2.28.3 because error is not drained, but rather gets propagated into a next CUDA kernel invocation
Fixes pytorch/pytorch#164402