Skip to content

Commit 72d2432

Browse files
committed
NCCL 2.27.3-1
Symmetric memory API and symmetric kernels * Redesign from the ground up, enabling major latency and bandwidth improvements. * Add new API calls to register user-allocated memory among communicator ranks into a NCCL window: ncclCommWindowRegister() and ncclCommWindowDeregister(). The calls currently support symmetric registration for P2P and NVLS, and require VMM memory buffers (i.e., CUMEM must be operational). * Implement specialized kernels taking advantage of symmetrically registered memory, with performance gains expected particularly for small to medium message sizes. * The kernels support 32 bit floating point types and smaller, and sum as the reduction operator, with no more than one collective operation per group. * Floating point summation is always done in fp32 accumulators (with the exception of fp8 on NVLS, where it uses fp16 inside the switch). Thus, the accuracy with fp8 and fp16 data types should be much improved. * This initial implementation supports non-network communicators only (P2P and NVLS transports). * To explore this functionality users need to use the new memory registration API calls with the NCCL_WIN_COLL_SYMMETRIC flag and all ranks of a communicator must pass buffers at the same offset in the same registration when invoking a collective NCCL operation. Add support for DGX Spark. Add support for DirectNIC (CX8) to the internal IB plugin. Add a new ncclCommShrink() API call * It is a non-collective call similar to ncclCommSplit(), which makes it possible to exclude some (possibly unresponsive) ranks from the parent communicator. Add support for loading multiple network plugins * This enables the creation of generic containers that can work across a range of providers. * Allow NCCL_NET_PLUGIN to accept a comma-separated list of plugins to load. NVLink SHARP (NVLS) improvements * Implement NVLS+IB SHARP support for AllGather and ReduceScatter with user buffer registration. This improves performance and reduces the number of CTAs needed to achieve peak bandwidth. * Gracefully fall back by default to other transports if NVLS initialization fails (the old behavior of returning an error code from a NCCL call can be preserved by setting NCCL_NVLS_ENABLE=1). * Decrease the NVLS channel count to 24 on Blackwell systems with multiple NVLink domains per communicator. * Enable fine-tuning of NCCL behavior per communicator using new "ncclConfig_t" members "collnetEnable", "CTAPolicy", and "nvlsCTAs". Profiler improvements * Extend the init function by adding communicator name, comm id (hash), rank, number of ranks, number of nodes, and the NCCL log function to the argument list. This makes the name and the comm id available to all events in the communicator without explicitly passing them to each individual event. Add the communicator id and rank to the profiler trace filename. Now, the communicator name can be set via a new "ncclConfig_t" member "commName". * Improve the accuracy of the GPU kernel events by providing GPU-generated timestamps for the start and stop of every NCCL operation. * Harmonize proxy events, removing overlaps between ProxyOp and ProxyStep states. * Add support for network-defined event updates (through "recordEventState"). * Report the correct number of channels used by every collective/p2p operation (used to be set to nMaxChannels for collectives and absent for p2ps). * Fix the logic on proxyCtrl Idle/Active events (Issue #1162). * Fix an issue where the network proxy profiler could lose track of an event identifier (Issue #1682). * Improve the backward compatibility with plugins older than v4. * Ensure that the work counters are 0-initialized. * Fix a potential race condition in the network profiler that could result in an event being linked to a wrong parent. MNNVL improvements * Increase to 16 the number of NICs used to communicate between MNNVL domains on GB200 systems, to optimize the performance of collective operations. * Add support for more complex MNNVL topologies with up to 32 NICs per node. * If the MNNVL fabric initialization was unsuccessful, NCCL will now fail by default, so as to avoid inadvertently falling back to a potentially much slower network transport. Such failures are typically due to a misconfigured IMEX support on the system. To continue without MNNVL, restart the job with NCCL_MNNVL_ENABLE=0. * Fix a potential hang in alltoall-like communication patterns at a scale of over 80 ranks. * Make NCCL_P2P_DISABLE=1 imply NCCL_MNNVL_ENABLE=0 (so the latter no longer needs to be specified on MNNVL systems). * Fix an initialization failure when NCCL_TOPO_FILE is used on MNNVL systems. * Fix the graph search to exclude non-local NICs. * Fix the SHM transport to use fabric handles on MNNVL systems. NIC Fusion improvements * Disable the creation of fused NICs for physical devices that haven't been merged. * Flatten multiple ports to a single PCI device within the internal IB plugin and reparent dual-port NICs under the first PCI parent. If the parent is not a PCI switch, PCI devices for fused NICs won't be duplicated. * Route traffic on GB200-CX8 systems through DirectNIC, not the host interface. Improve support for platforms with C2C connectivity (e.g., GB200) * Enable GPUDirect RDMA for the NICs by default. * Add support for P2C (PXN over C2C) and the LL128 protocol. Extend NCCL fault tolerance in multithreaded scenarios * Support the creation of multiple nonblocking communicators within a single group and polling in parallel for the completion using multiple threads (one per communicator). Enable ncclImplicitOrderLaunch for CUDA 12.9+ * This can potentially speed up NCCL_IMPLICIT_LAUNCH_ORDER. Improve the netSocket transport latency and control * Provide finer control over the size of the socket send/receive buffers, the task size, and the number of sockets that a single peer can open. * Add support for the inlining of small messages behind the header when using multiple sockets per connection. Improve the readability of the CPU affinity in the debug output * Print it as a range string rather than a bitmask. Fix a potential race condition in graph execution * A contention could arise when mixing graph and non-graph execution. Improve PXN connection code * Avoid duplicate and unused connections. RAS fixes * Fix a memory corruption at job termination time in case of a previously failed initialization of a RAS socket connection. * Fix a race condition leading to a crash when generating a RAS report during communicator initialization (Issues #1669, #1718). * Fix a potential race condition when gathering data for a RAS status report. Fix a potential memory corruption in ncclCommSplit() * Memory could get corrupted when resource sharing was in use and the size of the NVLink domain in the new communicator was smaller than in the old one. Fix asynchronous graph upload * Fix a small memory leak. * Fix oversychronization. Add a check for out-of-memory conditions in ncclMemAlloc() Clean up the NCCL socket code * accept() will retry also if just reading the magic failed (Issue #1613). * connect() will retry also if poll() did not return a POLLOUT event (Issue #1618). * Add error checking in a few instances (Issue #1539). * Fix the loop condition in ncclFindInterfaceMatchSubnet() (Issue #1574). * Clean up the debug output, downgrading WARN messages to INFO in non-critical cases, and printing the peer's address where relevant. Switch NCCL_DEBUG_FILE to line buffering * This should help avoid mixed-up partial output lines in multithreaded cases. Other minor fixes * Improve the checks for buffer overflows in the graph code (Issue #1585). * Extend logging and state clearing to all four events in the internal IB plugin (Issue #1650). * Fix the error path in case IB communication is not ready (Issue #1489). * Add ECE logging for IB fabric. * Fix various minor issues in the graph module (Issue #1635). * Clean up the debug output in the graph code, downgrading WARN messages to INFO in non-critical cases. * Add a missing argument to a directSend() call (Issue #1628). * Remove duplicate code in sendProxySetup() (Issue #1420). * Fix the order of arguments of cudaDeviceCanAccessPeer() (Issue #1507). * Fix compiler warnings with GCC 14. * Fix a typo in a comment (Issue #1236).
1 parent 8171af6 commit 72d2432

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

99 files changed

+7216
-2022
lines changed

ext-net/example/nccl/common.h

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,9 +7,15 @@
77
#ifndef COMMON_H_
88
#define COMMON_H_
99

10+
#include <stdint.h>
11+
1012
typedef enum {NCCL_LOG_NONE=0, NCCL_LOG_VERSION=1, NCCL_LOG_WARN=2, NCCL_LOG_INFO=3, NCCL_LOG_ABORT=4, NCCL_LOG_TRACE=5} ncclDebugLogLevel;
1113
typedef enum {NCCL_INIT=1, NCCL_COLL=2, NCCL_P2P=4, NCCL_SHM=8, NCCL_NET=16, NCCL_GRAPH=32, NCCL_TUNING=64, NCCL_ENV=128, NCCL_ALLOC=256, NCCL_CALL=512, NCCL_PROXY=1024, NCCL_NVLS=2048, NCCL_BOOTSTRAP=4096, NCCL_REG=8192, NCCL_ALL=~0} ncclDebugLogSubSys;
1214

1315
typedef void (*ncclDebugLogger_t)(ncclDebugLogLevel level, unsigned long flags, const char *file, int line, const char *fmt, ...);
1416

17+
enum { ncclProfilerNetEventStart = 0, ncclProfilerNetEventStop, ncclProfilerNetEventUpdate, ncclProfilerNetEventUpdateAndStop };
18+
19+
typedef ncclResult_t (*ncclProfilerCallback_t)(void** eHandle, int type, void* phandle, int64_t pluginId, void* extData);
20+
1521
#endif

ext-net/example/nccl/net.h

Lines changed: 1 addition & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -8,9 +8,9 @@
88
#include <stdint.h>
99
#include <stdlib.h>
1010

11-
#include "common.h"
1211
#include "err.h"
1312
#include "net_device.h"
13+
#include "common.h"
1414

1515
#define NCCL_NET_HANDLE_MAXSIZE 128
1616
#define NCCL_MAX_NET_SIZE_BYTES (1*1024*1024*1024*1024L) //1TB
@@ -23,8 +23,6 @@
2323
// Maximum number of requests per comm object
2424
#define NCCL_NET_MAX_REQUESTS 32
2525

26-
typedef ncclResult_t (*ncclProfilerCallback_t)(void** eHandle, int type, void* phandle, int64_t pluginId, void* extData);
27-
2826
#include "net_v10.h"
2927
#include "net_v9.h"
3028
#include "net_v8.h"

ext-profiler/README.md

Lines changed: 78 additions & 49 deletions
Original file line numberDiff line numberDiff line change
@@ -49,9 +49,9 @@ of newer ones.
4949
The `nccl/` directory is populated with `profiler_vX.h` files extracting all relevant definitions
5050
from old API versions. It also provides error codes in `err.h`.
5151

52-
# API (v3)
52+
# API (v4)
5353

54-
Below is the main `ncclProfiler_v3` struct. Each function is explained in later sections.
54+
Below is the main `ncclProfiler_v4` struct. Each function is explained in later sections.
5555

5656
```
5757
typedef struct {
@@ -60,17 +60,23 @@ typedef struct {
6060
// init - initialize the profiler plugin
6161
// Input
6262
// - context : opaque profiler context object for separating profiler behavior across comms
63+
// - commName : user assigned communicator name
64+
// - commHash : communicator id
65+
// - nNodes : number of nodes in communicator
66+
// - nranks : number of ranks in communicator
67+
// - rank : rank identifier in communicator
68+
// - logfn : logger function
6369
// Output
6470
// - eActivationMask: bitmask of active events set by the plugin
65-
ncclResult_t (*init)(void** context, int* eActivationMask);
71+
ncclResult_t (*init)(void** context, int* eActivationMask, const char* commName, uint64_t commHash, int nNodes, int nranks, int rank, ncclDebugLogger_t logfn);
6672
6773
// startEvent - initialize and start a new event for the supplied event descriptor inside the eventset
6874
// Input
6975
// - context: opaque profiler context object
7076
// - eDescr : pointer to ncclProfilerEventDescr_t object
7177
// Output
7278
// - eHandle: return event handle for supplied event descriptor object
73-
ncclResult_t (*startEvent)(void* context, void** eHandle, ncclProfilerEventDescr_v3_t* eDescr);
79+
ncclResult_t (*startEvent)(void* context, void** eHandle, ncclProfilerEventDescr_v4_t* eDescr);
7480
7581
// stopEvent - stop/finalize an event inside and event set
7682
// Input
@@ -82,13 +88,13 @@ typedef struct {
8288
// - eHandle : handle to event object created through startEvent
8389
// - eStateArgs: optional argument used to capture event attribute updates associated with the state transition
8490
// - eState : event state transition
85-
ncclResult_t (*recordEventState)(void* eHandle, ncclProfilerEventState_v3_t eState, ncclProfilerEventStateArgs_v3_t* eStateArgs);
91+
ncclResult_t (*recordEventState)(void* eHandle, ncclProfilerEventState_v4_t eState, ncclProfilerEventStateArgs_v4_t* eStateArgs);
8692
8793
// finalize - finalize the profiler plugin
8894
// Input
8995
// - context: opaque profiler context object
9096
ncclResult_t (*finalize)(void* context);
91-
} ncclProfiler_v3_t;
97+
} ncclProfiler_v4_t;
9298
```
9399

94100
## Error codes
@@ -147,29 +153,26 @@ typedef struct {
147153
int rank; // rank that generated the event
148154
union {
149155
struct { // collective events metadata
150-
const char* name; // string containing name of the communicator
151-
uint64_t commHash; // unique hash/id for the communicator
152156
uint64_t seqNumber; // sequence number of this collective operation in the communicator
153157
const char* func; // string containing name of the collective
154158
void const* sendBuff; // address of send buffer
155159
void* recvBuff; // address of recv buffer
156160
size_t count; // data count
157161
int root; // root rank
158162
const char* datatype; // string containing the name of the datatype
159-
uint8_t nMaxChannels; // max number of channels for this collective
163+
uint8_t nChannels; // number of channels for this collective
160164
uint8_t nWarps; // number of GPU warps for this collective
161165
const char* algo; // string containing name of the algorithm for this collective
162166
const char* proto; // string containing name of the protocol for this collective
163167
} coll;
164168
165169
struct { // point-to-point events metadata
166-
const char* name;
167-
uint64_t commHash;
168170
const char* func;
169171
void* buff;
170172
const char* datatype;
171173
size_t count;
172174
int peer; // peer rank for this point-to-point
175+
uint8_t nChannels; // number of channels for this p2p
173176
} p2p;
174177
175178
struct { // proxyOp events metadata
@@ -178,7 +181,7 @@ typedef struct {
178181
int peer; // peer rank
179182
int nSteps; // number of network transfers/steps required by the `ncclProxyOp`
180183
int chunkSize; // chunk size for this `ncclProxyOp`
181-
int isSend; // set to 1 for sends and 0 for recvs
184+
int isSend; // type of network operation
182185
} proxyOp;
183186
184187
struct { // proxyStep events metadata
@@ -187,14 +190,15 @@ typedef struct {
187190
188191
struct {
189192
uint8_t channelId; // id of the channel used by the kernel
193+
uint64_t ptimer; // kernel supplied timestamp
190194
} kernelCh;
191195
192196
struct {
193197
int64_t id; // net plugin id (used by net and profiler plugins to agree on event definitions)
194198
void* data; // pointer to network plugin defined event
195199
} netPlugin;
196200
};
197-
} ncclProfilerEventDescr_v3_t;
201+
} ncclProfilerEventDescr_v4_t;
198202
```
199203

200204
NCCL defines the following events: `ncclProfileGroup`, `ncclProfileColl`, `ncclProfileP2p`,
@@ -212,45 +216,57 @@ handle after `eventStop` is undefined behavior.
212216
Some events can only be started and stopped. For example, `ncclProfileGroup`, `ncclProfileColl`,
213217
`ncclProfileP2p`, cannot be updated through calls to `recordEventState`.
214218

215-
`ncclProfileProxyOp`, `ncclProfileProxyStep` and `ncclProfileProxyCtrl` can be updated through
216-
calls to `recordEventState`.
219+
`ncclProfileProxyOp`, `ncclProfileProxyStep`, `ncclProfileNetPlugin`, `ncclProfileKernelCh`, and
220+
`ncclProfileProxyCtrl` can be updated through calls to `recordEventState`.
217221

218-
The state of proxy generated events can be updated, along with event attributes, using
219-
`recordEventState`. These events can go through several states during their lifecycle.
220-
The list of supported states for the proxy-defined events is reported below.
222+
The state of these events can be updated, along with event attributes, using `recordEventState`.
223+
These events can go through several states during their lifecycle.
224+
225+
The list of supported states for the updatable events is reported below.
221226

222227
```
223228
typedef enum {
224229
// ncclProfileProxyOp event states
225-
ncclProfilerProxyOpSendPosted, // state marks the posting of send buffer to GPU for given network transfer/step
226-
ncclProfilerProxyOpSendRemFifoWait, // state marks the waiting of CTS credits from peer rank
227-
ncclProfilerProxyOpSendTransmitted, // state marks the sending of network transfer/step to peer rank
228-
ncclProfilerProxyOpSendDone, // state marks the ending of network transfer/step
229-
ncclProfilerProxyOpRecvPosted, // state marks the posting of recv to network for given network transfer/step
230-
ncclProfilerProxyOpRecvReceived, // state marks the recving of network transfer/step from peer rank
231-
ncclProfilerProxyOpRecvTransmitted, // state marks the ending of the network transfer/step
232-
ncclProfilerProxyOpRecvDone, // state marks the consuming of data from GPU
230+
ncclProfilerProxyOpSendPosted = 0, // deprecated in v4
231+
ncclProfilerProxyOpSendRemFifoWait = 1, // deprecated in v4
232+
ncclProfilerProxyOpSendTransmitted = 2, // deprecated in v4
233+
ncclProfilerProxyOpSendDone = 3, // deprecated in v4
234+
ncclProfilerProxyOpRecvPosted = 4, // deprecated in v4
235+
ncclProfilerProxyOpRecvReceived = 5, // deprecated in v4
236+
ncclProfilerProxyOpRecvTransmitted = 6, // deprecated in v4
237+
ncclProfilerProxyOpRecvDone = 7, // deprecated in v4
238+
ncclProfilerProxyOpInProgress_v4 = 19,// state marks transition of proxy op to progress
233239
234240
// ncclProfileProxyStep event states
235-
ncclProfilerProxyStepSendGPUWait, // state marks the waiting of send data from GPU for given network transfer/step
236-
ncclProfilerProxyStepSendWait, // state marks the waiting of send data from network for given network transfer/step
237-
ncclProfilerProxyStepRecvWait, // state marks the waiting of recv data from network for given network transfer/step
238-
ncclProfilerProxyStepRecvFlushWait, // state marks the waiting of recv data flush to GPU for given network transfer/step
239-
ncclProfilerProxyStepRecvGPUWait, // state marks the waiting of recv data consumption from GPU for given network transfer/step
241+
ncclProfilerProxyStepSendGPUWait = 8, // state marks the waiting of send data from GPU for given network transfer/step
242+
ncclProfilerProxyStepSendPeerWait_v4 = 20,// state marks the waiting of recv clear to send credits for given network transfer/step
243+
ncclProfilerProxyStepSendWait = 9, // state marks the waiting of send data from network for given network transfer/step
244+
ncclProfilerProxyStepRecvWait = 10,// state marks the waiting of recv data from network for given network transfer/step
245+
ncclProfilerProxyStepRecvFlushWait = 11,// state marks the waiting of recv data flush to GPU for given network transfer/step
246+
ncclProfilerProxyStepRecvGPUWait = 12,// state marks the waiting of recv data consumption from GPU for given network transfer/step
240247
241248
// ncclProfileProxyCtrl event states
242-
ncclProfilerProxyCtrlIdle, // state marks proxy progress thread idle
243-
ncclProfilerProxyCtrlActive, // state marks proxy progress thread active
244-
ncclProfilerProxyCtrlSleep, // state marks proxy progress thread sleeping
245-
ncclProfilerProxyCtrlWakeup, // state marks proxy progress thread waking up
246-
ncclProfilerProxyCtrlAppend, // state marks append of new network work item begin
247-
ncclProfilerProxyCtrlAppendEnd, // state marks append of new network work item end
248-
} ncclProfilerEventState_v3_t;
249+
ncclProfilerProxyCtrlIdle = 13,// state marks proxy progress thread idle
250+
ncclProfilerProxyCtrlActive = 14,// state marks proxy progress thread active
251+
ncclProfilerProxyCtrlSleep = 15,// state marks proxy progress thread sleeping
252+
ncclProfilerProxyCtrlWakeup = 16,// state marks proxy progress thread waking up
253+
ncclProfilerProxyCtrlAppend = 17,// state marks append of new network work item begin
254+
ncclProfilerProxyCtrlAppendEnd = 18,// state marks append of new network work item end
255+
256+
// ncclProfileNetPlugin event states
257+
ncclProfilerNetPluginUpdate = 21,// state marks update of network defined event
258+
259+
// ncclProfileKernelCh event states
260+
ncclProfilerKernelChStop = 22,// state marks stop of kernelCh event and timestamp update
261+
} ncclProfilerEventState_v4_t;
249262
```
250263

251264
`ncclProfileProxyOp` events are generated by the proxy progress thread while it is processing
252265
network requests for the GPU kernel. ProxyOp events are generated for every active channel and
253-
provide a summary of the activity of the proxy progress thread for that channel.
266+
provide a summary of the activity of the proxy progress thread for that channel. Most of the
267+
states for this event were duplicated with `ncclProfileProxyStep` events. Therefore, starting
268+
with version 4 of the profiler interface these states have been deprecated. The same level of
269+
information can still be obtained through the `ncclProfileProxyStep` events.
254270

255271
`ncclProfileProxyStep` events are generated by the proxy progress thread while it is processing
256272
network requests for the GPU kernel. ProxyStep events describe individual network transfer in
@@ -348,15 +364,22 @@ reason the profiler defines the `ncclProfilerEventStateArgs_t` struct, reported
348364
349365
```
350366
typedef union {
351-
struct { // attributes to update for ncclProfileProxyOp events
352-
size_t transSize; // data transferred thus far
353-
int steps; // network transfer/steps processed thus far
354-
} proxyOp;
367+
struct { // attributes for update for ncclProfileProxyStep events
368+
size_t transSize; // transfer size field for this proxy step
369+
} proxyStep;
355370

356-
struct { // attributes to update for ncclProfileProxyCtrl
371+
struct { // attributes to update for ncclProfileProxyCtrl events
357372
int appendedProxyOps; // number of appended proxy ops thus far
358373
} proxyCtrl;
359-
} ncclProfilerEventStateArgs_v3_t;
374+
375+
struct { // attributes to update for ncclProfileNetPlugin events
376+
void* data; // network plugin opaque update data field
377+
} netPlugin;
378+
379+
struct { // attribute to update for ncclProfileKernelCh events
380+
uint64_t pTimer; // timestamp provided by the NCCL kernel
381+
} kernelCh;
382+
} ncclProfilerEventStateArgs_v4_t;
360383
```
361384
362385
The example profiler in `ext-profiler/example` contains details on how to capture and use the events above.
@@ -396,12 +419,12 @@ ProxyCtrl event
396419
## Profiling of collective and p2p operations
397420
398421
The NCCL code is instrumented with profiler callbacks at different levels to capture start/stop of groups,
399-
collective and point-to-point operations, as well as proxy progress activity. Due to the asynchronous nature
422+
collective and point-to-point operations, as well as proxy, kernel and network activity. Due to the asynchronous nature
400423
of NCCL operations, events associated to collective and point-to-point operations are not easy to delimit
401424
precisely. For example, without both proxy and/or kernel activity it is impossible for the profiler to
402425
figure out when a collective operation completes. Therefore, `stopEvent` for collectives simply indicates to
403-
the profiler that the collective has been enqueued. The profiler can leverage proxy event information, if
404-
these are enabled, to estimate when the collective ends. In this case, the profiler can look at the `stopEvent`
426+
the profiler that the collective has been enqueued. The profiler can leverage proxy and/or kernel event information, if
427+
these are enabled, to estimate when the collective ends. For example, the profiler can look at the `stopEvent`
405428
call of the last `ncclProfileProxyOp` event to mark the completion of the associated collective event. This
406429
can be achieved by reference counting the collective event and letting calls to `startEvent` and `stopEvent`
407430
increment and decrement the reference counter, respectively.
@@ -425,8 +448,14 @@ enqueue can be time stamped by the profiler (at start and stop) to reconstruct t
425448
collective. However, this time only represents the launch time of the collective and not the actual
426449
execution time. To reconstruct the execution time more accurately proxy and kernel events are provided.
427450
451+
With version 3 of the profiler interface network activity is no longer required to do intra-node profiling.
428452
Kernel events instrumentation leverages counters exposed by the kernel to the host and the proxy progress
429453
thread. Thus, the proxy progress thread infrastructure is shared between the network and the profiler. If
430454
the proxy is serving network requests the kernel profiling probing can be delayed, causing loss of
431455
accuracy. Similarly, if the CPU is under heavy load and the scheduling of the proxy progress thread is
432-
delayed, a similar loss of accuracy can be encountered. Keep this in mind when using kernel events.
456+
delayed, a similar loss of accuracy can be encountered.
457+
458+
To mitigate this effect, with version 4 of the profiler NCCL uses a per-channel ring buffer of 64 elements.
459+
Every counter is complemented by two timestamps (ptimers) supplied by the NCCL kernel (one for start and one
460+
for stop of the operation in the kernel). NCCL propagates these timestamps to the profiler plugin that it can
461+
convert them to CPU time domain.

0 commit comments

Comments
 (0)