Skip to content

Commit 8a2e3d8

Browse files
authored
Revise Code - Update NCCL Example (#22)
1 parent a5df795 commit 8a2e3d8

File tree

8 files changed

+2566
-10
lines changed

8 files changed

+2566
-10
lines changed

README.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,10 @@ Below is an example of NPKit timeline result. Green blocks are LL128 data transf
1010

1111
![NPKit Result Example](./npkit_result_example.png)
1212

13+
## Quick Start
14+
15+
Please check `nccl_samples` for NCCL quick start, `rccl_samples` for RCCL quick start and `msccl_samples` for MSCCL quick start.
16+
1317
## Build
1418

1519
NPKit is a patches series of some version of NCCL/RCCL/MSCCL. Users need to apply these patches to correct NCCL/RCCL/MSCCL version and build NCCL/RCCL/MSCCL with expected profiling events specified. In this section, we take NCCL 2.10.3-1, RCCL develop branch commit 4643a17 and MSCCL master branch commit e52c525 as examples. Assume we want to jointly profile LL128 data transfer time in GPU and net send/recv time in CPU:

msccl_samples/npkit_runner.sh

Lines changed: 2 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -9,10 +9,9 @@ set -x
99
# <npkit_dump_dir> <npkit_result_dir>
1010
function msccl_test() {
1111
mpirun --allow-run-as-root \
12-
-np 8 -host localhost:8 \
12+
-map-by ppr:8:node --bind-to numa \
1313
-x LD_PRELOAD=$2/build/lib/libnccl.so.2:$LD_PRELOAD \
14-
-x NCCL_DEBUG=INFO \
15-
-x NCCL_DEBUG_SUBSYS=INIT,GRAPH \
14+
-x NCCL_DEBUG=WARN \
1615
-x NCCL_ALGO=$4 \
1716
-x NCCL_PROTO=$5 \
1817
-x NPKIT_DUMP_DIR=$8 \

nccl_samples/README.md

Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,19 @@
1+
## Introduction
2+
3+
This folder contains scripts for NPKit sample workflow for NCCL. The sample workflow first builds NCCL with NPKit enabled, then runs nccl-tests to collect NPKit event dump files, and finally generates NPKit trace file.
4+
5+
## Dependencies
6+
7+
[NCCL 2.17.1-1](https://github.com/nvidia/nccl/tree/v2.17.1-1) and [nccl-tests](https://github.com/nvidia/nccl-tests).
8+
9+
## Usage
10+
11+
1) Get NCCL version 2.17.1-1 and apply `npkit-for-nccl-2.17.1-1.diff` to the source repo.
12+
13+
2) Make sure parameters in `npkit_launcher.sh` are valid. Also note that currently NPKit only supports collecting non-overlapped events in GPU, and `NPKIT_FLAGS` should follow this rule.
14+
15+
3) Make sure `nccl_test` function in `npkit_runner.sh` is a valid command to run `nccl-tests` binary. Also note that currently NPKit only supports 1 GPU per process, so `-g 1` mode is required in `nccl-tests` commands.
16+
17+
4) Run command `bash npkit_launcher.sh`.
18+
19+
5) The generated trace file `npkit_event_trace.json` (zipped in `npkit_result.tar.gz`) is in [Google Trace Event Format](https://docs.google.com/document/d/1CvAClvFfyA5R-PhYUmn5OOQtYMH4h6I0nSsKchNAySU/preview) and can be viewed by trace viewers.

0 commit comments

Comments
 (0)