-
Notifications
You must be signed in to change notification settings - Fork 73
Integrate MSCCL++ DSL to torch workload #620
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
133 commits
Select commit
Hold shift + click to select a range
a91d273
WIP
Binyang2014 2fb4d74
WIP
Binyang2014 cc96414
WIP
Binyang2014 d561a5c
WIP
Binyang2014 0dfc13a
WIP
Binyang2014 11b3bdb
WIP
Binyang2014 a9e98ec
WIP
Binyang2014 f89ffe4
WIP
Binyang2014 82211f6
WIP
Binyang2014 978a182
WIP
Binyang2014 b89697d
WIP
Binyang2014 0f7ce20
Fix correctness
Binyang2014 e620110
WIP
Binyang2014 9df8ca3
clean code
Binyang2014 c922620
all works
Binyang2014 b5a8793
WIP
Binyang2014 f1a905f
Merge branch 'main' into binyli/nccl-algo
Binyang2014 1235a90
clean up
Binyang2014 030a4c2
WIP
Binyang2014 7bcd613
bug fix
Binyang2014 f1cde8b
fix compile error
Binyang2014 775f48e
WIP
Binyang2014 ca638b5
for amd
Binyang2014 646656b
WIP
Binyang2014 25f4ee2
WIP
Binyang2014 52211ce
WIP
Binyang2014 ffae384
add logs
Binyang2014 7be42b9
merge main
Binyang2014 f5b6f48
Merge branch 'main' into binyli/nccl-algo
Binyang2014 81ba828
WIP
Binyang2014 40f66e3
Update docs/guide/msccl_dsl_integration.md
Binyang2014 02f608a
address comments
Binyang2014 2313197
remove nccl.h from algorithm file
Binyang2014 1d6aee0
move algo to core lib
Binyang2014 3d3842b
WIP
Binyang2014 8e57b6b
WIP
Binyang2014 bea70b0
WIP
Binyang2014 b025bd3
WIP
Binyang2014 cd28260
WIP
Binyang2014 e2d33fb
example works
Binyang2014 d2941f1
add doc
Binyang2014 d1e636a
WIp
Binyang2014 33c6eb5
WIP
Binyang2014 44196ec
update doc
Binyang2014 0cbe6d7
update
Binyang2014 b51edcf
Merge branch 'main' into binyli/py-api
Binyang2014 04e53c8
refactor
Binyang2014 4f91a50
merge main
Binyang2014 6583661
WIP
Binyang2014 351b5b0
WIP
Binyang2014 3b3c2d4
Merge branch 'main' into binyli/py-api
Binyang2014 652b78c
WIP
Binyang2014 bfd1171
WIP
Binyang2014 dba7b5d
WIP
Binyang2014 bd38165
fix and update
Binyang2014 6c324a6
WIP
Binyang2014 c28d401
Merge branch 'binyli/nccl-algo' into binyli/py-api
Binyang2014 b1b0338
Merge branch 'main' into binyli/py-api
Binyang2014 6d6f9fa
WIP
Binyang2014 209ae57
WIP
Binyang2014 acea3e1
pass compile
Binyang2014 a65cbb1
WIP
Binyang2014 bfbdded
WIP
Binyang2014 3102fb5
Fix
Binyang2014 13d485e
WIP
Binyang2014 4ba2c0d
WIP
Binyang2014 eb8d926
WIP
Binyang2014 a2fa58e
update
Binyang2014 4e233fc
working with leak
Binyang2014 1b4023e
works for now
Binyang2014 ccf635f
WIP
Binyang2014 fd5fbc8
WIP
Binyang2014 e76073f
WIP
Binyang2014 1606dad
WIP
Binyang2014 c00f88f
WIP
Binyang2014 a1fc471
WIP
Binyang2014 a9ec4ef
merge main
Binyang2014 8e154d9
resolving conflicts
caiomcbr 7fde9c1
wip
caiomcbr 3db9f4b
Merge branch 'main' into binyli/py-api
caiomcbr ace7301
wip
caiomcbr f8ca314
wip
caiomcbr 92b4d3a
wip
caiomcbr 6ad10eb
Merge branch 'main' into binyli/py-api
Binyang2014 f05b857
lint
Binyang2014 1cb68ac
wip
caiomcbr 175f49f
wip
caiomcbr 1a3e93a
wip
caiomcbr 0da9515
fix
Binyang2014 ef91b29
wip
caiomcbr 15e018e
wip
caiomcbr 0502327
update
Binyang2014 10b453b
fix hang issue
92ad551
wip
caiomcbr f595852
fix
7912487
bug fix
Binyang2014 a442f37
wip
caiomcbr 9d7386a
wip
caiomcbr 248461e
wip
caiomcbr 3eefbeb
wip
caiomcbr 43c2155
wip
caiomcbr be75d20
wip
caiomcbr 2a0e39d
wip
caiomcbr dff8d86
wip
caiomcbr df9fcd0
WIP
Binyang2014 cbee2f6
Merge branch 'main' into binyli/py-api
Binyang2014 9b1afd7
address comments
Binyang2014 988935f
wip
caiomcbr c97b9c3
wip
caiomcbr cd9379d
wip
caiomcbr 43a4417
wip
caiomcbr c3fea11
wip
caiomcbr 305b16b
wip
caiomcbr a2b84da
wip
caiomcbr 7a8e183
merge main
Binyang2014 fbe6911
fix test
Binyang2014 858c3b7
minor fix
Binyang2014 8e14e97
wip
caiomcbr 35a7e04
update doc
Binyang2014 0d5cb6a
update the doc
Binyang2014 a01b5aa
Merge branch 'main' into binyli/py-api
Binyang2014 517767f
fix doc build
Binyang2014 08a8903
Merge branch 'main' into binyli/py-api
Binyang2014 f4a77c5
Fix rocm build issue
Binyang2014 2e79d28
Update python/mscclpp/__init__.py
Binyang2014 6af6fd0
Update python/mscclpp/__init__.py
Binyang2014 d18f276
Update python/mscclpp/language/default_algos/allreduce_2nodes.py
Binyang2014 87beb94
Update src/executor/execution_plan.cc
Binyang2014 f89a6f9
Fix
Binyang2014 42ef9cc
lint
Binyang2014 ceaf4d6
wip
caiomcbr 488fda3
wip
caiomcbr 860f5ea
Merge branch 'main' into binyli/py-api
chhwang File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,126 @@ | ||
| # MSCCL++ DSL Integration Guide | ||
|
|
||
| MSCCL++ DSL (domain-specific language) enables concise expression of collective algorithms as Python functions. | ||
| MSCCL++ offers pythonic utilities to author, JIT-compile, register, and select execution plans. This guide walks through two integration paths: a customized MSCCL++ communicator and NCCL interposition that accelerates existing PyTorch `backend="nccl"` workloads. | ||
|
|
||
| ## Initial Setup | ||
|
|
||
| Run the following from the repository root after completing the basic project setup: | ||
|
|
||
| 1. Install Python dependencies. | ||
| ```bash | ||
| pip install -r ./python/<requirements_file> | ||
| ``` | ||
| Replace `<requirements_file>` with the file that matches your environment (e.g., `requirements_cuda11.txt`, `requirements_cuda12.txt`, or `requirements_rocm6.txt`). | ||
|
|
||
| 2. Install the module and generate default algorithm plans. | ||
| ```bash | ||
| pip install . && python3 -m mscclpp --install | ||
| ``` | ||
|
|
||
| ## Integration Options | ||
|
|
||
| MSCCL++ DSL integrates into your training or inference workload in two ways: | ||
| 1. **Custom MSCCL++ Communicator** — directly manage an MSCCL++ communicator and launch collectives with the MSCCL++ executor. | ||
| 2. **NCCL Interposition** — keep using `backend="nccl"`; MSCCL++ intercepts NCCL calls at runtime for drop-in acceleration. | ||
|
|
||
Binyang2014 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| Both paths follow the same high-level flow: | ||
| 1. Author (or reuse) a collective algorithm with the MSCCL++ DSL. | ||
| 2. Compile it into an execution plan. | ||
| 3. Register the plan with the MSCCL++ runtime. | ||
| 4. Configure a selector to choose the plan for each collective call. | ||
|
|
||
| Below we show an AllReduce example and then detail each integration option. | ||
|
|
||
| ### Example: AllReduce in the MSCCL++ DSL | ||
| The snippet defines an AllReduce that uses NVLS for intra-node reduce-scatter followed by broadcast. | ||
| ```python | ||
Binyang2014 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| def allreduce_nvls(spec: mscclpp.AlgoSpec) -> CollectiveProgram: | ||
| gpu_size = spec.world_size | ||
| with CollectiveProgram( | ||
| spec.name, | ||
| spec.collective, | ||
| gpu_size, | ||
| instances=8, | ||
| protocol=spec.protocol, | ||
| num_threads_per_block=spec.num_threads_per_block, | ||
| min_message_size=spec.min_message_size, | ||
| max_message_size=spec.max_message_size, | ||
| ) as program: | ||
| # Creating Channels | ||
| nvls_chan = SwitchChannel(rank_list=[gpu for gpu in range(gpu_size)], buffer_type=BufferType.input) | ||
| channels = {} | ||
| for gpu in range(gpu_size): | ||
| for peer in range(gpu_size): | ||
| if peer != gpu: | ||
| channels[(peer, gpu)] = MemoryChannel(peer, gpu) | ||
|
|
||
| # Synchronization to Ensure all the Gpus are Ready | ||
| for gpu in range(gpu_size): | ||
| src_rank = gpu | ||
| for peer in range(gpu_size): | ||
| if peer != src_rank: | ||
| dst_rank = peer | ||
| channels[(dst_rank, src_rank)].signal(tb=0, relaxed=True) | ||
| for peer in range(gpu_size): | ||
| if peer != src_rank: | ||
| dst_rank = peer | ||
| channels[(dst_rank, src_rank)].wait(tb=0, relaxed=True, data_sync=SyncType.after) | ||
| # Reducing and Storing the data | ||
| for gpu in range(gpu_size): | ||
| buffer_offset = gpu | ||
| rank = Rank(gpu) | ||
| input_buffer = rank.get_input_buffer() | ||
| nvls_chan.at_rank(gpu).reduce( | ||
| buffer_offset=buffer_offset, size=1, dst_chunk=input_buffer[gpu : gpu + 1], tb=0 | ||
| ) | ||
| nvls_chan.at_rank(gpu).broadcast( | ||
| src_chunk=input_buffer[gpu : gpu + 1], buffer_offset=buffer_offset, size=1, tb=0 | ||
| ) | ||
| # Synchronization to Ensure the Gpus finished | ||
| for gpu in range(gpu_size): | ||
| src_rank = gpu | ||
| for peer in range(gpu_size): | ||
| if peer != src_rank: | ||
| dst_rank = peer | ||
| channels[(dst_rank, src_rank)].signal(tb=0, relaxed=True, data_sync=SyncType.before) | ||
| for peer in range(gpu_size): | ||
| if peer != src_rank: | ||
| dst_rank = peer | ||
| channels[(dst_rank, src_rank)].wait(tb=0, relaxed=True) | ||
|
|
||
| return program | ||
| ``` | ||
|
|
||
| ### Integrate with MSCCL++ customized communicator | ||
| Use when you want a PyTorch‑compatible interface with fine‑grained control. You manage the communicator, compile/register DSL plans, and invoke collectives via a thin wrapper. The example below shows an AllReduce built on the MSCCL++ communicator and executor. | ||
| Example source directory: | ||
| ``` | ||
| examples/torch-integration | ||
| ``` | ||
| Key file: `customized_comm.py`. | ||
|
|
||
|
|
||
| #### Launch (single node) | ||
| ```bash | ||
| MSCCLPP_MASTER_ADDR=<master_ip> MSCCLPP_MASTER_PORT=<port> torchrun --nnodes=1 --nproc_per_node=8 customized_comm.py | ||
| ``` | ||
|
|
||
| ### Integrate via NCCL Interposition | ||
| Keep your script as‑is: init PyTorch with backend="nccl"; MSCCL++ intercepts NCCL calls for drop‑in acceleration. | ||
| Example source directory: | ||
| ``` | ||
| examples/torch-integration | ||
| ``` | ||
| Key file: `dsl_with_nccl_api.py`. | ||
|
|
||
| #### Launch with interposition | ||
| To run with NCCL interposition, you preload the MSCCL++ shim so it transparently intercepts NCCL calls made by PyTorch’s nccl backend. | ||
| ```bash | ||
| LD_PRELOAD=<MSCCLPP_REPO>/build/apps/nccl/libmscclpp_nccl.so torchrun --nnodes=1 --nproc_per_node=8 dsl_with_nccl_api.py | ||
| ``` | ||
| ## Notices: | ||
| - When using NCCL interposition, the algorithm selection order is: | ||
| 1. Check for registered DSL plans matching the collective call. | ||
| 2. Check for a customized kernel implementation if no DSL plan fits. | ||
| 3. Fall back to the default NCCL implementation (set `MSCCLPP_NCCL_LIB_PATH` to the original NCCL library). | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.