Skip to content
Merged
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Binary file added docs/figs/zero_copy_offset_diagram.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
18 changes: 18 additions & 0 deletions docs/guide/mscclpp-dsl.md
Original file line number Diff line number Diff line change
Expand Up @@ -244,5 +244,23 @@ The following picture shows the overall workflow for running with MSCCL++ DSL:
Overall workflow for running with MSCCL++ DSL
```

## Algorithms Details
In MSCCL++, the executor does not communicate or synchronize memory offsets between ranks. This means that for zero-copy algorithms, all ranks must use identical offsets when specifying the base address of their input and output buffers. For non–zero-copy algorithms, where communication occurs exclusively through a scratch buffer, offset synchronization is not required. In this case, the data exchange between remote machines uses only the scratch buffer, so differences in the input or output buffer offsets across ranks do not affect correctness.

More concretely:
- The input and output buffer offset (the distance from the base memory region to where the input data begins) must be the same across all ranks.
- The offset between the input and output buffers may differ (i.e., input and output regions can be located at different positions, as long as these positions are consistent across ranks).

If different ranks allocate their input or output buffers at different offsets, the executor will not be able to correctly interpret the shared memory layout, which will likely lead to incorrect behavior or runtime errors.

```{figure} ../figs/zero_copy_offset_diagram.png
:name: diagram zero copy offset executor
:alt: diagram zero copy offset executor
:align: center
```
Input/Output Offset Consistency Across Ranks

As shown in the figure, each channel stores only the base address of the registered memory region (RegMem) and assumes that the buffer pointer (e.g., sendbuff) is at the same offset (DIFF) from the base pointer (SrcBasePtr) across all ranks. If any rank uses a different offset, remote addresses will misalign, causing data corruption or runtime errors. This design removes the need for offset synchronization, preserving zero-copy efficiency and minimizing setup overhead.

## All2All support
Currently, the DSL only supports the static all2all algorithm. To support all2allv, we need to obtain the send/receive sizes at runtime. This may require using placeholders in the JSON execution plan, which would be replaced with the actual sizes during execution. If we can make the chunk size variable, the same approach could be used to support all2allv.
Loading