Skip to content

Commit 17247cd

Browse files
DSL Quick Start (#689)
Fix #675 --------- Co-authored-by: Binyang Li <[email protected]>
1 parent 8b8593b commit 17247cd

File tree

2 files changed

+178
-0
lines changed

2 files changed

+178
-0
lines changed

docs/dsl_quick_start.md

Lines changed: 176 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,176 @@
1+
# DSL Quick Start
2+
3+
The MSCCL++ DSL (Domain Specific Language) provides a high-level Python API for defining custom collective communication algorithms. This guide will help you get started with writing and testing your own communication patterns.
4+
5+
## Installation
6+
7+
You can follow the same steps in the [Quick Start](quickstart).
8+
9+
After finishing the installation in the quick start section, you can add the following steps to install some default algorithms from the DSL:
10+
11+
```bash
12+
python3 -m mscclpp --install
13+
```
14+
15+
## Your First Algorithm: AllGather
16+
17+
Let's walk through a simple AllGather algorithm to understand the DSL basics. This example demonstrates the key concepts without diving into all the advanced features.
18+
19+
### Complete Example
20+
21+
```python
22+
from mscclpp.language import *
23+
24+
def simple_allgather(name):
25+
"""
26+
A simple AllGather implementation using the MSCCL++ DSL.
27+
28+
This example demonstrates a 2-GPU AllGather where each GPU sends
29+
its data to all other GPUs, so all GPUs end up with everyone's data.
30+
31+
Args:
32+
name: Algorithm name for identification
33+
"""
34+
num_gpus = 2
35+
chunk_factor = 1 # Split data into num_gpus chunks
36+
37+
# Define the collective operation
38+
collective = AllGather(num_gpus, chunk_factor, inplace=True)
39+
40+
# Create the program context
41+
with CollectiveProgram(
42+
name,
43+
collective,
44+
num_gpus,
45+
protocol="Simple", # Use Simple protocol (vs "LL" for low-latency)
46+
min_message_size=0,
47+
max_message_size=2**30 # 1GB
48+
):
49+
# Loop over each source GPU rank
50+
for src_rank in range(num_gpus):
51+
# Create a Rank object for the source GPU
52+
rank = Rank(src_rank)
53+
# Get the output buffer where the data is stored
54+
src_buffer = rank.get_output_buffer()
55+
# Take a slice corresponding to this rank's data
56+
src_chunk = src_buffer[src_rank:src_rank + 1]
57+
58+
# Loop over each destination GPU rank
59+
for dst_rank in range(num_gpus):
60+
# Skip sending from a rank to itself
61+
if src_rank != dst_rank:
62+
# Create a Rank object for the destination GPU
63+
dst_rank_obj = Rank(dst_rank)
64+
# Get the destination buffer where data will be sent
65+
dst_buffer = dst_rank_obj.get_output_buffer()
66+
# Take a slice where the data will be placed
67+
dst_chunk = dst_buffer[src_rank:src_rank + 1]
68+
69+
# Define a channel from src_rank → dst_rank
70+
channel = MemoryChannel(dst_rank, src_rank)
71+
72+
# Step 1: Source signals it is ready to send data
73+
channel.signal(tb=0, relaxed=True)
74+
75+
# Step 2: Wait for destination to be ready
76+
channel.wait(tb=0, data_sync=SyncType.after, relaxed=True)
77+
78+
# Step 3: Source rank sends data to destination rank
79+
channel.put(dst_chunk, src_chunk, tb=0)
80+
81+
# Step 4: Signal that put operation is complete
82+
channel.signal(tb=0, data_sync=SyncType.before)
83+
84+
# Step 5: Wait for acknowledgment
85+
channel.wait(tb=0, data_sync=SyncType.after)
86+
87+
print(JSON())
88+
89+
simple_allgather("simple_allgather_2gpus")
90+
```
91+
92+
### Key Concepts Explained
93+
94+
**1. Collective Definition**
95+
```python
96+
collective = AllGather(num_gpus, chunk_factor=1, inplace=True)
97+
```
98+
- Defines what collective operation to implement (AllGather in this case)
99+
- `chunk_factor` determines data chunking strategy
100+
- `inplace=True` means input and output use the same buffer. For AllGather, the input buffer is a slice of the output buffer. For example, on rank 0, the input buffer is the first half of the output buffer, and on rank 1, the input buffer is the second half of the output buffer.
101+
102+
**2. Program Context**
103+
```python
104+
with CollectiveProgram(name, collective, num_gpus, ...):
105+
```
106+
- Sets up the execution environment
107+
- Configures protocol, threading, and message size ranges
108+
109+
**3. Ranks and Buffers**
110+
```python
111+
rank = Rank(src_rank)
112+
src_buffer = rank.get_output_buffer()
113+
src_chunk = src_buffer[src_rank:src_rank + 1]
114+
```
115+
- `Rank` represents a GPU in the collective
116+
- Buffers hold the data being communicated
117+
- Chunks are slices of buffers representing data portions
118+
119+
**4. Channels**
120+
```python
121+
channel = MemoryChannel(dst_rank, src_rank)
122+
```
123+
- Establishes communication paths between GPUs
124+
- `MemoryChannel` for intra-node (fast, direct memory access)
125+
- Created for each source-destination pair
126+
- Can also use `PortChannel` for inter-node communication
127+
128+
**5. Synchronization and Data Transfer**
129+
```python
130+
channel.signal(tb=0, relaxed=True)
131+
channel.wait(tb=0, data_sync=SyncType.after, relaxed=True)
132+
channel.put(dst_chunk, src_chunk, tb=0)
133+
```
134+
- `signal()`: Notify remote GPU of state changes
135+
- `wait()`: Wait for remote GPU to reach a certain state
136+
- `put()`: Write data from local to remote GPU memory
137+
- `tb=0` assigns operations to thread block 0
138+
- `relaxed=True` uses relaxed memory ordering for performance
139+
140+
For more advanced concepts like synchronization, scratch buffers, and pipelining, refer to the [full DSL documentation](py_api).
141+
142+
## Testing Your Algorithm
143+
144+
Once you've written your algorithm, you need to run it:
145+
146+
```bash
147+
python3 path/to/simple_allgather.py > /path/to/simple_allgather.json
148+
```
149+
150+
After this, use `executor_test.py` to validate correctness and measure performance.
151+
152+
```bash
153+
# Test with 2 GPUs on a single node
154+
mpirun --allow-run-as-root -np 2 python3 python/test/executor_test.py \
155+
-path /path/to/simple_allgather.json \
156+
--size 1M \
157+
--in_place
158+
```
159+
160+
## Next Steps
161+
162+
Now that you understand the basics:
163+
164+
1. **Explore Examples**: Check `python/mscclpp/language/tests/` for more algorithm examples
165+
2. **Optimize**: Experiment with different chunk strategies, pipelining, and synchronization patterns
166+
3. **Advanced Features**: Learn about scratch buffers, thread block groups, and packet-based communication
167+
168+
For detailed API documentation and advanced features, refer to:
169+
- [Programming Guide](programming_guide)
170+
- [Tutorials](tutorials)
171+
172+
## Troubleshooting
173+
174+
**Import Error**: If you see `ModuleNotFoundError: No module named 'mscclpp'`, ensure you've installed the package with `pip install .`
175+
176+
For more help, please file an issue on the [GitHub repository](https://github.com/microsoft/mscclpp/issues).

docs/index.rst

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,7 @@ You can find the followings from this documentation.
1010

1111
- **Overview:** An overview of MSCCL++ and its features. :doc:`🔗 <overview>`
1212
- **Quick Start:** A guide to build, install, and run MSCCL++. :doc:`🔗 <quickstart>`
13+
- **DSL Quick Start:** A guide to get started with the MSCCL++ DSL for defining custom algorithms. :doc:`🔗 <dsl_quick_start>`
1314
- **Tutorials:** A step-by-step guide for GPU communication using MSCCL++. :doc:`🔗 <tutorials>`
1415
- **Programming Guide:** Advanced topics and best practices for using MSCCL++. :doc:`🔗 <programming_guide>`
1516
- **C++ API Reference:** Detailed documentation of the MSCCL++ C++ API. :doc:`🔗 <cpp_api>`
@@ -21,6 +22,7 @@ You can find the followings from this documentation.
2122

2223
overview
2324
quickstart
25+
dsl_quick_start
2426
tutorials
2527
programming_guide
2628
cpp_api

0 commit comments

Comments
 (0)