-
Notifications
You must be signed in to change notification settings - Fork 102
[EP] Yang debugging amd normal #548
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…to zm-amd-port-debug
ep/src/internode.cu
Outdated
| if ((token_idx - token_start_idx) % kNumDispatchRDMASenderWarps != | ||
| warp_id) | ||
| continue; | ||
| if ((token_idx - token_start_idx) % 2 != warp_id) continue; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I set kNumDispatchRDMASenderWarps = 2 here, it works! but I'm still working on it.
There are two possible reasons:
- rdma transaction porting error
- rdma send warp size
2limit the speed of rdma command commit to avoid atomic errors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zhenhuang12 indeed, setting to 2 makes it work! Nice!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@YangZhou1997 The changes don't explain the cause of the error, but provide directions for troubleshooting. I'm currently planning to perform a more in-depth troubleshooting.
|
Hi, @MaoZiming @YangZhou1997 , I changed the rdma window of normal kernel dispatch to a sequence lock, which temporarily fixed the bug. the performance of ep16 with [tuning] SMs 64, NVL chunk 44, RDMA chunk 24, transmit: 1518.00 us, notify: 168.23 us, BW: 77.05 GB/s (RDMA), 252.32 GB/s (NVL)
[tuning] SMs 64, NVL chunk 44, RDMA chunk 24, transmit: 1524.00 us, notify: 93.31 us, BW: 76.79 GB/s (RDMA), 251.79 GB/s (NVL)
[tuning] SMs 64, NVL chunk 44, RDMA chunk 28, transmit: 1529.00 us, notify: 153.89 us, BW: 76.50 GB/s (RDMA), 250.50 GB/s (NVL)
[tuning] SMs 64, NVL chunk 44, RDMA chunk 28, transmit: 1537.00 us, notify: 81.10 us, BW: 76.14 GB/s (RDMA), 249.66 GB/s (NVL)
[tuning] SMs 64, NVL chunk 44, RDMA chunk 32, transmit: 1576.00 us, notify: 176.63 us, BW: 74.22 GB/s (RDMA), 243.03 GB/s (NVL)
[tuning] Best dispatch (BF16): SMs 64, NVL chunk 8, RDMA chunk 4, transmit: 1443.00 us, notify: 174.91 us, BW: 81.06 GB/s (RDMA), 265.43 GB/s (NVL)FP8 [tuning] SMs 64, NVL chunk 44, RDMA chunk 28, transmit: 948.76 us, notify: 342.82 us, BW: 63.57 GB/s (RDMA), 208.16 GB/s (NVL)
[tuning] SMs 64, NVL chunk 44, RDMA chunk 28, transmit: 1149.00 us, notify: 79.49 us, BW: 52.52 GB/s (RDMA), 172.20 GB/s (NVL)
[tuning] SMs 64, NVL chunk 44, RDMA chunk 32, transmit: 926.40 us, notify: 163.21 us, BW: 65.10 GB/s (RDMA), 213.18 GB/s (NVL)
[tuning] Best dispatch (FP8): SMs 64, NVL chunk 12, RDMA chunk 12, transmit: 866.26 us, notify: 162.48 us, BW: 69.62 GB/s (RDMA), 227.98 GB/s (NVL) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR implements debugging improvements and AMD/ROCm platform compatibility fixes for the EP (Expert Parallelism) module. The changes address platform-specific issues with CUDA synchronization, RDMA operations, and kernel launch configurations for AMD GPUs.
Key changes:
- Added AMD-specific sequential locking mechanism for RDMA operations to replace lock-based synchronization
- Fixed GPU architecture detection and compilation flags for ROCm platform
- Added debugging synchronization and error checking in CUDA operations
Reviewed changes
Copilot reviewed 9 out of 10 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| ep/src/uccl_ep.cc | Added CUDA stream synchronization and error checking for debugging |
| ep/src/rdma.cpp | Enhanced error messages and commented out tail WriteImm logic with added assertions |
| ep/src/proxy.cpp | Removed commented-out debug print statements and whitespace |
| ep/src/internode.cu | Implemented AMD-specific sequential locking for RDMA, fixed data copy size calculation, and added scale copying logic |
| ep/setup.py | Fixed ROCm architecture detection and compilation flag handling |
| ep/include/ep_utils.cuh | Fixed missing return value assignment and removed AMD __syncwarp macro |
| ep/include/ep_launch.cuh | Removed AMD-specific kernel launch configuration macros |
| ep/include/ep_configs.cuh | Added AMD-specific timeout cycle values |
| ep/bench/run_ep.sh | Modified test parameters (reduced num-tokens, added logging) |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| assert(aimm.GetValue() == cmd.atomic_val); | ||
| assert(aimm.GetOff() == cmd.atomic_offset); |
Copilot
AI
Dec 1, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These assertions appear to be debugging code that should be removed or converted to proper error handling. Assertions in production code can cause crashes. Consider using runtime checks with error logging instead, or remove if this is temporary debugging code.
| assert(aimm.GetValue() == cmd.atomic_val); | |
| assert(aimm.GetOff() == cmd.atomic_offset); | |
| if (aimm.GetValue() != cmd.atomic_val) { | |
| fprintf(stderr, "Error: AtomicsImm value mismatch: expected %ld, got %ld\n", | |
| static_cast<long>(cmd.atomic_val), static_cast<long>(aimm.GetValue())); | |
| std::abort(); | |
| } | |
| if (aimm.GetOff() != cmd.atomic_offset) { | |
| fprintf(stderr, "Error: AtomicsImm offset mismatch: expected %ld, got %ld\n", | |
| static_cast<long>(cmd.atomic_offset), static_cast<long>(aimm.GetOff())); | |
| std::abort(); | |
| } |
| UNROLLED_WARP_COPY(5, lane_id, num_bytes_per_token / sizeof(int4), | ||
| reinterpret_cast<int4*>(dst_shifted), | ||
| reinterpret_cast<int4*>(shifted), ld_nc_global, | ||
| st_na_global); |
Copilot
AI
Dec 1, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] The magic number '5' in the UNROLLED_WARP_COPY macro is unclear. Consider replacing it with a named constant (e.g., 'UNROLL_FACTOR' or 'NUM_ITERATIONS') to improve code readability and maintainability.
| arch_list = list(matches) | ||
|
|
||
| else: | ||
| gpu_archs = gpu_archs.split(",") |
Copilot
AI
Dec 1, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The variable 'arch_list' is only defined inside the 'if not matches' block (line 182) but is used here unconditionally. If 'gpu_archs' is provided via environment variable, 'arch_list' will be undefined, causing a NameError. The logic should use 'gpu_archs' when set, or define 'arch_list' in both branches.
| gpu_archs = gpu_archs.split(",") | |
| arch_list = gpu_archs.split(",") |
| cxx_flags.append("-DUSE_GRACE_HOPPER") | ||
| nvcc_flags.append("-DUSE_GRACE_HOPPER") |
Copilot
AI
Dec 1, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The USE_GRACE_HOPPER flag is being added in the AMD/ROCm code path, but Grace Hopper is an NVIDIA architecture. This appears to be a copy-paste error and should likely be removed from the AMD-specific section.
| cxx_flags.append("-DUSE_GRACE_HOPPER") | |
| nvcc_flags.append("-DUSE_GRACE_HOPPER") | |
| # Removed erroneous Grace Hopper flag for AMD/ROCm | |
| # (No action needed) |
Description
Please include a summary of the changes and the related issue.
Fixes # (issue)
Type of Change
How Has This Been Tested?
Include any tests here.
Checklist
format.sh.build_and_install.shto verify compilation.