Skip to content

Conversation

@YangZhou1997
Copy link
Member

Description

Please include a summary of the changes and the related issue.

Fixes # (issue)

Type of Change

  • Bug fix
  • New feature
  • Documentation update

How Has This Been Tested?

Include any tests here.

  • Unit tests
  • Integration tests
  • Manual testing

Checklist

  • My code follows the style guidelines, e.g. format.sh.
  • I have run build_and_install.sh to verify compilation.
  • I have removed redundant variables and comments.
  • I have updated the documentation.
  • I have added tests.

if ((token_idx - token_start_idx) % kNumDispatchRDMASenderWarps !=
warp_id)
continue;
if ((token_idx - token_start_idx) % 2 != warp_id) continue;
Copy link
Collaborator

@zhenhuang12 zhenhuang12 Nov 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I set kNumDispatchRDMASenderWarps = 2 here, it works! but I'm still working on it.
There are two possible reasons:

  • rdma transaction porting error
  • rdma send warp size 2 limit the speed of rdma command commit to avoid atomic errors.

Copy link
Member Author

@YangZhou1997 YangZhou1997 Nov 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zhenhuang12 indeed, setting to 2 makes it work! Nice!

cc @MaoZiming, @CalebZ9909

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@YangZhou1997 The changes don't explain the cause of the error, but provide directions for troubleshooting. I'm currently planning to perform a more in-depth troubleshooting.

@zhenhuang12
Copy link
Collaborator

Hi, @MaoZiming @YangZhou1997 , I changed the rdma window of normal kernel dispatch to a sequence lock, which temporarily fixed the bug.

the performance of ep16 with num_sms=64:
BF16

[tuning] SMs 64, NVL chunk 44, RDMA chunk 24, transmit: 1518.00 us, notify: 168.23 us, BW: 77.05 GB/s (RDMA), 252.32 GB/s (NVL) 
[tuning] SMs 64, NVL chunk 44, RDMA chunk 24, transmit: 1524.00 us, notify: 93.31 us, BW: 76.79 GB/s (RDMA), 251.79 GB/s (NVL) 
[tuning] SMs 64, NVL chunk 44, RDMA chunk 28, transmit: 1529.00 us, notify: 153.89 us, BW: 76.50 GB/s (RDMA), 250.50 GB/s (NVL) 
[tuning] SMs 64, NVL chunk 44, RDMA chunk 28, transmit: 1537.00 us, notify: 81.10 us, BW: 76.14 GB/s (RDMA), 249.66 GB/s (NVL) 
[tuning] SMs 64, NVL chunk 44, RDMA chunk 32, transmit: 1576.00 us, notify: 176.63 us, BW: 74.22 GB/s (RDMA), 243.03 GB/s (NVL) 
[tuning] Best dispatch (BF16): SMs 64, NVL chunk 8, RDMA chunk 4, transmit: 1443.00 us, notify: 174.91 us, BW: 81.06 GB/s (RDMA), 265.43 GB/s (NVL)

FP8

[tuning] SMs 64, NVL chunk 44, RDMA chunk 28, transmit: 948.76 us, notify: 342.82 us, BW: 63.57 GB/s (RDMA), 208.16 GB/s (NVL) 
[tuning] SMs 64, NVL chunk 44, RDMA chunk 28, transmit: 1149.00 us, notify: 79.49 us, BW: 52.52 GB/s (RDMA), 172.20 GB/s (NVL) 
[tuning] SMs 64, NVL chunk 44, RDMA chunk 32, transmit: 926.40 us, notify: 163.21 us, BW: 65.10 GB/s (RDMA), 213.18 GB/s (NVL) 
[tuning] Best dispatch (FP8): SMs 64, NVL chunk 12, RDMA chunk 12, transmit: 866.26 us, notify: 162.48 us, BW: 69.62 GB/s (RDMA), 227.98 GB/s (NVL)

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR implements debugging improvements and AMD/ROCm platform compatibility fixes for the EP (Expert Parallelism) module. The changes address platform-specific issues with CUDA synchronization, RDMA operations, and kernel launch configurations for AMD GPUs.

Key changes:

  • Added AMD-specific sequential locking mechanism for RDMA operations to replace lock-based synchronization
  • Fixed GPU architecture detection and compilation flags for ROCm platform
  • Added debugging synchronization and error checking in CUDA operations

Reviewed changes

Copilot reviewed 9 out of 10 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
ep/src/uccl_ep.cc Added CUDA stream synchronization and error checking for debugging
ep/src/rdma.cpp Enhanced error messages and commented out tail WriteImm logic with added assertions
ep/src/proxy.cpp Removed commented-out debug print statements and whitespace
ep/src/internode.cu Implemented AMD-specific sequential locking for RDMA, fixed data copy size calculation, and added scale copying logic
ep/setup.py Fixed ROCm architecture detection and compilation flag handling
ep/include/ep_utils.cuh Fixed missing return value assignment and removed AMD __syncwarp macro
ep/include/ep_launch.cuh Removed AMD-specific kernel launch configuration macros
ep/include/ep_configs.cuh Added AMD-specific timeout cycle values
ep/bench/run_ep.sh Modified test parameters (reduced num-tokens, added logging)

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +925 to +926
assert(aimm.GetValue() == cmd.atomic_val);
assert(aimm.GetOff() == cmd.atomic_offset);
Copy link

Copilot AI Dec 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These assertions appear to be debugging code that should be removed or converted to proper error handling. Assertions in production code can cause crashes. Consider using runtime checks with error logging instead, or remove if this is temporary debugging code.

Suggested change
assert(aimm.GetValue() == cmd.atomic_val);
assert(aimm.GetOff() == cmd.atomic_offset);
if (aimm.GetValue() != cmd.atomic_val) {
fprintf(stderr, "Error: AtomicsImm value mismatch: expected %ld, got %ld\n",
static_cast<long>(cmd.atomic_val), static_cast<long>(aimm.GetValue()));
std::abort();
}
if (aimm.GetOff() != cmd.atomic_offset) {
fprintf(stderr, "Error: AtomicsImm offset mismatch: expected %ld, got %ld\n",
static_cast<long>(cmd.atomic_offset), static_cast<long>(aimm.GetOff()));
std::abort();
}

Copilot uses AI. Check for mistakes.
Comment on lines +1205 to +1208
UNROLLED_WARP_COPY(5, lane_id, num_bytes_per_token / sizeof(int4),
reinterpret_cast<int4*>(dst_shifted),
reinterpret_cast<int4*>(shifted), ld_nc_global,
st_na_global);
Copy link

Copilot AI Dec 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] The magic number '5' in the UNROLLED_WARP_COPY macro is unclear. Consider replacing it with a named constant (e.g., 'UNROLL_FACTOR' or 'NUM_ITERATIONS') to improve code readability and maintainability.

Copilot uses AI. Check for mistakes.
arch_list = list(matches)

else:
gpu_archs = gpu_archs.split(",")
Copy link

Copilot AI Dec 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The variable 'arch_list' is only defined inside the 'if not matches' block (line 182) but is used here unconditionally. If 'gpu_archs' is provided via environment variable, 'arch_list' will be undefined, causing a NameError. The logic should use 'gpu_archs' when set, or define 'arch_list' in both branches.

Suggested change
gpu_archs = gpu_archs.split(",")
arch_list = gpu_archs.split(",")

Copilot uses AI. Check for mistakes.
Comment on lines +198 to +199
cxx_flags.append("-DUSE_GRACE_HOPPER")
nvcc_flags.append("-DUSE_GRACE_HOPPER")
Copy link

Copilot AI Dec 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The USE_GRACE_HOPPER flag is being added in the AMD/ROCm code path, but Grace Hopper is an NVIDIA architecture. This appears to be a copy-paste error and should likely be removed from the AMD-specific section.

Suggested change
cxx_flags.append("-DUSE_GRACE_HOPPER")
nvcc_flags.append("-DUSE_GRACE_HOPPER")
# Removed erroneous Grace Hopper flag for AMD/ROCm
# (No action needed)

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants