[EP] Yang debugging amd normal #548

YangZhou1997 · 2025-11-15T05:44:37Z

Description

Please include a summary of the changes and the related issue.

Fixes # (issue)

Type of Change

Bug fix
New feature
Documentation update

How Has This Been Tested?

Include any tests here.

Unit tests
Integration tests
Manual testing

Checklist

My code follows the style guidelines, e.g. format.sh.
I have run build_and_install.sh to verify compilation.
I have removed redundant variables and comments.
I have updated the documentation.
I have added tests.

…to zm-amd-port-debug

…md-port-debug

zhenhuang12 · 2025-11-26T11:49:51Z

ep/src/internode.cu

-      if ((token_idx - token_start_idx) % kNumDispatchRDMASenderWarps !=
-          warp_id)
-        continue;
+      if ((token_idx - token_start_idx) % 2 != warp_id) continue;


I set kNumDispatchRDMASenderWarps = 2 here, it works! but I'm still working on it.
There are two possible reasons:

rdma transaction porting error

rdma send warp size 2 limit the speed of rdma command commit to avoid atomic errors.

@zhenhuang12 indeed, setting to 2 makes it work! Nice!

cc @MaoZiming, @CalebZ9909

@YangZhou1997 The changes don't explain the cause of the error, but provide directions for troubleshooting. I'm currently planning to perform a more in-depth troubleshooting.

zhenhuang12 · 2025-12-01T08:56:00Z

Hi, @MaoZiming @YangZhou1997 , I changed the rdma window of normal kernel dispatch to a sequence lock, which temporarily fixed the bug.

the performance of ep16 with num_sms=64:
BF16

[tuning] SMs 64, NVL chunk 44, RDMA chunk 24, transmit: 1518.00 us, notify: 168.23 us, BW: 77.05 GB/s (RDMA), 252.32 GB/s (NVL) 
[tuning] SMs 64, NVL chunk 44, RDMA chunk 24, transmit: 1524.00 us, notify: 93.31 us, BW: 76.79 GB/s (RDMA), 251.79 GB/s (NVL) 
[tuning] SMs 64, NVL chunk 44, RDMA chunk 28, transmit: 1529.00 us, notify: 153.89 us, BW: 76.50 GB/s (RDMA), 250.50 GB/s (NVL) 
[tuning] SMs 64, NVL chunk 44, RDMA chunk 28, transmit: 1537.00 us, notify: 81.10 us, BW: 76.14 GB/s (RDMA), 249.66 GB/s (NVL) 
[tuning] SMs 64, NVL chunk 44, RDMA chunk 32, transmit: 1576.00 us, notify: 176.63 us, BW: 74.22 GB/s (RDMA), 243.03 GB/s (NVL) 
[tuning] Best dispatch (BF16): SMs 64, NVL chunk 8, RDMA chunk 4, transmit: 1443.00 us, notify: 174.91 us, BW: 81.06 GB/s (RDMA), 265.43 GB/s (NVL)

FP8

[tuning] SMs 64, NVL chunk 44, RDMA chunk 28, transmit: 948.76 us, notify: 342.82 us, BW: 63.57 GB/s (RDMA), 208.16 GB/s (NVL) 
[tuning] SMs 64, NVL chunk 44, RDMA chunk 28, transmit: 1149.00 us, notify: 79.49 us, BW: 52.52 GB/s (RDMA), 172.20 GB/s (NVL) 
[tuning] SMs 64, NVL chunk 44, RDMA chunk 32, transmit: 926.40 us, notify: 163.21 us, BW: 65.10 GB/s (RDMA), 213.18 GB/s (NVL) 
[tuning] Best dispatch (FP8): SMs 64, NVL chunk 12, RDMA chunk 12, transmit: 866.26 us, notify: 162.48 us, BW: 69.62 GB/s (RDMA), 227.98 GB/s (NVL)

Copilot

Pull request overview

This PR implements debugging improvements and AMD/ROCm platform compatibility fixes for the EP (Expert Parallelism) module. The changes address platform-specific issues with CUDA synchronization, RDMA operations, and kernel launch configurations for AMD GPUs.

Key changes:

Added AMD-specific sequential locking mechanism for RDMA operations to replace lock-based synchronization
Fixed GPU architecture detection and compilation flags for ROCm platform
Added debugging synchronization and error checking in CUDA operations

Reviewed changes

Copilot reviewed 9 out of 10 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
ep/src/uccl_ep.cc	Added CUDA stream synchronization and error checking for debugging
ep/src/rdma.cpp	Enhanced error messages and commented out tail WriteImm logic with added assertions
ep/src/proxy.cpp	Removed commented-out debug print statements and whitespace
ep/src/internode.cu	Implemented AMD-specific sequential locking for RDMA, fixed data copy size calculation, and added scale copying logic
ep/setup.py	Fixed ROCm architecture detection and compilation flag handling
ep/include/ep_utils.cuh	Fixed missing return value assignment and removed AMD __syncwarp macro
ep/include/ep_launch.cuh	Removed AMD-specific kernel launch configuration macros
ep/include/ep_configs.cuh	Added AMD-specific timeout cycle values
ep/bench/run_ep.sh	Modified test parameters (reduced num-tokens, added logging)

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-12-01T09:12:51Z

ep/src/rdma.cpp

+            assert(aimm.GetValue() == cmd.atomic_val);
+            assert(aimm.GetOff() == cmd.atomic_offset);


These assertions appear to be debugging code that should be removed or converted to proper error handling. Assertions in production code can cause crashes. Consider using runtime checks with error logging instead, or remove if this is temporary debugging code.

Suggested change

assert(aimm.GetValue() == cmd.atomic_val);

assert(aimm.GetOff() == cmd.atomic_offset);

if (aimm.GetValue() != cmd.atomic_val) {

fprintf(stderr, "Error: AtomicsImm value mismatch: expected %ld, got %ld\n",

static_cast<long>(cmd.atomic_val), static_cast<long>(aimm.GetValue()));

std::abort();

}

if (aimm.GetOff() != cmd.atomic_offset) {

fprintf(stderr, "Error: AtomicsImm offset mismatch: expected %ld, got %ld\n",

static_cast<long>(cmd.atomic_offset), static_cast<long>(aimm.GetOff()));

std::abort();

}

Copilot · 2025-12-01T09:12:52Z

ep/src/internode.cu

+        UNROLLED_WARP_COPY(5, lane_id, num_bytes_per_token / sizeof(int4),
+                           reinterpret_cast<int4*>(dst_shifted),
+                           reinterpret_cast<int4*>(shifted), ld_nc_global,
+                           st_na_global);


[nitpick] The magic number '5' in the UNROLLED_WARP_COPY macro is unclear. Consider replacing it with a named constant (e.g., 'UNROLL_FACTOR' or 'NUM_ITERATIONS') to improve code readability and maintainability.

Copilot · 2025-12-01T09:12:52Z

ep/setup.py

+            arch_list = list(matches)
+
+        else:
+            gpu_archs = gpu_archs.split(",")


The variable 'arch_list' is only defined inside the 'if not matches' block (line 182) but is used here unconditionally. If 'gpu_archs' is provided via environment variable, 'arch_list' will be undefined, causing a NameError. The logic should use 'gpu_archs' when set, or define 'arch_list' in both branches.

Suggested change

gpu_archs = gpu_archs.split(",")

arch_list = gpu_archs.split(",")

Copilot · 2025-12-01T09:12:52Z

ep/setup.py

+        cxx_flags.append("-DUSE_GRACE_HOPPER")
+        nvcc_flags.append("-DUSE_GRACE_HOPPER")


The USE_GRACE_HOPPER flag is being added in the AMD/ROCm code path, but Grace Hopper is an NVIDIA architecture. This appears to be a copy-paste error and should likely be removed from the AMD-specific section.

Suggested change

cxx_flags.append("-DUSE_GRACE_HOPPER")

nvcc_flags.append("-DUSE_GRACE_HOPPER")

# Removed erroneous Grace Hopper flag for AMD/ROCm

# (No action needed)

zhenhuang12 and others added 22 commits November 3, 2025 06:07

EP: port internode_dispatch to amd.

b7c1fb1

EP: port internode_combine to amd.

05e7df8

ziming fix amd port

04c6a14

fixing wr bug

946db83

debugging

fbda03b

clean

d6f3355

Merge branch 'zm-amd-port' of https://github.com/uccl-project/uccl in…

1903181

…to zm-amd-port-debug

checkpt

7324fcf

merge main

b136aaa

adding wr_id_to_wr_ids emplace for normal mode atomics

4d63369

EP: fix ep internode

de4abc4

EP: fix RDMAAndNVLForwarder copy data

49c0bab

merge main

e5b2ef3

Merge branch 'main' of https://github.com/uccl-project/uccl into zm-a…

a4d9d4b

…md-port-debug

Merge branch 'main' of https://github.com/uccl-project/uccl into zm-a…

026a836

…md-port-debug

debugging

b64412d

run on nebius

9cb051f

merge with main

f19fd62

fixing setup.py on amd

3a7d093

add printf to internode

8c86136

Merge branch 'main' into yang-amd-normal

fa1d720

debug internode on amd-gpu

b6451b4

zhenhuang12 reviewed Nov 26, 2025

View reviewed changes

YangZhou1997 and others added 6 commits November 28, 2025 06:17

trying to debug dispatch kernel hang issues, but fails

a142d9e

merge

a2d9f26

EP: fix normal dispatch bug

5d38ecb

Merge branch 'main' into yang-amd-normal

33e9bb1

EP: format code

97d905e

EP: restore internode_ll.cu

4a44245

EP: format code

61eba6f

zhenhuang12 requested review from MaoZiming and Copilot December 1, 2025 09:11

Copilot AI reviewed Dec 1, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[EP] Yang debugging amd normal #548

[EP] Yang debugging amd normal #548

YangZhou1997 commented Nov 15, 2025

Uh oh!

zhenhuang12 Nov 26, 2025 •

edited

Loading

Uh oh!

YangZhou1997 Nov 28, 2025 •

edited

Loading

Uh oh!

zhenhuang12 Nov 28, 2025

Uh oh!

zhenhuang12 commented Dec 1, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Dec 1, 2025

Uh oh!

Copilot AI Dec 1, 2025

Uh oh!

Copilot AI Dec 1, 2025

Uh oh!

Copilot AI Dec 1, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

		assert(aimm.GetValue() == cmd.atomic_val);
		assert(aimm.GetOff() == cmd.atomic_offset);

-            assert(aimm.GetValue() == cmd.atomic_val);
-            assert(aimm.GetOff() == cmd.atomic_offset);
+            if (aimm.GetValue() != cmd.atomic_val) {
+              fprintf(stderr, "Error: AtomicsImm value mismatch: expected %ld, got %ld\n",
+                      static_cast<long>(cmd.atomic_val), static_cast<long>(aimm.GetValue()));
+              std::abort();
+            }
+            if (aimm.GetOff() != cmd.atomic_offset) {
+              fprintf(stderr, "Error: AtomicsImm offset mismatch: expected %ld, got %ld\n",
+                      static_cast<long>(cmd.atomic_offset), static_cast<long>(aimm.GetOff()));
+              std::abort();
+            }

	gpu_archs = gpu_archs.split(",")
	arch_list = gpu_archs.split(",")

		cxx_flags.append("-DUSE_GRACE_HOPPER")
		nvcc_flags.append("-DUSE_GRACE_HOPPER")

[EP] Yang debugging amd normal #548

Are you sure you want to change the base?

[EP] Yang debugging amd normal #548

Conversation

YangZhou1997 commented Nov 15, 2025

Description

Type of Change

How Has This Been Tested?

Checklist

Uh oh!

zhenhuang12 Nov 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

YangZhou1997 Nov 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zhenhuang12 Nov 28, 2025

Choose a reason for hiding this comment

Uh oh!

zhenhuang12 commented Dec 1, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Dec 1, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 1, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 1, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 1, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

zhenhuang12 Nov 26, 2025 •

edited

Loading

YangZhou1997 Nov 28, 2025 •

edited

Loading