-
Notifications
You must be signed in to change notification settings - Fork 1k
Open
Description
Hi,
I’m testing DeepEP across multiple nodes and found that bandwidth looks normal on 2 nodes, but drops sharply when scaling to 4 nodes.
Could you please suggest possible causes or configurations that might affect this? Any tips for debugging or tuning would be appreciated.
Setup:
GPUs: NVIDIA H200
Network: CX7 400 Gb/s InfiniBand
DeepEP version: a84a248(commit ID)
Test result(only list best):
# 2 nodes
[tuning] Best combine: SMs 24, NVL chunk 2, RDMA chunk 20: 44.37 GB/s (RDMA), 145.16 GB/s (NVL)
[tuning] Best dispatch (BF16): SMs 24, NVL chunk 28, RDMA chunk 16: 40.89 GB/s (RDMA), 133.78 GB/s (NVL)
[tuning] Best dispatch (FP8): SMs 24, NVL chunk 28, RDMA chunk 24: 38.19 GB/s (RDMA), 124.92 GB/s (NVL)
# 4 nodes
[tuning] Best combine: SMs 24, NVL chunk 2, RDMA chunk 8: 12.53 GB/s (RDMA), 25.05 GB/s (NVL)
[tuning] Best dispatch (BF16): SMs 24, NVL chunk 12, RDMA chunk 4: 12.27 GB/s (RDMA), 24.51 GB/s (NVL)
[tuning] Best dispatch (FP8): SMs 24, NVL chunk 4, RDMA chunk 4: 13.15 GB/s (RDMA), 26.27 GB/s (NVL)
Thanks!
Metadata
Metadata
Assignees
Labels
No labels