-
Notifications
You must be signed in to change notification settings - Fork 103
Description
Hi Dead developer,
There are two quesions:
(1) I found the GPU and NIC map error on my env, I am using RCCL to run 16 GPU cards according 4 NICs each NODE.
As the TOPO file searched by RCCL is error, we need to input a TOPO file using NCCL_TOPO_FILE.
Detected best GPU-NIC mapping:
GPU 0 -> NIC mlx5_0, dev_idx: 0
GPU 1 -> NIC mlx5_1, dev_idx: 1
GPU 2 -> NIC mlx5_2, dev_idx: 2
GPU 3 -> NIC mlx5_3, dev_idx: 3
GPU 4 -> NIC mlx5_2, dev_idx: 2
GPU 5 -> NIC mlx5_0, dev_idx: 0
GPU 6 -> NIC mlx5_0, dev_idx: 0
GPU 7 -> NIC mlx5_3, dev_idx: 3
Can UCCL give a way to map the GPU and NIC mannually, just like NCCL_TOPO_FILE
(2) On my env, I need to use one procress 8 GPUs, then the allreduce performance will same like RCCL if question (1) is solved.
But one procress one GPU will get poor performance.
Can you help me to analisys this issue ?
Thank you.