Skip to content

GPU and IB NIC map error on my env #471

@shanleo2024

Description

@shanleo2024

Hi Dead developer,
There are two quesions:
(1) I found the GPU and NIC map error on my env, I am using RCCL to run 16 GPU cards according 4 NICs each NODE.
As the TOPO file searched by RCCL is error, we need to input a TOPO file using NCCL_TOPO_FILE.

Detected best GPU-NIC mapping:
GPU 0 -> NIC mlx5_0, dev_idx: 0
GPU 1 -> NIC mlx5_1, dev_idx: 1
GPU 2 -> NIC mlx5_2, dev_idx: 2
GPU 3 -> NIC mlx5_3, dev_idx: 3
GPU 4 -> NIC mlx5_2, dev_idx: 2
GPU 5 -> NIC mlx5_0, dev_idx: 0
GPU 6 -> NIC mlx5_0, dev_idx: 0
GPU 7 -> NIC mlx5_3, dev_idx: 3

Can UCCL give a way to map the GPU and NIC mannually, just like NCCL_TOPO_FILE

(2) On my env, I need to use one procress 8 GPUs, then the allreduce performance will same like RCCL if question (1) is solved.
But one procress one GPU will get poor performance.
Can you help me to analisys this issue ?

Thank you.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions