When running test_internode.py on two nodes (with 8 GPUs per node), an error occurs.
Execution Command::
MASTER_ADDR=${MASTER_ADDR:-"10.124.173.185"}
MASTER_PORT=${MASTER_PORT:-10556}
WORLD_SIZE=${WORLD_SIZE:-2}
RANK=${RANK:-${ARNOLD_ID:-0}}
MASTER_ADDR="$MASTER_ADDR" MASTER_PORT=$MASTER_PORT WORLD_SIZE=$WORLD_SIZE RANK=$RANK
python ./tests/test_internode.py 2>&1 | tee debug_${RANK}.log
Error Logs:
rank0:
https://github.com/SaltFish11/DeepEP/blob/develop/debug_0.log
rank1:
https://github.com/SaltFish11/DeepEP/blob/develop/debug_1.log