Skip to content

[Issue]: nccl receiving an external TCP request causes the proxy thread's ncclProxyService to hang #1808

@YeSho-cpp

Description

@YeSho-cpp

How is this issue impacting you?

Application hang

Share Your Debug Logs

2025-08-12 09:37:02.216313 worker-9 >> :168:1228 [2] NCCL INFO Connected all trees
2025-08-12 09:37:02.25487 worker-9 >> :169:1234 [3] NCCL INFO Connected all trees
2025-08-12 09:37:02.259538 worker-9 >> :170:1232 [4] NCCL INFO Connected all trees
2025-08-12 09:37:02.259607 worker-9 >> :171:1235 [5] NCCL INFO Connected all trees
2025-08-12 09:37:02.259932 worker-9 >> :167:1230 [1] NCCL INFO Connected all trees
2025-08-12 09:37:02.259934 worker-9 >> :166:1233 [0] NCCL INFO Connected all trees
2025-08-12 09:37:02.259936 worker-9 >> :173:1229 [7] NCCL INFO Connected all trees
2025-08-12 09:37:02.259938 worker-9 >> :172:1231 [6] NCCL INFO Connected all trees
2025-08-12 09:37:03.463948 worker-9 >> [2025-08-12 01:37:03] :166:1226 [0] misc/socket.cc:484 NCCL WARN socketFinalizeAccept: wrong magic 74656d2f20544547 != 7970663b906d70b9
2025-08-12 09:37:14.267026 worker-9 >> :166:1835 [0] NCCL INFO Channel 00/0 : 80[0] -> 87[7] via P2P/CUMEM
2025-08-12 09:37:14.28165 worker-9 >> :168:1836 [2] NCCL INFO Channel 03/0 : 82[2] -> 90[2] [send] via NET/IBext_v10/10/GDRDMA
2025-08-12 09:37:14.28173 worker-9 >> :168:1836 [2] NCCL INFO Channel 11/0 : 82[2] -> 90[2] [send] via NET/IBext_v10/10/GDRDMA
2025-08-12 09:37:14.288816 worker-9 >> :171:1837 [5] NCCL INFO Channel 04/0 : 85[5] -> 93[5] [send] via NET/IBext_v10/13/GDRDMA
2025-08-12 09:37:14.288886 worker-9 >> :171:1837 [5] NCCL INFO Channel 12/0 : 85[5] -> 93[5] [send] via NET/IBext_v10/13/GDRDMA
2025-08-12 09:37:14.291391 worker-9 >> :172:1838 [6] NCCL INFO Channel 07/0 : 86[6] -> 94[6] [send] via NET/IBext_v10/14/GDRDMA
2025-08-12 09:37:14.291459 worker-9 >> :172:1838 [6] NCCL INFO Channel 15/0 : 86[6] -> 94[6] [send] via NET/IBext_v10/14/GDRDMA
2025-08-12 09:37:14.297593 worker-9 >> :170:1839 [4] NCCL INFO Channel 05/0 : 84[4] -> 92[4] [send] via NET/IBext_v10/12/GDRDMA
2025-08-12 09:37:14.297662 worker-9 >> :170:1839 [4] NCCL INFO Channel 13/0 : 84[4] -> 92[4] [send] via NET/IBext_v10/12/GDRDMA
2025-08-12 09:37:14.304893 worker-9 >> :167:1840 [1] NCCL INFO Channel 00/0 : 81[1] -> 89[1] [send] via NET/IBext_v10/9/GDRDMA
2025-08-12 09:37:14.304961 worker-9 >> :167:1840 [1] NCCL INFO Channel 08/0 : 81[1] -> 89[1] [send] via NET/IBext_v10/9/GDRDMA
2025-08-12 09:37:14.317681 worker-9 >> :169:1841 [3] NCCL INFO Channel 02/0 : 83[3] -> 91[3] [send] via NET/IBext_v10/11/GDRDMA
2025-08-12 09:37:14.317772 worker-9 >> :169:1841 [3] NCCL INFO Channel 10/0 : 83[3] -> 91[3] [send] via NET/IBext_v10/11/GDRDMA
2025-08-12 09:37:14.400754 worker-9 >> :168:1836 [2] NCCL INFO Channel 02/0 : 82[2] -> 81[1] via P2P/CUMEM
2025-08-12 09:37:14.401317 worker-9 >> :168:1836 [2] NCCL INFO Channel 10/0 : 82[2] -> 81[1] via P2P/CUMEM
2025-08-12 09:37:14.402336 worker-9 >> :170:1839 [4] NCCL INFO Channel 04/0 : 84[4] -> 83[3] via P2P/CUMEM
2025-08-12 09:37:14.402604 worker-9 >> :170:1839 [4] NCCL INFO Channel 12/0 : 84[4] -> 83[3] via P2P/CUMEM
2025-08-12 09:37:14.413799 worker-9 >> :172:1838 [6] NCCL INFO Channel 06/0 : 86[6] -> 85[5] via P2P/CUMEM
2025-08-12 09:37:14.415856 worker-9 >> :172:1838 [6] NCCL INFO Channel 14/0 : 86[6] -> 85[5] via P2P/CUMEM
2025-08-12 09:38:03.463005 worker-9 >> [2025-08-12 01:38:03] :166:1226 [0] misc/socket.cc:484 NCCL WARN socketFinalizeAccept: wrong magic 74656d2f20544547 != 7970663b906d70b9

The whole program is stuck in this line

Steps to Reproduce the Issue

Image

Here's the entire unexpected reproduction process:

Image

When the connection is established, enter for(int s=0; s<maxnpeers; s++) The ncc1SocketTryRecv here is used to receive various data sent by the main thread, and it also receives type, connection, reqSize, respSize, and reqBuff on this line, because pollfds[s].fd==-1 jumps out directly, the for loop ends, and enters while(stop== PROXY_RUNNING || npeers> again 0), call poll again, under normal circumstances, the main thread sends data at this time, at this time the proxy thread monitors the arrival of read and write events, pollfd[s].revents will be true, the proxy thread receives the data, but if the main thread establishes a connection with the proxy and the proxy receives the data, there is an additional connection, at this time even if the main thread sends data, but because the proxy thread uses poll, The proxy thread will first receive this new connection event to establish a connection, but the verification fails, the wrong magic is printed, and it accpets again, so the whole program gets stuck.

NCCL Version

NCCL version 2.26.5+cuda12.9 and NCCL version 2.27.6

Your platform details

No response

Error Message & Behavior

The root cause is that the proxy thread uses poll, and the whole process of establishing a connection and receiving events requires multiple polls, which will disrupt the process if an accidental connection comes.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions