-
Notifications
You must be signed in to change notification settings - Fork 79
Description
I'm trying to use torch RPC for distributed training in a parameter server architecture. With a limited (less than 20) number of workers, everything works fine but as I increase the number of workers to 20 or beyond, I get the following runtime error:
terminate called after throwing an instance of 'std::runtime_error' what(): In connectFromLoop at tensorpipe/transport/uv/uv.h:297 "rv < 0: too many open files"
followed by:
[W tensorpipe_agent.cpp:726] RPC agent for worker:2 encountered error when reading incoming request from ps:0: EOF: end of file (this error originated at tensorpipe/transport/uv/connection_impl.cc:132)
[W tensorpipe_agent.cpp:726] RPC agent for worker:8 encountered error when reading incoming request from ps:0: EOF: end of file (this error originated at tensorpipe/transport/uv/connection_impl.cc:132)
[W tensorpipe_agent.cpp:726] RPC agent for worker:3 encountered error when reading incoming request from ps:0: EOF: end of file (this error originated at tensorpipe/transport/uv/connection_impl.cc:132)
[W tensorpipe_agent.cpp:726] RPC agent for worker:9 encountered error when reading incoming request from ps:0: EOF: end of file (this error originated at tensorpipe/transport/uv/connection_impl.cc:132)
[W tensorpipe_agent.cpp:726] RPC agent for worker:18 encountered error when reading incoming request from ps:0: EOF: end of file (this error originated at tensorpipe/transport/uv/connection_impl.cc:132)
[W tensorpipe_agent.cpp:726] RPC agent for worker:13 encountered error when reading incoming request from ps:0: EOF: end of file (this error originated at tensorpipe/transport/uv/connection_impl.cc:132)
[W tensorpipe_agent.cpp:726] RPC agent for worker:0 encountered error when reading incoming request from ps:0: EOF: end of file (this error originated at tensorpipe/transport/uv/connection_impl.cc:132)
[W tensorpipe_agent.cpp:726] RPC agent for worker:4 encountered error when reading incoming request from ps:0: EOF: end of file (this error originated at tensorpipe/transport/uv/connection_impl.cc:132)
[W tensorpipe_agent.cpp:726] RPC agent for worker:5 encountered error when reading incoming request from ps:0: EOF: end of file (this error originated at tensorpipe/transport/uv/connection_impl.cc:132)
[W tensorpipe_agent.cpp:726] RPC agent for worker:16 encountered error when reading incoming request from ps:0: EOF: end of file (this error originated at tensorpipe/transport/uv/connection_impl.cc:132)
[W tensorpipe_agent.cpp:726] RPC agent for worker:1 encountered error when reading incoming request from ps:0: EOF: end of file (this error originated at tensorpipe/transport/uv/connection_impl.cc:132)
[W tensorpipe_agent.cpp:726] RPC agent for worker:6 encountered error when reading incoming request from ps:0: EOF: end of file (this error originated at tensorpipe/transport/uv/connection_impl.cc:132)
[W tensorpipe_agent.cpp:726] RPC agent for worker:7 encountered error when reading incoming request from ps:0: EOF: end of file (this error originated at tensorpipe/transport/uv/connection_impl.cc:132)
[W tensorpipe_agent.cpp:726] RPC agent for worker:17 encountered error when reading incoming request from ps:0: EOF: end of file (this error originated at tensorpipe/transport/uv/connection_impl.cc:132)
[W tensorpipe_agent.cpp:726] RPC agent for worker:10 encountered error when reading incoming request from ps:0: EOF: end of file (this error originated at tensorpipe/transport/uv/connection_impl.cc:132)
[W tensorpipe_agent.cpp:726] RPC agent for worker:14 encountered error when reading incoming request from ps:0: EOF: end of file (this error originated at tensorpipe/transport/uv/connection_impl.cc:132)
[W tensorpipe_agent.cpp:726] RPC agent for worker:12 encountered error when reading incoming request from ps:0: EOF: end of file (this error originated at tensorpipe/transport/uv/connection_impl.cc:132)
[W tensorpipe_agent.cpp:726] RPC agent for worker:11 encountered error when reading incoming request from ps:0: EOF: end of file (this error originated at tensorpipe/transport/uv/connection_impl.cc:132)
[W tensorpipe_agent.cpp:726] RPC agent for worker:15 encountered error when reading incoming request from ps:0: EOF: end of file (this error originated at tensorpipe/transport/uv/connection_impl.cc:132)
[W tensorpipe_agent.cpp:726] RPC agent for worker:19 encountered error when reading incoming request from ps:0: EOF: end of file (this error originated at tensorpipe/transport/uv/connection_impl.cc:132)
I call the init_rpc with these arguments:
rpc.init_rpc('worker:{}'.format(rank-num_ps), rank=rank, world_size=world_size, rpc_backend_options=rpc.TensorPipeRpcBackendOptions(init_method='env://', _transports=["uv"],))
I'm using pytorch 1.13 with cuda toolkit 11.7 but previously experienced a similar issue with pytorch 1.8.1 with cuda 10.2 as well.
Using cat /proc/sys/fs/file-max gives me: 9223372036854775807 and logging the number of open files I can confirm that this is never met. I'm curious where the issue might be coming from and how it should be fixed.
Thank you!