Skip to content

rv < 0: too many open files #460

@hamidralmasi

Description

@hamidralmasi

I'm trying to use torch RPC for distributed training in a parameter server architecture. With a limited (less than 20) number of workers, everything works fine but as I increase the number of workers to 20 or beyond, I get the following runtime error:

terminate called after throwing an instance of 'std::runtime_error' what(): In connectFromLoop at tensorpipe/transport/uv/uv.h:297 "rv < 0: too many open files"

followed by:

[W tensorpipe_agent.cpp:726] RPC agent for worker:2 encountered error when reading incoming request from ps:0: EOF: end of file (this error originated at tensorpipe/transport/uv/connection_impl.cc:132)
[W tensorpipe_agent.cpp:726] RPC agent for worker:8 encountered error when reading incoming request from ps:0: EOF: end of file (this error originated at tensorpipe/transport/uv/connection_impl.cc:132)
[W tensorpipe_agent.cpp:726] RPC agent for worker:3 encountered error when reading incoming request from ps:0: EOF: end of file (this error originated at tensorpipe/transport/uv/connection_impl.cc:132)
[W tensorpipe_agent.cpp:726] RPC agent for worker:9 encountered error when reading incoming request from ps:0: EOF: end of file (this error originated at tensorpipe/transport/uv/connection_impl.cc:132)
[W tensorpipe_agent.cpp:726] RPC agent for worker:18 encountered error when reading incoming request from ps:0: EOF: end of file (this error originated at tensorpipe/transport/uv/connection_impl.cc:132)
[W tensorpipe_agent.cpp:726] RPC agent for worker:13 encountered error when reading incoming request from ps:0: EOF: end of file (this error originated at tensorpipe/transport/uv/connection_impl.cc:132)
[W tensorpipe_agent.cpp:726] RPC agent for worker:0 encountered error when reading incoming request from ps:0: EOF: end of file (this error originated at tensorpipe/transport/uv/connection_impl.cc:132)
[W tensorpipe_agent.cpp:726] RPC agent for worker:4 encountered error when reading incoming request from ps:0: EOF: end of file (this error originated at tensorpipe/transport/uv/connection_impl.cc:132)
[W tensorpipe_agent.cpp:726] RPC agent for worker:5 encountered error when reading incoming request from ps:0: EOF: end of file (this error originated at tensorpipe/transport/uv/connection_impl.cc:132)
[W tensorpipe_agent.cpp:726] RPC agent for worker:16 encountered error when reading incoming request from ps:0: EOF: end of file (this error originated at tensorpipe/transport/uv/connection_impl.cc:132)
[W tensorpipe_agent.cpp:726] RPC agent for worker:1 encountered error when reading incoming request from ps:0: EOF: end of file (this error originated at tensorpipe/transport/uv/connection_impl.cc:132)
[W tensorpipe_agent.cpp:726] RPC agent for worker:6 encountered error when reading incoming request from ps:0: EOF: end of file (this error originated at tensorpipe/transport/uv/connection_impl.cc:132)
[W tensorpipe_agent.cpp:726] RPC agent for worker:7 encountered error when reading incoming request from ps:0: EOF: end of file (this error originated at tensorpipe/transport/uv/connection_impl.cc:132)
[W tensorpipe_agent.cpp:726] RPC agent for worker:17 encountered error when reading incoming request from ps:0: EOF: end of file (this error originated at tensorpipe/transport/uv/connection_impl.cc:132)
[W tensorpipe_agent.cpp:726] RPC agent for worker:10 encountered error when reading incoming request from ps:0: EOF: end of file (this error originated at tensorpipe/transport/uv/connection_impl.cc:132)
[W tensorpipe_agent.cpp:726] RPC agent for worker:14 encountered error when reading incoming request from ps:0: EOF: end of file (this error originated at tensorpipe/transport/uv/connection_impl.cc:132)
[W tensorpipe_agent.cpp:726] RPC agent for worker:12 encountered error when reading incoming request from ps:0: EOF: end of file (this error originated at tensorpipe/transport/uv/connection_impl.cc:132)
[W tensorpipe_agent.cpp:726] RPC agent for worker:11 encountered error when reading incoming request from ps:0: EOF: end of file (this error originated at tensorpipe/transport/uv/connection_impl.cc:132)
[W tensorpipe_agent.cpp:726] RPC agent for worker:15 encountered error when reading incoming request from ps:0: EOF: end of file (this error originated at tensorpipe/transport/uv/connection_impl.cc:132)
[W tensorpipe_agent.cpp:726] RPC agent for worker:19 encountered error when reading incoming request from ps:0: EOF: end of file (this error originated at tensorpipe/transport/uv/connection_impl.cc:132)

I call the init_rpc with these arguments:

rpc.init_rpc('worker:{}'.format(rank-num_ps), rank=rank, world_size=world_size, rpc_backend_options=rpc.TensorPipeRpcBackendOptions(init_method='env://', _transports=["uv"],))

I'm using pytorch 1.13 with cuda toolkit 11.7 but previously experienced a similar issue with pytorch 1.8.1 with cuda 10.2 as well.

Using cat /proc/sys/fs/file-max gives me: 9223372036854775807 and logging the number of open files I can confirm that this is never met. I'm curious where the issue might be coming from and how it should be fixed.

Thank you!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions