Skip to content

dask-mpi fails with wheel packaging #83

@mahendrapaipuri

Description

@mahendrapaipuri

Using pip install dask-mpi

$ pip install dask-mpi
$ mpirun -np 2 dask-mpi --name=test-worker --nthreads=1 --memory-limit=0 --scheduler-file=test.json
distributed.http.proxy - INFO - To route to workers diagnostics web server please install jupyter-server-proxy: python -m pip install jupyter-server-proxy
distributed.scheduler - INFO - Clear task state
distributed.scheduler - INFO -   Scheduler at: tcp://172.16.66.109:36539
distributed.scheduler - INFO -   dashboard at:                     :8787
distributed.nanny - INFO -         Start Nanny at: 'tcp://172.16.66.109:36297'
--------------------------------------------------------------------------
It looks like orte_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  getting local rank failed
  --> Returned value No permission (-17) instead of ORTE_SUCCESS
--------------------------------------------------------------------------
--------------------------------------------------------------------------
It looks like orte_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  orte_ess_init failed
  --> Returned value No permission (-17) instead of ORTE_SUCCESS
--------------------------------------------------------------------------
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  ompi_mpi_init: ompi_rte_init failed
  --> Returned "No permission" (-17) instead of "Success" (0)
--------------------------------------------------------------------------
*** An error occurred in MPI_Init_thread
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)

Using python setup.py install

$ python setup.py install
$ mpirun -np 2 dask-mpi --name=test-worker --nthreads=1 --memory-limit=0 --scheduler-file=test.json
distributed.http.proxy - INFO - To route to workers diagnostics web server please install jupyter-server-proxy: python -m pip install jupyter-server-proxy
distributed.scheduler - INFO - Clear task state
distributed.scheduler - INFO -   Scheduler at: tcp://172.16.66.109:44933
distributed.scheduler - INFO -   dashboard at:                     :8787
distributed.nanny - INFO -         Start Nanny at: 'tcp://172.16.66.109:42437'
distributed.diskutils - INFO - Found stale lock file and directory '/home/mpaipuri/downloads/dask-mpi/dask-worker-space/worker-6h2hf4i6', purging
distributed.worker - INFO -       Start worker at:  tcp://172.16.66.109:37893
distributed.worker - INFO -          Listening to:  tcp://172.16.66.109:37893
distributed.worker - INFO -          dashboard at:        172.16.66.109:45119
distributed.worker - INFO - Waiting to connect to:  tcp://172.16.66.109:44933
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -               Threads:                          1
distributed.worker - INFO -       Local Directory: /home/mpaipuri/downloads/dask-mpi/dask-worker-space/worker-t48hj0dc
distributed.worker - INFO - -------------------------------------------------
distributed.scheduler - INFO - Register worker <WorkerState 'tcp://172.16.66.109:37893', name: rascil-worker-1, status: undefined, memory: 0, processing: 0>
distributed.scheduler - INFO - Starting worker compute stream, tcp://172.16.66.109:37893
distributed.core - INFO - Starting established connection
distributed.worker - INFO -         Registered to:  tcp://172.16.66.109:44933
distributed.worker - INFO - -------------------------------------------------
distributed.core - INFO - Starting established connection

What happened: Installing dask-mpi with wheel packaging fails but it works normally with egg packaging. Tested it on 2 different systems and same behaviour is observed

What you expected to happen: To work with both packaging methods

Anything else we need to know?: The only difference between two approaches is generated dask-mpi command line executable.

  • Dask version: 2021.11.2
  • Python version: 3.8
  • Operating System: Debian 11
  • Install method (conda, pip, source): pip and source

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions