Skip to content

Non-MPI backends (e.g. Slurm) #97

@lgarrison

Description

@lgarrison

Since a "major makeover" was mentioned for dask-mpi in dask/distributed#7192, I thought I would ask if support for non-MPI backends was a possibility. That is, use something like Slurm's srun to launch the ranks even if MPI is not available.

The reason I think this may be possible is because as far as I can tell, MPI is only being used to broadcast the address of the scheduler to the workers. Afterwards, it seems to me that any communication is done via direct TCP. If this is correct, could the startup be achieved instead by, e.g., writing the address to a file in a shared location?

The benefit of this is the elimination of a fairly heavyweight MPI dependency. When users start mixing, e.g. a conda install of MPI with a Lmod install of MPI, their jobs often get confused and fail. It seems like dask-mpi has a chance to sidestep these issues completely.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions