Skip to content

Conversation

@vanman-nguyen
Copy link

When building Open MPI v6.0.x and main with internal PRRTE, the generated binaries are named ompi-prte-*, and then symlinked to their actual name. It seems the symbolic link was missing for prted, as mpirun was complaining about it.

@github-actions
Copy link

github-actions bot commented Dec 3, 2025

Hello! The Git Commit Checker CI bot found a few problems with this PR:

f83ce29: fix: added symlink for prted

  • check_signed_off: does not contain a valid Signed-off-by line

Please fix these problems and, if necessary, force-push new commits back up to the PR branch. Thanks!

Signed-off-by: Van Man NGUYEN <[email protected]>
@hppritcha
Copy link
Member

Could you show the error message mpirun is reporting? Also what configure options were you using?

@vanman-nguyen
Copy link
Author

Could you show the error message mpirun is reporting? Also what configure options were you using?

I got this detailed error log by running salloc prterun hostname, which pointed me towards that symbolic linking for prted:

prterun --prtemca plm_base_verbose 100  --debug-daemons hostname
[login1:2662941] mca: base: component_find: searching NULL for plm components
[login1:2662941] mca: base: find_dyn_components: checking NULL for plm components
[login1:2662941] pmix:mca: base: components_register: registering framework plm components
[login1:2662941] pmix:mca: base: components_register: found loaded component slurm
[login1:2662941] pmix:mca: base: components_register: component slurm register function successful
[login1:2662941] pmix:mca: base: components_register: found loaded component ssh
[login1:2662941] pmix:mca: base: components_register: component ssh register function successful
[login1:2662941] mca: base: components_open: opening plm components
[login1:2662941] mca: base: components_open: found loaded component slurm
[login1:2662941] mca: base: components_open: component slurm open function successful
[login1:2662941] mca: base: components_open: found loaded component ssh
[login1:2662941] mca: base: components_open: component ssh open function successful
[login1:2662941] mca:base:select: Auto-selecting plm components
[login1:2662941] mca:base:select:(  plm) Querying component [slurm]
[login1:2662941] [[INVALID],UNDEFINED] plm:slurm: available for selection
[login1:2662941] mca:base:select:(  plm) Query of component [slurm] set priority to 75
[login1:2662941] mca:base:select:(  plm) Querying component [ssh]
[login1:2662941] [[INVALID],UNDEFINED] plm:ssh_lookup on agent ssh : rsh path NULL
[login1:2662941] mca:base:select:(  plm) Query of component [ssh] set priority to 10
[login1:2662941] mca:base:select:(  plm) Selected component [slurm]
[login1:2662941] mca: base: close: component ssh closed
[login1:2662941] mca: base: close: unloading component ssh
[login1:2662941] [prterun-login1-2662941@0,0] plm:base:receive start comm
[login1:2662941] [prterun-login1-2662941@0,0] plm:slurm: LAUNCH DAEMONS CALLED
[login1:2662941] [prterun-login1-2662941@0,0] plm:base:setup_vm
[login1:2662941] [prterun-login1-2662941@0,0] plm:base:setup_vm creating map
[login1:2662941] [prterun-login1-2662941@0,0] plm:base:setup_vm add new daemon [prterun-login1-2662941@0,1]
[login1:2662941] [prterun-login1-2662941@0,0] plm:base:setup_vm assigning new daemon [prterun-login1-2662941@0,1] to node pm8-nod12
[login1:2662941] [prterun-login1-2662941@0,0] plm:base:setup_vm add new daemon [prterun-login1-2662941@0,2]
[login1:2662941] [prterun-login1-2662941@0,0] plm:base:setup_vm assigning new daemon [prterun-login1-2662941@0,2] to node pm8-nod13
[login1:2662941] [prterun-login1-2662941@0,0] plm:slurm: launching on nodes pm8-nod12,pm8-nod13
[login1:2662941] [prterun-login1-2662941@0,0] plm:slurm: final top-level argv:
        srun --external-launcher --ntasks-per-node=1 --kill-on-bad-exit --mpi=none --cpu-bind=none --ntasks=2 prted --debug-daemons --prtemca ess "slurm" --prtemca ess_base_nspace "prterun-login1-2662941@0" --prtemca ess_base_vpid "1" --prtemca ess_base_num_procs "3" --prtemca prte_hnp_uri "prterun-logi
        [email protected];tcp://10.3.0.3,172.0.0.248:48757:14,16" --prtemca PREFIXES "errmgr,ess,filem,grpcomm,iof,odls,plm,prtebacktrace,prtedl,prteinstalldirs,prtereachable,ras,rmaps,schizo,state,hwloc,if,reachable" --prtemca plm_base_verbose "100" --prtemca pmix_session_server "1"
        [login1:2662941] [prterun-login1-2662941@0,0] plm:slurm: reset PATH: /home/nguyenv/tools/ompi-5_commu/bin:/home/nguyenv/tools/ompi-5_commu/bin:/home/nguyenv/tools/ubcl/bin:/home/nguyenv/.local/bin:/home/nguyenv/bin:/usr/share/Modules/bin:/usr/local/bin:/usr/bin:/usr/loca
        l/sbin:/usr/sbin
        [login1:2662941] [prterun-login1-2662941@0,0] plm:slurm: reset LD_LIBRARY_PATH: /home/nguyenv/tools/ompi-5_commu/lib:/home/nguyenv/tools/ompi-5_commu/lib/pmix:/home/nguyenv/tools/ompi-5_commu/lib/prrte:/home/nguyenv/tools/ompi-5_commu/lib:/home/nguyenv/tools/ubcl/lib:/h
        ome/nguyenv/.local/lib:/home/nguyenv/.local/lib:/home/nguyenv/.local/lib
        srun: error: pm8-nod12: task 0: Exited with exit code 2
        srun: Terminating StepId=873986.10
        slurmstepd: error: execve(): prted: No such file or directory
        slurmstepd: error: execve(): prted: No such file or directory
        srun: error: pm8-nod13: task 1: Exited with exit code 2
        [login:2662941] [prterun-login-2662941@0,0]:../../../../3rd-party/prrte/src/prted/prte.c(1071) updating exit status to -6
        --------------------------------------------------------------------------
        srun returned non-zero exit status (512) from launching
        the per-node daemon. You may debug this problem further
        by augmenting the cmd line with:

        * "--debug-daemons"
        * "--leave-session-attached"
        * "--prtemca plm_base_verbose N" where N > 0
        --------------------------------------------------------------------------
        [login1:2662941] [prterun-login1-2662941@0,0] plm:base:receive stop comm
        [login1:2662941] mca: base: close: component slurm closed
        [login1:2662941] mca: base: close: unloading component slurm

Open MPI was built using this configure line:

#!/bin/bash

../configure \
        --prefix=${HOME}/tools/ompi-commu-bis \
        --enable-mpi1-compatibility \
        --enable-mca-dso=yes \
        --enable-prte-prefix-by-default \
        --with-cma \
        --with-pmix=internal \
        --with-xpmem=/usr/ \
        --with-libevent=/usr \
        --with-lustre=no \
        --with-hwloc=/usr \
        --with-ubcl=${UBCL_ROOT} \
        --without-portals4 \
        --without-ucx \
        --with-cuda=/usr/local/cuda \
        --enable-picky=no \
        --enable-debug \
        CFLAGS='-O3 -g -pipe' \
        CXXFLAGS='-O3 -g -pipe' \
        FCFLAGS='-O3 -g -pipe '

But I wasn't smart enough to keep the installation somewhere and overwrote it when I gave another shot at reproducing the bug today, and it seems I'm not reproducing it anymore, neither with the main, the v6.0.x tarballs or my git repo. Maybe it was because my environment or path wasn't clean enough and it tried to get the wrong version of the executables?

I can close this PR if it was indeed an user bug.

@rhc54
Copy link

rhc54 commented Dec 4, 2025

If you ran prterun, it would indeed by looking for prted. I believe the point of this fork was to rename the executables to be ompi-xxx, so if you ran mpirun it would look for ompi-prted. I guess they are still installing a prterun, which is probably a mistake - or else this PR would be required to make that work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants