Skip to content

Commit a12ecb0

Browse files
authored
Set up the new LSFClusterManager.jl package, and remove all non-LSF code, tests, and docs (#3)
1 parent 2a2d8c6 commit a12ecb0

21 files changed

+27
-1071
lines changed

.github/workflows/ci.yml

Lines changed: 1 addition & 51 deletions
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,6 @@ jobs:
1616
timeout-minutes: 10
1717
needs:
1818
- unit-tests
19-
- test-slurm
2019
# Important: the next line MUST be `if: always()`.
2120
# Do not change that line.
2221
# That line is necessary to make sure that this job runs even if tests fail.
@@ -25,13 +24,11 @@ jobs:
2524
steps:
2625
- run: |
2726
echo unit-tests: ${{ needs.unit-tests.result }}
28-
echo test-slurm: ${{ needs.test-slurm.result }}
2927
- run: exit 1
3028
# The last line must NOT end with ||
3129
# All other lines MUST end with ||
3230
if: |
33-
(needs.unit-tests.result != 'success') ||
34-
(needs.test-slurm.result != 'success')
31+
(needs.unit-tests.result != 'success')
3532
unit-tests:
3633
runs-on: ubuntu-latest
3734
timeout-minutes: 20
@@ -51,50 +48,3 @@ jobs:
5148
with:
5249
version: ${{ matrix.version }}
5350
- uses: julia-actions/julia-runtest@v1
54-
test-slurm:
55-
runs-on: ubuntu-latest
56-
timeout-minutes: 20
57-
strategy:
58-
fail-fast: false
59-
matrix:
60-
version:
61-
# Please note: You must specify the full Julia version number (major.minor.patch).
62-
# This is because the value here will be directly interpolated into a download URL.
63-
# - '1.2.0' # minimum Julia version supported in Project.toml
64-
- '1.6.7' # previous LTS
65-
- '1.10.7' # current LTS
66-
- '1.11.2' # currently the latest stable release
67-
steps:
68-
- uses: actions/checkout@v4
69-
with:
70-
persist-credentials: false
71-
- name: Print Docker version
72-
run: |
73-
docker --version
74-
docker version
75-
# This next bit of code is taken from:
76-
# https://github.com/kleinhenz/SlurmClusterManager.jl
77-
# Original author: Joseph Kleinhenz
78-
# License: MIT
79-
- name: Setup Slurm inside Docker
80-
run: |
81-
docker version
82-
docker compose version
83-
docker build --build-arg "JULIA_VERSION=${MATRIX_JULIA_VERSION:?}" -t slurm-cluster-julia -f ci/Dockerfile .
84-
docker compose -f ci/docker-compose.yml up -d
85-
docker ps
86-
env:
87-
MATRIX_JULIA_VERSION: ${{matrix.version}}
88-
- name: Print some information for debugging purposes
89-
run: |
90-
docker exec -t slurmctld pwd
91-
docker exec -t slurmctld ls -la
92-
docker exec -t slurmctld ls -la ClusterManagers
93-
- name: Instantiate package
94-
run: docker exec -t slurmctld julia --project=ClusterManagers -e 'import Pkg; @show Base.active_project(); Pkg.instantiate(); Pkg.status()'
95-
- name: Run tests without a Slurm allocation
96-
run: docker exec -t slurmctld julia --project=ClusterManagers -e 'import Pkg; Pkg.test(; test_args=["slurm"])'
97-
- name: Run tests inside salloc
98-
run: docker exec -t slurmctld salloc -t 00:10:00 -n 2 julia --project=ClusterManagers -e 'import Pkg; Pkg.test(test_args=["slurm"])'
99-
- name: Run tests inside sbatch
100-
run: docker exec -t slurmctld ClusterManagers/ci/run_my_sbatch.sh

.gitignore

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
# macOS-specific:
2+
.DS_Store

Project.toml

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
1-
name = "ClusterManagers"
2-
uuid = "34f1f09b-3a8b-5176-ab39-66d58a4d544e"
3-
version = "0.4.7"
1+
name = "LSFClusterManager"
2+
uuid = "af02cf76-cbe3-4eeb-96a8-af9391005858"
3+
version = "1.0.0-DEV"
44

55
[deps]
66
Distributed = "8ba89e20-285c-5b6f-9357-94700520ee1b"

README.md

Lines changed: 10 additions & 129 deletions
Original file line numberDiff line numberDiff line change
@@ -1,79 +1,22 @@
1-
# ClusterManagers.jl
1+
# LSFClusterManager.jl
22

3-
The `ClusterManager.jl` package implements code for different job queue systems commonly used on compute clusters.
3+
The `LSFClusterManager.jl` package implements code for the LSF (Load Sharing Facility) compute cluster job queue system.
44

5-
> [!WARNING]
6-
> This package is not currently being actively maintained or tested.
7-
>
8-
> We are in the process of splitting this package up into multiple smaller packages, with a separate package for each job queue systems.
9-
>
10-
> We are seeking maintainers for these new packages. If you are an active user of any of the job queue systems listed below and are interested in being a maintainer, please open a GitHub issue - say that you are interested in being a maintainer, and specify which job queue system you use.
11-
12-
## Available job queue systems
5+
`LSFManager` supports IBM's scheduler. See the `addprocs_lsf` docstring
6+
for more information.
137

14-
Implemented in this package (the `ClusterManagers.jl` package):
8+
Implemented in this package (the `LSFClusterManager.jl` package):
159

1610
| Job queue system | Command to add processors |
1711
| ---------------- | ------------------------- |
1812
| Load Sharing Facility (LSF) | `addprocs_lsf(np::Integer; bsub_flags=``, ssh_cmd=``)` or `addprocs(LSFManager(np, bsub_flags, ssh_cmd, retry_delays, throttle))` |
19-
| Sun Grid Engine (SGE) via `qsub` | `addprocs_sge(np::Integer; qsub_flags=``)` or `addprocs(SGEManager(np, qsub_flags))` |
20-
| Sun Grid Engine (SGE) via `qrsh` | `addprocs_qrsh(np::Integer; qsub_flags=``)` or `addprocs(QRSHManager(np, qsub_flags))` |
21-
| PBS (Portable Batch System) | `addprocs_pbs(np::Integer; qsub_flags=``)` or `addprocs(PBSManager(np, qsub_flags))` |
22-
| Scyld | `addprocs_scyld(np::Integer)` or `addprocs(ScyldManager(np))` |
23-
| HTCondor[^1] | `addprocs_htc(np::Integer)` or `addprocs(HTCManager(np))` |
24-
| Slurm | `addprocs_slurm(np::Integer; kwargs...)` or `addprocs(SlurmManager(np); kwargs...)` |
25-
| Local manager with CPU affinity setting | `addprocs(LocalAffinityManager(;np=CPU_CORES, mode::AffinityMode=BALANCED, affinities=[]); kwargs...)` |
26-
27-
[^1]: HTCondor was previously named Condor.
28-
29-
Implemented in external packages:
3013

31-
| Job queue system | Command to add processors |
32-
| ---------------- | ------------------------- |
33-
| Kubernetes (K8s) via [K8sClusterManagers.jl](https://github.com/beacon-biosignals/K8sClusterManagers.jl) | `addprocs(K8sClusterManagers(np; kwargs...))` |
34-
| Azure scale-sets via [AzManagers.jl](https://github.com/ChevronETC/AzManagers.jl) | `addprocs(vmtemplate, n; kwargs...)` |
35-
36-
You can also write your own custom cluster manager; see the instructions in the [Julia manual](https://docs.julialang.org/en/v1/manual/distributed-computing/#ClusterManagers).
37-
38-
### Slurm: a simple example
14+
### LSF: a simple interactive example
3915

4016
```julia
41-
using Distributed, ClusterManagers
42-
43-
# Arguments to the Slurm srun(1) command can be given as keyword
44-
# arguments to addprocs. The argument name and value is translated to
45-
# a srun(1) command line argument as follows:
46-
# 1) If the length of the argument is 1 => "-arg value",
47-
# e.g. t="0:1:0" => "-t 0:1:0"
48-
# 2) If the length of the argument is > 1 => "--arg=value"
49-
# e.g. time="0:1:0" => "--time=0:1:0"
50-
# 3) If the value is the empty string, it becomes a flag value,
51-
# e.g. exclusive="" => "--exclusive"
52-
# 4) If the argument contains "_", they are replaced with "-",
53-
# e.g. mem_per_cpu=100 => "--mem-per-cpu=100"
54-
addprocs(SlurmManager(2), partition="debug", t="00:5:00")
17+
julia> using LSFClusterManager
5518

56-
hosts = []
57-
pids = []
58-
for i in workers()
59-
host, pid = fetch(@spawnat i (gethostname(), getpid()))
60-
push!(hosts, host)
61-
push!(pids, pid)
62-
end
63-
64-
# The Slurm resource allocation is released when all the workers have
65-
# exited
66-
for i in workers()
67-
rmprocs(i)
68-
end
69-
```
70-
71-
### SGE - a simple interactive example
72-
73-
```julia
74-
julia> using ClusterManagers
75-
76-
julia> ClusterManagers.addprocs_sge(5; qsub_flags=`-q queue_name`)
19+
julia> LSFClusterManager.addprocs_sge(5; qsub_flags=`-q queue_name`)
7720
job id is 961, waiting for job to start .
7821
5-element Array{Any,1}:
7922
2
@@ -93,13 +36,13 @@ julia> From worker 2: compute-6
9336
From worker 3: compute-6
9437
```
9538

96-
Some clusters require the user to specify a list of required resources.
39+
Some clusters require the user to specify a list of required resources.
9740
For example, it may be necessary to specify how much memory will be needed by the job - see this [issue](https://github.com/JuliaLang/julia/issues/10390).
9841
The keyword `qsub_flags` can be used to specify these and other options.
9942
Additionally the keyword `wd` can be used to specify the working directory (which defaults to `ENV["HOME"]`).
10043

10144
```julia
102-
julia> using Distributed, ClusterManagers
45+
julia> using Distributed, LSFClusterManager
10346

10447
julia> addprocs_sge(5; qsub_flags=`-q queue_name -l h_vmem=4G,tmem=4G`, wd=mktempdir())
10548
Job 5672349 in queue.
@@ -116,70 +59,8 @@ julia> pmap(x->run(`hostname`),workers());
11659
julia> From worker 26: lum-7-2.local
11760
From worker 23: pace-6-10.local
11861
From worker 22: chong-207-10.local
119-
From worker 24: pace-6-11.local
12062
From worker 25: cheech-207-16.local
12163

12264
julia> rmprocs(workers())
12365
Task (done)
12466
```
125-
126-
### SGE via qrsh
127-
128-
`SGEManager` uses SGE's `qsub` command to launch workers, which communicate the
129-
TCP/IP host:port info back to the master via the filesystem. On filesystems
130-
that are tuned to make heavy use of caching to increase throughput, launching
131-
Julia workers can frequently timeout waiting for the standard output files to appear.
132-
In this case, it's better to use the `QRSHManager`, which uses SGE's `qrsh`
133-
command to bypass the filesystem and captures STDOUT directly.
134-
135-
### Load Sharing Facility (LSF)
136-
137-
`LSFManager` supports IBM's scheduler. See the `addprocs_lsf` docstring
138-
for more information.
139-
140-
### Using `LocalAffinityManager` (for pinning local workers to specific cores)
141-
142-
- Linux only feature.
143-
- Requires the Linux `taskset` command to be installed.
144-
- Usage : `addprocs(LocalAffinityManager(;np=CPU_CORES, mode::AffinityMode=BALANCED, affinities=[]); kwargs...)`.
145-
146-
where
147-
148-
- `np` is the number of workers to be started.
149-
- `affinities`, if specified, is a list of CPU IDs. As many workers as entries in `affinities` are launched. Each worker is pinned
150-
to the specified CPU ID.
151-
- `mode` (used only when `affinities` is not specified, can be either `COMPACT` or `BALANCED`) - `COMPACT` results in the requested number
152-
of workers pinned to cores in increasing order, For example, worker1 => CPU0, worker2 => CPU1 and so on. `BALANCED` tries to spread
153-
the workers. Useful when we have multiple CPU sockets, with each socket having multiple cores. A `BALANCED` mode results in workers
154-
spread across CPU sockets. Default is `BALANCED`.
155-
156-
### Using `ElasticManager` (dynamically adding workers to a cluster)
157-
158-
The `ElasticManager` is useful in scenarios where we want to dynamically add workers to a cluster.
159-
It achieves this by listening on a known port on the master. The launched workers connect to this
160-
port and publish their own host/port information for other workers to connect to.
161-
162-
On the master, you need to instantiate an instance of `ElasticManager`. The constructors defined are:
163-
164-
```julia
165-
ElasticManager(;addr=IPv4("127.0.0.1"), port=9009, cookie=nothing, topology=:all_to_all, printing_kwargs=())
166-
ElasticManager(port) = ElasticManager(;port=port)
167-
ElasticManager(addr, port) = ElasticManager(;addr=addr, port=port)
168-
ElasticManager(addr, port, cookie) = ElasticManager(;addr=addr, port=port, cookie=cookie)
169-
```
170-
171-
You can set `addr=:auto` to automatically use the host's private IP address on the local network, which will allow other workers on this network to connect. You can also use `port=0` to let the OS choose a random free port for you (some systems may not support this). Once created, printing the `ElasticManager` object prints the command which you can run on workers to connect them to the master, e.g.:
172-
173-
```julia
174-
julia> em = ElasticManager(addr=:auto, port=0)
175-
ElasticManager:
176-
Active workers : []
177-
Number of workers to be added : 0
178-
Terminated workers : []
179-
Worker connect command :
180-
/home/user/bin/julia --project=/home/user/myproject/Project.toml -e 'using ClusterManagers; ClusterManagers.elastic_worker("4cOSyaYpgSl6BC0C","127.0.1.1",36275)'
181-
```
182-
183-
By default, the printed command uses the absolute path to the current Julia executable and activates the same project as the current session. You can change either of these defaults by passing `printing_kwargs=(absolute_exename=false, same_project=false))` to the first form of the `ElasticManager` constructor.
184-
185-
Once workers are connected, you can print the `em` object again to see them added to the list of active workers.

ci/Dockerfile

Lines changed: 0 additions & 21 deletions
This file was deleted.

ci/docker-compose.yml

Lines changed: 0 additions & 48 deletions
This file was deleted.

ci/my_sbatch.sh

Lines changed: 0 additions & 14 deletions
This file was deleted.

ci/run_my_sbatch.sh

Lines changed: 0 additions & 14 deletions
This file was deleted.

slurm_test.jl

Lines changed: 0 additions & 21 deletions
This file was deleted.

0 commit comments

Comments
 (0)