BwUniCluster2.0/Slurm: Difference between revisions

From bwHPC Wiki
Jump to navigation Jump to search
 
(64 intermediate revisions by 12 users not shown)
Line 26: Line 26:
| [[#Canceling own jobs : scancel|scancel]] || Cancels a job (obsoleted!) [[https://slurm.schedmd.com/scancel.html scancel]]
| [[#Canceling own jobs : scancel|scancel]] || Cancels a job (obsoleted!) [[https://slurm.schedmd.com/scancel.html scancel]]
|}
|}
'''IMPORTANT HINT: As soon as Slurm has allocated nodes to your batch job, it is allowed to login per ssh to the allocated nodes.'''


<br>
<br>
Line 112: Line 111:
| --export=[ALL,] ''env-variables''
| --export=[ALL,] ''env-variables''
| #SBATCH --export=[ALL,] ''env-variables''
| #SBATCH --export=[ALL,] ''env-variables''
| Identifies which environment variables from the submission <br> environment are propagated to the launched application. Default <br> is ALL. If adding to the submission environment instead of <br> replacing it is intended, the argument ALL must be added.
| Identifies which environment variables from the submission <br> environment are propagated to the launched application. Default <br> is ALL. If adding an environment variable to the submission<br> environment is intended, the argument ALL must be added.
|-
|-
|- style="vertical-align:top;"
|- style="vertical-align:top;"
Line 123: Line 122:
| #SBATCH --partition=''queue-name''
| #SBATCH --partition=''queue-name''
| Request a specific queue for the resource allocation.
| Request a specific queue for the resource allocation.
|-
|- style="vertical-align:top;"
| --reservation=''reservation-name''
| #SBATCH --reservation=''reservation-name''
| Use a specific reservation for the resource allocation.
|-
|-
|- style="vertical-align:top;"
|- style="vertical-align:top;"
Line 130: Line 134:
|-
|-
|- style="vertical-align:top;"
|- style="vertical-align:top;"
| -C ''BEEOND'' or --constraint=''BEEOND''
| -C ''BEEOND (BEEOND_4MDS, BEEOND_MAXMDS)'' or --constraint=''BEEOND (BEEOND_4MDS, BEEOND_MAXMDS''
| #SBATCH --constraint=BEEOND
| #SBATCH --constraint=BEEOND (BEEOND_4MDS, BEEOND_MAXMDS)
| Job constraint BeeOND file system.
| Job constraint BeeOND file system.
|-
|-
Line 155: Line 159:
#SBATCH --ntasks=1
#SBATCH --ntasks=1
#SBATCH --time=10
#SBATCH --time=10
#SBATCH --mem=200gb
#SBATCH --mem=180gb
#SBATCH --job-name=simple
#SBATCH --job-name=simple
</source>
</source>
and execute the modified script with the command line option ''--partition ???|???'' (with ''--partition ???'' maximum ''--mem=96gb'' is possible):
and execute the modified script with the command line option ''--partition=fat'' (with ''--partition=(dev_)single'' maximum ''--mem=96gb'' is possible):
<pre>
<pre>
$ sbatch --partition=??? job.sh # on ForHLR I
$ sbatch --partition=fat job.sh
</pre>
</pre>
Note, that sbatch command line options overrule script options.
Note, that sbatch command line options overrule script options.
Line 172: Line 176:
For multithreaded programs based on '''Open''' '''M'''ulti-'''P'''rocessing (OpenMP) a number of threads is defined by the environment variable OMP_NUM_THREADS. By default this variable is set to 1 (OMP_NUM_THREADS=1).
For multithreaded programs based on '''Open''' '''M'''ulti-'''P'''rocessing (OpenMP) a number of threads is defined by the environment variable OMP_NUM_THREADS. By default this variable is set to 1 (OMP_NUM_THREADS=1).
<br>
<br>
'''Because hyperthreading is switched on on bwUniCluster 2.0, the option --cpus-per-task (-c) must be set to 2*n, if you want to use n threads.'''
To submit a batch job called ''OpenMP_Test'' that runs a fourfold threaded program ''omp_exe'' which requires 6000 MByte of total physical memory and total wall clock time of 40 minutes:
<br>
To submit a batch job called ''OpenMP_Test'' that runs a 40-fold threaded program ''omp_exe'' which requires 6000 MByte of total physical memory and total wall clock time of 40 minutes:
<br>
<br>
a) execute:
a) execute:
<pre>
<pre>
$ sbatch -p ??? --export=ALL,OMP_NUM_THREADS=28 -J OpenMP_Test -N 1 -c 28? -t 40 --mem=6000 omp_exe
$ sbatch -p single --export=ALL,OMP_NUM_THREADS=40 -J OpenMP_Test -N 1 -c 80 -t 40 --mem=6000 ./omp_exe
</pre>
</pre>
or
or
Line 184: Line 190:
#!/bin/bash
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --nodes=1
#SBATCH --cpus-per-task=40
#SBATCH --cpus-per-task=80
#SBATCH --time=40:00
#SBATCH --time=40:00
#SBATCH --mem=6gb
#SBATCH --mem=6000mb
#SBATCH --export=ALL,EXECUTABLE=./omp_exe
#SBATCH --export=ALL,EXECUTABLE=./omp_exe
#SBATCH -J OpenMP_Test
#SBATCH -J OpenMP_Test
Line 195: Line 201:
#KMP_AFFINITY Description: https://software.intel.com/en-us/node/524790#KMP_AFFINITY_ENVIRONMENT_VARIABLE
#KMP_AFFINITY Description: https://software.intel.com/en-us/node/524790#KMP_AFFINITY_ENVIRONMENT_VARIABLE


export OMP_NUM_THREADS=${SLURM_JOB_CPUS_PER_NODE}
export OMP_NUM_THREADS=$((${SLURM_JOB_CPUS_PER_NODE}/2))
echo "Executable ${EXECUTABLE} running on ${SLURM_JOB_CPUS_PER_NODE} cores with ${OMP_NUM_THREADS} threads"
echo "Executable ${EXECUTABLE} running on ${SLURM_JOB_CPUS_PER_NODE} cores with ${OMP_NUM_THREADS} threads"
startexe=${EXECUTABLE}
startexe=${EXECUTABLE}
Line 201: Line 207:
exec $startexe
exec $startexe
</source>
</source>
Using Intel compiler the environment variable KMP_AFFINITY switches on binding of threads to specific cores and, if necessary, replace <placeholder> with the required modulefile to enable the OpenMP environment and execute the script '''job_omp.sh''' adding the queue class ''???'' as sbatch option:
Using Intel compiler the environment variable KMP_AFFINITY switches on binding of threads to specific cores and, if necessary, replace <placeholder> with the required modulefile to enable the OpenMP environment and execute the script '''job_omp.sh''' adding the queue class ''single'' as sbatch option:
<pre>
<pre>
$ sbatch -p ??? job_omp.sh
$ sbatch -p single job_omp.sh
</pre>
</pre>
Note, that sbatch command line options overrule script options, e.g.,
Note, that sbatch command line options overrule script options, e.g.,
<pre>
<pre>
$ sbatch --partition=??? --mem=200 job_omp.sh
$ sbatch --partition=single --mem=200 job_omp.sh
</pre>
</pre>
overwrites the script setting of 6000 MByte with 200 MByte.
overwrites the script setting of 6000 MByte with 200 MByte.
Line 231: Line 237:
<source lang="bash">
<source lang="bash">
#!/bin/bash
#!/bin/bash
# Use when a defined module environment related to OpenMPI is wished
# Use when using the module environment for OpenMPI
module load compiler/<placeholder_for_compiler>/<placeholder_for_compiler_version>
module load mpi/openmpi/<placeholder_for_version>
module load mpi/openmpi/<placeholder_for_mpi_version>
mpirun --bind-to core --map-by core -report-bindings my_par_program
mpirun --bind-to core --map-by core -report-bindings my_par_program
</source>
</source>
Line 239: Line 246:
Considering 4 OpenMPI tasks on a single node, each requiring 2000 MByte, and running for 1 hour, execute:
Considering 4 OpenMPI tasks on a single node, each requiring 2000 MByte, and running for 1 hour, execute:
<pre>
<pre>
$ sbatch -p ??? -N 4 -n 160 --mem=2000 --time=01:00:00 job_ompi.sh
$ sbatch -p single -N 1 -n 4 --mem-per-cpu=2000 --time=01:00:00 ./job_ompi.sh
</pre>
</pre>
<br>
<br>
Line 249: Line 256:
#!/bin/bash
#!/bin/bash
# Use when a defined module environment related to Intel MPI is wished
# Use when a defined module environment related to Intel MPI is wished
module load compiler/<placeholder_for_compiler>/<placeholder_for_compiler_version>
module load mpi/impi/<placeholder_for_version>
module load mpi/impi/<placeholder_for_version>
mpiexec.hydra -bootstrap slurm my_par_program
mpiexec.hydra -bootstrap slurm my_par_program
Line 257: Line 265:
Launching and running 200 Intel MPI tasks on 5 nodes, each requiring 80 GByte, and running for 5 hours, execute:
Launching and running 200 Intel MPI tasks on 5 nodes, each requiring 80 GByte, and running for 5 hours, execute:
<pre>
<pre>
$ sbatch --partition ??? -N 5 --ntasks-per-node=40 --mem=80gb -t 300 job_impi.sh
$ sbatch --partition=multiple -N 5 --ntasks-per-node=40 --mem=80gb -t 300 ./job_impi.sh
</pre>
</pre>
<br>
<br>
Line 268: Line 276:


==== Multithreaded + MPI parallel Programs ====
==== Multithreaded + MPI parallel Programs ====
Multithreaded + MPI parallel programs operate faster than serial programs on multi CPUs with multiple cores. All threads of one process share resources such as memory. On the contrary MPI tasks do not share memory but can be spawned over different nodes.
Multithreaded + MPI parallel programs operate faster than serial programs on multi CPUs with multiple cores. All threads of one process share resources such as memory. On the contrary MPI tasks do not share memory but can be spawned over different nodes. '''Because hyperthreading is switched on on bwUniCluster 2.0, the option --cpus-per-task (-c) must be set to 2*n, if you want to use n threads.'''
<br>
<br>
<br>
<br>
Line 279: Line 287:
#!/bin/bash
#!/bin/bash
#SBATCH --ntasks=4
#SBATCH --ntasks=4
#SBATCH --cpus-per-task=28
#SBATCH --cpus-per-task=56
#SBATCH --time=03:00:00
#SBATCH --time=03:00:00
#SBATCH --mem=84gb
#SBATCH --mem=83gb # 84000 MB = 84000/1024 GB = 82.1 GB
#SBATCH --export=ALL,MPI_MODULE=mpi/openmpi/3.1,EXECUTABLE=./ompi_omp_program
#SBATCH --export=ALL,MPI_MODULE=mpi/openmpi/3.1,EXECUTABLE=./ompi_omp_program
#SBATCH --output="parprog_hybrid_%j.out"
#SBATCH --output="parprog_hybrid_%j.out"
Line 287: Line 295:
# Use when a defined module environment related to OpenMPI is wished
# Use when a defined module environment related to OpenMPI is wished
module load ${MPI_MODULE}
module load ${MPI_MODULE}
export MPIRUN_OPTIONS="--bind-to core --map-by socket:PE=${SLURM_CPUS_PER_TASK} -report-bindings"
export OMP_NUM_THREADS=$((${SLURM_CPUS_PER_TASK}/2))
export MPIRUN_OPTIONS="--bind-to core --map-by socket:PE=${OMP_NUM_THREADS} -report-bindings"
export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK}
export NUM_CORES=${SLURM_NTASKS}*${SLURM_CPUS_PER_TASK}
export NUM_CORES=${SLURM_NTASKS}*${OMP_NUM_THREADS}
echo "${EXECUTABLE} running on ${NUM_CORES} cores with ${SLURM_NTASKS} MPI-tasks and ${OMP_NUM_THREADS} threads"
echo "${EXECUTABLE} running on ${NUM_CORES} cores with ${SLURM_NTASKS} MPI-tasks and ${OMP_NUM_THREADS} threads"
startexe="mpirun -n ${SLURM_NTASKS} ${MPIRUN_OPTIONS} ${EXECUTABLE}"
startexe="mpirun -n ${SLURM_NTASKS} ${MPIRUN_OPTIONS} ${EXECUTABLE}"
Line 297: Line 305:
Execute the script '''job_ompi_omp.sh''' by command sbatch:
Execute the script '''job_ompi_omp.sh''' by command sbatch:
<pre>
<pre>
$ sbatch -p ??? job_ompi_omp.sh
$ sbatch -p multiple ./job_ompi_omp.sh
</pre>
</pre>
* With the mpirun option ''--bind-to core'' MPI tasks and OpenMP threads are bound to physical cores.
* With the mpirun option ''--bind-to core'' MPI tasks and OpenMP threads are bound to physical cores.
Line 310: Line 318:
Multiple Intel MPI tasks must be launched by the MPI parallel program '''mpiexec.hydra'''. For multithreaded programs based on '''Open''' '''M'''ulti-'''P'''rocessing (OpenMP) number of threads are defined by the environment variable OMP_NUM_THREADS. By default this variable is set to 1 (OMP_NUM_THREADS=1).
Multiple Intel MPI tasks must be launched by the MPI parallel program '''mpiexec.hydra'''. For multithreaded programs based on '''Open''' '''M'''ulti-'''P'''rocessing (OpenMP) number of threads are defined by the environment variable OMP_NUM_THREADS. By default this variable is set to 1 (OMP_NUM_THREADS=1).


'''For Intel MPI''' a job-script to submit a batch job called ''job_impi_omp.sh'' that runs a Intel MPI program with 4 tasks and a 40-fold threaded program ''impi_omp_program'' requiring 96000 MByte of total physical memory per task and total wall clock time of 1 hours looks like:
'''For Intel MPI''' a job-script to submit a batch job called ''job_impi_omp.sh'' that runs a Intel MPI program with 10 tasks and a 40-fold threaded program ''impi_omp_program'' requiring 96000 MByte of total physical memory per task and total wall clock time of 1 hours looks like:


<!--b)-->
<!--b)-->
<source lang="bash">
<source lang="bash">
#!/bin/bash
#!/bin/bash
#SBATCH --ntasks=4
#SBATCH --ntasks=10
#SBATCH --cpus-per-task=40
#SBATCH --cpus-per-task=80
#SBATCH --time=60
#SBATCH --time=60
#SBATCH --mem=96000
#SBATCH --mem=96000
Line 329: Line 337:
# Use when a defined module environment related to Intel MPI is wished
# Use when a defined module environment related to Intel MPI is wished
module load ${MPI_MODULE}
module load ${MPI_MODULE}
export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK}
export OMP_NUM_THREADS=$((${SLURM_CPUS_PER_TASK}/2))
export MPIRUN_OPTIONS="-binding "domain=omp:compact" -print-rank-map -envall"
export MPIRUN_OPTIONS="-binding "domain=omp:compact" -print-rank-map -envall"
export NUM_PROCS=eval(${SLURM_NTASKS}*${OMP_NUM_THREADS})
export NUM_PROCS=eval(${SLURM_NTASKS}*${OMP_NUM_THREADS})
Line 347: Line 355:
Execute the script '''job_impi_omp.sh''' by command sbatch:
Execute the script '''job_impi_omp.sh''' by command sbatch:
<pre>
<pre>
$ sbatch -p ??? job_impi_omp.sh
$ sbatch -p multiple ./job_impi_omp.sh
</pre>
</pre>
<br>
<br>
Line 379: Line 387:
while [ ${myloop_counter} -le ${max_nojob} ] ; do
while [ ${myloop_counter} -le ${max_nojob} ] ; do
##
##
## Differ msub_opt depending on chain link number
## Differ slurm_opt depending on chain link number
if [ ${myloop_counter} -eq 1 ] ; then
if [ ${myloop_counter} -eq 1 ] ; then
slurm_opt=""
slurm_opt=""
Line 403: Line 411:
done
done
</source>
</source>
<br>

==== GPU jobs ====

The nodes in the gpu_4 and gpu_8 queues have 4 or 8 NVIDIA Tesla V100 GPUs. Just submitting a job to these queues is not enough to also allocate one or more GPUs, you have to do so using the "--gres=gpu" parameter. You have to specifiy how many GPUs your job needs, e.g. "--gres=gpu:2" will request two GPUs.

The GPU nodes are shared between multiple jobs if the jobs don't request all the GPUs in a node and there are enough ressources to run more than one job. The individual GPUs are always bound to a single job and will not be shared between different jobs.

a) add after the initial line of your script job.sh the line including the
information about the GPU usage:<br> #SBATCH --gres=gpu:2
<pre>
#!/bin/bash
#SBATCH --ntasks=40
#SBATCH --time=02:00:00
#SBATCH --mem=4000
#SBATCH --gres=gpu:2
</pre>

or b) execute:
<pre>
$ sbatch -p <queue> -n 40 -t 02:00:00 --mem 4000 --gres=gpu:2 job.sh
</pre>
<br/>
If you start an interactive session on of the GPU nodes, you can use the "nvidia-smi" command to list the GPUs allocated to your job:
<pre>
$ nvidia-smi
Sun Mar 29 15:20:05 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.33.01 Driver Version: 440.33.01 CUDA Version: 10.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2... Off | 00000000:3A:00.0 Off | 0 |
| N/A 29C P0 39W / 300W | 9MiB / 32510MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla V100-SXM2... Off | 00000000:3B:00.0 Off | 0 |
| N/A 30C P0 41W / 300W | 8MiB / 32510MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 14228 G /usr/bin/X 8MiB |
| 1 14228 G /usr/bin/X 8MiB |
+-----------------------------------------------------------------------------+
</pre>

<br/>
In case of using Open MPI, the underlying communication infrastructure (UCX and Open MPI's BTL) is CUDA-aware.
However, there may be warnings, e.g. when running
<pre>
$ module load compiler/gnu/10.3 mpi/openmpi devel/cuad
$ mpirun mpirun -np 2 ./mpi_cuda_app
--------------------------------------
WARNING: There are more than one active ports on host 'uc2n520', but the
default subnet GID prefix was detected on more than one of these
ports. If these ports are connected to different physical IB
networks, this configuration will fail in Open MPI. This version of
Open MPI requires that every physically separate IB subnet that is
used between connected MPI processes must have different subnet ID
values.

Please see this FAQ entry for more details:

http://www.open-mpi.org/faq/?category=openfabrics#ofa-default-subnet-gid

NOTE: You can turn off this warning by setting the MCA parameter
btl_openib_warn_default_gid_prefix to 0.
--------------------------------------------------------------------------
</pre>

Please run Open MPI's mpirun using the following command:
<pre>
$ mpirun --mca pml ucx --mca btl_openib_warn_default_gid_prefix 0 -np 2 ./mpi_cuda_app
</pre>
or disabling the (older) communication layer Bit-Transfer-Layer (short BTL) alltogether:
<pre>
$ mpirun --mca pml ucx --mca btl ^openib -np 2 ./mpi_cuda_app
</pre>

(Please note, that CUDA per v11.4 is only available with up to GCC-10)
<br>
<br>
<br>


==== LSDF Online Storage ====
==== LSDF Online Storage ====
On ForHLR you can use for special cases the LSDF Online Storage on the HPC cluster nodes. Please request for this service seperately ([https://www.lsdf.kit.edu/os/storagerequest/: LSDF Storage Request]).
On bwUniCluster 2.0 you can use for special cases the LSDF Online Storage on the HPC cluster nodes. Please request for this service separately ([https://www.lsdf.kit.edu/os/storagerequest/: LSDF Storage Request]).
To mount the LSDF Online Storage on the compute nodes during the job runtime the
To mount the LSDF Online Storage on the compute nodes during the job runtime the
the constraint flag "LSDF" has to be set.
the constraint flag "LSDF" has to be set.
Line 422: Line 514:
or b) execute:
or b) execute:
<pre>
<pre>
$ sbatch -p queue -n1 -t 2:00 --mem 200 job.sh -C LSDF
$ sbatch -p <queue> -n 1 -t 2:00:00 --mem 200 job.sh -C LSDF
</pre>
</pre>
<br>
<br>
Line 432: Line 524:
====BeeOND (BeeGFS On-Demand)====
====BeeOND (BeeGFS On-Demand)====


BeeOND instances are integrated into the prolog and epilog script of the cluster batch system, Slurm. It can be used on the compute nodes during the job runtime with the constraint flag "BEEOND" ([[ForHLR_-_SLURM_Batch_Jobs#sbatch_Command_Parameters | Slurm Command Parameters]])
BeeOND instances are integrated into the prolog and epilog script of the cluster batch system Slurm. It can be used on the exclusive compute nodes during the job runtime with the constraint flag "BEEOND", "BEEOND_4MDS" or "BEEOND_MAXMDS" ([[BwUniCluster_2.0_Slurm_common_Features#sbatch_Command_Parameters|Slurm Command Parameters]])
* BEEOND: one metadata server is started on the first node
<pre>
* BEEOND_4MDS: 4 metadata servers are started within your job. If you have less than 4 nodes less metadata servers are started.
* BEEOND_MAXMDS: on every node of your job a metadata server for the on_demand file system is started

As starting point we recommend using the "BEEOND" option. If you are unsure if this is sufficient for you feel free to contact the support team.
<source lang="bash">
#!/bin/bash
#!/bin/bash
#SBATCH ...
#SBATCH ...
#SBATCH --constraint=BEEOND
#SBATCH --constraint=BEEOND # or BEEOND_4MDS or BEEOND_MAXMDS
</pre>
</source>


After your job has started you can find the private on-demand file system in '''/mnt/odfs/$SLURM_JOB_ID''' directory. The mountpoint comes with three pre-configured directories:
After your job has started you can find the private on-demand file system in '''/mnt/odfs/${SLURM_JOB_ID}''' directory. The mountpoint comes with five pre-configured directories:
<source lang="bash">
<pre>
#for small files (stripe count = 1)
# For small files (stripe count = 1)
/mnt/odfs/$SLURM_JOB_ID/stripe_1
/mnt/odfs/${SLURM_JOB_ID}/stripe_1
#stripe count = 4
# Stripe count = 4
/mnt/odfs/$SLURM_JOB_ID/stripe_default
/mnt/odfs/${SLURM_JOB_ID}/stripe_default
# or
#stripe count = 8, 16 or 32, use this directories for medium sized and large files or when using MPI-IO
/mnt/odfs/$SLURM_JOB_ID/stripe_8, /mnt/odfs/$SLURM_JOB_ID/stripe_16 or /mnt/odfs/$SLURM_JOB_ID/stripe_32
/mnt/odfs/${SLURM_JOB_ID}/stripe_4
# Stripe count = 8, 16 or 32, use this directories for medium sized and large files or when using MPI-IO
</pre>
/mnt/odfs/${SLURM_JOB_ID}/stripe_8
/mnt/odfs/${SLURM_JOB_ID}/stripe_16
# or
/mnt/odfs/${SLURM_JOB_ID}/stripe_32
</source>


If you request less nodes than stripe count, the stripe count will be max number of nodes,
If you request less nodes than stripe count, the stripe count will be the number of nodes. For example, if you only request 8 nodes the directory stripe_16 has only a stripe count 8.
e.g., You only request 8 nodes , so the directory with stripe count 16 is basically only with a stripe count 8.


; <font color=red>'''Attention:'''</font><br>
The capacity of the private file system depends on the number of nodes. For each node you get 250Gbyte.
:Be careful when creating large files: use always the directory with the max stripe count for large files.
:If you create large files use a higher stripe count. For example, if your largest file is 1.1 Tb, then you have to use a stripe count larger than 2,
:otherwise the used disk space is exceeded.


The capacity of the private file system depends on the number of nodes. For each node you get 750 Gbyte.
!!! Be careful when creating large files, use always the directory with the max stripe count for large files.
If you request 100 nodes for your job, the private file system is 100 * 750 Gbyte ~ 75 Tbyte (approx) capacity.
If you create large files use a higher stripe count. For example, if your largest file is 1.1 Tb, then you have to use a stripe count larger>4 (4 x 250GB).

If you request 100 nodes for your job, the private file system is 100 * 250 Gbyte ~ 25 Tbyte (approx) capacity.

'''Recommendation:'''

The private file system is using its own metadata server. This metadata server is started on the first nodes. Depending on your application, the metadata server is consuming decent amount of CPU power. Probably adding a extra node to your job could improve the usability of the on-demand file system. Start your application with the MPI option:
<pre>
mpirun -nolocal myapplication
</pre>
With the -nolocal option the node where mpirun is initiated is not used for your application. This node is fully available for the meta data server of your requested on-demand file system.


Example job script:
<pre>
#!/bin/bash
#very simple example on how to use a private on-demand file system
#SBATCH -N 10
#SBATCH --constraint=BEEOND

#create a workspace
ws_allocate myresults-$SLURM_JOB_ID 90
RESULTDIR=`ws_find myresults-$SLURM_JOB_ID`

#Set ENV variable to on-demand file system
ODFSDIR=/mnt/odfs/$SLURM_JOB_ID/stripe_16/

#start application and write results to on-demand file system
mpirun -nolocal myapplication -o $ODFSDIR/results

#Copy back data after your job application end
rsync -av $ODFSDIR/results $RESULTDIR
</pre>
<br>
<br>


== Start time of job or resources : squeue --start ==
== Start time of job or resources : squeue --start ==
Line 520: Line 590:


=== Examples ===
=== Examples ===
''squeue'' example on ForHLR I <small>(Only your own jobs are displayed!)</small>.
''squeue'' example on bwUniCluster 2.0 <small>(Only your own jobs are displayed!)</small>.
<pre>
<pre>
$ squeue
$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
382 multinode job_ompi ku8089 PD 0:00 4 (AssocGrpJobsLimit)
18088744 single CPV.sbat ab1234 PD 0:00 1 (Priority)
381 multinode job_ompi ku8089 R 0:19 4 fhbn[005-008]
18098414 multiple CPV.sbat ab1234 PD 0:00 2 (Priority)
380 multinode job_ompi ku8089 R 0:23 4 fhbn[001-004]
18090089 multiple CPV.sbat ab1234 R 2:27 2 uc2n[127-128]

$ squeue -l
$ squeue -l
JOBID PARTITION NAME USER STATE TIME TIME_LIMI NODES NODELIST(REASON)
JOBID PARTITION NAME USER STATE TIME TIME_LIMI NODES NODELIST(REASON)
382 multinode job_ompi ku8089 PENDING 0:00 1:00:00 4 (AssocGrpJobsLimit)
18088654 single CPV.sbat ab1234 COMPLETI 4:29 2:00:00 1 uc2n374
381 multinode job_ompi ku8089 RUNNING 0:42 1:00:00 4 fhbn[005-008]
18088785 single CPV.sbat ab1234 PENDING 0:00 2:00:00 1 (Priority)
380 multinode job_ompi ku8089 RUNNING 0:46 1:00:00 4 fhbn[001-004]
18098414 multiple CPV.sbat ab1234 PENDING 0:00 2:00:00 2 (Priority)
18088683 single CPV.sbat ab1234 RUNNING 0:14 2:00:00 1 uc2n413
</pre>
</pre>
* The output of ''squeue'' shows how many jobs of yours are running or pending and how many nodes are in use by your jobs.
* The output of ''squeue'' shows how many jobs of yours are running or pending and how many nodes are in use by your jobs.
Line 550: Line 620:
* The following command displays what resources are available for immediate use for the whole partition.
* The following command displays what resources are available for immediate use for the whole partition.
<pre>$ sinfo_t_idle
<pre>$ sinfo_t_idle
Partition dev_multiple : 8 nodes idle
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
develop up 30:00 0 n/a
Partition multiple : 332 nodes idle
singlenode up 3-00:00:00 0 n/a
Partition dev_single : 4 nodes idle
multinode up 3-00:00:00 0 n/a
Partition single : 76 nodes idle
fat up 4-00:00:00 7 idle fh1n[802-803,805,808-810,813]
Partition long : 80 nodes idle
login up infinite 0 n/a
Partition fat : 5 nodes idle
Partition dev_special : 342 nodes idle
service up infinite 0 n/a
slurm up infinite 0 n/a
Partition special : 342 nodes idle
Partition dev_multiple_e: 7 nodes idle
transfer up infinite 0 n/a
Partition multiple_e : 335 nodes idle
headnode up infinite 0 n/a
Partition gpu_4 : 12 nodes idle
Partition gpu_8 : 6 nodes idle
</pre>
</pre>
* For the above example the request for 1 node in the partition fat can be run immediately.
* For the above example jobs in all partitions can be run immediately.
<br>
<br>


Line 585: Line 657:
| (n/a)
| (n/a)
| Detailed mode
| Detailed mode
| Example: Display the state with jobid 8370992 in detailed mode. <br> <pre>scontrol -d show job 8370992</pre>
| Example: Display the state with jobid 18089884 in detailed mode. <br> <pre>scontrol -d show job 18089884</pre>
|}
|}
<br>
<br>
Line 591: Line 663:


=== Scontrol show job Example ===
=== Scontrol show job Example ===
Here is an example from ForHLR I.
Here is an example from bwUniCluster 2.0.
<pre>
<pre>
squeue # show my own jobs (here the userid is replaced!)
squeue # show my own jobs (here the userid is replaced!)
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
451750 multinode job_ompi ab1234 PD 0:00 4 (JobHeldAdmin)
18089884 multiple CPV.sbat bq0742 R 33:44 2 uc2n[165-166]


$
$
$ # now, see what's up with my pending job with jobid 451750
$ # now, see what's up with my pending job with jobid 18089884
$
$
$ scontrol show job 451750
$ scontrol show job 18089884

JobId=451750 JobName=job_ompi.sh
JobId=18089884 JobName=CPV.sbatch
UserId=ab1234(8975) GroupId=fh1-project-devel(500376) MCS_label=N/A
UserId=bq0742(8946) GroupId=scc(12345) MCS_label=N/A
Priority=0 Nice=0 Account=fh1-scs QOS=(null)
Priority=3 Nice=0 Account=kit QOS=normal
JobState=PENDING Reason=JobHeldAdmin Dependency=(null)
JobState=RUNNING Reason=None Dependency=(null)
Requeue=0 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
RunTime=00:00:00 TimeLimit=01:00:00 TimeMin=N/A
RunTime=00:35:06 TimeLimit=02:00:00 TimeMin=N/A
SubmitTime=2018-11-30T14:40:22 EligibleTime=2018-11-30T14:40:22
SubmitTime=2020-03-16T14:14:54 EligibleTime=2020-03-16T14:14:54
AccrueTime=2020-03-16T14:14:54
StartTime=Unknown EndTime=Unknown Deadline=N/A
StartTime=2020-03-16T15:12:51 EndTime=2020-03-16T17:12:51 Deadline=N/A
PreemptTime=None SuspendTime=None SecsPreSuspend=0
SuspendTime=None SecsPreSuspend=0 LastSchedEval=2020-03-16T15:12:51
Partition=multinode AllocNode:Sid=fh1n988:19636
Partition=multiple AllocNode:Sid=uc2n995:5064
ReqNodeList=(null) ExcNodeList=(null)
ReqNodeList=(null) ExcNodeList=(null)
NodeList=(null)
NodeList=uc2n[165-166]
BatchHost=uc2n165
NumNodes=4-4 NumCPUs=80 NumTasks=80 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
NumNodes=2 NumCPUs=160 NumTasks=80 CPUs/Task=1 ReqB:S:C:T=0:0:*:1
TRES=cpu=80,mem=4000,node=4
TRES=cpu=160,mem=96320M,node=2,billing=160
Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
Socks/Node=* NtasksPerN:B:S:C=40:0:*:1 CoreSpec=*
MinCPUsNode=1 MinMemoryNode=1000M MinTmpDiskNode=0
MinCPUsNode=40 MinMemoryCPU=1204M MinTmpDiskNode=0
Features=(null) Gres=(null) Reservation=(null)
Features=(null) DelayBoot=00:00:00
OverSubscribe=NO Contiguous=0 Licenses=(null) Network=(null)
OverSubscribe=NO Contiguous=0 Licenses=(null) Network=(null)
Command=/pfs/data3/project/fh1-project-devel/ab1234/Slurm/job_ompi.sh
Command=/pfs/data5/home/kit/scc/bq0742/git/CPV/bin/CPV.sbatch
WorkDir=/pfs/data3/project/fh1-project-devel/ab1234/Slurm
WorkDir=/pfs/data5/home/kit/scc/bq0742/git/CPV/bin
StdErr=/pfs/data3/project/fh1-project-devel/ab1234/Slurm/slurm-451750.out
StdErr=/pfs/data5/home/kit/scc/bq0742/git/CPV/bin/slurm-18089884.out
StdIn=/dev/null
StdIn=/dev/null
StdOut=/pfs/data3/project/fh1-project-devel/ab1234/Slurm/slurm-451750.out
StdOut=/pfs/data5/home/kit/scc/bq0742/git/CPV/bin/slurm-18089884.out
Power=
Power=
MailUser=(null) MailType=NONE
</pre>
</pre>
<br>
<br>
You can use standard Linux pipe commands to filter the very detailed scontrol show job output.
You can use standard Linux pipe commands to filter the very detailed scontrol show job output.
* Is the job still pending?
* In which state the job is?
<pre>$ scontrol show job 451750 | grep -i pending
<pre>$ scontrol show job 18089884 | grep -i State
JobState=PENDING Reason=JobHeldAdmin Dependency=(null)
JobState=COMPLETED Reason=None Dependency=(null)
</pre>
</pre>
<br>
<br>
Line 706: Line 781:
|-
|-
| SLURM_SUBMIT_DIR
| SLURM_SUBMIT_DIR
| Job submit folder. The directory from which msub was invoked.
| Job submit folder. The directory from which sbatch was invoked.
|-
|-
| SLURM_JOB_USER
| SLURM_JOB_USER
Line 724: Line 799:
|-
|-
| SLURM_STEP_NUM_TASKS
| SLURM_STEP_NUM_TASKS
| Task count (number of PI ranks)
| Task count (number of MPI ranks)
|-
|-
| SLURM_JOB_CONSTRAINT
| SLURM_JOB_CONSTRAINT
Line 752: Line 827:
[...]
[...]
exit_code=$?
exit_code=$?
mpirun -np <#cores> <EXE_BIN_DIR>/<executable> ... (options) 2>&1
mpirun -np <#cores> <EXE_BIN_DIR>/<executable> ... (options) 2>&1
?? [ "$exit_code" -eq 0 ] && echo "all clean..." || \
[ "$exit_code" -eq 0 ] && echo "all clean..." || \
echo "Executable <EXE_BIN_DIR>/<executable> finished with exit code ${$exit_code}"
echo "Executable <EXE_BIN_DIR>/<executable> finished with exit code ${$exit_code}"
[...]
[...]
Line 762: Line 837:
<br>
<br>
----
----
[[Category:bwUniCluster 2.0|bwUniCluster 2.0]]
[[#top|Back to top]]
[[#top|Back to top]]

Latest revision as of 10:22, 6 June 2024

Slurm HPC Workload Manager

Specification

Slurm is an open source, fault-tolerant, and highly scalable cluster management and job scheduling system for large and small Linux clusters. Slurm requires no kernel modifications for its operation and is relatively self-contained. As a cluster workload manager, Slurm has three key functions. First, it allocates exclusive and/or non-exclusive access to resources (compute nodes) to users for some duration of time so they can perform work. Second, it provides a framework for starting, executing, and monitoring work (normally a parallel job) on the set of allocated nodes. Finally, it arbitrates contention for resources by managing a queue of pending work.

Any kind of calculation on the compute nodes of bwUniCluster 2.0 requires the user to define calculations as a sequence of commands or single command together with required run time, number of CPU cores and main memory and submit all, i.e., the batch job, to a resource and workload managing software. bwUniCluster 2.0 has installed the workload managing software Slurm. Therefore any job submission by the user is to be executed by commands of the Slurm software. Slurm queues and runs user jobs based on fair sharing policies.

Slurm Commands (excerpt)

Some of the most used Slurm commands for non-administrators working on bwUniCluster 2.0.

Slurm commands Brief explanation
sbatch Submits a job and queues it in an input queue [sbatch]
scontrol show job Displays detailed job state information [scontrol]
squeue Displays information about active, eligible, blocked, and/or recently completed jobs [squeue]
squeue --start Returns start time of submitted job or requested resources [squeue]
sinfo_t_idle Shows what resources are available for immediate use [sinfo]
scancel Cancels a job (obsoleted!) [scancel]



Job Submission : sbatch

Batch jobs are submitted by using the command sbatch. The main purpose of the sbatch command is to specify the resources that are needed to run the job. sbatch will then queue the batch job. However, starting of batch job depends on the availability of the requested resources and the fair sharing value.

sbatch Command Parameters

The syntax and use of sbatch can be displayed via:

$ man sbatch

sbatch options can be used from the command line or in your job script.

sbatch Options
Command line Script Purpose
-t time or --time=time #SBATCH --time=time Wall clock time limit.
-N count or --nodes=count #SBATCH --nodes=count Number of nodes to be used.
-n count or --ntasks=count #SBATCH --ntasks=count Number of tasks to be launched.
--ntasks-per-node=count #SBATCH --ntasks-per-node=count Maximum count (<= 28 and <= 40 resp.) of tasks per node.
(Replaces the option ppn of MOAB.)
-c count or --cpus-per-task=count #SBATCH --cpus-per-task=count Number of CPUs required per (MPI-)task.
--mem=value_in_MB #SBATCH --mem=value_in_MB Memory in MegaByte per node.
(Default value is 128000 and 96000 MB resp., i.e. you should omit
the setting of this option.)
--mem-per-cpu=value_in_MB #SBATCH --mem-per-cpu=value_in_MB Minimum Memory required per allocated CPU.
(Replaces the option pmem of MOAB. You should omit
the setting of this option.)
--mail-type=type #SBATCH --mail-type=type Notify user by email when certain event types occur.
Valid type values are NONE, BEGIN, END, FAIL, REQUEUE, ALL.
--mail-user=mail-address #SBATCH --mail-user=mail-address The specified mail-address receives email notification of state
changes as defined by --mail-type.
--output=name #SBATCH --output=name File in which job output is stored.
--error=name #SBATCH --error=name File in which job error messages are stored.
-J name or --job-name=name #SBATCH --job-name=name Job name.
--export=[ALL,] env-variables #SBATCH --export=[ALL,] env-variables Identifies which environment variables from the submission
environment are propagated to the launched application. Default
is ALL. If adding an environment variable to the submission
environment is intended, the argument ALL must be added.
-A group-name or --account=group-name #SBATCH --account=group-name Change resources used by this job to specified group. You may
need this option if your account is assigned to more
than one group. By command "scontrol show job" the project
group the job is accounted on can be seen behind "Account=".
-p queue-name or --partition=queue-name #SBATCH --partition=queue-name Request a specific queue for the resource allocation.
--reservation=reservation-name #SBATCH --reservation=reservation-name Use a specific reservation for the resource allocation.
-C LSDF or --constraint=LSDF #SBATCH --constraint=LSDF Job constraint LSDF Filesystems.
-C BEEOND (BEEOND_4MDS, BEEOND_MAXMDS) or --constraint=BEEOND (BEEOND_4MDS, BEEOND_MAXMDS #SBATCH --constraint=BEEOND (BEEOND_4MDS, BEEOND_MAXMDS) Job constraint BeeOND file system.


sbatch --partition queues

Queue classes define maximum resources such as walltime, nodes and processes per node and queue of the compute system. Details can be found here:


sbatch Examples

Serial Programs

To submit a serial job that runs the script job.sh and that requires 5000 MB of main memory and 10 minutes of wall clock time

a) execute:

$ sbatch -p dev_single -n 1 -t 10:00 --mem=5000  job.sh

or b) add after the initial line of your script job.sh the lines (here with a high memory request):

#SBATCH --ntasks=1
#SBATCH --time=10
#SBATCH --mem=180gb
#SBATCH --job-name=simple

and execute the modified script with the command line option --partition=fat (with --partition=(dev_)single maximum --mem=96gb is possible):

$ sbatch --partition=fat job.sh

Note, that sbatch command line options overrule script options.

Multithreaded Programs

Multithreaded programs operate faster than serial programs on CPUs with multiple cores.
Moreover, multiple threads of one process share resources such as memory.
For multithreaded programs based on Open Multi-Processing (OpenMP) a number of threads is defined by the environment variable OMP_NUM_THREADS. By default this variable is set to 1 (OMP_NUM_THREADS=1).
Because hyperthreading is switched on on bwUniCluster 2.0, the option --cpus-per-task (-c) must be set to 2*n, if you want to use n threads.
To submit a batch job called OpenMP_Test that runs a 40-fold threaded program omp_exe which requires 6000 MByte of total physical memory and total wall clock time of 40 minutes:
a) execute:

$ sbatch -p single --export=ALL,OMP_NUM_THREADS=40 -J OpenMP_Test -N 1 -c 80 -t 40 --mem=6000 ./omp_exe

or -->

  • generate the script job_omp.sh containing the following lines:
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --cpus-per-task=80
#SBATCH --time=40:00
#SBATCH --mem=6000mb   
#SBATCH --export=ALL,EXECUTABLE=./omp_exe
#SBATCH -J OpenMP_Test

#Usually you should set
export KMP_AFFINITY=compact,1,0
#export KMP_AFFINITY=verbose,compact,1,0 prints messages concerning the supported affinity
#KMP_AFFINITY Description: https://software.intel.com/en-us/node/524790#KMP_AFFINITY_ENVIRONMENT_VARIABLE

export OMP_NUM_THREADS=$((${SLURM_JOB_CPUS_PER_NODE}/2))
echo "Executable ${EXECUTABLE} running on ${SLURM_JOB_CPUS_PER_NODE} cores with ${OMP_NUM_THREADS} threads"
startexe=${EXECUTABLE}
echo $startexe
exec $startexe

Using Intel compiler the environment variable KMP_AFFINITY switches on binding of threads to specific cores and, if necessary, replace <placeholder> with the required modulefile to enable the OpenMP environment and execute the script job_omp.sh adding the queue class single as sbatch option:

$ sbatch -p single job_omp.sh

Note, that sbatch command line options overrule script options, e.g.,

$ sbatch --partition=single --mem=200 job_omp.sh

overwrites the script setting of 6000 MByte with 200 MByte.

MPI Parallel Programs

MPI parallel programs run faster than serial programs on multi CPU and multi core systems. N-fold spawned processes of the MPI program, i.e., MPI tasks, run simultaneously and communicate via the Message Passing Interface (MPI) paradigm. MPI tasks do not share memory but can be spawned over different nodes.
Multiple MPI tasks must be launched via mpirun, e.g. 4 MPI tasks of my_par_program:

$ mpirun -n 4 my_par_program

This command runs 4 MPI tasks of my_par_program on the node you are logged in. To run this command with a loaded Intel MPI the environment variable I_MPI_HYDRA_BOOTSTRAP must be unset ( --> $ unset I_MPI_HYDRA_BOOTSTRAP).

Running MPI parallel programs in a batch job the interactive environment - particularly the loaded modules - will also be set in the batch job. If you want to set a defined module environment in your batch job you have to purge all modules before setting the wished modules.

OpenMPI

If you want to run jobs on batch nodes, generate a wrapper script job_ompi.sh for OpenMPI containing the following lines:

#!/bin/bash
# Use when using the module environment for OpenMPI
module load compiler/<placeholder_for_compiler>/<placeholder_for_compiler_version>
module load mpi/openmpi/<placeholder_for_mpi_version>
mpirun --bind-to core --map-by core -report-bindings my_par_program

Attention: Do NOT add mpirun options -n <number_of_processes> or any other option defining processes or nodes, since Slurm instructs mpirun about number of processes and node hostnames. Use ALWAYS the MPI options --bind-to core and --map-by core|socket|node. Please type mpirun --help for an explanation of the meaning of the different options of mpirun option --map-by.
Considering 4 OpenMPI tasks on a single node, each requiring 2000 MByte, and running for 1 hour, execute:

$ sbatch -p single -N 1 -n 4 --mem-per-cpu=2000 --time=01:00:00 ./job_ompi.sh


Intel MPI

Generate a wrapper script for Intel MPI, job_impi.sh containing the following lines:

#!/bin/bash
# Use when a defined module environment related to Intel MPI is wished
module load compiler/<placeholder_for_compiler>/<placeholder_for_compiler_version>
module load mpi/impi/<placeholder_for_version>   
mpiexec.hydra -bootstrap slurm my_par_program

Attention:
Do NOT add mpirun options -n <number_of_processes> or any other option defining processes or nodes, since Slurm instructs mpirun about number of processes and node hostnames.
Launching and running 200 Intel MPI tasks on 5 nodes, each requiring 80 GByte, and running for 5 hours, execute:

$ sbatch --partition=multiple -N 5 --ntasks-per-node=40 --mem=80gb -t 300 ./job_impi.sh


If you want to use 128 or more nodes, you must also set the environment variable as follows:
export I_MPI_HYDRA_BRANCH_COUNT=-1
If you want to use the options perhost, ppn or rr, you must additionally set the environment variable I_MPI_JOB_RESPECT_PROCESS_PLACEMENT=off.

Multithreaded + MPI parallel Programs

Multithreaded + MPI parallel programs operate faster than serial programs on multi CPUs with multiple cores. All threads of one process share resources such as memory. On the contrary MPI tasks do not share memory but can be spawned over different nodes. Because hyperthreading is switched on on bwUniCluster 2.0, the option --cpus-per-task (-c) must be set to 2*n, if you want to use n threads.

OpenMPI with Multithreading

Multiple MPI tasks using OpenMPI must be launched by the MPI parallel program mpirun. For multithreaded programs based on Open Multi-Processing (OpenMP) number of threads are defined by the environment variable OMP_NUM_THREADS. By default this variable is set to 1 (OMP_NUM_THREADS=1).
For OpenMPI a job-script to submit a batch job called job_ompi_omp.sh that runs a MPI program with 4 tasks and a 28-fold threaded program ompi_omp_program requiring 3000 MByte of physical memory per thread (using 28 threads per MPI task you will get 28*3000 MByte = 84000 MByte per MPI task) and total wall clock time of 3 hours looks like:

#!/bin/bash
#SBATCH --ntasks=4
#SBATCH --cpus-per-task=56
#SBATCH --time=03:00:00
#SBATCH --mem=83gb    # 84000 MB = 84000/1024 GB = 82.1 GB
#SBATCH --export=ALL,MPI_MODULE=mpi/openmpi/3.1,EXECUTABLE=./ompi_omp_program
#SBATCH --output="parprog_hybrid_%j.out"  

# Use when a defined module environment related to OpenMPI is wished
module load ${MPI_MODULE}
export OMP_NUM_THREADS=$((${SLURM_CPUS_PER_TASK}/2))
export MPIRUN_OPTIONS="--bind-to core --map-by socket:PE=${OMP_NUM_THREADS} -report-bindings"
export NUM_CORES=${SLURM_NTASKS}*${OMP_NUM_THREADS}
echo "${EXECUTABLE} running on ${NUM_CORES} cores with ${SLURM_NTASKS} MPI-tasks and ${OMP_NUM_THREADS} threads"
startexe="mpirun -n ${SLURM_NTASKS} ${MPIRUN_OPTIONS} ${EXECUTABLE}"
echo $startexe
exec $startexe

Execute the script job_ompi_omp.sh by command sbatch:

$ sbatch -p multiple ./job_ompi_omp.sh
  • With the mpirun option --bind-to core MPI tasks and OpenMP threads are bound to physical cores.
  • With the option --map-by node:PE=<value> (neighbored) MPI tasks will be attached to different nodes and each MPI task is bound to the first core of a node. <value> must be set to ${OMP_NUM_THREADS}.
  • The option -report-bindings shows the bindings between MPI tasks and physical cores.
  • The mpirun-options --bind-to core, --map-by socket|...|node:PE=<value> should always be used when running a multithreaded MPI program.


Intel MPI with Multithreading

Multithreaded + MPI parallel programs operate faster than serial programs on multi CPUs with multiple cores. All threads of one process share resources such as memory. On the contrary MPI tasks do not share memory but can be spawned over different nodes.

Multiple Intel MPI tasks must be launched by the MPI parallel program mpiexec.hydra. For multithreaded programs based on Open Multi-Processing (OpenMP) number of threads are defined by the environment variable OMP_NUM_THREADS. By default this variable is set to 1 (OMP_NUM_THREADS=1).

For Intel MPI a job-script to submit a batch job called job_impi_omp.sh that runs a Intel MPI program with 10 tasks and a 40-fold threaded program impi_omp_program requiring 96000 MByte of total physical memory per task and total wall clock time of 1 hours looks like:

#!/bin/bash
#SBATCH --ntasks=10
#SBATCH --cpus-per-task=80
#SBATCH --time=60
#SBATCH --mem=96000
#SBATCH --export=ALL,MPI_MODULE=mpi/impi,EXE=./impi_omp_program
#SBATCH --output="parprog_impi_omp_%j.out"

#If using more than one MPI task per node please set
export KMP_AFFINITY=compact,1,0
#export KMP_AFFINITY=verbose,scatter  prints messages concerning the supported affinity 
#KMP_AFFINITY Description: https://software.intel.com/en-us/node/524790#KMP_AFFINITY_ENVIRONMENT_VARIABLE

# Use when a defined module environment related to Intel MPI is wished 
module load ${MPI_MODULE}
export OMP_NUM_THREADS=$((${SLURM_CPUS_PER_TASK}/2))
export MPIRUN_OPTIONS="-binding "domain=omp:compact" -print-rank-map -envall"
export NUM_PROCS=eval(${SLURM_NTASKS}*${OMP_NUM_THREADS})
echo "${EXE} running on ${NUM_PROCS} cores with ${SLURM_NTASKS} MPI-tasks and ${OMP_NUM_THREADS} threads"
startexe="mpiexec.hydra -bootstrap slurm ${MPIRUN_OPTIONS} -n ${SLURM_NTASKS} ${EXE}"
echo $startexe
exec $startexe

Using Intel compiler the environment variable KMP_AFFINITY switches on binding of threads to specific cores. If you only run one MPI task per node please set KMP_AFFINITY=compact,1,0.
If you want to use 128 or more nodes, you must also set the environment variable as follows:
export I_MPI_HYDRA_BRANCH_COUNT=-1
If you want to use the options perhost, ppn or rr, you must additionally set the environment variable I_MPI_JOB_RESPECT_PROCESS_PLACEMENT=off.

Execute the script job_impi_omp.sh by command sbatch:

$ sbatch -p multiple ./job_impi_omp.sh


The mpirun option -print-rank-map shows the bindings between MPI tasks and nodes (not very beneficial). The option -binding binds MPI tasks (processes) to a particular processor; domain=omp means that the domain size is determined by the number of threads. If you would choose 2 MPI tasks per node, you should choose -binding "cell=unit;map=bunch"; this binding maps one MPI process to each socket.

Chain jobs

The CPU time requirements of many applications exceed the limits of the job classes. In those situations it is recommended to solve the problem by a job chain. A job chain is a sequence of jobs where each job automatically starts its successor.

#!/bin/bash
####################################
## simple Slurm submitter script to setup   ## 
## a chain of jobs using Slurm                    ##
####################################
## ver.  : 2018-11-27, KIT, SCC

## Define maximum number of jobs via positional parameter 1, default is 5
max_nojob=${1:-5}

## Define your jobscript (e.g. "~/chain_job.sh")
chain_link_job=${PWD}/chain_job.sh

## Define type of dependency via positional parameter 2, default is 'afterok'
dep_type="${2:-afterok}"
## -> List of all dependencies:
## https://slurm.schedmd.com/sbatch.html

myloop_counter=1
## Submit loop
while [ ${myloop_counter} -le ${max_nojob} ] ; do
   ##
   ## Differ slurm_opt depending on chain link number
   if [ ${myloop_counter} -eq 1 ] ; then
      slurm_opt=""
   else
      slurm_opt="-d ${dep_type}:${jobID}"
   fi
   ##
   ## Print current iteration number and sbatch command
   echo "Chain job iteration = ${myloop_counter}"
   echo "   sbatch --export=myloop_counter=${myloop_counter} ${slurm_opt} ${chain_link_job}"
   ## Store job ID for next iteration by storing output of sbatch command with empty lines
   jobID=$(sbatch -p <queue> --export=ALL,myloop_counter=${myloop_counter} ${slurm_opt} ${chain_link_job} 2>&1 | sed 's/[S,a-z]* //g')
   ##   
   ## Check if ERROR occured
   if [[ "${jobID}" =~ "ERROR" ]] ; then
      echo "   -> submission failed!" ; exit 1
   else
      echo "   -> job number = ${jobID}"
   fi
   ##
   ## Increase counter
   let myloop_counter+=1
done


GPU jobs

The nodes in the gpu_4 and gpu_8 queues have 4 or 8 NVIDIA Tesla V100 GPUs. Just submitting a job to these queues is not enough to also allocate one or more GPUs, you have to do so using the "--gres=gpu" parameter. You have to specifiy how many GPUs your job needs, e.g. "--gres=gpu:2" will request two GPUs.

The GPU nodes are shared between multiple jobs if the jobs don't request all the GPUs in a node and there are enough ressources to run more than one job. The individual GPUs are always bound to a single job and will not be shared between different jobs.

a) add after the initial line of your script job.sh the line including the information about the GPU usage:
#SBATCH --gres=gpu:2

#!/bin/bash
#SBATCH --ntasks=40
#SBATCH --time=02:00:00
#SBATCH --mem=4000
#SBATCH --gres=gpu:2

or b) execute:

$ sbatch -p <queue> -n 40 -t 02:00:00 --mem 4000 --gres=gpu:2 job.sh


If you start an interactive session on of the GPU nodes, you can use the "nvidia-smi" command to list the GPUs allocated to your job:

$ nvidia-smi
Sun Mar 29 15:20:05 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.33.01    Driver Version: 440.33.01    CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  Off  | 00000000:3A:00.0 Off |                    0 |
| N/A   29C    P0    39W / 300W |      9MiB / 32510MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2...  Off  | 00000000:3B:00.0 Off |                    0 |
| N/A   30C    P0    41W / 300W |      8MiB / 32510MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0     14228      G   /usr/bin/X                                     8MiB |
|    1     14228      G   /usr/bin/X                                     8MiB |
+-----------------------------------------------------------------------------+


In case of using Open MPI, the underlying communication infrastructure (UCX and Open MPI's BTL) is CUDA-aware. However, there may be warnings, e.g. when running

$ module load compiler/gnu/10.3 mpi/openmpi devel/cuad
$ mpirun mpirun -np 2 ./mpi_cuda_app
--------------------------------------
WARNING: There are more than one active ports on host 'uc2n520', but the
default subnet GID prefix was detected on more than one of these
ports.  If these ports are connected to different physical IB
networks, this configuration will fail in Open MPI.  This version of
Open MPI requires that every physically separate IB subnet that is
used between connected MPI processes must have different subnet ID
values.

Please see this FAQ entry for more details:

  http://www.open-mpi.org/faq/?category=openfabrics#ofa-default-subnet-gid

NOTE: You can turn off this warning by setting the MCA parameter
      btl_openib_warn_default_gid_prefix to 0.
--------------------------------------------------------------------------

Please run Open MPI's mpirun using the following command:

$ mpirun --mca pml ucx --mca btl_openib_warn_default_gid_prefix 0 -np 2 ./mpi_cuda_app

or disabling the (older) communication layer Bit-Transfer-Layer (short BTL) alltogether:

$ mpirun --mca pml ucx --mca btl ^openib -np 2 ./mpi_cuda_app

(Please note, that CUDA per v11.4 is only available with up to GCC-10)

LSDF Online Storage

On bwUniCluster 2.0 you can use for special cases the LSDF Online Storage on the HPC cluster nodes. Please request for this service separately (LSDF Storage Request). To mount the LSDF Online Storage on the compute nodes during the job runtime the the constraint flag "LSDF" has to be set.

a) add after the initial line of your script job.sh the line including the information about the LSDF Online Storage usage:
#SBATCH --constraint=LSDF

#!/bin/bash
#SBATCH --ntasks=1
#SBATCH --time=120
#SBATCH --mem=200
#SBATCH --constraint=LSDF

or b) execute:

$ sbatch -p <queue> -n 1 -t 2:00:00 --mem 200 job.sh -C LSDF


For the usage of the LSDF Online Storage the following environment variables are available: $LSDF, $LSDFPROJECTS, $LSDFHOME.

BeeOND (BeeGFS On-Demand)

BeeOND instances are integrated into the prolog and epilog script of the cluster batch system Slurm. It can be used on the exclusive compute nodes during the job runtime with the constraint flag "BEEOND", "BEEOND_4MDS" or "BEEOND_MAXMDS" (Slurm Command Parameters)

  • BEEOND: one metadata server is started on the first node
  • BEEOND_4MDS: 4 metadata servers are started within your job. If you have less than 4 nodes less metadata servers are started.
  • BEEOND_MAXMDS: on every node of your job a metadata server for the on_demand file system is started

As starting point we recommend using the "BEEOND" option. If you are unsure if this is sufficient for you feel free to contact the support team.

#!/bin/bash
#SBATCH ...
#SBATCH --constraint=BEEOND   # or BEEOND_4MDS or BEEOND_MAXMDS

After your job has started you can find the private on-demand file system in /mnt/odfs/${SLURM_JOB_ID} directory. The mountpoint comes with five pre-configured directories:

# For small files (stripe count = 1)
/mnt/odfs/${SLURM_JOB_ID}/stripe_1
# Stripe count = 4
/mnt/odfs/${SLURM_JOB_ID}/stripe_default 
# or 
/mnt/odfs/${SLURM_JOB_ID}/stripe_4
# Stripe count = 8, 16 or 32, use this directories for medium sized and large files or when using MPI-IO
/mnt/odfs/${SLURM_JOB_ID}/stripe_8
/mnt/odfs/${SLURM_JOB_ID}/stripe_16 
# or 
/mnt/odfs/${SLURM_JOB_ID}/stripe_32

If you request less nodes than stripe count, the stripe count will be the number of nodes. For example, if you only request 8 nodes the directory stripe_16 has only a stripe count 8.

Attention:
Be careful when creating large files: use always the directory with the max stripe count for large files.
If you create large files use a higher stripe count. For example, if your largest file is 1.1 Tb, then you have to use a stripe count larger than 2,
otherwise the used disk space is exceeded.

The capacity of the private file system depends on the number of nodes. For each node you get 750 Gbyte. If you request 100 nodes for your job, the private file system is 100 * 750 Gbyte ~ 75 Tbyte (approx) capacity.

Start time of job or resources : squeue --start

The command can be used by any user to displays the estimated start time of a job based a number of analysis types based on historical usage, earliest available reservable resources, and priority based backlog. The command squeue is explained in detail on the webpage https://slurm.schedmd.com/squeue.html or via manpage (man squeue).

Access

By default, this command can be run by any user.

List of your submitted jobs : squeue

Displays information about YOUR active, pending and/or recently completed jobs. The command displays all own active and pending jobs. The command squeue is explained in detail on the webpage https://slurm.schedmd.com/squeue.html or via manpage (man squeue).

Access

By default, this command can be run by any user.

Flags

Flag Description
-l, --long Report more of the available information for the selected jobs or job steps, subject to any constraints specified.


Examples

squeue example on bwUniCluster 2.0 (Only your own jobs are displayed!).

$ squeue 
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
          18088744    single CPV.sbat   ab1234 PD       0:00      1 (Priority)
          18098414  multiple CPV.sbat   ab1234 PD       0:00      2 (Priority) 
          18090089  multiple CPV.sbat   ab1234  R       2:27      2 uc2n[127-128]
$ squeue -l
            JOBID PARTITION     NAME     USER    STATE       TIME TIME_LIMI  NODES NODELIST(REASON) 
         18088654    single CPV.sbat   ab1234 COMPLETI       4:29   2:00:00      1 uc2n374
         18088785    single CPV.sbat   ab1234  PENDING       0:00   2:00:00      1 (Priority)
         18098414  multiple CPV.sbat   ab1234  PENDING       0:00   2:00:00      2 (Priority)
         18088683    single CPV.sbat   ab1234  RUNNING       0:14   2:00:00      1 uc2n413  
  • The output of squeue shows how many jobs of yours are running or pending and how many nodes are in use by your jobs.


Shows free resources : sinfo_t_idle

The Slurm command sinfo is used to view partition and node information for a system running Slurm. It incorporates down time, reservations, and node state information in determining the available backfill window. The sinfo command can only be used by the administrator.
SCC has prepared a special script (sinfo_t_idle) to find out how many processors are available for immediate use on the system. It is anticipated that users will use this information to submit jobs that meet these criteria and thus obtain quick job turnaround times.

Access

By default, this command can be used by any user or administrator.

Example

  • The following command displays what resources are available for immediate use for the whole partition.
$ sinfo_t_idle
Partition dev_multiple  :      8 nodes idle
Partition multiple      :    332 nodes idle
Partition dev_single    :      4 nodes idle
Partition single        :     76 nodes idle
Partition long          :     80 nodes idle
Partition fat           :      5 nodes idle
Partition dev_special   :    342 nodes idle
Partition special       :    342 nodes idle
Partition dev_multiple_e:      7 nodes idle
Partition multiple_e    :    335 nodes idle
Partition gpu_4         :     12 nodes idle
Partition gpu_8         :      6 nodes idle
  • For the above example jobs in all partitions can be run immediately.


Detailed job information : scontrol show job

scontrol show job displays detailed job state information and diagnostic output for all or a specified job of yours. Detailed information is available for active, pending and recently completed jobs. The command scontrol is explained in detail on the webpage https://slurm.schedmd.com/scontrol.html or via manpage (man scontrol).
Display the state of all your jobs in normal mode: scontrol show job
Display the state of a job with <jobid> in normal mode: scontrol show job <jobid>

Access

  • End users can use scontrol show job to view the status of their own jobs only.


Arguments

Option Default Description Example
-d (n/a) Detailed mode Example: Display the state with jobid 18089884 in detailed mode.
scontrol -d show job 18089884



Scontrol show job Example

Here is an example from bwUniCluster 2.0.

squeue    # show my own jobs (here the userid is replaced!)
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
          18089884  multiple CPV.sbat   bq0742  R      33:44      2 uc2n[165-166]

$
$ # now, see what's up with my pending job with jobid 18089884
$ 
$ scontrol show job 18089884

JobId=18089884 JobName=CPV.sbatch
   UserId=bq0742(8946) GroupId=scc(12345) MCS_label=N/A
   Priority=3 Nice=0 Account=kit QOS=normal
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:35:06 TimeLimit=02:00:00 TimeMin=N/A
   SubmitTime=2020-03-16T14:14:54 EligibleTime=2020-03-16T14:14:54
   AccrueTime=2020-03-16T14:14:54
   StartTime=2020-03-16T15:12:51 EndTime=2020-03-16T17:12:51 Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2020-03-16T15:12:51
   Partition=multiple AllocNode:Sid=uc2n995:5064
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=uc2n[165-166]
   BatchHost=uc2n165
   NumNodes=2 NumCPUs=160 NumTasks=80 CPUs/Task=1 ReqB:S:C:T=0:0:*:1
   TRES=cpu=160,mem=96320M,node=2,billing=160
   Socks/Node=* NtasksPerN:B:S:C=40:0:*:1 CoreSpec=*
   MinCPUsNode=40 MinMemoryCPU=1204M MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=NO Contiguous=0 Licenses=(null) Network=(null)
   Command=/pfs/data5/home/kit/scc/bq0742/git/CPV/bin/CPV.sbatch
   WorkDir=/pfs/data5/home/kit/scc/bq0742/git/CPV/bin
   StdErr=/pfs/data5/home/kit/scc/bq0742/git/CPV/bin/slurm-18089884.out
   StdIn=/dev/null
   StdOut=/pfs/data5/home/kit/scc/bq0742/git/CPV/bin/slurm-18089884.out
   Power=
   MailUser=(null) MailType=NONE


You can use standard Linux pipe commands to filter the very detailed scontrol show job output.

  • In which state the job is?
$ scontrol show job 18089884 | grep -i State
   JobState=COMPLETED Reason=None Dependency=(null)


Cancel Slurm Jobs

The scancel command is used to cancel jobs. The command scancel is explained in detail on the webpage https://slurm.schedmd.com/scancel.html or via manpage (man scancel).

Canceling own jobs : scancel

scancel is used to signal or cancel jobs, job arrays or job steps. The command is:

$ scancel [-i] <job-id>
$ scancel -t <job_state_name>


Flag Default Description Example
-i, --interactive (n/a) Interactive mode. Cancel the job 987654 interactively.
 scancel -i 987654 
-t, --state (n/a) Restrict the scancel operation to jobs in a certain state.
"job_state_name" may have a value of either "PENDING", "RUNNING" or "SUSPENDED".
Cancel all jobs in state "PENDING".
 scancel -t "PENDING" 


Resource Managers

Batch Job (Slurm) Variables

The following environment variables of Slurm are added to your environment once your job has started (only an excerpt of the most important ones).

Environment Brief explanation
SLURM_JOB_CPUS_PER_NODE Number of processes per node dedicated to the job
SLURM_JOB_NODELIST List of nodes dedicated to the job
SLURM_JOB_NUM_NODES Number of nodes dedicated to the job
SLURM_MEM_PER_NODE Memory per node dedicated to the job
SLURM_NPROCS Total number of processes dedicated to the job
SLURM_CLUSTER_NAME Name of the cluster executing the job
SLURM_CPUS_PER_TASK Number of CPUs requested per task
SLURM_JOB_ACCOUNT Account name
SLURM_JOB_ID Job ID
SLURM_JOB_NAME Job Name
SLURM_JOB_PARTITION Partition/queue running the job
SLURM_JOB_UID User ID of the job's owner
SLURM_SUBMIT_DIR Job submit folder. The directory from which sbatch was invoked.
SLURM_JOB_USER User name of the job's owner
SLURM_RESTART_COUNT Number of times job has restarted
SLURM_PROCID Task ID (MPI rank)
SLURM_NTASKS The total number of tasks available for the job
SLURM_STEP_ID Job step ID
SLURM_STEP_NUM_TASKS Task count (number of MPI ranks)
SLURM_JOB_CONSTRAINT Job constraints

See also:


Job Exit Codes

A job's exit code (also known as exit status, return code and completion code) is captured by SLURM and saved as part of the job record.
Any non-zero exit code will be assumed to be a job failure and will result in a Job State of FAILED with a reason of "NonZeroExitCode".
The exit code is an 8 bit unsigned number ranging between 0 and 255. While it is possible for a job to return a negative exit code, SLURM will display it as an unsigned value in the 0 - 255 range.

Displaying Exit Codes and Signals

SLURM displays a job's exit code in the output of the scontrol show job and the sview utility.
When a signal was responsible for a job or step's termination, the signal number will be displayed after the exit code, delineated by a colon(:).

Submitting Termination Signal

Here is an example, how to 'save' a Slurm termination signal in a typical jobscript.

[...]
exit_code=$?
mpirun  -np <#cores>  <EXE_BIN_DIR>/<executable> ... (options)  2>&1
[ "$exit_code" -eq 0 ] && echo "all clean..." || \
   echo "Executable <EXE_BIN_DIR>/<executable> finished with exit code ${$exit_code}"
[...]
  • Do not use 'time' mpirun! The exit code will be the one submitted by the first (time) program.
  • You do not need an exit $exit_code in the scripts.




Back to top