BinAC2/Slurm: Difference between revisions
| F Bartusch (talk | contribs)  (Correct link to BinAC2 partitions.) |  (updated GPU section) | ||
| (22 intermediate revisions by 2 users not shown) | |||
| Line 24: | Line 24: | ||
| : A partition (usually called queue outside SLURM) is a waiting line in which jobs are put by users. | : A partition (usually called queue outside SLURM) is a waiting line in which jobs are put by users. | ||
| ;<span id=" | ;<span id="Socket"></span>Socket     | ||
| : Receptacle on the motherboard for one physically packaged processor (each of which can contain one or more cores). | |||
| : A CPU in Slurm means a single core. This is different from the more common terminology, where a CPU (a microprocessor chip) consists of multiple cores. Slurm uses the term '''sockets''' when talking about CPU chips. | |||
| ;<span id="Core"></span>Core     | |||
| : A complete private set of registers, execution units, and retirement queues needed to execute programs. | |||
| ;<span id="Thread"></span>Thread     | |||
| : One or more hardware contexts withing a single core. Each thread has attributes of one core, managed & scheduled as a single logical processor by the OS. | |||
| ;<span id="CPU"></span>CPU | |||
| : A '''CPU''' in Slurm means a '''single core'''. This is different from the more common terminology, where a CPU (a microprocessor chip) consists of multiple cores. Slurm uses the term '''sockets''' when talking about CPU chips. Depending upon system configuration, a CPU can be either a '''core''' or a '''thread'''. On '''BinAC 2 Hyperthreading is activated on every machine'''. This means that the operating system and Slurm sees each physical core as two logical cores. | |||
| = Slurm Commands = | = Slurm Commands = | ||
| Line 47: | Line 55: | ||
| |} | |} | ||
| == Interactive Jobs == | |||
| You can run interactive jobs for testing and developing your job scripts. | |||
| Several nodes are reserved for interactive work, so your jobs should start right away. | |||
| You can only submit one job to this partition at a time. A job can run for up to 10 hours (about one workday). | |||
| This example command gives you 16 cores and 128 GB of memory for four hours on one of the reserved nodes: | |||
| <pre> | |||
| salloc --partition=interactive --time=4:00:00 --cpus-per-task=16 --mem=128gb | |||
| </pre> | |||
| You can also use srun to request the same resources: | |||
| <pre> | |||
| srun --partition=interactive --time=4:00:00 --cpus-per-task=16 --mem=128gb --pty bash | |||
| </pre> | |||
| == Job Submission : sbatch == | == Job Submission : sbatch == | ||
| Line 105: | Line 129: | ||
| | <code>--mem=<size>[units]</code> | | <code>--mem=<size>[units]</code> | ||
| | #SBATCH --mem=''value_in_MB''  | | #SBATCH --mem=''value_in_MB''  | ||
| | Memory in MegaByte per node.  | | Memory in MegaByte per node.</br><code>[units]</code> can be one of <code>[K<nowiki>|</nowiki>M<nowiki>|</nowiki>G<nowiki>|</nowiki>T]</code>. | ||
| |  | | <code>--mem=10g</code> Request 10GB RAM per node </br> <code>--mem=0</code> Request all memory on node | ||
| | Depends on Slurm configuration.</br>It | | Depends on Slurm configuration.</br>It is better to specify <code>--mem</code> in every case. | ||
| |- | |- | ||
| |- style="vertical-align:top;" | |- style="vertical-align:top;" | ||
| Line 173: | Line 197: | ||
| === sbatch Examples === | === sbatch Examples === | ||
| If you are coming from Moab/Torque on BinAC 1 or you are new to HPC/Slurm the <code>sbatch</code> options may confuse you. The following examples give an orientation how to run typical workloads on BinAC 2. | |||
| You can find every file mentioned on this Wiki page on BinAC 2 at: <code>/pfs/10/project/examples</code> | |||
| ==== Serial Programs ==== | ==== Serial Programs ==== | ||
| When you use serial programs that use only one process, you can omit most of the <code>sbatch</code> parameters, as the default values are sufficient. | |||
| To submit a serial job that runs the script '''job.sh''' and that requires 5000 MB of main memory and 10 minutes of wall clock time | |||
| To submit a serial job that runs the script <code>serial_job.sh</code> and requires 5000 MB of main memory and 10 minutes of wall clock time, Slurm will allocate one '''physical''' core to your job. | |||
| a) execute: | a) execute: | ||
| <pre> | <pre> | ||
| $ sbatch -p  | $ sbatch -p compute -t 10:00 --mem=5000m  serial_job.sh | ||
| </pre> | </pre> | ||
| or | or | ||
| b) add after the initial line of your script ''' | b) add after the initial line of your script '''serial_job.sh''' the lines: | ||
| <source lang="bash"> | <source lang="bash"> | ||
| #SBATCH -- | #SBATCH --time=10:00 | ||
| #SBATCH -- | #SBATCH --mem=5000m | ||
| #SBATCH -- | #SBATCH --job-name=simple-serial-job | ||
| #SBATCH --job-name=simple | |||
| </source> | </source> | ||
| and execute the modified script with the command line option ''--partition= | and execute the modified script with the command line option ''--partition=compute'' | ||
| <pre> | <pre> | ||
| $ sbatch - | $ sbatch -p=compute serial_job.sh | ||
| </pre> | </pre> | ||
| Note, that sbatch command line options overrule script options. | Note, that sbatch command line options overrule script options. | ||
| Line 197: | Line 227: | ||
| ==== Multithreaded Programs ==== | ==== Multithreaded Programs ==== | ||
| Multithreaded programs operate faster than serial programs on CPUs with multiple cores.<br> | |||
| Multithreaded programs run their processes on multiple threads and share resources such as memory.<br> | |||
| You may use a program that includes a built-in option for multithreading (e.g., options like <code>--threads</code>).<br> | |||
| <br> | |||
| For multithreaded programs based on '''Open''' '''M'''ulti-'''P'''rocessing (OpenMP) a number of threads is defined by the environment variable OMP_NUM_THREADS. By default this variable is set to 1 (OMP_NUM_THREADS=1). | For multithreaded programs based on '''Open''' '''M'''ulti-'''P'''rocessing (OpenMP) a number of threads is defined by the environment variable <code>OMP_NUM_THREADS</code>. By default, this variable is set to 1 (<code>OMP_NUM_THREADS=1</code>).  | ||
| <br> | |||
| '''Important:''' Hyperthreading is activated on bwForCluster BinAC 2. Hyperthreading can be beneficial for some applications and codes, but it can also degrade performance in other cases. We therefore recommend to run a small test job with and without hyperthreading to determine the best choice. ''' | |||
| '''Because hyperthreading is switched on on bwUniCluster 2.0, the option --cpus-per-task (-c) must be set to 2*n, if you want to use n threads.''' | |||
| <br> | |||
| '''a) Program with built-in multithreading option''' | |||
| To submit a batch job called ''OpenMP_Test'' that runs a 40-fold threaded program ''omp_exe'' which requires 6000 MByte of total physical memory and total wall clock time of 40 minutes: | |||
| <br> | |||
| The example uses the common Bioinformatics software called <code>samtools</code> as example for using built-in multithreading. | |||
| a) execute: | |||
| The module <code>bio/samtools/1.21</code> provides an example jobscript that requests 4 CPUs and runs <code>samtools sort</code> with 4 threads. | |||
| <pre> | <pre> | ||
| $ sbatch -p single --export=ALL,OMP_NUM_THREADS=40 -J OpenMP_Test -N 1 -c 80 -t 40 --mem=6000 ./omp_exe | |||
| </pre> | |||
| or | |||
| --> | |||
| * generate the script '''job_omp.sh''' containing the following lines: | |||
| <source lang="bash"> | |||
| #!/bin/bash | #!/bin/bash | ||
| #SBATCH --time=19:00 | |||
| #SBATCH --nodes=1 | #SBATCH --nodes=1 | ||
| #SBATCH --cpus-per-task= | #SBATCH --cpus-per-task=4 | ||
| #SBATCH -- | #SBATCH --mem=5000m | ||
| #SBATCH -- | #SBATCH --partition compute | ||
| [...] | |||
| #SBATCH --export=ALL,EXECUTABLE=./omp_exe | |||
| samtools sort -@ 4 sample.bam -o sample.sorted.bam | |||
| #SBATCH -J OpenMP_Test | |||
| </pre> | |||
| You can use the example jobscript with this command | |||
| #Usually you should set | |||
| export KMP_AFFINITY=compact,1,0 | |||
| #export KMP_AFFINITY=verbose,compact,1,0 prints messages concerning the supported affinity | |||
| #KMP_AFFINITY Description: https://software.intel.com/en-us/node/524790#KMP_AFFINITY_ENVIRONMENT_VARIABLE | |||
| export OMP_NUM_THREADS=$((${SLURM_JOB_CPUS_PER_NODE}/2)) | |||
| echo "Executable ${EXECUTABLE} running on ${SLURM_JOB_CPUS_PER_NODE} cores with ${OMP_NUM_THREADS} threads" | |||
| startexe=${EXECUTABLE} | |||
| echo $startexe | |||
| exec $startexe | |||
| </source> | |||
| Using Intel compiler the environment variable KMP_AFFINITY switches on binding of threads to specific cores and, if necessary, replace <placeholder> with the required modulefile to enable the OpenMP environment and execute the script '''job_omp.sh''' adding the queue class ''single'' as sbatch option: | |||
| <pre> | <pre> | ||
| sbatch /opt/bwhpc/common/bio/samtools/1.21/bwhpc-examples/binac2-samtools-1.21-bwhpc-examples.slurm | |||
| $ sbatch -p single job_omp.sh | |||
| </pre> | </pre> | ||
| Note, that sbatch command line options overrule script options, e.g., | |||
| '''b) OpenMP''' | |||
| We will run an exaple OpenMP Hello-World program. The jobscript looks like this: | |||
| <pre> | <pre> | ||
| #!/bin/bash | |||
| $ sbatch --partition=single --mem=200 job_omp.sh | |||
| #SBATCH --nodes=1 | |||
| #SBATCH --cpus-per-task=4 | |||
| #SBATCH --time=1:00 | |||
| #SBATCH --mem=5000m    | |||
| #SBATCH -J OpenMP-Hello-World | |||
| export OMP_NUM_THREADS=$(${SLURM_JOB_CPUS_PER_NODE}/2) | |||
| echo "Executable running on ${SLURM_JOB_CPUS_PER_NODE} cores with ${OMP_NUM_THREADS} threads" | |||
| # Run parallel Hello World | |||
| /pfs/10/project/examples/openmp_hello_world | |||
| </pre> | </pre> | ||
| overwrites the script setting of 6000 MByte with 200 MByte. | |||
| <br> | |||
| <br> | |||
| Submit the job to the <code>compute</code> partition and get the output (in the stdout-file) | |||
| ==== MPI Parallel Programs ==== | |||
| MPI parallel programs run faster than serial programs on multi CPU and multi core systems. N-fold spawned processes of the MPI program, i.e., '''MPI tasks''',  run simultaneously and communicate via the Message Passing Interface (MPI) paradigm. MPI tasks do not share memory but can be spawned over different nodes. | |||
| <br> | |||
| Multiple MPI tasks must be launched via '''mpirun''', e.g. 4 MPI tasks of ''my_par_program'': | |||
| <pre> | <pre> | ||
| sbatch --partition=compute /pfs/10/project/examples/openmp_hello_world.sh | |||
| $ mpirun -n 4 my_par_program | |||
| Executable  running on 4 cores with 4 threads | |||
| Hello from process: 0 | |||
| Hello from process: 2 | |||
| Hello from process: 1 | |||
| Hello from process: 3 | |||
| </pre> | </pre> | ||
| This command runs 4 MPI tasks of ''my_par_program'' on the node you are logged in. | |||
| To run this command with a loaded Intel MPI the environment variable I_MPI_HYDRA_BOOTSTRAP must be unset ( --> $ unset I_MPI_HYDRA_BOOTSTRAP). | |||
| ==== OpenMPI ==== | |||
| Running MPI parallel programs in a batch job the interactive environment - particularly the loaded modules - will also be set in the batch job. If you want to set a defined module environment in your batch job you have to purge all modules before setting the wished modules.  | |||
| <br> | |||
| If you want to run MPI-jobs on batch nodes, generate a wrapper script <code>mpi_hello_world.sh</code> for '''OpenMPI''' containing the following lines: | |||
| <br> | |||
| ===== OpenMPI ===== | |||
| If you want to run jobs on batch nodes, generate a wrapper script ''job_ompi.sh'' for '''OpenMPI''' containing the following lines: | |||
| <source lang="bash"> | <source lang="bash"> | ||
| #!/bin/bash | #!/bin/bash | ||
| # Use when using the module environment for OpenMPI | |||
| #SBATCH --partition compute | |||
| module load compiler/<placeholder_for_compiler>/<placeholder_for_compiler_version> | |||
| #SBATCH --nodes=2 | |||
| module load mpi/openmpi/<placeholder_for_mpi_version> | |||
| #SBATCH --ntasks-per-node=2 | |||
| mpirun --bind-to core --map-by core -report-bindings my_par_program | |||
| #SBATCH --cpus-per-task=2 | |||
| #SBATCH --mem-per-cpu=2000 | |||
| #SBATCH --time=05:00 | |||
| # Load the MPI implementation of your choice | |||
| module load mpi/openmpi/4.1-gnu-14.2 | |||
| # Run your MPI program | |||
| mpirun --bind-to core --map-by core --report-bindings mpi_hello_world | |||
| </source> | </source> | ||
| '''Attention:''' Do '''NOT''' add mpirun options ''-n <number_of_processes>'' or any other option defining processes or nodes, since Slurm instructs mpirun about number of processes and node hostnames. Use '''ALWAYS''' the MPI options '''''--bind-to core''''' and '''''--map-by core|socket|node'''''. Please type ''mpirun --help'' for an explanation of the meaning of the different options of mpirun option ''--map-by''. | |||
| '''Attention:''' Do '''NOT''' add mpirun options <code>-n <number_of_processes></code> or any other option defining processes or nodes, since Slurm instructs mpirun about number of processes and node hostnames. | |||
| <br> | |||
| Considering 4 OpenMPI tasks on a single node, each requiring 2000 MByte, and running for 1 hour, execute: | |||
| Use '''ALWAYS''' the MPI options <code>--bind-to core</code> and <code>--map-by core|socket|node</code>. | |||
| Please type <code>man mpirun</code> for an explanation of the meaning of the different options of mpirun option <code>--map-by</code>. | |||
| The above jobscript runs four OpenMPI tasks, distributed between two nodes. Because of hyperthreading you have to set <code>--cpus-per-task=2</code>. This means each MPI-task will get one physical core. If you omit <code>--cpus-per-task=2</code> MPI will fail. | |||
| '''Attention:''' Not all compute nodes are connected via Infiniband. Tell Slurm you want Infiniband via <code>--constraint=ib</code> when submitting or add <code>#SBATCH --constraint=ib</code> to your jobscript. | |||
| <pre> | <pre> | ||
| $ sbatch --constraint=ib /pfs/10/project/examples/mpi_hello_world.sh | |||
| $ sbatch -p single -N 1 -n 4 --mem-per-cpu=2000 --time=01:00:00 ./job_ompi.sh | |||
| </pre> | </pre> | ||
| <br> | |||
| This will run a simple Hello World program: | |||
| ===== Intel MPI ===== | |||
| Generate a wrapper script for '''Intel MPI''', ''job_impi.sh'' containing the following lines: | |||
| <source lang="bash"> | |||
| #!/bin/bash | |||
| # Use when a defined module environment related to Intel MPI is wished | |||
| module load compiler/<placeholder_for_compiler>/<placeholder_for_compiler_version> | |||
| module load mpi/impi/<placeholder_for_version>    | |||
| mpiexec.hydra -bootstrap slurm my_par_program | |||
| </source> | |||
| <font color=red>'''Attention:'''</font><br> | |||
| Do '''NOT''' add mpirun options ''-n <number_of_processes>'' or any other option defining processes or nodes, since Slurm instructs mpirun about number of processes and node hostnames. | |||
| <br> | |||
| Launching and running 200 Intel MPI tasks on 5 nodes, each requiring 80 GByte, and running for 5 hours, execute: | |||
| <pre> | <pre> | ||
| [...] | |||
| $ sbatch --partition=multiple -N 5 --ntasks-per-node=40 --mem=80gb -t 300 ./job_impi.sh | |||
| Hello world from processor node2-031, rank 3 out of 4 processors | |||
| </pre>  | |||
| Hello world from processor node2-031, rank 2 out of 4 processors | |||
| <br> | |||
| Hello world from processor node2-030, rank 1 out of 4 processors | |||
| If you want to use 128 or more nodes, you must also set the environment variable as follows:           <BR> | |||
| Hello world from processor node2-030, rank 0 out of 4 processors | |||
| export I_MPI_HYDRA_BRANCH_COUNT=-1 | |||
| <br> | |||
| </pre> | |||
| If you want to use the options perhost, ppn or rr, you must additionally set the environment variable I_MPI_JOB_RESPECT_PROCESS_PLACEMENT=off. | |||
| <br> | |||
| <br> | |||
| ==== Multithreaded + MPI parallel Programs ==== | ==== Multithreaded + MPI parallel Programs ==== | ||
| Multithreaded + MPI parallel programs operate faster than serial programs on multi CPUs with multiple cores. All threads of one process share resources such as memory. On the contrary MPI tasks do not share memory but can be spawned over different nodes. '''Because hyperthreading is switched on on bwUniCluster 2.0, the option --cpus-per-task (-c) must be set to 2*n, if you want to use n threads.''' | |||
| Multithreaded + MPI parallel programs operate faster than serial programs on multi CPUs with multiple cores. All threads of one process share resources such as memory. On the contrary MPI tasks do not share memory but can be spawned over different nodes. '''Because hyperthreading is switched on BinaC 2, the option --cpus-per-task (-c) must be set to 2*n, if you want to use n threads.''' | |||
| <br> | <br> | ||
| <br> | <br> | ||
| Line 331: | Line 367: | ||
| Execute the script '''job_ompi_omp.sh''' by command sbatch: | Execute the script '''job_ompi_omp.sh''' by command sbatch: | ||
| <pre> | <pre> | ||
| $ sbatch -p  | $ sbatch -p compute ./job_ompi_omp.sh | ||
| </pre> | </pre> | ||
| * With the mpirun option ''--bind-to core'' MPI tasks and OpenMP threads are bound to physical cores. | * With the mpirun option ''--bind-to core'' MPI tasks and OpenMP threads are bound to physical cores. | ||
| Line 337: | Line 373: | ||
| * The option ''-report-bindings'' shows the bindings between MPI tasks and physical cores. | * The option ''-report-bindings'' shows the bindings between MPI tasks and physical cores. | ||
| * The mpirun-options '''--bind-to core''', '''--map-by socket|...|node:PE=<value>''' should always be used when running a multithreaded MPI program. | * The mpirun-options '''--bind-to core''', '''--map-by socket|...|node:PE=<value>''' should always be used when running a multithreaded MPI program. | ||
| <br> | |||
| ===== Intel MPI with Multithreading ===== | |||
| Multithreaded + MPI parallel programs operate faster than serial programs on multi CPUs with multiple cores. All threads of one process share resources such as memory. On the contrary MPI tasks do not share memory but can be spawned over different nodes.   | |||
| Multiple Intel MPI tasks must be launched by the MPI parallel program '''mpiexec.hydra'''. For multithreaded programs based on '''Open''' '''M'''ulti-'''P'''rocessing (OpenMP) number of threads are defined by the environment variable OMP_NUM_THREADS. By default this variable is set to 1 (OMP_NUM_THREADS=1). | |||
| '''For Intel MPI''' a job-script to submit a batch job called ''job_impi_omp.sh'' that runs a Intel MPI program with 10 tasks and a 40-fold threaded program ''impi_omp_program'' requiring 96000 MByte of total physical memory per task and total wall clock time of 1 hours looks like:  | |||
| <!--b)-->  | |||
| <source lang="bash"> | |||
| #!/bin/bash | |||
| #SBATCH --ntasks=10 | |||
| #SBATCH --cpus-per-task=80 | |||
| #SBATCH --time=60 | |||
| #SBATCH --mem=96000 | |||
| #SBATCH --export=ALL,MPI_MODULE=mpi/impi,EXE=./impi_omp_program | |||
| #SBATCH --output="parprog_impi_omp_%j.out" | |||
| #If using more than one MPI task per node please set | |||
| export KMP_AFFINITY=compact,1,0 | |||
| #export KMP_AFFINITY=verbose,scatter  prints messages concerning the supported affinity  | |||
| #KMP_AFFINITY Description: https://software.intel.com/en-us/node/524790#KMP_AFFINITY_ENVIRONMENT_VARIABLE | |||
| # Use when a defined module environment related to Intel MPI is wished  | |||
| module load ${MPI_MODULE} | |||
| export OMP_NUM_THREADS=$((${SLURM_CPUS_PER_TASK}/2)) | |||
| export MPIRUN_OPTIONS="-binding "domain=omp:compact" -print-rank-map -envall" | |||
| export NUM_PROCS=eval(${SLURM_NTASKS}*${OMP_NUM_THREADS}) | |||
| echo "${EXE} running on ${NUM_PROCS} cores with ${SLURM_NTASKS} MPI-tasks and ${OMP_NUM_THREADS} threads" | |||
| startexe="mpiexec.hydra -bootstrap slurm ${MPIRUN_OPTIONS} -n ${SLURM_NTASKS} ${EXE}" | |||
| echo $startexe | |||
| exec $startexe | |||
| </source> | |||
| Using Intel compiler the environment variable KMP_AFFINITY switches on binding of threads to specific cores. If you only run one MPI task per node please set KMP_AFFINITY=compact,1,0. | |||
| <BR> | |||
| If you want to use 128 or more nodes, you must also set the environment variable as follows:           <BR> | |||
| export I_MPI_HYDRA_BRANCH_COUNT=-1 | |||
| <br> | |||
| If you want to use the options perhost, ppn or rr, you must additionally set the environment variable I_MPI_JOB_RESPECT_PROCESS_PLACEMENT=off.  | |||
| <br> | |||
| <br> | |||
| Execute the script '''job_impi_omp.sh''' by command sbatch: | |||
| <pre> | |||
| $ sbatch -p multiple ./job_impi_omp.sh | |||
| </pre> | |||
| <br> | |||
| The mpirun option ''-print-rank-map'' shows the bindings between MPI tasks and nodes (not very beneficial). The option ''-binding'' binds MPI tasks (processes) to a particular processor; ''domain=omp'' means that the domain size is determined by the number of threads. If you would choose 2 MPI tasks per node, you should choose ''-binding "cell=unit;map=bunch"''; this binding maps one MPI process to each socket.  | |||
| <br> | |||
| <br> | |||
| ==== Chain jobs ==== | |||
| The CPU time requirements of many applications exceed the limits of the job classes. In those situations it is recommended to solve the problem by a job chain. A job chain is a sequence of jobs where each job automatically starts its successor.  | |||
| <source lang="bash"> | |||
| #!/bin/bash | |||
| #################################### | |||
| ## simple Slurm submitter script to setup   ##  | |||
| ## a chain of jobs using Slurm                    ## | |||
| #################################### | |||
| ## ver.  : 2018-11-27, KIT, SCC | |||
| ## Define maximum number of jobs via positional parameter 1, default is 5 | |||
| max_nojob=${1:-5} | |||
| ## Define your jobscript (e.g. "~/chain_job.sh") | |||
| chain_link_job=${PWD}/chain_job.sh | |||
| ## Define type of dependency via positional parameter 2, default is 'afterok' | |||
| dep_type="${2:-afterok}" | |||
| ## -> List of all dependencies: | |||
| ## https://slurm.schedmd.com/sbatch.html | |||
| myloop_counter=1 | |||
| ## Submit loop | |||
| while [ ${myloop_counter} -le ${max_nojob} ] ; do | |||
|    ## | |||
|    ## Differ slurm_opt depending on chain link number | |||
|    if [ ${myloop_counter} -eq 1 ] ; then | |||
|       slurm_opt="" | |||
|    else | |||
|       slurm_opt="-d ${dep_type}:${jobID}" | |||
|    fi | |||
|    ## | |||
|    ## Print current iteration number and sbatch command | |||
|    echo "Chain job iteration = ${myloop_counter}" | |||
|    echo "   sbatch --export=myloop_counter=${myloop_counter} ${slurm_opt} ${chain_link_job}" | |||
|    ## Store job ID for next iteration by storing output of sbatch command with empty lines | |||
|    jobID=$(sbatch -p <queue> --export=ALL,myloop_counter=${myloop_counter} ${slurm_opt} ${chain_link_job} 2>&1 | sed 's/[S,a-z]* //g') | |||
|    ##    | |||
|    ## Check if ERROR occured | |||
|    if [[ "${jobID}" =~ "ERROR" ]] ; then | |||
|       echo "   -> submission failed!" ; exit 1 | |||
|    else | |||
|       echo "   -> job number = ${jobID}" | |||
|    fi | |||
|    ## | |||
|    ## Increase counter | |||
|    let myloop_counter+=1 | |||
| done | |||
| </source> | |||
| <br> | <br> | ||
| ==== GPU jobs ==== | ==== GPU jobs ==== | ||
| The nodes in the  | The nodes in the <code>gpu</code> queue have 2 or 8 NVIDIA A30/A100/H200 GPUs. Just submitting a job to these queues is not enough to also allocate one or more GPUs, you have to do so using the "--gres=gpu" parameter. You have to specifiy how many GPUs your job needs, e.g. "--gres=gpu:a30:2" will request two NVIDIA A30 GPUs. | ||
| The GPU nodes are shared between multiple jobs if the jobs don't request all the GPUs in a node and there are enough ressources to run more than one job. The individual GPUs are always bound to a single job and will not be shared between different jobs. | The GPU nodes are shared between multiple jobs if the jobs don't request all the GPUs in a node and there are enough ressources to run more than one job. The individual GPUs are always bound to a single job and will not be shared between different jobs. | ||
| a) add after the initial line of your script job.sh the line including the | a) add after the initial line of your script job.sh the line including the | ||
| information about the GPU usage:<br>   #SBATCH --gres=gpu:2 | information about the GPU usage:<br>   #SBATCH --gres=gpu:a30:2 | ||
| <pre> | <pre> | ||
| #!/bin/bash | #!/bin/bash | ||
| Line 452: | Line 388: | ||
| #SBATCH --time=02:00:00 | #SBATCH --time=02:00:00 | ||
| #SBATCH --mem=4000 | #SBATCH --mem=4000 | ||
| #SBATCH --gres=gpu:2 | #SBATCH --gres=gpu:a30:2 | ||
| </pre> | </pre> | ||
| or b) execute: | or b) execute: | ||
| <pre> | <pre> | ||
| $ sbatch -p <queue> -n 40 -t 02:00:00 --mem 4000 --gres=gpu:2 job.sh | $ sbatch -p <queue> -n 40 -t 02:00:00 --mem 4000 --gres=gpu:a30:2 job.sh | ||
| </pre> | </pre> | ||
| <br/> | <br/> | ||
| Line 485: | Line 421: | ||
| +-----------------------------------------------------------------------------+ | +-----------------------------------------------------------------------------+ | ||
| </pre> | </pre> | ||
| Upon successfull GPU ressource allocation, SLURM will set the environment variable <code>CUDA_VISIBLE_DEVICES</code> appropriately. <b>Do not change this variable!</b> | |||
| <br/> | <br/> | ||
| Line 519: | Line 457: | ||
| </pre> | </pre> | ||
| (Please note, that CUDA per  | (Please note, that CUDA per v12.8 is only officially supported with up to GCC-11) | ||
| <br> | <br> | ||
| <br> | <br> | ||
| ==== LSDF Online Storage ==== | |||
| On bwUniCluster 2.0 you can use for special cases the LSDF Online Storage on the HPC cluster nodes. Please request for this service separately ([https://www.lsdf.kit.edu/os/storagerequest/: LSDF Storage Request]). | |||
| To mount the LSDF Online Storage on the compute nodes during the job runtime the | |||
| the constraint flag "LSDF" has to be set.   | |||
| a) add after the initial line of your script job.sh the line including the | |||
| information about the LSDF Online Storage usage:<br>   #SBATCH --constraint=LSDF | |||
| <pre> | |||
| #!/bin/bash | |||
| #SBATCH --ntasks=1 | |||
| #SBATCH --time=120 | |||
| #SBATCH --mem=200 | |||
| #SBATCH --constraint=LSDF | |||
| </pre> | |||
| or b) execute: | |||
| <pre> | |||
| $ sbatch -p <queue> -n 1 -t 2:00:00 --mem 200 job.sh -C LSDF | |||
| </pre> | |||
| <br> | |||
| For the usage of the LSDF Online Storage | |||
| the following environment variables are available: $LSDF, $LSDFPROJECTS, $LSDFHOME. | |||
| <br> | |||
| <br> | |||
| == Start time of job or resources : squeue --start == | == Start time of job or resources : squeue --start == | ||
| Line 578: | Line 490: | ||
| === Examples === | === Examples === | ||
| ''squeue'' example on  | ''squeue'' example on BinaC 2 <small>(Only your own jobs are displayed!)</small>. | ||
| <pre> | <pre> | ||
| $ squeue  | $ squeue  | ||
| Line 651: | Line 563: | ||
| === Scontrol show job Example === | === Scontrol show job Example === | ||
| Here is an example from  | Here is an example from BinAC 2. | ||
| <pre> | <pre> | ||
| squeue    # show my own jobs (here the userid is replaced!) | squeue    # show my own jobs (here the userid is replaced!) | ||
Latest revision as of 09:42, 8 September 2025
General information about Slurm
Any kind of calculation on the compute nodes of bwForCluster BinAC 2 requires the user to define calculations as a sequence of commands or single command together with required run time, number of CPU cores and main memory and submit all, i.e., the batch job, to a resource and workload managing software. BinAC 2 has installed the workload managing software Slurm. Therefore any job submission by the user is to be executed by commands of the Slurm software. Slurm queues and runs user jobs based on fair sharing policies.
External Slurm documentation
You can find the official Slurm configuration and some other material here:
- Slurm documentation: https://slurm.schedmd.com/documentation.html
- Slurm cheat sheet: https://slurm.schedmd.com/pdfs/summary.pdf
- Slurm tutorials: https://slurm.schedmd.com/tutorials.html
SLURM terminology
SLURM knows and mirrors the division of the cluster into nodes with several cores. When queuing jobs, there are several ways of requesting resources and it is important to know which term means what in SLURM. Here are some basic SLURM terms:
- Job
- A job is a self-contained computation that may encompass multiple tasks and is given specific resources like individual CPUs/GPUs, a specific amount of RAM or entire nodes. These resources are said to have been allocated for the job.
- Task
- A task is a single run of a single process. By default, one task is run per node and one CPU is assigned per task.
- Partition
- A partition (usually called queue outside SLURM) is a waiting line in which jobs are put by users.
- Socket
- Receptacle on the motherboard for one physically packaged processor (each of which can contain one or more cores).
- Core
- A complete private set of registers, execution units, and retirement queues needed to execute programs.
- Thread
- One or more hardware contexts withing a single core. Each thread has attributes of one core, managed & scheduled as a single logical processor by the OS.
- CPU
- A CPU in Slurm means a single core. This is different from the more common terminology, where a CPU (a microprocessor chip) consists of multiple cores. Slurm uses the term sockets when talking about CPU chips. Depending upon system configuration, a CPU can be either a core or a thread. On BinAC 2 Hyperthreading is activated on every machine. This means that the operating system and Slurm sees each physical core as two logical cores.
Slurm Commands
| Slurm commands | Brief explanation | 
|---|---|
| sbatch | Submits a job and queues it in an input queue | 
| saclloc | Request resources for an interactive job | 
| squeue | Displays information about active, eligible, blocked, and/or recently completed jobs | 
| scontrol | Displays detailed job state information | 
| sstat | Displays status information about a running job | 
| scancel | Cancels a job | 
Interactive Jobs
You can run interactive jobs for testing and developing your job scripts. Several nodes are reserved for interactive work, so your jobs should start right away. You can only submit one job to this partition at a time. A job can run for up to 10 hours (about one workday).
This example command gives you 16 cores and 128 GB of memory for four hours on one of the reserved nodes:
salloc --partition=interactive --time=4:00:00 --cpus-per-task=16 --mem=128gb
You can also use srun to request the same resources:
srun --partition=interactive --time=4:00:00 --cpus-per-task=16 --mem=128gb --pty bash
Job Submission : sbatch
Batch jobs are submitted by using the command sbatch. The main purpose of the sbatch command is to specify the resources that are needed to run the job. sbatch will then queue the batch job. However, starting of batch job depends on the availability of the requested resources and the fair sharing value.
sbatch Command Parameters
The syntax and use of sbatch can be displayed via:
$ man sbatch
sbatch options can be used from the command line or in your job script. The following table shows the syntax and provides examples for each option.
| sbatch Options | ||||
|---|---|---|---|---|
| Command line | Job Script | Purpose | Example | Default value | 
| -t timeor--time=time | #SBATCH --time=time | Wall clock time limit. | -t 2:30:00Limits run time to 2h 30 min.-t 2-12Limits run time to 2 days and 12 hours. | Depends on Slurm partition. | 
| -N count or --nodes=count | #SBATCH --nodes=count | Number of nodes to be used. | -N 1Run job on one node.-N 2Run job on two nodes (have to use MPI!) | |
| -n count or --ntasks=count | #SBATCH --ntasks=count | Number of tasks to be launched. | -n 2launch two tasks in the job. | One task per node | 
| --ntasks-per-node=count | #SBATCH --ntasks-per-node=count | Maximum count of tasks per node. (Replaces the option ppnof MOAB.) | --ntasks-per-node=2Run 2 tasks per node | 1 task per node | 
| -c count or --cpus-per-task=count | #SBATCH --cpus-per-task=count | Number of CPUs required per (MPI-)task. | -c 2Request two CPUs per (MPI-)task. | 1 CPU per (MPI-)task | 
| --mem=<size>[units] | #SBATCH --mem=value_in_MB | Memory in MegaByte per node. [units]can be one of[K|M|G|T]. | --mem=10gRequest 10GB RAM per node--mem=0Request all memory on node | Depends on Slurm configuration. It is better to specify --memin every case. | 
| --mem-per-cpu=value_in_MB | #SBATCH --mem-per-cpu=value_in_MB | Minimum Memory required per allocated CPU. (Replaces the option pmem of MOAB. You should omit the setting of this option.) | ||
| --mail-type=type | #SBATCH --mail-type=type | Notify user by email when certain event types occur. Valid type values are NONE, BEGIN, END, FAIL, REQUEUE, ALL. | ||
| --mail-user=mail-address | #SBATCH --mail-user=mail-address | The specified mail-address receives email notification of state changes as defined by --mail-type. | ||
| --output=name | #SBATCH --output=name | File in which job output is stored. | ||
| --error=name | #SBATCH --error=name | File in which job error messages are stored. | ||
| -J name or --job-name=name | #SBATCH --job-name=name | Job name. | ||
| --export=[ALL,] env-variables | #SBATCH --export=[ALL,] env-variables | Identifies which environment variables from the submission environment are propagated to the launched application. Default is ALL. If adding an environment variable to the submission environment is intended, the argument ALL must be added. | ||
| -A group-name or --account=group-name | #SBATCH --account=group-name | Change resources used by this job to specified group. You may need this option if your account is assigned to more than one group. By command "scontrol show job" the project group the job is accounted on can be seen behind "Account=". | ||
| -p queue-name or --partition=queue-name | #SBATCH --partition=queue-name | Request a specific queue for the resource allocation. | ||
| --reservation=reservation-name | #SBATCH --reservation=reservation-name | Use a specific reservation for the resource allocation. | ||
| -C LSDF or --constraint=LSDF | #SBATCH --constraint=LSDF | Job constraint LSDF Filesystems. | ||
sbatch --partition queues
Queue classes define maximum resources such as walltime, nodes and processes per node and queue of the compute system. Details can be found here:
sbatch Examples
If you are coming from Moab/Torque on BinAC 1 or you are new to HPC/Slurm the sbatch options may confuse you. The following examples give an orientation how to run typical workloads on BinAC 2.
You can find every file mentioned on this Wiki page on BinAC 2 at: /pfs/10/project/examples
Serial Programs
When you use serial programs that use only one process, you can omit most of the sbatch parameters, as the default values are sufficient.
To submit a serial job that runs the script serial_job.sh and requires 5000 MB of main memory and 10 minutes of wall clock time, Slurm will allocate one physical core to your job.
a) execute:
$ sbatch -p compute -t 10:00 --mem=5000m serial_job.sh
or b) add after the initial line of your script serial_job.sh the lines:
#SBATCH --time=10:00
#SBATCH --mem=5000m
#SBATCH --job-name=simple-serial-job
and execute the modified script with the command line option --partition=compute
$ sbatch -p=compute serial_job.sh
Note, that sbatch command line options overrule script options.
Multithreaded Programs
Multithreaded programs run their processes on multiple threads and share resources such as memory.
You may use a program that includes a built-in option for multithreading (e.g., options like --threads).
For multithreaded programs based on Open Multi-Processing (OpenMP) a number of threads is defined by the environment variable OMP_NUM_THREADS. By default, this variable is set to 1 (OMP_NUM_THREADS=1). 
Important: Hyperthreading is activated on bwForCluster BinAC 2. Hyperthreading can be beneficial for some applications and codes, but it can also degrade performance in other cases. We therefore recommend to run a small test job with and without hyperthreading to determine the best choice.
a) Program with built-in multithreading option
The example uses the common Bioinformatics software called samtools as example for using built-in multithreading.
The module bio/samtools/1.21 provides an example jobscript that requests 4 CPUs and runs samtools sort with 4 threads.
#!/bin/bash #SBATCH --time=19:00 #SBATCH --nodes=1 #SBATCH --cpus-per-task=4 #SBATCH --mem=5000m #SBATCH --partition compute [...] samtools sort -@ 4 sample.bam -o sample.sorted.bam
You can use the example jobscript with this command
sbatch /opt/bwhpc/common/bio/samtools/1.21/bwhpc-examples/binac2-samtools-1.21-bwhpc-examples.slurm
b) OpenMP
We will run an exaple OpenMP Hello-World program. The jobscript looks like this:
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --cpus-per-task=4
#SBATCH --time=1:00
#SBATCH --mem=5000m   
#SBATCH -J OpenMP-Hello-World
export OMP_NUM_THREADS=$(${SLURM_JOB_CPUS_PER_NODE}/2)
echo "Executable running on ${SLURM_JOB_CPUS_PER_NODE} cores with ${OMP_NUM_THREADS} threads"
# Run parallel Hello World
/pfs/10/project/examples/openmp_hello_world
Submit the job to the compute partition and get the output (in the stdout-file)
sbatch --partition=compute /pfs/10/project/examples/openmp_hello_world.sh Executable running on 4 cores with 4 threads Hello from process: 0 Hello from process: 2 Hello from process: 1 Hello from process: 3
OpenMPI
If you want to run MPI-jobs on batch nodes, generate a wrapper script mpi_hello_world.sh for OpenMPI containing the following lines:
#!/bin/bash
#SBATCH --partition compute
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=2
#SBATCH --cpus-per-task=2
#SBATCH --mem-per-cpu=2000
#SBATCH --time=05:00
# Load the MPI implementation of your choice
module load mpi/openmpi/4.1-gnu-14.2
# Run your MPI program
mpirun --bind-to core --map-by core --report-bindings mpi_hello_world
Attention: Do NOT add mpirun options -n <number_of_processes> or any other option defining processes or nodes, since Slurm instructs mpirun about number of processes and node hostnames.
Use ALWAYS the MPI options --bind-to core and --map-by core|socket|node.
Please type man mpirun for an explanation of the meaning of the different options of mpirun option --map-by.
The above jobscript runs four OpenMPI tasks, distributed between two nodes. Because of hyperthreading you have to set --cpus-per-task=2. This means each MPI-task will get one physical core. If you omit --cpus-per-task=2 MPI will fail.
Attention: Not all compute nodes are connected via Infiniband. Tell Slurm you want Infiniband via --constraint=ib when submitting or add #SBATCH --constraint=ib to your jobscript.
$ sbatch --constraint=ib /pfs/10/project/examples/mpi_hello_world.sh
This will run a simple Hello World program:
[...] Hello world from processor node2-031, rank 3 out of 4 processors Hello world from processor node2-031, rank 2 out of 4 processors Hello world from processor node2-030, rank 1 out of 4 processors Hello world from processor node2-030, rank 0 out of 4 processors
Multithreaded + MPI parallel Programs
Multithreaded + MPI parallel programs operate faster than serial programs on multi CPUs with multiple cores. All threads of one process share resources such as memory. On the contrary MPI tasks do not share memory but can be spawned over different nodes. Because hyperthreading is switched on BinaC 2, the option --cpus-per-task (-c) must be set to 2*n, if you want to use n threads.
OpenMPI with Multithreading
Multiple MPI tasks using OpenMPI must be launched by the MPI parallel program mpirun. For multithreaded programs based on Open Multi-Processing (OpenMP) number of threads are defined by the environment variable OMP_NUM_THREADS. By default this variable is set to 1 (OMP_NUM_THREADS=1).
For OpenMPI a job-script to submit a batch job called job_ompi_omp.sh that runs a MPI program with 4 tasks and a 28-fold threaded program ompi_omp_program requiring 3000 MByte of physical memory per thread (using 28 threads per MPI task you will get 28*3000 MByte = 84000 MByte per MPI task) and total wall clock time of 3 hours looks like:
#!/bin/bash
#SBATCH --ntasks=4
#SBATCH --cpus-per-task=56
#SBATCH --time=03:00:00
#SBATCH --mem=83gb    # 84000 MB = 84000/1024 GB = 82.1 GB
#SBATCH --export=ALL,MPI_MODULE=mpi/openmpi/3.1,EXECUTABLE=./ompi_omp_program
#SBATCH --output="parprog_hybrid_%j.out"  
# Use when a defined module environment related to OpenMPI is wished
module load ${MPI_MODULE}
export OMP_NUM_THREADS=$((${SLURM_CPUS_PER_TASK}/2))
export MPIRUN_OPTIONS="--bind-to core --map-by socket:PE=${OMP_NUM_THREADS} -report-bindings"
export NUM_CORES=${SLURM_NTASKS}*${OMP_NUM_THREADS}
echo "${EXECUTABLE} running on ${NUM_CORES} cores with ${SLURM_NTASKS} MPI-tasks and ${OMP_NUM_THREADS} threads"
startexe="mpirun -n ${SLURM_NTASKS} ${MPIRUN_OPTIONS} ${EXECUTABLE}"
echo $startexe
exec $startexe
Execute the script job_ompi_omp.sh by command sbatch:
$ sbatch -p compute ./job_ompi_omp.sh
- With the mpirun option --bind-to core MPI tasks and OpenMP threads are bound to physical cores.
- With the option --map-by node:PE=<value> (neighbored) MPI tasks will be attached to different nodes and each MPI task is bound to the first core of a node. <value> must be set to ${OMP_NUM_THREADS}.
- The option -report-bindings shows the bindings between MPI tasks and physical cores.
- The mpirun-options --bind-to core, --map-by socket|...|node:PE=<value> should always be used when running a multithreaded MPI program.
GPU jobs
The nodes in the gpu queue have 2 or 8 NVIDIA A30/A100/H200 GPUs. Just submitting a job to these queues is not enough to also allocate one or more GPUs, you have to do so using the "--gres=gpu" parameter. You have to specifiy how many GPUs your job needs, e.g. "--gres=gpu:a30:2" will request two NVIDIA A30 GPUs.
The GPU nodes are shared between multiple jobs if the jobs don't request all the GPUs in a node and there are enough ressources to run more than one job. The individual GPUs are always bound to a single job and will not be shared between different jobs.
a) add after the initial line of your script job.sh the line including the
information about the GPU usage:
   #SBATCH --gres=gpu:a30:2
#!/bin/bash #SBATCH --ntasks=40 #SBATCH --time=02:00:00 #SBATCH --mem=4000 #SBATCH --gres=gpu:a30:2
or b) execute:
$ sbatch -p <queue> -n 40 -t 02:00:00 --mem 4000 --gres=gpu:a30:2 job.sh
If you start an interactive session on of the GPU nodes, you can use the "nvidia-smi" command to list the GPUs allocated to your job:
$ nvidia-smi
Sun Mar 29 15:20:05 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.33.01    Driver Version: 440.33.01    CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  Off  | 00000000:3A:00.0 Off |                    0 |
| N/A   29C    P0    39W / 300W |      9MiB / 32510MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2...  Off  | 00000000:3B:00.0 Off |                    0 |
| N/A   30C    P0    41W / 300W |      8MiB / 32510MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0     14228      G   /usr/bin/X                                     8MiB |
|    1     14228      G   /usr/bin/X                                     8MiB |
+-----------------------------------------------------------------------------+
Upon successfull GPU ressource allocation, SLURM will set the environment variable CUDA_VISIBLE_DEVICES appropriately. Do not change this variable!
In case of using Open MPI, the underlying communication infrastructure (UCX and Open MPI's BTL) is CUDA-aware.
However, there may be warnings, e.g. when running
$ module load compiler/gnu/10.3 mpi/openmpi devel/cuad
$ mpirun mpirun -np 2 ./mpi_cuda_app
--------------------------------------
WARNING: There are more than one active ports on host 'uc2n520', but the
default subnet GID prefix was detected on more than one of these
ports.  If these ports are connected to different physical IB
networks, this configuration will fail in Open MPI.  This version of
Open MPI requires that every physically separate IB subnet that is
used between connected MPI processes must have different subnet ID
values.
Please see this FAQ entry for more details:
  http://www.open-mpi.org/faq/?category=openfabrics#ofa-default-subnet-gid
NOTE: You can turn off this warning by setting the MCA parameter
      btl_openib_warn_default_gid_prefix to 0.
--------------------------------------------------------------------------
Please run Open MPI's mpirun using the following command:
$ mpirun --mca pml ucx --mca btl_openib_warn_default_gid_prefix 0 -np 2 ./mpi_cuda_app
or disabling the (older) communication layer Bit-Transfer-Layer (short BTL) alltogether:
$ mpirun --mca pml ucx --mca btl ^openib -np 2 ./mpi_cuda_app
(Please note, that CUDA per v12.8 is only officially supported with up to GCC-11)
Start time of job or resources : squeue --start
The command can be used by any user to displays the estimated start time of a job based a number of analysis types based on historical usage, earliest available reservable resources, and priority based backlog. The command squeue is explained in detail on the webpage https://slurm.schedmd.com/squeue.html or via manpage (man squeue). 
Access
By default, this command can be run by any user. 
List of your submitted jobs : squeue
Displays information about YOUR active, pending and/or recently completed jobs. The command displays all own active and pending jobs. The command squeue is explained in detail on the webpage https://slurm.schedmd.com/squeue.html or via manpage (man squeue).
Access
By default, this command can be run by any user.
Flags
| Flag | Description | 
|---|---|
| -l, --long | Report more of the available information for the selected jobs or job steps, subject to any constraints specified. | 
Examples
squeue example on BinaC 2 (Only your own jobs are displayed!).
$ squeue 
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
          18088744    single CPV.sbat   ab1234 PD       0:00      1 (Priority)
          18098414  multiple CPV.sbat   ab1234 PD       0:00      2 (Priority) 
          18090089  multiple CPV.sbat   ab1234  R       2:27      2 uc2n[127-128]
$ squeue -l
            JOBID PARTITION     NAME     USER    STATE       TIME TIME_LIMI  NODES NODELIST(REASON) 
         18088654    single CPV.sbat   ab1234 COMPLETI       4:29   2:00:00      1 uc2n374
         18088785    single CPV.sbat   ab1234  PENDING       0:00   2:00:00      1 (Priority)
         18098414  multiple CPV.sbat   ab1234  PENDING       0:00   2:00:00      2 (Priority)
         18088683    single CPV.sbat   ab1234  RUNNING       0:14   2:00:00      1 uc2n413  
- The output of squeue shows how many jobs of yours are running or pending and how many nodes are in use by your jobs.
Shows free resources : sinfo_t_idle
The Slurm command sinfo is used to view partition and node information for a system running Slurm. It incorporates down time, reservations, and node state information in determining the available backfill window. The sinfo command can only be used by the administrator.
SCC has prepared a special script (sinfo_t_idle) to find out how many processors are available for immediate use on the system. It is anticipated that users will use this information to submit jobs that meet these criteria and thus obtain quick job turnaround times. 
Access
By default, this command can be used by any user or administrator. 
Example
- The following command displays what resources are available for immediate use for the whole partition.
$ sinfo_t_idle Partition dev_multiple : 8 nodes idle Partition multiple : 332 nodes idle Partition dev_single : 4 nodes idle Partition single : 76 nodes idle Partition long : 80 nodes idle Partition fat : 5 nodes idle Partition dev_special : 342 nodes idle Partition special : 342 nodes idle Partition dev_multiple_e: 7 nodes idle Partition multiple_e : 335 nodes idle Partition gpu_4 : 12 nodes idle Partition gpu_8 : 6 nodes idle
- For the above example jobs in all partitions can be run immediately.
Detailed job information : scontrol show job
scontrol show job displays detailed job state information and diagnostic output for all or a specified job of yours. Detailed information is available for active, pending and recently completed jobs. The command scontrol is explained in detail on the webpage https://slurm.schedmd.com/scontrol.html or via manpage (man scontrol). 
Display the state of all your jobs in normal mode: scontrol show job
Display the state of a job with <jobid> in normal mode: scontrol show job <jobid>
Access
- End users can use scontrol show job to view the status of their own jobs only.
Arguments
| Option | Default | Description | Example | 
|---|---|---|---|
| -d | (n/a) | Detailed mode | Example: Display the state with jobid 18089884 in detailed mode. scontrol -d show job 18089884 | 
Scontrol show job Example
Here is an example from BinAC 2.
squeue    # show my own jobs (here the userid is replaced!)
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
          18089884  multiple CPV.sbat   bq0742  R      33:44      2 uc2n[165-166]
$
$ # now, see what's up with my pending job with jobid 18089884
$ 
$ scontrol show job 18089884
JobId=18089884 JobName=CPV.sbatch
   UserId=bq0742(8946) GroupId=scc(12345) MCS_label=N/A
   Priority=3 Nice=0 Account=kit QOS=normal
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:35:06 TimeLimit=02:00:00 TimeMin=N/A
   SubmitTime=2020-03-16T14:14:54 EligibleTime=2020-03-16T14:14:54
   AccrueTime=2020-03-16T14:14:54
   StartTime=2020-03-16T15:12:51 EndTime=2020-03-16T17:12:51 Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2020-03-16T15:12:51
   Partition=multiple AllocNode:Sid=uc2n995:5064
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=uc2n[165-166]
   BatchHost=uc2n165
   NumNodes=2 NumCPUs=160 NumTasks=80 CPUs/Task=1 ReqB:S:C:T=0:0:*:1
   TRES=cpu=160,mem=96320M,node=2,billing=160
   Socks/Node=* NtasksPerN:B:S:C=40:0:*:1 CoreSpec=*
   MinCPUsNode=40 MinMemoryCPU=1204M MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=NO Contiguous=0 Licenses=(null) Network=(null)
   Command=/pfs/data5/home/kit/scc/bq0742/git/CPV/bin/CPV.sbatch
   WorkDir=/pfs/data5/home/kit/scc/bq0742/git/CPV/bin
   StdErr=/pfs/data5/home/kit/scc/bq0742/git/CPV/bin/slurm-18089884.out
   StdIn=/dev/null
   StdOut=/pfs/data5/home/kit/scc/bq0742/git/CPV/bin/slurm-18089884.out
   Power=
   MailUser=(null) MailType=NONE
You can use standard Linux pipe commands to filter the very detailed scontrol show job output.
- In which state the job is?
$ scontrol show job 18089884 | grep -i State JobState=COMPLETED Reason=None Dependency=(null)
Cancel Slurm Jobs
The scancel command is used to cancel jobs. The command scancel is explained in detail on the webpage https://slurm.schedmd.com/scancel.html or via manpage (man scancel).   
Canceling own jobs : scancel
scancel is used to signal or cancel jobs, job arrays or job steps. The command is:
$ scancel [-i] <job-id> $ scancel -t <job_state_name>
| Flag | Default | Description | Example | 
|---|---|---|---|
| -i, --interactive | (n/a) | Interactive mode. | Cancel the job 987654 interactively. scancel -i 987654 | 
| -t, --state | (n/a) | Restrict the scancel operation to jobs in a certain state. "job_state_name" may have a value of either "PENDING", "RUNNING" or "SUSPENDED". | Cancel all jobs in state "PENDING". scancel -t "PENDING" | 
Resource Managers
Batch Job (Slurm) Variables
The following environment variables of Slurm are added to your environment once your job has started (only an excerpt of the most important ones).
| Environment | Brief explanation | 
|---|---|
| SLURM_JOB_CPUS_PER_NODE | Number of processes per node dedicated to the job | 
| SLURM_JOB_NODELIST | List of nodes dedicated to the job | 
| SLURM_JOB_NUM_NODES | Number of nodes dedicated to the job | 
| SLURM_MEM_PER_NODE | Memory per node dedicated to the job | 
| SLURM_NPROCS | Total number of processes dedicated to the job | 
| SLURM_CLUSTER_NAME | Name of the cluster executing the job | 
| SLURM_CPUS_PER_TASK | Number of CPUs requested per task | 
| SLURM_JOB_ACCOUNT | Account name | 
| SLURM_JOB_ID | Job ID | 
| SLURM_JOB_NAME | Job Name | 
| SLURM_JOB_PARTITION | Partition/queue running the job | 
| SLURM_JOB_UID | User ID of the job's owner | 
| SLURM_SUBMIT_DIR | Job submit folder. The directory from which sbatch was invoked. | 
| SLURM_JOB_USER | User name of the job's owner | 
| SLURM_RESTART_COUNT | Number of times job has restarted | 
| SLURM_PROCID | Task ID (MPI rank) | 
| SLURM_NTASKS | The total number of tasks available for the job | 
| SLURM_STEP_ID | Job step ID | 
| SLURM_STEP_NUM_TASKS | Task count (number of MPI ranks) | 
| SLURM_JOB_CONSTRAINT | Job constraints | 
See also:
Job Exit Codes
A job's exit code (also known as exit status, return code and completion code) is captured by SLURM and saved as part of the job record. 
Any non-zero exit code will be assumed to be a job failure and will result in a Job State of FAILED with a reason of "NonZeroExitCode".
The exit code is an 8 bit unsigned number ranging between 0 and 255. While it is possible for a job to return a negative exit code, SLURM will display it as an unsigned value in the 0 - 255 range.
Displaying Exit Codes and Signals
SLURM displays a job's exit code in the output of the scontrol show job and the sview utility.
 
When a signal was responsible for a job or step's termination, the signal number will be displayed after the exit code, delineated by a colon(:).
Submitting Termination Signal
Here is an example, how to 'save' a Slurm termination signal in a typical jobscript.
[...]
exit_code=$?
mpirun  -np <#cores>  <EXE_BIN_DIR>/<executable> ... (options)  2>&1
[ "$exit_code" -eq 0 ] && echo "all clean..." || \
   echo "Executable <EXE_BIN_DIR>/<executable> finished with exit code ${$exit_code}"
[...]
- Do not use 'time' mpirun! The exit code will be the one submitted by the first (time) program.
- You do not need an exit $exit_code in the scripts.