BwUniCluster2.0/Slurm
Slurm HPC Workload Manager
Specification
Slurm is an open source, fault-tolerant, and highly scalable cluster management and job scheduling system for large and small Linux clusters. Slurm requires no kernel modifications for its operation and is relatively self-contained. As a cluster workload manager, Slurm has three key functions. First, it allocates exclusive and/or non-exclusive access to resources (compute nodes) to users for some duration of time so they can perform work. Second, it provides a framework for starting, executing, and monitoring work (normally a parallel job) on the set of allocated nodes. Finally, it arbitrates contention for resources by managing a queue of pending work.
Any kind of calculation on the compute nodes of bwUniCluster 2.0 requires the user to define calculations as a sequence of commands or single command together with required run time, number of CPU cores and main memory and submit all, i.e., the batch job, to a resource and workload managing software. bwUniCluster 2.0 has installed the workload managing software Slurm. Therefore any job submission by the user is to be executed by commands of the Slurm software. Slurm queues and runs user jobs based on fair sharing policies.
Slurm Commands (excerpt)
Some of the most used Slurm commands for non-administrators working on bwUniCluster 2.0.
Slurm commands | Brief explanation |
---|---|
sbatch | Submits a job and queues it in an input queue [sbatch] |
scontrol show job | Displays detailed job state information [scontrol] |
squeue | Displays information about active, eligible, blocked, and/or recently completed jobs [squeue] |
squeue --start | Returns start time of submitted job or requested resources [squeue] |
sinfo_t_idle | Shows what resources are available for immediate use [sinfo] |
scancel | Cancels a job (obsoleted!) [scancel] |
IMPORTANT HINT: As soon as Slurm has allocated nodes to your batch job, it is allowed to login per ssh to the allocated nodes.
Job Submission : sbatch
Batch jobs are submitted by using the command sbatch. The main purpose of the sbatch command is to specify the resources that are needed to run the job. sbatch will then queue the batch job. However, starting of batch job depends on the availability of the requested resources and the fair sharing value.
sbatch Command Parameters
The syntax and use of sbatch can be displayed via:
$ man sbatch
sbatch options can be used from the command line or in your job script.
sbatch Options | ||
---|---|---|
Command line | Script | Purpose |
-t time or --time=time | #SBATCH --time=time | Wall clock time limit. |
-N count or --nodes=count | #SBATCH --nodes=count | Number of nodes to be used. |
-n count or --ntasks=count | #SBATCH --ntasks=count | Number of tasks to be launched. |
--ntasks-per-node=count | #SBATCH --ntasks-per-node=count | Maximum count (<= 28 or <= 40) of tasks per node. (Replaces the option ppn of MOAB.) |
-c count or --cpus-per-task=count | #SBATCH --cpus-per-task=count | Number of CPUs required per (MPI-)task. |
--mem=value_in_MB | #SBATCH --mem=value_in_MB | Memory in MegaByte per node. (Default value is 64000 MB, i.e. you should omit the setting of this option.) |
--mem-per-cpu=value_in_MB | #SBATCH --mem-per-cpu=value_in_MB | Minimum Memory required per allocated CPU. (Replaces the option pmem of MOAB. You should omit the setting of this option.) |
--mail-type=type | #SBATCH --mail-type=type | Notify user by email when certain event types occur. Valid type values are NONE, BEGIN, END, FAIL, REQUEUE, ALL. |
--mail-user=mail-address | #SBATCH --mail-user=mail-address | The specified mail-address receives email notification of state changes as defined by --mail-type. |
--output=name | #SBATCH --output=name | File in which job output is stored. |
--error=name | #SBATCH --error=name | File in which job error messages are stored. |
-J name or --job-name=name | #SBATCH --job-name=name | Job name. |
--export=[ALL,] env-variables | #SBATCH --export=[ALL,] env-variables | Identifies which environment variables from the submission environment are propagated to the launched application. Default is ALL. If adding to the submission environment instead of replacing it is intended, the argument ALL must be added. |
-A group-name or --account=group-name | #SBATCH --account=group-name | Charge resources used by this job to specified group. You may need this option if your account is assigned to more than one group. By command "scontrol show job" the project group the job is accounted on can be seen behind "Account=". |
-p queue-name or --partition=queue-name | #SBATCH --partition=queue-name | Request a specific queue for the resource allocation. |
-C LSDF or --constraint=LSDF | #SBATCH --constraint=LSDF | Job constraint LSDF Filesystems. |
-C BEEOND or --constraint=BEEOND | #SBATCH --constraint=BEEOND | Job constraint BeeOND file system. |
sbatch --partition queues
Queue classes define maximum resources such as walltime, nodes and processes per node and queue of the compute system. Note that queue settings of ForHLR clusters are not identical, but differ due to their different prerequisites, such as HPC performance, scalability and throughput levels. Details can be found here:
sbatch Examples
Serial Programs
To submit a serial job that runs the script job.sh and that requires 5000 MB of main memory and 10 minutes of wall clock time
a) execute:
$ sbatch -p develop -n 1 -t 10:00 --mem=5000 job.sh # on both clusters
or b) add after the initial line of your script job.sh the lines (here with a high memory request):
#SBATCH --ntasks=1
#SBATCH --time=3:00:00
#SBATCH --mem=200gb
#SBATCH --job-name=simple
and execute the modified script with the command line option --partition fat|visu (with --partition singlenode|normal maximum --mem=64gb is possible):
$ sbatch --partition=fat job.sh # on ForHLR I $ sbatch --partition=visu job.sh # on ForHLR II
Note, that sbatch command line options overrule script options.
Multithreaded Programs
Multithreaded programs operate faster than serial programs on CPUs with multiple cores.
Moreover, multiple threads of one process share resources such as memory.
For multithreaded programs based on Open Multi-Processing (OpenMP) a number of threads is defined by the environment variable OMP_NUM_THREADS. By default this variable is set to 1 (OMP_NUM_THREADS=1).
To submit a batch job called OpenMP_Test that runs a fourfold threaded program omp_exe which requires 6000 MByte of total physical memory and total wall clock time of 40 minutes:
a) execute:
# ForHLR I $ sbatch -p singlenode --export=ALL,OMP_NUM_THREADS=4 -J OpenMP_Test -N 1 -c 4 -t 40 --mem=6000 omp_exe # ForHLR II $ sbatch -p normal --export=ALL,OMP_NUM_THREADS=4 -J OpenMP_Test -N 1 -c 4 -t 40 --mem=6000 omp_exe
or -->
- generate the script job_omp.sh containing the following lines:
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --cpus-per-task=4
#SBATCH --time=40:00
#SBATCH --mem=6gb
#SBATCH --export=ALL,EXECUTABLE=./omp_exe
#SBATCH -J OpenMP_Test
#Usually you should set
export KMP_AFFINITY=compact,1,0
#export KMP_AFFINITY=verbose,compact,1,0 prints messages concerning the supported affinity
#KMP_AFFINITY Description: https://software.intel.com/en-us/node/524790#KMP_AFFINITY_ENVIRONMENT_VARIABLE
export OMP_NUM_THREADS=${SLURM_JOB_CPUS_PER_NODE}
echo "Executable ${EXECUTABLE} running on ${SLURM_JOB_CPUS_PER_NODE} cores with ${OMP_NUM_THREADS} threads"
startexe=${EXECUTABLE}
echo $startexe
exec $startexe
Using Intel compiler the environment variable KMP_AFFINITY switches on binding of threads to specific cores and, if necessary, replace <placeholder> with the required modulefile to enable the OpenMP environment and execute the script job_omp.sh adding the queue class singlenode|normal as sbatch option:
$ sbatch -p singlenode job_omp.sh # on ForHLR I $ sbatch -p normal job_omp.sh # on ForHLR II
Note, that sbatch command line options overrule script options, e.g.,
$ sbatch --partition=singlenode --mem=200 job_omp.sh
overwrites the script setting of 6000 MByte with 200 MByte on ForHLR I.
MPI Parallel Programs
MPI parallel programs run faster than serial programs on multi CPU and multi core systems. N-fold spawned processes of the MPI program, i.e., MPI tasks, run simultaneously and communicate via the Message Passing Interface (MPI) paradigm. MPI tasks do not share memory but can be spawned over different nodes.
Multiple MPI tasks must be launched via mpirun, e.g. 4 MPI tasks of my_par_program:
$ mpirun -n 4 my_par_program
This command runs 4 MPI tasks of my_par_program on the node you are logged in. To run this command on ForHLR I/II with a loaded Intel MPI the environment variable I_MPI_HYDRA_BOOTSTRAP must be unset ( --> $ unset I_MPI_HYDRA_BOOTSTRAP).
Running MPI parallel programs in a batch job the interactive environment - particularly the loaded modules - will also be set in the batch job. If you want to set a defined module environment in your batch job you have to purge all modules before setting the wished modules.
OpenMPI
If you want to run jobs on batch nodes, generate a wrapper script job_ompi.sh for OpenMPI containing the following lines:
#!/bin/bash
# Use when a defined module environment related to OpenMPI is wished
module load mpi/openmpi/<placeholder_for_version>
mpirun --bind-to core --map-by core -report-bindings my_par_program
Attention: Do NOT add mpirun options -n <number_of_processes> or any other option defining processes or nodes, since Slurm instructs mpirun about number of processes and node hostnames. Use ALWAYS the MPI options --bind-to core and --map-by core|socket|node. Please type mpirun --help for an explanation of the meaning of the different options of mpirun option --map-by.
Considering 4 OpenMPI tasks on a single node, each requiring 1000 MByte, and running for 1 hour, execute:
$ sbatch -p multinode -N 4 -n 80 --mem=1000 --time=01:00:00 job_ompi.sh # on ForHLR I $ sbatch -p normal -N 4 -n 80 --mem=1000 --time=01:00:00 job_ompi.sh # on ForHLR II
Intel MPI
Generate a wrapper script for Intel MPI, job_impi.sh containing the following lines:
#!/bin/bash
# Use when a defined module environment related to Intel MPI is wished
module load mpi/impi/<placeholder_for_version>
mpiexec.hydra -bootstrap slurm my_par_program
Attention:
Do NOT add mpirun options -n <number_of_processes> or any other option defining processes or nodes, since Slurm instructs mpirun about number of processes and node hostnames.
Launching and running 100 Intel MPI tasks on 5 nodes, each requiring 10 GByte, and running for 5 hours, execute:
$ sbatch --partition multinode -N 5 --ntasks-per-node=20 --mem=10gb -t 300 job_impi.sh # on ForHLR I $ sbatch --partition normal -N 5 --ntasks-per-node=20 --mem=10gb -t 300 job_impi.sh # on ForHLR II
If you want to use 128 or more nodes, you must also set the environment variable as follows:
export I_MPI_HYDRA_BRANCH_COUNT=-1
If you want to use the options perhost, ppn or rr, you must additionally set the environment variable I_MPI_JOB_RESPECT_PROCESS_PLACEMENT=off.
Multithreaded + MPI parallel Programs
Multithreaded + MPI parallel programs operate faster than serial programs on multi CPUs with multiple cores. All threads of one process share resources such as memory. On the contrary MPI tasks do not share memory but can be spawned over different nodes.
OpenMPI with Multithreading
Multiple MPI tasks using OpenMPI must be launched by the MPI parallel program mpirun. For multithreaded programs based on Open Multi-Processing (OpenMP) number of threads are defined by the environment variable OMP_NUM_THREADS. By default this variable is set to 1 (OMP_NUM_THREADS=1).
For OpenMPI a job-script to submit a batch job called job_ompi_omp.sh that runs a MPI program with 4 tasks and an tenfold threaded program ompi_omp_program requiring 3000 MByte of physical memory per thread (using 10 threads per MPI task you will get 10*3000 MByte = 30000 MByte per MPI task) and total wall clock time of 3 hours looks like:
#!/bin/bash
#SBATCH --ntasks=4
#SBATCH --cpus-per-task=10
#SBATCH --time=03:00:00
#SBATCH --mem=30gb
#SBATCH --export=ALL,MPI_MODULE=mpi/openmpi/3.1,EXECUTABLE=./ompi_omp_program
#SBATCH --output="parprog_hybrid_%j.out"
# Use when a defined module environment related to OpenMPI is wished
module load ${MPI_MODULE}
export MPIRUN_OPTIONS="--bind-to core --map-by socket:PE=${SLURM_CPUS_PER_TASK} -report-bindings"
export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK}
export NUM_CORES=${SLURM_NTASKS}*${SLURM_CPUS_PER_TASK}
echo "${EXECUTABLE} running on ${NUM_CORES} cores with ${SLURM_NTASKS} MPI-tasks and ${OMP_NUM_THREADS} threads"
startexe="mpirun -n ${SLURM_NTASKS} ${MPIRUN_OPTIONS} ${EXECUTABLE}"
echo $startexe
exec $startexe
Execute the script job_ompi_omp.sh by command sbatch:
$ sbatch -p multinode job_ompi_omp.sh # on ForHLR I $ sbatch -p normal job_ompi_omp.sh # on ForHLR II
- With the mpirun option --bind-to core MPI tasks and OpenMP threads are bound to physical cores.
- With the option --map-by socket:PE=<value> (neighbored) MPI tasks will be attached to different sockets and each MPI task is bound to the (in <value>) specified number of cpus. <value> must be set to ${OMP_NUM_THREADS}.
- The option -report-bindings shows the bindings between MPI tasks and physical cores.
- The mpirun-options --bind-to core, --map-by socket|...|node:PE=<value> should always be used when running a multithreaded MPI program.
Intel MPI with Multithreading
Multithreaded + MPI parallel programs operate faster than serial programs on multi CPUs with multiple cores. All threads of one process share resources such as memory. On the contrary MPI tasks do not share memory but can be spawned over different nodes.
Multiple Intel MPI tasks must be launched by the MPI parallel program mpiexec.hydra. For multithreaded programs based on Open Multi-Processing (OpenMP) number of threads are defined by the environment variable OMP_NUM_THREADS. By default this variable is set to 1 (OMP_NUM_THREADS=1).
For Intel MPI a job-script to submit a batch job called job_impi_omp.sh that runs a Intel MPI program with 4 tasks and a 20-fold threaded program impi_omp_program requiring 64000 MByte of total physical memory per task and total wall clock time of 1 hours looks like:
#!/bin/bash
#SBATCH --ntasks=4
#SBATCH --cpus-per-task=20
#SBATCH --time=60
#SBATCH --mem=64000
#SBATCH --export=ALL,MPI_MODULE=mpi/impi,EXE=./impi_omp_program
#SBATCH --output="parprog_impi_omp_%j.out"
#If using more than one MPI task per node please set
export KMP_AFFINITY=compact,1,0
#export KMP_AFFINITY=verbose,scatter prints messages concerning the supported affinity
#KMP_AFFINITY Description: https://software.intel.com/en-us/node/524790#KMP_AFFINITY_ENVIRONMENT_VARIABLE
# Use when a defined module environment related to Intel MPI is wished
module load ${MPI_MODULE}
export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK}
export MPIRUN_OPTIONS="-binding "domain=omp:compact" -print-rank-map -envall"
export NUM_PROCS=eval(${SLURM_NTASKS}*${OMP_NUM_THREADS})
echo "${EXE} running on ${NUM_PROCS} cores with ${SLURM_NTASKS} MPI-tasks and ${OMP_NUM_THREADS} threads"
startexe="mpiexec.hydra -bootstrap slurm ${MPIRUN_OPTIONS} -n ${SLURM_NTASKS} ${EXE}"
echo $startexe
exec $startexe
Using Intel compiler the environment variable KMP_AFFINITY switches on binding of threads to specific cores. If you only run one MPI task per node please set KMP_AFFINITY=compact,1,0.
If you want to use 128 or more nodes, you must also set the environment variable as follows:
export I_MPI_HYDRA_BRANCH_COUNT=-1
If you want to use the options perhost, ppn or rr, you must additionally set the environment variable I_MPI_JOB_RESPECT_PROCESS_PLACEMENT=off.
Execute the script job_impi_omp.sh by command sbatch:
$ sbatch -p multinode job_impi_omp.sh # on ForHLR I $ sbatch -p normal job_impi_omp.sh # on ForHLR II
The mpirun option -print-rank-map shows the bindings between MPI tasks and nodes (not very beneficial). The option -binding binds MPI tasks (processes) to a particular processor; domain=omp means that the domain size is determined by the number of threads. In the above examples (2 MPI tasks per node) you could also choose -binding "cell=unit;map=bunch"; this binding maps one MPI process to each socket.
Chain jobs
The CPU time requirements of many applications exceed the limits of the job classes. In those situations it is recommended to solve the problem by a job chain. A job chain is a sequence of jobs where each job automatically starts its successor.
#!/bin/bash
####################################
## simple Slurm submitter script to setup ##
## a chain of jobs using Slurm ##
####################################
## ver. : 2018-11-27, KIT, SCC
## Define maximum number of jobs via positional parameter 1, default is 5
max_nojob=${1:-5}
## Define your jobscript (e.g. "~/chain_job.sh")
chain_link_job=${PWD}/chain_job.sh
## Define type of dependency via positional parameter 2, default is 'afterok'
dep_type="${2:-afterok}"
## -> List of all dependencies:
## https://slurm.schedmd.com/sbatch.html
myloop_counter=1
## Submit loop
while [ ${myloop_counter} -le ${max_nojob} ] ; do
##
## Differ msub_opt depending on chain link number
if [ ${myloop_counter} -eq 1 ] ; then
slurm_opt=""
else
slurm_opt="-d ${dep_type}:${jobID}"
fi
##
## Print current iteration number and sbatch command
echo "Chain job iteration = ${myloop_counter}"
echo " sbatch --export=myloop_counter=${myloop_counter} ${slurm_opt} ${chain_link_job}"
## Store job ID for next iteration by storing output of sbatch command with empty lines
jobID=$(sbatch -p <queue> --export=ALL,myloop_counter=${myloop_counter} ${slurm_opt} ${chain_link_job} 2>&1 | sed 's/[S,a-z]* //g')
##
## Check if ERROR occured
if [[ "${jobID}" =~ "ERROR" ]] ; then
echo " -> submission failed!" ; exit 1
else
echo " -> job number = ${jobID}"
fi
##
## Increase counter
let myloop_counter+=1
done
LSDF Online Storage
On ForHLR you can use for special cases the LSDF Online Storage on the HPC cluster nodes. Please request for this service seperately (LSDF Storage Request). To mount the LSDF Online Storage on the compute nodes during the job runtime the the constraint flag "LSDF" has to be set.
a) add after the initial line of your script job.sh the line including the
information about the LSDF Online Storage usage:
#SBATCH --constraint=LSDF
#!/bin/bash #SBATCH --ntasks=1 #SBATCH --time=120 #SBATCH --mem=200 #SBATCH --constraint=LSDF
or b) execute:
$ sbatch -p queue -n1 -t 2:00 --mem 200 job.sh -C LSDF
For the usage of the LSDF Online Storage
the following environment variables are available: $LSDF, $LSDFPROJECTS, $LSDFHOME.
BeeOND (BeeGFS On-Demand)
BeeOND instances are integrated into the prolog and epilog script of the cluster batch system, Slurm. It can be used on the compute nodes during the job runtime with the constraint flag "BEEOND" ( Slurm Command Parameters)
#!/bin/bash #SBATCH ... #SBATCH --constraint=BEEOND
After your job has started you can find the private on-demand file system in /mnt/odfs/$SLURM_JOB_ID directory. The mountpoint comes with three pre-configured directories:
#for small files (stripe count = 1) /mnt/odfs/$SLURM_JOB_ID/stripe_1 #stripe count = 4 /mnt/odfs/$SLURM_JOB_ID/stripe_default #stripe count = 8, 16 or 32, use this directories for medium sized and large files or when using MPI-IO /mnt/odfs/$SLURM_JOB_ID/stripe_8, /mnt/odfs/$SLURM_JOB_ID/stripe_16 or /mnt/odfs/$SLURM_JOB_ID/stripe_32
If you request less nodes than stripe count, the stripe count will be max number of nodes, e.g., You only request 8 nodes , so the directory with stripe count 16 is basically only with a stripe count 8.
The capacity of the private file system depends on the number of nodes. For each node you get 250Gbyte.
!!! Be careful when creating large files, use always the directory with the max stripe count for large files. If you create large files use a higher stripe count. For example, if your largest file is 1.1 Tb, then you have to use a stripe count larger>4 (4 x 250GB).
If you request 100 nodes for your job, the private file system is 100 * 250 Gbyte ~ 25 Tbyte (approx) capacity.
Recommendation:
The private file system is using its own metadata server. This metadata server is started on the first nodes. Depending on your application, the metadata server is consuming decent amount of CPU power. Probably adding a extra node to your job could improve the usability of the on-demand file system. Start your application with the MPI option:
mpirun -nolocal myapplication
With the -nolocal option the node where mpirun is initiated is not used for your application. This node is fully available for the meta data server of your requested on-demand file system.
Example job script:
#!/bin/bash #very simple example on how to use a private on-demand file system #SBATCH -N 10 #SBATCH --constraint=BEEOND #create a workspace ws_allocate myresults-$SLURM_JOB_ID 90 RESULTDIR=`ws_find myresults-$SLURM_JOB_ID` #Set ENV variable to on-demand file system ODFSDIR=/mnt/odfs/$SLURM_JOB_ID/stripe_16/ #start application and write results to on-demand file system mpirun -nolocal myapplication -o $ODFSDIR/results #Copy back data after your job application end rsync -av $ODFSDIR/results $RESULTDIR
Start time of job or resources : squeue --start
The command can be used by any user to displays the estimated start time of a job based a number of analysis types based on historical usage, earliest available reservable resources, and priority based backlog. The command squeue is explained in detail on the webpage https://slurm.schedmd.com/squeue.html or via manpage (man squeue).
Access
By default, this command can be run by any user.
List of your submitted jobs : squeue
Displays information about active, pending and/or recently completed jobs. The command displays all own active and pending jobs. The command squeue is explained in detail on the webpage https://slurm.schedmd.com/squeue.html or via manpage (man squeue).
Access
By default, this command can be run by any user.
Flags
Flag | Description |
---|---|
-l, --long | Report more of the available information for the selected jobs or job steps, subject to any constraints specified. |
Examples
squeue example on ForHLR I (Only your own jobs are displayed!).
$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 382 multinode job_ompi ku8089 PD 0:00 4 (AssocGrpJobsLimit) 381 multinode job_ompi ku8089 R 0:19 4 fhbn[005-008] 380 multinode job_ompi ku8089 R 0:23 4 fhbn[001-004] $ squeue -l JOBID PARTITION NAME USER STATE TIME TIME_LIMI NODES NODELIST(REASON) 382 multinode job_ompi ku8089 PENDING 0:00 1:00:00 4 (AssocGrpJobsLimit) 381 multinode job_ompi ku8089 RUNNING 0:42 1:00:00 4 fhbn[005-008] 380 multinode job_ompi ku8089 RUNNING 0:46 1:00:00 4 fhbn[001-004]
- The output of squeue shows how many jobs of yours are running or pending and how many nodes are in use by your jobs.
Shows free resources : sinfo_t_idle
The Slurm command sinfo is used to view partition and node information for a system running Slurm. It incorporates down time, reservations, and node state information in determining the available backfill window. The sinfo command can only be used by the administrator.
SCC has prepared a special script (sinfo_t_idle) to find out how many processors are available for immediate use on the system. It is anticipated that users will use this information to submit jobs that meet these criteria and thus obtain quick job turnaround times.
Access
By default, this command can be used by any user or administrator.
Example
- The following command displays what resources are available for immediate use for the whole partition.
$ sinfo_t_idle PARTITION AVAIL TIMELIMIT NODES STATE NODELIST develop up 30:00 0 n/a singlenode up 3-00:00:00 0 n/a multinode up 3-00:00:00 0 n/a fat up 4-00:00:00 7 idle fh1n[802-803,805,808-810,813] login up infinite 0 n/a service up infinite 0 n/a slurm up infinite 0 n/a transfer up infinite 0 n/a headnode up infinite 0 n/a
- For the above example the request for 1 node in the partition fat can be run immediately.
Detailed job information : scontrol show job
scontrol show job displays detailed job state information and diagnostic output for all or a specified job of yours. Detailed information is available for active, pending and recently completed jobs. The command scontrol is explained in detail on the webpage https://slurm.schedmd.com/scontrol.html or via manpage (man scontrol).
Display the state of all your jobs in normal mode: scontrol show job
Display the state of a job with <jobid> in normal mode: scontrol show job <jobid>
Access
- End users can use scontrol show job to view the status of their own jobs only.
Arguments
Option | Default | Description | Example |
---|---|---|---|
-d | (n/a) | Detailed mode | Example: Display the state with jobid 8370992 in detailed mode. scontrol -d show job 8370992 |
Scontrol show job Example
Here is an example from ForHLR I.
squeue # show my own jobs (here the userid is replaced!) JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 451750 multinode job_ompi ab1234 PD 0:00 4 (JobHeldAdmin) $ $ # now, see what's up with my pending job with jobid 451750 $ $ scontrol show job 451750 JobId=451750 JobName=job_ompi.sh UserId=ab1234(8975) GroupId=fh1-project-devel(500376) MCS_label=N/A Priority=0 Nice=0 Account=fh1-scs QOS=(null) JobState=PENDING Reason=JobHeldAdmin Dependency=(null) Requeue=0 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 RunTime=00:00:00 TimeLimit=01:00:00 TimeMin=N/A SubmitTime=2018-11-30T14:40:22 EligibleTime=2018-11-30T14:40:22 StartTime=Unknown EndTime=Unknown Deadline=N/A PreemptTime=None SuspendTime=None SecsPreSuspend=0 Partition=multinode AllocNode:Sid=fh1n988:19636 ReqNodeList=(null) ExcNodeList=(null) NodeList=(null) NumNodes=4-4 NumCPUs=80 NumTasks=80 CPUs/Task=1 ReqB:S:C:T=0:0:*:* TRES=cpu=80,mem=4000,node=4 Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* MinCPUsNode=1 MinMemoryNode=1000M MinTmpDiskNode=0 Features=(null) Gres=(null) Reservation=(null) OverSubscribe=NO Contiguous=0 Licenses=(null) Network=(null) Command=/pfs/data3/project/fh1-project-devel/ab1234/Slurm/job_ompi.sh WorkDir=/pfs/data3/project/fh1-project-devel/ab1234/Slurm StdErr=/pfs/data3/project/fh1-project-devel/ab1234/Slurm/slurm-451750.out StdIn=/dev/null StdOut=/pfs/data3/project/fh1-project-devel/ab1234/Slurm/slurm-451750.out Power=
You can use standard Linux pipe commands to filter the very detailed scontrol show job output.
- Is the job still pending?
$ scontrol show job 451750 | grep -i pending JobState=PENDING Reason=JobHeldAdmin Dependency=(null)
Cancel Slurm Jobs
The scancel command is used to cancel jobs. The command scancel is explained in detail on the webpage https://slurm.schedmd.com/scancel.html or via manpage (man scancel).
Canceling own jobs : scancel
scancel is used to signal or cancel jobs, job arrays or job steps. The command is:
$ scancel [-i] <job-id> $ scancel -t <job_state_name>
Flag | Default | Description | Example |
---|---|---|---|
-i, --interactive | (n/a) | Interactive mode. | Cancel the job 987654 interactively. scancel -i 987654 |
-t, --state | (n/a) | Restrict the scancel operation to jobs in a certain state. "job_state_name" may have a value of either "PENDING", "RUNNING" or "SUSPENDED". |
Cancel all jobs in state "PENDING". scancel -t "PENDING" |
Resource Managers
Batch Job (Slurm) Variables
The following environment variables of Slurm are added to your environment once your job has started (only an excerpt of the most important ones).
Environment | Brief explanation |
---|---|
SLURM_JOB_CPUS_PER_NODE | Number of processes per node dedicated to the job |
SLURM_JOB_NODELIST | List of nodes dedicated to the job |
SLURM_JOB_NUM_NODES | Number of nodes dedicated to the job |
SLURM_MEM_PER_NODE | Memory per node dedicated to the job |
SLURM_NPROCS | Total number of processes dedicated to the job |
SLURM_CLUSTER_NAME | Name of the cluster executing the job |
SLURM_CPUS_PER_TASK | Number of CPUs requested per task |
SLURM_JOB_ACCOUNT | Account name |
SLURM_JOB_ID | Job ID |
SLURM_JOB_NAME | Job Name |
SLURM_JOB_PARTITION | Partition/queue running the job |
SLURM_JOB_UID | User ID of the job's owner |
SLURM_SUBMIT_DIR | Job submit folder. The directory from which msub was invoked. |
SLURM_JOB_USER | User name of the job's owner |
SLURM_RESTART_COUNT | Number of times job has restarted |
SLURM_PROCID | Task ID (MPI rank) |
SLURM_NTASKS | The total number of tasks available for the job |
SLURM_STEP_ID | Job step ID |
SLURM_STEP_NUM_TASKS | Task count (number of PI ranks) |
SLURM_JOB_CONSTRAINT | Job constraints |
See also:
Job Exit Codes
A job's exit code (also known as exit status, return code and completion code) is captured by SLURM and saved as part of the job record.
Any non-zero exit code will be assumed to be a job failure and will result in a Job State of FAILED with a reason of "NonZeroExitCode".
The exit code is an 8 bit unsigned number ranging between 0 and 255. While it is possible for a job to return a negative exit code, SLURM will display it as an unsigned value in the 0 - 255 range.
Displaying Exit Codes and Signals
SLURM displays a job's exit code in the output of the scontrol show job and the sview utility.
When a signal was responsible for a job or step's termination, the signal number will be displayed after the exit code, delineated by a colon(:).
Submitting Termination Signal
Here is an example, how to 'save' a msub termination signal in a typical ForHLR-submit script.
[...]
exit_code=$?
echo "### Calling YOUR_PROGRAM command ..."
mpirun -np 'NUMBER_OF_CORES' $YOUR_PROGRAM_BIN_DIR/runproc ... (options) 2>&1
[ "$exit_code" -eq 0 ] && echo "all clean..." || \
echo "Executable ${YOUR_PROGRAM_BIN_DIR}/runproc finished with exit code ${$exit_code}"
[...]
- Do not use 'time' mpirun! The exit code will be the one submitted by the first (time) program.
- You do not need an exit $exit_code in the scripts.