BwForCluster JUSTUS 2 Slurm HOWTO: Difference between revisions
No edit summary |
No edit summary |
||
(157 intermediate revisions by 4 users not shown) | |||
Line 1: | Line 1: | ||
{{Justus2}} |
{{Justus2}} |
||
This is a collection of howtos and convenient Slurm commands for JUSTUS 2. |
|||
Slurm Howto |
|||
= PREFACE = |
|||
This is a collection of howtos and convenient Slurm commands I initially |
|||
wrote for internal use at Ulm only. Scripts and commands have been tested |
|||
within our Slurm test environment at JUSTUS (running Slurm 19.05 at the |
|||
moment). |
|||
Users may find this collection useful as well. Especially sections 1 - 4 may |
|||
be a good stating point for migration from Moab to Slurm. |
|||
Some commands behave slightly different depending on whether they are executed |
Some commands behave slightly different depending on whether they are executed |
||
by a system administrator or by a regular user |
by a system administrator or by a regular user, as Slurm prevents regular users from accessing critical system information and viewing job and usage information of other users. |
||
privacy reasons, Slurm prevents regular users from accessing critical system information and from viewing job and usage information of any user other than themselves. |
|||
= GENERAL INFORMATION = |
|||
This applies in particular to privileged commands from Section 5, which are |
|||
predominantly the responsibility of the system administrators. |
|||
== How to find a general quick start user guide? == |
|||
= GENERAL = |
|||
https://slurm.schedmd.com/quickstart.html |
|||
== How to find Slurm FAQ? == |
== How to find Slurm FAQ? == |
||
https://slurm.schedmd.com/faq.html |
https://slurm.schedmd.com/faq.html |
||
== How to find a Slurm cheat sheet? == |
== How to find a Slurm cheat sheet? == |
||
https://slurm.schedmd.com/pdfs/summary.pdf |
https://slurm.schedmd.com/pdfs/summary.pdf |
||
== How to find Slurm tutorials? == |
== How to find Slurm tutorials? == |
||
Line 36: | Line 24: | ||
https://slurm.schedmd.com/tutorials.html |
https://slurm.schedmd.com/tutorials.html |
||
== How to get more information on Slurm? == |
|||
== How to get more information? == |
|||
(Almost) every Slurm command has a man page. Use it. |
(Almost) every Slurm command has a man page. Use it. |
||
Online versions: https://slurm.schedmd.com/man_index.html |
Online versions: https://slurm.schedmd.com/man_index.html |
||
== How to find hardware specific details about JUSTUS 2? == |
|||
See our Wiki page: [[Hardware and Architecture (bwForCluster JUSTUS 2)|Hardware and Architecture]] |
|||
= JOB SUBMISSION = |
= JOB SUBMISSION = |
||
== How to submit a serial batch job? == |
|||
Use [https://slurm.schedmd.com/sbatch.html sbatch] command: |
|||
<pre>$ sbatch <job-script> </pre> |
|||
Sample job script template for serial job: |
|||
<source lang="bash"> |
|||
#!/bin/bash |
|||
# Allocate one node |
|||
#SBATCH --nodes=1 |
|||
# Number of program instances to be executed |
|||
#SBATCH --ntasks-per-node=1 |
|||
# 8 GB memory required per node |
|||
#SBATCH --mem=8G |
|||
# Maximum run time of job |
|||
#SBATCH --time=1:00:00 |
|||
# Give job a reasonable name |
|||
#SBATCH --job-name=serial_job |
|||
# File name for standard output (%j will be replaced by job id) |
|||
#SBATCH --output=serial_job-%j.out |
|||
# File name for error output |
|||
#SBATCH --error=serial_job-%j.err |
|||
# Load software modules as needed, e.g. |
|||
# module load foo/bar |
|||
# Run serial program |
|||
./my_serial_program |
|||
</source> |
|||
Sample code for serial program: [[Media:Hello_serial.c | Hello_serial.c]] |
|||
'''Notes:''' |
|||
* --nodes=1 and --ntasks-per-node=1 may be replaced by --ntasks=1. |
|||
* If not specified, stdout and stderr are both written to slurm-%j.out. |
|||
== How to find working sample scripts for my program? == |
|||
Most software modules for applications provide working sample batch scripts. |
|||
Check with [[Software_Modules_Lmod#Module_specific_help | module help]] command, e.g. |
|||
<pre> |
|||
$ module help chem/vasp # display module help for VASP |
|||
$ module help math/matlab # display module help for Matlab |
|||
</pre> |
|||
== How to harden job scripts against common errors? == |
|||
The bash shell provides several options that support users in disclosing hidden bugs and writing safer job scripts. |
|||
In order to activate these safeguard settings users can insert the following lines in their scripts (after all #SBATCH directives): |
|||
<source lang="bash"> |
|||
[...] |
|||
set -o errexit # (or set -e) cause batch script to exit immediately when a command fails. |
|||
set -o pipefail # cause batch script to exit immediately also when the command that failed is embedded in a pipeline |
|||
set -o nounset # (or set -u) causes the script to treat unset variables as an error and exit immediately |
|||
[...] |
|||
</source> |
|||
== How to submit an interactive job? == |
== How to submit an interactive job? == |
||
Use [https://slurm.schedmd.com/ |
Use [https://slurm.schedmd.com/salloc.html salloc] command, e.g.: |
||
<pre>$ salloc --nodes=1 --ntasks-per-node=8</pre> |
|||
'''Note:''' |
|||
In previous Slurm versions < 20.11 the use of [https://slurm.schedmd.com/srun.html srun] has been the recommended way for launching interactive jobs, e.g.: |
|||
<pre>$ srun --nodes=1 --ntasks-per-node=8 --pty bash </pre> |
<pre>$ srun --nodes=1 --ntasks-per-node=8 --pty bash </pre> |
||
Although this still works with current Slurm versions this is considered '''deprecated ''' for current Slurm versions as it may cause issues when launching additional jobs steps from within the interactive job environment. Use [https://slurm.schedmd.com/salloc.html salloc] command. |
|||
== How to enable X11 forwarding for an interactive job? == |
== How to enable X11 forwarding for an interactive job? == |
||
Use --x11 flag, e.g. |
Use '--x11' flag, e.g. |
||
<pre> |
<pre> |
||
$ |
$ salloc --nodes=1 --ntasks-per-node=8 --x11 # run shell with X11 forwarding enabled |
||
$ srun --nodes=1 --ntasks-per-node=8 --pty --x11 xterm # directly launch terminal window on node |
|||
</pre> |
</pre> |
||
Line 63: | Line 120: | ||
* For X11 forwarding to work, you must also enable X11 forwarding for your ssh login from your local computer to the cluster, i.e.: |
* For X11 forwarding to work, you must also enable X11 forwarding for your ssh login from your local computer to the cluster, i.e.: |
||
<pre>local> ssh -X <username>@justus2.uni-ulm.de></pre> |
<pre>local> ssh -X <username>@justus2.uni-ulm.de></pre> |
||
== How to submit a batch job? == |
|||
Use [https://slurm.schedmd.com/sbatch.html sbatch] command: |
|||
<pre> $ sbatch <job-script> </pre> |
|||
== How to convert Moab batch job scripts to Slurm? == |
== How to convert Moab batch job scripts to Slurm? == |
||
Line 146: | Line 195: | ||
| Number of processes || $MOAB_PROCCOUNT || $PBS_TASKNUM || $SLURM_NTASKS |
| Number of processes || $MOAB_PROCCOUNT || $PBS_TASKNUM || $SLURM_NTASKS |
||
|- |
|- |
||
| Requested tasks per node || - || $PBS_NUM_PPN || $SLURM_NTASKS_PER_NODE |
| Requested tasks per node || --- || $PBS_NUM_PPN || $SLURM_NTASKS_PER_NODE |
||
|- |
|- |
||
| Requested CPUs per task || --- || --- || $SLURM_CPUS_PER_TASK |
| Requested CPUs per task || --- || --- || $SLURM_CPUS_PER_TASK |
||
Line 168: | Line 217: | ||
* See [https://slurm.schedmd.com/sbatch.html sbatch] man page for a complete list of flags and environment variables. |
* See [https://slurm.schedmd.com/sbatch.html sbatch] man page for a complete list of flags and environment variables. |
||
== How to view information about submitted jobs? == |
|||
Use [https://slurm.schedmd.com/squeue.html squeue] command, e.g.: |
|||
<pre> |
|||
$ squeue # all users (admins only) |
|||
$ squeue -u <username> # jobs of specific user |
|||
$ squeue -t PENDING # pending jobs only |
|||
</pre> |
|||
Note: The output format of [https://slurm.schedmd.com/squeue.html squeue] (and most other Slurm commands) is highly configurable to your needs. Look for the --format or --Format options. |
|||
== How to cancel jobs? == |
|||
Use [https://slurm.schedmd.com/scancel.html scancel] command, e.g. |
|||
<pre> |
|||
$ scancel <jobid> # cancel specific job |
|||
$ scancel <jobid>_<index> # cancel indexed job in a job array |
|||
$ scancel -u <username> # cancel all jobs of specific user |
|||
$ scancel -t PENDING # cancel pending jobs |
|||
</pre> |
|||
== How to submit a serial batch job? == |
|||
Sample job script template for serial job: |
|||
<source lang="bash"> |
|||
#!/bin/bash |
|||
# Allocate one node |
|||
#SBATCH --nodes=1 |
|||
# Number of program instances to be executed |
|||
#SBATCH --tasks-per-node=1 |
|||
# 8 GB memory required per node |
|||
#SBATCH --mem=8G |
|||
# Maximum run time of job |
|||
#SBATCH --time=1:00:00 |
|||
# Give job a reasonable name |
|||
#SBATCH --job-name=serial_job |
|||
# File name for standard output (%j will be replaced by job id) |
|||
#SBATCH --output=serial_job-%j.out |
|||
# File name for error output |
|||
#SBATCH --error=serial_job-%j.err |
|||
# Load software modules as needed, e.g. |
|||
# module load foo/bar |
|||
# Run serial program |
|||
./my_serial_program |
|||
</source> |
|||
Sample code for serial program: "hello_serial.c":https://projects.uni-konstanz.de/attachments/download/16815/hello_serial.c |
|||
'''Notes:''' |
|||
* --nodes=1 and --tasks-per-node=1 may be replaced by --ntasks=1. |
|||
* If not specified, stdout and stderr are both written to slurm-%j.out. |
|||
== How to emulate Moab output file names? == |
== How to emulate Moab output file names? == |
||
Line 237: | Line 226: | ||
#SBATCH --error="%x.e%j" |
#SBATCH --error="%x.e%j" |
||
</pre> |
</pre> |
||
== How to pass command line arguments to the job script? == |
== How to pass command line arguments to the job script? == |
||
Line 267: | Line 255: | ||
'''Notes:''' |
'''Notes:''' |
||
* Do '''not''' add any unit (such as --gres=scratch:100G). This |
* Do '''not''' add any unit (such as --gres=scratch:100G). This would be treated as requesting an amount of 10^9 * 100GB of scratch space. |
||
* Multinode jobs get nnn GB of local scratch space on every node of the job. |
* Multinode jobs get nnn GB of local scratch space on every node of the job. |
||
* Environment variable $SCRATCH will point to |
* Environment variable '''$SCRATCH''' will point to |
||
** /scratch/<user>.<jobid> when local scratch has been requested |
** /scratch/<user>.<jobid> when local scratch has been requested. This will be on locally attached SSD/NVMe devices. |
||
** /tmp/<user>.<jobid> when no local scratch has |
** /tmp/<user>.<jobid> when no local scratch has been requested. This will be in memory and, thus, be limited in size. |
||
* Environment variable $TMPDIR always points |
* Environment variable '''$TMPDIR''' always points to /tmp/<user>.<jobid>. This will always be in memory and, thus, limited in size. |
||
* For backward compatibility environment variable $RAMDISK always points to /tmp/<user>.<jobid> |
* For backward compatibility environment variable $RAMDISK always points to /tmp/<user>.<jobid> |
||
Line 283: | Line 271: | ||
* Data written to $TMPDIR will always count against allocated memory. |
* Data written to $TMPDIR will always count against allocated memory. |
||
* Data written to local scratch space will automatically be removed at the end of the job. |
|||
== How to request GPGPU nodes at job submission? == |
|||
Use '--gres=gpu:<count>' option to allocate 1 or 2 GPUs per node for the entire job. |
|||
Example: '--gres=gpu:1' will allocate one GPU per node for this job. |
|||
'''Notes:''' |
|||
* GPGPU nodes are equipped with two Nvidia V100S cards |
|||
* Environment variables $CUDA_VISIBLE_DEVICES, $SLURM_JOB_GPUS and $GPU_DEVICE_ORDINAL will denote card(s) allocated for the job. |
|||
* CUDA Toolkit is available as software module devel/cuda. |
|||
== How to clean-up or save files before a job times out? == |
|||
Possibly you would like to clean up the work directory or save intermediate result files in case a job times out. |
|||
The following sample script may serve as a blueprint for implementing a pre-termination function to perform clean-up or file recovery actions. |
|||
<source lang="bash"> |
|||
#!/bin/bash |
|||
# Allocate one node |
|||
#SBATCH --nodes=1 |
|||
# Number of program instances to be executed |
|||
#SBATCH --ntasks-per-node=1 |
|||
# 2 GB memory required per node |
|||
#SBATCH --mem=2G |
|||
# Request 10 GB local scratch space |
|||
#SBATCH --gres=scratch:10 |
|||
# Maximum run time of job |
|||
#SBATCH --time=10:00 |
|||
# Send the USR1 signal 120 seconds before end of time limit |
|||
#SBATCH --signal=B:USR1@120 |
|||
# Give job a reasonable name |
|||
#SBATCH --job-name=signal_job |
|||
# File name for standard output (%j will be replaced by job id) |
|||
#SBATCH --output=signal_job-%j.out |
|||
# File name for error output |
|||
#SBATCH --error=signal_job-%j.err |
|||
# Define the signal handler function |
|||
# Note: This is not executed here, but rather when the associated |
|||
# signal is received by the shell. |
|||
finalize_job() |
|||
{ |
|||
# Do whatever cleanup you want here. In this example we copy |
|||
# output file(s) back to $SLURM_SUBMIT_DIR, but you may implement |
|||
# your own job finalization code here. |
|||
echo "function finalize_job called at `date`" |
|||
cd $SCRATCH |
|||
mkdir -vp "$SLURM_SUBMIT_DIR"/results |
|||
tar czvf "$SLURM_SUBMIT_DIR"/results/${SLURM_JOB_ID}.tgz output*.txt |
|||
exit |
|||
} |
|||
# Call finalize_job function as soon as we receive USR1 signal |
|||
trap 'finalize_job' USR1 |
|||
# Copy input files for this job to the scratch directory (if needed). |
|||
# Note: Environment variable $SCRATCH always points to a scratch directory |
|||
# automatically created for this job. Environment variable $SLURM_SUBMIT_DIR |
|||
# points to the path where this script was submitted from. |
|||
# Example: |
|||
# cp -v "$SLURM_SUBMIT_DIR"/input*.txt "$SCRATCH" |
|||
# Change working directory to local scratch directory |
|||
cd "$SCRATCH" |
|||
# Load software modules as needed, e.g. |
|||
# module load foo/bar |
|||
# This is where the actual work is done. In this case we just create |
|||
# a sample output file for 900 (=15*60) seconds, but since we asked |
|||
# Slurm for 600 seconds only it will not be able finish within this |
|||
# wall time. |
|||
# Note: It is important to run this task in the background |
|||
# by placing the & symbol at the end. Otherwise the signal handler |
|||
# would not be executed until that process has finished, which is not |
|||
# what we want. |
|||
(for i in `seq 15`; do echo "Hello World at `date +%H:%M:%S`."; sleep 60; done) >output.txt 2>&1 & |
|||
# Note: The command above is just for illustration. Normally you would just run |
|||
# my_program >output.txt 2>&1 & |
|||
# Tell the shell to wait for background task(s) to finish. |
|||
# Note: This is important because otherwise the parent shell |
|||
# (this script) would proceed (and terminate) without waiting for |
|||
# background task(s) to finish. |
|||
wait |
|||
# If we get here, the job did not time out but finished in time. |
|||
# Release user defined signal handler for USR1 |
|||
trap - USR1 |
|||
# Do regular cleanup and save files. In this example we simply call |
|||
# the same function that we defined as a signal handler above, but you |
|||
# may implement your own code here. |
|||
finalize_job |
|||
exit |
|||
</source> |
|||
'''Notes:''' |
|||
* The number of seconds specified in --signal option must match the runtime of the pre-termination function and must not exceed 65535 seconds. |
|||
* Due to the resolution of event handling by Slurm, the signal may be sent a little earlier than specified. |
|||
== How to submit a multithreaded batch job? == |
== How to submit a multithreaded batch job? == |
||
Line 293: | Line 392: | ||
#SBATCH --nodes=1 |
#SBATCH --nodes=1 |
||
# Number of program instances to be executed |
# Number of program instances to be executed |
||
#SBATCH -- |
#SBATCH --ntasks-per-node=1 |
||
# Number of cores per program instance |
# Number of cores per program instance |
||
#SBATCH --cpus-per-task=8 |
#SBATCH --cpus-per-task=8 |
||
Line 317: | Line 416: | ||
</source> |
</source> |
||
Sample code for multithreaded program: |
Sample code for multithreaded program: [[Media:Hello_openmp.c | Hello_openmp.c]] |
||
'''Notes:''' |
'''Notes:''' |
||
* In our configuration each physical core is considered a "CPU". |
* In our configuration each physical core is considered a "CPU". |
||
* On JUSTUS 2 it is recommended to specify a number of cores per task ('--cpus-per-task') that is either an integer divisor of 24 or (at most) 48. |
|||
* Required memory can also by specified per allocated CPU with '--mem-per-cpu' option. |
* Required memory can also by specified per allocated CPU with '--mem-per-cpu' option. |
||
* The '--mem' and '--mem-per-cpu' options are mutually exclusive. |
* The '--mem' and '--mem-per-cpu' options are mutually exclusive. |
||
* In terms of core allocation '--tasks-per-node=1' or '--ntasks=1' together with '--cpus-per-task=8' is almost equivalent to '--tasks-per-node=8' or '--ntasks=8' and omitting '--cpus-per-task=8'. However, there are subtle differences when multiple tasks are spawned within one job by means of srun command. |
|||
** See: https://stackoverflow.com/questions/39186698/what-does-the-ntasks-or-n-tasks-does-in-slurm |
|||
== How to submit an array job? == |
== How to submit an array job? == |
||
Use |
Use [https://slurm.schedmd.com/sbatch.html#OPT_array -a] (or [https://slurm.schedmd.com/sbatch.html#OPT_array --array]) option, e.g. |
||
<pre> sbatch -a 1-16%8 ...</pre> |
<pre>$ sbatch -a 1-16%8 ...</pre> |
||
This will submit 16 tasks to be executed, each one indexed by SLURM_ARRAY_TASK_ID ranging from 1 to 16, but will limit the number of simultaneously running tasks from this job array to 8. |
This will submit 16 tasks to be executed, each one indexed by SLURM_ARRAY_TASK_ID ranging from 1 to 16, but will limit the number of simultaneously running tasks from this job array to 8. |
||
Line 341: | Line 439: | ||
<source lang="bash"> |
<source lang="bash"> |
||
#!/bin/bash |
|||
# Number of cores per individual array task |
# Number of cores per individual array task |
||
#SBATCH --ntasks=1 |
#SBATCH --ntasks=1 |
||
Line 366: | Line 465: | ||
* Every sub job in an array job will have its own unique environment variable $SLURM_JOB_ID. Environment variable $SLURM_ARRAY_JOB_ID will be set to the first job array index value for all tasks. |
* Every sub job in an array job will have its own unique environment variable $SLURM_JOB_ID. Environment variable $SLURM_ARRAY_JOB_ID will be set to the first job array index value for all tasks. |
||
* The remaining options in the sample job script are the same as the options used in other, non-array jobs. In the example above, we are requesting that each array task be allocated 1 CPU (--ntasks=1) and 4 GB of memory (--mem=4G) for up to one hour (--time=01:00:00). |
|||
* More information: https://slurm.schedmd.com/job_array.html |
* More information: https://slurm.schedmd.com/job_array.html |
||
== How to delay the start of a job? == |
|||
Use [https://slurm.schedmd.com/sbatch.html#OPT_begin -b] (or [https://slurm.schedmd.com/sbatch.html#OPT_begin --begin]) option in order to defer the allocation of the job until the specified time. |
|||
Examples: |
|||
<pre> |
|||
sbatch --begin=20:00 ... # job can start after 8 p.m. |
|||
sbatch --begin=now+1hour ... # job can start 1 hour after submission |
|||
sbatch --begin=teatime ... # job can start at teatime (4 p.m.) |
|||
sbatch --begin=2023-12-24T20:00:00 ... # job can start after specified date/time |
|||
</pre> |
|||
== How to submit dependency (chain) jobs? == |
|||
Use [https://slurm.schedmd.com/sbatch.html#OPT_dependency -d] (or [https://slurm.schedmd.com/sbatch.html#OPT_dependency --dependency]) option, e.g. |
|||
<pre>$ sbatch -d afterany:123456 ...</pre> |
|||
This will defer the submitted job until the specified job 123456 has terminated. |
|||
Slurm supports a number of different dependency types, e.g.: |
|||
<pre> |
|||
-d after:123456 # job can begin execution after the specified job has begun execution |
|||
-d afterany:123456 # job can begin execution after the specified job has finished |
|||
-d afternotok:123456 # job can begin execution after the specified job has failed (exit code not equal zero) |
|||
-d afterok:123456 # job can begin execution after the specified job has successfully finished (exit code zero) |
|||
-d singleton # job can begin execution after any previously job with the same job name and user have finished |
|||
</pre> |
|||
'''Note:''' Multiple jobs can be specified by separating their job ids by colon characters (:), e.g. |
|||
<pre> $ sbatch -d afterany:123456:123457 ... </pre> |
|||
This will defer the submitted job until the specified jobs 123456 and 123457 have both finished. |
|||
== How to deal with invalid job dependencies? == |
|||
Use [https://slurm.schedmd.com/sbatch.html#OPT_kill-on-invalid-dep --kill-on-invalid-dep=yes] option in order to automatically terminate jobs which can never run due to invalid dependencies. By default the job stays pending with reason 'DependencyNeverSatisfied' to allow review and appropriate action by the user. |
|||
'''Note:''' A job dependency may also become invalid if a job has been submitted with '-d afterok:<jobid>' but the specified dependency job has failed, e.g. because it timed out (i.e. exceeded its wall time limit). |
|||
== How to submit an MPI batch job? == |
== How to submit an MPI batch job? == |
||
Line 380: | Line 524: | ||
#SBATCH --nodes=2 |
#SBATCH --nodes=2 |
||
# Number of program instances to be executed |
# Number of program instances to be executed |
||
#SBATCH -- |
#SBATCH --ntasks-per-node=48 |
||
# Allocate 32 GB memory per node |
# Allocate 32 GB memory per node |
||
#SBATCH --mem=32gb |
#SBATCH --mem=32gb |
||
Line 424: | Line 568: | ||
</source> |
</source> |
||
Sample code for MPI program: |
Sample code for MPI program: [[Media:Hello_mpi.c | Hello_mpi.c]] |
||
'''Notes''' |
'''Notes''' |
||
* SchedMD recommends to use srun and many (most?) sites do so as well. The rationale is that srun is more tightly integrated with the scheduler and provides more consistent and reliable resource tracking and accounting for individual jobs and job steps. mpirun may behave differently for different MPI implementations and versions. There are reports that claim "strange behavior" of mpirun especially when using task affinity and core binding. Using srun is supposed to resolve these issues and is therefore highly recommended. |
* SchedMD recommends to use srun and many (most?) sites do so as well. The rationale is that srun is more tightly integrated with the scheduler and provides more consistent and reliable resource tracking and accounting for individual jobs and job steps. mpirun may behave differently for different MPI implementations and versions. There are reports that claim "strange behavior" of mpirun especially when using task affinity and core binding. Using srun is supposed to resolve these issues and is therefore highly recommended. |
||
* Do not run batch jobs that launch a large number (hundreds or thousands) short running (few minutes or less) MPI programs, e.g. from a shell loop. Every single MPI invocation does generate its own job step and sends remote procedure calls to the Slurm controller server. This can result in degradation of performance for both, Slurm and the application, especially if many of that jobs happen to run at the same time. Jobs of that kind can even get stuck without showing any further activity until hitting the wall time limit. For high throughput computing (e.g. processing a large number of files with every single task running independently from each other and very shortly), consider a more appropriate parallelization paradigm that invokes independent serial (non-MPI) processes in parallel at the same time. This approach is sometimes referred to as "[https://en.wikipedia.org/wiki/Embarrassingly_parallel pleasingly parallel]" workload. GNU Parallel is a shell tool that facilitates executing serial tasks in parallel. On JUSTUS 2 this tool is available as a software module "system/parallel". |
|||
== How to submit a hybrid MPI/OpenMP job? == |
== How to submit a hybrid MPI/OpenMP job? == |
||
Line 439: | Line 584: | ||
#SBATCH --nodes=4 |
#SBATCH --nodes=4 |
||
# Number of MPI instances (ranks) to be executed per node |
# Number of MPI instances (ranks) to be executed per node |
||
#SBATCH -- |
#SBATCH --ntasks-per-node=2 |
||
# Number of threads per MPI instance |
# Number of threads per MPI instance |
||
#SBATCH --cpus-per-task= |
#SBATCH --cpus-per-task=24 |
||
# Allocate 8 GB memory per node |
# Allocate 8 GB memory per node |
||
#SBATCH --mem=8gb |
#SBATCH --mem=8gb |
||
Line 465: | Line 610: | ||
</source> |
</source> |
||
Sample code for hybrid program: |
Sample code for hybrid program: [[Media:Hello_hybrid.c | Hello_hybrid.c]] |
||
'''Notes:''' |
'''Notes:''' |
||
Line 473: | Line 618: | ||
== How to request specific node(s) at job submission? == |
== How to request specific node(s) at job submission? == |
||
Use |
Use [https://slurm.schedmd.com/sbatch.html#OPT_nodelist -w] (or [https://slurm.schedmd.com/sbatch.html#OPT_nodelist --nodelist]) option, e.g.: |
||
<pre>$ sbatch -w <node1>,<node2> ...</pre> |
<pre>$ sbatch -w <node1>,<node2> ...</pre> |
||
Also see |
Also see [https://slurm.schedmd.com/sbatch.html#OPT_nodefile -F] (or [https://slurm.schedmd.com/sbatch.html#OPT_nodefile --nodefile]) option. |
||
== How to exclude specific nodes from job? == |
== How to exclude specific nodes from job? == |
||
Use |
Use [https://slurm.schedmd.com/sbatch.html#OPT_exclude -x] (or [https://slurm.schedmd.com/sbatch.html#OPT_exclude --exclude]) option, e.g.: |
||
<pre>$ sbatch -x <node1>,<node2> ...</pre> |
<pre>$ sbatch -x <node1>,<node2> ...</pre> |
||
== How to get exclusive jobs? == |
== How to get exclusive jobs? == |
||
Line 503: | Line 646: | ||
* Depending on configuration, exclusive=user may (and probably will) be the default node access policy on JUSTUS 2. |
* Depending on configuration, exclusive=user may (and probably will) be the default node access policy on JUSTUS 2. |
||
== How to submit batch job without job script? == |
|||
Use [https://slurm.schedmd.com/sbatch.html#OPT_wrap --wrap] option. |
|||
== How to show job script of a running job? == |
|||
Example: |
|||
Use [https://slurm.schedmd.com/scontrol.html scontrol] command: |
|||
<pre>$ sbatch --nodes=2 --ntasks-per-node=16 --wrap "sleep 600"</pre> |
|||
'''Note:''' May be useful for testing purposes. |
|||
= JOB MONITORING AND CONTROL = |
|||
== How to prevent Slurm performance degradation? == |
|||
Almost every invocation of a Slurm client command (e.g. squeue, sacct, sprio or sshare) sends a remote procedure call (RPC) to the Slurm control daemon and/or database. |
|||
If enough remote procedure calls come in at once, this can result in a degradation of performance of the Slurm services for all users, possibly resulting in a denial of service. |
|||
Therefore, '''do not run Slurm client commands that send remote procedure calls from loops in shell scripts or other programs''' (such as 'watch squeue'). Always ensure to limit calls to squeue, sstat, sacct etc. to the minimum necessary for the information you are trying to gather. |
|||
Slurm does collect RPC counts and timing statistics by message type and user for diagnostic purposes. |
|||
== How to view information about submitted jobs? == |
|||
Use [https://slurm.schedmd.com/squeue.html squeue] command, e.g.: |
|||
<pre> |
<pre> |
||
$ squeue # all jobs owned by user (all jobs owned by all users for admins) |
|||
$ scontrol write batch_script <job_id> <file> |
|||
$ squeue --me # all jobs owned by user (same as squeue for regular users) |
|||
$ scontrol write batch_script <job_id> - |
|||
$ squeue -u <username> # jobs of specific user |
|||
$ squeue -t PENDING # pending jobs only |
|||
$ squeue -t RUNNING # running jobs only |
|||
</pre> |
</pre> |
||
'''Notes:''' |
|||
* If file name is omitted default file name will be slurm-<job_id>.sh |
|||
* If file name is - (i.e. dash) job script will be written to stdout. |
|||
* The output format of [https://slurm.schedmd.com/squeue.html squeue] (and most other Slurm commands) is highly configurable to your needs. Look for the --format or --Format options. |
|||
* Every invocation of squeue sends a remote procedure call to the Slurm database server. '''Do not run squeue or other Slurm client commands from loops in shell scripts or other programs''' as this can result in a degradation of performance. Ensure that programs limit calls to squeue to the minimum necessary for the information you are trying to gather. |
|||
== How to submit batch job without job script? == |
|||
== How to cancel jobs? == |
|||
Use '--wrap' option. |
|||
Use [https://slurm.schedmd.com/scancel.html scancel] command, e.g. |
|||
Example: |
|||
<pre> |
|||
<pre>$ sbatch --nodes=2 --ntasks-per-node=16 --wrap "sleep 600"</pre> |
|||
$ scancel <jobid> # cancel specific job |
|||
$ scancel <jobid>_<index> # cancel indexed job in a job array |
|||
$ scancel -u <username> # cancel all jobs of specific user |
|||
$ scancel -t PENDING # cancel pending jobs |
|||
$ scancel -t RUNNING # cancel running jobs |
|||
</pre> |
|||
== How to show job script of a running job? == |
|||
'''Note:''' May be useful for testing purposes. |
|||
Use [https://slurm.schedmd.com/scontrol.html scontrol] command: |
|||
<pre> |
|||
= JOB MONITORING = |
|||
$ scontrol write batch_script <job_id> <file> |
|||
$ scontrol write batch_script <job_id> - |
|||
</pre> |
|||
* If file name is omitted default file name will be slurm-<job_id>.sh |
|||
* If file name is - (i.e. dash) job script will be written to stdout. |
|||
== How to get estimated start time of a job? == |
== How to get estimated start time of a job? == |
||
Line 534: | Line 713: | ||
<pre>$ squeue --start</pre> |
<pre>$ squeue --start</pre> |
||
'''Notes:''' |
|||
'''Note:''' Estimated start times are dynamic and can change at any moment. Exact start times of individual jobs are usually unpredictable. |
|||
* Estimated start times are dynamic and can change at any moment. Exact start times of individual jobs are usually unpredictable. |
|||
* Slurm will report N/A for the start time estimate if nodes are not currently being reserved by the scheduler for the job to run on. |
|||
== How to show remaining walltime of running jobs? == |
|||
Use [https://slurm.schedmd.com/squeue.html squeue] with format option "%L", e.g.: |
|||
<pre> $ squeue -t r -o "%u %i %L" </pre> |
|||
== How to check priority of jobs? == |
== How to check priority of jobs? == |
||
Line 557: | Line 743: | ||
</pre> |
</pre> |
||
== How to prevent (hold) jobs from being scheduled for execution? == |
|||
<pre> |
|||
$ scontrol hold <job_id> |
|||
</pre> |
|||
== How to unhold job? == |
|||
<pre> |
|||
$ scontrol release <job_id> |
|||
</pre> |
|||
== How to suspend a running job? == |
|||
<pre> |
|||
$ scontrol suspend <job_id> |
|||
</pre> |
|||
== How to resume a suspended job? == |
|||
<pre> |
|||
$ scontrol resume <job_id> |
|||
</pre> |
|||
== How to requeue (cancel and resubmit) a particular job? == |
|||
<pre> |
|||
$ scontrol requeue <job_id> |
|||
</pre> |
|||
==How to monitor resource usage of running job(s)? == |
== How to monitor resource usage of running job(s)? == |
||
Use "[https://slurm.schedmd.com/sstat.html sstat] command. |
Use "[https://slurm.schedmd.com/sstat.html sstat] command. |
||
'sstat -e' command shows |
'sstat -e' command shows a list of fields that can be specified with the '--format' option. |
||
Example: |
|||
<pre> |
|||
$ sstat --format=JobId,AveCPU,AveRSS,MaxRSS -j <jobid> |
|||
</pre> |
|||
This will show average CPU time, average and maximum memory consumption of all tasks in the running job. |
|||
Ideally, average CPU time equals the number of cores allocated for the job multiplied by the current run time of the job. |
|||
The maximum memory consumption gives an estimate of the peak amount of memory actually needed so far. This can be compared with the amount of memory requested for the job. Over-requesting memory can result in significant waste of compute resources. |
|||
'''Notes:''' |
'''Notes:''' |
||
Line 568: | Line 793: | ||
* Users can also ssh into compute nodes that they have one or more running jobs on. Once logged in, they can use standard Linux process monitoring tools like ps, (h)top, free, vmstat, iostat, du, ... |
* Users can also ssh into compute nodes that they have one or more running jobs on. Once logged in, they can use standard Linux process monitoring tools like ps, (h)top, free, vmstat, iostat, du, ... |
||
* Users can also attach an interactive shell under an already allocated job by running the following command: <pre>srun --jobid <job> --pty /bin/bash</pre> Once logged in, they can again use standard Linux process monitoring tools like ps, (h)top, free, vmstat, iostat, du, ... For a single node job the user does not even need to know the node that the job is running on. For a multinode job, the user can still use '-w <node>' option to specify a specific node. |
* Users can also attach an interactive shell under an already allocated job by running the following command: <pre>srun --jobid <job> --overlap --pty /bin/bash</pre> Once logged in, they can again use standard Linux process monitoring tools like ps, (h)top, free, vmstat, iostat, du, ... For a single node job the user does not even need to know the node that the job is running on. For a multinode job, the user can still use '-w <node>' option to specify a specific node. |
||
== How to get job information |
== How to get detailed job information == |
||
<pre> |
<pre> |
||
Line 579: | Line 803: | ||
</pre> |
</pre> |
||
== How to modify a pending/running job? == |
|||
Use |
|||
<pre>$ scontrol update JobId=<jobid> ...</pre> |
|||
E.g.: <pre>$ scontrol update JobId=42 TimeLimit=7-0</pre> |
|||
This will modify the time limit of the job to 7 days. |
|||
'''Note:''' Update requests for '''running''' jobs are mostly restricted to Slurm administrators. In particular, only an administrator can increase the TimeLimit of a job. |
|||
== How to show accounting data of completed job(s)? == |
== How to show accounting data of completed job(s)? == |
||
Line 586: | Line 821: | ||
'sacct -e' command shows a list of fields that can be |
'sacct -e' command shows a list of fields that can be |
||
specified with the '--format' option. |
specified with the '--format' option. |
||
== How to retrieve job history and accounting? == |
== How to retrieve job history and accounting? == |
||
Line 616: | Line 850: | ||
</pre> |
</pre> |
||
'''Note:''' |
|||
You can also set the environment variable $SACCT_FORMAT to specify the default format. To get a general idea of how efficiently a job utilized its resources, the following format can be used: |
|||
<pre> |
|||
export SACCT_FORMAT="JobID,JobName,Elapsed,NCPUs,TotalCPU,CPUTime,ReqMem,MaxRSS,MaxDiskRead,MaxDiskWrite,State,ExitCode" |
|||
</pre> |
|||
To find how efficiently the CPUs were used, divide TotalCPU by CPUTime. To find how efficiently memory were used, devide MaxRSS by ReqMem. But be aware that sacct memory usage measurement doesn't catch very rapid memory spikes. If your job got killed for running out of memory, it '''did run out of memory''' even if sacct reports a lower memory usage than would trigger an out-of-memory-kill. A job that reads or writes excessively to disk might be bogged down significantly by I/O operations. |
|||
== How to get efficiency information of completed job(s)? == |
== How to get efficiency information of completed job(s)? == |
||
Line 621: | Line 864: | ||
Use <pre>$ seff <jobid> </pre> command for some brief information. |
Use <pre>$ seff <jobid> </pre> command for some brief information. |
||
'''Note:''' It is good practice to have a look at the efficiency of your job(s) on completion '''and we expect you to do so'''. This way you can improve your job specifications in the future. |
|||
== How to get |
== How to get complete field values from sstat and sacct commands? == |
||
When using the [https://slurm.schedmd.com/sacct.html#OPT_format --format] option for listing various fields you can put a %NUMBER afterwards to specify how many characters should be printed. |
|||
<pre> |
|||
$ scontrol show hostnames $SLURM_JOB_NODELIST |
|||
</pre> |
|||
E.g. '--format=User%30' will print 30 characters for the user name (right justified). A %-30 will print 30 characters left justified. |
|||
sstat and sacct also provide the '--parsable' and '--parsable2' option to always print full field values delimited with a pipe '|' character by default. |
|||
= ADMINISTRATION = |
|||
The delimiting character can be specified by using the '--delimiter' option, e.g. '--delimiter=","' for comma separated values. |
|||
== How to |
== How to retrieve job records for all jobs running/pending at a certain point in time? == |
||
Use [https://slurm.schedmd.com/sacct.html sacct] with [https://slurm.schedmd.com/sacct.html#OPT_state -s <state>] and [https://slurm.schedmd.com/sacct.html#OPT_starttime -S <start time>] options, e.g.: |
|||
You can stop Slurm from scheduling jobs on a per partition basis by |
|||
setting that partition's state to DOWN. Set its state UP to resume |
|||
scheduling. For example: |
|||
<pre> |
<pre> |
||
$ sacct -n -a -X -S 2021-04-01T00:00:00 -s R -o JobID,User%15,Account%10,NCPUS,NNodes,NodeList%1500 |
|||
$ scontrol update PartitionName=foo State=DOWN |
|||
$ scontrol update PartitionName=foo State=UP |
|||
</pre> |
</pre> |
||
'''Note:''' When specifying the state "-s <state>" '''and''' the start time "-S <start time>", the default |
|||
time window will be set to end time "-E" equal to start time. Thus, you will get a snapshot of all running/pending |
|||
jobs at the instance given by "-S <start time>". |
|||
== How to |
== How to get a parsable list of hostnames from $SLURM_JOB_NODELIST? == |
||
<pre> |
<pre> |
||
$ scontrol show hostnames $SLURM_JOB_NODELIST |
|||
$ scontrol reboot ASAP nextstate=RESUME <node1>,<node2> # specific nodes |
|||
$ scontrol reboot ASAP nextstate=RESUME ALL # all nodes |
|||
</pre> |
</pre> |
||
= ADMINISTRATION = |
|||
'''Note:''' Most commands in this section are restricted to system administrators. |
|||
== How to check current node status? == |
|||
== How to stop Slurm from scheduling jobs? == |
|||
<pre> |
|||
$ scontrol show node <node> |
|||
</pre> |
|||
You can stop Slurm from scheduling jobs on a per partition basis by |
|||
setting that partition's state to DOWN. Set its state UP to resume |
|||
== How to instruct all Slurm daemons to re-read the configuration file == |
|||
scheduling. For example: |
|||
<pre> |
<pre> |
||
$ scontrol |
$ scontrol update PartitionName=foo State=DOWN |
||
$ scontrol update PartitionName=foo State=UP |
|||
</pre> |
</pre> |
||
== How to print actual hardware configuration of a node? == |
|||
== How to prevent (hold) jobs from being scheduled for execution? == |
|||
<pre> |
<pre> |
||
$ slurmd -C # print hardware configuration plus uptime |
|||
$ scontrol hold <job_id> |
|||
$ slurmd -G # print generic resource configuration |
|||
</pre> |
</pre> |
||
== How to reboot (all) nodes as soon as they become idle? == |
|||
== How to unhold job? == |
|||
<pre> |
<pre> |
||
$ scontrol reboot ASAP nextstate=RESUME <node1>,<node2> # specific nodes |
|||
$ scontrol release <job_id> |
|||
$ scontrol reboot ASAP nextstate=RESUME ALL # all nodes |
|||
</pre> |
</pre> |
||
== How to cancel pending reboot of nodes? == |
|||
== How to suspend a running job? == |
|||
<pre> |
<pre> |
||
$ scontrol |
$ scontrol cancel_reboot <node1>,<node2> |
||
</pre> |
</pre> |
||
== How to check current node status? == |
|||
== How to resume a suspended job? == |
|||
<pre> |
<pre> |
||
$ scontrol |
$ scontrol show node <node> |
||
</pre> |
</pre> |
||
== How to instruct all Slurm daemons to re-read the configuration file == |
|||
== How to requeue (cancel and resubmit) a particular job? == |
|||
<pre> |
<pre> |
||
$ scontrol |
$ scontrol reconfigure |
||
</pre> |
</pre> |
||
== How to prevent a user from submitting new jobs? == |
== How to prevent a user from submitting new jobs? == |
||
Line 720: | Line 960: | ||
* Use the following command to release the limit: |
* Use the following command to release the limit: |
||
<pre> |
<pre> |
||
sacctmgr update user <username> set maxsubmitjobs=-1 |
$ sacctmgr update user <username> set maxsubmitjobs=-1 |
||
</pre> |
</pre> |
||
== How to drain node(s)? == |
== How to drain node(s)? == |
||
Line 736: | Line 975: | ||
* Do '''not''' just set state DOWN to drain nodes. This will kill any active jobs that may run on that nodes. |
* Do '''not''' just set state DOWN to drain nodes. This will kill any active jobs that may run on that nodes. |
||
== How to list reason for nodes being drained or down? == |
|||
<pre> |
|||
$ sinfo -R |
|||
</pre> |
|||
== How to resume node state? == |
== How to resume node state? == |
||
Line 742: | Line 986: | ||
$ scontrol update NodeName=<node1>,<node2> State=RESUME |
$ scontrol update NodeName=<node1>,<node2> State=RESUME |
||
</pre> |
</pre> |
||
== How to create a reservation on nodes? == |
== How to create a reservation on nodes? == |
||
Suggested reading: https://slurm.schedmd.com/reservations.html |
|||
<pre> |
<pre> |
||
Line 752: | Line 997: | ||
</pre> |
</pre> |
||
'''Note:''' Add "FLEX" flag to allow jobs that qualify for the reservation to start before the reservation begins (and continue after it starts). |
|||
See: https://slurm.schedmd.com/reservations.html |
|||
Add "MAGNETIC" flag to attract jobs that qualify for the reservation to run in that reservation without having requested it at submit time. |
|||
== How to create a floating reservation on nodes? == |
|||
Use the flag "TIME_FLOAT" and a start time that is relative to the current time (use the keyword "now"). |
|||
In the example below, the nodes are prevented from starting any jobs exceeding a walltime of 2 days. |
|||
<pre> |
|||
$ scontrol create reservation user=root starttime=now+2days duration=UNLIMITED flags=maint,ignore_jobs,time_float nodes=<node1>,<node2> |
|||
</pre> |
|||
'''Note:''' Floating reservation are not intended to run jobs, but to prevent long running jobs from being initiated on specific nodes. Attempts by users to make use of a floating reservation will be rejected. When ready to perform the maintenance, place the nodes in DRAIN state and delete the reservation. |
|||
== How to use a reservation? == |
== How to use a reservation? == |
||
Line 760: | Line 1,016: | ||
$ sbatch --reservation=foo_6 ... script.slurm |
$ sbatch --reservation=foo_6 ... script.slurm |
||
</pre> |
</pre> |
||
== How to delete a reservation? == |
== How to delete a reservation? == |
||
Line 767: | Line 1,022: | ||
$ scontrol delete ReservationName=foo_6 |
$ scontrol delete ReservationName=foo_6 |
||
</pre> |
</pre> |
||
== How to get node oriented information similar to 'mdiag -n'? == |
== How to get node oriented information similar to 'mdiag -n'? == |
||
Line 785: | Line 1,039: | ||
n0003 standard* 0/0/ N/A 128000 N/A down* Not responding |
n0003 standard* 0/0/ N/A 128000 N/A down* Not responding |
||
</pre> |
</pre> |
||
== How to get node oriented information similar to 'pbsnodes'? == |
== How to get node oriented information similar to 'pbsnodes'? == |
||
<pre> |
<pre> |
||
$ scontrol show nodes # One paragraph per node |
$ scontrol show nodes # One paragraph per node (all nodes) |
||
$ scontrol |
$ scontrol show nodes <node1>,<node2> # One paragraph per node (specified nodes) |
||
$ scontrol -o show nodes # One line per node (all nodes) |
|||
$ scontrol -o show nodes <node1>,<node2> # One line per node (specified nodes) |
|||
</pre> |
</pre> |
||
== How to modify a running job? == |
|||
Use |
|||
<pre>$ scontrol update JobId=<jobid> ...</pre> |
|||
E.g.: <pre>$ scontrol update JobId=42 TimeLimit=28-0</pre> |
|||
This will modify the time limit of the job to 28 days. |
|||
== How to update multiple jobs of a user with a single scontrol command? == |
== How to update multiple jobs of a user with a single scontrol command? == |
||
Line 830: | Line 1,073: | ||
However, Slurm does not allow the UserID filter alone. |
However, Slurm does not allow the UserID filter alone. |
||
== How to create a new account? == |
== How to create a new account? == |
||
Line 845: | Line 1,087: | ||
$ sacctmgr add account <accountname> parent=<parent_accountname> |
$ sacctmgr add account <accountname> parent=<parent_accountname> |
||
</pre> |
</pre> |
||
== How to move account to another parent? == |
== How to move account to another parent? == |
||
Line 852: | Line 1,093: | ||
$ sacctmgr modify account name=<accountname> set parent=<new_parent_accountname> |
$ sacctmgr modify account name=<accountname> set parent=<new_parent_accountname> |
||
</pre> |
</pre> |
||
== How to delete an account? == |
== How to delete an account? == |
||
Line 859: | Line 1,099: | ||
$ sacctmgr delete account name=<accountname> |
$ sacctmgr delete account name=<accountname> |
||
</pre> |
</pre> |
||
== How to add a new user? == |
== How to add a new user? == |
||
Line 866: | Line 1,105: | ||
$ sacctmgr add user <username> DefaultAccount=<accountname> |
$ sacctmgr add user <username> DefaultAccount=<accountname> |
||
</pre> |
</pre> |
||
== How to add/remove users from an account? == |
== How to add/remove users from an account? == |
||
Line 876: | Line 1,114: | ||
</pre> |
</pre> |
||
== How to change default account of a user? == |
|||
<pre> |
|||
$ sacctmgr modify user where user=<username> set DefaultAccount=<default_account> |
|||
</pre> |
|||
'''Note:''' The user must already be associated with the account you want to set as default. |
|||
== How to show account information? == |
== How to show account information? == |
||
Line 883: | Line 1,128: | ||
$ sacctmgr show assoc tree |
$ sacctmgr show assoc tree |
||
</pre> |
</pre> |
||
== How to implement user resource throttling policies? == |
== How to implement user resource throttling policies? == |
||
Line 901: | Line 1,145: | ||
Again, the QOS would be overriding the base priority that could be set |
Again, the QOS would be overriding the base priority that could be set |
||
in the associations. |
in the associations. |
||
== How to set a resource limit for an individual user? == |
== How to set a resource limit for an individual user? == |
||
Suggested reading: https://slurm.schedmd.com/resource_limits.html |
|||
Example: |
Example: |
||
Line 913: | Line 1,158: | ||
</pre> |
</pre> |
||
== How to retrieve historical resource usage for a specific user or account? == |
|||
'''Note:''' Also see https://slurm.schedmd.com/resource_limits.html |
|||
Use [https://slurm.schedmd.com/sreport.html sreport] command. |
|||
Examples: |
|||
<pre> |
|||
$ sreport cluster UserUtilizationByAccount Start=2021-01-01 End=2021-12-31 -t Hours user=<username> # Report cluster utilization of given user broken down by accounts |
|||
$ sreport cluster AccountUtilizationByUser Start=2021-01-01 End=2021-12-31 -t Hours account=<account> # Report cluster utilization of given account broken down by users |
|||
</pre> |
|||
'''Notes:''' |
|||
* By default CPU resources will be reported. Use '-T' option for other trackable resources, e.g. '-T cpu,mem,gres/gpu,gres/scratch'. |
|||
* On JUSTUS 2 registered compute projects ("Rechenvorhaben") are uniquely mapped to Slurm accounts of the same name. Thus, 'AccountUtilizationByUser' can also be used to report the aggregated cluster utilization of compute projects. |
|||
* Can be executed by regular users as well in which case Slurm will only report their own usage records (but along with the total usage of the associated account in the case of 'AccountUtilizationByUser'). |
|||
== How to fix/reset a user's RawUsage value? == |
|||
<pre> |
|||
$ sacctmgr modify user <username> where Account=<account> set RawUsage=<number> |
|||
</pre> |
|||
== How to create/modify/delete QOSes? == |
== How to create/modify/delete QOSes? == |
||
Line 933: | Line 1,196: | ||
</pre> |
</pre> |
||
== How to find (and fix) runaway jobs? == |
|||
<pre>$ sacctmgr show runaway</pre> |
|||
'''Notes:''' |
|||
* Runaway jobs are orphaned jobs that don't exist in the Slurm controller but have a start and no end time in the Slurm data base. Runaway jobs mess with accounting and affects new jobs of users who have too many runaway jobs. |
|||
* If there are jobs in this state this command will also provide an option to fix them. This will set the end time for each job to the latest out of the start, eligible, or submit times, and set the state to completed. |
|||
== How to show a history of database transactions? == |
== How to show a history of database transactions? == |
||
<pre>$ sacctmgr list transactions</pre> |
|||
<pre> |
|||
sacctmgr list transactions |
|||
</pre> |
|||
'''Note:''' Useful to get timestamps for when a user/account/qos has been created/modified/removed etc. |
'''Note:''' Useful to get timestamps for when a user/account/qos has been created/modified/removed etc. |
Latest revision as of 17:21, 23 October 2024
The bwForCluster JUSTUS 2 is a state-wide high-performance compute resource dedicated to Computational Chemistry and Quantum Sciences in Baden-Württemberg, Germany.
This is a collection of howtos and convenient Slurm commands for JUSTUS 2.
Some commands behave slightly different depending on whether they are executed by a system administrator or by a regular user, as Slurm prevents regular users from accessing critical system information and viewing job and usage information of other users.
GENERAL INFORMATION
How to find a general quick start user guide?
https://slurm.schedmd.com/quickstart.html
How to find Slurm FAQ?
https://slurm.schedmd.com/faq.html
How to find a Slurm cheat sheet?
https://slurm.schedmd.com/pdfs/summary.pdf
How to find Slurm tutorials?
https://slurm.schedmd.com/tutorials.html
How to get more information on Slurm?
(Almost) every Slurm command has a man page. Use it.
Online versions: https://slurm.schedmd.com/man_index.html
How to find hardware specific details about JUSTUS 2?
See our Wiki page: Hardware and Architecture
JOB SUBMISSION
How to submit a serial batch job?
Use sbatch command:
$ sbatch <job-script>
Sample job script template for serial job:
#!/bin/bash
# Allocate one node
#SBATCH --nodes=1
# Number of program instances to be executed
#SBATCH --ntasks-per-node=1
# 8 GB memory required per node
#SBATCH --mem=8G
# Maximum run time of job
#SBATCH --time=1:00:00
# Give job a reasonable name
#SBATCH --job-name=serial_job
# File name for standard output (%j will be replaced by job id)
#SBATCH --output=serial_job-%j.out
# File name for error output
#SBATCH --error=serial_job-%j.err
# Load software modules as needed, e.g.
# module load foo/bar
# Run serial program
./my_serial_program
Sample code for serial program: Hello_serial.c
Notes:
- --nodes=1 and --ntasks-per-node=1 may be replaced by --ntasks=1.
- If not specified, stdout and stderr are both written to slurm-%j.out.
How to find working sample scripts for my program?
Most software modules for applications provide working sample batch scripts. Check with module help command, e.g.
$ module help chem/vasp # display module help for VASP $ module help math/matlab # display module help for Matlab
How to harden job scripts against common errors?
The bash shell provides several options that support users in disclosing hidden bugs and writing safer job scripts. In order to activate these safeguard settings users can insert the following lines in their scripts (after all #SBATCH directives):
[...]
set -o errexit # (or set -e) cause batch script to exit immediately when a command fails.
set -o pipefail # cause batch script to exit immediately also when the command that failed is embedded in a pipeline
set -o nounset # (or set -u) causes the script to treat unset variables as an error and exit immediately
[...]
How to submit an interactive job?
Use salloc command, e.g.:
$ salloc --nodes=1 --ntasks-per-node=8
Note:
In previous Slurm versions < 20.11 the use of srun has been the recommended way for launching interactive jobs, e.g.:
$ srun --nodes=1 --ntasks-per-node=8 --pty bash
Although this still works with current Slurm versions this is considered deprecated for current Slurm versions as it may cause issues when launching additional jobs steps from within the interactive job environment. Use salloc command.
How to enable X11 forwarding for an interactive job?
Use '--x11' flag, e.g.
$ salloc --nodes=1 --ntasks-per-node=8 --x11 # run shell with X11 forwarding enabled
Note:
- For X11 forwarding to work, you must also enable X11 forwarding for your ssh login from your local computer to the cluster, i.e.:
local> ssh -X <username>@justus2.uni-ulm.de>
How to convert Moab batch job scripts to Slurm?
Replace Moab/Torque job specification flags and environment variables in your job scripts by their corresponding Slurm counterparts.
Commonly used Moab job specification flags and their Slurm equivalents
Option | Moab (msub) | Slurm (sbatch) |
---|---|---|
Script directive | #MSUB | #SBATCH |
Job name | -N <name> | --job-name=<name> (-J <name>) |
Account | -A <account> | --account=<account> (-A <account>) |
Queue | -q <queue> | --partition=<partition> (-p <partition>) |
Wall time limit | -l walltime=<hh:mm:ss> | --time=<hh:mm:ss> (-t <hh:mm:ss>) |
Node count | -l nodes=<count> | --nodes=<count> (-N <count>) |
Core count | -l procs=<count> | --ntasks=<count> (-n <count>) |
Process count per node | -l ppn=<count> | --ntasks-per-node=<count> |
Core count per process | --cpus-per-task=<count> | |
Memory limit per node | -l mem=<limit> | --mem=<limit> |
Memory limit per process | -l pmem=<limit> | --mem-per-cpu=<limit> |
Job array | -t <array indices> | --array=<indices> (-a <indices>) |
Node exclusive job | -l naccesspolicy=singlejob | --exclusive |
Initial working directory | -d <directory> (default: $HOME) | --chdir=<directory> (-D <directory>) (default: submission directory) |
Standard output file | -o <file path> | --output=<file> (-o <file>) |
Standard error file | -e <file path> | --error=<file> (-e <file>) |
Combine stdout/stderr to stdout | -j oe | --output=<combined stdout/stderr file> |
Mail notification events | -m <event> | --mail-type=<events> (valid types include: NONE, BEGIN, END, FAIL, ALL) |
Export environment to job | -V | --export=ALL (default) |
Don't export environment to job | (default) | --export=NONE |
Export environment variables to job | -v <var[=value][,var2=value2[, ...]]> | --export=<var[=value][,var2=value2[,...]]> |
Notes:
- Default initial job working directory is $HOME for Moab. For Slurm the default working directory is where you submit your job from.
- By default Moab does not export any environment variables to the job's runtime environment. With Slurm most of the login environment variables are exported to your job's runtime environment. This includes environment variables from software modules that were loaded at job submission time (and also $HOSTNAME variable).
Commonly used Moab/Torque script environment variables and their Slurm equivalents
Information | Moab | Torque | Slurm |
---|---|---|---|
Job name | $MOAB_JOBNAME | $PBS_JOBNAME | $SLURM_JOB_NAME |
Job ID | $MOAB_JOBID | $PBS_JOBID | $SLURM_JOB_ID |
Submit directory | $MOAB_SUBMITDIR | $PBS_O_WORKDIR | $SLURM_SUBMIT_DIR |
Number of nodes allocated | $MOAB_NODECOUNT | $PBS_NUM_NODES | $SLURM_JOB_NUM_NODES (and: $SLURM_NNODES) |
Node list | $MOAB_NODELIST | cat $PBS_NODEFILE | $SLURM_JOB_NODELIST |
Number of processes | $MOAB_PROCCOUNT | $PBS_TASKNUM | $SLURM_NTASKS |
Requested tasks per node | --- | $PBS_NUM_PPN | $SLURM_NTASKS_PER_NODE |
Requested CPUs per task | --- | --- | $SLURM_CPUS_PER_TASK |
Job array index | $MOAB_JOBARRAYINDEX | $PBS_ARRAY_INDEX | $SLURM_ARRAY_TASK_ID |
Job array range | $MOAB_JOBARRAYRANGE | - | $SLURM_ARRAY_TASK_COUNT |
Queue name | $MOAB_CLASS | $PBS_QUEUE | $SLURM_JOB_PARTITION |
QOS name | $MOAB_QOS | --- | $SLURM_JOB_QOS |
--- | $PBS_NUM_PPN | $SLURM_TASKS_PER_NODE | |
Job user | $MOAB_USER | $PBS_O_LOGNAME | $SLURM_JOB_USER |
Hostname | $MOAB_MACHINE | $PBS_O_HOST | $SLURMD_NODENAME |
Note:
- See sbatch man page for a complete list of flags and environment variables.
How to emulate Moab output file names?
Use the following directives:
#SBATCH --output="%x.o%j" #SBATCH --error="%x.e%j"
How to pass command line arguments to the job script?
Run
$ sbatch <job-script> arg1 arg2 ...
Inside the job script the arguments can be accessed as $1, $2, ...
E.g.:
[...]
infile="$1"
outfile="$2"
./my_serial_program < "$infile" > "$outfile" 2>&1
[...]
Notes:
- Do not use $1, $2, ... in "#SBATCH" lines. These parameters can be used only within the regular shell script.
How to request local scratch (SSD/NVMe) at job submission?
Use '--gres=scratch:nnn' option to allocate nnn GB of local (i.e. node-local) scratch space for the entire job.
Example: '--gres=scratch:100' will allocate 100 GB scratch space on a locally attached NVMe device.
Notes:
- Do not add any unit (such as --gres=scratch:100G). This would be treated as requesting an amount of 10^9 * 100GB of scratch space.
- Multinode jobs get nnn GB of local scratch space on every node of the job.
- Environment variable $SCRATCH will point to
- /scratch/<user>.<jobid> when local scratch has been requested. This will be on locally attached SSD/NVMe devices.
- /tmp/<user>.<jobid> when no local scratch has been requested. This will be in memory and, thus, be limited in size.
- Environment variable $TMPDIR always points to /tmp/<user>.<jobid>. This will always be in memory and, thus, limited in size.
- For backward compatibility environment variable $RAMDISK always points to /tmp/<user>.<jobid>
- Scratch space allocation in /scratch will be enforced by quota limits
- Data written to $TMPDIR will always count against allocated memory.
- Data written to local scratch space will automatically be removed at the end of the job.
How to request GPGPU nodes at job submission?
Use '--gres=gpu:<count>' option to allocate 1 or 2 GPUs per node for the entire job.
Example: '--gres=gpu:1' will allocate one GPU per node for this job.
Notes:
- GPGPU nodes are equipped with two Nvidia V100S cards
- Environment variables $CUDA_VISIBLE_DEVICES, $SLURM_JOB_GPUS and $GPU_DEVICE_ORDINAL will denote card(s) allocated for the job.
- CUDA Toolkit is available as software module devel/cuda.
How to clean-up or save files before a job times out?
Possibly you would like to clean up the work directory or save intermediate result files in case a job times out.
The following sample script may serve as a blueprint for implementing a pre-termination function to perform clean-up or file recovery actions.
#!/bin/bash
# Allocate one node
#SBATCH --nodes=1
# Number of program instances to be executed
#SBATCH --ntasks-per-node=1
# 2 GB memory required per node
#SBATCH --mem=2G
# Request 10 GB local scratch space
#SBATCH --gres=scratch:10
# Maximum run time of job
#SBATCH --time=10:00
# Send the USR1 signal 120 seconds before end of time limit
#SBATCH --signal=B:USR1@120
# Give job a reasonable name
#SBATCH --job-name=signal_job
# File name for standard output (%j will be replaced by job id)
#SBATCH --output=signal_job-%j.out
# File name for error output
#SBATCH --error=signal_job-%j.err
# Define the signal handler function
# Note: This is not executed here, but rather when the associated
# signal is received by the shell.
finalize_job()
{
# Do whatever cleanup you want here. In this example we copy
# output file(s) back to $SLURM_SUBMIT_DIR, but you may implement
# your own job finalization code here.
echo "function finalize_job called at `date`"
cd $SCRATCH
mkdir -vp "$SLURM_SUBMIT_DIR"/results
tar czvf "$SLURM_SUBMIT_DIR"/results/${SLURM_JOB_ID}.tgz output*.txt
exit
}
# Call finalize_job function as soon as we receive USR1 signal
trap 'finalize_job' USR1
# Copy input files for this job to the scratch directory (if needed).
# Note: Environment variable $SCRATCH always points to a scratch directory
# automatically created for this job. Environment variable $SLURM_SUBMIT_DIR
# points to the path where this script was submitted from.
# Example:
# cp -v "$SLURM_SUBMIT_DIR"/input*.txt "$SCRATCH"
# Change working directory to local scratch directory
cd "$SCRATCH"
# Load software modules as needed, e.g.
# module load foo/bar
# This is where the actual work is done. In this case we just create
# a sample output file for 900 (=15*60) seconds, but since we asked
# Slurm for 600 seconds only it will not be able finish within this
# wall time.
# Note: It is important to run this task in the background
# by placing the & symbol at the end. Otherwise the signal handler
# would not be executed until that process has finished, which is not
# what we want.
(for i in `seq 15`; do echo "Hello World at `date +%H:%M:%S`."; sleep 60; done) >output.txt 2>&1 &
# Note: The command above is just for illustration. Normally you would just run
# my_program >output.txt 2>&1 &
# Tell the shell to wait for background task(s) to finish.
# Note: This is important because otherwise the parent shell
# (this script) would proceed (and terminate) without waiting for
# background task(s) to finish.
wait
# If we get here, the job did not time out but finished in time.
# Release user defined signal handler for USR1
trap - USR1
# Do regular cleanup and save files. In this example we simply call
# the same function that we defined as a signal handler above, but you
# may implement your own code here.
finalize_job
exit
Notes:
- The number of seconds specified in --signal option must match the runtime of the pre-termination function and must not exceed 65535 seconds.
- Due to the resolution of event handling by Slurm, the signal may be sent a little earlier than specified.
How to submit a multithreaded batch job?
Sample job script template for a job running one multithreaded program instance:
#!/bin/bash
# Allocate one node
#SBATCH --nodes=1
# Number of program instances to be executed
#SBATCH --ntasks-per-node=1
# Number of cores per program instance
#SBATCH --cpus-per-task=8
# 8 GB memory required per node
#SBATCH --mem=8G
# Maximum run time of job
#SBATCH --time=1:00:00
# Give job a reasonable name
#SBATCH --job-name=multithreaded_job
# File name for standard output (%j will be replaced by job id)
#SBATCH --output=multithreaded_job-%j.out
# File name for error output
#SBATCH --error=multithreaded_job-%j.err
# Load software modules as needed, e.g.
# module load foo/bar
export OMP_NUM_THREADS=${SLURM_JOB_CPUS_PER_NODE}
export MKL_NUM_THREADS=${SLURM_JOB_CPUS_PER_NODE}
# Run multithreaded program
./my_multithreaded_program
Sample code for multithreaded program: Hello_openmp.c
Notes:
- In our configuration each physical core is considered a "CPU".
- On JUSTUS 2 it is recommended to specify a number of cores per task ('--cpus-per-task') that is either an integer divisor of 24 or (at most) 48.
- Required memory can also by specified per allocated CPU with '--mem-per-cpu' option.
- The '--mem' and '--mem-per-cpu' options are mutually exclusive.
How to submit an array job?
Use -a (or --array) option, e.g.
$ sbatch -a 1-16%8 ...
This will submit 16 tasks to be executed, each one indexed by SLURM_ARRAY_TASK_ID ranging from 1 to 16, but will limit the number of simultaneously running tasks from this job array to 8.
Sample job script template for an array job:
#!/bin/bash
# Number of cores per individual array task
#SBATCH --ntasks=1
#SBATCH --array=1-16%8
#SBATCH --mem=4G
#SBATCH --time=01:00:00
#SBATCH --job-name=array_job
#SBATCH --output=array_job-%A_%a.out
#SBATCH --error=array_job-%A_%a.err
# Load software modules as needed, e.g.
# module load foo/bar
# Print the task id.
echo "My SLURM_ARRAY_TASK_ID: " $SLURM_ARRAY_TASK_ID
# Add lines here to run your computations, e.g.
# ./my_program <input.$SLURM_ARRAY_TASK_ID
Notes:
- Placeholder %A will be replaced by the master job id, %a will be replaced by the array task id.
- Every sub job in an array job will have its own unique environment variable $SLURM_JOB_ID. Environment variable $SLURM_ARRAY_JOB_ID will be set to the first job array index value for all tasks.
- The remaining options in the sample job script are the same as the options used in other, non-array jobs. In the example above, we are requesting that each array task be allocated 1 CPU (--ntasks=1) and 4 GB of memory (--mem=4G) for up to one hour (--time=01:00:00).
- More information: https://slurm.schedmd.com/job_array.html
How to delay the start of a job?
Use -b (or --begin) option in order to defer the allocation of the job until the specified time.
Examples:
sbatch --begin=20:00 ... # job can start after 8 p.m. sbatch --begin=now+1hour ... # job can start 1 hour after submission sbatch --begin=teatime ... # job can start at teatime (4 p.m.) sbatch --begin=2023-12-24T20:00:00 ... # job can start after specified date/time
How to submit dependency (chain) jobs?
Use -d (or --dependency) option, e.g.
$ sbatch -d afterany:123456 ...
This will defer the submitted job until the specified job 123456 has terminated.
Slurm supports a number of different dependency types, e.g.:
-d after:123456 # job can begin execution after the specified job has begun execution -d afterany:123456 # job can begin execution after the specified job has finished -d afternotok:123456 # job can begin execution after the specified job has failed (exit code not equal zero) -d afterok:123456 # job can begin execution after the specified job has successfully finished (exit code zero) -d singleton # job can begin execution after any previously job with the same job name and user have finished
Note: Multiple jobs can be specified by separating their job ids by colon characters (:), e.g.
$ sbatch -d afterany:123456:123457 ...
This will defer the submitted job until the specified jobs 123456 and 123457 have both finished.
How to deal with invalid job dependencies?
Use --kill-on-invalid-dep=yes option in order to automatically terminate jobs which can never run due to invalid dependencies. By default the job stays pending with reason 'DependencyNeverSatisfied' to allow review and appropriate action by the user.
Note: A job dependency may also become invalid if a job has been submitted with '-d afterok:<jobid>' but the specified dependency job has failed, e.g. because it timed out (i.e. exceeded its wall time limit).
How to submit an MPI batch job?
Suggested reading: https://slurm.schedmd.com/mpi_guide.html
Sample job script template for an MPI job:
#!/bin/bash
# Allocate two nodes
#SBATCH --nodes=2
# Number of program instances to be executed
#SBATCH --ntasks-per-node=48
# Allocate 32 GB memory per node
#SBATCH --mem=32gb
# Maximum run time of job
#SBATCH --time=1:00:00
# Give job a reasonable name
#SBATCH --job-name=mpi_job
# File name for standard output (%j will be replaced by job id)
#SBATCH --output=mpi_job-%j.out
# File name for error output
#SBATCH --error=mpi_job-%j.err
# Add lines here to run your computations, e.g.
#
# Option 1: Lauch MPI tasks by using mpirun
#
# for OpenMPI and GNU compiler:
#
# module load compiler/gnu
# module load mpi/openmpi
# mpirun ./my_mpi_program
#
# for Intel MPI and Intel complier:
#
# module load compiler/intel
# module load mpi/impi
# mpirun ./my_mpi_program
#
# Option 2: Launch MPI tasks by using srun
#
# for OpenMPI and GNU compiler:
#
# module load compiler/gnu
# module load mpi/openmpi
# srun ./my_mpi_program
#
# for Intel MPI and Intel compiler:
#
module load compiler/intel
module load mpi/impi
srun ./my_mpi_program
Sample code for MPI program: Hello_mpi.c
Notes
- SchedMD recommends to use srun and many (most?) sites do so as well. The rationale is that srun is more tightly integrated with the scheduler and provides more consistent and reliable resource tracking and accounting for individual jobs and job steps. mpirun may behave differently for different MPI implementations and versions. There are reports that claim "strange behavior" of mpirun especially when using task affinity and core binding. Using srun is supposed to resolve these issues and is therefore highly recommended.
- Do not run batch jobs that launch a large number (hundreds or thousands) short running (few minutes or less) MPI programs, e.g. from a shell loop. Every single MPI invocation does generate its own job step and sends remote procedure calls to the Slurm controller server. This can result in degradation of performance for both, Slurm and the application, especially if many of that jobs happen to run at the same time. Jobs of that kind can even get stuck without showing any further activity until hitting the wall time limit. For high throughput computing (e.g. processing a large number of files with every single task running independently from each other and very shortly), consider a more appropriate parallelization paradigm that invokes independent serial (non-MPI) processes in parallel at the same time. This approach is sometimes referred to as "pleasingly parallel" workload. GNU Parallel is a shell tool that facilitates executing serial tasks in parallel. On JUSTUS 2 this tool is available as a software module "system/parallel".
How to submit a hybrid MPI/OpenMP job?
Sample job script template for an hybrid job:
#!/bin/bash
# Number of nodes to allocate
#SBATCH --nodes=4
# Number of MPI instances (ranks) to be executed per node
#SBATCH --ntasks-per-node=2
# Number of threads per MPI instance
#SBATCH --cpus-per-task=24
# Allocate 8 GB memory per node
#SBATCH --mem=8gb
# Maximum run time of job
#SBATCH --time=1:00:00
# Give job a reasonable name
#SBATCH --job-name=hybrid_job
# File name for standard output (%j will be replaced by job id)
#SBATCH --output=hybrid_job-%j.out
# File name for error output
#SBATCH --error=hybrid_job-%j.err
export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK}
export MKL_NUM_THREADS=${SLURM_CPUS_PER_TASK}
module load compiler/intel
module load mpi/impi
srun ./my_hybrid_program
# or:
# mpirun ./my_hybrid_program
Sample code for hybrid program: Hello_hybrid.c
Notes:
- $SLURM_CPUS_PER_TASK is only set if the '--cpus-per-task' option is specified.
How to request specific node(s) at job submission?
Use -w (or --nodelist) option, e.g.:
$ sbatch -w <node1>,<node2> ...
Also see -F (or --nodefile) option.
How to exclude specific nodes from job?
Use -x (or --exclude) option, e.g.:
$ sbatch -x <node1>,<node2> ...
How to get exclusive jobs?
Use '--exclusive' option on job submission. This makes sure that there will be no other jobs running on your nodes. Very useful for benchmarking!
Note:
- --exclusive option does not mean that you automatically get full access to all the resources which the node might provide without explicitly requesting them.
How to avoid sharing nodes with other users?
Use '--exclusive=user' option on job submission. This will still allow multiple jobs of one and the same user on the nodes.
Note:
- Depending on configuration, exclusive=user may (and probably will) be the default node access policy on JUSTUS 2.
How to submit batch job without job script?
Use --wrap option.
Example:
$ sbatch --nodes=2 --ntasks-per-node=16 --wrap "sleep 600"
Note: May be useful for testing purposes.
JOB MONITORING AND CONTROL
How to prevent Slurm performance degradation?
Almost every invocation of a Slurm client command (e.g. squeue, sacct, sprio or sshare) sends a remote procedure call (RPC) to the Slurm control daemon and/or database. If enough remote procedure calls come in at once, this can result in a degradation of performance of the Slurm services for all users, possibly resulting in a denial of service.
Therefore, do not run Slurm client commands that send remote procedure calls from loops in shell scripts or other programs (such as 'watch squeue'). Always ensure to limit calls to squeue, sstat, sacct etc. to the minimum necessary for the information you are trying to gather.
Slurm does collect RPC counts and timing statistics by message type and user for diagnostic purposes.
How to view information about submitted jobs?
Use squeue command, e.g.:
$ squeue # all jobs owned by user (all jobs owned by all users for admins) $ squeue --me # all jobs owned by user (same as squeue for regular users) $ squeue -u <username> # jobs of specific user $ squeue -t PENDING # pending jobs only $ squeue -t RUNNING # running jobs only
Notes:
- The output format of squeue (and most other Slurm commands) is highly configurable to your needs. Look for the --format or --Format options.
- Every invocation of squeue sends a remote procedure call to the Slurm database server. Do not run squeue or other Slurm client commands from loops in shell scripts or other programs as this can result in a degradation of performance. Ensure that programs limit calls to squeue to the minimum necessary for the information you are trying to gather.
How to cancel jobs?
Use scancel command, e.g.
$ scancel <jobid> # cancel specific job $ scancel <jobid>_<index> # cancel indexed job in a job array $ scancel -u <username> # cancel all jobs of specific user $ scancel -t PENDING # cancel pending jobs $ scancel -t RUNNING # cancel running jobs
How to show job script of a running job?
Use scontrol command:
$ scontrol write batch_script <job_id> <file> $ scontrol write batch_script <job_id> -
- If file name is omitted default file name will be slurm-<job_id>.sh
- If file name is - (i.e. dash) job script will be written to stdout.
How to get estimated start time of a job?
$ squeue --start
Notes:
- Estimated start times are dynamic and can change at any moment. Exact start times of individual jobs are usually unpredictable.
- Slurm will report N/A for the start time estimate if nodes are not currently being reserved by the scheduler for the job to run on.
How to show remaining walltime of running jobs?
Use squeue with format option "%L", e.g.:
$ squeue -t r -o "%u %i %L"
How to check priority of jobs?
Use squeue with format options "%Q" and/or "%p", e.g.:
$ squeue -o "%8i %8u %15a %.10r %.10L %.5D %.10Q"
Use sprio command to display the priority components (age/fairshare/...) for each job:
$ sprio
Use "sshare command for listing the shares of associations, e.g. accounts.
$ sshare
How to prevent (hold) jobs from being scheduled for execution?
$ scontrol hold <job_id>
How to unhold job?
$ scontrol release <job_id>
How to suspend a running job?
$ scontrol suspend <job_id>
How to resume a suspended job?
$ scontrol resume <job_id>
How to requeue (cancel and resubmit) a particular job?
$ scontrol requeue <job_id>
How to monitor resource usage of running job(s)?
Use "sstat command.
'sstat -e' command shows a list of fields that can be specified with the '--format' option.
Example:
$ sstat --format=JobId,AveCPU,AveRSS,MaxRSS -j <jobid>
This will show average CPU time, average and maximum memory consumption of all tasks in the running job. Ideally, average CPU time equals the number of cores allocated for the job multiplied by the current run time of the job. The maximum memory consumption gives an estimate of the peak amount of memory actually needed so far. This can be compared with the amount of memory requested for the job. Over-requesting memory can result in significant waste of compute resources.
Notes:
- Users can also ssh into compute nodes that they have one or more running jobs on. Once logged in, they can use standard Linux process monitoring tools like ps, (h)top, free, vmstat, iostat, du, ...
- Users can also attach an interactive shell under an already allocated job by running the following command:
srun --jobid <job> --overlap --pty /bin/bash
Once logged in, they can again use standard Linux process monitoring tools like ps, (h)top, free, vmstat, iostat, du, ... For a single node job the user does not even need to know the node that the job is running on. For a multinode job, the user can still use '-w <node>' option to specify a specific node.
How to get detailed job information
$ scontrol show job 1234 # For job id 1234 $ scontrol show jobs # For all jobs $ scontrol -o show jobs # One line per job
How to modify a pending/running job?
Use
$ scontrol update JobId=<jobid> ...
E.g.:
$ scontrol update JobId=42 TimeLimit=7-0
This will modify the time limit of the job to 7 days.
Note: Update requests for running jobs are mostly restricted to Slurm administrators. In particular, only an administrator can increase the TimeLimit of a job.
How to show accounting data of completed job(s)?
Use sacct command.
'sacct -e' command shows a list of fields that can be specified with the '--format' option.
How to retrieve job history and accounting?
For a specific job:
$ sacct -j <jobid> --format=User,JobID,Jobname,partition,state,time,start,end,elapsed,MaxRss,MaxVMSize,nnodes,ncpus,nodelist
For a specific user:
$ sacct -u <user> --format=User,JobID,Jobname,partition,state,time,start,end,elapsed,MaxRss,MaxVMSize,nnodes,ncpus,nodelist
Note: Default time window is the current day.
Starting from a specific date:
$ sacct -u <user> -S 2020-01-15 --format=User,JobID,Jobname,partition,state,time,start,end,elapsed,MaxRss,MaxVMSize,nnodes,ncpus,nodelist
Within a time window:
$ sacct -u <user> -S 2020-01-15 -E 2020-01-31 --format=User,JobID,Jobname,partition,state,time,start,end,elapsed,MaxRss,MaxVMSize,nnodes,ncpus,nodelist
Note:
You can also set the environment variable $SACCT_FORMAT to specify the default format. To get a general idea of how efficiently a job utilized its resources, the following format can be used:
export SACCT_FORMAT="JobID,JobName,Elapsed,NCPUs,TotalCPU,CPUTime,ReqMem,MaxRSS,MaxDiskRead,MaxDiskWrite,State,ExitCode"
To find how efficiently the CPUs were used, divide TotalCPU by CPUTime. To find how efficiently memory were used, devide MaxRSS by ReqMem. But be aware that sacct memory usage measurement doesn't catch very rapid memory spikes. If your job got killed for running out of memory, it did run out of memory even if sacct reports a lower memory usage than would trigger an out-of-memory-kill. A job that reads or writes excessively to disk might be bogged down significantly by I/O operations.
How to get efficiency information of completed job(s)?
Use
$ seff <jobid>
command for some brief information.
Note: It is good practice to have a look at the efficiency of your job(s) on completion and we expect you to do so. This way you can improve your job specifications in the future.
How to get complete field values from sstat and sacct commands?
When using the --format option for listing various fields you can put a %NUMBER afterwards to specify how many characters should be printed.
E.g. '--format=User%30' will print 30 characters for the user name (right justified). A %-30 will print 30 characters left justified.
sstat and sacct also provide the '--parsable' and '--parsable2' option to always print full field values delimited with a pipe '|' character by default. The delimiting character can be specified by using the '--delimiter' option, e.g. '--delimiter=","' for comma separated values.
How to retrieve job records for all jobs running/pending at a certain point in time?
Use sacct with -s <state> and -S <start time> options, e.g.:
$ sacct -n -a -X -S 2021-04-01T00:00:00 -s R -o JobID,User%15,Account%10,NCPUS,NNodes,NodeList%1500
Note: When specifying the state "-s <state>" and the start time "-S <start time>", the default time window will be set to end time "-E" equal to start time. Thus, you will get a snapshot of all running/pending jobs at the instance given by "-S <start time>".
How to get a parsable list of hostnames from $SLURM_JOB_NODELIST?
$ scontrol show hostnames $SLURM_JOB_NODELIST
ADMINISTRATION
Note: Most commands in this section are restricted to system administrators.
How to stop Slurm from scheduling jobs?
You can stop Slurm from scheduling jobs on a per partition basis by setting that partition's state to DOWN. Set its state UP to resume scheduling. For example:
$ scontrol update PartitionName=foo State=DOWN $ scontrol update PartitionName=foo State=UP
How to print actual hardware configuration of a node?
$ slurmd -C # print hardware configuration plus uptime $ slurmd -G # print generic resource configuration
How to reboot (all) nodes as soon as they become idle?
$ scontrol reboot ASAP nextstate=RESUME <node1>,<node2> # specific nodes $ scontrol reboot ASAP nextstate=RESUME ALL # all nodes
How to cancel pending reboot of nodes?
$ scontrol cancel_reboot <node1>,<node2>
How to check current node status?
$ scontrol show node <node>
How to instruct all Slurm daemons to re-read the configuration file
$ scontrol reconfigure
How to prevent a user from submitting new jobs?
Use the following sacctmgr command:
$ sacctmgr update user <username> set maxsubmitjobs=0
Notes:
- Job submission is then rejected with the following message:
$ sbatch job.slurm sbatch: error: AssocMaxSubmitJobLimit sbatch: error: Batch job submission failed: Job violates accounting/QOS policy (job submit limit, user's size and/or time limits)
- Use the following command to release the limit:
$ sacctmgr update user <username> set maxsubmitjobs=-1
How to drain node(s)?
$ scontrol update NodeName=<node1>,<node2> State=DRAIN Reason="Some Reason"
Notes:
- Reason is mandatory.
- Do not just set state DOWN to drain nodes. This will kill any active jobs that may run on that nodes.
How to list reason for nodes being drained or down?
$ sinfo -R
How to resume node state?
$ scontrol update NodeName=<node1>,<node2> State=RESUME
How to create a reservation on nodes?
Suggested reading: https://slurm.schedmd.com/reservations.html
$ scontrol create reservation user=root starttime=now duration=UNLIMITED flags=maint,ignore_jobs nodes=ALL $ scontrol create reservation user=root starttime=2020-12-24T17:00 duration=12:00:00 flags=maint,ignore_jobs nodes=<node1>,<node2> $ scontrol show reservation
Note: Add "FLEX" flag to allow jobs that qualify for the reservation to start before the reservation begins (and continue after it starts). Add "MAGNETIC" flag to attract jobs that qualify for the reservation to run in that reservation without having requested it at submit time.
How to create a floating reservation on nodes?
Use the flag "TIME_FLOAT" and a start time that is relative to the current time (use the keyword "now"). In the example below, the nodes are prevented from starting any jobs exceeding a walltime of 2 days.
$ scontrol create reservation user=root starttime=now+2days duration=UNLIMITED flags=maint,ignore_jobs,time_float nodes=<node1>,<node2>
Note: Floating reservation are not intended to run jobs, but to prevent long running jobs from being initiated on specific nodes. Attempts by users to make use of a floating reservation will be rejected. When ready to perform the maintenance, place the nodes in DRAIN state and delete the reservation.
How to use a reservation?
$ sbatch --reservation=foo_6 ... script.slurm
How to delete a reservation?
$ scontrol delete ReservationName=foo_6
How to get node oriented information similar to 'mdiag -n'?
$ sinfo -N -l
Fields can be individually customized. See sinfo man page. For example:
$ sinfo -N --format="%8N %12P %.4C %.8O %.6m %.6e %.8T %.20E" NODELIST PARTITION CPUS CPU_LOAD MEMORY FREE_M STATE REASON n0001 standard* 0/16 0.01 128000 120445 idle none n0002 standard* 0/16 0.01 128000 120438 idle none n0003 standard* 0/0/ N/A 128000 N/A down* Not responding
How to get node oriented information similar to 'pbsnodes'?
$ scontrol show nodes # One paragraph per node (all nodes) $ scontrol show nodes <node1>,<node2> # One paragraph per node (specified nodes) $ scontrol -o show nodes # One line per node (all nodes) $ scontrol -o show nodes <node1>,<node2> # One line per node (specified nodes)
How to update multiple jobs of a user with a single scontrol command?
Not possible. But you can e.g. use squeue to build the script taking advantage of its filtering and formatting options.
$ squeue -tpd -h -o "scontrol update jobid=%i priority=1000" >my.script
You can also identify the list of jobs and add them to the JobID all at once, for example:
$ scontrol update JobID=123 qos=reallylargeqos $ scontrol update JobID=123,456,789 qos=reallylargeqos $ scontrol update JobID=[123-400],[500-600] qos=reallylargeqos
Another option is to use the JobName, if all the jobs have the same name.
$ scontrol update JobName="foobar" UserID=johndoe qos=reallylargeqos
However, Slurm does not allow the UserID filter alone.
How to create a new account?
Add account at top level in association tree:
$ sacctmgr add account <accountname> Cluster=justus Description="Account description" Organization="none"
Add account as child of some parent account in association tree:
$ sacctmgr add account <accountname> parent=<parent_accountname>
How to move account to another parent?
$ sacctmgr modify account name=<accountname> set parent=<new_parent_accountname>
How to delete an account?
$ sacctmgr delete account name=<accountname>
How to add a new user?
$ sacctmgr add user <username> DefaultAccount=<accountname>
How to add/remove users from an account?
$ sacctmgr add user <username> account=<accountname> # Add user to account $ sacctmgr add user <username> account=<accountname2> # Add user to a second account $ sacctmgr remove user <username> where account=<accountname> # Remove user from this account
How to change default account of a user?
$ sacctmgr modify user where user=<username> set DefaultAccount=<default_account>
Note: The user must already be associated with the account you want to set as default.
How to show account information?
$ sacctmgr show assoc $ sacctmgr show assoc tree
How to implement user resource throttling policies?
Quoting from https://bugs.schedmd.com/show_bug.cgi?id=3600#c4
With Slurm, the associations are meant to establish base limits on the defined partitions, accounts and users. Because limits propagate down through the association tree, you only need to define limits at a high level and those limits will be applied to all partitions, accounts and users that are below it (parent to child). You can also override those high level (parent) limits by explicitly setting different limits at any lower level (on the child). So using the association tree is the best way to get some base limits applied that you want for most cases. QOS's are meant to override any of those base limits for exceptional cases. Like Maui, you can use QOS's to set a different priority. Again, the QOS would be overriding the base priority that could be set in the associations.
How to set a resource limit for an individual user?
Suggested reading: https://slurm.schedmd.com/resource_limits.html
Example:
$ sacctmgr modify user <username> set maxjobs=1 # Limit maximum number of running jobs for user $ sacctmgr list assoc user=<username> format=user,maxjobs # Show that limit $ sacctmgr modify user <username> set maxjobs=-1 # Remove that limit
How to retrieve historical resource usage for a specific user or account?
Use sreport command.
Examples:
$ sreport cluster UserUtilizationByAccount Start=2021-01-01 End=2021-12-31 -t Hours user=<username> # Report cluster utilization of given user broken down by accounts $ sreport cluster AccountUtilizationByUser Start=2021-01-01 End=2021-12-31 -t Hours account=<account> # Report cluster utilization of given account broken down by users
Notes:
- By default CPU resources will be reported. Use '-T' option for other trackable resources, e.g. '-T cpu,mem,gres/gpu,gres/scratch'.
- On JUSTUS 2 registered compute projects ("Rechenvorhaben") are uniquely mapped to Slurm accounts of the same name. Thus, 'AccountUtilizationByUser' can also be used to report the aggregated cluster utilization of compute projects.
- Can be executed by regular users as well in which case Slurm will only report their own usage records (but along with the total usage of the associated account in the case of 'AccountUtilizationByUser').
How to fix/reset a user's RawUsage value?
$ sacctmgr modify user <username> where Account=<account> set RawUsage=<number>
How to create/modify/delete QOSes?
Suggested reading: https://slurm.schedmd.com/qos.html
Examples:
$ sacctmgr show qos # Show existing QOSes $ sacctmgr add qos verylong # Create new QOS verylong $ sacctmgr modify qos verylong set MaxWall=28-00:00:00 # Set maximum walltime limit $ sacctmgr modify qos verylong set MaxTRESPerUser=cpu=4 # Set maximum maximum number of CPUS a user can allocate at a given time $ sacctmgr modify qos verylong set flags=denyonlimit # Prevent submission if job requests exceed any limits of QOS $ sacctmgr modify user <username> set qos+=verylong # Add a QOS to a user account $ sacctmgr modify user <username> set qos-=verylong # Remove a QOS from a user account $ sacctmgr delete qos verylong # Delete that QOS
How to find (and fix) runaway jobs?
$ sacctmgr show runaway
Notes:
- Runaway jobs are orphaned jobs that don't exist in the Slurm controller but have a start and no end time in the Slurm data base. Runaway jobs mess with accounting and affects new jobs of users who have too many runaway jobs.
- If there are jobs in this state this command will also provide an option to fix them. This will set the end time for each job to the latest out of the start, eligible, or submit times, and set the state to completed.
How to show a history of database transactions?
$ sacctmgr list transactions
Note: Useful to get timestamps for when a user/account/qos has been created/modified/removed etc.