General information about Slurm

The bwForCluster Helix uses Slurm as batch system.

Slurm documentation: https://slurm.schedmd.com/documentation.html
Slurm cheat sheet: https://slurm.schedmd.com/pdfs/summary.pdf
Slurm tutorials: https://slurm.schedmd.com/tutorials.html

Slurm Command Overview

Slurm commands	Brief explanation
sbatch	Submits a job and queues it in an input queue
salloc	Request resources for an interactive job
squeue	Displays information about active, eligible, blocked, and/or recently completed jobs
scontrol	Displays detailed job state information
sstat	Displays status information about a running job
scancel	Cancels a job

Job Submission

Batch jobs are submitted with the command:

$ sbatch <job-script>

A job script contains options for Slurm in lines beginning with #SBATCH as well as your commands which you want to execute on the compute nodes. For example:

#!/bin/bash
#SBATCH --partition=cpu-single
#SBATCH --ntasks=1
#SBATCH --time=00:20:00
#SBATCH --mem=1gb
#SBATCH --export=NONE
echo 'Hello world'

This jobs requests one core (--ntasks=1) and 1 GB memory (--mem=1gb) for 20 minutes (--time=00:20:00) on nodes provided by the partition 'cpu-single'.

For the sake of a better reproducibility of jobs it is recommended to use the option --export=NONE to prevent the propagation of environment variables from the submit session into the job environment and to load required software modules in the job script.

Partitions

On bwForCluster Helix it is necessary to request a partition with "--partition=<partition_name>" on job submission. Within a partition job allocations are routed automatically to the most suitable compute node(s) for the requested resources (e.g. amount of nodes and cores, memory, number of GPUs). The devel partition is the default partition, if no partition is requested.

The partitions devel, cpu-single and gpu-single are operated in shared mode, i.e. jobs from different users can run on the same node. Jobs can get exclusive access to compute nodes in these partitions with the "--exclusive" option. The partitions cpu-multi and gpu-multi are operated in exclusive mode. Jobs in these partitions automatically get exclusive access to the requested compute nodes.

Partition	Node Access Policy	Node Types	Default	Limits
devel	shared	cpu, gpu4	ntasks=1, time=00:10:00, mem-per-cpu=2gb	nodes=2, time=00:30:00
cpu-single	shared	cpu, fat	ntasks=1, time=00:30:00, mem-per-cpu=2gb	nodes=1, time=120:00:00
gpu-single	shared	gpu4, gpu8	ntasks=1, time=00:30:00, mem-per-cpu=2gb	nodes=1, time=120:00:00
cpu-multi	job exclusive	cpu	nodes=2, time=00:30:00	nodes=32, time=48:00:00
gpu-multi	job exclusive	gpu4	nodes=2, time=00:30:00	nodes=8, time=48:00:00

GPU requests

For the partitions gpu-single and gpu-multi is it required to request GPU ressources.

The number of GPUs is requested with the option "--gres=gpu:<number-of-gpus>".
A specific GPU type can be requested with the option "--gres=gpu:<gpu-type>:<number-of-gpus>". Possible values for <gpu-type> are listed in the line 'GPU Type' of the Compute Nodes table.
GPUs that are suitable for a specific GPU memory requirement can be requested with option "--gres=gpu:<number-of-gpus>,gpumem_per_gpu:<required-gpumem>GB". This only restricts the selection of possible GPU types. For the job the total GPU memory per GPU is available as listed in the line 'GPU memory per GPU' of the Compute Nodes table.

Constraints

It is possible to refine the resource request for a job with the option "--constraint=<feature>".

Feature	Meaning
fp64	request GPU types with FP64 capability (double precision)

Examples

Here you can find some example scripts for batch jobs.

Serial Programs

#!/bin/bash
#SBATCH --partition=cpu-single
#SBATCH --ntasks=1
#SBATCH --time=20:00:00
#SBATCH --mem=4gb
./my_serial_program

Notes:

Jobs with "--mem" up to 236gb can run on all node types associated with the cpu-single partition.

Multi-threaded Programs

#!/bin/bash
#SBATCH --partition=cpu-single
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=16
#SBATCH --time=01:30:00
#SBATCH --mem=50gb
export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK}
./my_multithreaded_program

Notes:

Jobs with "--ntasks-per-node" up to 64 and "--mem" up to 236gb can run on all node types associated with the cpu-single partition.
With "export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK}" you can set the number of threads according to the number of resources requested.

MPI Programs

#!/bin/bash
#SBATCH --partition=cpu-multi
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=64
#SBATCH --time=12:00:00
module load compiler/gnu
module load mpi/openmpi
srun ./my_mpi_program

Notes:

"--mem" requests the memory per node. The maximum is 236gb.
The Compiler and MPI modules used for the compilation must be loaded before the start of the program.
It is recommended to start MPI programs with 'srun'.

GPU Programs

#!/bin/bash
#SBATCH --partition=gpu-single
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=40
#SBATCH --gres=gpu:A40:4
#SBATCH --time=12:00:00
#SBATCH --mem=200gb
module load devel/cuda
export OMP_NUM_THREADS=${SLURM_NTASKS}
./my_cuda_program

Notes:

The number of GPUs per node is requested with the option "--gres=gpu:<number-of-gpus>"
It is recommended to request a suitable GPU type for your application with the option "--gres=gpu:<gpu-type>:<number-of-gpus>". For <gpu-type> put the 'GPU Type' listed in the Compute Nodes table.
- Example for a request of two A40 GPUs: --gres=gpu:A40:2
- Example for a request of one A100 GPU: --gres=gpu:A100:1
If you are unsure on which GPU type your code runs faster, please run a test case and compare the run times. In general the following applies:
- A40 GPUs are optimized for single precision computations.
- A100 and H200 GPUs offer better performance for double precision computations or if the code makes use of tensor cores.
The CUDA module used for compilation must be loaded before the start of the program.

More examples

Further batch script examples are available on bwForCluster Helix in the directory: /opt/bwhpc/common/system/slurm-examples

Interactive Jobs

Interactive jobs must NOT run on the login nodes, however resources for interactive jobs can be requested using srun. The following example requests an interactive session on 1 core for 2 hours:

$ salloc --partition=cpu-single --ntasks=1 --time=2:00:00

After execution of this command wait until the queueing system has granted you the requested resources. Once granted you will be automatically logged on the allocated compute node.

If you use applications or tools which provide a GUI, enable X-forwarding for your interactive session with:

$ salloc --partition=cpu-single --ntasks=1 --time=2:00:00 --x11

Once the walltime limit has been reached you will be automatically logged out from the compute node.

For convenient access to specific GUI applications (JupyterLab, ...) on the cluster, we provide a web-based platform: bwVisu

Job Monitoring

Information about submitted jobs

For an overview of your submitted jobs use the command:

$ squeue

To get detailed information about a specific job use the command:

$ scontrol show job <jobid>

Notes:
A job start may be delayed for various reasons:

(QOSMaxCpuPerUserLimit) - There is a limit to how many CPU cores a user can use at the same time. The job exceeds this limit.
(QOSMaxGRESPerUser) - There is a limit to how many GPUs a user can use at the same time. The job exceeds this limit.
(QOSMinGRES) - The job was submitted to a gpu partition without requesting a GPU.
(launch failed requeued held) - The job has failed to start. You may be able to resume it using scontrol. Alternatively you can cancel it and submit it again.

For further reasons please refer to: https://slurm.schedmd.com/job_reason_codes.html

Information about resource usage of running jobs

You can monitor the resource usage of running jobs with the sstat command. For example:

$ sstat --format=JobId,AveCPU,AveRSS,MaxRSS -j <jobid>

This will show average CPU time, average and maximum memory consumption of all tasks in the running job.

'sstat -e' command shows a list of fields that can be specified with the '--format' option.

Interactive access to running jobs

It is also possible to attach an interactive shell to a running job with command:

$ srun --jobid=<jobid> --overlap --pty /bin/bash

Commands like 'top' show you the most busy processes on the node. To exit 'top' type 'q'.

To monitor your GPU processes use the command 'nvidia-smi'.

Job Feedback

You get feedback on resource usage and job efficiency for completed jobs with the command:

$ seff <jobid>

Example Output:

============================= JOB FEEDBACK =============================
Job ID: 12345678
Cluster: helix
User/Group: hd_ab123/hd_hd
State: COMPLETED (exit code 0)
Nodes: 2
Cores per node: 64
CPU Utilized: 3-04:11:46
CPU Efficiency: 97.90% of 3-05:49:52 core-walltime
Job Wall-clock time: 00:36:29
Memory Utilized: 432.74 GB (estimated maximum)
Memory Efficiency: 85.96% of 503.42 GB (251.71 GB/node)

Explanation:

Nodes: Number of allocated nodes for the job.
Cores per node: Number of physical cores per node allocated for the job.
CPU Utilized: Sum of utilized core time.
CPU Efficiency: 'CPU Utilized' with respect to core-walltime (= 'Nodes' x 'Cores per node' x 'Job Wall-clock time') in percent.
Job Wall-clock time: runtime of the job.
Memory Utilized: Sum of memory used. For multi node MPI jobs the sum is only correct when srun is used instead of mpirun.
Memory Efficiency: 'Memory Utilized' with respect to total allocated memory for the job.

Job Monitoring Portal

For more detailed information about your jobs visit the job monitoring portal: https://helix-monitoring.bwservices.uni-heidelberg.de

Accounting

Jobs are billed for allocated CPU cores, memory and GPUs.

To see the accounting data of a specific job:

$ sacct -j <jobid> --format=user,jobid,account,nnodes,ncpus,time,elapsed,AllocTRES%50

To retrive the job history for a specific user for a certain time frame:

$ sacct -u <user> -S 2022-08-20 -E 2022-08-30 --format=user,jobid,account,nnodes,ncpus,time,elapsed,AllocTRES%50

Overview about free resources

On the login nodes the following command shows what resources are available for immediate use:

$ sinfo_t_idle

Helix/Slurm

Contents

General information about Slurm

Slurm Command Overview

Job Submission

Partitions

GPU requests

Constraints

Examples

Serial Programs

Multi-threaded Programs

MPI Programs

GPU Programs

More examples

Interactive Jobs

Job Monitoring

Information about submitted jobs

Information about resource usage of running jobs

Interactive access to running jobs

Job Feedback

Job Monitoring Portal

Accounting

Overview about free resources

Navigation menu

Helix/Slurm

General information about Slurm

Slurm Command Overview

Job Submission

Partitions

GPU requests

Constraints

Examples

Serial Programs

Multi-threaded Programs

MPI Programs

GPU Programs

More examples

Interactive Jobs

Job Monitoring

Information about submitted jobs

Information about resource usage of running jobs

Interactive access to running jobs

Job Feedback

Job Monitoring Portal

Accounting

Overview about free resources

Navigation menu

Search