Purpose and function of a queuing system

All compute activities on bwUniCluster 3.0 have to be performed on the compute nodes. Compute nodes are only available by requesting the corresponding resources via the queuing system. As soon as the requested resources are available, automated tasks are executed via a batch script or they can be accessed interactively.
General procedure: Hint to Running Calculations

Job submission process

bwUniCluster 3.0 has installed the workload managing software Slurm. Therefore any job submission by the user is to be executed by commands of the Slurm software. Slurm queues and runs user jobs based on fair sharing policies.

Slurm

HPC Workload Manager on bwUniCluster 3.0 is Slurm. Slurm is a cluster management and job scheduling system. Slurm has three key functions.

It allocates access to resources (compute cores on nodes) to users for some duration of time so they can perform work.
It provides a framework for starting, executing, and monitoring work (normally a parallel job) on the set of allocated nodes.
It arbitrates contention for resources by managing a queue of pending work.

Any kind of calculation on the compute nodes of bwUniCluster 3.0 requires the user to define calculations as a sequence of commands together with required run time, number of CPU cores and main memory and submit all, i.e., the batch job, to a resource and workload managing software.

Terms and definitions

Partitions

Slurm manages job queues for different partitions. Partitions are used to group similar node types (e.g. nodes with and without accelerators) and to enforce different access policies and resource limits.

On bwUniCluster 3.0 there are different partitions:

CPU-only nodes
- 2-socket nodes, consisting of 2 Intel Ice Lake processors with 32 cores each or 2 AMD processors with 48 cores each
- 2-socket nodes with very high RAM capacity, consisting of 2 AMD processors with 48 cores each
GPU-accelerated nodes
- 2-socket nodes with 4x NVIDIA A100 or 4x NVIDIA H100 GPUs
- 4-socket node with 4x AMD Instinct accelerator

Queues

Job queues are used to manage jobs that request access to shared but limited computing resources of a certain kind (partition).

On bwUniCluster 3.0 there are different main types of queues:

Regular queues
- cpu: Jobs that request CPU-only nodes.
- gpu: Jobs that request GPU-accelerated nodes.
Development queues (dev)
- Short, usually interactive jobs that are used for developing, compiling and testing code and workflows. The intention behind development queues is to provide users with immediate access to computer resources without having to wait. This is the place to realize instantaneous heavy compute without affecting other users, as would be the case on the login nodes.

Requested compute resources such as (wall-)time, number of nodes and amount of memory are restricted and must fit into the boundaries imposed by the queues. The request for compute resources on the bwUniCluster 3.0 requires at least the specification of the queue and the time.

Jobs

Jobs can be run non-interactively as batch jobs or as interactive jobs.
Submitting a batch job means, that all steps of a compute project are defined in a Bash script. This Bash script is queued and executed as soon as the compute resources are available and allocated. Jobs are enqueued with the sbatch command. For interactive jobs, the resources are requested with the salloc command. As soon as the computing resources are available and allocated, a command line prompt is returned on a computing node and the user can freely dispose of the resources now available to him.

Please remember:

Heavy computations are not allowed on the login nodes.
Use a developement or a regular job queue instead! Please refer to Allowed Activities on Login Nodes.
Development queues are meant for development tasks.
Do not misuse this queue for regular, short-running jobs or chain jobs! Only one running job at a time is enabled. Maximum queue length is reduced to 3.

Queues on bwUniCluster 3.0

Policy

The computing time is provided in accordance with the fair share policy. The individual investment shares of the respective university and the resources already used by its members are taken into account. Furthermore, the following throttling policy is also active: The maximum amount of cores used at any given time from jobs running is 1920 per user (aggregated over all running jobs). This number corresponds to 30 nodes on the Ice Lake partition or 20 nodes on the standard partition. The aim is to minimize waiting times and maximize the number of users who can access computing time at the same time.

Regular Queues

Queue	Node-Type	Default Resources	Minimal Resources	Maximum Resources
`cpu_il`	CPU nodes Ice Lake	mem-per-cpu=2000mb		time=72:00:00, nodes=30, mem=249600mb, ntasks-per-node=64, (threads-per-core=2)
`cpu`	CPU nodes Standard	mem-per-cpu=2000mb		time=72:00:00, nodes=20, mem=380000mb, ntasks-per-node=96, (threads-per-core=2)
`highmem`	CPU nodes High Memory	mem-per-cpu=12090mb	mem=380001mb	time=72:00:00, nodes=4, mem=2300000mb, ntasks-per-node=96, (threads-per-core=2)
`gpu_h100`	GPU nodes NVIDIA GPU x4	mem-per-gpu=193300mb cpus-per-gpu=24		time=72:00:00, nodes=12, mem=760000mb, ntasks-per-node=96, (threads-per-core=2)
`gpu_mi300`	GPU node AMD GPU x4	mem-per-gpu=128200mb cpus-per-gpu=24		time=72:00:00, nodes=1, mem=510000mb, ntasks-per-node=40, (threads-per-core=2)
`gpu_a100_il`/`gpu_h100_il`	GPU nodes Ice Lake NVIDIA GPU x4	mem-per-gpu=127500mb cpus-per-gpu=16		time=72:00:00, nodes=9, mem=510000mb, ntasks-per-node=64, (threads-per-core=2)

Table 1: Regular Queues

Development Queues

Only for development, i.e. debugging or performance optimization ...

Queue	Node Type	Default Resources	Maximum Resources
`dev_cpu_il`	CPU nodes Ice Lake	mem-per-cpu=1950mb	time=30, nodes=8, mem=249600mb, ntasks-per-node=64, (threads-per-core=2)
`dev_cpu`	CPU nodes Standard	mem-per-cpu=1125mb	time=30, nodes=1, mem=380000mb, ntasks-per-node=96, (threads-per-core=2)
`dev_gpu_h100`	GPU nodes NVIDIA GPU x4	mem-per-cpu=1125mb cpus-per-gpu=24	time=30, nodes=1, mem=760000mb, ntasks-per-node=96, (threads-per-core=2)
`dev_gpu_a100_il`	GPU nodes NVIDIA GPU x4	mem-per-gpu=127500mb cpus-per-gpu=16	time=30, nodes=1, mem=510000mb, ntasks-per-node=64, (threads-per-core=2)

Table 2: Development Queues

Default resources of a queue class define number of tasks and memory if not explicitly given with sbatch command. Resource list acronyms --time, --ntasks, --nodes, --mem and --mem-per-cpu are described here.

Check available resources: sinfo_t_idle

The Slurm command sinfo is used to view partition and node information for a system running Slurm. It incorporates down time, reservations, and node state information in determining the available backfill window. The sinfo command can only be used by the administrator.
SCC has prepared a special script (sinfo_t_idle) to find out how many processors are available for immediate use on the system. It is anticipated that users will use this information to submit jobs that meet these criteria and thus obtain quick job turnaround times.

The following command displays what resources are available for immediate use for the whole partition.

$ sinfo_t_idle
Partition dev_cpu                 :      2 nodes idle
Partition cpu                     :     68 nodes idle
Partition highmem                 :      4 nodes idle
Partition dev_gpu_h100            :      0 nodes idle
Partition gpu_h100                :     11 nodes idle
Partition gpu_mi300               :      1 nodes idle
Partition dev_cpu_il              :      0 nodes idle
Partition cpu_il                  :      0 nodes idle
Partition dev_gpu_a100_il         :      0 nodes idle
Partition gpu_a100_il             :      0 nodes idle
Partition gpu_h100_il             :      0 nodes idle

Running Jobs

Slurm Commands (excerpt)

Important Slurm commands for non-administrators working on bwUniCluster 3.0.

Slurm commands	Brief explanation
sbatch	Submits a job and puts it into the queue [sbatch]
salloc	Requests resources for an interactive Job [salloc]
scontrol show job	Displays detailed job state information [scontrol]
squeue	Displays information about active, eligible, blocked, and/or recently completed jobs [squeue]
squeue --start	Returns start time of submitted job [squeue]
sinfo_t_idle	Shows what resources are available for immediate use [sinfo]
scancel	Cancels a job [scancel]

Batch Jobs: sbatch

Batch jobs are submitted by using the command sbatch. The main purpose of the sbatch command is to specify the resources that are needed to run the job. sbatch will then queue the batch job. However, starting of batch job depends on the availability of the requested resources and the fair sharing value.

The syntax and use of sbatch can be displayed via:

$ man sbatch

sbatch options can be used from the command line or in your job script. Different defaults for some of these options are set based on the queue and can be found here

sbatch Options
Command line	Script	Purpose
-t, --time=time	#SBATCH --time=time	Wall clock time limit.
-N, --nodes=count	#SBATCH --nodes=count	Number of nodes to be used.
-n, --ntasks=count	#SBATCH --ntasks=count	Number of tasks to be launched.
--ntasks-per-node=count	#SBATCH --ntasks-per-node=count	Maximum count of tasks per node.
-c, --cpus-per-task=count	#SBATCH --cpus-per-task=count	Number of CPUs required per (MPI-)task.
--mem=value_in_MB	#SBATCH --mem=value_in_MB	Memory in MegaByte per node. (You should omit the setting of this option.)
--mem-per-cpu=value_in_MB	#SBATCH --mem-per-cpu=value_in_MB	Minimum Memory required per allocated CPU. (You should omit the setting of this option.)
--mail-type=type	#SBATCH --mail-type=type	Notify user by email when certain event types occur. Valid type values are NONE, BEGIN, END, FAIL, REQUEUE, ALL.
--mail-user=mail-address	#SBATCH --mail-user=mail-address	The specified mail-address receives email notification of state changes as defined by --mail-type.
--output=name	#SBATCH --output=name	File in which job output is stored.
--error=name	#SBATCH --error=name	File in which job error messages are stored.
-J, --job-name=name	#SBATCH --job-name=name	Job name.
--export=[ALL,] env-variables	#SBATCH --export=[ALL,] env-variables	Identifies which environment variables from the submission environment are propagated to the launched application. Default is ALL.
-A, --account=group-name	#SBATCH --account=group-name	Change resources used by this job to specified group. You may need this option if your account is assigned to more than one group. By command "scontrol show job" the project group the job is accounted on can be seen behind "Account=".
-p, --partition=queue-name	#SBATCH --partition=queue-name	Request a specific queue for the resource allocation.
--reservation=reservation-name	#SBATCH --reservation=reservation-name	Use a specific reservation for the resource allocation.
-C, --constraint=LSDF	#SBATCH --constraint=LSDF	Job constraint LSDF filesystems.
-C, --constraint=BEEOND (BEEOND_4MDS, BEEOND_MAXMDS)	#SBATCH --constraint=BEEOND (BEEOND_4MDS, BEEOND_MAXMDS)	Job constraint BeeOND filesystem.

Interactive Jobs: salloc

On bwUniCluster 3.0 you are only allowed to run short jobs (<< 1 hour) with little memory requirements (<< 8 GByte) on the logins nodes. If you want to run longer jobs and/or jobs with a request of more than 8 GByte of memory, you must allocate resources for so-called interactive jobs by usage of the command salloc on a login node. Considering a serial application running on a compute node that requires 5000 MByte of memory and limiting the interactive run to 2 hours the following command has to be executed:

$ salloc -p cpu -n 1 -t 120 --mem=5000

Then you will get one core on a compute node within the partition "cpu". After execution of this command DO NOT CLOSE your current terminal session but wait until the queueing system Slurm has granted you the requested resources on the compute system. You will be logged in automatically on the granted core! To run a serial program on the granted core you only have to type the name of the executable.

$ ./<my_serial_program>

Please be aware that your serial job must run less than 2 hours in this example, else the job will be killed during runtime by the system.

You can also start now a graphical X11-terminal connecting you to the dedicated resource that is available for 2 hours. You can start it by the command:

$ xterm

Note that, once the walltime limit has been reached the resources - i.e. the compute node - will automatically be revoked.

An interactive parallel application running on one compute node or on many compute nodes (e.g. here 5 nodes) with 96 cores each requires usually an amount of memory in GByte (e.g. 50 GByte) and a maximum time (e.g. 1 hour). E.g. 5 nodes can be allocated by the following command:

$ salloc -p cpu -N 5 --ntasks-per-node=96 -t 01:00:00  --mem=50gb

Now you can run parallel jobs on 480 cores requiring 50 GByte of memory per node. Please be aware that you will be logged in on core 0 of the first node. If you want to have access to another node you have to open a new terminal, connect it also to bwUniCluster 3.0 and type the following commands to connect to the running interactive job and then to a specific node:

$ srun --jobid=XXXXXXXX --pty /bin/bash
$ srun --nodelist=uc3nXXX --pty /bin/bash

With the command:

$ squeue

the jobid and the nodelist can be shown.

If you want to run MPI-programs, you can do it by simply typing mpirun <program_name>. Then your program will be run on 480 cores. A very simple example for starting a parallel job can be:

$ mpirun <my_mpi_program>

You can also start the debugger ddt by the commands:

$ module add devel/ddt
$ ddt <my_mpi_program>

The above commands will execute the parallel program <my_mpi_program> on all available cores. You can also start parallel programs on a subset of cores; an example for this can be:

$ mpirun -n 50 <my_mpi_program>

If you are using Intel MPI you must start <my_mpi_program> by the command mpiexec.hydra (instead of mpirun).

Interactive Computing with Jupyter

Monitor and manage jobs

List of your submitted jobs : squeue

Displays information about YOUR active, pending and/or recently completed jobs. The command displays all own active and pending jobs. The command squeue is explained in detail on the webpage https://slurm.schedmd.com/squeue.html or via manpage (man squeue).

squeue example on bwUniCluster 3.0 (Only your own jobs are displayed!).

$ squeue 
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
              1262       cpu     wrap ka_ab123  R       8:15      1 uc3n002
              1267 dev_gpu_h     wrap ka_ab123 PD       0:00      1 (Resources)
              1265   highmem     wrap ka_ab123  R       2:41      1 uc3n084
$ squeue -l
             JOBID PARTITION     NAME     USER    STATE       TIME TIME_LIMI  NODES NODELIST(REASON)
              1262       cpu     wrap ka_ab123  RUNNING       8:55     20:00      1 uc3n002
              1267 dev_gpu_h     wrap ka_ab123  PENDING       0:00     20:00      1 (Resources)
              1265   highmem     wrap ka_ab123  RUNNING       3:21     20:00      1 uc3n084

Detailed job information : scontrol show job

scontrol show job displays detailed job state information and diagnostic output for all or a specified job of yours. Detailed information is available for active, pending and recently completed jobs. The command scontrol is explained in detail on the webpage https://slurm.schedmd.com/scontrol.html or via manpage (man scontrol).
Display the state of all your jobs in normal mode: scontrol show job
Display the state of a job with <jobid> in normal mode: scontrol show job <jobid>

Here is an example from bwUniCluster 3.0.

$ squeue
JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
1262       cpu     wrap ka_zs040  R       1:12      1 uc3n002

$
$ # now, see what's up with my pending job with jobid 1262
$ 
$ scontrol show job 1262

JobId=1262 JobName=wrap
   UserId=ka_zs0402(241992) GroupId=ka_scc(12345) MCS_label=N/A
   Priority=4246 Nice=0 Account=ka QOS=normal
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:37 TimeLimit=00:20:00 TimeMin=N/A
   SubmitTime=2025-04-04T10:01:30 EligibleTime=2025-04-04T10:01:30
   AccrueTime=2025-04-04T10:01:30
   StartTime=2025-04-04T10:01:31 EndTime=2025-04-04T10:21:31 Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2025-04-04T10:01:31 Scheduler=Main
   Partition=cpu AllocNode:Sid=uc3n999:2819841
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=uc3n002
   BatchHost=uc3n002
   NumNodes=1 NumCPUs=2 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   ReqTRES=cpu=1,mem=2000M,node=1,billing=1
   AllocTRES=cpu=2,mem=4000M,node=1,billing=2
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:1 CoreSpec=*
   MinCPUsNode=1 MinMemoryCPU=2000M MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=(null)
   WorkDir=/pfs/data6/home/ka/ka_scc/ka_zs0402
   StdErr=/pfs/data6/home/ka/ka_scc/ka_zs0402/slurm-1262.out
   StdIn=/dev/null
   StdOut=/pfs/data6/home/ka/ka_scc/ka_zs0402/slurm-1262.out

Canceling own jobs : scancel

The scancel command is used to cancel jobs. The command scancel is explained in detail on the webpage https://slurm.schedmd.com/scancel.html or via manpage (man scancel). The command is:

$ scancel [-i] <job-id>
$ scancel -t <job_state_name>

BwUniCluster3.0/Running Jobs

Contents

Purpose and function of a queuing system

Job submission process

Slurm

Terms and definitions

Queues on bwUniCluster 3.0

Policy

Regular Queues

Development Queues

Check available resources: sinfo_t_idle

Running Jobs

Slurm Commands (excerpt)

Batch Jobs: sbatch

Interactive Jobs: salloc

Interactive Computing with Jupyter

Monitor and manage jobs

List of your submitted jobs : squeue

Detailed job information : scontrol show job

Canceling own jobs : scancel

Slurm Options

Best Practices

Step-by-Step example

Dos and Don'ts

Navigation menu

BwUniCluster3.0/Running Jobs

Purpose and function of a queuing system

Job submission process

Slurm

Terms and definitions

Queues on bwUniCluster 3.0

Policy

Regular Queues

Development Queues

Check available resources: sinfo_t_idle

Running Jobs

Slurm Commands (excerpt)

Batch Jobs: sbatch

Interactive Jobs: salloc

Interactive Computing with Jupyter

Monitor and manage jobs

List of your submitted jobs : squeue

Detailed job information : scontrol show job

Canceling own jobs : scancel

Slurm Options

Best Practices

Step-by-Step example

Dos and Don'ts

Navigation menu

Search