= Purpose and function of a queuing system =

All compute activities on bwUniCluster 3.0 have to be performed on the compute nodes. Compute nodes are only available by requesting the corresponding resources via the queuing system. As soon as the requested resources are available, automated tasks are executed via a batch script or they can be accessed interactively. 
General procedure: Hint to [[Running_Calculations | Running Calculations]]

== Job submission process ==

bwUniCluster 3.0 has installed the workload managing software Slurm. Therefore any job submission by the user is to be executed by commands of the Slurm software. Slurm queues and runs user jobs based on fair sharing policies.

== Slurm ==

HPC Workload Manager on bwUniCluster 3.0 is Slurm.
Slurm is a cluster management and job scheduling system. Slurm has three key functions.
* It allocates access to resources (compute cores on nodes) to users for some duration of time so they can perform work.
* It provides a framework for starting, executing, and monitoring work (normally a parallel job) on the set of allocated nodes.
* It arbitrates contention for resources by managing a queue of pending work.

Any kind of calculation on the compute nodes of bwUniCluster 3.0 requires the user to define calculations as a sequence of commands together with required run time, number of CPU cores and main memory and submit all, i.e., the '''batch job''', to a resource and workload managing software.

== Terms and definitions ==

''' Partitions '''

Slurm manages job queues for different '''partitions'''. Partitions are used to group similar node types (e.g. nodes with and without accelerators) and to enforce different access policies and resource limits.

On bwUniCluster 3.0 there are different partitions:

* CPU-only nodes
** 2-socket nodes, consisting of 2 Intel Ice Lake processors with 32 cores each or 2 AMD processors with 48 cores each
** 2-socket nodes with very high RAM capacity, consisting of 2 AMD processors with 48 cores each
* GPU-accelerated nodes
** 2-socket nodes with 4x NVIDIA A100 or 4x NVIDIA H100 GPUs
** 4-socket node with 4x AMD Instinct accelerator

''' Queues '''

Job '''queues''' are used to manage jobs that request access to shared but limited computing resources of a certain kind (partition).

On bwUniCluster 3.0 there are different main types of queues:
* Regular queues
** cpu: Jobs that request CPU-only nodes.
** gpu: Jobs that request GPU-accelerated nodes.
* Development queues (dev)
** Short, usually interactive jobs that are used for developing, compiling and testing code and workflows. The intention behind development queues is to provide users with immediate access to computer resources without having to wait. This is the place to realize instantaneous heavy compute without affecting other users, as would be the case on the login nodes.

Requested compute resources such as (wall-)time, number of nodes and amount of memory are restricted and must fit into the boundaries imposed by the queues. The request for compute resources on the bwUniCluster 3.0 requires at least the specification of the '''queue''' and the '''time'''.

''' Jobs '''

Jobs can be run non-interactively as '''batch jobs''' or as '''interactive jobs'''. 
Submitting a batch job means, that all steps of a compute project are defined in a Bash script. This Bash script is queued and executed as soon as the compute resources are available and allocated. Jobs are enqueued with the <code>sbatch</code> command.
For interactive jobs, the resources are requested with the <code>salloc</code> command. As soon as the computing resources are available and allocated, a command line prompt is returned on a computing node and the user can freely dispose of the resources now available to him.
{|style="background:#deffee; width:100%;"
|style="padding:5px; background:#cef2e0; text-align:left"|
[[Image:Attention.svg|center|25px]]
|style="padding:5px; background:#cef2e0; text-align:left"|
'''Please remember:'''
* '''Heavy computations are not allowed on the login nodes'''. Use a developement or a regular job queue instead! Please refer to [[BwUniCluster3.0/Login#Allowed_Activities_on_Login_Nodes|Allowed Activities on Login Nodes]].
* '''Development queues''' are meant for '''development tasks'''. Do not misuse this queue for regular, short-running jobs or chain jobs! Only one running job at a time is enabled. Maximum queue length is reduced to 3.
|}

= Queues on bwUniCluster 3.0 =
== Policy ==

The computing time is provided in accordance with the '''fair share policy'''. The individual investment shares of the respective university and the resources already used by its members are taken into account. Furthermore, the following throttling policy is also active: The '''maximum amount of physical cores''' used at any given time from jobs running is '''1920 per user''' (aggregated over all running jobs). This number corresponds to 30 nodes on the Ice Lake partition or 20 nodes on the standard partition. The aim is to minimize waiting times and maximize the number of users who can access computing time at the same time.

== Regular Queues ==
{| class="wikitable"
|-
! style="width:5%"| Queue
! style="width:13%"| Node-Type
! style="width:23%"| Default Resources
! style="width:13%"| Minimal Resources
! style="width:13%"| Maximum Resources
|-
| <code>cpu_il</code>
| CPU nodes Ice Lake
| mem-per-cpu=2000mb
|
| time=72:00:00, nodes=30, mem=249600mb, ntasks-per-node=64, (threads-per-core=2)
|-
| <code>cpu</code>
| CPU nodes Standard
| mem-per-cpu=2000mb
|
| time=72:00:00, nodes=20, mem=380000mb, ntasks-per-node=96, (threads-per-core=2)
|-
| <code>highmem</code>
| CPU nodes High Memory
| mem-per-cpu=12090mb
| mem=380001mb
| time=72:00:00, nodes=4, mem=2300000mb, ntasks-per-node=96, (threads-per-core=2)
|-
| <code>gpu_h100</code>
| GPU nodes NVIDIA GPU x4
| mem-per-gpu=193300mb cpus-per-gpu=24
| gres=gpu:1
| time=72:00:00, nodes=12, mem=760000mb, ntasks-per-node=96, (threads-per-core=2)
|-
| <code>gpu_mi300</code>
| GPU node AMD GPU x4
| mem-per-gpu=128200mb cpus-per-gpu=24
| gres=gpu:1
| time=72:00:00, nodes=1, mem=510000mb, ntasks-per-node=40, (threads-per-core=2)
|-
| <code>gpu_a100_il</code>/<code>gpu_h100_il</code>
| GPU nodes Ice Lake NVIDIA GPU x4
| mem-per-gpu=127500mb cpus-per-gpu=16
| gres=gpu:1
| time=48:00:00, nodes=9(A100)/nodes=5(H100) , mem=510000mb, ntasks-per-node=64, (threads-per-core=2)
|}
Table 1: Regular Queues

== Short Queues ==
Queues with a short runtime of 30 minutes.
{| class="wikitable"
|-
! style="width:5%"| Queue
! style="width:13%"| Node Type
! style="width:23%"| Default Resources
! style="width:13%"| Minimal Resources
! style="width:13%"| Maximum Resources
|-
| <code>gpu_a100_short</code>
| GPU nodes Ice Lake NVIDIA GPU x4
| mem-per-gpu=94000mb cpus-per-gpu=12
| gres=gpu:1
| time=30, nodes=12, mem=376000mb, ntasks-per-node=48, (threads-per-core=2)
|}
Table 2: Short Queues

== Development Queues ==
Only for development, i.e. debugging or performance optimization ...

{| class="wikitable"
|-
! style="width:5%"| Queue
! style="width:13%"| Node Type
! style="width:23%"| Default Resources
! style="width:13%"| Minimal Resources
! style="width:13%"| Maximum Resources
|-
| <code>dev_cpu_il</code>
| CPU nodes Ice Lake
| mem-per-cpu=2000mb
|
| time=30, nodes=8, mem=249600mb, ntasks-per-node=64, (threads-per-core=2)
|-
| <code>dev_cpu</code>
| CPU nodes Standard
| mem-per-cpu=2000mb
|
| time=30, nodes=1, mem=380000mb, ntasks-per-node=96, (threads-per-core=2)
|-
| <code>dev_gpu_h100</code>
| GPU nodes NVIDIA GPU x4
| mem-per-gpu=193300mb cpus-per-gpu=24
| gres=gpu:1
| time=30, nodes=1, mem=760000mb, ntasks-per-node=96, (threads-per-core=2)
|-
| <code>dev_gpu_a100_il</code>
| GPU nodes NVIDIA GPU x4 
| mem-per-gpu=127500mb cpus-per-gpu=16
| gres=gpu:1
| time=30, nodes=1, mem=510000mb, ntasks-per-node=64, (threads-per-core=2)
|}
Table 3: Development Queues

Default resources of a queue class define number of tasks and memory if not explicitly given with sbatch command. Resource list acronyms ''--time'', ''--ntasks'', ''--nodes'', ''--mem'' and ''--mem-per-cpu'' are described [[BwUniCluster3.0/Running_Jobs/Slurm|here]].

== Check available resources: sinfo_t_idle ==
The Slurm command sinfo is used to view partition and node information for a system running Slurm. It incorporates down time, reservations, and node state information in determining the available backfill window. The sinfo command can only be used by the administrator.
 
SCC has prepared a special script (sinfo_t_idle) to find out how many processors are available for immediate use on the system. It is anticipated that users will use this information to submit jobs that meet these criteria and thus obtain quick job turnaround times.
 
 

* The following command displays what resources are available for immediate use for the whole partition.
<pre>$ sinfo_t_idle
Partition dev_cpu : 1 nodes idle
Partition cpu : 1 nodes idle
Partition highmem : 2 nodes idle
Partition dev_gpu_h100 : 0 nodes idle
Partition gpu_h100 : 0 nodes idle
Partition gpu_mi300 : 0 nodes idle
Partition dev_cpu_il : 7 nodes idle
Partition cpu_il : 2 nodes idle
Partition dev_gpu_a100_il : 1 nodes idle
Partition gpu_a100_il : 0 nodes idle
Partition gpu_h100_il : 1 nodes idle
Partition gpu_a100_short : 0 nodes idle
</pre>
 

= Running Jobs =

== Slurm Commands (excerpt) ==
Important Slurm commands for non-administrators working on bwUniCluster 3.0.
{| width=850px class="wikitable"
! Slurm commands !! Brief explanation
|-
| [[#Batch Jobs: sbatch|sbatch]] || Submits a job and puts it into the queue [[https://slurm.schedmd.com/sbatch.html sbatch]]
|-
| [[#Interactive Jobs: salloc|salloc]] || Requests resources for an interactive Job [[https://slurm.schedmd.com/salloc.html salloc]]
|-
| [[#Monitor and manage jobs |scontrol show job]] || Displays detailed job state information [[https://slurm.schedmd.com/scontrol.html scontrol]]
|-
| [[#List of your submitted jobs : squeue|squeue]] || Displays information about active, eligible, blocked, and/or recently completed jobs [[https://slurm.schedmd.com/squeue.html squeue]]
|-
| [[#List of your submitted jobs : squeue|squeue --start]] || Returns start time of submitted job [[https://slurm.schedmd.com/squeue.html squeue]]
|-
| [[#Check available resources: sinfo_t_idle|sinfo_t_idle]] || Shows what resources are available for immediate use [[https://slurm.schedmd.com/sinfo.html sinfo]]
|-
| [[#Canceling own jobs : scancel|scancel]] || Cancels a job [[https://slurm.schedmd.com/scancel.html scancel]]
|}

 
* [https://slurm.schedmd.com/tutorials.html Slurm Tutorials]
* [https://slurm.schedmd.com/pdfs/summary.pdf Slurm command/option summary (2 pages)]
* [https://slurm.schedmd.com/man_index.html Slurm Commands]
 

== Batch Jobs: sbatch ==

Batch jobs are submitted by using the command '''sbatch'''. The main purpose of the '''sbatch''' command is to specify the resources that are needed to run the job. '''sbatch''' will then queue the batch job. However, starting of batch job depends on the availability of the requested resources and the fair sharing value.
 
 

* The syntax and use of '''sbatch''' can be displayed via:
<pre>
$ man sbatch
</pre>

'''sbatch''' options can be used from the command line or in your job script. Different defaults for some of these options are set based on the queue and can be found [[BwUniCluster3.0/Slurm | here]]

{| class="wikitable"
! colspan="3" | sbatch Options
|-
! style="width:8%"| Command line
! style="width:9%"| Script
! style="width:13%"| Purpose
|- style="vertical-align:top;"
| -t, --time=''time''
| #SBATCH --time=''time''
| Wall clock time limit. 
|-
|- style="vertical-align:top;"
| -N, --nodes=''count''
| #SBATCH --nodes=''count''
| Number of nodes to be used.
|- style="vertical-align:top;"
| -n, --ntasks=''count''
| #SBATCH --ntasks=''count''
| Number of tasks to be launched.
|-
|- style="vertical-align:top;"
| --ntasks-per-node=''count''
| #SBATCH --ntasks-per-node=''count''
| Maximum count of tasks per node.
|-
|- style="vertical-align:top;"
| -c, --cpus-per-task=''count''
| #SBATCH --cpus-per-task=''count''
| Number of CPUs required per (MPI-)task.
|-
|- style="vertical-align:top;"
| --gres=gpu:''count''
| #SBATCH --gres=gpu:''count''
| Number of GPUs required per node.
|-
|- style="vertical-align:top;"
| --mem=''value_in_MB''
| #SBATCH --mem=''value_in_MB''
| Memory in MegaByte per node. (You should omit the setting of this option.)
|-
|- style="vertical-align:top;"
| --mem-per-cpu=''value_in_MB''
| #SBATCH --mem-per-cpu=''value_in_MB''
| Minimum Memory required per allocated CPU. (You should omit the setting of this option.)
|-
|- style="vertical-align:top;"
| --exclusive
| #SBATCH --exclusive
| The job allocates all CPUs and GPUs on the nodes. It will not share the node with other running jobs
|-
|- style="vertical-align:top;"
| --mail-type=''type''
| #SBATCH --mail-type=''type''
| Notify user by email when certain event types occur. Valid type values are NONE, BEGIN, END, FAIL, REQUEUE, ALL.
|-
|- style="vertical-align:top;"
| --mail-user=''mail-address''
| #SBATCH --mail-user=''mail-address''
| The specified mail-address receives email notification of state changes as defined by --mail-type.
|-
|- style="vertical-align:top;"
| --output=''name''
| #SBATCH --output=''name''
| File in which job output is stored.
|-
|- style="vertical-align:top;"
| --error=''name''
| #SBATCH --error=''name''
| File in which job error messages are stored.
|-
|- style="vertical-align:top;"
| -J, --job-name=''name''
| #SBATCH --job-name=''name''
| Job name.
|-
|- style="vertical-align:top;"
| --export=[ALL,] ''env-variables''
| #SBATCH --export=[ALL,] ''env-variables''
| Identifies which environment variables from the submission environment are propagated to the launched application. Default is ALL.
|-
|- style="vertical-align:top;"
| -A, --account=''group-name''
| #SBATCH --account=''group-name''
| Change resources used by this job to specified group. You may need this option if your account is assigned to more than one group. By command "scontrol show job" the project group the job is accounted on can be seen behind "Account=".
|-
|- style="vertical-align:top;"
| -p, --partition=''queue-name''
| #SBATCH --partition=''queue-name''
| Request a specific queue for the resource allocation.
|-
|- style="vertical-align:top;"
| --reservation=''reservation-name''
| #SBATCH --reservation=''reservation-name''
| Use a specific reservation for the resource allocation.
|-
|- style="vertical-align:top;"
| -C, --constraint=''LSDF''
| #SBATCH --constraint=LSDF
| Job constraint LSDF filesystems.
|-
|- style="vertical-align:top;"
| -C, --constraint=''BEEOND (BEEOND_4MDS, BEEOND_MAXMDS)''
| #SBATCH --constraint=BEEOND (BEEOND_4MDS, BEEOND_MAXMDS)
| Job constraint BeeOND filesystem.
|-
|}
 

== Interactive Jobs: salloc ==

On bwUniCluster 3.0 you are only allowed to run short jobs (<< 1 hour) with little memory requirements (<< 8 GByte) on the logins nodes. If you want to run longer jobs and/or jobs with a request of more than 8 GByte of memory, you must allocate resources for so-called interactive jobs by usage of the command salloc on a login node. Considering a serial application running on a compute node that requires 5000 MByte of memory and limiting the interactive run to 2 hours the following command has to be executed:

<pre>
$ salloc -p cpu -n 1 -t 120 --mem=5000
</pre>

Then you will get one core on a compute node within the partition "cpu". After execution of this command '''DO NOT CLOSE''' your current terminal session but wait until the queueing system Slurm has granted you the requested resources on the compute system. You will be logged in automatically on the granted core! To run a serial program on the granted core you only have to type the name of the executable.

<pre>
$ ./<my_serial_program>
</pre>

Please be aware that your serial job must run less than 2 hours in this example, else the job will be killed during runtime by the system.

You can also start now a graphical X11-terminal connecting you to the dedicated resource that is available for 2 hours. You can start it by the command:

<pre>
$ xterm
</pre>

Note that, once the walltime limit has been reached the resources - i.e. the compute node - will automatically be revoked.

An interactive parallel application running on one compute node or on many compute nodes (e.g. here 5 nodes) with 96 cores each requires usually an amount of memory in GByte (e.g. 50 GByte) and a maximum time (e.g. 1 hour). E.g. 5 nodes can be allocated by the following command:

<pre>
$ salloc -p cpu -N 5 --ntasks-per-node=96 -t 01:00:00 --mem=50gb
</pre>

Now you can run parallel jobs on 480 cores requiring 50 GByte of memory per node. Please be aware that you will be logged in on core 0 of the first node.
If you want to have access to another node you have to open a new terminal, connect it also to bwUniCluster 3.0 and type the following commands to
connect to the running interactive job and then to a specific node:

<pre>
$ srun --jobid=XXXXXXXX --pty /bin/bash
$ srun --nodelist=uc3nXXX --pty /bin/bash
</pre>

With the command:

<pre>
$ squeue
</pre>

the jobid and the nodelist can be shown.

If you want to run MPI-programs, you can do it by simply typing mpirun <program_name>. Then your program will be run on 480 cores. A very simple example for starting a parallel job can be:

<pre>
$ mpirun <my_mpi_program>
</pre>

You can also start the debugger ddt by the commands:

<pre>
$ module add devel/ddt
$ ddt <my_mpi_program>
</pre>

The above commands will execute the parallel program <my_mpi_program> on all available cores. You can also start parallel programs on a subset of cores; an example for this can be:

<pre>
$ mpirun -n 50 <my_mpi_program>
</pre>

If you are using Intel MPI you must start <my_mpi_program> by the command mpiexec.hydra (instead of mpirun).

== Interactive Computing with Jupyter ==

== Monitor and manage jobs ==

=== List of your submitted jobs : squeue ===
Displays information about YOUR active, pending and/or recently completed jobs. The command displays all own active and pending jobs. The command squeue is explained in detail on the webpage https://slurm.schedmd.com/squeue.html or via manpage (man squeue).
 
 

* ''squeue'' example on bwUniCluster 3.0 (Only your own jobs are displayed!).
<pre>
$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
1262 cpu wrap ka_ab123 R 8:15 1 uc3n002
1267 dev_gpu_h wrap ka_ab123 PD 0:00 1 (Resources)
1265 highmem wrap ka_ab123 R 2:41 1 uc3n084
$ squeue -l
JOBID PARTITION NAME USER STATE TIME TIME_LIMI NODES NODELIST(REASON)
1262 cpu wrap ka_ab123 RUNNING 8:55 20:00 1 uc3n002
1267 dev_gpu_h wrap ka_ab123 PENDING 0:00 20:00 1 (Resources)
1265 highmem wrap ka_ab123 RUNNING 3:21 20:00 1 uc3n084

</pre>

=== Detailed job information : scontrol show job ===
scontrol show job displays detailed job state information and diagnostic output for all or a specified job of yours. Detailed information is available for active, pending and recently completed jobs. The command scontrol is explained in detail on the webpage https://slurm.schedmd.com/scontrol.html or via manpage (man scontrol).
 
Display the state of all your jobs in normal mode: scontrol show job
 
Display the state of a job with <jobid> in normal mode: scontrol show job <jobid>
 
 

Here is an example from bwUniCluster 3.0:
<pre>
$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
1262 cpu wrap ka_zs040 R 1:12 1 uc3n002

$
$ # now, see what's up with my pending job with jobid 1262
$
$ scontrol show job 1262

JobId=1262 JobName=wrap
UserId=ka_zs0402(241992) GroupId=ka_scc(12345) MCS_label=N/A
Priority=4246 Nice=0 Account=ka QOS=normal
JobState=RUNNING Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
RunTime=00:00:37 TimeLimit=00:20:00 TimeMin=N/A
SubmitTime=2025-04-04T10:01:30 EligibleTime=2025-04-04T10:01:30
AccrueTime=2025-04-04T10:01:30
StartTime=2025-04-04T10:01:31 EndTime=2025-04-04T10:21:31 Deadline=N/A
SuspendTime=None SecsPreSuspend=0 LastSchedEval=2025-04-04T10:01:31 Scheduler=Main
Partition=cpu AllocNode:Sid=uc3n999:2819841
ReqNodeList=(null) ExcNodeList=(null)
NodeList=uc3n002
BatchHost=uc3n002
NumNodes=1 NumCPUs=2 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
ReqTRES=cpu=1,mem=2000M,node=1,billing=1
AllocTRES=cpu=2,mem=4000M,node=1,billing=2
Socks/Node=* NtasksPerN:B:S:C=0:0:*:1 CoreSpec=*
MinCPUsNode=1 MinMemoryCPU=2000M MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
Command=(null)
WorkDir=/pfs/data6/home/ka/ka_scc/ka_zs0402
StdErr=/pfs/data6/home/ka/ka_scc/ka_zs0402/slurm-1262.out
StdIn=/dev/null
StdOut=/pfs/data6/home/ka/ka_scc/ka_zs0402/slurm-1262.out

</pre>
 
Each request to the Slurm workload manager generates a load. Therefore, do not use <code>squeue</code> with a simple <code>watch</code>. The smallest allowed time interval is 30 seconds. 
Any violation of this rule will result in the task being terminated without notice.

=== Canceling own jobs : scancel ===
The scancel command is used to cancel jobs. The command scancel is explained in detail on the webpage https://slurm.schedmd.com/scancel.html or via manpage (man scancel). The command is:
<pre>
$ scancel [-i] <job-id>
$ scancel -t <job_state_name>
</pre>

= Slurm Options =
[[BwUniCluster3.0/Running_Jobs/Slurm | Detailed Slurm usage]]

= Best Practices =

== Step-by-Step example==

== Dos and Don'ts ==

{|style="background:#deffee; width:100%;"
|style="padding:5px; background:#cef2e0; text-align:left"|
[[Image:Attention.svg|center|25px]]
|style="padding:5px; background:#cef2e0; text-align:left"| Do not run squeue and other slurm commands in loops or "watch" as not to saturate up the slurm daemon with rpc requests
|}

BwUniCluster3.0/Policies

2025-12-02T08:49:50Z

S Braun:

= Policies =

* '''File system quotas'''
** HOME: '''500GB''', '''5 million files (inodes)'''
** Workspace: '''40TB''', '''20 million files (inodes)'''
** Throttling Policies: The '''maximum amount of cores''' used at any given time from jobs running is 1920 per user (aggregated over all running jobs).
* '''Username and HOME directory for KIT users'''
** Like everyone else, KIT users' usernames now have the two-character prefix of their home location: '''<code>ka_</code>'''
** The HOME directory for user ''ab1234'' would be: '''<code>/home/ka/ka_OE/ka_ab1234</code>''' (OE: organizational unit)
** Login with SSH: '''<code>ssh ka_ab1234@uc3.scc.kit.edu</code>'''
* '''Access for KIT students'''
** KIT students can be granted access with their regular u-student account in the context of a lecture (cf. https://www.scc.kit.edu/servicedesk/formulare.php → Application Form for Students accounts on bwUniCluster).
** The account is only enabled '''during the lecture period'''. After the end of the semester, the accounts will be deprovisioned and the user data is deleted.
** A guest and partner account (GuP) is required for all other projects of KIT students on bwUniCluster 3.0.
* '''Allowed Activities on Login Nodes'''
** To guarantee usability for all the users of clusters you must not run your compute jobs on the login nodes.
** Compute intensive jobs must be submitted to the queuing system. 
** '''Any compute job running on the login nodes will be terminated without any notice.'''
** Any long-running compilation or any long-running pre- or post-processing of batch jobs must also be submitted to the queuing system.

BwUniCluster3.0/Policies

2025-12-02T08:40:12Z

S Braun:

BwUniCluster3.0

2025-12-02T08:38:19Z

S Braun:

The '''bwUniCluster 3.0+KIT-GFA-HPC 3''' is the joint high-performance computer system of Baden-Württemberg's Universities and Universities of Applied Sciences for '''general purpose and teaching''' and located at the Scientific Computing Center (SCC) at Karlsruhe Institute of Technology (KIT). The bwUniCluster 3.0 complements the four bwForClusters and their dedicated scientific areas.
[[File:DSCF6485_rectangled_perspective.jpg|center|600px|frameless|alt=bwUniCluster3.0 |upright=1| bwUniCluster 3.0 ]]







{| style=" background:#eeeefe; width:100%;"
| style="padding:8px; background:#dedefe; font-size:120%; font-weight:bold; text-align:left" | Training & Support
|-
|
* [[BwUniCluster3.0/Getting_Started|Getting Started]]
* [https://training.bwhpc.de E-Learning Courses]
* [[BwUniCluster3.0/Support|Support]]
* [[BwUniCluster3.0/FAQ|FAQ]]
* Send [[Feedback|Feedback]] about Wiki pages
|}

{| style=" background:#deffee; width:100%;"
| style="padding:8px; background:#cef2e0; font-size:120%; font-weight:bold; text-align:left" | User Documentation
|-
|
* Access: [[Registration/bwUniCluster|Registration]], [[Registration/Deregistration|Deregistration]], [[BwUniCluster3.0/Policies|Policies]]
* [[BwUniCluster3.0/Login|Login]]
** [[BwUniCluster3.0/Login/Client|SSH Clients]]
** [[BwUniCluster3.0/Login/Data_Transfer|Data Transfer]]
* [[BwUniCluster3.0/Hardware_and_Architecture|Hardware and Architecture]]
** [[BwUniCluster3.0/Hardware_and_Architecture#Compute_resources|Compute Resources]]
** [[BwUniCluster3.0/Hardware_and_Architecture#File_Systems|File Systems]]
* [[BwUniCluster3.0/Software|Cluster Specific Software]]
** [[BwUniCluster3.0/Containers|Using Containers]]
* [[BwUniCluster3.0/Running_Jobs|Runnning Jobs]]
** [[BwUniCluster3.0/Running_Jobs#Batch_Jobs:_sbatch|Running Batch Jobs]]
** [[BwUniCluster3.0/Running_Jobs#Interactive_Jobs:_salloc|Running Interactive Jobs]]
** [[BwUniCluster3.0/Jupyter|Interactive Computing with Jupyter]]
* [[BwUniCluster3.0/Maintenance|Operational Changes]]
|}

{| style=" background:#e6e9eb; width:100%;"
| style="padding:8px; background:#d1dadf; font-size:120%; font-weight:bold; text-align:left" | Cluster Funding
|-
|
* Please [[BwUniCluster3.0/Acknowledgement|acknowledge]] bwUniCluster 3.0 in your publications.
|}

BwUniCluster3.0/Policies

2025-12-02T08:37:26Z

S Braun: Created page with "= Policies ="

= Policies =

BwUniCluster3.0

2025-12-02T08:36:27Z

S Braun:

The '''bwUniCluster 3.0+KIT-GFA-HPC 3''' is the joint high-performance computer system of Baden-Württemberg's Universities and Universities of Applied Sciences for '''general purpose and teaching''' and located at the Scientific Computing Center (SCC) at Karlsruhe Institute of Technology (KIT). The bwUniCluster 3.0 complements the four bwForClusters and their dedicated scientific areas.
[[File:DSCF6485_rectangled_perspective.jpg|center|600px|frameless|alt=bwUniCluster3.0 |upright=1| bwUniCluster 3.0 ]]







{| style=" background:#eeeefe; width:100%;"
| style="padding:8px; background:#dedefe; font-size:120%; font-weight:bold; text-align:left" | Training & Support
|-
|
* [[BwUniCluster3.0/Getting_Started|Getting Started]]
* [https://training.bwhpc.de E-Learning Courses]
* [[BwUniCluster3.0/Support|Support]]
* [[BwUniCluster3.0/FAQ|FAQ]]
* Send [[Feedback|Feedback]] about Wiki pages
|}

{| style=" background:#deffee; width:100%;"
| style="padding:8px; background:#cef2e0; font-size:120%; font-weight:bold; text-align:left" | User Documentation
|-
|
* Access: [[Registration/bwUniCluster|Registration]], [[Registration/Deregistration|Deregistration]], [[Registration/Deregistration|Policies]]
* [[BwUniCluster3.0/Login|Login]]
** [[BwUniCluster3.0/Login/Client|SSH Clients]]
** [[BwUniCluster3.0/Login/Data_Transfer|Data Transfer]]
* [[BwUniCluster3.0/Hardware_and_Architecture|Hardware and Architecture]]
** [[BwUniCluster3.0/Hardware_and_Architecture#Compute_resources|Compute Resources]]
** [[BwUniCluster3.0/Hardware_and_Architecture#File_Systems|File Systems]]
* [[BwUniCluster3.0/Software|Cluster Specific Software]]
** [[BwUniCluster3.0/Containers|Using Containers]]
* [[BwUniCluster3.0/Running_Jobs|Runnning Jobs]]
** [[BwUniCluster3.0/Running_Jobs#Batch_Jobs:_sbatch|Running Batch Jobs]]
** [[BwUniCluster3.0/Running_Jobs#Interactive_Jobs:_salloc|Running Interactive Jobs]]
** [[BwUniCluster3.0/Jupyter|Interactive Computing with Jupyter]]
* [[BwUniCluster3.0/Maintenance|Operational Changes]]
|}

{| style=" background:#e6e9eb; width:100%;"
| style="padding:8px; background:#d1dadf; font-size:120%; font-weight:bold; text-align:left" | Cluster Funding
|-
|
* Please [[BwUniCluster3.0/Acknowledgement|acknowledge]] bwUniCluster 3.0 in your publications.
|}

BwUniCluster3.0

2025-12-02T08:35:01Z

S Braun:

The '''bwUniCluster 3.0+KIT-GFA-HPC 3''' is the joint high-performance computer system of Baden-Württemberg's Universities and Universities of Applied Sciences for '''general purpose and teaching''' and located at the Scientific Computing Center (SCC) at Karlsruhe Institute of Technology (KIT). The bwUniCluster 3.0 complements the four bwForClusters and their dedicated scientific areas.
[[File:DSCF6485_rectangled_perspective.jpg|center|600px|frameless|alt=bwUniCluster3.0 |upright=1| bwUniCluster 3.0 ]]







{| style=" background:#eeeefe; width:100%;"
| style="padding:8px; background:#dedefe; font-size:120%; font-weight:bold; text-align:left" | Training & Support
|-
|
* [[BwUniCluster3.0/Getting_Started|Getting Started]]
* [https://training.bwhpc.de E-Learning Courses]
* [[BwUniCluster3.0/Support|Support]]
* [[BwUniCluster3.0/FAQ|FAQ]]
* Send [[Feedback|Feedback]] about Wiki pages
|}

{| style=" background:#deffee; width:100%;"
| style="padding:8px; background:#cef2e0; font-size:120%; font-weight:bold; text-align:left" | User Documentation
|-
|
* Access: [[Registration/bwUniCluster|Registration]], [[Registration/Deregistration|Deregistration]]
* [[BwUniCluster3.0/Login|Login]]
** [[BwUniCluster3.0/Login/Client|SSH Clients]]
** [[BwUniCluster3.0/Login/Data_Transfer|Data Transfer]]
* [[BwUniCluster3.0/Hardware_and_Architecture|Hardware and Architecture]]
** [[BwUniCluster3.0/Hardware_and_Architecture#Compute_resources|Compute Resources]]
** [[BwUniCluster3.0/Hardware_and_Architecture#File_Systems|File Systems]]
* [[BwUniCluster3.0/Software|Cluster Specific Software]]
** [[BwUniCluster3.0/Containers|Using Containers]]
* [[BwUniCluster3.0/Running_Jobs|Runnning Jobs]]
** [[BwUniCluster3.0/Running_Jobs#Batch_Jobs:_sbatch|Running Batch Jobs]]
** [[BwUniCluster3.0/Running_Jobs#Interactive_Jobs:_salloc|Running Interactive Jobs]]
** [[BwUniCluster3.0/Jupyter|Interactive Computing with Jupyter]]
* [[BwUniCluster3.0/Maintenance|Operational Changes]]
|}

{| style=" background:#e6e9eb; width:100%;"
| style="padding:8px; background:#d1dadf; font-size:120%; font-weight:bold; text-align:left" | Cluster Funding
|-
|
* Please [[BwUniCluster3.0/Acknowledgement|acknowledge]] bwUniCluster 3.0 in your publications.
|}

Registration/SSH

2025-11-07T13:21:13Z

S Braun: /* Minimum requirements for SSH Keys */

{|style="background:#deffee; width:100%;"
|style="padding:5px; background:#cef2e0; text-align:left"|
[[Image:Attention.svg|center|25px]]
|style="padding:5px; background:#cef2e0; text-align:left"|
This process is only necessary for the bwUniCluster and the bwForCluster Helix and NEMO2.
On the other clusters, SSH keys can still be copied to the <code>authorized_keys</code> file.
|}

= Registering SSH Keys with your Cluster =

{|style="background:#deffee; width:100%;"
|style="padding:5px; background:#cef2e0; text-align:left"|
[[Image:Attention.svg|center|25px]]
|style="padding:5px; background:#cef2e0; text-align:left"|
Interactive SSH Keys are not valid all the time, but only for a few hours after the last 2-factor login.
They have to be "unlocked" by entering the OTP and service password.
|}

'''SSH Keys''' are a mechanism for logging into a computer system without having to enter a password. Instead of authenticating yourself with something you know (a password), you prove your identity by showing the server something you have (a cryptographic key).

The usual process is the following:

* The user generates a pair of SSH Keys, a private key and a public key, on their local system. The private key never leaves the local system.

* The user then logs into the remote system using the remote system password and adds the public key to a file called ~/.ssh/authorized_keys .

* All following logins will no longer require the entry of the remote system password because the local system can prove to the remote system that it has a private key matching the public key on file.

While SSH Keys have many advantages, the concept also has '''a number of issues''' which make it hard to handle them securely:

* The private key on the local system is supposed to be protected by a strong passphrase. There is no possibility for the server to check if this is the case. Many users do not use a strong passphrase or do not use any passphrase at all. If such a private key is stolen, an attacker can immediately use it to access the remote system.

* There is no concept of validity. Users are not forced to regularly generate new SSH Key pairs and replace the old ones. Often the same key pair is used for many years and the users have no overview of how many systems they have stored their SSH Keys on.

* SSH Keys can be restricted so they can only be used to execute specific commands on the server, or to log in from specified IP addresses. Most users do not do this.

To fix these issues '''it is no longer possible to self-manage your SSH Keys by adding them to the ~/.ssh/authorized_keys file''' on bwUniCluster/bwForCluster.
SSH Keys have to be managed through bwIDM/bwServces instead.
Existing authorized_keys files are ignored.

== Minimum requirements for SSH Keys ==

Algorithms and Key sizes:

* 2048 bits or more for RSA
* 521 bits for ECDSA
* 256 Bits (Default) for ED25519

'''Please set a strong passphrase for your private keys.'''

ECDSA-SK and ED25519-SK keys (for use with U2F/FIDO Hardware Tokens like Yubikeys) can currently only be used on NEMO2 and bwUniCluster 3.0.

= Adding a new SSH Key =

{|style="background:#deffee; width:100%;"
|style="padding:5px; background:#cef2e0; text-align:left"|
[[Image:Attention.svg|center|25px]]
|style="padding:5px; background:#cef2e0; text-align:left"|
* Newly added keys are valid for 180 days. After that, they are revoked and placed on a "revocation list" so that they cannot be reused.
* Copy only the contents of your public ssh key file to bwIDM/bwServices. The file ends with <code>.pub</code> ( e.g. <code>~/.ssh/<filename>.pub</code>).
|}

'''SSH keys''' are generally managed via the '''My SSH Pubkeys''' menu entry on the registration pages for the clusters.
Here you can add and revoke SSH keys. To add a ssh key, please follow these steps:

1. '''Select the cluster''' for which you want to create a second factor: → [https://login.bwidm.de/user/ssh-keys.xhtml '''bwUniCluster 3.0'''] → [https://bwservices.uni-heidelberg.de/user/ssh-keys.xhtml '''bwForCluster Helix'''] → [https://login.bwidm.de/user/ssh-keys.xhtml '''bwForCluster NEMO 2''']
[[File:BwIDM-twofa.png|center|600px|thumb|My SSH Pubkeys.]]

3. Click the '''Add SSH Key''' or '''SSH Key Hochladen''' button.
[[File:Bwunicluster 2.0 access ssh keys empty.png|center|400px|thumb|Add new SSH key.]]

4. A new window will appear.
Enter a name for the key and paste your SSH public key (file <code>~/.ssh/<filename>.pub</code>) into the box labelled "SSH Key:".
Click on the button labelled '''Add''' or '''Hinzufügen'''.
[[File:Ssh-key.png|center|600px|thumb|Add new SSH key.]]

5. If everything worked fine your new key will show up in the user interface:
[[File:Ssh-success.png|center|800px|thumb|New SSH key added.]]

Once you have added SSH keys to the system, you can bind them to one or more services to use either for interactive logins ('''Interactive key''') or for automatic logins ('''Command key''').

== Registering an Interactive Key ==

{|style="background:#deffee; width:100%;"
|style="padding:5px; background:#cef2e0; text-align:left"|
[[Image:Attention.svg|center|25px]]
|style="padding:5px; background:#cef2e0; text-align:left"|
Interactive SSH Keys are not valid all the time, but only for a few hours after the last 2-factor login.
They have to be "unlocked" by entering the OTP and service password.
|}

'''Interactive Keys''' can be used to log into a system for interactive use.
Perform the following steps to register an interactive key:

1. [[Registration/SSH#Adding_a_new_SSH_Key|'''Add a new interactive SSH key''']] if you have not already done so.

2. Select '''Registered services/Registrierte Dienste''' from the top menu and click '''Set SSH Key/SSH Key setzen''' for the cluster for which you want to use the SSH key.
[[File:BwIDM-registered.png|center|600px|thumb|Select Cluster for which you want to use the SSH key.]]

3. The upper block displays the SSH keys currently registered for the service.
The bottom block displays all the public SSH keys associated with your account.
Find the SSH key you want to use and click '''Add/Hinzufügen'''.
[[File:Ssh-service-int.png|center|800px|thumb|Add SSH key to service.]]

4. A new window appears.
Select '''Interactive''' as the usage type, enter an optional comment and click '''Add/Hinzufügen'''.
[[File:Ssh-int.png|center|600px|thumb|Add interactive SSH key to service.]]

5. Your SSH key is now registered for interactive use with this service.
[[File:Ssh-service.png|center|800px|thumb|SSH key is now registered for interactive use.]]

=== SSH Interactive Key valid after successful Login ===

Interactive SSH Keys are not valid all the time, but only for a few hours after the last 2-factor login.
They have to be "unlocked" by entering the OTP and service password.

{| class="wikitable" style="text-align:center;"
|-
! style="width:50%"| Cluster
! style="width:50%"| Interactive SSH Key Validity
|-
!scope="column"| bwUniCluster 3.0
| 8h
|-
!scope="column"| bwForCluster Helix
| 12h
|-
!scope="column"| bwForCluster NEMO 2
| 12h
|-
|}

== Registering a Command Key ==

{|style="background:#deffee; width:100%;"
|style="padding:5px; background:#cef2e0; text-align:left"|
[[Image:Attention.svg|center|25px]]
|style="padding:5px; background:#cef2e0; text-align:left"|
SSH command keys are always valid and do not need to be unlocked with a 2-factor login.
This makes these keys extremely valuable to a potential attacker and poses a security risk.
Therefore, additional restrictions apply to these keys:
* They must be limited to a single command to be executed.
* They must be limited to a single IP address (e.g., the workflow server) or a small number of IP addresses (e.g., the institution's subnet).
* They must be reviewed and approved by a cluster administrator before they can be used.
* Validity is reduced to one month.
|}

'''Command Keys''' can be used for automatic workflows.
If you want to use rsync, please read the [[Registration/SSH/rrsync|rrsync wiki]].

Perform the following steps to register a "Command key" (in this example we use rrsync):

1. [[Registration/SSH#Adding_a_new_SSH_Key|'''Add a new "SSH key"''']] if you have not already done so.

2. Select '''Registered services/Registrierte Dienste''' from the top menu and click '''Set SSH Key/SSH Key setzen''' for the cluster for which you want to use the SSH key.
[[File:BwIDM-registered.png|center|600px|thumb|Select Cluster for which you want to use the SSH key.]]

3. The upper block displays the SSH keys currently registered for the service.
The bottom block displays all the public SSH keys associated with your account.
Find the SSH key you want to use and click '''Add/Hinzufügen'''.
[[File:Ssh-service-com.png|center|800px|thumb|Add SSH key to service.]]

4. A new window appears.
Select '''Command''' as the usage type.
Type the full command with the full path, including all parameters, in the '''Command''' text box.
Specify a network address, list, or range in the '''From''' text field (see [https://man.openbsd.org/cgi-bin/man.cgi/OpenBSD-current/man8/sshd.8#from=_pattern-list_ man 8 sshd] for more info).
Please also provide a comment to speed up the approval process.
Click '''Add/Hinzufügen'''.
{| class="wikitable"
! | Example
|-
| If you want to register a command key to be able to transfer data automatically, please use the following string as in the '''Command''' text field (please verify the path on the cluster first):
<pre>
/usr[/local]/bin/rrsync -ro /home/aa/aa_bb/aa_abc1/
</pre>
|}
[[File:Ssh-com.png|center|600px|thumb|Add command SSH key to service.]]

5. After the key has been added, it will be marked as '''Pending''':
You will receive an e-mail as soon as the key has been approved and can be used.
[[File:Ssh-service.png|center|800px|thumb|SSH key is now registered for interactive use.]]

== Revoke/Delete SSH Key ==

{|style="background:#deffee; width:100%;"
|style="padding:5px; background:#cef2e0; text-align:left"|
[[Image:Attention.svg|center|25px]]
|style="padding:5px; background:#cef2e0; text-align:left"|
Revoked keys are locked and can no longer be used.
|}

'''SSH keys''' are generally managed via the '''My SSH Pubkeys''' menu entry on the registration pages for the clusters.
Here you can add and revoke SSH keys. To revoke/delete a ssh key, please follow these steps:

1. '''Select the cluster''' for which you want to delete the SSH key: → [https://login.bwidm.de/user/ssh-keys.xhtml '''bwUniCluster 3.0'''] → [https://bwservices.uni-heidelberg.de/user/ssh-keys.xhtml '''bwForCluster Helix'''] → [https://login.bwidm.de/user/ssh-keys.xhtml '''bwForCluster NEMO 2''']
[[File:BwIDM-twofa.png|center|600px|thumb|My SSH Pubkeys.]]

2. Click '''REVOKE/ZURÜCKZIEHEN''' next to the SSH key you want to revoke.
[[File:Ssh-success.png|center|800px|thumb|Revoke SSH key.]]

BwUniCluster3.0/Running Jobs

2025-10-24T08:07:44Z

S Braun:

= Purpose and function of a queuing system =

All compute activities on bwUniCluster 3.0 have to be performed on the compute nodes. Compute nodes are only available by requesting the corresponding resources via the queuing system. As soon as the requested resources are available, automated tasks are executed via a batch script or they can be accessed interactively. 
General procedure: Hint to [[Running_Calculations | Running Calculations]]

== Job submission process ==

bwUniCluster 3.0 has installed the workload managing software Slurm. Therefore any job submission by the user is to be executed by commands of the Slurm software. Slurm queues and runs user jobs based on fair sharing policies.

== Slurm ==

HPC Workload Manager on bwUniCluster 3.0 is Slurm.
Slurm is a cluster management and job scheduling system. Slurm has three key functions.
* It allocates access to resources (compute cores on nodes) to users for some duration of time so they can perform work.
* It provides a framework for starting, executing, and monitoring work (normally a parallel job) on the set of allocated nodes.
* It arbitrates contention for resources by managing a queue of pending work.

Any kind of calculation on the compute nodes of bwUniCluster 3.0 requires the user to define calculations as a sequence of commands together with required run time, number of CPU cores and main memory and submit all, i.e., the '''batch job''', to a resource and workload managing software.

== Terms and definitions ==

''' Partitions '''

Slurm manages job queues for different '''partitions'''. Partitions are used to group similar node types (e.g. nodes with and without accelerators) and to enforce different access policies and resource limits.

On bwUniCluster 3.0 there are different partitions:

* CPU-only nodes
** 2-socket nodes, consisting of 2 Intel Ice Lake processors with 32 cores each or 2 AMD processors with 48 cores each
** 2-socket nodes with very high RAM capacity, consisting of 2 AMD processors with 48 cores each
* GPU-accelerated nodes
** 2-socket nodes with 4x NVIDIA A100 or 4x NVIDIA H100 GPUs
** 4-socket node with 4x AMD Instinct accelerator

''' Queues '''

Job '''queues''' are used to manage jobs that request access to shared but limited computing resources of a certain kind (partition).

On bwUniCluster 3.0 there are different main types of queues:
* Regular queues
** cpu: Jobs that request CPU-only nodes.
** gpu: Jobs that request GPU-accelerated nodes.
* Development queues (dev)
** Short, usually interactive jobs that are used for developing, compiling and testing code and workflows. The intention behind development queues is to provide users with immediate access to computer resources without having to wait. This is the place to realize instantaneous heavy compute without affecting other users, as would be the case on the login nodes.

Requested compute resources such as (wall-)time, number of nodes and amount of memory are restricted and must fit into the boundaries imposed by the queues. The request for compute resources on the bwUniCluster 3.0 requires at least the specification of the '''queue''' and the '''time'''.

''' Jobs '''

Jobs can be run non-interactively as '''batch jobs''' or as '''interactive jobs'''. 
Submitting a batch job means, that all steps of a compute project are defined in a Bash script. This Bash script is queued and executed as soon as the compute resources are available and allocated. Jobs are enqueued with the <code>sbatch</code> command.
For interactive jobs, the resources are requested with the <code>salloc</code> command. As soon as the computing resources are available and allocated, a command line prompt is returned on a computing node and the user can freely dispose of the resources now available to him.
{|style="background:#deffee; width:100%;"
|style="padding:5px; background:#cef2e0; text-align:left"|
[[Image:Attention.svg|center|25px]]
|style="padding:5px; background:#cef2e0; text-align:left"|
'''Please remember:'''
* '''Heavy computations are not allowed on the login nodes'''. Use a developement or a regular job queue instead! Please refer to [[BwUniCluster3.0/Login#Allowed_Activities_on_Login_Nodes|Allowed Activities on Login Nodes]].
* '''Development queues''' are meant for '''development tasks'''. Do not misuse this queue for regular, short-running jobs or chain jobs! Only one running job at a time is enabled. Maximum queue length is reduced to 3.
|}

= Queues on bwUniCluster 3.0 =
== Policy ==

The computing time is provided in accordance with the '''fair share policy'''. The individual investment shares of the respective university and the resources already used by its members are taken into account. Furthermore, the following throttling policy is also active: The '''maximum amount of physical cores''' used at any given time from jobs running is '''1920 per user''' (aggregated over all running jobs). This number corresponds to 30 nodes on the Ice Lake partition or 20 nodes on the standard partition. The aim is to minimize waiting times and maximize the number of users who can access computing time at the same time.

== Regular Queues ==
{| class="wikitable"
|-
! style="width:5%"| Queue
! style="width:13%"| Node-Type
! style="width:23%"| Default Resources
! style="width:13%"| Minimal Resources
! style="width:13%"| Maximum Resources
|-
| <code>cpu_il</code>
| CPU nodes Ice Lake
| mem-per-cpu=2000mb
|
| time=72:00:00, nodes=30, mem=249600mb, ntasks-per-node=64, (threads-per-core=2)
|-
| <code>cpu</code>
| CPU nodes Standard
| mem-per-cpu=2000mb
|
| time=72:00:00, nodes=20, mem=380000mb, ntasks-per-node=96, (threads-per-core=2)
|-
| <code>highmem</code>
| CPU nodes High Memory
| mem-per-cpu=12090mb
| mem=380001mb
| time=72:00:00, nodes=4, mem=2300000mb, ntasks-per-node=96, (threads-per-core=2)
|-
| <code>gpu_h100</code>
| GPU nodes NVIDIA GPU x4
| mem-per-gpu=193300mb cpus-per-gpu=24
| gres=gpu:1
| time=72:00:00, nodes=12, mem=760000mb, ntasks-per-node=96, (threads-per-core=2)
|-
| <code>gpu_mi300</code>
| GPU node AMD GPU x4
| mem-per-gpu=128200mb cpus-per-gpu=24
| gres=gpu:1
| time=72:00:00, nodes=1, mem=510000mb, ntasks-per-node=40, (threads-per-core=2)
|-
| <code>gpu_a100_il</code>/<code>gpu_h100_il</code>
| GPU nodes Ice Lake NVIDIA GPU x4
| mem-per-gpu=127500mb cpus-per-gpu=16
| gres=gpu:1
| time=48:00:00, nodes=9(A100)/nodes=5(H100) , mem=510000mb, ntasks-per-node=64, (threads-per-core=2)
|}
Table 1: Regular Queues

== Short Queues ==
Queues with a short runtime of 30 minutes.
{| class="wikitable"
|-
! style="width:5%"| Queue
! style="width:13%"| Node Type
! style="width:23%"| Default Resources
! style="width:13%"| Minimal Resources
! style="width:13%"| Maximum Resources
|-
| <code>gpu_a100_short</code>
| GPU nodes Ice Lake NVIDIA GPU x4
| mem-per-gpu=94000mb cpus-per-gpu=12
| gres=gpu:1
| time=30, nodes=12, mem=376000mb, ntasks-per-node=48, (threads-per-core=2)
|}
Table 2: Short Queues

== Development Queues ==
Only for development, i.e. debugging or performance optimization ...

{| class="wikitable"
|-
! style="width:5%"| Queue
! style="width:13%"| Node Type
! style="width:23%"| Default Resources
! style="width:13%"| Minimal Resources
! style="width:13%"| Maximum Resources
|-
| <code>dev_cpu_il</code>
| CPU nodes Ice Lake
| mem-per-cpu=2000mb
|
| time=30, nodes=8, mem=249600mb, ntasks-per-node=64, (threads-per-core=2)
|-
| <code>dev_cpu</code>
| CPU nodes Standard
| mem-per-cpu=2000mb
|
| time=30, nodes=1, mem=380000mb, ntasks-per-node=96, (threads-per-core=2)
|-
| <code>dev_gpu_h100</code>
| GPU nodes NVIDIA GPU x4
| mem-per-gpu=193300mb cpus-per-gpu=24
| gres=gpu:1
| time=30, nodes=1, mem=760000mb, ntasks-per-node=96, (threads-per-core=2)
|-
| <code>dev_gpu_a100_il</code>
| GPU nodes NVIDIA GPU x4 
| mem-per-gpu=127500mb cpus-per-gpu=16
| gres=gpu:1
| time=30, nodes=1, mem=510000mb, ntasks-per-node=64, (threads-per-core=2)
|}
Table 3: Development Queues

Default resources of a queue class define number of tasks and memory if not explicitly given with sbatch command. Resource list acronyms ''--time'', ''--ntasks'', ''--nodes'', ''--mem'' and ''--mem-per-cpu'' are described [[BwUniCluster3.0/Running_Jobs/Slurm|here]].

== Check available resources: sinfo_t_idle ==
The Slurm command sinfo is used to view partition and node information for a system running Slurm. It incorporates down time, reservations, and node state information in determining the available backfill window. The sinfo command can only be used by the administrator.
 
SCC has prepared a special script (sinfo_t_idle) to find out how many processors are available for immediate use on the system. It is anticipated that users will use this information to submit jobs that meet these criteria and thus obtain quick job turnaround times.
 
 

* The following command displays what resources are available for immediate use for the whole partition.
<pre>$ sinfo_t_idle
Partition dev_cpu : 1 nodes idle
Partition cpu : 1 nodes idle
Partition highmem : 2 nodes idle
Partition dev_gpu_h100 : 0 nodes idle
Partition gpu_h100 : 0 nodes idle
Partition gpu_mi300 : 0 nodes idle
Partition dev_cpu_il : 7 nodes idle
Partition cpu_il : 2 nodes idle
Partition dev_gpu_a100_il : 1 nodes idle
Partition gpu_a100_il : 0 nodes idle
Partition gpu_h100_il : 1 nodes idle
Partition gpu_a100_short : 0 nodes idle
</pre>
 

= Running Jobs =

== Slurm Commands (excerpt) ==
Important Slurm commands for non-administrators working on bwUniCluster 3.0.
{| width=850px class="wikitable"
! Slurm commands !! Brief explanation
|-
| [[#Batch Jobs: sbatch|sbatch]] || Submits a job and puts it into the queue [[https://slurm.schedmd.com/sbatch.html sbatch]]
|-
| [[#Interactive Jobs: salloc|salloc]] || Requests resources for an interactive Job [[https://slurm.schedmd.com/salloc.html salloc]]
|-
| [[#Monitor and manage jobs |scontrol show job]] || Displays detailed job state information [[https://slurm.schedmd.com/scontrol.html scontrol]]
|-
| [[#List of your submitted jobs : squeue|squeue]] || Displays information about active, eligible, blocked, and/or recently completed jobs [[https://slurm.schedmd.com/squeue.html squeue]]
|-
| [[#List of your submitted jobs : squeue|squeue --start]] || Returns start time of submitted job [[https://slurm.schedmd.com/squeue.html squeue]]
|-
| [[#Check available resources: sinfo_t_idle|sinfo_t_idle]] || Shows what resources are available for immediate use [[https://slurm.schedmd.com/sinfo.html sinfo]]
|-
| [[#Canceling own jobs : scancel|scancel]] || Cancels a job [[https://slurm.schedmd.com/scancel.html scancel]]
|}

 
* [https://slurm.schedmd.com/tutorials.html Slurm Tutorials]
* [https://slurm.schedmd.com/pdfs/summary.pdf Slurm command/option summary (2 pages)]
* [https://slurm.schedmd.com/man_index.html Slurm Commands]
 

== Batch Jobs: sbatch ==

Batch jobs are submitted by using the command '''sbatch'''. The main purpose of the '''sbatch''' command is to specify the resources that are needed to run the job. '''sbatch''' will then queue the batch job. However, starting of batch job depends on the availability of the requested resources and the fair sharing value.
 
 

* The syntax and use of '''sbatch''' can be displayed via:
<pre>
$ man sbatch
</pre>

'''sbatch''' options can be used from the command line or in your job script. Different defaults for some of these options are set based on the queue and can be found [[BwUniCluster3.0/Slurm | here]]

{| class="wikitable"
! colspan="3" | sbatch Options
|-
! style="width:8%"| Command line
! style="width:9%"| Script
! style="width:13%"| Purpose
|- style="vertical-align:top;"
| -t, --time=''time''
| #SBATCH --time=''time''
| Wall clock time limit. 
|-
|- style="vertical-align:top;"
| -N, --nodes=''count''
| #SBATCH --nodes=''count''
| Number of nodes to be used.
|- style="vertical-align:top;"
| -n, --ntasks=''count''
| #SBATCH --ntasks=''count''
| Number of tasks to be launched.
|-
|- style="vertical-align:top;"
| --ntasks-per-node=''count''
| #SBATCH --ntasks-per-node=''count''
| Maximum count of tasks per node.
|-
|- style="vertical-align:top;"
| -c, --cpus-per-task=''count''
| #SBATCH --cpus-per-task=''count''
| Number of CPUs required per (MPI-)task.
|-
|- style="vertical-align:top;"
| --mem=''value_in_MB''
| #SBATCH --mem=''value_in_MB''
| Memory in MegaByte per node. (You should omit the setting of this option.)
|-
|- style="vertical-align:top;"
| --mem-per-cpu=''value_in_MB''
| #SBATCH --mem-per-cpu=''value_in_MB''
| Minimum Memory required per allocated CPU. (You should omit the setting of this option.)
|-
|- style="vertical-align:top;"
| --exclusive
| #SBATCH --exclusive
| The job allocates all CPUs and GPUs on the nodes. It will not share the node with other running jobs
|-
|- style="vertical-align:top;"
| --mail-type=''type''
| #SBATCH --mail-type=''type''
| Notify user by email when certain event types occur. Valid type values are NONE, BEGIN, END, FAIL, REQUEUE, ALL.
|-
|- style="vertical-align:top;"
| --mail-user=''mail-address''
| #SBATCH --mail-user=''mail-address''
| The specified mail-address receives email notification of state changes as defined by --mail-type.
|-
|- style="vertical-align:top;"
| --output=''name''
| #SBATCH --output=''name''
| File in which job output is stored.
|-
|- style="vertical-align:top;"
| --error=''name''
| #SBATCH --error=''name''
| File in which job error messages are stored.
|-
|- style="vertical-align:top;"
| -J, --job-name=''name''
| #SBATCH --job-name=''name''
| Job name.
|-
|- style="vertical-align:top;"
| --export=[ALL,] ''env-variables''
| #SBATCH --export=[ALL,] ''env-variables''
| Identifies which environment variables from the submission environment are propagated to the launched application. Default is ALL.
|-
|- style="vertical-align:top;"
| -A, --account=''group-name''
| #SBATCH --account=''group-name''
| Change resources used by this job to specified group. You may need this option if your account is assigned to more than one group. By command "scontrol show job" the project group the job is accounted on can be seen behind "Account=".
|-
|- style="vertical-align:top;"
| -p, --partition=''queue-name''
| #SBATCH --partition=''queue-name''
| Request a specific queue for the resource allocation.
|-
|- style="vertical-align:top;"
| --reservation=''reservation-name''
| #SBATCH --reservation=''reservation-name''
| Use a specific reservation for the resource allocation.
|-
|- style="vertical-align:top;"
| -C, --constraint=''LSDF''
| #SBATCH --constraint=LSDF
| Job constraint LSDF filesystems.
|-
|- style="vertical-align:top;"
| -C, --constraint=''BEEOND (BEEOND_4MDS, BEEOND_MAXMDS)''
| #SBATCH --constraint=BEEOND (BEEOND_4MDS, BEEOND_MAXMDS)
| Job constraint BeeOND filesystem.
|-
|}
 

== Interactive Jobs: salloc ==

On bwUniCluster 3.0 you are only allowed to run short jobs (<< 1 hour) with little memory requirements (<< 8 GByte) on the logins nodes. If you want to run longer jobs and/or jobs with a request of more than 8 GByte of memory, you must allocate resources for so-called interactive jobs by usage of the command salloc on a login node. Considering a serial application running on a compute node that requires 5000 MByte of memory and limiting the interactive run to 2 hours the following command has to be executed:

<pre>
$ salloc -p cpu -n 1 -t 120 --mem=5000
</pre>

Then you will get one core on a compute node within the partition "cpu". After execution of this command '''DO NOT CLOSE''' your current terminal session but wait until the queueing system Slurm has granted you the requested resources on the compute system. You will be logged in automatically on the granted core! To run a serial program on the granted core you only have to type the name of the executable.

<pre>
$ ./<my_serial_program>
</pre>

Please be aware that your serial job must run less than 2 hours in this example, else the job will be killed during runtime by the system.

You can also start now a graphical X11-terminal connecting you to the dedicated resource that is available for 2 hours. You can start it by the command:

<pre>
$ xterm
</pre>

Note that, once the walltime limit has been reached the resources - i.e. the compute node - will automatically be revoked.

An interactive parallel application running on one compute node or on many compute nodes (e.g. here 5 nodes) with 96 cores each requires usually an amount of memory in GByte (e.g. 50 GByte) and a maximum time (e.g. 1 hour). E.g. 5 nodes can be allocated by the following command:

<pre>
$ salloc -p cpu -N 5 --ntasks-per-node=96 -t 01:00:00 --mem=50gb
</pre>

Now you can run parallel jobs on 480 cores requiring 50 GByte of memory per node. Please be aware that you will be logged in on core 0 of the first node.
If you want to have access to another node you have to open a new terminal, connect it also to bwUniCluster 3.0 and type the following commands to
connect to the running interactive job and then to a specific node:

<pre>
$ srun --jobid=XXXXXXXX --pty /bin/bash
$ srun --nodelist=uc3nXXX --pty /bin/bash
</pre>

With the command:

<pre>
$ squeue
</pre>

the jobid and the nodelist can be shown.

If you want to run MPI-programs, you can do it by simply typing mpirun <program_name>. Then your program will be run on 480 cores. A very simple example for starting a parallel job can be:

<pre>
$ mpirun <my_mpi_program>
</pre>

You can also start the debugger ddt by the commands:

<pre>
$ module add devel/ddt
$ ddt <my_mpi_program>
</pre>

The above commands will execute the parallel program <my_mpi_program> on all available cores. You can also start parallel programs on a subset of cores; an example for this can be:

<pre>
$ mpirun -n 50 <my_mpi_program>
</pre>

If you are using Intel MPI you must start <my_mpi_program> by the command mpiexec.hydra (instead of mpirun).

== Interactive Computing with Jupyter ==

== Monitor and manage jobs ==

=== List of your submitted jobs : squeue ===
Displays information about YOUR active, pending and/or recently completed jobs. The command displays all own active and pending jobs. The command squeue is explained in detail on the webpage https://slurm.schedmd.com/squeue.html or via manpage (man squeue).
 
 

* ''squeue'' example on bwUniCluster 3.0 (Only your own jobs are displayed!).
<pre>
$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
1262 cpu wrap ka_ab123 R 8:15 1 uc3n002
1267 dev_gpu_h wrap ka_ab123 PD 0:00 1 (Resources)
1265 highmem wrap ka_ab123 R 2:41 1 uc3n084
$ squeue -l
JOBID PARTITION NAME USER STATE TIME TIME_LIMI NODES NODELIST(REASON)
1262 cpu wrap ka_ab123 RUNNING 8:55 20:00 1 uc3n002
1267 dev_gpu_h wrap ka_ab123 PENDING 0:00 20:00 1 (Resources)
1265 highmem wrap ka_ab123 RUNNING 3:21 20:00 1 uc3n084

</pre>

=== Detailed job information : scontrol show job ===
scontrol show job displays detailed job state information and diagnostic output for all or a specified job of yours. Detailed information is available for active, pending and recently completed jobs. The command scontrol is explained in detail on the webpage https://slurm.schedmd.com/scontrol.html or via manpage (man scontrol).
 
Display the state of all your jobs in normal mode: scontrol show job
 
Display the state of a job with <jobid> in normal mode: scontrol show job <jobid>
 
 

Here is an example from bwUniCluster 3.0:
<pre>
$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
1262 cpu wrap ka_zs040 R 1:12 1 uc3n002

$
$ # now, see what's up with my pending job with jobid 1262
$
$ scontrol show job 1262

JobId=1262 JobName=wrap
UserId=ka_zs0402(241992) GroupId=ka_scc(12345) MCS_label=N/A
Priority=4246 Nice=0 Account=ka QOS=normal
JobState=RUNNING Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
RunTime=00:00:37 TimeLimit=00:20:00 TimeMin=N/A
SubmitTime=2025-04-04T10:01:30 EligibleTime=2025-04-04T10:01:30
AccrueTime=2025-04-04T10:01:30
StartTime=2025-04-04T10:01:31 EndTime=2025-04-04T10:21:31 Deadline=N/A
SuspendTime=None SecsPreSuspend=0 LastSchedEval=2025-04-04T10:01:31 Scheduler=Main
Partition=cpu AllocNode:Sid=uc3n999:2819841
ReqNodeList=(null) ExcNodeList=(null)
NodeList=uc3n002
BatchHost=uc3n002
NumNodes=1 NumCPUs=2 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
ReqTRES=cpu=1,mem=2000M,node=1,billing=1
AllocTRES=cpu=2,mem=4000M,node=1,billing=2
Socks/Node=* NtasksPerN:B:S:C=0:0:*:1 CoreSpec=*
MinCPUsNode=1 MinMemoryCPU=2000M MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
Command=(null)
WorkDir=/pfs/data6/home/ka/ka_scc/ka_zs0402
StdErr=/pfs/data6/home/ka/ka_scc/ka_zs0402/slurm-1262.out
StdIn=/dev/null
StdOut=/pfs/data6/home/ka/ka_scc/ka_zs0402/slurm-1262.out

</pre>
 
Each request to the Slurm workload manager generates a load. Therefore, do not use <code>squeue</code> with a simple <code>watch</code>. The smallest allowed time interval is 30 seconds. 
Any violation of this rule will result in the task being terminated without notice.

=== Canceling own jobs : scancel ===
The scancel command is used to cancel jobs. The command scancel is explained in detail on the webpage https://slurm.schedmd.com/scancel.html or via manpage (man scancel). The command is:
<pre>
$ scancel [-i] <job-id>
$ scancel -t <job_state_name>
</pre>

= Slurm Options =
[[BwUniCluster3.0/Running_Jobs/Slurm | Detailed Slurm usage]]

= Best Practices =

== Step-by-Step example==

== Dos and Don'ts ==

{|style="background:#deffee; width:100%;"
|style="padding:5px; background:#cef2e0; text-align:left"|
[[Image:Attention.svg|center|25px]]
|style="padding:5px; background:#cef2e0; text-align:left"| Do not run squeue and other slurm commands in loops or "watch" as not to saturate up the slurm daemon with rpc requests
|}

BwUniCluster3.0/Running Jobs

2025-10-24T08:06:40Z

S Braun: /* Detailed job information : scontrol show job */

= Purpose and function of a queuing system =

All compute activities on bwUniCluster 3.0 have to be performed on the compute nodes. Compute nodes are only available by requesting the corresponding resources via the queuing system. As soon as the requested resources are available, automated tasks are executed via a batch script or they can be accessed interactively. 
General procedure: Hint to [[Running_Calculations | Running Calculations]]

== Job submission process ==

bwUniCluster 3.0 has installed the workload managing software Slurm. Therefore any job submission by the user is to be executed by commands of the Slurm software. Slurm queues and runs user jobs based on fair sharing policies.

== Slurm ==

HPC Workload Manager on bwUniCluster 3.0 is Slurm.
Slurm is a cluster management and job scheduling system. Slurm has three key functions.
* It allocates access to resources (compute cores on nodes) to users for some duration of time so they can perform work.
* It provides a framework for starting, executing, and monitoring work (normally a parallel job) on the set of allocated nodes.
* It arbitrates contention for resources by managing a queue of pending work.

Any kind of calculation on the compute nodes of bwUniCluster 3.0 requires the user to define calculations as a sequence of commands together with required run time, number of CPU cores and main memory and submit all, i.e., the '''batch job''', to a resource and workload managing software.

== Terms and definitions ==

''' Partitions '''

Slurm manages job queues for different '''partitions'''. Partitions are used to group similar node types (e.g. nodes with and without accelerators) and to enforce different access policies and resource limits.

On bwUniCluster 3.0 there are different partitions:

* CPU-only nodes
** 2-socket nodes, consisting of 2 Intel Ice Lake processors with 32 cores each or 2 AMD processors with 48 cores each
** 2-socket nodes with very high RAM capacity, consisting of 2 AMD processors with 48 cores each
* GPU-accelerated nodes
** 2-socket nodes with 4x NVIDIA A100 or 4x NVIDIA H100 GPUs
** 4-socket node with 4x AMD Instinct accelerator

''' Queues '''

Job '''queues''' are used to manage jobs that request access to shared but limited computing resources of a certain kind (partition).

On bwUniCluster 3.0 there are different main types of queues:
* Regular queues
** cpu: Jobs that request CPU-only nodes.
** gpu: Jobs that request GPU-accelerated nodes.
* Development queues (dev)
** Short, usually interactive jobs that are used for developing, compiling and testing code and workflows. The intention behind development queues is to provide users with immediate access to computer resources without having to wait. This is the place to realize instantaneous heavy compute without affecting other users, as would be the case on the login nodes.

Requested compute resources such as (wall-)time, number of nodes and amount of memory are restricted and must fit into the boundaries imposed by the queues. The request for compute resources on the bwUniCluster 3.0 requires at least the specification of the '''queue''' and the '''time'''.

''' Jobs '''

Jobs can be run non-interactively as '''batch jobs''' or as '''interactive jobs'''. 
Submitting a batch job means, that all steps of a compute project are defined in a Bash script. This Bash script is queued and executed as soon as the compute resources are available and allocated. Jobs are enqueued with the <code>sbatch</code> command.
For interactive jobs, the resources are requested with the <code>salloc</code> command. As soon as the computing resources are available and allocated, a command line prompt is returned on a computing node and the user can freely dispose of the resources now available to him.
{|style="background:#deffee; width:100%;"
|style="padding:5px; background:#cef2e0; text-align:left"|
[[Image:Attention.svg|center|25px]]
|style="padding:5px; background:#cef2e0; text-align:left"|
'''Please remember:'''
* '''Heavy computations are not allowed on the login nodes'''. Use a developement or a regular job queue instead! Please refer to [[BwUniCluster3.0/Login#Allowed_Activities_on_Login_Nodes|Allowed Activities on Login Nodes]].
* '''Development queues''' are meant for '''development tasks'''. Do not misuse this queue for regular, short-running jobs or chain jobs! Only one running job at a time is enabled. Maximum queue length is reduced to 3.
|}

= Queues on bwUniCluster 3.0 =
== Policy ==

The computing time is provided in accordance with the '''fair share policy'''. The individual investment shares of the respective university and the resources already used by its members are taken into account. Furthermore, the following throttling policy is also active: The '''maximum amount of physical cores''' used at any given time from jobs running is '''1920 per user''' (aggregated over all running jobs). This number corresponds to 30 nodes on the Ice Lake partition or 20 nodes on the standard partition. The aim is to minimize waiting times and maximize the number of users who can access computing time at the same time.

== Regular Queues ==
{| class="wikitable"
|-
! style="width:5%"| Queue
! style="width:13%"| Node-Type
! style="width:23%"| Default Resources
! style="width:13%"| Minimal Resources
! style="width:13%"| Maximum Resources
|-
| <code>cpu_il</code>
| CPU nodes Ice Lake
| mem-per-cpu=2000mb
|
| time=72:00:00, nodes=30, mem=249600mb, ntasks-per-node=64, (threads-per-core=2)
|-
| <code>cpu</code>
| CPU nodes Standard
| mem-per-cpu=2000mb
|
| time=72:00:00, nodes=20, mem=380000mb, ntasks-per-node=96, (threads-per-core=2)
|-
| <code>highmem</code>
| CPU nodes High Memory
| mem-per-cpu=12090mb
| mem=380001mb
| time=72:00:00, nodes=4, mem=2300000mb, ntasks-per-node=96, (threads-per-core=2)
|-
| <code>gpu_h100</code>
| GPU nodes NVIDIA GPU x4
| mem-per-gpu=193300mb cpus-per-gpu=24
| gres=gpu:1
| time=72:00:00, nodes=12, mem=760000mb, ntasks-per-node=96, (threads-per-core=2)
|-
| <code>gpu_mi300</code>
| GPU node AMD GPU x4
| mem-per-gpu=128200mb cpus-per-gpu=24
| gres=gpu:1
| time=72:00:00, nodes=1, mem=510000mb, ntasks-per-node=40, (threads-per-core=2)
|-
| <code>gpu_a100_il</code>/<code>gpu_h100_il</code>
| GPU nodes Ice Lake NVIDIA GPU x4
| mem-per-gpu=127500mb cpus-per-gpu=16
| gres=gpu:1
| time=48:00:00, nodes=9(A100)/nodes=5(H100) , mem=510000mb, ntasks-per-node=64, (threads-per-core=2)
|}
Table 1: Regular Queues

== Short Queues ==
Queues with a short runtime of 30 minutes.
{| class="wikitable"
|-
! style="width:5%"| Queue
! style="width:13%"| Node Type
! style="width:23%"| Default Resources
! style="width:13%"| Minimal Resources
! style="width:13%"| Maximum Resources
|-
| <code>gpu_a100_short</code>
| GPU nodes Ice Lake NVIDIA GPU x4
| mem-per-gpu=94000mb cpus-per-gpu=12
| gres=gpu:1
| time=30, nodes=12, mem=376000mb, ntasks-per-node=48, (threads-per-core=2)
|}
Table 2: Short Queues

== Development Queues ==
Only for development, i.e. debugging or performance optimization ...

{| class="wikitable"
|-
! style="width:5%"| Queue
! style="width:13%"| Node Type
! style="width:23%"| Default Resources
! style="width:13%"| Minimal Resources
! style="width:13%"| Maximum Resources
|-
| <code>dev_cpu_il</code>
| CPU nodes Ice Lake
| mem-per-cpu=2000mb
|
| time=30, nodes=8, mem=249600mb, ntasks-per-node=64, (threads-per-core=2)
|-
| <code>dev_cpu</code>
| CPU nodes Standard
| mem-per-cpu=2000mb
|
| time=30, nodes=1, mem=380000mb, ntasks-per-node=96, (threads-per-core=2)
|-
| <code>dev_gpu_h100</code>
| GPU nodes NVIDIA GPU x4
| mem-per-gpu=193300mb cpus-per-gpu=24
| gres=gpu:1
| time=30, nodes=1, mem=760000mb, ntasks-per-node=96, (threads-per-core=2)
|-
| <code>dev_gpu_a100_il</code>
| GPU nodes NVIDIA GPU x4 
| mem-per-gpu=127500mb cpus-per-gpu=16
| gres=gpu:1
| time=30, nodes=1, mem=510000mb, ntasks-per-node=64, (threads-per-core=2)
|}
Table 3: Development Queues

Default resources of a queue class define number of tasks and memory if not explicitly given with sbatch command. Resource list acronyms ''--time'', ''--ntasks'', ''--nodes'', ''--mem'' and ''--mem-per-cpu'' are described [[BwUniCluster3.0/Running_Jobs/Slurm|here]].

== Check available resources: sinfo_t_idle ==
The Slurm command sinfo is used to view partition and node information for a system running Slurm. It incorporates down time, reservations, and node state information in determining the available backfill window. The sinfo command can only be used by the administrator.
 
SCC has prepared a special script (sinfo_t_idle) to find out how many processors are available for immediate use on the system. It is anticipated that users will use this information to submit jobs that meet these criteria and thus obtain quick job turnaround times.
 
 

* The following command displays what resources are available for immediate use for the whole partition.
<pre>$ sinfo_t_idle
Partition dev_cpu : 1 nodes idle
Partition cpu : 1 nodes idle
Partition highmem : 2 nodes idle
Partition dev_gpu_h100 : 0 nodes idle
Partition gpu_h100 : 0 nodes idle
Partition gpu_mi300 : 0 nodes idle
Partition dev_cpu_il : 7 nodes idle
Partition cpu_il : 2 nodes idle
Partition dev_gpu_a100_il : 1 nodes idle
Partition gpu_a100_il : 0 nodes idle
Partition gpu_h100_il : 1 nodes idle
Partition gpu_a100_short : 0 nodes idle
</pre>
 

= Running Jobs =

== Slurm Commands (excerpt) ==
Important Slurm commands for non-administrators working on bwUniCluster 3.0.
{| width=850px class="wikitable"
! Slurm commands !! Brief explanation
|-
| [[#Batch Jobs: sbatch|sbatch]] || Submits a job and puts it into the queue [[https://slurm.schedmd.com/sbatch.html sbatch]]
|-
| [[#Interactive Jobs: salloc|salloc]] || Requests resources for an interactive Job [[https://slurm.schedmd.com/salloc.html salloc]]
|-
| [[#Monitor and manage jobs |scontrol show job]] || Displays detailed job state information [[https://slurm.schedmd.com/scontrol.html scontrol]]
|-
| [[#List of your submitted jobs : squeue|squeue]] || Displays information about active, eligible, blocked, and/or recently completed jobs [[https://slurm.schedmd.com/squeue.html squeue]]
|-
| [[#List of your submitted jobs : squeue|squeue --start]] || Returns start time of submitted job [[https://slurm.schedmd.com/squeue.html squeue]]
|-
| [[#Check available resources: sinfo_t_idle|sinfo_t_idle]] || Shows what resources are available for immediate use [[https://slurm.schedmd.com/sinfo.html sinfo]]
|-
| [[#Canceling own jobs : scancel|scancel]] || Cancels a job [[https://slurm.schedmd.com/scancel.html scancel]]
|}

 
* [https://slurm.schedmd.com/tutorials.html Slurm Tutorials]
* [https://slurm.schedmd.com/pdfs/summary.pdf Slurm command/option summary (2 pages)]
* [https://slurm.schedmd.com/man_index.html Slurm Commands]
 

== Batch Jobs: sbatch ==

Batch jobs are submitted by using the command '''sbatch'''. The main purpose of the '''sbatch''' command is to specify the resources that are needed to run the job. '''sbatch''' will then queue the batch job. However, starting of batch job depends on the availability of the requested resources and the fair sharing value.
 
 

* The syntax and use of '''sbatch''' can be displayed via:
<pre>
$ man sbatch
</pre>

'''sbatch''' options can be used from the command line or in your job script. Different defaults for some of these options are set based on the queue and can be found [[BwUniCluster3.0/Slurm | here]]

{| class="wikitable"
! colspan="3" | sbatch Options
|-
! style="width:8%"| Command line
! style="width:9%"| Script
! style="width:13%"| Purpose
|- style="vertical-align:top;"
| -t, --time=''time''
| #SBATCH --time=''time''
| Wall clock time limit. 
|-
|- style="vertical-align:top;"
| -N, --nodes=''count''
| #SBATCH --nodes=''count''
| Number of nodes to be used.
|- style="vertical-align:top;"
| -n, --ntasks=''count''
| #SBATCH --ntasks=''count''
| Number of tasks to be launched.
|-
|- style="vertical-align:top;"
| --ntasks-per-node=''count''
| #SBATCH --ntasks-per-node=''count''
| Maximum count of tasks per node.
|-
|- style="vertical-align:top;"
| -c, --cpus-per-task=''count''
| #SBATCH --cpus-per-task=''count''
| Number of CPUs required per (MPI-)task.
|-
|- style="vertical-align:top;"
| --mem=''value_in_MB''
| #SBATCH --mem=''value_in_MB''
| Memory in MegaByte per node. (You should omit the setting of this option.)
|-
|- style="vertical-align:top;"
| --mem-per-cpu=''value_in_MB''
| #SBATCH --mem-per-cpu=''value_in_MB''
| Minimum Memory required per allocated CPU. (You should omit the setting of this option.)
|-
|- style="vertical-align:top;"
| --exclusive
| #SBATCH --exclusive
| The job allocates all CPUs and GPUs on the nodes. It will not share the node with other running jobs
|-
|- style="vertical-align:top;"
| --mail-type=''type''
| #SBATCH --mail-type=''type''
| Notify user by email when certain event types occur. Valid type values are NONE, BEGIN, END, FAIL, REQUEUE, ALL.
|-
|- style="vertical-align:top;"
| --mail-user=''mail-address''
| #SBATCH --mail-user=''mail-address''
| The specified mail-address receives email notification of state changes as defined by --mail-type.
|-
|- style="vertical-align:top;"
| --output=''name''
| #SBATCH --output=''name''
| File in which job output is stored.
|-
|- style="vertical-align:top;"
| --error=''name''
| #SBATCH --error=''name''
| File in which job error messages are stored.
|-
|- style="vertical-align:top;"
| -J, --job-name=''name''
| #SBATCH --job-name=''name''
| Job name.
|-
|- style="vertical-align:top;"
| --export=[ALL,] ''env-variables''
| #SBATCH --export=[ALL,] ''env-variables''
| Identifies which environment variables from the submission environment are propagated to the launched application. Default is ALL.
|-
|- style="vertical-align:top;"
| -A, --account=''group-name''
| #SBATCH --account=''group-name''
| Change resources used by this job to specified group. You may need this option if your account is assigned to more than one group. By command "scontrol show job" the project group the job is accounted on can be seen behind "Account=".
|-
|- style="vertical-align:top;"
| -p, --partition=''queue-name''
| #SBATCH --partition=''queue-name''
| Request a specific queue for the resource allocation.
|-
|- style="vertical-align:top;"
| --reservation=''reservation-name''
| #SBATCH --reservation=''reservation-name''
| Use a specific reservation for the resource allocation.
|-
|- style="vertical-align:top;"
| -C, --constraint=''LSDF''
| #SBATCH --constraint=LSDF
| Job constraint LSDF filesystems.
|-
|- style="vertical-align:top;"
| -C, --constraint=''BEEOND (BEEOND_4MDS, BEEOND_MAXMDS)''
| #SBATCH --constraint=BEEOND (BEEOND_4MDS, BEEOND_MAXMDS)
| Job constraint BeeOND filesystem.
|-
|}
 

== Interactive Jobs: salloc ==

On bwUniCluster 3.0 you are only allowed to run short jobs (<< 1 hour) with little memory requirements (<< 8 GByte) on the logins nodes. If you want to run longer jobs and/or jobs with a request of more than 8 GByte of memory, you must allocate resources for so-called interactive jobs by usage of the command salloc on a login node. Considering a serial application running on a compute node that requires 5000 MByte of memory and limiting the interactive run to 2 hours the following command has to be executed:

<pre>
$ salloc -p cpu -n 1 -t 120 --mem=5000
</pre>

Then you will get one core on a compute node within the partition "cpu". After execution of this command '''DO NOT CLOSE''' your current terminal session but wait until the queueing system Slurm has granted you the requested resources on the compute system. You will be logged in automatically on the granted core! To run a serial program on the granted core you only have to type the name of the executable.

<pre>
$ ./<my_serial_program>
</pre>

Please be aware that your serial job must run less than 2 hours in this example, else the job will be killed during runtime by the system.

You can also start now a graphical X11-terminal connecting you to the dedicated resource that is available for 2 hours. You can start it by the command:

<pre>
$ xterm
</pre>

Note that, once the walltime limit has been reached the resources - i.e. the compute node - will automatically be revoked.

An interactive parallel application running on one compute node or on many compute nodes (e.g. here 5 nodes) with 96 cores each requires usually an amount of memory in GByte (e.g. 50 GByte) and a maximum time (e.g. 1 hour). E.g. 5 nodes can be allocated by the following command:

<pre>
$ salloc -p cpu -N 5 --ntasks-per-node=96 -t 01:00:00 --mem=50gb
</pre>

Now you can run parallel jobs on 480 cores requiring 50 GByte of memory per node. Please be aware that you will be logged in on core 0 of the first node.
If you want to have access to another node you have to open a new terminal, connect it also to bwUniCluster 3.0 and type the following commands to
connect to the running interactive job and then to a specific node:

<pre>
$ srun --jobid=XXXXXXXX --pty /bin/bash
$ srun --nodelist=uc3nXXX --pty /bin/bash
</pre>

With the command:

<pre>
$ squeue
</pre>

the jobid and the nodelist can be shown.

If you want to run MPI-programs, you can do it by simply typing mpirun <program_name>. Then your program will be run on 480 cores. A very simple example for starting a parallel job can be:

<pre>
$ mpirun <my_mpi_program>
</pre>

You can also start the debugger ddt by the commands:

<pre>
$ module add devel/ddt
$ ddt <my_mpi_program>
</pre>

The above commands will execute the parallel program <my_mpi_program> on all available cores. You can also start parallel programs on a subset of cores; an example for this can be:

<pre>
$ mpirun -n 50 <my_mpi_program>
</pre>

If you are using Intel MPI you must start <my_mpi_program> by the command mpiexec.hydra (instead of mpirun).

== Interactive Computing with Jupyter ==

== Monitor and manage jobs ==

=== List of your submitted jobs : squeue ===
Displays information about YOUR active, pending and/or recently completed jobs. The command displays all own active and pending jobs. The command squeue is explained in detail on the webpage https://slurm.schedmd.com/squeue.html or via manpage (man squeue).
 
 

* ''squeue'' example on bwUniCluster 3.0 (Only your own jobs are displayed!).
<pre>
$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
1262 cpu wrap ka_ab123 R 8:15 1 uc3n002
1267 dev_gpu_h wrap ka_ab123 PD 0:00 1 (Resources)
1265 highmem wrap ka_ab123 R 2:41 1 uc3n084
$ squeue -l
JOBID PARTITION NAME USER STATE TIME TIME_LIMI NODES NODELIST(REASON)
1262 cpu wrap ka_ab123 RUNNING 8:55 20:00 1 uc3n002
1267 dev_gpu_h wrap ka_ab123 PENDING 0:00 20:00 1 (Resources)
1265 highmem wrap ka_ab123 RUNNING 3:21 20:00 1 uc3n084

</pre>

=== Detailed job information : scontrol show job ===
scontrol show job displays detailed job state information and diagnostic output for all or a specified job of yours. Detailed information is available for active, pending and recently completed jobs. The command scontrol is explained in detail on the webpage https://slurm.schedmd.com/scontrol.html or via manpage (man scontrol).
 
Display the state of all your jobs in normal mode: scontrol show job
 
Display the state of a job with <jobid> in normal mode: scontrol show job <jobid>
 
 

Here is an example from bwUniCluster 3.0:
<pre>
$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
1262 cpu wrap ka_zs040 R 1:12 1 uc3n002

$
$ # now, see what's up with my pending job with jobid 1262
$
$ scontrol show job 1262

JobId=1262 JobName=wrap
UserId=ka_zs0402(241992) GroupId=ka_scc(12345) MCS_label=N/A
Priority=4246 Nice=0 Account=ka QOS=normal
JobState=RUNNING Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
RunTime=00:00:37 TimeLimit=00:20:00 TimeMin=N/A
SubmitTime=2025-04-04T10:01:30 EligibleTime=2025-04-04T10:01:30
AccrueTime=2025-04-04T10:01:30
StartTime=2025-04-04T10:01:31 EndTime=2025-04-04T10:21:31 Deadline=N/A
SuspendTime=None SecsPreSuspend=0 LastSchedEval=2025-04-04T10:01:31 Scheduler=Main
Partition=cpu AllocNode:Sid=uc3n999:2819841
ReqNodeList=(null) ExcNodeList=(null)
NodeList=uc3n002
BatchHost=uc3n002
NumNodes=1 NumCPUs=2 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
ReqTRES=cpu=1,mem=2000M,node=1,billing=1
AllocTRES=cpu=2,mem=4000M,node=1,billing=2
Socks/Node=* NtasksPerN:B:S:C=0:0:*:1 CoreSpec=*
MinCPUsNode=1 MinMemoryCPU=2000M MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
Command=(null)
WorkDir=/pfs/data6/home/ka/ka_scc/ka_zs0402
StdErr=/pfs/data6/home/ka/ka_scc/ka_zs0402/slurm-1262.out
StdIn=/dev/null
StdOut=/pfs/data6/home/ka/ka_scc/ka_zs0402/slurm-1262.out

</pre>
 
Each request to the Slurm workload manager generates a load. Therefore, do not use <code>squeue</code> with a simple <code>watch</code>. The smallest allowed time interval is 30 seconds. 
Any violation of this rule will result in the task being terminated without notice.

=== Canceling own jobs : scancel ===
The scancel command is used to cancel jobs. The command scancel is explained in detail on the webpage https://slurm.schedmd.com/scancel.html or via manpage (man scancel). The command is:
<pre>
$ scancel [-i] <job-id>
$ scancel -t <job_state_name>
</pre>

= Slurm Options =
[[BwUniCluster3.0/Running_Jobs/Slurm | Detailed Slurm usage]]

= Best Practices =

== Step-by-Step example==

== Dos and Don'ts ==

{|style="background:#deffee; width:100%;"
|style="padding:5px; background:#cef2e0; text-align:left"|
[[Image:Attention.svg|center|25px]]
|style="padding:5px; background:#cef2e0; text-align:left"| Do not run squeue and other slurm commands in loops or "watch" as not to saturate up the slurm daemon with rpc requests
|}

BwUniCluster3.0/Running Jobs

2025-10-24T08:05:14Z

S Braun: /* Detailed job information : scontrol show job */

= Purpose and function of a queuing system =

All compute activities on bwUniCluster 3.0 have to be performed on the compute nodes. Compute nodes are only available by requesting the corresponding resources via the queuing system. As soon as the requested resources are available, automated tasks are executed via a batch script or they can be accessed interactively. 
General procedure: Hint to [[Running_Calculations | Running Calculations]]

== Job submission process ==

bwUniCluster 3.0 has installed the workload managing software Slurm. Therefore any job submission by the user is to be executed by commands of the Slurm software. Slurm queues and runs user jobs based on fair sharing policies.

== Slurm ==

HPC Workload Manager on bwUniCluster 3.0 is Slurm.
Slurm is a cluster management and job scheduling system. Slurm has three key functions.
* It allocates access to resources (compute cores on nodes) to users for some duration of time so they can perform work.
* It provides a framework for starting, executing, and monitoring work (normally a parallel job) on the set of allocated nodes.
* It arbitrates contention for resources by managing a queue of pending work.

Any kind of calculation on the compute nodes of bwUniCluster 3.0 requires the user to define calculations as a sequence of commands together with required run time, number of CPU cores and main memory and submit all, i.e., the '''batch job''', to a resource and workload managing software.

== Terms and definitions ==

''' Partitions '''

Slurm manages job queues for different '''partitions'''. Partitions are used to group similar node types (e.g. nodes with and without accelerators) and to enforce different access policies and resource limits.

On bwUniCluster 3.0 there are different partitions:

* CPU-only nodes
** 2-socket nodes, consisting of 2 Intel Ice Lake processors with 32 cores each or 2 AMD processors with 48 cores each
** 2-socket nodes with very high RAM capacity, consisting of 2 AMD processors with 48 cores each
* GPU-accelerated nodes
** 2-socket nodes with 4x NVIDIA A100 or 4x NVIDIA H100 GPUs
** 4-socket node with 4x AMD Instinct accelerator

''' Queues '''

Job '''queues''' are used to manage jobs that request access to shared but limited computing resources of a certain kind (partition).

On bwUniCluster 3.0 there are different main types of queues:
* Regular queues
** cpu: Jobs that request CPU-only nodes.
** gpu: Jobs that request GPU-accelerated nodes.
* Development queues (dev)
** Short, usually interactive jobs that are used for developing, compiling and testing code and workflows. The intention behind development queues is to provide users with immediate access to computer resources without having to wait. This is the place to realize instantaneous heavy compute without affecting other users, as would be the case on the login nodes.

Requested compute resources such as (wall-)time, number of nodes and amount of memory are restricted and must fit into the boundaries imposed by the queues. The request for compute resources on the bwUniCluster 3.0 requires at least the specification of the '''queue''' and the '''time'''.

''' Jobs '''

Jobs can be run non-interactively as '''batch jobs''' or as '''interactive jobs'''. 
Submitting a batch job means, that all steps of a compute project are defined in a Bash script. This Bash script is queued and executed as soon as the compute resources are available and allocated. Jobs are enqueued with the <code>sbatch</code> command.
For interactive jobs, the resources are requested with the <code>salloc</code> command. As soon as the computing resources are available and allocated, a command line prompt is returned on a computing node and the user can freely dispose of the resources now available to him.
{|style="background:#deffee; width:100%;"
|style="padding:5px; background:#cef2e0; text-align:left"|
[[Image:Attention.svg|center|25px]]
|style="padding:5px; background:#cef2e0; text-align:left"|
'''Please remember:'''
* '''Heavy computations are not allowed on the login nodes'''. Use a developement or a regular job queue instead! Please refer to [[BwUniCluster3.0/Login#Allowed_Activities_on_Login_Nodes|Allowed Activities on Login Nodes]].
* '''Development queues''' are meant for '''development tasks'''. Do not misuse this queue for regular, short-running jobs or chain jobs! Only one running job at a time is enabled. Maximum queue length is reduced to 3.
|}

= Queues on bwUniCluster 3.0 =
== Policy ==

The computing time is provided in accordance with the '''fair share policy'''. The individual investment shares of the respective university and the resources already used by its members are taken into account. Furthermore, the following throttling policy is also active: The '''maximum amount of physical cores''' used at any given time from jobs running is '''1920 per user''' (aggregated over all running jobs). This number corresponds to 30 nodes on the Ice Lake partition or 20 nodes on the standard partition. The aim is to minimize waiting times and maximize the number of users who can access computing time at the same time.

== Regular Queues ==
{| class="wikitable"
|-
! style="width:5%"| Queue
! style="width:13%"| Node-Type
! style="width:23%"| Default Resources
! style="width:13%"| Minimal Resources
! style="width:13%"| Maximum Resources
|-
| <code>cpu_il</code>
| CPU nodes Ice Lake
| mem-per-cpu=2000mb
|
| time=72:00:00, nodes=30, mem=249600mb, ntasks-per-node=64, (threads-per-core=2)
|-
| <code>cpu</code>
| CPU nodes Standard
| mem-per-cpu=2000mb
|
| time=72:00:00, nodes=20, mem=380000mb, ntasks-per-node=96, (threads-per-core=2)
|-
| <code>highmem</code>
| CPU nodes High Memory
| mem-per-cpu=12090mb
| mem=380001mb
| time=72:00:00, nodes=4, mem=2300000mb, ntasks-per-node=96, (threads-per-core=2)
|-
| <code>gpu_h100</code>
| GPU nodes NVIDIA GPU x4
| mem-per-gpu=193300mb cpus-per-gpu=24
| gres=gpu:1
| time=72:00:00, nodes=12, mem=760000mb, ntasks-per-node=96, (threads-per-core=2)
|-
| <code>gpu_mi300</code>
| GPU node AMD GPU x4
| mem-per-gpu=128200mb cpus-per-gpu=24
| gres=gpu:1
| time=72:00:00, nodes=1, mem=510000mb, ntasks-per-node=40, (threads-per-core=2)
|-
| <code>gpu_a100_il</code>/<code>gpu_h100_il</code>
| GPU nodes Ice Lake NVIDIA GPU x4
| mem-per-gpu=127500mb cpus-per-gpu=16
| gres=gpu:1
| time=48:00:00, nodes=9(A100)/nodes=5(H100) , mem=510000mb, ntasks-per-node=64, (threads-per-core=2)
|}
Table 1: Regular Queues

== Short Queues ==
Queues with a short runtime of 30 minutes.
{| class="wikitable"
|-
! style="width:5%"| Queue
! style="width:13%"| Node Type
! style="width:23%"| Default Resources
! style="width:13%"| Minimal Resources
! style="width:13%"| Maximum Resources
|-
| <code>gpu_a100_short</code>
| GPU nodes Ice Lake NVIDIA GPU x4
| mem-per-gpu=94000mb cpus-per-gpu=12
| gres=gpu:1
| time=30, nodes=12, mem=376000mb, ntasks-per-node=48, (threads-per-core=2)
|}
Table 2: Short Queues

== Development Queues ==
Only for development, i.e. debugging or performance optimization ...

{| class="wikitable"
|-
! style="width:5%"| Queue
! style="width:13%"| Node Type
! style="width:23%"| Default Resources
! style="width:13%"| Minimal Resources
! style="width:13%"| Maximum Resources
|-
| <code>dev_cpu_il</code>
| CPU nodes Ice Lake
| mem-per-cpu=2000mb
|
| time=30, nodes=8, mem=249600mb, ntasks-per-node=64, (threads-per-core=2)
|-
| <code>dev_cpu</code>
| CPU nodes Standard
| mem-per-cpu=2000mb
|
| time=30, nodes=1, mem=380000mb, ntasks-per-node=96, (threads-per-core=2)
|-
| <code>dev_gpu_h100</code>
| GPU nodes NVIDIA GPU x4
| mem-per-gpu=193300mb cpus-per-gpu=24
| gres=gpu:1
| time=30, nodes=1, mem=760000mb, ntasks-per-node=96, (threads-per-core=2)
|-
| <code>dev_gpu_a100_il</code>
| GPU nodes NVIDIA GPU x4 
| mem-per-gpu=127500mb cpus-per-gpu=16
| gres=gpu:1
| time=30, nodes=1, mem=510000mb, ntasks-per-node=64, (threads-per-core=2)
|}
Table 3: Development Queues

Default resources of a queue class define number of tasks and memory if not explicitly given with sbatch command. Resource list acronyms ''--time'', ''--ntasks'', ''--nodes'', ''--mem'' and ''--mem-per-cpu'' are described [[BwUniCluster3.0/Running_Jobs/Slurm|here]].

== Check available resources: sinfo_t_idle ==
The Slurm command sinfo is used to view partition and node information for a system running Slurm. It incorporates down time, reservations, and node state information in determining the available backfill window. The sinfo command can only be used by the administrator.
 
SCC has prepared a special script (sinfo_t_idle) to find out how many processors are available for immediate use on the system. It is anticipated that users will use this information to submit jobs that meet these criteria and thus obtain quick job turnaround times.
 
 

* The following command displays what resources are available for immediate use for the whole partition.
<pre>$ sinfo_t_idle
Partition dev_cpu : 1 nodes idle
Partition cpu : 1 nodes idle
Partition highmem : 2 nodes idle
Partition dev_gpu_h100 : 0 nodes idle
Partition gpu_h100 : 0 nodes idle
Partition gpu_mi300 : 0 nodes idle
Partition dev_cpu_il : 7 nodes idle
Partition cpu_il : 2 nodes idle
Partition dev_gpu_a100_il : 1 nodes idle
Partition gpu_a100_il : 0 nodes idle
Partition gpu_h100_il : 1 nodes idle
Partition gpu_a100_short : 0 nodes idle
</pre>
 

= Running Jobs =

== Slurm Commands (excerpt) ==
Important Slurm commands for non-administrators working on bwUniCluster 3.0.
{| width=850px class="wikitable"
! Slurm commands !! Brief explanation
|-
| [[#Batch Jobs: sbatch|sbatch]] || Submits a job and puts it into the queue [[https://slurm.schedmd.com/sbatch.html sbatch]]
|-
| [[#Interactive Jobs: salloc|salloc]] || Requests resources for an interactive Job [[https://slurm.schedmd.com/salloc.html salloc]]
|-
| [[#Monitor and manage jobs |scontrol show job]] || Displays detailed job state information [[https://slurm.schedmd.com/scontrol.html scontrol]]
|-
| [[#List of your submitted jobs : squeue|squeue]] || Displays information about active, eligible, blocked, and/or recently completed jobs [[https://slurm.schedmd.com/squeue.html squeue]]
|-
| [[#List of your submitted jobs : squeue|squeue --start]] || Returns start time of submitted job [[https://slurm.schedmd.com/squeue.html squeue]]
|-
| [[#Check available resources: sinfo_t_idle|sinfo_t_idle]] || Shows what resources are available for immediate use [[https://slurm.schedmd.com/sinfo.html sinfo]]
|-
| [[#Canceling own jobs : scancel|scancel]] || Cancels a job [[https://slurm.schedmd.com/scancel.html scancel]]
|}

 
* [https://slurm.schedmd.com/tutorials.html Slurm Tutorials]
* [https://slurm.schedmd.com/pdfs/summary.pdf Slurm command/option summary (2 pages)]
* [https://slurm.schedmd.com/man_index.html Slurm Commands]
 

== Batch Jobs: sbatch ==

Batch jobs are submitted by using the command '''sbatch'''. The main purpose of the '''sbatch''' command is to specify the resources that are needed to run the job. '''sbatch''' will then queue the batch job. However, starting of batch job depends on the availability of the requested resources and the fair sharing value.
 
 

* The syntax and use of '''sbatch''' can be displayed via:
<pre>
$ man sbatch
</pre>

'''sbatch''' options can be used from the command line or in your job script. Different defaults for some of these options are set based on the queue and can be found [[BwUniCluster3.0/Slurm | here]]

{| class="wikitable"
! colspan="3" | sbatch Options
|-
! style="width:8%"| Command line
! style="width:9%"| Script
! style="width:13%"| Purpose
|- style="vertical-align:top;"
| -t, --time=''time''
| #SBATCH --time=''time''
| Wall clock time limit. 
|-
|- style="vertical-align:top;"
| -N, --nodes=''count''
| #SBATCH --nodes=''count''
| Number of nodes to be used.
|- style="vertical-align:top;"
| -n, --ntasks=''count''
| #SBATCH --ntasks=''count''
| Number of tasks to be launched.
|-
|- style="vertical-align:top;"
| --ntasks-per-node=''count''
| #SBATCH --ntasks-per-node=''count''
| Maximum count of tasks per node.
|-
|- style="vertical-align:top;"
| -c, --cpus-per-task=''count''
| #SBATCH --cpus-per-task=''count''
| Number of CPUs required per (MPI-)task.
|-
|- style="vertical-align:top;"
| --mem=''value_in_MB''
| #SBATCH --mem=''value_in_MB''
| Memory in MegaByte per node. (You should omit the setting of this option.)
|-
|- style="vertical-align:top;"
| --mem-per-cpu=''value_in_MB''
| #SBATCH --mem-per-cpu=''value_in_MB''
| Minimum Memory required per allocated CPU. (You should omit the setting of this option.)
|-
|- style="vertical-align:top;"
| --exclusive
| #SBATCH --exclusive
| The job allocates all CPUs and GPUs on the nodes. It will not share the node with other running jobs
|-
|- style="vertical-align:top;"
| --mail-type=''type''
| #SBATCH --mail-type=''type''
| Notify user by email when certain event types occur. Valid type values are NONE, BEGIN, END, FAIL, REQUEUE, ALL.
|-
|- style="vertical-align:top;"
| --mail-user=''mail-address''
| #SBATCH --mail-user=''mail-address''
| The specified mail-address receives email notification of state changes as defined by --mail-type.
|-
|- style="vertical-align:top;"
| --output=''name''
| #SBATCH --output=''name''
| File in which job output is stored.
|-
|- style="vertical-align:top;"
| --error=''name''
| #SBATCH --error=''name''
| File in which job error messages are stored.
|-
|- style="vertical-align:top;"
| -J, --job-name=''name''
| #SBATCH --job-name=''name''
| Job name.
|-
|- style="vertical-align:top;"
| --export=[ALL,] ''env-variables''
| #SBATCH --export=[ALL,] ''env-variables''
| Identifies which environment variables from the submission environment are propagated to the launched application. Default is ALL.
|-
|- style="vertical-align:top;"
| -A, --account=''group-name''
| #SBATCH --account=''group-name''
| Change resources used by this job to specified group. You may need this option if your account is assigned to more than one group. By command "scontrol show job" the project group the job is accounted on can be seen behind "Account=".
|-
|- style="vertical-align:top;"
| -p, --partition=''queue-name''
| #SBATCH --partition=''queue-name''
| Request a specific queue for the resource allocation.
|-
|- style="vertical-align:top;"
| --reservation=''reservation-name''
| #SBATCH --reservation=''reservation-name''
| Use a specific reservation for the resource allocation.
|-
|- style="vertical-align:top;"
| -C, --constraint=''LSDF''
| #SBATCH --constraint=LSDF
| Job constraint LSDF filesystems.
|-
|- style="vertical-align:top;"
| -C, --constraint=''BEEOND (BEEOND_4MDS, BEEOND_MAXMDS)''
| #SBATCH --constraint=BEEOND (BEEOND_4MDS, BEEOND_MAXMDS)
| Job constraint BeeOND filesystem.
|-
|}
 

== Interactive Jobs: salloc ==

On bwUniCluster 3.0 you are only allowed to run short jobs (<< 1 hour) with little memory requirements (<< 8 GByte) on the logins nodes. If you want to run longer jobs and/or jobs with a request of more than 8 GByte of memory, you must allocate resources for so-called interactive jobs by usage of the command salloc on a login node. Considering a serial application running on a compute node that requires 5000 MByte of memory and limiting the interactive run to 2 hours the following command has to be executed:

<pre>
$ salloc -p cpu -n 1 -t 120 --mem=5000
</pre>

Then you will get one core on a compute node within the partition "cpu". After execution of this command '''DO NOT CLOSE''' your current terminal session but wait until the queueing system Slurm has granted you the requested resources on the compute system. You will be logged in automatically on the granted core! To run a serial program on the granted core you only have to type the name of the executable.

<pre>
$ ./<my_serial_program>
</pre>

Please be aware that your serial job must run less than 2 hours in this example, else the job will be killed during runtime by the system.

You can also start now a graphical X11-terminal connecting you to the dedicated resource that is available for 2 hours. You can start it by the command:

<pre>
$ xterm
</pre>

Note that, once the walltime limit has been reached the resources - i.e. the compute node - will automatically be revoked.

An interactive parallel application running on one compute node or on many compute nodes (e.g. here 5 nodes) with 96 cores each requires usually an amount of memory in GByte (e.g. 50 GByte) and a maximum time (e.g. 1 hour). E.g. 5 nodes can be allocated by the following command:

<pre>
$ salloc -p cpu -N 5 --ntasks-per-node=96 -t 01:00:00 --mem=50gb
</pre>

Now you can run parallel jobs on 480 cores requiring 50 GByte of memory per node. Please be aware that you will be logged in on core 0 of the first node.
If you want to have access to another node you have to open a new terminal, connect it also to bwUniCluster 3.0 and type the following commands to
connect to the running interactive job and then to a specific node:

<pre>
$ srun --jobid=XXXXXXXX --pty /bin/bash
$ srun --nodelist=uc3nXXX --pty /bin/bash
</pre>

With the command:

<pre>
$ squeue
</pre>

the jobid and the nodelist can be shown.

If you want to run MPI-programs, you can do it by simply typing mpirun <program_name>. Then your program will be run on 480 cores. A very simple example for starting a parallel job can be:

<pre>
$ mpirun <my_mpi_program>
</pre>

You can also start the debugger ddt by the commands:

<pre>
$ module add devel/ddt
$ ddt <my_mpi_program>
</pre>

The above commands will execute the parallel program <my_mpi_program> on all available cores. You can also start parallel programs on a subset of cores; an example for this can be:

<pre>
$ mpirun -n 50 <my_mpi_program>
</pre>

If you are using Intel MPI you must start <my_mpi_program> by the command mpiexec.hydra (instead of mpirun).

== Interactive Computing with Jupyter ==

== Monitor and manage jobs ==

=== List of your submitted jobs : squeue ===
Displays information about YOUR active, pending and/or recently completed jobs. The command displays all own active and pending jobs. The command squeue is explained in detail on the webpage https://slurm.schedmd.com/squeue.html or via manpage (man squeue).
 
 

* ''squeue'' example on bwUniCluster 3.0 (Only your own jobs are displayed!).
<pre>
$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
1262 cpu wrap ka_ab123 R 8:15 1 uc3n002
1267 dev_gpu_h wrap ka_ab123 PD 0:00 1 (Resources)
1265 highmem wrap ka_ab123 R 2:41 1 uc3n084
$ squeue -l
JOBID PARTITION NAME USER STATE TIME TIME_LIMI NODES NODELIST(REASON)
1262 cpu wrap ka_ab123 RUNNING 8:55 20:00 1 uc3n002
1267 dev_gpu_h wrap ka_ab123 PENDING 0:00 20:00 1 (Resources)
1265 highmem wrap ka_ab123 RUNNING 3:21 20:00 1 uc3n084

</pre>

=== Detailed job information : scontrol show job ===
scontrol show job displays detailed job state information and diagnostic output for all or a specified job of yours. Detailed information is available for active, pending and recently completed jobs. The command scontrol is explained in detail on the webpage https://slurm.schedmd.com/scontrol.html or via manpage (man scontrol).
 
Display the state of all your jobs in normal mode: scontrol show job
 
Display the state of a job with <jobid> in normal mode: scontrol show job <jobid>
 
 

Here is an example from bwUniCluster 3.0:
<pre>
$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
1262 cpu wrap ka_zs040 R 1:12 1 uc3n002

$
$ # now, see what's up with my pending job with jobid 1262
$
$ scontrol show job 1262

JobId=1262 JobName=wrap
UserId=ka_zs0402(241992) GroupId=ka_scc(12345) MCS_label=N/A
Priority=4246 Nice=0 Account=ka QOS=normal
JobState=RUNNING Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
RunTime=00:00:37 TimeLimit=00:20:00 TimeMin=N/A
SubmitTime=2025-04-04T10:01:30 EligibleTime=2025-04-04T10:01:30
AccrueTime=2025-04-04T10:01:30
StartTime=2025-04-04T10:01:31 EndTime=2025-04-04T10:21:31 Deadline=N/A
SuspendTime=None SecsPreSuspend=0 LastSchedEval=2025-04-04T10:01:31 Scheduler=Main
Partition=cpu AllocNode:Sid=uc3n999:2819841
ReqNodeList=(null) ExcNodeList=(null)
NodeList=uc3n002
BatchHost=uc3n002
NumNodes=1 NumCPUs=2 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
ReqTRES=cpu=1,mem=2000M,node=1,billing=1
AllocTRES=cpu=2,mem=4000M,node=1,billing=2
Socks/Node=* NtasksPerN:B:S:C=0:0:*:1 CoreSpec=*
MinCPUsNode=1 MinMemoryCPU=2000M MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
Command=(null)
WorkDir=/pfs/data6/home/ka/ka_scc/ka_zs0402
StdErr=/pfs/data6/home/ka/ka_scc/ka_zs0402/slurm-1262.out
StdIn=/dev/null
StdOut=/pfs/data6/home/ka/ka_scc/ka_zs0402/slurm-1262.out

</pre>
 
Each request to the Slurm workload manager generates a load. Therefore, do not use <code>squeue</code> with a simple <code>watch</code>. The smallest allowed time interval is 30 seconds. 
Any violation of this rule will result in the task being terminated without notice.

=== Canceling own jobs : scancel ===
The scancel command is used to cancel jobs. The command scancel is explained in detail on the webpage https://slurm.schedmd.com/scancel.html or via manpage (man scancel). The command is:
<pre>
$ scancel [-i] <job-id>
$ scancel -t <job_state_name>
</pre>

= Slurm Options =
[[BwUniCluster3.0/Running_Jobs/Slurm | Detailed Slurm usage]]

= Best Practices =

== Step-by-Step example==

== Dos and Don'ts ==

{|style="background:#deffee; width:100%;"
|style="padding:5px; background:#cef2e0; text-align:left"|
[[Image:Attention.svg|center|25px]]
|style="padding:5px; background:#cef2e0; text-align:left"| Do not run squeue and other slurm commands in loops or "watch" as not to saturate up the slurm daemon with rpc requests
|}

BwUniCluster3.0/Running Jobs

2025-09-23T15:08:16Z

S Braun: /* Dos and Don'ts */

= Purpose and function of a queuing system =

All compute activities on bwUniCluster 3.0 have to be performed on the compute nodes. Compute nodes are only available by requesting the corresponding resources via the queuing system. As soon as the requested resources are available, automated tasks are executed via a batch script or they can be accessed interactively. 
General procedure: Hint to [[Running_Calculations | Running Calculations]]

== Job submission process ==

bwUniCluster 3.0 has installed the workload managing software Slurm. Therefore any job submission by the user is to be executed by commands of the Slurm software. Slurm queues and runs user jobs based on fair sharing policies.

== Slurm ==

HPC Workload Manager on bwUniCluster 3.0 is Slurm.
Slurm is a cluster management and job scheduling system. Slurm has three key functions.
* It allocates access to resources (compute cores on nodes) to users for some duration of time so they can perform work.
* It provides a framework for starting, executing, and monitoring work (normally a parallel job) on the set of allocated nodes.
* It arbitrates contention for resources by managing a queue of pending work.

Any kind of calculation on the compute nodes of bwUniCluster 3.0 requires the user to define calculations as a sequence of commands together with required run time, number of CPU cores and main memory and submit all, i.e., the '''batch job''', to a resource and workload managing software.

== Terms and definitions ==

''' Partitions '''

Slurm manages job queues for different '''partitions'''. Partitions are used to group similar node types (e.g. nodes with and without accelerators) and to enforce different access policies and resource limits.

On bwUniCluster 3.0 there are different partitions:

* CPU-only nodes
** 2-socket nodes, consisting of 2 Intel Ice Lake processors with 32 cores each or 2 AMD processors with 48 cores each
** 2-socket nodes with very high RAM capacity, consisting of 2 AMD processors with 48 cores each
* GPU-accelerated nodes
** 2-socket nodes with 4x NVIDIA A100 or 4x NVIDIA H100 GPUs
** 4-socket node with 4x AMD Instinct accelerator

''' Queues '''

Job '''queues''' are used to manage jobs that request access to shared but limited computing resources of a certain kind (partition).

On bwUniCluster 3.0 there are different main types of queues:
* Regular queues
** cpu: Jobs that request CPU-only nodes.
** gpu: Jobs that request GPU-accelerated nodes.
* Development queues (dev)
** Short, usually interactive jobs that are used for developing, compiling and testing code and workflows. The intention behind development queues is to provide users with immediate access to computer resources without having to wait. This is the place to realize instantaneous heavy compute without affecting other users, as would be the case on the login nodes.

Requested compute resources such as (wall-)time, number of nodes and amount of memory are restricted and must fit into the boundaries imposed by the queues. The request for compute resources on the bwUniCluster 3.0 requires at least the specification of the '''queue''' and the '''time'''.

''' Jobs '''

Jobs can be run non-interactively as '''batch jobs''' or as '''interactive jobs'''. 
Submitting a batch job means, that all steps of a compute project are defined in a Bash script. This Bash script is queued and executed as soon as the compute resources are available and allocated. Jobs are enqueued with the <code>sbatch</code> command.
For interactive jobs, the resources are requested with the <code>salloc</code> command. As soon as the computing resources are available and allocated, a command line prompt is returned on a computing node and the user can freely dispose of the resources now available to him.
{|style="background:#deffee; width:100%;"
|style="padding:5px; background:#cef2e0; text-align:left"|
[[Image:Attention.svg|center|25px]]
|style="padding:5px; background:#cef2e0; text-align:left"|
'''Please remember:'''
* '''Heavy computations are not allowed on the login nodes'''. Use a developement or a regular job queue instead! Please refer to [[BwUniCluster3.0/Login#Allowed_Activities_on_Login_Nodes|Allowed Activities on Login Nodes]].
* '''Development queues''' are meant for '''development tasks'''. Do not misuse this queue for regular, short-running jobs or chain jobs! Only one running job at a time is enabled. Maximum queue length is reduced to 3.
|}

= Queues on bwUniCluster 3.0 =
== Policy ==

The computing time is provided in accordance with the '''fair share policy'''. The individual investment shares of the respective university and the resources already used by its members are taken into account. Furthermore, the following throttling policy is also active: The '''maximum amount of physical cores''' used at any given time from jobs running is '''1920 per user''' (aggregated over all running jobs). This number corresponds to 30 nodes on the Ice Lake partition or 20 nodes on the standard partition. The aim is to minimize waiting times and maximize the number of users who can access computing time at the same time.

== Regular Queues ==
{| class="wikitable"
|-
! style="width:5%"| Queue
! style="width:13%"| Node-Type
! style="width:23%"| Default Resources
! style="width:13%"| Minimal Resources
! style="width:13%"| Maximum Resources
|-
| <code>cpu_il</code>
| CPU nodes Ice Lake
| mem-per-cpu=2000mb
|
| time=72:00:00, nodes=30, mem=249600mb, ntasks-per-node=64, (threads-per-core=2)
|-
| <code>cpu</code>
| CPU nodes Standard
| mem-per-cpu=2000mb
|
| time=72:00:00, nodes=20, mem=380000mb, ntasks-per-node=96, (threads-per-core=2)
|-
| <code>highmem</code>
| CPU nodes High Memory
| mem-per-cpu=12090mb
| mem=380001mb
| time=72:00:00, nodes=4, mem=2300000mb, ntasks-per-node=96, (threads-per-core=2)
|-
| <code>gpu_h100</code>
| GPU nodes NVIDIA GPU x4
| mem-per-gpu=193300mb cpus-per-gpu=24
| gres=gpu:1
| time=72:00:00, nodes=12, mem=760000mb, ntasks-per-node=96, (threads-per-core=2)
|-
| <code>gpu_mi300</code>
| GPU node AMD GPU x4
| mem-per-gpu=128200mb cpus-per-gpu=24
| gres=gpu:1
| time=72:00:00, nodes=1, mem=510000mb, ntasks-per-node=40, (threads-per-core=2)
|-
| <code>gpu_a100_il</code>/<code>gpu_h100_il</code>
| GPU nodes Ice Lake NVIDIA GPU x4
| mem-per-gpu=127500mb cpus-per-gpu=16
| gres=gpu:1
| time=48:00:00, nodes=9(A100)/nodes=5(H100) , mem=510000mb, ntasks-per-node=64, (threads-per-core=2)
|}
Table 1: Regular Queues

== Short Queues ==
Queues with a short runtime of 30 minutes.
{| class="wikitable"
|-
! style="width:5%"| Queue
! style="width:13%"| Node Type
! style="width:23%"| Default Resources
! style="width:13%"| Minimal Resources
! style="width:13%"| Maximum Resources
|-
| <code>gpu_a100_short</code>
| GPU nodes Ice Lake NVIDIA GPU x4
| mem-per-gpu=94000mb cpus-per-gpu=12
| gres=gpu:1
| time=30, nodes=12, mem=376000mb, ntasks-per-node=48, (threads-per-core=2)
|}
Table 2: Short Queues

== Development Queues ==
Only for development, i.e. debugging or performance optimization ...

{| class="wikitable"
|-
! style="width:5%"| Queue
! style="width:13%"| Node Type
! style="width:23%"| Default Resources
! style="width:13%"| Minimal Resources
! style="width:13%"| Maximum Resources
|-
| <code>dev_cpu_il</code>
| CPU nodes Ice Lake
| mem-per-cpu=2000mb
|
| time=30, nodes=8, mem=249600mb, ntasks-per-node=64, (threads-per-core=2)
|-
| <code>dev_cpu</code>
| CPU nodes Standard
| mem-per-cpu=2000mb
|
| time=30, nodes=1, mem=380000mb, ntasks-per-node=96, (threads-per-core=2)
|-
| <code>dev_gpu_h100</code>
| GPU nodes NVIDIA GPU x4
| mem-per-gpu=193300mb cpus-per-gpu=24
| gres=gpu:1
| time=30, nodes=1, mem=760000mb, ntasks-per-node=96, (threads-per-core=2)
|-
| <code>dev_gpu_a100_il</code>
| GPU nodes NVIDIA GPU x4 
| mem-per-gpu=127500mb cpus-per-gpu=16
| gres=gpu:1
| time=30, nodes=1, mem=510000mb, ntasks-per-node=64, (threads-per-core=2)
|}
Table 3: Development Queues

Default resources of a queue class define number of tasks and memory if not explicitly given with sbatch command. Resource list acronyms ''--time'', ''--ntasks'', ''--nodes'', ''--mem'' and ''--mem-per-cpu'' are described [[BwUniCluster3.0/Running_Jobs/Slurm|here]].

== Check available resources: sinfo_t_idle ==
The Slurm command sinfo is used to view partition and node information for a system running Slurm. It incorporates down time, reservations, and node state information in determining the available backfill window. The sinfo command can only be used by the administrator.
 
SCC has prepared a special script (sinfo_t_idle) to find out how many processors are available for immediate use on the system. It is anticipated that users will use this information to submit jobs that meet these criteria and thus obtain quick job turnaround times.
 
 

* The following command displays what resources are available for immediate use for the whole partition.
<pre>$ sinfo_t_idle
Partition dev_cpu : 1 nodes idle
Partition cpu : 1 nodes idle
Partition highmem : 2 nodes idle
Partition dev_gpu_h100 : 0 nodes idle
Partition gpu_h100 : 0 nodes idle
Partition gpu_mi300 : 0 nodes idle
Partition dev_cpu_il : 7 nodes idle
Partition cpu_il : 2 nodes idle
Partition dev_gpu_a100_il : 1 nodes idle
Partition gpu_a100_il : 0 nodes idle
Partition gpu_h100_il : 1 nodes idle
Partition gpu_a100_short : 0 nodes idle
</pre>
 

= Running Jobs =

== Slurm Commands (excerpt) ==
Important Slurm commands for non-administrators working on bwUniCluster 3.0.
{| width=850px class="wikitable"
! Slurm commands !! Brief explanation
|-
| [[#Batch Jobs: sbatch|sbatch]] || Submits a job and puts it into the queue [[https://slurm.schedmd.com/sbatch.html sbatch]]
|-
| [[#Interactive Jobs: salloc|salloc]] || Requests resources for an interactive Job [[https://slurm.schedmd.com/salloc.html salloc]]
|-
| [[#Monitor and manage jobs |scontrol show job]] || Displays detailed job state information [[https://slurm.schedmd.com/scontrol.html scontrol]]
|-
| [[#List of your submitted jobs : squeue|squeue]] || Displays information about active, eligible, blocked, and/or recently completed jobs [[https://slurm.schedmd.com/squeue.html squeue]]
|-
| [[#List of your submitted jobs : squeue|squeue --start]] || Returns start time of submitted job [[https://slurm.schedmd.com/squeue.html squeue]]
|-
| [[#Check available resources: sinfo_t_idle|sinfo_t_idle]] || Shows what resources are available for immediate use [[https://slurm.schedmd.com/sinfo.html sinfo]]
|-
| [[#Canceling own jobs : scancel|scancel]] || Cancels a job [[https://slurm.schedmd.com/scancel.html scancel]]
|}

 
* [https://slurm.schedmd.com/tutorials.html Slurm Tutorials]
* [https://slurm.schedmd.com/pdfs/summary.pdf Slurm command/option summary (2 pages)]
* [https://slurm.schedmd.com/man_index.html Slurm Commands]
 

== Batch Jobs: sbatch ==

Batch jobs are submitted by using the command '''sbatch'''. The main purpose of the '''sbatch''' command is to specify the resources that are needed to run the job. '''sbatch''' will then queue the batch job. However, starting of batch job depends on the availability of the requested resources and the fair sharing value.
 
 

* The syntax and use of '''sbatch''' can be displayed via:
<pre>
$ man sbatch
</pre>

'''sbatch''' options can be used from the command line or in your job script. Different defaults for some of these options are set based on the queue and can be found [[BwUniCluster3.0/Slurm | here]]

{| class="wikitable"
! colspan="3" | sbatch Options
|-
! style="width:8%"| Command line
! style="width:9%"| Script
! style="width:13%"| Purpose
|- style="vertical-align:top;"
| -t, --time=''time''
| #SBATCH --time=''time''
| Wall clock time limit. 
|-
|- style="vertical-align:top;"
| -N, --nodes=''count''
| #SBATCH --nodes=''count''
| Number of nodes to be used.
|- style="vertical-align:top;"
| -n, --ntasks=''count''
| #SBATCH --ntasks=''count''
| Number of tasks to be launched.
|-
|- style="vertical-align:top;"
| --ntasks-per-node=''count''
| #SBATCH --ntasks-per-node=''count''
| Maximum count of tasks per node.
|-
|- style="vertical-align:top;"
| -c, --cpus-per-task=''count''
| #SBATCH --cpus-per-task=''count''
| Number of CPUs required per (MPI-)task.
|-
|- style="vertical-align:top;"
| --mem=''value_in_MB''
| #SBATCH --mem=''value_in_MB''
| Memory in MegaByte per node. (You should omit the setting of this option.)
|-
|- style="vertical-align:top;"
| --mem-per-cpu=''value_in_MB''
| #SBATCH --mem-per-cpu=''value_in_MB''
| Minimum Memory required per allocated CPU. (You should omit the setting of this option.)
|-
|- style="vertical-align:top;"
| --exclusive
| #SBATCH --exclusive
| The job allocates all CPUs and GPUs on the nodes. It will not share the node with other running jobs
|-
|- style="vertical-align:top;"
| --mail-type=''type''
| #SBATCH --mail-type=''type''
| Notify user by email when certain event types occur. Valid type values are NONE, BEGIN, END, FAIL, REQUEUE, ALL.
|-
|- style="vertical-align:top;"
| --mail-user=''mail-address''
| #SBATCH --mail-user=''mail-address''
| The specified mail-address receives email notification of state changes as defined by --mail-type.
|-
|- style="vertical-align:top;"
| --output=''name''
| #SBATCH --output=''name''
| File in which job output is stored.
|-
|- style="vertical-align:top;"
| --error=''name''
| #SBATCH --error=''name''
| File in which job error messages are stored.
|-
|- style="vertical-align:top;"
| -J, --job-name=''name''
| #SBATCH --job-name=''name''
| Job name.
|-
|- style="vertical-align:top;"
| --export=[ALL,] ''env-variables''
| #SBATCH --export=[ALL,] ''env-variables''
| Identifies which environment variables from the submission environment are propagated to the launched application. Default is ALL.
|-
|- style="vertical-align:top;"
| -A, --account=''group-name''
| #SBATCH --account=''group-name''
| Change resources used by this job to specified group. You may need this option if your account is assigned to more than one group. By command "scontrol show job" the project group the job is accounted on can be seen behind "Account=".
|-
|- style="vertical-align:top;"
| -p, --partition=''queue-name''
| #SBATCH --partition=''queue-name''
| Request a specific queue for the resource allocation.
|-
|- style="vertical-align:top;"
| --reservation=''reservation-name''
| #SBATCH --reservation=''reservation-name''
| Use a specific reservation for the resource allocation.
|-
|- style="vertical-align:top;"
| -C, --constraint=''LSDF''
| #SBATCH --constraint=LSDF
| Job constraint LSDF filesystems.
|-
|- style="vertical-align:top;"
| -C, --constraint=''BEEOND (BEEOND_4MDS, BEEOND_MAXMDS)''
| #SBATCH --constraint=BEEOND (BEEOND_4MDS, BEEOND_MAXMDS)
| Job constraint BeeOND filesystem.
|-
|}
 

== Interactive Jobs: salloc ==

On bwUniCluster 3.0 you are only allowed to run short jobs (<< 1 hour) with little memory requirements (<< 8 GByte) on the logins nodes. If you want to run longer jobs and/or jobs with a request of more than 8 GByte of memory, you must allocate resources for so-called interactive jobs by usage of the command salloc on a login node. Considering a serial application running on a compute node that requires 5000 MByte of memory and limiting the interactive run to 2 hours the following command has to be executed:

<pre>
$ salloc -p cpu -n 1 -t 120 --mem=5000
</pre>

Then you will get one core on a compute node within the partition "cpu". After execution of this command '''DO NOT CLOSE''' your current terminal session but wait until the queueing system Slurm has granted you the requested resources on the compute system. You will be logged in automatically on the granted core! To run a serial program on the granted core you only have to type the name of the executable.

<pre>
$ ./<my_serial_program>
</pre>

Please be aware that your serial job must run less than 2 hours in this example, else the job will be killed during runtime by the system.

You can also start now a graphical X11-terminal connecting you to the dedicated resource that is available for 2 hours. You can start it by the command:

<pre>
$ xterm
</pre>

Note that, once the walltime limit has been reached the resources - i.e. the compute node - will automatically be revoked.

An interactive parallel application running on one compute node or on many compute nodes (e.g. here 5 nodes) with 96 cores each requires usually an amount of memory in GByte (e.g. 50 GByte) and a maximum time (e.g. 1 hour). E.g. 5 nodes can be allocated by the following command:

<pre>
$ salloc -p cpu -N 5 --ntasks-per-node=96 -t 01:00:00 --mem=50gb
</pre>

Now you can run parallel jobs on 480 cores requiring 50 GByte of memory per node. Please be aware that you will be logged in on core 0 of the first node.
If you want to have access to another node you have to open a new terminal, connect it also to bwUniCluster 3.0 and type the following commands to
connect to the running interactive job and then to a specific node:

<pre>
$ srun --jobid=XXXXXXXX --pty /bin/bash
$ srun --nodelist=uc3nXXX --pty /bin/bash
</pre>

With the command:

<pre>
$ squeue
</pre>

the jobid and the nodelist can be shown.

If you want to run MPI-programs, you can do it by simply typing mpirun <program_name>. Then your program will be run on 480 cores. A very simple example for starting a parallel job can be:

<pre>
$ mpirun <my_mpi_program>
</pre>

You can also start the debugger ddt by the commands:

<pre>
$ module add devel/ddt
$ ddt <my_mpi_program>
</pre>

The above commands will execute the parallel program <my_mpi_program> on all available cores. You can also start parallel programs on a subset of cores; an example for this can be:

<pre>
$ mpirun -n 50 <my_mpi_program>
</pre>

If you are using Intel MPI you must start <my_mpi_program> by the command mpiexec.hydra (instead of mpirun).

== Interactive Computing with Jupyter ==

== Monitor and manage jobs ==

=== List of your submitted jobs : squeue ===
Displays information about YOUR active, pending and/or recently completed jobs. The command displays all own active and pending jobs. The command squeue is explained in detail on the webpage https://slurm.schedmd.com/squeue.html or via manpage (man squeue).
 
 

* ''squeue'' example on bwUniCluster 3.0 (Only your own jobs are displayed!).
<pre>
$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
1262 cpu wrap ka_ab123 R 8:15 1 uc3n002
1267 dev_gpu_h wrap ka_ab123 PD 0:00 1 (Resources)
1265 highmem wrap ka_ab123 R 2:41 1 uc3n084
$ squeue -l
JOBID PARTITION NAME USER STATE TIME TIME_LIMI NODES NODELIST(REASON)
1262 cpu wrap ka_ab123 RUNNING 8:55 20:00 1 uc3n002
1267 dev_gpu_h wrap ka_ab123 PENDING 0:00 20:00 1 (Resources)
1265 highmem wrap ka_ab123 RUNNING 3:21 20:00 1 uc3n084

</pre>

=== Detailed job information : scontrol show job ===
scontrol show job displays detailed job state information and diagnostic output for all or a specified job of yours. Detailed information is available for active, pending and recently completed jobs. The command scontrol is explained in detail on the webpage https://slurm.schedmd.com/scontrol.html or via manpage (man scontrol).
 
Display the state of all your jobs in normal mode: scontrol show job
 
Display the state of a job with <jobid> in normal mode: scontrol show job <jobid>
 
 

* Here is an example from bwUniCluster 3.0.
<pre>
$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
1262 cpu wrap ka_zs040 R 1:12 1 uc3n002

$
$ # now, see what's up with my pending job with jobid 1262
$
$ scontrol show job 1262

JobId=1262 JobName=wrap
UserId=ka_zs0402(241992) GroupId=ka_scc(12345) MCS_label=N/A
Priority=4246 Nice=0 Account=ka QOS=normal
JobState=RUNNING Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
RunTime=00:00:37 TimeLimit=00:20:00 TimeMin=N/A
SubmitTime=2025-04-04T10:01:30 EligibleTime=2025-04-04T10:01:30
AccrueTime=2025-04-04T10:01:30
StartTime=2025-04-04T10:01:31 EndTime=2025-04-04T10:21:31 Deadline=N/A
SuspendTime=None SecsPreSuspend=0 LastSchedEval=2025-04-04T10:01:31 Scheduler=Main
Partition=cpu AllocNode:Sid=uc3n999:2819841
ReqNodeList=(null) ExcNodeList=(null)
NodeList=uc3n002
BatchHost=uc3n002
NumNodes=1 NumCPUs=2 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
ReqTRES=cpu=1,mem=2000M,node=1,billing=1
AllocTRES=cpu=2,mem=4000M,node=1,billing=2
Socks/Node=* NtasksPerN:B:S:C=0:0:*:1 CoreSpec=*
MinCPUsNode=1 MinMemoryCPU=2000M MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
Command=(null)
WorkDir=/pfs/data6/home/ka/ka_scc/ka_zs0402
StdErr=/pfs/data6/home/ka/ka_scc/ka_zs0402/slurm-1262.out
StdIn=/dev/null
StdOut=/pfs/data6/home/ka/ka_scc/ka_zs0402/slurm-1262.out

</pre>
 
=== Canceling own jobs : scancel ===
The scancel command is used to cancel jobs. The command scancel is explained in detail on the webpage https://slurm.schedmd.com/scancel.html or via manpage (man scancel). The command is:
<pre>
$ scancel [-i] <job-id>
$ scancel -t <job_state_name>
</pre>

= Slurm Options =
[[BwUniCluster3.0/Running_Jobs/Slurm | Detailed Slurm usage]]

= Best Practices =

== Step-by-Step example==

== Dos and Don'ts ==

{|style="background:#deffee; width:100%;"
|style="padding:5px; background:#cef2e0; text-align:left"|
[[Image:Attention.svg|center|25px]]
|style="padding:5px; background:#cef2e0; text-align:left"| Do not run squeue and other slurm commands in loops or "watch" as not to saturate up the slurm daemon with rpc requests
|}

BwUniCluster3.0/Hardware and Architecture

2025-09-02T09:45:21Z

S Braun: /* Compute nodes */

= Architecture of bwUniCluster 3.0 =

The '''bwUniCluster 3.0''' is a parallel computer with distributed memory.
It consists of the bwUniCluster 3.0 components procured in 2024 and also includes the additional compute nodes which were procured as an extension to the bwUniCluster 2.0 in 2022.

Each node of the system consists of two Intel Xeon or AMD EPYC processors, local memory, local storage, network adapters and optional accelerators (NVIDIA A100 and H100, AMD Instinct MI300A). All nodes are connected via a fast InfiniBand interconnect.

The parallel file system (Lustre) is connected to the InfiniBand switch of the compute cluster. This provides a fast and scalable parallel file
system.

The operating system on each node is Red Hat Enterprise Linux (RHEL) 9.4.

The individual nodes of the system act in different roles. From an end users point of view the different groups of nodes are login nodes and compute nodes. File server nodes and administrative server nodes are not accessible by users.

'''Login Nodes''' 
The login nodes are the only nodes directly accessible by end users. These nodes are used for interactive login, file management, program development, and interactive pre- and post-processing.
There are two nodes dedicated to this service, but they can all be reached from a single address: <code>uc3.scc.kit.edu</code>. A DNS round-robin alias distributes login sessions to the login nodes.
To prevent login nodes from being used for activities that are not permitted there and that affect the user experience of other users, '''long-running and/or compute-intensive tasks are periodically terminated without any prior warning'''. Please refer to [[BwUniCluster3.0/Login#Allowed_Activities_on_Login_Nodes|Allowed Activities on Login Nodes]].

'''Compute Nodes''' 
The majority of nodes are compute nodes which are managed by a batch system. Users submit their jobs to the SLURM batch system and a job is executed when the required resources become available (depending on its fair-share priority).

'''File Systems''' 
bwUniCluster 3.0 comprises two parallel file systems based on Lustre.

[[File:uc3.png|Optionen|center|Überschrift|800px]]

= Compute Resources =

== Login nodes ==

After a successful [[BwUniCluster3.0/Login|login]], users find themselves on one of the so called login nodes. Technically, these largely correspond to a standard CPU node, i.e. users have two AMD EPYC 9454 processors with a total of 96 cores at their disposal. Login nodes are the bridgehead for accessing computing resources.
Data and software are organized here, computing jobs are initiated and managed, and computing resources allocated for interactive use can also be accessed from here.
{|style="background:#deffee; width:100%;"
|style="padding:5px; background:#ffa500; text-align:left"|
[[Image:Attention.svg|center|25px]]
|style="padding:5px; background:#ffa500; text-align:left"|
'''Any compute intensive job running on the login nodes will be terminated without any notice.''' 
Please refer to [[BwUniCluster3.0/Login#Allowed_Activities_on_Login_Nodes|Allowed Activities on Login Nodes]].
|}

== Compute nodes ==
All compute activities on bwUniCluster 3.0 have to be performed on the compute nodes. Compute nodes are only available by requesting the corresponding resources via the queuing system. As soon as the requested resources are available, automated tasks are executed via a batch script or they can be accessed interactively. Please refer to [[BwUniCluster3.0/Running_Jobs|Running Jobs]] on how to request resources. 
The following compute node types are available: 
CPU nodes
* '''Standard''': Two AMD EPYC 9454 processors per node with a total of 96 physical CPU cores or 192 logical cores (Hyper-Threading) per node. The nodes have been procured in 2024.
* '''Ice Lake''': Two Intel Xeon Platinum 8358 processors per node with a total of 64 physical CPU cores or 128 logical cores (Hyper-Threading) per node. The nodes have been procured in 2022 as an extension to bwUniCluster 2.0.
* '''High Memory''': Similar to the standard nodes, but with six times larger memory.
GPU nodes
* '''NVIDIA GPU x4''': Similar to the standard nodes, but with larger memory and four NVIDIA H100 GPUs.
* '''AMD GPU x4''': AMD's accelerated processing unit (APU) MI300A with 4 CPU sockets and 4 compute units which share the same high-bandwidth memory (HBM).
* '''Ice Lake NVIDIA GPU x4''': Similar to the Ice Lake nodes, but with larger memory and four NVIDIA A100 or H100 GPUs.
* '''Cascade Lake NVIDIA GPU x4''': Nodes with four NVIDIA A100 GPUs.
{| class="wikitable"
|-
! style="width:10%"| Node Type
! style="width:10%"| CPU nodes Ice Lake
! style="width:10%"| CPU nodes Standard
! style="width:10%"| CPU nodes High Memory
! style="width:10%"| GPU nodes NVIDIA GPU x4
! style="width:10%"| GPU node AMD GPU x4
! style="width:10%"| GPU nodes Ice Lake NVIDIA GPU x4
! style="width:10%"| GPU nodes Cascade Lake NVIDIA GPU x4
! style="width:10%"| Login nodes
|-
!scope="column"| Availability in [[BwUniCluster3.0/Running_Jobs#Queues_on_bwUniCluster_3.0| queues]]
| <code>cpu_il</code>, <code>dev_cpu_il</code>
| <code>cpu</code>, <code>dev_cpu</code>
| <code>highmem</code>, <code>dev_highmem</code>
| <code>gpu_h100</code>, <code>dev_gpu_h100</code>
| <code>gpu_mi300</code>
| <code>gpu_a100_il</code> / <code>gpu_h100_il</code>
| <code>gpu_a100_short</code></code>
| -
|-
!scope="column"| Number of nodes
| 272
| 70
| 4
| 12
| 1
| 15
| 19
| 2
|-
!scope="column"| Processors
| Intel Xeon Platinum 8358
| AMD EPYC 9454
| AMD EPYC 9454
| AMD EPYC 9454
| AMD Zen 4
| Intel Xeon Platinum 8358
| Intel Xeon Gold 6248R
| AMD EPYC 9454
|-
!scope="column"| Number of sockets
| 2
| 2
| 2
| 2
| 4
| 2
| 2
| 2
|-
!scope="column"| Total number of cores
| 64
| 96
| 96
| 96
| 96 (4x 24)
| 64
| 48
| 96
|-
!scope="column"| Main memory
| 256 GB
| 384 GB
| 2.3 TB
| 768 GB
| 4x 128 GB HBM3
| 512 GB
| 384 GB
| 384 GB
|-
!scope="column"| Local SSD
| 1.8 TB NVMe
| 3.84 TB NVMe
| 15.36 TB NVMe
| 15.36 TB NVMe
| 7.68 TB NVMe
| 6.4 TB NVMe
| 1.92 TB SATA SSD
| 7.68 TB SATA SSD
|-
!scope="column"| Accelerators
| -
| -
| -
| 4x NVIDIA H100
| 4x AMD Instinct MI300A
| 4x NVIDIA A100 / H100
| 4x NVIDIA A100
| -
|-
!scope="column"| Accelerator memory
| -
| -
| -
| 94 GB
| APU
| 80 GB / 94 GB
| 40 GB
| -
|-
!scope="column"| Interconnect
| IB HDR200
| IB 2x NDR200
| IB 2x NDR200
| IB 4x NDR200
| IB 2x NDR200
| IB 2x HDR200
| IB 4x EDR
| IB 1x NDR200
|}
Table 1: Hardware overview and properties

= File Systems =

On bwUniCluster 3.0 the following file systems are available:

* '''$HOME''' The HOME directory is created automatically after account activation, and the environment variable $HOME holds its name. HOME is the place, where users find themselves after login.
* '''Workspaces''' Users can create so-called workspaces for non-permanent data with temporary lifetime.
* '''Workspaces on flash storage''' A further workspace file system based on flash-only storage is available for special requirements and certain users.
* '''$TMPDIR''' The directory $TMPDIR is only available and visible on the local node during the runtime of a compute job. It is located on fast SSD storage devices.
* '''BeeOND''' (BeeGFS On-Demand) On request a parallel on-demand file system (BeeOND) is created which uses the SSDs of the nodes which were allocated to the batch job.
* '''LSDF Online Storage''' On request the external LSDF Online Storage is mounted on the nodes which were allocated to the batch job. On the login nodes, LSDF is automatically mounted.

'''Which file system to use?'''

You should separate your data and store it on the appropriate file system.
Permanently needed data like software or important results should be stored in $HOME but capacity restrictions (quotas) apply.
In case you accidentally deleted data on $HOME there is a chance that we can restore it from backup.
Permanent data which is not needed for months or exceeds the capacity restrictions should be sent to the LSDF Online Storage or to the archive and deleted from the file systems. Temporary data which is only needed on a single node and which does not exceed the disk space shown in Table 1 above should be stored
below $TMPDIR. Data which is read many times on a single node, e.g. if you are doing AI training,
should be copied to $TMPDIR and read from there. Temporary data which is used from many nodes
of your batch job and which is only needed during job runtime should be stored on a
parallel on-demand file system BeeOND. Temporary data which can be recomputed or which is the
result of one job and input for another job should be stored in workspaces. The lifetime
of data in workspaces is limited and depends on the lifetime of the workspace which can be
several months.

For further details please check: [[BwUniCluster3.0/Hardware_and_Architecture/Filesystem_Details|File System Details]]

== $HOME ==

The $HOME directories of bwUniCluster 3.0 users are located on the parallel file system Lustre.
You have access to your $HOME directory from all nodes of UC3. A regular backup of these directories
to tape archive is done automatically. The directory $HOME is used to hold those files that are
permanently used like source codes, configuration files, executable programs etc.

[[BwUniCluster3.0/Hardware_and_Architecture/Filesystem_Details#$HOME|Detailed information on $HOME]]

== Workspaces ==

On UC3 workspaces should be used to store large non-permanent data sets, e.g. restart files or output
data that has to be post-processed. The file system used for workspaces is also the parallel file system Lustre. This file system is especially designed for parallel access and for a high throughput to large
files. It is able to provide high data transfer rates of up to 40 GB/s write and read performance when data access is parallel.

On UC3 there is a default user quota limit of 40 TiB and 20 million inodes (files and directories) per user.

[[BwUniCluster3.0/Hardware_and_Architecture/Filesystem_Details#Workspaces|Detailed information on Workspaces]]

== Workspaces on flash storage ==

Another workspace file system based on flash-only storage is available for special requirements and certain users.
If possible, this file system should be used from the Ice Lake nodes of bwUniCluster 3.0 (queue ''cpu_il'').
It provides high IOPS rates and better performance for small files. The quota limts are lower than on the
normal workspace file system.

[[BwUniCluster3.0/Hardware_and_Architecture/Filesystem_Details#Workspaces_on_flash_storage|Detailed information on Workspaces on flash storage]]

== $TMPDIR ==

The environment variable $TMPDIR contains the name of a directory which is located on the local SSD of each node.
This directory should be used for temporary files being accessed from the local node. It should
also be used if you read the same data many times from a single node, e.g. if you are doing AI training.
Because of the extremely fast local SSD storage devices performance with small files is much better than on the parallel file systems.

[[BwUniCluster3.0/Hardware_and_Architecture/Filesystem_Details#$TMPDIR|Detailed information on $TMPDIR]]

== BeeOND (BeeGFS On-Demand) ==

Users have the possibility to request a private BeeOND (on-demand BeeGFS) parallel filesystem for each job. The file system is created during job startup and purged when your job completes.

[[BwUniCluster3.0/Hardware_and_Architecture/Filesystem_Details#BeeOND_(BeeGFS_On-Demand)|Detailed information on BeeOND]]

== LSDF Online Storage ==

The LSDF Online Storage allows dedicated users to store scientific measurement data and simulation results. BwUniCluster 3.0 has an extremely fast network connection to the LSDF Online Storage. This file system provides external access via different protocols and is only available for certain users.

[[BwUniCluster3.0/Hardware_and_Architecture/Filesystem_Details#LSDF_Online_Storage|Detailed information on LSDF Online Storage]]

Talk:Development/Python

2025-08-21T11:29:08Z

S Braun: Created page with "Samuel: Wir sollten uv mit in die Tools Liste aufnehmen."

Samuel: Wir sollten uv mit in die Tools Liste aufnehmen.

Workspace

2025-08-15T11:07:02Z

S Braun:

'''Workspace tools''' provide temporary scratch space so calles '''workspaces''' for your calculation on a central file storage. They are meant to keep data for a limited time – but usually longer than the time of a single job run.

== No Backup ==

Workspaces are not meant for permanent storage, hence data in workspaces is not backed up and may be lost in case of problems on the storage system. Please copy/move important results to $HOME or some disks outside the cluster.

== Create workspace ==
To create a workspace you need to state ''name'' of your workspace and ''lifetime'' in days. A maximum value for ''lifetime'' and a maximum number of renewals is defined on each cluster. Execution of:

$ ws_allocate mySpace 30

e.g. returns:

Workspace created. Duration is 720 hours.
Further extensions available: 3
/work/workspace/scratch/username-mySpace-0

For more information read the program's help, i.e. ''$ ws_allocate -h''.

== List all your workspaces ==
To list all your workspaces, execute:

$ ws_list

which will return:
* Workspace ID
* Workspace location
* available extensions
* creation date and remaining time

== Find workspace location ==

Workspace location/path can be prompted for any workspace ''ID'' using '''ws_find''', in case of workspace ''mySpace'':

$ ws_find mySpace

returns the one-liner:

/work/workspace/scratch/username-mySpace-0

== Extend lifetime of your workspace ==

Any workspace's lifetime can be only extended a cluster-specific number of times. There several commands to extend workspace lifetime
#<pre>$ ws_extend mySpace 40</pre> which extends workspace ID ''mySpace'' by ''40'' days from now,
#<pre>$ ws_extend mySpace</pre> which extends workspace ID ''mySpace'' by the number days used previously
#<pre>$ ws_allocate -x mySpace 40</pre> which extends workspace ID ''mySpace'' by ''40'' days from now.
 

== Setting Permissions for Sharing Files ==
The examples will assume you want to change the directory in $DIR. If you want to share a workspace, DIR could be set with <code>DIR=$(ws_find my_workspace)</code>

=== Regular Unix Permissions ===

Making workspaces world readable/writable using standard unix access rights with <tt>chmod</tt> is only feasible if you are in a research group and you and your co-workers share a common ("bwXXXXX") unix group. It is strongly discouraged to make files readable or even writable to everyone or to large common groups.
{| class="wikitable"
|-
!style="width:45%" | Command
!style="width:55%" | Action
|-
|<tt>chgrp -R bw16e001 "$DIR"</tt>
<tt>chmod -R g+rX "$DIR"</tt>
|Set group ownership and grant read access to group for files in workspace via unix rights to the group "bw16e001" (has to be re-done if files are added)
|-
|<tt>chgrp -R bw16e001 "$DIR"</tt>
<tt>chmod -R g+rswX "$DIR"</tt>
|Set group ownership and grant read/write access to group for files in workspace via unix rights (has to be re-done if files are added). Group will be inherited by new files, but rights for the group will have to be re-set with chmod for every new file
|-
|}

Options used:
* -R: recursive
* g+rwx
** g: group
** + add permissions (- to remove)
** rwx: read, write, execute

=== "ACL"s: Access Crontrol Lists ===
ACLs allow a much more detailed distribution of permissions but are a bit more complicated and not visible in detail via "ls". They have the additional advantage that you can set a "default" ACL for a directory, (with a <tt>-d</tt> flag or a <tt>d:</tt> prefix) which will cause all newly created files to inherit the ACLs from the directory. Regular unix permissions only have limited support (only group ownership, not access rights) for this via the suid bit.

Best practices with respect to ACL usage:
# Take into account that ACL take precedence over standard unix access rights
# The owner of a workspace is responsible for its content and management

Please note that <tt>ls</tt> (List directory contents) shows ACLs on directories and files only when run as <tt>ls -l</tt> as in long format, as "plus" sign after the standard unix access rights.

Examples with regard to "my_workspace":
{| class="wikitable"
|-
!style="width:45%" | Command
!style="width:55%" | Action
|-
|<tt>getfacl "$DIR"</tt>
|List access rights on $DIR
|-
|<tt>setfacl -Rm u:fr_xy1:rX,d:u:fr_xy1:rX "$DIR"</tt>
|Grant user "fr_xy1" read-only access to $DIR
|-
|<tt>setfacl -Rm u:fr_me0000:rwX,d:u:fr_me0000:rwX "$DIR"</tt>
<tt>setfacl -Rm u:fr_xy1:rwX,d:u:fr_xy1:rwX "$DIR"</tt>
|Grant your own user "fr_me0000" and "fr_xy1" inheritable read and write access to $DIR, so you can also read/write files put into the workspace by a coworker
|-
|<tt>setfacl -Rm g:bw16e001:rX,d:g:bw16e001:rX "$DIR"</tt>
|Grant group (Rechenvorhaben) "bw16e001" read-only access to $DIR
|-
|<tt>setfacl -Rb "$DIR"</tt>
|Remove all ACL rights. Standard Unix access rights apply again.
|}

Options used:
* -R: recursive
* -m: modify
* u:username:rwX u: next name is a user; rwX read, write, eXecute (only where execute is set for user)

== Delete a Workspace ==

$ ws_release mySpace # Manually erase your workspace mySpace

Note: workspaces are kept for some time after release. To immediately delete and free space e.g. for quota reasons, delete the files with rm before release.

Newer versions of workspace tools have a --delete-data flag that immediately deletes data. Note that deleted data from workspaces is permanently lost.

== Restore an Expired Workspace ==

For a certain (system-specific) grace time following workspace expiration, a workspace can be restored by performing the following steps:

(1) Display restorable workspaces.
ws_restore -l

(2) Create a new workspace as the target for the restore:
ws_allocate restored 60

(3) Restore:
ws_restore <full_name_of_expired_workspace> restored

The expired workspace has to be specified using the '''full name''', including username prefix and timestamp suffix (otherwise, it cannot be uniquely identified).
The target workspace, on the other hand, must be given with just its short name as listed by <code>ws_list</code>, without the username prefix.

If the workspace is no visible/restorable, it has been '''permanently deleted''' and cannot be restored, not even by us. Please always remember, that workspaces are intended solely for temporary work data, and there is no backup of data in the workspaces.

Workspace

2025-08-15T11:06:29Z

S Braun: /* Create workspace */

'''Workspace tools''' provide temporary scratch space so calles '''workspaces''' for your calculation on a central file storage. They are meant to keep data for a limited time – but usually longer than the time of a single job run.

== No Backup ==

Workspaces are not meant for permanent storage, hence data in workspaces is not backed up and may be lost in case of problems on the storage system. Please copy/move important results to $HOME or some disks outside the cluster.

== Create workspace ==
To create a workspace you need to state ''name'' of your workspace and ''lifetime'' in days. A maximum value for ''lifetime'' and a maximum number of renewals is defined on each cluster. Execution of:

$ ws_allocate mySpace 30

e.g. returns:

Workspace created. Duration is 720 hours.
Further extensions available: 3
/work/workspace/scratch/username-mySpace-0

For more information read the program's help, i.e. ''$ ws_allocate -h''.

== List all your workspaces ==
To list all your workspaces, execute:

$ ws_list

which will return:
* Workspace ID
* Workspace location
* available extensions
* creation date and remaining time

== Find workspace location ==

Workspace location/path can be prompted for any workspace ''ID'' using '''ws_find''', in case of workspace ''blah'':

$ ws_find blah

returns the one-liner:

/work/workspace/scratch/username-blah-0

== Extend lifetime of your workspace ==

Any workspace's lifetime can be only extended a cluster-specific number of times. There several commands to extend workspace lifetime
#<pre>$ ws_extend blah 40</pre> which extends workspace ID ''blah'' by ''40'' days from now,
#<pre>$ ws_extend blah</pre> which extends workspace ID ''blah'' by the number days used previously
#<pre>$ ws_allocate -x blah 40</pre> which extends workspace ID ''blah'' by ''40'' days from now.
 

== Setting Permissions for Sharing Files ==
The examples will assume you want to change the directory in $DIR. If you want to share a workspace, DIR could be set with <code>DIR=$(ws_find my_workspace)</code>

=== Regular Unix Permissions ===

Making workspaces world readable/writable using standard unix access rights with <tt>chmod</tt> is only feasible if you are in a research group and you and your co-workers share a common ("bwXXXXX") unix group. It is strongly discouraged to make files readable or even writable to everyone or to large common groups.
{| class="wikitable"
|-
!style="width:45%" | Command
!style="width:55%" | Action
|-
|<tt>chgrp -R bw16e001 "$DIR"</tt>
<tt>chmod -R g+rX "$DIR"</tt>
|Set group ownership and grant read access to group for files in workspace via unix rights to the group "bw16e001" (has to be re-done if files are added)
|-
|<tt>chgrp -R bw16e001 "$DIR"</tt>
<tt>chmod -R g+rswX "$DIR"</tt>
|Set group ownership and grant read/write access to group for files in workspace via unix rights (has to be re-done if files are added). Group will be inherited by new files, but rights for the group will have to be re-set with chmod for every new file
|-
|}

Options used:
* -R: recursive
* g+rwx
** g: group
** + add permissions (- to remove)
** rwx: read, write, execute

=== "ACL"s: Access Crontrol Lists ===
ACLs allow a much more detailed distribution of permissions but are a bit more complicated and not visible in detail via "ls". They have the additional advantage that you can set a "default" ACL for a directory, (with a <tt>-d</tt> flag or a <tt>d:</tt> prefix) which will cause all newly created files to inherit the ACLs from the directory. Regular unix permissions only have limited support (only group ownership, not access rights) for this via the suid bit.

Best practices with respect to ACL usage:
# Take into account that ACL take precedence over standard unix access rights
# The owner of a workspace is responsible for its content and management

Please note that <tt>ls</tt> (List directory contents) shows ACLs on directories and files only when run as <tt>ls -l</tt> as in long format, as "plus" sign after the standard unix access rights.

Examples with regard to "my_workspace":
{| class="wikitable"
|-
!style="width:45%" | Command
!style="width:55%" | Action
|-
|<tt>getfacl "$DIR"</tt>
|List access rights on $DIR
|-
|<tt>setfacl -Rm u:fr_xy1:rX,d:u:fr_xy1:rX "$DIR"</tt>
|Grant user "fr_xy1" read-only access to $DIR
|-
|<tt>setfacl -Rm u:fr_me0000:rwX,d:u:fr_me0000:rwX "$DIR"</tt>
<tt>setfacl -Rm u:fr_xy1:rwX,d:u:fr_xy1:rwX "$DIR"</tt>
|Grant your own user "fr_me0000" and "fr_xy1" inheritable read and write access to $DIR, so you can also read/write files put into the workspace by a coworker
|-
|<tt>setfacl -Rm g:bw16e001:rX,d:g:bw16e001:rX "$DIR"</tt>
|Grant group (Rechenvorhaben) "bw16e001" read-only access to $DIR
|-
|<tt>setfacl -Rb "$DIR"</tt>
|Remove all ACL rights. Standard Unix access rights apply again.
|}

Options used:
* -R: recursive
* -m: modify
* u:username:rwX u: next name is a user; rwX read, write, eXecute (only where execute is set for user)

== Delete a Workspace ==

$ ws_release blah # Manually erase your workspace blah

Note: workspaces are kept for some time after release. To immediately delete and free space e.g. for quota reasons, delete the files with rm before release.

Newer versions of workspace tools have a --delete-data flag that immediately deletes data. Note that deleted data from workspaces is permanently lost.

== Restore an Expired Workspace ==

For a certain (system-specific) grace time following workspace expiration, a workspace can be restored by performing the following steps:

(1) Display restorable workspaces.
ws_restore -l

(2) Create a new workspace as the target for the restore:
ws_allocate restored 60

(3) Restore:
ws_restore <full_name_of_expired_workspace> restored

The expired workspace has to be specified using the '''full name''', including username prefix and timestamp suffix (otherwise, it cannot be uniquely identified).
The target workspace, on the other hand, must be given with just its short name as listed by <code>ws_list</code>, without the username prefix.

If the workspace is no visible/restorable, it has been '''permanently deleted''' and cannot be restored, not even by us. Please always remember, that workspaces are intended solely for temporary work data, and there is no backup of data in the workspaces.

BwUniCluster2.0/Software/Start vnc desktop

2025-08-14T11:14:15Z

S Braun: S Braun moved page BwUniCluster2.0/Software/Start vnc desktop to BwUniCluster3.0/Software/Start vnc desktop

#REDIRECT [[BwUniCluster3.0/Software/Start vnc desktop]]

BwUniCluster3.0/Software/Start vnc desktop

2025-08-14T11:14:15Z

S Braun: S Braun moved page BwUniCluster2.0/Software/Start vnc desktop to BwUniCluster3.0/Software/Start vnc desktop

The Linux 3D graphics stack is based on ''X11'' and ''OpenGL''. This has some
drawbacks in conjunction with remote visualization:

* Rendering takes place on the client, not the cluster
* Whole 3D model must be transferred via network to the client
* Some OpenGL extensions are not supported when using indirect / client side rendering instead of direct / hardware based rendering
* Many round trips in the X11 protocol negatively influence interactivity
* X11 is not available on non-Linux platforms
* Compatibility problems between client and cluster can occur

To avoid these drawbacks, <code>start_vnc_desktop</code> is provided.
It combines the three open source products [http://www.turbovnc.org/ TurboVNC], [http://www.virtualgl.org/ VirtualGL] and [http://openswr.org/ OpenSWR].

''Virtual Network Computing (VNC)'' is a graphical desktop sharing system.
VNC is platform-independent - there are clients and servers for many
GUI-based operating systems. The VNC server is the program on the
machine that shares its screen. The VNC client (or viewer) is the
program that watches, controls, and interacts with the server. For more
details see: [https://en.wikipedia.org/wiki/VNC Wikipedia]

''VirtualGL'' redirects the 3D rendering commands from Linux OpenGL
applications to 3D accelerator hardware in the cluster. For more details
see: [https://en.wikipedia.org/wiki/VirtualGL Wikipedia]

When no 3D accelerator hardware is available ''OpenSWR'', a high
performance, highly scalable software rasterizer for OpenGL can carry
out the rendering task. For more details see: [http://openswr.org OpenSWR]

This script takes a two step approach to start a VNC server in the
cluster environment:

In the first step the batch system is used to allocate resources where a
VNC server can be started.

In the second step the VNC server is launched on the resources granted
by the batch system. When VNC server is successfully started all
required login credentials and connection parameters will be reported.
To connect to this VNC server a VNC client installation on the local
desktop is required.

= Script usage =

* After login the script can simply be called from the command line:<pre>start_vnc_desktop</pre>
* To get help on the available options use:<pre>start_vnc_desktop --help</pre>
* Hardware rendering is currently only available on FH2 and bwUniCluster, it can be requested with:<pre>start_vnc_desktop --hw-rendering</pre>
* Software rendering is available on all clusters, it can be requested with: <pre>start_vnc_desktop --sw-rendering</pre>
* There is only a limited number of nodes with hardware rendering support, software rendering runs on all nodes.
* For large 3D data sets the software renderer may be faster.
* If neither <code>--hw-rendering</code> nor <code>--sw-rendering</code> is selected no 3D rendering support is available.

= VNC client =

In general every VNC client can be used to connect to the VNC server.
However for best performance and compatibility the use of the
[http://www.turbovnc.org/ TurboVNC] client is recommended.
Below you find the necessary steps for different client operation systems.

; Debian, Ubuntu:
* Download: [https://sourceforge.net/projects/turbovnc/files Download Site] -> latest version -> turbovnc_<VERSION>_amd64.deb
* Install: <pre> sudo apt-get install ./turbovnc_<VERSION>_amd64.deb</pre>
* Execute: <pre>/opt/TurboVNC/bin/vncviewer</pre>

; Red Hat Enterprise Linux, Fedora:
* Download: [https://sourceforge.net/projects/turbovnc/files Download Site] -> latest version -> turbovnc-<VERSION>.x86_64.rpm
* Install: <pre>sudo yum install ./turbovnc-<VERSION>.x86_64.rpm</pre>
* Execute: <pre>/opt/TurboVNC/bin/vncviewer</pre>

; SUSE Linux Enterprise, openSUSE:
* Download [https://sourceforge.net/projects/turbovnc/files Download Site] -> latest version -> turbovnc-<VERSION>.x86_64.rpm
* Install: <pre>sudo zypper install ./turbovnc-<VERSION>.x86_64.rpm</pre>
* Execute: <pre>/opt/TurboVNC/bin/vncviewer</pre>

; ArchLinux:
* Download: Can be installed from the AUR
* Install: <pre>pacaur -S turbovnc</pre>
* Execute: <pre>vncviewer</pre>

; Windows:
* Download: [https://sourceforge.net/projects/turbovnc/files Download Site] -> latest version -> TurboVNC64-<VERSION>.exe for 64-bit, TurboVNC-<VERSION>.exe for 32-bit
* Install: Double click on TurboVNC64-<VERSION>.exe / TurboVNC-<VERSION>.exe. Install in default directory (or chose a different one, if preferred)
* Execute: Java TurboVNCviewer (vncviewer-javaw.bat in installation directory)

BwUniCluster2.0/Software/Python Dask

2025-08-14T11:13:43Z

S Braun: S Braun moved page BwUniCluster2.0/Software/Python Dask to BwUniCluster3.0/Software/Python Dask

#REDIRECT [[BwUniCluster3.0/Software/Python Dask]]

BwUniCluster3.0/Software/Python Dask

2025-08-14T11:13:43Z

S Braun: S Braun moved page BwUniCluster2.0/Software/Python Dask to BwUniCluster3.0/Software/Python Dask

This guide explains how to use Python Dask and dask-jobqueue on bwUniCluster2.0.

== Installation and Usage ==
Please have a look at our [https://github.com/hpcraink/workshop-parallel-jupyter Workshop] on how to use Dask on bwUniCluster2.0 (2_Grundlagen: Environment erstellen and 6_Dask). This is currently only available in German.

BwUniCluster2.0/Software/OpenFoam

2025-08-14T11:13:06Z

S Braun: S Braun moved page BwUniCluster2.0/Software/OpenFoam to BwUniCluster3.0/Software/OpenFoam

#REDIRECT [[BwUniCluster3.0/Software/OpenFoam]]

BwUniCluster3.0/Software/OpenFoam

2025-08-14T11:13:06Z

S Braun: S Braun moved page BwUniCluster2.0/Software/OpenFoam to BwUniCluster3.0/Software/OpenFoam

{{Softwarepage|cae/openfoam}}

{| width=600px class="wikitable"
|-
! Description !! Content
|-
| module load
| cae/openfoam
|-
| License
| [https://www.openfoam.org/licence.php GNU General Public Licence]
|-
| Citing
| n/a
|-
| Links
| [https://www.openfoam.org/ Homepage] | [https://www.openfoam.org/docs/ Documentation]
|-
| Graphical Interface
| No
|}
 
 
= Description =

'''OpenFOAM''' (Open-source Field Operation And Manipulation) is a free, open-source CFD software package with an extensive range of features to solve anything from complex fluid flows involving chemical reactions, turbulence and heat transfer, to solid dynamics and electromagnetics.

= Adding OpenFOAM to Your Environment =

After loading the desired module, type to activate the OpenFOAM applications
<pre>$ source $FOAM_INIT</pre>
or simply
<pre>$ foamInit</pre>

= Parallel run with OpenFOAM =
For a better performance on running OpenFOAM jobs in parallel on bwUniCluster, it is recommended to have the decomposed data in local folders on each node.

Therefore you may use *HPC scripts, wich will copy your data to the node specific folders after running the decomposePar, and copy it back to the local case folder before running reconstructPar.

Don't forget to allocate enough wall-time for decomposition and reconstruction of your cases. As the data will be processed directly on the nodes, and may be lost if the job is cancelled before the data is copied back into the case folder.

Following commands will do that for you:

<pre>$ decomposeParHPC
$ reconstructParHPC
$ reconstructParMeshHPC</pre>

instead of:
<pre>$ decomposePar
$ reconstructPar
$ recontructParMesh</pre>

For example, if you want to runsnappyHexMeshin parallel, you may use the following commands:
<pre>$ decomposeParMeshHPC
$ mpirun --bind-to core --map-by core -report-bindings snappyHexMesh -overwrite -parallel
$ reconstructParMeshHPC -constant</pre>
instead of:
<pre>$ decomposePar
$ mpirun --bind-to core --map-by core -report-bindings snappyHexMesh -overwrite -parallel
$ reconstructParMesh -constant</pre>
 

For running jobs on multiple nodes, OpenFOAM needs passwordless communication between the nodes, to copy data into the local folders.

A small trick using ssh-keygen once will let your nodes to communicate freely over rsh.

Do it once (if you didn't do it already in the past):

<pre>
$ ssh-keygen
$ cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys
</pre>

= Building an OpenFOAM batch file for parallel processing =
== General information ==
Before running OpenFOAM jobs in parallel, it is necessary to decompose the geometry domain into segments, equal to the number of processors (or threads) you intend to use.

That means, for example, if you want to run a case on 8 processors, you will have to decompose the mesh in 8 segments, first. Then, you start the solver in ''parallel'', letting ''OpenFOAM'' to run calculations concurrently on these segments, one processor responding for one segment of the mesh, sharing the data with all other processors in between.

There is, of course, a mechanism that connects properly the calculations, so you don't loose your data or generate wrong results.

Decomposition and segments building process is handled bydecomposeParutility.

The number of subdomains, in which the geometry will be decomposed, is specified in "''system/decomposeParDict''", as well as the decomposition method to use.

The automatic decomposition method is "''scotch''". It trims the mesh, collecting as many cells as possible per processor, trying to avoid having empty segments or segments with not enough cells. If you want your mesh to be divided in other way, specifying the number of segments it should be cut in x, y or z direction, for example, you can use "simple" or "hierarchical" methods.
 

== Wrapper script generation ==
'''Attention:''' openfoam module loads automatically the necessary openmpi module for parallel run, do '''NOT''' load another version of mpi, as it may conflict with the loaded openfoam version.

A job-script to submit a batch job called ''job_openfoam.sh'' that runs ''icoFoam'' solver with OpenFoam version 8, on 80 processors, on a ''multiple'' partition with a total wall clock time of 6 hours looks like:


{| style="width: 100%; border:1px solid #d0cfcc; background:#f2f7ff;border-spacing: 5px;"
| style="width:280px; white-space:nowrap; color:#000;" |
<source lang="bash">
#!/bin/bash
# Allocate nodes
#SBATCH --nodes=2
# Number of tasks per node
#SBATCH --ntasks-per-node=40
# Queue class https://wiki.bwhpc.de/e/BwUniCluster_2.0_Batch_Queues
#SBATCH --partition=multiple
# Maximum job run time
#SBATCH --time=4:00:00
# Give the job a reasonable name
#SBATCH --job-name=openfoam
# File name for standard output (%j will be replaced by job id)
#SBATCH --output=logs-%j.out
# File name for error output
#SBATCH --error=logs-%j.err

# User defined variables
FOAM_VERSION="8"
EXECUTABLE="icoFoam"
MPIRUN_OPTIONS="--bind-to core --map-by core --report-bindings"

module load ${FOAM_VERSION}
foamInit

# remove decomposePar if you already decomposed your case beforehand
decomposeParHPC &&

# starting the solver in parallel. Name of the solver is given in the "EXECUTABLE" variable
mpirun ${MPIRUN_OPTIONS} ${EXECUTABLE} -parallel &&

reconstructParHPC

</source>
|}
 
'''Attention:''' The script above will run a parallel OpenFOAM Job with pre-installed OpenMPI. If you are using an OpenFOAM version wich comes with pre-installed Intel MPI (like, for examplecae/openfoam/v1712-impi) you will have to modify the batch script to use all the advantages of Intel MPI for parallel calculations. For details see:
* [[Batch_Jobs_-_bwUniCluster_Features|Batch Jobs Features]]

= Using I/O and reducing the amount of data and files =
In OpenFOAM, you can control which variables or fields are written at specific times. For example, for post-processing purposes, you might need only a subset of variables. In order to control which files will be written, there is a function object called "writeObjects".

An example controlDict file may look like this: At the top of the file (entry "writeControl") you specify that ALL fields (variables) required for restarting are saved every 12 wall-clock hours. Then, additionally, at the bottom of the controlDict in the "functions" block, you can add a function object of type "writeObjects". With this function object, you can control the output of specific fields independent of the entry at the top of the file:

{| style="width: 100%; border:1px solid #d0cfcc; background:#f2f7ff;border-spacing: 5px;"
| style="width:280px; white-space:nowrap; color:#000;" |
<source lang="text">
/*--------------------------------*- C++ -*----------------------------------*\
| ========= | |
| \\ / F ield | OpenFOAM: The Open Source CFD Toolbox |
| \\ / O peration | Version: 4.1.x |
| \\ / A nd | Web: www.OpenFOAM.org |
| \\/ M anipulation | |
\*---------------------------------------------------------------------------*/
FoamFile
{
version 2.0;
format ascii;
class dictionary;
location "system";
object controlDict;
}
// * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * //

startFrom latestTime;
startTime 0;
stopAt endTime;
endTime 1e2;
deltaT 1e-5;

writeControl clockTime;
writeInterval 43200; // write ALL fields necessary to restart your simulation
// every 43200 wall-clock seconds = 12 hours of real time

purgeWrite 0;
writeFormat binary;
writePrecision 10;
writeCompression off;
timeFormat general;
timePrecision 10;
runTimeModifiable false;

functions
{
writeFields // name of the function object
{
type writeObjects;
libs ( "libutilityFunctionObjects.so" );

objects
(
T U rho // list of fields/variables to be written
);

// E.g. write every 1e-5 seconds of simulation time only the specified fields
writeControl runTime;
writeInterval 1e-5; // write every 1e-5 seconds
}
}

</source>
|}
 

You can also define multiple function objects in order to write different subsets of fields at different times. You can also use wildcards in the list of fields- for example, in order to write out all fields starting with "RR_" you can add
<pre>
"RR_.*"
</pre>
to the list of objects. You can get a list of valid field names by writing "banana" in the field list. During the run of the solver all valid field names are printed.
The output time can be changed too. Instead of writing at specific times in the simulation, you can also write after a certain number of time steps or depening on the wall clock time:

<pre>// write every 100th simulation time step
writeControl timeStep;
writeInterval 100;
</pre>

<pre>// every 3600 seconds of real wall clock time
writeControl runtime;
writeInterval 3600;
</pre>

If you use OpenFOAM before version 4.0 or 1606, the type of function object is:
<pre>
type writeRegisteredObject; // (instead of type writeObjects)
</pre>
If you use OpenFOAM before version 3.0, you have to load the library with
<pre>
functionObjectLibs ("libIOFunctionObjects.so"); // (instead of libs ( "libutilityFunctionObjects.so" ))
</pre>
and exchange the entry "writeControl" with "outputControl".

= OpenFOAM and ParaView on bwUniCluster=
ParaView is not directly linked to OpenFOAM installation on the cluster. Therefore, to visualize OpenFOAM jobs with ParaView, they will have to be manually opened within the specific ParaView module.

1. Load the ParaView module. For example:
<pre>$ module load cae/paraview/5.9</pre>
2. Create a dummy '*.openfoam' file in the OpenFOAM case folder:
<pre>$ cd <case_folder_path>
$ touch <case_name>.openfoam</pre>
'''NOTICE:''' the name of the dummy file should be the same as the name of the OpenFOAM case folder, with '.openfoam' extension.

3. Open ParaView:
To run Paraview using VNC system is required on the bwUniCluster.
On the cluster run:
<pre>$ start_vnc_desktop --hw-rendering </pre>
Start your VNC client on your desktop PC.
'''NOTICE''' Information for remote visualization on KIT HPC system is available on: https://wiki.bwhpc.de/e/BwUniCluster2.0/Software/Start_vnc_desktop

4. In Paraview go to 'File' -> 'Open', or press Ctrl+O. Choose to show 'All files (*)', and open your <case_name>.openfoam file. In the pop-up window select OpenFOAM, and press 'Ok'.

5. That's it! Enjoy ParaView and OpenFOAM.

BwUniCluster2.0/Software/Matlab

2025-08-14T11:11:33Z

S Braun: S Braun moved page BwUniCluster2.0/Software/Matlab to BwUniCluster3.0/Software/Matlab

#REDIRECT [[BwUniCluster3.0/Software/Matlab]]

BwUniCluster3.0/Software/Matlab

2025-08-14T11:11:33Z

S Braun: S Braun moved page BwUniCluster2.0/Software/Matlab to BwUniCluster3.0/Software/Matlab

{{Softwarepage|math/matlab}}

{| width=600px class="wikitable"
|-
! Description !! Content
|-
| module load
| math/matlab
|-
| License
| [https://de.mathworks.com/pricing-licensing/index.html?intendeduse=edu&prodcode=ML Academic License/Commercial]
|-
| Citing
| n/a
|-
| Links
| [https://de.mathworks.com/products/matlab/ MATLAB Homepage] | [https://de.mathworks.com/index.html?s_tid=gn_logo MathWorks Homepage] | [https://de.mathworks.com/support/?s_tid=gn_supp Support and more]
|-
| Graphical Interface
| No
|}

= Description =

'''MATLAB''' (MATrix LABoratory) is a high-level programming language and interactive computing environment for numerical calculation and data visualization.

= Loading MATLAB =

{|style="background:#deffee; width:100%;"
|style="padding:5px; background:#cef2e0; text-align:left"|
[[Image:Attention.svg|center|25px]]
|style="padding:5px; background:#cef2e0; text-align:left"|
It is not advisable to invoke an interactive MATLAB session on a login node of the cluster. Such sessions will be terminated automatically.
The recommended way to run a long-duration interactive MATLAB session is to submit an interactive job and start MATLAB from within the dedicated compute node assigned to you by the queueing system (consult the specific cluster users guide on how to submit interactive jobs).
|}

An interactive MATLAB session with graphical user interface (GUI) can be started with the command (requires X11 forwarding enabled for your ssh login):
<pre>$ matlab</pre>

Since graphics rendering can be very slow on remote connections, the preferable way is to run the MATLAB command line interface without GUI:
<pre>$ matlab -nodisplay</pre>

The following command will execute a MATLAB script or function named "example" '''on a single thread''':
<pre>$ matlab -nodisplay -singleCompThread -r example > result.out 2>&1</pre>

The output of this session will be redirected to the file result.out. The option -r executes the MATLAB statement non-interactively. The option -singleCompThread limits MATLAB to single computational thread. Most of the time, running MATLAB in single-threaded mode will meet your needs. But if you have mathematically intense computations that benefit from the built-in multithreading provided by MATLAB's BLAS and FFT implementation, then you can experiment with running in multi-threaded mode by omitting this option (see section 4.1 - Implicit Threading).

As with all processes that require more than a few minutes to run, non-trivial MATLAB jobs must be submitted to the cluster queuing system. Example batch scripts are available in the directory pointed to by the environment variable $MATLAB_EXA_DIR.

= Parallel Computing Using MATLAB =

Parallelization of MATLAB jobs is realized via the built-in multithreading provided by MATLAB's BLAS and FFT implementation and the parallel computing functionality of MATLAB's Parallel Computing Toolbox (PCT). The MATLAB Parallel/Distributed Computing Server is not available on the bwHPC-Clusters.

== Implicit Threading ==

A large number of built-in MATLAB functions may utilize multiple cores automatically without any code modifications required. This is referred to as implicit multithreading and must be strictly distinguished from explicit parallelism provided by the Parallel Computing Toolbox (PCT) which requires specific commands in your code in order to create threads.

Implicit threading particularly takes place for linear algebra operations (such as the solution to a linear system A\b or matrix products A*B) and FFT operations. Many other high-level MATLAB functions do also benefit from multithreading capabilities of their underlying routines. However, the user can still enforce single-threaded mode by adding the command line option -singleCompThread.

Whenever implicit threading takes place, MATLAB will detect the total number of cores that exist on a machine and by default makes use of all of them. This has very important implications for MATLAB jobs in HPC environments with shared-node job scheduling policy (i.e. with multiple users sharing one compute node). Due to this behaviour, a MATLAB job may take over more compute resources than assigned by the queueing system of the cluster (and thereby taking away these resources from all other users with running jobs on the same node - including your own jobs).

Therefore, when running in multi-threaded mode, MATLAB always requires the user's intervention to not allocate all cores of the machine (unless requested so from the queueing system). The number of threads must be controlled from within the code by means of the maxNumCompThreads(N) function (which is supposed to be deprecated) or, alternatively, with the feature('numThreads', N) function (which is currently undocumented).

== Using the Parallel Computing Toolbox (PCT) ==

By using the PCT one can make explicit use of several cores on multicore processors to parallelize MATLAB applications without MPI programming. Under MATLAB version 8.4 and earlier, this toolbox provides 12 workers (MATLAB computational engines) to execute applications locally on a single multicore node. Under MATLAB version 8.5 and later, the number of workers available is equal to the number of cores on a single node (up to a maximum of 512).

If multiple PCT jobs are running at the same time, they all write temporary MATLAB job information to the same location. This race condition can cause one or more of the parallel MATLAB jobs fail to use the parallel functionality of the toolbox.

To solve this issue, each MATLAB job should explicitly set a unique location where these files are created. This can be accomplished by the following snippet of code added to your MATLAB script.

{{bwFrameA|
<source lang="Matlab">

% create a local cluster object
pc = parcluster('local')

% get the number of dedicated cores from environment
pc.NumWorkers = str2num(getenv('SLURM_NPROCS'))

% explicitly set the JobStorageLocation to the tmp directory that is unique to each cluster job (and is on local, fast scratch)
parpool_tmpdir = [getenv('TMP'),'/.matlab/local_cluster_jobs/slurm_jobID_',getenv('SLURM_JOB_ID')]
mkdir(parpool_tmpdir)
pc.JobStorageLocation = parpool_tmpdir

% start the parallel pool
parpool(pc)

</source>
}}

Note: The code snippet also sets the correct number of parallel workers in MATLAB according to the total number of processes dedicated to the job given by the environment variable $SLURM_NPROCS in the job submission file.

= General Performance Tips for MATLAB =

MATLAB data structures (arrays or matrices) are dynamic in size, i.e. MATLAB will automatically resize the structure on demand. Although this seems to be convenient, MATLAB continually needs to allocate a new chunk of memory and copy over the data to the new block of memory as the array or matrix grows in a loop. This may take a significant amount of extra time during execution of the program.

Code performance can often be drastically improved by preallocating memory for the final expected size of the array or matrix before actually starting the processing loop. In order to preallocate an array of strings, you can use MATLAB's build-in cell function. In order to preallocate an array or matrix of numbers, you can use MATLAB's build-in zeros function.

The performance benefit of preallocation is illustrated with the following example code.

{{bwFrameA|
<source lang="Matlab">

% prealloc.m

clear all;

num=10000000;

disp('Without preallocation:')
tic
for i=1:num
a(i)=i;
end
toc

disp('With preallocation:')
tic
b=zeros(1,num);
for i=1:num
b(i)=i;
end
toc

</source>
}}

On a compute node, the result may look like this:

<pre>
Without preallocation:
Elapsed time is 2.879446 seconds.
With preallocation:
Elapsed time is 0.097557 seconds.
</pre>

Please recognize that the code runs almost 30 times faster with preallocation.