bwHPC Wiki - User contributions [en]

Registration/bwUniCluster/Entitlement

2026-03-02T13:01:28Z

P Schuhmacher:

<div style="border: 3px solid #e2e3e5; padding: 15px; background-color: #ffffff; margin: 10px 0;">
The bwUniCluster entitlement (see [https://www.bwidm.de/attribute.php#Berechtigung eduPersonEntitlement]) issued by a university assures the operator of the bwUniCluster that its university member's compute activities comply with the German Foreign Trade Act (Außenwirtschaftsgesetz - AWG) and German Foreign Trade Regulations (Außenwirtschaftsverordnung - AWV).
</div>

= Step A: bwUniCluster Entitlement =

<div style="border: 3px solid #0066cc; padding: 15px; background-color: #e7f3ff; margin: 10px 0;">
The entitlement is called '''bwForCluster''' and each university assigns the entitlement '''only''' for its own members.
</div>

To register for a bwForCluster you need the '''bwForCluster Entitlement''' issued by your university.

If you are not sure if you already have an entitlement, please check it first with the [[#Check_your_Entitlements|'''Check your Entitlements''']] guide below.
If you need the entitlement, please follow the link for your institution or contact your local service desk if no information is provided:
* [https://www.hs-esslingen.de/informatik-und-informationstechnik/forschung-labore/forschung/laufende-projekte/bwhpc-s5 Hochschule Esslingen]
* [[Registration/bwIDM-Entitlements-Uni-Freiburg|Universität Freiburg]]
* [https://heiservices.uni-heidelberg.de/entitlement Universität Heidelberg] (access only within Uni Heidelberg network)
* [https://kim.uni-hohenheim.de/bwhpc-account Universität Hohenheim]
* [https://www.scc.kit.edu/downloads/ISM/SD-HPC-Formulare/Accessform_bwUniCluster3_v1_DE_EN_2026.pdf Karlsruhe Institute of Technology (KIT)]
* [https://www.kim.uni-konstanz.de/en/services/research-and-teaching/high-performance-computing/access-to-bwunicluster Universität Konstanz]
* [[BWUniCluster_User_Access_Members_Uni_Mannheim|Universität Mannheim]]
* [https://www.hlrs.de/apply-for-computing-time/bw-uni-cluster Universität Stuttgart]
* [https://uni-tuebingen.de/de/155157 Universität Tübingen]
* [[BWUniCluster_User_Access_Members_Uni_Ulm|Universität Ulm]]
* [[Registration/HAW|HAW BW e.V.]] and Duale Hochschule Baden-Württemberg: Please contact your local service desk. In case of questions contact [mailto:hpc-at-haw@hs-esslingen.de hpc-at-haw@hs-esslingen.de]

<div style="border: 3px solid #ffc107; padding: 15px; background-color: #fff3cd; margin: 10px 0;">
'''Entitlement Synchronization:'''

After your university assigns the entitlement, it takes some time for it to synchronize across the system.
* '''Check your entitlements first''' (see below) before contacting support
* '''If the entitlement does not appear within 24 hours,''' contact your local service desk
</div>

== Check your Entitlements ==

To make sure you do not already have the entitlement, please log in to '''https://login.bwidm.de/user/index.xhtml'''.
To see the list of your entitlements, first select the '''Shibboleth''' tab at the top.
If the list below <code><nowiki>urn:oid:1.3.6.1.4.1.5923.1.1.1.7</nowiki></code> contains
<pre>http://bwidm.de/entitlement/bwUniCluster</pre>
you already have the entitlement and can skip step A.

<div style="border: 3px solid #6c757d; padding: 15px; background-color: #e2e3e5; margin: 10px 0;">
'''Note:''' <code><nowiki>http://bwidm.de/entitlement/bwUniCluster</nowiki></code> is an attribute and not a link!

See [https://www.bwidm.de/dienste.php bwUniCluster und bwForCluster] for more information about needed attributes for this service.
</div>

[[File:BwIDM-idp.png|center|600px|thumb|Verify Entitlement.]]

----

[[Registration/bwUniCluster/Service | Go to step B]]

Registration/bwUniCluster/Entitlement

2025-12-17T09:01:17Z

P Schuhmacher:

<div style="border: 3px solid #e2e3e5; padding: 15px; background-color: #ffffff; margin: 10px 0;">
The bwUniCluster entitlement (see [https://www.bwidm.de/attribute.php#Berechtigung eduPersonEntitlement]) issued by a university assures the operator of the bwUniCluster that its university member's compute activities comply with the German Foreign Trade Act (Außenwirtschaftsgesetz - AWG) and German Foreign Trade Regulations (Außenwirtschaftsverordnung - AWV).
</div>

= Step A: bwUniCluster Entitlement =

<div style="border: 3px solid #0066cc; padding: 15px; background-color: #e7f3ff; margin: 10px 0;">
The entitlement is called '''bwForCluster''' and each university assigns the entitlement '''only''' for its own members.
</div>

To register for a bwForCluster you need the '''bwForCluster Entitlement''' issued by your university.

If you are not sure if you already have an entitlement, please check it first with the [[#Check_your_Entitlements|'''Check your Entitlements''']] guide below.
If you need the entitlement, please follow the link for your institution or contact your local service desk if no information is provided:
* [https://www.hs-esslingen.de/informatik-und-informationstechnik/forschung-labore/forschung/laufende-projekte/bwhpc-s5 Hochschule Esslingen]
* [[Registration/bwIDM-Entitlements-Uni-Freiburg|Universität Freiburg]]
* [https://heiservices.uni-heidelberg.de/entitlement Universität Heidelberg] (access only within Uni Heidelberg network)
* [https://kim.uni-hohenheim.de/bwhpc-account Universität Hohenheim]
* [https://www.scc.kit.edu/downloads/ISM/SD-HPC-Formulare/Accessform_bwUniCluster3_v2_DE_EN_2025.pdf Karlsruhe Institute of Technology (KIT)]
* [https://www.kim.uni-konstanz.de/en/services/research-and-teaching/high-performance-computing/access-to-bwunicluster Universität Konstanz]
* [[BWUniCluster_User_Access_Members_Uni_Mannheim|Universität Mannheim]]
* [https://www.hlrs.de/apply-for-computing-time/bw-uni-cluster Universität Stuttgart]
* [https://uni-tuebingen.de/de/155157 Universität Tübingen]
* [[BWUniCluster_User_Access_Members_Uni_Ulm|Universität Ulm]]
* [[Registration/HAW|HAW BW e.V.]] and Duale Hochschule Baden-Württemberg: Please contact your local service desk. In case of questions contact [mailto:hpc-at-haw@hs-esslingen.de hpc-at-haw@hs-esslingen.de]

<div style="border: 3px solid #ffc107; padding: 15px; background-color: #fff3cd; margin: 10px 0;">
'''Entitlement Synchronization:'''

After your university assigns the entitlement, it takes some time for it to synchronize across the system.
* '''Check your entitlements first''' (see below) before contacting support
* '''If the entitlement does not appear within 24 hours,''' contact your local service desk
</div>

== Check your Entitlements ==

To make sure you do not already have the entitlement, please log in to '''https://login.bwidm.de/user/index.xhtml'''.
To see the list of your entitlements, first select the '''Shibboleth''' tab at the top.
If the list below <code><nowiki>urn:oid:1.3.6.1.4.1.5923.1.1.1.7</nowiki></code> contains
<pre>http://bwidm.de/entitlement/bwUniCluster</pre>
you already have the entitlement and can skip step A.

<div style="border: 3px solid #6c757d; padding: 15px; background-color: #e2e3e5; margin: 10px 0;">
'''Note:''' <code><nowiki>http://bwidm.de/entitlement/bwUniCluster</nowiki></code> is an attribute and not a link!

See [https://www.bwidm.de/dienste.php bwUniCluster und bwForCluster] for more information about needed attributes for this service.
</div>

[[File:BwIDM-idp.png|center|600px|thumb|Verify Entitlement.]]

----

[[Registration/bwUniCluster/Service | Go to step B]]

Development/VS Code

2025-10-29T14:14:44Z

P Schuhmacher: /* Connect to code-server */

== Overview ==

[[File:vscode.png|thumb|Visual Studio Code, Source: https://code.visualstudio.com/|450px]]

[https://github.com/Microsoft/vscode Visual Studio Code] (VS Code) is an open source source-code editor from Microsoft. It has become one of the most popular IDEs according to a [https://survey.stackoverflow.co/2024/technology#1-integrated-development-environment stackoverflow survey]. The functionality of VS Code can easily be extended by installing extensions. These extensions allow for almost arbitrary '''language support''', '''debugging''' or '''remote development'''. You can install VS Code locally and use it for remote development. From the following table you can see which instructions you need to follow to develop on a bwHPC cluster with VS Code.

{| class="wikitable"
|-
!scope="column"|Cluster
! Description
! Commands
|-
!scope="column"| bwUniCluster
| Setup with [[Development/VS_Code#code-server | Code Server]]
| <source lang="bash">module load devel/code-server</source>
|-
!scope="column"| Other
| Setup with [[Development/VS_Code#Connect_to_Remote_Jupyter_Kernel | Jupyter kernel]] or [[Development/VS_Code#Install_Code-Server | install Code-Server]]
| -
|-
|}

== Remote - SSH ==

In order to remotely develop and debug code at HPC facilities, you can use the [https://code.visualstudio.com/docs/remote/ssh '''Remote - SSH''' extension]. The extension allows you to connect your locally installed VS Code with the remote servers. So in contrast to using graphical IDEs within a remote desktop session (RDP, VNC), there are no negative effects like e.g. laggy reactions to your input or blurred display of fonts.

=== Installation and Configuration ===

[[File:vscode-extensions-button.png|vscode-extensions-button.png|30px]] 
In order to install the Remote - SSH extension, just click on the Extensions (Erweiterungen) button in the left side bar and enter “remote ssh” in the search field. Choose '''Remote - SSH''' from the occurring list and click on '''Install'''.
 
[[File:vscode-remoteexplorer-button.png|vscode-remoteexplorer-button.png|30px]] 
In order to configure remote connections, open the Remote-Explorer extension. On Linux Systems, the file <code>~/.ssh/config</code> is automatically evaluated. The targets within this file already appear in the left side bar.
 
[[File:vscode-remoteexplorer-add.png|vscode-remoteexplorer-add.png|350px]] 
If there are no remote ssh targets defined within this file, you can easily add one by clicking on the + symbol. Make sure that “SSH Targets” is active in the drop down menu of the Remote-Explorer. Enter the connection details <code><user>@<server></code>. You will be asked, whether the file <code>~/.ssh/config</code> should be modified or if another config file should be used or created.

=== Connect to Login Nodes ===

[[File:vscode-remoteexplorer-button.png|vscode-remoteexplorer-button.png|30px]] 
In order to connect to a remote SSH target, open the Remote-Explorer. Right-click a target and connect in the current or a new window. TOTP and password can be entered in the corresponding input fields that open.

You are now logged in on the remote server. As usual, you can open a project directory with the standard key binding Ctrl+k Ctrl+o. You can now edit and debug code.

'''Attention''': Please remember that you are running and debugging the code on a login node. Do not perform resource-intensive tasks. Furthermore, no GPU resources are available to you.

Extensions, which are installed locally, are only usable on your local machine and are not automatically installed remotely. However, as soon as you open the Extensions-Explorer during a remote session, VS Code proposes to install the locally installed extensions remotely.

=== Disconnect from Login Nodes ===

[[File:vscode-remoteexplorer-indicator.png|images/vscode-remoteexplorer-indicator.png|200px]] 
If you want to end your remote session, click the green box in the lower left corner. In the input box that opens, select the “Close Remote Connection” option. If you simply close your VS Code window, some server-side components of VS Code will continue to run remotely.

=== Access to Compute Nodes ===

The workflow described above does not allow debugging on compute nodes that have been requested via an interactive Slurm job, for example. The security settings prevent the login node from being used as a proxy jump host. So there is no direct way to connect your locally installed VS code to the compute nodes. Debugging GPU codes is therefore also not possible, since this kind of resource is only accessible within Slurm jobs. Please have a look at the overview table in the first chapter to see which solution to follow.

== Code-Server ==

The application [https://github.com/cdr/code-server code-server] allows to run the server part of VS Code on any machine, it can be accessed in the web browser. This enables, for example, development and debugging on compute nodes. 

[[File:code-server.png|thumb|code-server.png|VS Code in web browser: code-server, Source: https://github.com/cdr/code-server">https://github.com/cdr/code-server|400px]]

=== Install Code-Server ===

If no code-server module is provided, you can install it yourself.
# Download the latest release archive for your system from GitHub and unpack it.
#: <syntaxhighlight lang="bash">
# Look up the version that you want to install: https://github.com/coder/code-server/releases
VERSION=4.101.2
mkdir -p ~/.local/lib ~/.local/bin
curl -fL https://github.com/coder/code-server/releases/download/v$VERSION/code-server-$VERSION-linux-amd64.tar.gz \
| tar -C ~/.local/lib -xz
</syntaxhighlight>
# You can run code-server by executing "./bin/code-server" or add ./bin/code-server to your $PATH and run it with "code-server"
#: <syntaxhighlight lang="bash">
mv ~/.local/lib/code-server-$VERSION-linux-amd64 ~/.local/lib/code-server-$VERSION
ln -s ~/.local/lib/code-server-$VERSION/bin/code-server ~/.local/bin/code-server
# Add the following line in your ~/.bashrc
export PATH="~/.local/bin:$PATH"
</syntaxhighlight>

=== Start Code-Server ===

Code-server can be run on either login nodes or compute nodes. In the example shown, an interactive job is started on a GPU partition to run code-server there.

<syntaxhighlight lang="console">$ salloc -p accelerated --gres=gpu:4 --time=30:00 # Start interactive job with 1 GPU
$ module load devel/code-server # Load code-server module</syntaxhighlight>
When code-server is started, it opens a web server listening on a certain port. The user has to '''specify the port'''. It can be chosen freely in the unprivileged range (above 1024). If a port is already assigned, e.g. because several users choose the same port, another port must be chosen.

By starting code-server, you are running a web server that can be accessed by anyone logged in to the cluster. To prevent other people from gaining access to your account and data, this web server is '''password protected'''. If no variable <code>PASSWORD</code> is defined, the password in the default config file <code>~/.config/code-server/config.yaml</code> is used. If you want to define your own password, you can either change it in the config file or export the variable <code>PASSWORD</code>.

<syntaxhighlight lang="console">$ PASSWORD=<mySecret> \
code-server \
--bind-addr 0.0.0.0:8081 \
--auth password # Start code-server on port 8081</syntaxhighlight>

{| style="background:#FFCCCC; width:100%;"
| '''Security implications'''
Please note that by starting <code>code-server</code> you are running a web server that can be accessed by everyone logged in on the cluster. 
* '''If password protection is disabled, anybody can access your account and your data.'''
* Choose a '''secure password'''!
* Do '''NOT''' use <code>code-server --link</code>!
|}

=== Connect to code-server ===
[[File:code-server-hk.png|thumb|Code-server running on GPU node.|400px]]

As soon as code-server is running, it can be accessed in the web browser. In order to establish the connection, a SSH tunnel from your local computer to the remote server has to be created via:

<syntaxhighlight lang="console">$ ssh -L 8081:<computeNodeID>:8081 <userID>@uc3.scc.kit.edu</syntaxhighlight>
You need to enter the <code>computeNodeID</code> of the node on which the interactive Slurm job is running. If you have started code server on a login node, just enter <code>localhost</code>. Now you can open http://127.0.0.1:8081 in your web browser. Possibly, you have to allow your browser to open an insecure (non-https) site. The login site looks as follows:

[[File:code-server-login.png|Code-server login page.|300px]]

Enter the password from <code>~/.config/code-server/config.yaml</code> or from the <code>PASSWORD</code> variable. After clicking the “Submit” button, the familiar VS Code interface will open in your browser.

=== End code-server session ===

If you want to temporarily log out from your code-server session you can open the “Application Menu” in the left side bar and click on “Log out”. To '''terminate''' the code-server session, you have to cancel it in the interactive Slurm job by pressing ++ctrl+c++.

== Connect to Remote Jupyter Kernel ==
To work with your python scripts and notebooks within VSCode while using the resources of a compute node, you can create a batch job that launches JupyterLab and connect to it via VS Code. To do so, please follow the instructions below. Any parts of the scripts that might need adjustments are marked with the keyword "@params".

=== Simple Use Case ===
The most basic steps are to set a password for JupyterLab, start a job which runs JupyterLab, get the connection details from the output log and connect to it locally. The following instructions explain these steps and provide an additional script that replaces the manual step of looking into the output file.

# Load a python module and set a password on the cluster for JupyterLab:
#: <syntaxhighlight lang="bash">
module load devel/miniforge
jupyter notebook --generate-config
jupyter notebook password
</syntaxhighlight>
# Define a batch script to start a JupyterLab Job. Please adjust the first part according to your needs and your specific cluster.
#: <pre>~/jupyterlab.slurm</pre>
#: <syntaxhighlight lang="bash">
#!/bin/bash

#SBATCH --partition=cpu-single
#SBATCH --job-name=jupyterlab
#SBATCH --time=00:10:00
#SBATCH --ntasks=1
#SBATCH --cpus-per-task 1
#SBATCH --mail-user=my_email_address #my_email_address # to use this generic version, add "alias my_email_address=<yourEmailAddress>" to the ~/.bashrc file
#SBATCH --mail-type=ALL

# @param: change this to your preferred python or conda module
module load devel/miniforge

# @param: cluster address for ssh connection
hostAddress=helix.bwservices.uni-heidelberg.de

PORT=$(( ( RANDOM % 9999 ) + 1024 ))
jupyter lab --no-browser --ip=0.0.0.0 --port=${PORT}
HOSTID=$(squeue -h -o "%A %N %j" | grep jupyterlab | awk '{print $2}')
echo "Connect"
echo "ssh -N -L ${PORT}:${HOSTID}:${PORT} ${USER}@$hostAddress"
echo "Job {$SLURM_JOB_ID} running on node {$SLURM_NODEID} on host {$HOSTID}."

returned_code=$?
echo "> Script completed with exit code ${returned_code}"
exit ${returned_code}
</syntaxhighlight>
# Run a wrapper script to execute the batch script and extract needed information from the slurm output file. You could save it together with other utility scripts in a "bin" directory in your home folder.
#: <pre>./bin/run_jupyterlab_simple.sh</pre>
#: <syntaxhighlight lang="bash">

#!/bin/bash

# Define parameters
jobscript=~/jupyterlab.slurm
hostAddress=helix.bwservices.uni-heidelberg.de

# Run job
job_id=$(sbatch $jobscript | awk '{print $4}')
echo "jobid: $job_id"

# Outfile name
slurm_out=slurm-${job_id}.out

# Wait for output file
while [ ! -f $slurm_out ]; do
sleep 2;
done

# Wait until url is written in output file
while [ -z ${url} ]; do
sleep 1;
url=$(grep -o 'http[^ ]*' $slurm_out | head -n 1);
done

# Extract hostID and port from output. The pattern assumes a node name with a length of 6 characters and a port with a length of 3, 4 or 5 numbers.
url_pattern="http://([a-z0-9]{6}):([0-9]{3,5})/lab"
if [[ $url =~ $url_pattern ]]; then
hostID=${BASH_REMATCH[1]}
port=${BASH_REMATCH[2]}
echo "To connect with the JupyterLab kernel, please enter the following into your local commandline: "
echo "ssh -N -L $port:$hostID:$port ${USER}@$hostAddress";
echo ""
echo "Note: It is normal that the ssh command doesn't end after providing the credentials. Ending the command would mean ending the local connection to the kernel."
echo ""
echo "Afterwards, you can use the URL"
echo " http://127.0.0.1:${port}/lab "
echo ""
echo "to:"
echo "- use the kernel in VSCode ('Existing Jupyter Server...', enter URL, enter password, confirm '127.0.0.1', choose kernel) or "
echo "- open JupyterLab in your browser with the URL"
else
echo "The needed information couldn't be found in the slurm output. Please contact your support unit if you need help with fixing this problem."
fi
# rm $slurm_out
</syntaxhighlight>
# Follow the instructions on the commandline to connect to the Jupyter kernel from your local machine or the Helix login node. More detailed instructions can be found below.

==== Connect to a running job ====

The job runs on a specific compute node and port. With this information, you can create a ssh connection to it. But first, you need to decide, in which way you want to work with your python code. The options are:

# The code is placed locally on your computer.
# The code is placed on the cluster and you've mounted the folder locally. (= The files on the cluster are accessible from within your local VS Code)
# The code is placed on the cluster and you work on the cluster via a remote connection in VS Code.

Depending on the use case, you need to execute the ssh command in a different place:

# Open VS Code on your computer.
# Open VS Code on your computer.
# Open VS Code on your computer and connect to the cluster.

Then open a terminal and execute the ssh command, which is given in the commandline output of the wrapper script. If the terminal isn't already open, go to menu item "Terminal" at the top of the window and choose "New Terminal" (or "new -> command prompt" on Windows).
It is normal that the command doesn't end after you've put in your credentials. Leave the terminal open and go on with the next step.

To use the jupyter kernel that is running on the cluster node, you need to connect this kernel. This is similar to connecting any other kernel:

# Open your code file.
# Click "Select Kernel" in the upper right corner.
# Choose "Existing Jupyter Server...".
# Enter the URL that was given by the wrapper script.
# Enter your JupyterLab password that you set in the first step of these instructions.
# Confirm the prefilled value "127.0.0.1" by pressing Enter.
# Choose one of the virtual environments that you've created on the cluster. You should see all python environments. To see the conda environments as well, you need to [[Helix/bwVisu/JupyterLab#Python_version | register them as ipykernel]] first.

=== Complex Use Case ===
If you have different use cases for juypterlab, you could use a more flexible wrapper script, for example:

<pre>./bin/run_jupyterlab.sh</pre>

<syntaxhighlight lang="bash">
#!/bin/bash
# Starts a jupyter kernel on a node and provides information on how to connect to it locally.
# If you have only one use case and therefore need only one combination of slurm settings for your jupyter jobs, then you can use the simpler script.
# This script supports explorative analyses by allowing to overwrite parameters via commandline.
# Different job configurations can be defined in advance and then used with a given short name (cpu, gpu,...).

programname=$0
function help {
'''help text'''
echo ""
echo "Starts a jupyterlab kernel"
echo ""
echo "usage example: $programname --param_set cpu"
echo ""
echo " --param_set string name of the parameter set"
echo " (examples: cpu, gpu)"
echo " --jobscript string optional, path of batch script"
echo " (default: ~/jupyterlab.slurm)"
echo " --slurm_out string optional, name of slurm output file"
echo " (default: slurm-${job_id}.out)"
}

# These parameters are set later in the script. Providing them via commandline, overwrites their values set in the script.
jobscript=None
slurm_out=None

# Process parameters
while [ $# -gt 0 ]; do
if [[ $1 == "--help" ]]; then
help
exit 0
# when given -p as parameter, use its value for the variable param_set
elif [[ $1 == "-p" ]]; then
param_set="$2"
shift
elif [[ $1 == "--"* ]]; then
v="${1/--/}"
declare "$v"="$2"
shift
fi
shift
done

function define_param_set(){
'''Define parameter sets for sbatch'''
# Define different sets
cpu=(--partition=cpu-single --mem=2gb)
gpu=(--partition=gpu-single --mem=3gb --gres=gpu:1)

param_set=${1}
param_set=$param_set[@]
param_set=("${!param_set}")

# Add params that are the same for all sets
param_set+=(--ntasks=1)
}

# @param: jobscript, name of the slurm batch script to execute
if [ "$jobscript" = "None" ]; then
jobscript=~/jupyterlab.slurm
fi

# @param: cluster address for ssh connection
hostAddress=helix.bwservices.uni-heidelberg.de

# Translate given param_set value to actual set of parameters
define_param_set $param_set
echo "param_set: ${param_set[*]}"

# Run job
job_id=$(sbatch ${param_set[@]} $jobscript | awk '{print $4}')
echo "jobid: $job_id"

# @param: slurm_out, the filename for the slurm output file
if [ "$slurm_out" = "None" ]; then
slurm_out=slurm-${job_id}.out
fi

# Wait for output file
while [ ! -f $slurm_out ]; do
sleep 1;
done

# Wait until url is written in output file
while [ -z ${url} ]; do
sleep 1;
url=$(grep -o 'http[^ ]*' $slurm_out | head -n 1);
done

# Extract hostID and port from output.
url_pattern="http://([a-z0-9]{6}):([0-9]{3,5})/lab"
if [[ $url =~ $url_pattern ]]; then
hostID=${BASH_REMATCH[1]}
port=${BASH_REMATCH[2]}
echo "To connect with the JupyterLab kernel, please enter the following into your local commandline: "
echo "ssh -N -L $port:$hostID:$port ${USER}@$hostAddress";
fi

echo "Afterwards, you can either"
echo "- use the kernel in VSCode or "
echo "- open JupyterLab with this URL: "
echo " http://127.0.0.1:${port}/lab "
echo "Note: It is normal that the ssh command doesn't end after providing the credentials. Ending the command would mean ending the local connection to the kernel."
#rm $slurm_out
</syntaxhighlight>

BwUniCluster3.0/Containers

2025-09-17T12:39:59Z

P Schuhmacher: /* FAQ */

= Introduction =
To date, only few container runtime environments integrate well with HPC environments due to security concerns and differing assumptions in some areas.

For example native Docker environments require elevated privileges, which is not an option on shared HPC resources. Docker's "rootless mode" is also currently not supported on our HPC systems because it does not support necessary features such as cgroups resource controls, security profiles, overlay networks, furthermore GPU passthrough is difficult. Necessary subuid (newuidmap) and subgid (newgidmap) settings may impose security issues.

On bwUniCluster the container runtimes '''Enroot''' and '''Singularity/Apptainer''' are supported.

Further rootless container runtime environments (Podman, …) might be supported in the future, depending on how support for e.g. network interconnects, security features and HPC file systems develops.

= ENROOT =

Enroot enables you to run '''Docker containers''' on HPC systems. It is developed by NVIDIA. It is the '''recommended tool''' to use containers on bwUniCluster and integrates well with GPU usage and has basically no impact on performance.
Enroot is available to all users by default.
[[File:docker_logo.svg|center|100px]]

== Usage ==

Excellent documentation is provided on [https://github.com/NVIDIA/enroot/blob/master/doc NVIDIA's github page]. This documentation here therefore confines itself to simple examples to get to know the essential functionalities.

Using Docker containers with Enroot requires three steps:

* Importing an image
* Creating a container
* Starting a container

Optionally containers can also be exported and transferred.

=== Importing a container image ===

* <code>enroot import docker://alpine</code> This pulls the latest alpine image from dockerhub (default registry). You will obtain the file alpine.sqsh.

* <code>enroot import docker://nvcr.io#nvidia/pytorch:21.04-py3</code> This pulls the pytorch image version 21.04-py3 from [https://ngc.nvidia.com/catalog NVIDIA's NGC registry]. Please note that the NGC registry does not always contain the "latest" tag and instead requires the specification of a dedicated version. You will obtain the file nvidia+pytorch+21.04-py3.sqsh.

* <code>enroot import docker://registry.scc.kit.edu#myProject/myImage:latest</code> This pulls your latest image from the KIT registry. You obtain the file myImage.sqsh.

=== Creating a container ===
Create a container named "nvidia+pytorch+21.04-py3" by unpacking the .sqsh-file.

<code>enroot create --name nvidia+pytorch+21.04-py3 nvidia+pytorch+21.04-py3.sqsh</code>

"Creating" a container means that the squashed container image is unpacked inside <code>$ENROOT_DATA_PATH/</code>. By default this variable points to <code>$HOME/.local/share/enroot/</code>.

=== Starting a container ===
* Start the container nvidia+pytorch+21.04-py3 in read-write mode (<code>--rw</code>) and run bash inside the container. <code>enroot start --rw nvidia+pytorch+21.04-py3 bash</code>

* Start container in <code>--rw</code>-mode and get root access (<code>--root</code>) inside the container. <code>enroot start --root --rw nvidia+pytorch+21.04-py3 bash</code> You can now install software with root privileges, depending on the containerized Linux distribution e.g. with <code>apt-get install … </code>, <code>apk add …</code>, <code>yum install …</code>, <code>pacman -S …</code>

* Start container and mount (<code>-m</code>) a local directory to <code>/work</code> inside the container. <code>enroot start -m <localDir>:/work --rw nvidia+pytorch+21.04-py3 bash</code>

* Start container, mount a directory and start the application <code>jupyter lab</code>. <code>enroot start -m <localDir>:/work --rw nvidia+pytorch+21.04-py3 jupyter lab</code>

=== Exporting and transfering containers ===

If you intend to use Docker images which you built e.g. on your local desktop, and transfer them somewhere else, there are several possibilities to do so:

* <code>enroot import --output myImage.sqsh dockerd://myImage</code> Import an image from the locally running Docker daemon. Copy the .sqsh-file to bwUniCluster and import it with <code>enroot import</code>.

* <code>enroot export --output myImage.sqsh myImage</code> Export an existing enroot container. Copy the .sqsh-file to bwUniCluster and import it with enroot import.

* <code>enroot bundle --output myImage.run myImage.sqsh</code> Create a self extracting bundle from a container image. Copy the .run-file to bwUniCluster. You can run the self extracting image via ./myImage.run even if enroot is not installed!

=== Container management ===

You can list all containers on the system and additional information (<code>--fancy</code> parameter) with the <code>enroot list</code> command.

The unpacked images can be removed with the enroot remove command.

== SLURM Integration==
Enroot allows you to run containerized applications non-interactively, including MPI- and multi-node parallelism. The necessary Slurm integration is realized via the [https://github.com/NVIDIA/pyxis Pyxis plugin].

=== Create Container via enroot ===

* <code>enroot import docker://ubuntu</code>
* <code>enroot create -n pyxis_ubuntu ubuntu.sqsh</code>
Adding pyxis_ is a must for the pyxis plugin to work

=== Start via Slurm ===
Start existing Container:
*<code>salloc -p dev_single -t 00:10:00 --container-name=ubuntu --container-mounts=/etc/slurm/task_prolog:/etc/slurm/task_prolog,/scratch:/scratch,/usr/lib64/slurm:/usr/lib64/slurm,/usr/lib64/libhwloc.so:/usr/lib64/libhwloc.so,/usr/lib64/libhwloc.so.15:/usr/lib64/libhwloc.so.15</code>

Download and start Container via pyxis directly:
*<code>salloc -p dev_single -t 00:10:00 --container-image=ubuntu --container-name=ubuntu --container-mounts=/etc/slurm/task_prolog:/etc/slurm/task_prolog,/scratch:/scratch,/usr/lib64/slurm:/usr/lib64/slurm,/usr/lib64/libhwloc.so:/usr/lib64/libhwloc.so,/usr/lib64/libhwloc.so.15:/usr/lib64/libhwloc.so.15</code>
In this case an enroot Container is created under ~./local/share/enroot/

Note: <code>--container-mounts=/etc/slurm/task_prolog:/etc/slurm/task_prolog,/scratch:/scratch,/usr/lib64/slurm:/usr/lib64/slurm,/usr/lib64/libhwloc.so:/usr/lib64/libhwloc.so,/usr/lib64/libhwloc.so.15:/usr/lib64/libhwloc.so.15</code> is needed for the plugin to work!! The Container name has to start with pyxis_ for the Plugin to work. When using the second Method this is done automatically. Furthermore when specifying the container name in your slurm Job the pyxis_ has to be omitted.

All options usable for pyxis can be found via srun --help under "Options provided by plugins:"

Notable Options:
* <code>--container-mount-home</code> Mounts the home directory into the container
* <code>--container-writable</code> Makes the container filesystem writable (otherwise only the mounted home is writebale)
* <code>--container-remap-root</code> Become root in your container. Allows installation of software via e.G apt (ubuntu)

== FAQ ==

* ''How can I run JupyterLab in a container and connect to it?''
** Start an interactive session with or without GPUs. Notice the compute node ID the session is running on, and start a container with a running JupyterLab, e.g.: <code>salloc -p gpu_4 --time=01:00:00 --gres=gpu:1</code> <code>enroot start -m <localDir>:/work --rw nvidia+pytorch+21.04-py3 jupyter lab</code>
** Open a terminal on your desktop and create a SSH-tunnel to the running JupyterLab instance on the compute node. Insert the node ID, where the interactive session is running on: <code>ssh -L8888:<computeNodeID>:8888 <yourAccount>@uc3.scc.kit.edu</code>
** Open a web browser and open the URL [http://localhost:8888 localhost:8888]
** Enter the token, which is visible in the output of the first terminal. Copy the string behind the <code>token=</code> and paste it into the input field in the browser.

* ''Are GPUs accessible from within a running container?'' Yes. Unlike Docker, Enroot does not need further command line options to enable GPU passthrough like <code>--runtime=nvidia</code> or <code>--privileged</code>.

* ''Is there something like <code>enroot-compose</code>?'' AFAIK no. Enroot is mainly intended for HPC workloads, not for operating multi-container applications. However, starting and running these applications separately is possible.

* ''Can I use workspaces to store containers?'' Yes. You can define the location of configuration files and storage with environment variables. The <code>ENROOT_DATA_PATH</code> variable should be set accordingly. Please refer to [https://github.com/NVIDIA/enroot/blob/master/doc/configuration.md#runtime-configuration NVIDIA's documentation] on runtime configuration. Unfortunately, using Pyxis and images stored on workspaces requires an ugly hack, because setting <code>ENROOT_DATA_PATH</code> is ignored by Pyxis. The workaround consists in setting <code>XDG_DATA_HOME</code> to the workspace directory (cf. https://github.com/NVIDIA/pyxis/issues/46). For a workspace named ''enroottest'' this would be: <code>export XDG_DATA_HOME=$(ws_find enroottest)</code> and <code>export ENROOT_DATA_PATH=$(ws_find enroottest)/enroot</code>

== Additional resources ==

Source code: [https://github.com/NVIDIA/enroot https://github.com/NVIDIA/enroot]

Documentation: [https://github.com/NVIDIA/enroot/blob/master/doc https://github.com/NVIDIA/enroot/blob/master/doc]

Additional information:
* [https://archive.fosdem.org/2020/schedule/event/containers_hpc_unprivileged/ FOSDEM 2020 talk] + [https://archive.fosdem.org/2020/schedule/event/containers_hpc_unprivileged/attachments/slides/3711/export/events/attachments/containers_hpc_unprivileged/slides/3711/containers_hpc_unprivileged.pdf slides]
* [https://slurm.schedmd.com/SLUG19/NVIDIA_Containers.pdf Slurm User Group Meeting 2019 talk]

= Singularity/Apptainer =
[[File:singularity_logo.svg|center|100px]]

== Usage ==

Excellent documentation is provided on the [https://sylabs.io/docs/ Documentation&Examples] page provided by Sylabs. This documentation here therefore confines itself to simple examples to get to know the essential functionalities.

Using Singularity/Apptainer usually involves two steps:

* Building a container image using singularity build

* Running a container image using singularity run or singularity exec

=== Building an image ===

* <code>singularity build ubuntu.sif library://ubuntu</code> This pulls the latest Ubuntu image from Singularity's [https://cloud.sylabs.io/library Container Library] and locally creates a container image file called ubuntu.sif.

* <code>singularity build alpine.sif docker://alpine</code> This pulls the latest alpine image from Dockerhub and locally creates a container image file called alpine.sif.

* <code>singularity build pytorch-21.04-p3.sif docker://nvcr.io#nvidia/pytorch:21.04-py3</code> This pulls the latest pytorch image from NVIDIA's NGC registry and locally creates a container image file called pytorch-21.04-p3.sif.

=== Running an image ===

* <code>singularity shell ubuntu.sif</code> Start a shell in the Ubuntu container.

* <code>singularity run alpine.sif</code> Start the container alpine.sif and run the default runscript provided by the image.

* <code>singularity exec alpine.sif /bin/ls</code> Start the container alpine.sif and run the /bin/ls command.

=== Container management ===

You can use the <code>singularity search</code> command to search for images on Singularity's [https://cloud.sylabs.io/library Container Library].

BwUniCluster3.0/Running Jobs/Slurm

2025-08-19T09:30:21Z

P Schuhmacher: /* BeeOND (BeeGFS On-Demand) */

{| style="background:#FFCCCC; width:100%; font-size:120%;"
| '''This page is work-in-progress''' 
We will revise all examples and describe gold standard ways to use the resources efficiently: OpenMP, MPI and hybrid using MPI+OpenMP
|}

= Slurm HPC Workload Manager =
== Specification ==
Slurm is an open source, fault-tolerant, and highly scalable cluster management and job scheduling system for large and small Linux clusters. Slurm requires no kernel modifications for its operation and is relatively self-contained. As a cluster workload manager, Slurm has three key functions. First, it allocates exclusive and/or non-exclusive access to resources (compute nodes) to users for some duration of time so they can perform work. Second, it provides a framework for starting, executing, and monitoring work (normally a parallel job) on the set of allocated nodes. Finally, it arbitrates contention for resources by managing a queue of pending work.
 
 
Any kind of calculation on the compute nodes of bwUniCluster 3.0 requires the user to define calculations as a sequence of commands or single command together with required run time, number of CPU cores and main memory and submit all, i.e., the '''batch job''', to a resource and workload managing software. bwUniCluster 3.0 has installed the workload managing software Slurm. Therefore any job submission by the user is to be executed by commands of the Slurm software. Slurm queues and runs user jobs based on fair sharing policies.
 
 

== Slurm Commands (excerpt) ==
Important Slurm commands for non-administrators working on bwUniCluster 3.0.
{| width=850px class="wikitable"
! Slurm commands !! Brief explanation
|-
| [[#Job submission : sbatch|sbatch]] || Submits a job and puts it into the queue [[https://slurm.schedmd.com/sbatch.html sbatch]]
|-
| [[#Interactive job : salloc|salloc]] || Requests resources for an interactive Job [[https://slurm.schedmd.com/salloc.html salloc]]
|-
| [[#Detailed job information : scontrol show job|scontrol show job]] || Displays detailed job state information [[https://slurm.schedmd.com/scontrol.html scontrol]]
|-
| [[#List of your submitted jo/bs : squeue|squeue]] || Displays information about active, eligible, blocked, and/or recently completed jobs [[https://slurm.schedmd.com/squeue.html squeue]]
|-
| [[#Start time of job or resources : squeue|squeue --start]] || Returns start time of submitted job [[https://slurm.schedmd.com/squeue.html squeue]]
|-
| [[#Shows free resources : sinfo_t_idle|sinfo_t_idle]] || Shows what resources are available for immediate use [[https://slurm.schedmd.com/sinfo.html sinfo]]
|-
| [[#Canceling own jobs : scancel|scancel]] || Cancels a job [[https://slurm.schedmd.com/scancel.html scancel]]
|}

 
* [https://slurm.schedmd.com/tutorials.html Slurm Tutorials]
* [https://slurm.schedmd.com/pdfs/summary.pdf Slurm command/option summary (2 pages)]
* [https://slurm.schedmd.com/man_index.html Slurm Commands]
 

== Job submission : sbatch ==
Batch jobs are submitted by using the command '''sbatch'''. The main purpose of the '''sbatch''' command is to specify the resources that are needed to run the job. '''sbatch''' will then queue the batch job. However, starting of batch job depends on the availability of the requested resources and the fair sharing value.
 
 
=== Command parameters sbatch ===
The syntax and use of '''sbatch''' can be displayed via:
<pre>
$ man sbatch
</pre>
'''sbatch''' options can be used from the command line or in your job script. Different defaults for some of these options are set based on the queue and can be found [https://wiki.bwhpc.de/e/BwUniCluster3.0/Batch_Queues#sbatch_-p_queue here]

{| class="wikitable"
! colspan="3" | sbatch Options
|-
! style="width:8%"| Command line
! style="width:9%"| Script
! style="width:13%"| Purpose
|- style="vertical-align:top;"
| -t, --time=''time''
| #SBATCH --time=''time''
| Wall clock time limit. 
|-
|- style="vertical-align:top;"
| -N, --nodes=''count''
| #SBATCH --nodes=''count''
| Number of nodes to be used.
|- style="vertical-align:top;"
| -n, --ntasks=''count''
| #SBATCH --ntasks=''count''
| Number of tasks to be launched.
|-
|- style="vertical-align:top;"
| --ntasks-per-node=''count''
| #SBATCH --ntasks-per-node=''count''
| Maximum count of tasks per node.
|-
|- style="vertical-align:top;"
| -c, --cpus-per-task=''count''
| #SBATCH --cpus-per-task=''count''
| Number of CPUs required per (MPI-)task.
|-
|- style="vertical-align:top;"
| --mem=''value_in_MB''
| #SBATCH --mem=''value_in_MB''
| Memory in MegaByte per node. (You should omit the setting of this option.)
|-
|- style="vertical-align:top;"
| --mem-per-cpu=''value_in_MB''
| #SBATCH --mem-per-cpu=''value_in_MB''
| Minimum Memory required per allocated CPU. (You should omit the setting of this option.)
|-
|- style="vertical-align:top;"
| --mail-type=''type''
| #SBATCH --mail-type=''type''
| Notify user by email when certain event types occur. Valid type values are NONE, BEGIN, END, FAIL, REQUEUE, ALL.
|-
|- style="vertical-align:top;"
| --mail-user=''mail-address''
| #SBATCH --mail-user=''mail-address''
| The specified mail-address receives email notification of state changes as defined by --mail-type.
|-
|- style="vertical-align:top;"
| --output=''name''
| #SBATCH --output=''name''
| File in which job output is stored.
|-
|- style="vertical-align:top;"
| --error=''name''
| #SBATCH --error=''name''
| File in which job error messages are stored.
|-
|- style="vertical-align:top;"
| -J, --job-name=''name''
| #SBATCH --job-name=''name''
| Job name.
|-
|- style="vertical-align:top;"
| --export=[ALL,] ''env-variables''
| #SBATCH --export=[ALL,] ''env-variables''
| Identifies which environment variables from the submission environment are propagated to the launched application. Default is ALL.
|-
|- style="vertical-align:top;"
| -A, --account=''group-name''
| #SBATCH --account=''group-name''
| Change resources used by this job to specified group. You may need this option if your account is assigned to more than one group. By command "scontrol show job" the project group the job is accounted on can be seen behind "Account=".
|-
|- style="vertical-align:top;"
| -p, --partition=''queue-name''
| #SBATCH --partition=''queue-name''
| Request a specific queue for the resource allocation.
|-
|- style="vertical-align:top;"
| --reservation=''reservation-name''
| #SBATCH --reservation=''reservation-name''
| Use a specific reservation for the resource allocation.
|-
|- style="vertical-align:top;"
| -C, --constraint=''LSDF''
| #SBATCH --constraint=LSDF
| Job constraint LSDF filesystems.
|-
|- style="vertical-align:top;"
| -C, --constraint=''BEEOND (BEEOND_4MDS, BEEOND_MAXMDS)''
| #SBATCH --constraint=BEEOND (BEEOND_4MDS, BEEOND_MAXMDS)
| Job constraint BeeOND filesystem.
|-
|}
 

== Interactive job : salloc ==

If you want to run an interactive job, you can do so via the command salloc on a login node. Considering a serial application running on a compute node that requires 5000 MByte of memory and limiting the interactive run to 2 hours the following command has to be executed:

<pre>
$ salloc --partition=cpu --ntasks=1 --time=120 --mem=5000
</pre>

Then you will get one core on a compute node within the partition "cpu". After execution of this command '''DO NOT CLOSE''' your current terminal session but wait until the queueing system Slurm has granted you the requested resources on the compute node. You will be logged in automatically on the granted core! To run a serial program on the granted core you only have to type the name of the executable.

<pre>
$ ./<my_serial_program>
</pre>

Please be aware that your serial job must run less than 2 hours in this example, else the job will be killed during runtime by the system.

You can also start now a graphical X11-terminal connecting you to the dedicated resource that is available for 2 hours. You can start it by the command:

<pre>
$ xterm
</pre>

Note that, once the walltime limit has been reached the resources - i.e. the compute node - will automatically be revoked.

An interactive parallel application running on one compute node or on many compute nodes (e.g. here 5 nodes) with 40 cores each requires usually an amount of memory in GByte (e.g. 50 GByte) and a maximum time (e.g. 1 hour). E.g. 5 nodes can be allocated by the following command:

<pre>
$ salloc --partition=cpu --nodes=5 --ntasks-per-node=40 --time=01:00:00 --mem=50gb
</pre>

Now you can run parallel jobs on 200 cores requiring 50 GByte of memory per node. Please be aware that you will be logged in on core 0 of the first node.
If you want to have access to another node you have to open a new terminal, connect it also to BwUniCluster 2.0 and type the following commands to
connect to the running interactive job and then to a specific node:

<pre>
$ srun --jobid=XXXXXXXX --pty /bin/bash
$ srun --nodelist=uc3nXXX --pty /bin/bash
</pre>

With the command:

<pre>
$ squeue
</pre>

the jobid and the nodelist can be shown.

If you want to run MPI-programs, you can do it by simply typing mpirun <program_name>. Then your program will be run on 200 cores. A very simple example for starting a parallel job can be:

<pre>
$ mpirun <my_mpi_program>
</pre>

You can also start the debugger ddt by the commands:

<pre>
$ module add devel/ddt
$ ddt <my_mpi_program>
</pre>

The above commands will execute the parallel program <my_mpi_program> on all available cores. You can also start parallel programs on a subset of cores; an example for this can be:

<pre>
$ mpirun -n 50 <my_mpi_program>
</pre>

If you are using Intel MPI you must start <my_mpi_program> by the command mpiexec.hydra (instead of mpirun).

<div id="top"></div>
= Slurm HPC Workload Manager =
== Specification ==
Slurm is an open source, fault-tolerant, and highly scalable cluster management and job scheduling system for large and small Linux clusters. Slurm requires no kernel modifications for its operation and is relatively self-contained. As a cluster workload manager, Slurm has three key functions. First, it allocates exclusive and/or non-exclusive access to resources (compute nodes) to users for some duration of time so they can perform work. Second, it provides a framework for starting, executing, and monitoring work (normally a parallel job) on the set of allocated nodes. Finally, it arbitrates contention for resources by managing a queue of pending work.
 
 
Any kind of calculation on the compute nodes of [[bwUniCluster3.0|bwUniCluster 3.0]] requires the user to define calculations as a sequence of commands or single command together with required run time, number of CPU cores and main memory and submit all, i.e., the '''batch job''', to a resource and workload managing software. bwUniCluster 3.0 has installed the workload managing software Slurm. Therefore any job submission by the user is to be executed by commands of the Slurm software. Slurm queues and runs user jobs based on fair sharing policies.
 
 

== Slurm Commands (excerpt) ==
Some of the most used Slurm commands for non-administrators working on bwUniCluster 2.0.
{| width=750px class="wikitable"
! Slurm commands !! Brief explanation
|-
| [[#Job Submission : sbatch|sbatch]] || Submits a job and queues it in an input queue [[https://slurm.schedmd.com/sbatch.html sbatch]]
|-
| [[#Detailed job information : scontrol show job|scontrol show job]] || Displays detailed job state information [[https://slurm.schedmd.com/scontrol.html scontrol]]
|-
| [[#List of your submitted jo/bs : squeue|squeue]] || Displays information about active, eligible, blocked, and/or recently completed jobs [[https://slurm.schedmd.com/squeue.html squeue]]
|-
| [[#Start time of job or resources : squeue|squeue --start]] || Returns start time of submitted job or requested resources [[https://slurm.schedmd.com/squeue.html squeue]]
|-
| [[#Shows free resources : sinfo_t_idle|sinfo_t_idle]] || Shows what resources are available for immediate use [[https://slurm.schedmd.com/sinfo.html sinfo]]
|-
| [[#Canceling own jobs : scancel|scancel]] || Cancels a job (obsoleted!) [[https://slurm.schedmd.com/scancel.html scancel]]
|}

 
* [https://slurm.schedmd.com/tutorials.html Slurm Tutorials]
* [https://slurm.schedmd.com/pdfs/summary.pdf Slurm command/option summary (2 pages)]
* [https://slurm.schedmd.com/man_index.html Slurm Commands]
 

== Job Submission : sbatch ==
Batch jobs are submitted by using the command '''sbatch'''. The main purpose of the '''sbatch''' command is to specify the resources that are needed to run the job. '''sbatch''' will then queue the batch job. However, starting of batch job depends on the availability of the requested resources and the fair sharing value.
 
 
=== sbatch Command Parameters ===
The syntax and use of '''sbatch''' can be displayed via:
<pre>
$ man sbatch
</pre>
'''sbatch''' options can be used from the command line or in your job script.
{| width=750px class="wikitable"
! colspan="3" | sbatch Options
|-
! Command line
! Script
! Purpose
|- style="vertical-align:top;"
| -t ''time'' or --time=''time''
| #SBATCH --time=''time''
| Wall clock time limit. 
|-
|- style="vertical-align:top;"
| -N ''count'' or --nodes=''count''
| #SBATCH --nodes=''count''
| Number of nodes to be used.
|- style="vertical-align:top;"
| -n ''count'' or --ntasks=''count''
| #SBATCH --ntasks=''count''
| Number of tasks to be launched.
|-
|- style="vertical-align:top;"
| --ntasks-per-node=''count''
| #SBATCH --ntasks-per-node=''count''
| Maximum count (<= 28 and <= 40 resp.) of tasks per node. (Replaces the option ppn of MOAB.)
|-
|- style="vertical-align:top;"
| -c ''count'' or --cpus-per-task=''count''
| #SBATCH --cpus-per-task=''count''
| Number of CPUs required per (MPI-)task.
|-
|- style="vertical-align:top;"
| --mem=''value_in_MB''
| #SBATCH --mem=''value_in_MB''
| Memory in MegaByte per node. (Default value is 128000 and 96000 MB resp., i.e. you should omit the setting of this option.)
|-
|- style="vertical-align:top;"
| --mem-per-cpu=''value_in_MB''
| #SBATCH --mem-per-cpu=''value_in_MB''
| Minimum Memory required per allocated CPU. (Replaces the option pmem of MOAB. You should omit the setting of this option.)
|-
|- style="vertical-align:top;"
| --mail-type=''type''
| #SBATCH --mail-type=''type''
| Notify user by email when certain event types occur. Valid type values are NONE, BEGIN, END, FAIL, REQUEUE, ALL.
|-
|- style="vertical-align:top;"
| --mail-user=''mail-address''
| #SBATCH --mail-user=''mail-address''
| The specified mail-address receives email notification of state changes as defined by --mail-type.
|-
|- style="vertical-align:top;"
| --output=''name''
| #SBATCH --output=''name''
| File in which job output is stored.
|-
|- style="vertical-align:top;"
| --error=''name''
| #SBATCH --error=''name''
| File in which job error messages are stored.
|-
|- style="vertical-align:top;"
| -J ''name'' or --job-name=''name''
| #SBATCH --job-name=''name''
| Job name.
|-
|- style="vertical-align:top;"
| --export=[ALL,] ''env-variables''
| #SBATCH --export=[ALL,] ''env-variables''
| Identifies which environment variables from the submission environment are propagated to the launched application. Default is ALL. If adding an environment variable to the submission environment is intended, the argument ALL must be added.
|-
|- style="vertical-align:top;"
| -A ''group-name'' or --account=''group-name''
| #SBATCH --account=''group-name''
| Change resources used by this job to specified group. You may need this option if your account is assigned to more than one group. By command "scontrol show job" the project group the job is accounted on can be seen behind "Account=".
|-
|- style="vertical-align:top;"
| -p ''queue-name'' or --partition=''queue-name''
| #SBATCH --partition=''queue-name''
| Request a specific queue for the resource allocation.
|-
|- style="vertical-align:top;"
| --reservation=''reservation-name''
| #SBATCH --reservation=''reservation-name''
| Use a specific reservation for the resource allocation.
|-
|- style="vertical-align:top;"
| -C ''LSDF'' or --constraint=''LSDF''
| #SBATCH --constraint=LSDF
| Job constraint LSDF Filesystems.
|-
|- style="vertical-align:top;"
| -C ''BEEOND (BEEOND_4MDS, BEEOND_MAXMDS)'' or --constraint=''BEEOND (BEEOND_4MDS, BEEOND_MAXMDS''
| #SBATCH --constraint=BEEOND (BEEOND_4MDS, BEEOND_MAXMDS)
| Job constraint BeeOND file system.
|-
|}
 

==== sbatch --partition ''queues'' ====
Queue classes define maximum resources such as walltime, nodes and processes per node and queue of the compute system. Details can be found here:
* [[BwUniCluster3.0/Batch_Queues|bwUniCluster 3.0 queue settings]]
 

=== sbatch Examples ===
==== Serial Programs ====
To submit a serial job that runs the script '''job.sh''' and that requires 5000 MB of main memory and 10 minutes of wall clock time

a) execute:
<pre>
$ sbatch -p dev_cpu -n 1 -t 10:00 --mem=5000 job.sh
</pre>
or
b) add after the initial line of your script '''job.sh''' the lines (here with a high memory request):
<source lang="bash">
#SBATCH --ntasks=1
#SBATCH --time=10
#SBATCH --mem=180gb
#SBATCH --job-name=simple
</source>
and execute the modified script with the command line option ''--partition=highmem'':
<pre>
$ sbatch --partition=highmem job.sh
</pre>
Note, that sbatch command line options overrule script options.
 
 

==== Multithreaded Programs ====
Multithreaded programs operate faster than serial programs on CPUs with multiple cores. 
Moreover, multiple threads of one process share resources such as memory.
 
For multithreaded programs based on '''Open''' '''M'''ulti-'''P'''rocessing (OpenMP) a number of threads is defined by the environment variable OMP_NUM_THREADS. By default this variable is set to 1 (OMP_NUM_THREADS=1).
 
'''Hyperthreading is switched on on bwUniCluster 3.0, the option --threads-per-core must be set to 1, if you do not want to use it.'''
 
To submit a batch job called ''OpenMP_Test'' that runs a 96-fold threaded program ''omp_exe'' which requires 6000 MByte of total physical memory and total wall clock time of 40 minutes:
 
a) execute:
<pre>
$ sbatch -p cpu --export=ALL,OMP_NUM_THREADS=96 -J OpenMP_Test -N 1 -c 96 --threads-per-core=1 -t 40 --mem=6000 ./omp_exe
</pre>
or
-->
* generate the script '''job_omp.sh''' containing the following lines:
<source lang="bash">
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --cpus-per-task=96
#SBATCH --time=40:00
#SBATCH --threads-per-core=1
#SBATCH --mem=6000mb
#SBATCH --export=ALL,EXECUTABLE=./omp_exe
#SBATCH -J OpenMP_Test

#Usually you should set
export KMP_AFFINITY=compact,1,0
#export KMP_AFFINITY=verbose,compact,1,0 prints messages concerning the supported affinity
#KMP_AFFINITY Description: https://software.intel.com/en-us/node/524790#KMP_AFFINITY_ENVIRONMENT_VARIABLE

export OMP_NUM_THREADS=${SLURM_JOB_CPUS_PER_NODE}
echo "Executable ${EXECUTABLE} running on ${SLURM_JOB_CPUS_PER_NODE} cores with ${OMP_NUM_THREADS} threads"
startexe=${EXECUTABLE}
echo $startexe
exec $startexe
</source>
Using Intel compiler the environment variable KMP_AFFINITY switches on binding of threads to specific cores and, if necessary, replace <placeholder> with the required modulefile to enable the OpenMP environment and execute the script '''job_omp.sh''' adding the queue class ''cpu'' as sbatch option:
<pre>
$ sbatch -p cpu job_omp.sh
</pre>
Note, that sbatch command line options overrule script options, e.g.,
<pre>
$ sbatch --partition=cpu --mem=200 job_omp.sh
</pre>
overwrites the script setting of 6000 MByte with 200 MByte.
 
 

==== MPI Parallel Programs ====
MPI parallel programs run faster than serial programs on multi CPU and multi core systems. N-fold spawned processes of the MPI program, i.e., '''MPI tasks''', run simultaneously and communicate via the Message Passing Interface (MPI) paradigm. MPI tasks do not share memory but can be spawned over different nodes.
 
Multiple MPI tasks must be launched via '''mpirun''', e.g. 4 MPI tasks of ''my_par_program'':
<pre>
$ mpirun -n 4 my_par_program
</pre>
This command runs 4 MPI tasks of ''my_par_program'' on the node you are logged in.
To run this command with a loaded Intel MPI the environment variable I_MPI_HYDRA_BOOTSTRAP must be unset ( --> $ unset I_MPI_HYDRA_BOOTSTRAP).

Running MPI parallel programs in a batch job the interactive environment - particularly the loaded modules - will also be set in the batch job. If you want to set a defined module environment in your batch job you have to purge all modules before setting the wished modules.
 
 
===== OpenMPI =====

If you want to run jobs on batch nodes, generate a wrapper script ''job_ompi.sh'' for '''OpenMPI''' containing the following lines:
<source lang="bash">
#!/bin/bash
# Use when using the module environment for OpenMPI
module load compiler/<placeholder_for_compiler>/<placeholder_for_compiler_version>
module load mpi/openmpi/<placeholder_for_mpi_version>
mpirun --bind-to core --map-by core -report-bindings my_par_program
</source>
'''Attention:''' Do '''NOT''' add mpirun options ''-n <number_of_processes>'' or any other option defining processes or nodes, since Slurm instructs mpirun about number of processes and node hostnames. Use '''ALWAYS''' the MPI options '''''--bind-to core''''' and '''''--map-by core|socket|node'''''. Please type ''mpirun --help'' for an explanation of the meaning of the different options of mpirun option ''--map-by''.
 
Considering 4 OpenMPI tasks on a single node, each requiring 2000 MByte, and running for 1 hour, execute:
<pre>
$ sbatch -p cpu -N 1 -n 4 --mem-per-cpu=2000 --time=01:00:00 ./job_ompi.sh
</pre>
 

===== Intel MPI =====

Generate a wrapper script for '''Intel MPI''', ''job_impi.sh'' containing the following lines:
<source lang="bash">
#!/bin/bash
# Use when a defined module environment related to Intel MPI is wished
module load compiler/<placeholder_for_compiler>/<placeholder_for_compiler_version>
module load mpi/impi/<placeholder_for_version>
mpiexec.hydra -bootstrap slurm my_par_program
</source>
'''Attention:''' 
Do '''NOT''' add mpirun options ''-n <number_of_processes>'' or any other option defining processes or nodes, since Slurm instructs mpirun about number of processes and node hostnames.
 
Launching and running 200 Intel MPI tasks on 5 nodes, each requiring 80 GByte, and running for 5 hours, execute:
<pre>
$ sbatch --partition=cpu -N 5 --ntasks-per-node=40 --mem=80gb -t 300 ./job_impi.sh
</pre>
 
If you want to use 128 or more nodes, you must also set the environment variable as follows: 
export I_MPI_HYDRA_BRANCH_COUNT=-1
 
If you want to use the options perhost, ppn or rr, you must additionally set the environment variable I_MPI_JOB_RESPECT_PROCESS_PLACEMENT=off.
 
 

==== Multithreaded + MPI parallel Programs ====
Multithreaded + MPI parallel programs operate faster than serial programs on multi CPUs with multiple cores. All threads of one process share resources such as memory. On the contrary MPI tasks do not share memory but can be spawned over different nodes. '''Hyperthreading is switched on on bwUniCluster 3.0, the option --threads-per-core must be set to 1, if you do not want to use it.'''
 
 
===== OpenMPI with Multithreading =====
Multiple MPI tasks using '''OpenMPI''' must be launched by the MPI parallel program '''mpirun'''. For multithreaded programs based on '''Open''' '''M'''ulti-'''P'''rocessing (OpenMP) number of threads are defined by the environment variable OMP_NUM_THREADS. By default this variable is set to 1 (OMP_NUM_THREADS=1).
 
'''For OpenMPI''' a job-script to submit a batch job called ''job_ompi_omp.sh'' that runs a MPI program with 4 tasks and a 28-fold threaded program ''ompi_omp_program'' requiring 3000 MByte of physical memory per thread (using 28 threads per MPI task you will get 28*3000 MByte = 84000 MByte per MPI task) and total wall clock time of 3 hours looks like:

<source lang="bash">
#!/bin/bash
#SBATCH --ntasks=4
#SBATCH --cpus-per-task=28
#SBATCH --threads-per-core=1
#SBATCH --time=03:00:00
#SBATCH --mem=83gb # 84000 MB = 84000/1024 GB = 82.1 GB
#SBATCH --export=ALL,MPI_MODULE=mpi/openmpi/3.1,EXECUTABLE=./ompi_omp_program
#SBATCH --output="parprog_hybrid_%j.out"

# Use when a defined module environment related to OpenMPI is wished
module load ${MPI_MODULE}
export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK}
export MPIRUN_OPTIONS="--bind-to core --map-by socket:PE=${OMP_NUM_THREADS} -report-bindings"
export NUM_CORES=${SLURM_NTASKS}*${OMP_NUM_THREADS}
echo "${EXECUTABLE} running on ${NUM_CORES} cores with ${SLURM_NTASKS} MPI-tasks and ${OMP_NUM_THREADS} threads"
startexe="mpirun -n ${SLURM_NTASKS} ${MPIRUN_OPTIONS} ${EXECUTABLE}"
echo $startexe
exec $startexe
</source>
Execute the script '''job_ompi_omp.sh''' by command sbatch:
<pre>
$ sbatch -p cpu ./job_ompi_omp.sh
</pre>
* With the mpirun option ''--bind-to core'' MPI tasks and OpenMP threads are bound to physical cores.
* With the option ''--map-by node:PE=<value>'' (neighbored) MPI tasks will be attached to different nodes and each MPI task is bound to the first core of a node. <value> must be set to ${OMP_NUM_THREADS}.
* The option ''-report-bindings'' shows the bindings between MPI tasks and physical cores.
* The mpirun-options '''--bind-to core''', '''--map-by socket|...|node:PE=<value>''' should always be used when running a multithreaded MPI program.
 

===== Intel MPI with Multithreading =====
Multithreaded + MPI parallel programs operate faster than serial programs on multi CPUs with multiple cores. All threads of one process share resources such as memory. On the contrary MPI tasks do not share memory but can be spawned over different nodes.

Multiple Intel MPI tasks must be launched by the MPI parallel program '''mpiexec.hydra'''. For multithreaded programs based on '''Open''' '''M'''ulti-'''P'''rocessing (OpenMP) number of threads are defined by the environment variable OMP_NUM_THREADS. By default this variable is set to 1 (OMP_NUM_THREADS=1).

'''For Intel MPI''' a job-script to submit a batch job called ''job_impi_omp.sh'' that runs a Intel MPI program with 10 tasks and a 96-fold threaded program ''impi_omp_program'' requiring 96000 MByte of total physical memory per task and total wall clock time of 1 hours looks like:


<source lang="bash">
#!/bin/bash
#SBATCH --ntasks=10
#SBATCH --cpus-per-task=96
#SBATCH --threads-per-core=1
#SBATCH --time=60
#SBATCH --mem=96000
#SBATCH --export=ALL,MPI_MODULE=mpi/impi,EXE=./impi_omp_program
#SBATCH --output="parprog_impi_omp_%j.out"

#If using more than one MPI task per node please set
export KMP_AFFINITY=compact,1,0
#export KMP_AFFINITY=verbose,scatter prints messages concerning the supported affinity
#KMP_AFFINITY Description: https://software.intel.com/en-us/node/524790#KMP_AFFINITY_ENVIRONMENT_VARIABLE

# Use when a defined module environment related to Intel MPI is wished
module load ${MPI_MODULE}
export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK}
export MPIRUN_OPTIONS="-binding "domain=omp:compact" -print-rank-map -envall"
export NUM_PROCS=eval(${SLURM_NTASKS}*${OMP_NUM_THREADS})
echo "${EXE} running on ${NUM_PROCS} cores with ${SLURM_NTASKS} MPI-tasks and ${OMP_NUM_THREADS} threads"
startexe="mpiexec.hydra -bootstrap slurm ${MPIRUN_OPTIONS} -n ${SLURM_NTASKS} ${EXE}"
echo $startexe
exec $startexe
</source>
Using Intel compiler the environment variable KMP_AFFINITY switches on binding of threads to specific cores. If you only run one MPI task per node please set KMP_AFFINITY=compact,1,0.
 
If you want to use 128 or more nodes, you must also set the environment variable as follows: 
export I_MPI_HYDRA_BRANCH_COUNT=-1
 
If you want to use the options perhost, ppn or rr, you must additionally set the environment variable I_MPI_JOB_RESPECT_PROCESS_PLACEMENT=off.
 
 
Execute the script '''job_impi_omp.sh''' by command sbatch:
<pre>
$ sbatch -p cpu ./job_impi_omp.sh
</pre>
 
The mpirun option ''-print-rank-map'' shows the bindings between MPI tasks and nodes (not very beneficial). The option ''-binding'' binds MPI tasks (processes) to a particular processor; ''domain=omp'' means that the domain size is determined by the number of threads. If you would choose 2 MPI tasks per node, you should choose ''-binding "cell=unit;map=bunch"''; this binding maps one MPI process to each socket.
 
 

==== Chain jobs ====
The CPU time requirements of many applications exceed the limits of the job classes. In those situations it is recommended to solve the problem by a job chain. A job chain is a sequence of jobs where each job automatically starts its successor.
<source lang="bash">
#!/bin/bash
####################################
## simple Slurm submitter script to setup ##
## a chain of jobs using Slurm ##
####################################
## ver. : 2018-11-27, KIT, SCC

## Define maximum number of jobs via positional parameter 1, default is 5
max_nojob=${1:-5}

## Define your jobscript (e.g. "~/chain_job.sh")
chain_link_job=${PWD}/chain_job.sh

## Define type of dependency via positional parameter 2, default is 'afterok'
dep_type="${2:-afterok}"
## -> List of all dependencies:
## https://slurm.schedmd.com/sbatch.html

myloop_counter=1
## Submit loop
while [ ${myloop_counter} -le ${max_nojob} ] ; do
##
## Differ slurm_opt depending on chain link number
if [ ${myloop_counter} -eq 1 ] ; then
slurm_opt=""
else
slurm_opt="-d ${dep_type}:${jobID}"
fi
##
## Print current iteration number and sbatch command
echo "Chain job iteration = ${myloop_counter}"
echo " sbatch --export=myloop_counter=${myloop_counter} ${slurm_opt} ${chain_link_job}"
## Store job ID for next iteration by storing output of sbatch command with empty lines
jobID=$(sbatch -p <queue> --export=ALL,myloop_counter=${myloop_counter} ${slurm_opt} ${chain_link_job} 2>&1 | sed 's/[S,a-z]* //g')
##
## Check if ERROR occured
if [[ "${jobID}" =~ "ERROR" ]] ; then
echo " -> submission failed!" ; exit 1
else
echo " -> job number = ${jobID}"
fi
##
## Increase counter
let myloop_counter+=1
done
</source>
 

==== GPU jobs ====

The nodes in the gpu_h100, gpu_mi300, gpu_a100_il and gpu_h100_il queues have 4 NVIDIA Ampere A100 GPUs or 4 NVIDIA Hopper H100 GPUs. Just submitting a job to these queues is not enough to also allocate one or more GPUs, you have to do so using the "--gres=gpu" parameter. You have to specifiy how many GPUs your job needs, e.g. "--gres=gpu:2" will request two GPUs.

The GPU nodes are shared between multiple jobs if the jobs don't request all the GPUs in a node and there are enough resources to run more than one job. The individual GPUs are always bound to a single job and will not be shared between different jobs.

a) add after the initial line of your script job.sh the line including the
information about the GPU usage: #SBATCH --gres=gpu:2
<pre>
#!/bin/bash
#SBATCH --ntasks=40
#SBATCH --time=02:00:00
#SBATCH --mem=4000
#SBATCH --gres=gpu:2
</pre>

or b) execute:
<pre>
$ sbatch -p <queue> -n 40 -t 02:00:00 --mem 4000 --gres=gpu:2 job.sh
</pre>
 
If you start an interactive session on of the GPU nodes, you can use the "nvidia-smi" command to list the GPUs allocated to your job:
<pre>
$ nvidia-smi
Fri Apr 4 09:51:29 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.86.15 Driver Version: 570.86.15 CUDA Version: 12.8 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA H100 On | 00000000:06:00.0 Off | 0 |
| N/A 45C P0 70W / 415W | 27MiB / 95830MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA H100 On | 00000000:26:00.0 Off | 0 |
| N/A 45C P0 69W / 415W | 27MiB / 95830MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 3569 G /usr/libexec/Xorg 17MiB |
| 1 N/A N/A 3569 G /usr/libexec/Xorg 17MiB |
+-----------------------------------------------------------------------------------------+
</pre>

 
In case of using Open MPI, the underlying communication infrastructure (UCX and Open MPI's BTL) is CUDA-aware.
Please run Open MPI's mpirun using the following command:
<pre>
$ mpirun --mca pml ucx --mca btl_openib_warn_default_gid_prefix 0 -np 2 ./mpi_cuda_app
</pre>
or disabling the (older) communication layer Bit-Transfer-Layer (short BTL) alltogether:
<pre>
$ mpirun --mca pml ucx --mca btl ^openib -np 2 ./mpi_cuda_app
</pre>

 
 

==== LSDF Online Storage ====
On bwUniCluster 3.0 you can use for special cases the LSDF Online Storage on the HPC cluster nodes. Please request for this service separately ([https://www.lsdf.kit.edu/os/storagerequest/: LSDF Storage Request]).
To mount the LSDF Online Storage on the compute nodes during the job runtime the
the constraint flag "LSDF" has to be set.

a) add after the initial line of your script job.sh the line including the
information about the LSDF Online Storage usage: #SBATCH --constraint=LSDF
<pre>
#!/bin/bash
#SBATCH --ntasks=1
#SBATCH --time=120
#SBATCH --mem=200
#SBATCH --constraint=LSDF
</pre>

or b) execute:
<pre>
$ sbatch -p <queue> -n 1 -t 2:00:00 --mem 200 job.sh -C LSDF
</pre>
 
For the usage of the LSDF Online Storage
the following environment variables are available: $LSDF, $LSDFPROJECTS, $LSDFHOME.
 
 

====BeeOND (BeeGFS On-Demand)====

BeeOND instances are integrated into the prolog and epilog script of the cluster batch system Slurm. It can be used on the exclusive compute nodes during the job runtime with the constraint flag "BEEOND", "BEEOND_4MDS" or "BEEOND_MAXMDS" ([[BwUniCluster_2.0_Slurm_common_Features#sbatch_Command_Parameters|Slurm Command Parameters]])
* BEEOND: one metadata server is started on the first node
* BEEOND_4MDS: 4 metadata servers are started within your job. If you have less than 4 nodes less metadata servers are started.
* BEEOND_MAXMDS: on every node of your job a metadata server for the on_demand file system is started

BEEOND will only work if the node is allocated exclusively, meaning that no other jobs are running on the same node. To achieve this, use the batch option "--exclusive". This is particularly important in shared partitions.

As starting point we recommend using the "BEEOND" option. If you are unsure if this is sufficient for you feel free to contact the support team.
<source lang="bash">
#!/bin/bash
#SBATCH ...
#SBATCH --constraint=BEEOND # or BEEOND_4MDS or BEEOND_MAXMDS
#SBATCH --exclusive
</source>

After your job has started you can find the private on-demand file system in '''/mnt/odfs/${SLURM_JOB_ID}''' directory. The mountpoint comes with five pre-configured directories:
<source lang="bash">
# For small files (stripe count = 1)
/mnt/odfs/${SLURM_JOB_ID}/stripe_1
# Stripe count = 4
/mnt/odfs/${SLURM_JOB_ID}/stripe_default
# or
/mnt/odfs/${SLURM_JOB_ID}/stripe_4
# Stripe count = 8, 16 or 32, use this directories for medium sized and large files or when using MPI-IO
/mnt/odfs/${SLURM_JOB_ID}/stripe_8
/mnt/odfs/${SLURM_JOB_ID}/stripe_16
# or
/mnt/odfs/${SLURM_JOB_ID}/stripe_32
</source>

If you request less nodes than stripe count, the stripe count will be the number of nodes. For example, if you only request 8 nodes the directory stripe_16 has only a stripe count 8.

; '''Attention:''' 
:Be careful when creating large files: use always the directory with the max stripe count for large files.
:If you create large files use a higher stripe count. For example, if your largest file is 1.1 Tb, then you have to use a stripe count larger than 2,
:otherwise the used disk space is exceeded.

The capacity of the private file system depends on the number of nodes. For each node you get 750 Gbyte.
If you request 100 nodes for your job, the private file system is 100 * 750 Gbyte ~ 75 Tbyte (approx) capacity.

== Start time of job or resources : squeue --start ==
The command can be used by any user to displays the estimated start time of a job based a number of analysis types based on historical usage, earliest available reservable resources, and priority based backlog. The command squeue is explained in detail on the webpage https://slurm.schedmd.com/squeue.html or via manpage (man squeue).
 
 
=== Access ===
By default, this command can be run by '''any user'''.
 
 

== List of your submitted jobs : squeue ==
Displays information about YOUR active, pending and/or recently completed jobs. The command displays all own active and pending jobs. The command squeue is explained in detail on the webpage https://slurm.schedmd.com/squeue.html or via manpage (man squeue).
 
 
=== Access ===
By default, this command can be run by any user.
 
 

=== Flags ===
{| width=750px class="wikitable"
|-
! Flag !! Description
|-
| -l, --long
| Report more of the available information for the selected jobs or job steps, subject to any constraints specified.
|}
 

=== Examples ===
''squeue'' example on bwUniCluster 3.0 (Only your own jobs are displayed!).
<pre>
$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
1262 cpu wrap ka_ab123 R 8:15 1 uc3n002
1267 dev_gpu_h wrap ka_ab123 PD 0:00 1 (Resources)
1265 highmem wrap ka_ab123 R 2:41 1 uc3n084
$ squeue -l
JOBID PARTITION NAME USER STATE TIME TIME_LIMI NODES NODELIST(REASON)
1262 cpu wrap ka_ab123 RUNNING 8:55 20:00 1 uc3n002
1267 dev_gpu_h wrap ka_ab123 PENDING 0:00 20:00 1 (Resources)
1265 highmem wrap ka_ab123 RUNNING 3:21 20:00 1 uc3n084

</pre>
* The output of ''squeue'' shows how many jobs of yours are running or pending and how many nodes are in use by your jobs.
 

== Shows free resources : sinfo_t_idle ==
The Slurm command sinfo is used to view partition and node information for a system running Slurm. It incorporates down time, reservations, and node state information in determining the available backfill window. The sinfo command can only be used by the administrator.
 
SCC has prepared a special script (sinfo_t_idle) to find out how many processors are available for immediate use on the system. It is anticipated that users will use this information to submit jobs that meet these criteria and thus obtain quick job turnaround times.
 
 
=== Example ===
* The following command displays what resources are available for immediate use for the whole partition.
<pre>$ sinfo_t_idle
Partition dev_cpu : 2 nodes idle
Partition cpu : 68 nodes idle
Partition highmem : 4 nodes idle
Partition dev_gpu_h100 : 0 nodes idle
Partition gpu_h100 : 11 nodes idle
Partition gpu_mi300 : 1 nodes idle
Partition dev_cpu_il : 0 nodes idle
Partition cpu_il : 0 nodes idle
Partition dev_gpu_a100_il : 0 nodes idle
Partition gpu_a100_il : 0 nodes idle
Partition gpu_h100_il : 0 nodes idle
</pre>
 

== Detailed job information : scontrol show job ==
scontrol show job displays detailed job state information and diagnostic output for all or a specified job of yours. Detailed information is available for active, pending and recently completed jobs. The command scontrol is explained in detail on the webpage https://slurm.schedmd.com/scontrol.html or via manpage (man scontrol).
 
Display the state of all your jobs in normal mode: scontrol show job
 
Display the state of a job with <jobid> in normal mode: scontrol show job <jobid>
 
 
=== Access ===
* End users can use scontrol show job to view the status of their '''own jobs''' only.
 

=== Arguments ===
{| width=750px class="wikitable"
|-
! Option !! Default !! Description !! Example
|- style="vertical-align:top;"
|- style="width:12%;"
| -d
| (n/a)
| Detailed mode
| Example: Display the state with jobid 18089884 in detailed mode. <pre>scontrol -d show job 18089884</pre>
|}
 
 

=== Scontrol show job Example ===
Here is an example from bwUniCluster 3.0.
<pre>
squeue # show my own jobs (here the userid is replaced!)
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
1262 cpu wrap ka_zs040 R 1:12 1 uc3n002

$
$ # now, see what's up with my pending job with jobid 18089884
$
$ scontrol show job 1262

JobId=1262 JobName=wrap
UserId=ka_zs0402(241992) GroupId=ka_scc(12345) MCS_label=N/A
Priority=4246 Nice=0 Account=ka QOS=normal
JobState=RUNNING Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
RunTime=00:00:37 TimeLimit=00:20:00 TimeMin=N/A
SubmitTime=2025-04-04T10:01:30 EligibleTime=2025-04-04T10:01:30
AccrueTime=2025-04-04T10:01:30
StartTime=2025-04-04T10:01:31 EndTime=2025-04-04T10:21:31 Deadline=N/A
SuspendTime=None SecsPreSuspend=0 LastSchedEval=2025-04-04T10:01:31 Scheduler=Main
Partition=cpu AllocNode:Sid=uc3n999:2819841
ReqNodeList=(null) ExcNodeList=(null)
NodeList=uc3n002
BatchHost=uc3n002
NumNodes=1 NumCPUs=2 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
ReqTRES=cpu=1,mem=2000M,node=1,billing=1
AllocTRES=cpu=2,mem=4000M,node=1,billing=2
Socks/Node=* NtasksPerN:B:S:C=0:0:*:1 CoreSpec=*
MinCPUsNode=1 MinMemoryCPU=2000M MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
Command=(null)
WorkDir=/pfs/data6/home/ka/ka_scc/ka_zs0402
StdErr=/pfs/data6/home/ka/ka_scc/ka_zs0402/slurm-1262.out
StdIn=/dev/null
StdOut=/pfs/data6/home/ka/ka_scc/ka_zs0402/slurm-1262.out

</pre>
 
You can use standard Linux pipe commands to filter the very detailed scontrol show job output.
* In which state the job is?
<pre>$ scontrol show job 1262 | grep -i State
JobState=RUNNING Reason=None Dependency=(null)
</pre>
 

== Cancel Slurm Jobs ==
The scancel command is used to cancel jobs. The command scancel is explained in detail on the webpage https://slurm.schedmd.com/scancel.html or via manpage (man scancel).
 
 
=== Canceling own jobs : scancel ===
scancel is used to signal or cancel jobs, job arrays or job steps. The command is:
<pre>
$ scancel [-i] <job-id>
$ scancel -t <job_state_name>
</pre>

 
{| class="wikitable"
! Flag !! Default !! Description !! Example
|- style="vertical-align:top;"
| -i, --interactive
| (n/a)
| Interactive mode.
| Cancel the job 987654 interactively. <pre> scancel -i 987654 </pre>
|-
| -t, --state
| (n/a)
| Restrict the scancel operation to jobs in a certain state. "job_state_name" may have a value of either "PENDING", "RUNNING" or "SUSPENDED".
| Cancel all jobs in state "PENDING". <pre> scancel -t "PENDING" </pre>
|}
 

= Resource Managers =
=== Batch Job (Slurm) Variables ===
The following environment variables of Slurm are added to your environment once your job has started
(only an excerpt of the most important ones).
{| width=750px class="wikitable"
! Environment !! Brief explanation
|-
| SLURM_JOB_CPUS_PER_NODE
| Number of processes per node dedicated to the job
|-
| SLURM_JOB_NODELIST
| List of nodes dedicated to the job
|-
| SLURM_JOB_NUM_NODES
| Number of nodes dedicated to the job
|-
| SLURM_MEM_PER_NODE
| Memory per node dedicated to the job
|-
| SLURM_NPROCS
| Total number of processes dedicated to the job
|-
| SLURM_CLUSTER_NAME
| Name of the cluster executing the job
|-
| SLURM_CPUS_PER_TASK
| Number of CPUs requested per task
|-
| SLURM_JOB_ACCOUNT
| Account name
|-
| SLURM_JOB_ID
| Job ID
|-
| SLURM_JOB_NAME
| Job Name
|-
| SLURM_JOB_PARTITION
| Partition/queue running the job
|-
| SLURM_JOB_UID
| User ID of the job's owner
|-
| SLURM_SUBMIT_DIR
| Job submit folder. The directory from which sbatch was invoked.
|-
| SLURM_JOB_USER
| User name of the job's owner
|-
| SLURM_RESTART_COUNT
| Number of times job has restarted
|-
| SLURM_PROCID
| Task ID (MPI rank)
|-
| SLURM_NTASKS
| The total number of tasks available for the job
|-
| SLURM_STEP_ID
| Job step ID
|-
| SLURM_STEP_NUM_TASKS
| Task count (number of MPI ranks)
|-
| SLURM_JOB_CONSTRAINT
| Job constraints
|}
See also:
* [https://slurm.schedmd.com/sbatch.html#SECTION_INPUT-ENVIRONMENT-VARIABLES Slurm input and output environment variables]
 

=== Job Exit Codes ===
A job's exit code (also known as exit status, return code and completion code) is captured by SLURM and saved as part of the job record.
 
Any non-zero exit code will be assumed to be a job failure and will result in a Job State of FAILED with a reason of "NonZeroExitCode".
 
The exit code is an 8 bit unsigned number ranging between 0 and 255. While it is possible for a job to return a negative exit code, SLURM will display it as an unsigned value in the 0 - 255 range.
 
 
==== Displaying Exit Codes and Signals ====
SLURM displays a job's exit code in the output of the '''scontrol show job''' and the sview utility.
 
When a signal was responsible for a job or step's termination, the signal number will be displayed after the exit code, delineated by a colon(:).
 
 
==== Submitting Termination Signal ====
Here is an example, how to 'save' a Slurm termination signal in a typical jobscript.
<source lang="bash">
[...]
exit_code=$?
mpirun -np <#cores> <EXE_BIN_DIR>/<executable> ... (options) 2>&1
[ "$exit_code" -eq 0 ] && echo "all clean..." || \
echo "Executable <EXE_BIN_DIR>/<executable> finished with exit code ${$exit_code}"
[...]
</source>
* Do not use ''''time'''' mpirun! The exit code will be the one submitted by the first (time) program.
* You do not need an '''exit $exit_code''' in the scripts.
 
 
----
[[#top|Back to top]]

BwUniCluster3.0/Running Jobs/Slurm

2025-08-19T09:29:54Z

P Schuhmacher: /* BeeOND (BeeGFS On-Demand) */

{| style="background:#FFCCCC; width:100%; font-size:120%;"
| '''This page is work-in-progress''' 
We will revise all examples and describe gold standard ways to use the resources efficiently: OpenMP, MPI and hybrid using MPI+OpenMP
|}

= Slurm HPC Workload Manager =
== Specification ==
Slurm is an open source, fault-tolerant, and highly scalable cluster management and job scheduling system for large and small Linux clusters. Slurm requires no kernel modifications for its operation and is relatively self-contained. As a cluster workload manager, Slurm has three key functions. First, it allocates exclusive and/or non-exclusive access to resources (compute nodes) to users for some duration of time so they can perform work. Second, it provides a framework for starting, executing, and monitoring work (normally a parallel job) on the set of allocated nodes. Finally, it arbitrates contention for resources by managing a queue of pending work.
 
 
Any kind of calculation on the compute nodes of bwUniCluster 3.0 requires the user to define calculations as a sequence of commands or single command together with required run time, number of CPU cores and main memory and submit all, i.e., the '''batch job''', to a resource and workload managing software. bwUniCluster 3.0 has installed the workload managing software Slurm. Therefore any job submission by the user is to be executed by commands of the Slurm software. Slurm queues and runs user jobs based on fair sharing policies.
 
 

== Slurm Commands (excerpt) ==
Important Slurm commands for non-administrators working on bwUniCluster 3.0.
{| width=850px class="wikitable"
! Slurm commands !! Brief explanation
|-
| [[#Job submission : sbatch|sbatch]] || Submits a job and puts it into the queue [[https://slurm.schedmd.com/sbatch.html sbatch]]
|-
| [[#Interactive job : salloc|salloc]] || Requests resources for an interactive Job [[https://slurm.schedmd.com/salloc.html salloc]]
|-
| [[#Detailed job information : scontrol show job|scontrol show job]] || Displays detailed job state information [[https://slurm.schedmd.com/scontrol.html scontrol]]
|-
| [[#List of your submitted jo/bs : squeue|squeue]] || Displays information about active, eligible, blocked, and/or recently completed jobs [[https://slurm.schedmd.com/squeue.html squeue]]
|-
| [[#Start time of job or resources : squeue|squeue --start]] || Returns start time of submitted job [[https://slurm.schedmd.com/squeue.html squeue]]
|-
| [[#Shows free resources : sinfo_t_idle|sinfo_t_idle]] || Shows what resources are available for immediate use [[https://slurm.schedmd.com/sinfo.html sinfo]]
|-
| [[#Canceling own jobs : scancel|scancel]] || Cancels a job [[https://slurm.schedmd.com/scancel.html scancel]]
|}

 
* [https://slurm.schedmd.com/tutorials.html Slurm Tutorials]
* [https://slurm.schedmd.com/pdfs/summary.pdf Slurm command/option summary (2 pages)]
* [https://slurm.schedmd.com/man_index.html Slurm Commands]
 

== Job submission : sbatch ==
Batch jobs are submitted by using the command '''sbatch'''. The main purpose of the '''sbatch''' command is to specify the resources that are needed to run the job. '''sbatch''' will then queue the batch job. However, starting of batch job depends on the availability of the requested resources and the fair sharing value.
 
 
=== Command parameters sbatch ===
The syntax and use of '''sbatch''' can be displayed via:
<pre>
$ man sbatch
</pre>
'''sbatch''' options can be used from the command line or in your job script. Different defaults for some of these options are set based on the queue and can be found [https://wiki.bwhpc.de/e/BwUniCluster3.0/Batch_Queues#sbatch_-p_queue here]

{| class="wikitable"
! colspan="3" | sbatch Options
|-
! style="width:8%"| Command line
! style="width:9%"| Script
! style="width:13%"| Purpose
|- style="vertical-align:top;"
| -t, --time=''time''
| #SBATCH --time=''time''
| Wall clock time limit. 
|-
|- style="vertical-align:top;"
| -N, --nodes=''count''
| #SBATCH --nodes=''count''
| Number of nodes to be used.
|- style="vertical-align:top;"
| -n, --ntasks=''count''
| #SBATCH --ntasks=''count''
| Number of tasks to be launched.
|-
|- style="vertical-align:top;"
| --ntasks-per-node=''count''
| #SBATCH --ntasks-per-node=''count''
| Maximum count of tasks per node.
|-
|- style="vertical-align:top;"
| -c, --cpus-per-task=''count''
| #SBATCH --cpus-per-task=''count''
| Number of CPUs required per (MPI-)task.
|-
|- style="vertical-align:top;"
| --mem=''value_in_MB''
| #SBATCH --mem=''value_in_MB''
| Memory in MegaByte per node. (You should omit the setting of this option.)
|-
|- style="vertical-align:top;"
| --mem-per-cpu=''value_in_MB''
| #SBATCH --mem-per-cpu=''value_in_MB''
| Minimum Memory required per allocated CPU. (You should omit the setting of this option.)
|-
|- style="vertical-align:top;"
| --mail-type=''type''
| #SBATCH --mail-type=''type''
| Notify user by email when certain event types occur. Valid type values are NONE, BEGIN, END, FAIL, REQUEUE, ALL.
|-
|- style="vertical-align:top;"
| --mail-user=''mail-address''
| #SBATCH --mail-user=''mail-address''
| The specified mail-address receives email notification of state changes as defined by --mail-type.
|-
|- style="vertical-align:top;"
| --output=''name''
| #SBATCH --output=''name''
| File in which job output is stored.
|-
|- style="vertical-align:top;"
| --error=''name''
| #SBATCH --error=''name''
| File in which job error messages are stored.
|-
|- style="vertical-align:top;"
| -J, --job-name=''name''
| #SBATCH --job-name=''name''
| Job name.
|-
|- style="vertical-align:top;"
| --export=[ALL,] ''env-variables''
| #SBATCH --export=[ALL,] ''env-variables''
| Identifies which environment variables from the submission environment are propagated to the launched application. Default is ALL.
|-
|- style="vertical-align:top;"
| -A, --account=''group-name''
| #SBATCH --account=''group-name''
| Change resources used by this job to specified group. You may need this option if your account is assigned to more than one group. By command "scontrol show job" the project group the job is accounted on can be seen behind "Account=".
|-
|- style="vertical-align:top;"
| -p, --partition=''queue-name''
| #SBATCH --partition=''queue-name''
| Request a specific queue for the resource allocation.
|-
|- style="vertical-align:top;"
| --reservation=''reservation-name''
| #SBATCH --reservation=''reservation-name''
| Use a specific reservation for the resource allocation.
|-
|- style="vertical-align:top;"
| -C, --constraint=''LSDF''
| #SBATCH --constraint=LSDF
| Job constraint LSDF filesystems.
|-
|- style="vertical-align:top;"
| -C, --constraint=''BEEOND (BEEOND_4MDS, BEEOND_MAXMDS)''
| #SBATCH --constraint=BEEOND (BEEOND_4MDS, BEEOND_MAXMDS)
| Job constraint BeeOND filesystem.
|-
|}
 

== Interactive job : salloc ==

If you want to run an interactive job, you can do so via the command salloc on a login node. Considering a serial application running on a compute node that requires 5000 MByte of memory and limiting the interactive run to 2 hours the following command has to be executed:

<pre>
$ salloc --partition=cpu --ntasks=1 --time=120 --mem=5000
</pre>

Then you will get one core on a compute node within the partition "cpu". After execution of this command '''DO NOT CLOSE''' your current terminal session but wait until the queueing system Slurm has granted you the requested resources on the compute node. You will be logged in automatically on the granted core! To run a serial program on the granted core you only have to type the name of the executable.

<pre>
$ ./<my_serial_program>
</pre>

Please be aware that your serial job must run less than 2 hours in this example, else the job will be killed during runtime by the system.

You can also start now a graphical X11-terminal connecting you to the dedicated resource that is available for 2 hours. You can start it by the command:

<pre>
$ xterm
</pre>

Note that, once the walltime limit has been reached the resources - i.e. the compute node - will automatically be revoked.

An interactive parallel application running on one compute node or on many compute nodes (e.g. here 5 nodes) with 40 cores each requires usually an amount of memory in GByte (e.g. 50 GByte) and a maximum time (e.g. 1 hour). E.g. 5 nodes can be allocated by the following command:

<pre>
$ salloc --partition=cpu --nodes=5 --ntasks-per-node=40 --time=01:00:00 --mem=50gb
</pre>

Now you can run parallel jobs on 200 cores requiring 50 GByte of memory per node. Please be aware that you will be logged in on core 0 of the first node.
If you want to have access to another node you have to open a new terminal, connect it also to BwUniCluster 2.0 and type the following commands to
connect to the running interactive job and then to a specific node:

<pre>
$ srun --jobid=XXXXXXXX --pty /bin/bash
$ srun --nodelist=uc3nXXX --pty /bin/bash
</pre>

With the command:

<pre>
$ squeue
</pre>

the jobid and the nodelist can be shown.

If you want to run MPI-programs, you can do it by simply typing mpirun <program_name>. Then your program will be run on 200 cores. A very simple example for starting a parallel job can be:

<pre>
$ mpirun <my_mpi_program>
</pre>

You can also start the debugger ddt by the commands:

<pre>
$ module add devel/ddt
$ ddt <my_mpi_program>
</pre>

The above commands will execute the parallel program <my_mpi_program> on all available cores. You can also start parallel programs on a subset of cores; an example for this can be:

<pre>
$ mpirun -n 50 <my_mpi_program>
</pre>

If you are using Intel MPI you must start <my_mpi_program> by the command mpiexec.hydra (instead of mpirun).

<div id="top"></div>
= Slurm HPC Workload Manager =
== Specification ==
Slurm is an open source, fault-tolerant, and highly scalable cluster management and job scheduling system for large and small Linux clusters. Slurm requires no kernel modifications for its operation and is relatively self-contained. As a cluster workload manager, Slurm has three key functions. First, it allocates exclusive and/or non-exclusive access to resources (compute nodes) to users for some duration of time so they can perform work. Second, it provides a framework for starting, executing, and monitoring work (normally a parallel job) on the set of allocated nodes. Finally, it arbitrates contention for resources by managing a queue of pending work.
 
 
Any kind of calculation on the compute nodes of [[bwUniCluster3.0|bwUniCluster 3.0]] requires the user to define calculations as a sequence of commands or single command together with required run time, number of CPU cores and main memory and submit all, i.e., the '''batch job''', to a resource and workload managing software. bwUniCluster 3.0 has installed the workload managing software Slurm. Therefore any job submission by the user is to be executed by commands of the Slurm software. Slurm queues and runs user jobs based on fair sharing policies.
 
 

== Slurm Commands (excerpt) ==
Some of the most used Slurm commands for non-administrators working on bwUniCluster 2.0.
{| width=750px class="wikitable"
! Slurm commands !! Brief explanation
|-
| [[#Job Submission : sbatch|sbatch]] || Submits a job and queues it in an input queue [[https://slurm.schedmd.com/sbatch.html sbatch]]
|-
| [[#Detailed job information : scontrol show job|scontrol show job]] || Displays detailed job state information [[https://slurm.schedmd.com/scontrol.html scontrol]]
|-
| [[#List of your submitted jo/bs : squeue|squeue]] || Displays information about active, eligible, blocked, and/or recently completed jobs [[https://slurm.schedmd.com/squeue.html squeue]]
|-
| [[#Start time of job or resources : squeue|squeue --start]] || Returns start time of submitted job or requested resources [[https://slurm.schedmd.com/squeue.html squeue]]
|-
| [[#Shows free resources : sinfo_t_idle|sinfo_t_idle]] || Shows what resources are available for immediate use [[https://slurm.schedmd.com/sinfo.html sinfo]]
|-
| [[#Canceling own jobs : scancel|scancel]] || Cancels a job (obsoleted!) [[https://slurm.schedmd.com/scancel.html scancel]]
|}

 
* [https://slurm.schedmd.com/tutorials.html Slurm Tutorials]
* [https://slurm.schedmd.com/pdfs/summary.pdf Slurm command/option summary (2 pages)]
* [https://slurm.schedmd.com/man_index.html Slurm Commands]
 

== Job Submission : sbatch ==
Batch jobs are submitted by using the command '''sbatch'''. The main purpose of the '''sbatch''' command is to specify the resources that are needed to run the job. '''sbatch''' will then queue the batch job. However, starting of batch job depends on the availability of the requested resources and the fair sharing value.
 
 
=== sbatch Command Parameters ===
The syntax and use of '''sbatch''' can be displayed via:
<pre>
$ man sbatch
</pre>
'''sbatch''' options can be used from the command line or in your job script.
{| width=750px class="wikitable"
! colspan="3" | sbatch Options
|-
! Command line
! Script
! Purpose
|- style="vertical-align:top;"
| -t ''time'' or --time=''time''
| #SBATCH --time=''time''
| Wall clock time limit. 
|-
|- style="vertical-align:top;"
| -N ''count'' or --nodes=''count''
| #SBATCH --nodes=''count''
| Number of nodes to be used.
|- style="vertical-align:top;"
| -n ''count'' or --ntasks=''count''
| #SBATCH --ntasks=''count''
| Number of tasks to be launched.
|-
|- style="vertical-align:top;"
| --ntasks-per-node=''count''
| #SBATCH --ntasks-per-node=''count''
| Maximum count (<= 28 and <= 40 resp.) of tasks per node. (Replaces the option ppn of MOAB.)
|-
|- style="vertical-align:top;"
| -c ''count'' or --cpus-per-task=''count''
| #SBATCH --cpus-per-task=''count''
| Number of CPUs required per (MPI-)task.
|-
|- style="vertical-align:top;"
| --mem=''value_in_MB''
| #SBATCH --mem=''value_in_MB''
| Memory in MegaByte per node. (Default value is 128000 and 96000 MB resp., i.e. you should omit the setting of this option.)
|-
|- style="vertical-align:top;"
| --mem-per-cpu=''value_in_MB''
| #SBATCH --mem-per-cpu=''value_in_MB''
| Minimum Memory required per allocated CPU. (Replaces the option pmem of MOAB. You should omit the setting of this option.)
|-
|- style="vertical-align:top;"
| --mail-type=''type''
| #SBATCH --mail-type=''type''
| Notify user by email when certain event types occur. Valid type values are NONE, BEGIN, END, FAIL, REQUEUE, ALL.
|-
|- style="vertical-align:top;"
| --mail-user=''mail-address''
| #SBATCH --mail-user=''mail-address''
| The specified mail-address receives email notification of state changes as defined by --mail-type.
|-
|- style="vertical-align:top;"
| --output=''name''
| #SBATCH --output=''name''
| File in which job output is stored.
|-
|- style="vertical-align:top;"
| --error=''name''
| #SBATCH --error=''name''
| File in which job error messages are stored.
|-
|- style="vertical-align:top;"
| -J ''name'' or --job-name=''name''
| #SBATCH --job-name=''name''
| Job name.
|-
|- style="vertical-align:top;"
| --export=[ALL,] ''env-variables''
| #SBATCH --export=[ALL,] ''env-variables''
| Identifies which environment variables from the submission environment are propagated to the launched application. Default is ALL. If adding an environment variable to the submission environment is intended, the argument ALL must be added.
|-
|- style="vertical-align:top;"
| -A ''group-name'' or --account=''group-name''
| #SBATCH --account=''group-name''
| Change resources used by this job to specified group. You may need this option if your account is assigned to more than one group. By command "scontrol show job" the project group the job is accounted on can be seen behind "Account=".
|-
|- style="vertical-align:top;"
| -p ''queue-name'' or --partition=''queue-name''
| #SBATCH --partition=''queue-name''
| Request a specific queue for the resource allocation.
|-
|- style="vertical-align:top;"
| --reservation=''reservation-name''
| #SBATCH --reservation=''reservation-name''
| Use a specific reservation for the resource allocation.
|-
|- style="vertical-align:top;"
| -C ''LSDF'' or --constraint=''LSDF''
| #SBATCH --constraint=LSDF
| Job constraint LSDF Filesystems.
|-
|- style="vertical-align:top;"
| -C ''BEEOND (BEEOND_4MDS, BEEOND_MAXMDS)'' or --constraint=''BEEOND (BEEOND_4MDS, BEEOND_MAXMDS''
| #SBATCH --constraint=BEEOND (BEEOND_4MDS, BEEOND_MAXMDS)
| Job constraint BeeOND file system.
|-
|}
 

==== sbatch --partition ''queues'' ====
Queue classes define maximum resources such as walltime, nodes and processes per node and queue of the compute system. Details can be found here:
* [[BwUniCluster3.0/Batch_Queues|bwUniCluster 3.0 queue settings]]
 

=== sbatch Examples ===
==== Serial Programs ====
To submit a serial job that runs the script '''job.sh''' and that requires 5000 MB of main memory and 10 minutes of wall clock time

a) execute:
<pre>
$ sbatch -p dev_cpu -n 1 -t 10:00 --mem=5000 job.sh
</pre>
or
b) add after the initial line of your script '''job.sh''' the lines (here with a high memory request):
<source lang="bash">
#SBATCH --ntasks=1
#SBATCH --time=10
#SBATCH --mem=180gb
#SBATCH --job-name=simple
</source>
and execute the modified script with the command line option ''--partition=highmem'':
<pre>
$ sbatch --partition=highmem job.sh
</pre>
Note, that sbatch command line options overrule script options.
 
 

==== Multithreaded Programs ====
Multithreaded programs operate faster than serial programs on CPUs with multiple cores. 
Moreover, multiple threads of one process share resources such as memory.
 
For multithreaded programs based on '''Open''' '''M'''ulti-'''P'''rocessing (OpenMP) a number of threads is defined by the environment variable OMP_NUM_THREADS. By default this variable is set to 1 (OMP_NUM_THREADS=1).
 
'''Hyperthreading is switched on on bwUniCluster 3.0, the option --threads-per-core must be set to 1, if you do not want to use it.'''
 
To submit a batch job called ''OpenMP_Test'' that runs a 96-fold threaded program ''omp_exe'' which requires 6000 MByte of total physical memory and total wall clock time of 40 minutes:
 
a) execute:
<pre>
$ sbatch -p cpu --export=ALL,OMP_NUM_THREADS=96 -J OpenMP_Test -N 1 -c 96 --threads-per-core=1 -t 40 --mem=6000 ./omp_exe
</pre>
or
-->
* generate the script '''job_omp.sh''' containing the following lines:
<source lang="bash">
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --cpus-per-task=96
#SBATCH --time=40:00
#SBATCH --threads-per-core=1
#SBATCH --mem=6000mb
#SBATCH --export=ALL,EXECUTABLE=./omp_exe
#SBATCH -J OpenMP_Test

#Usually you should set
export KMP_AFFINITY=compact,1,0
#export KMP_AFFINITY=verbose,compact,1,0 prints messages concerning the supported affinity
#KMP_AFFINITY Description: https://software.intel.com/en-us/node/524790#KMP_AFFINITY_ENVIRONMENT_VARIABLE

export OMP_NUM_THREADS=${SLURM_JOB_CPUS_PER_NODE}
echo "Executable ${EXECUTABLE} running on ${SLURM_JOB_CPUS_PER_NODE} cores with ${OMP_NUM_THREADS} threads"
startexe=${EXECUTABLE}
echo $startexe
exec $startexe
</source>
Using Intel compiler the environment variable KMP_AFFINITY switches on binding of threads to specific cores and, if necessary, replace <placeholder> with the required modulefile to enable the OpenMP environment and execute the script '''job_omp.sh''' adding the queue class ''cpu'' as sbatch option:
<pre>
$ sbatch -p cpu job_omp.sh
</pre>
Note, that sbatch command line options overrule script options, e.g.,
<pre>
$ sbatch --partition=cpu --mem=200 job_omp.sh
</pre>
overwrites the script setting of 6000 MByte with 200 MByte.
 
 

==== MPI Parallel Programs ====
MPI parallel programs run faster than serial programs on multi CPU and multi core systems. N-fold spawned processes of the MPI program, i.e., '''MPI tasks''', run simultaneously and communicate via the Message Passing Interface (MPI) paradigm. MPI tasks do not share memory but can be spawned over different nodes.
 
Multiple MPI tasks must be launched via '''mpirun''', e.g. 4 MPI tasks of ''my_par_program'':
<pre>
$ mpirun -n 4 my_par_program
</pre>
This command runs 4 MPI tasks of ''my_par_program'' on the node you are logged in.
To run this command with a loaded Intel MPI the environment variable I_MPI_HYDRA_BOOTSTRAP must be unset ( --> $ unset I_MPI_HYDRA_BOOTSTRAP).

Running MPI parallel programs in a batch job the interactive environment - particularly the loaded modules - will also be set in the batch job. If you want to set a defined module environment in your batch job you have to purge all modules before setting the wished modules.
 
 
===== OpenMPI =====

If you want to run jobs on batch nodes, generate a wrapper script ''job_ompi.sh'' for '''OpenMPI''' containing the following lines:
<source lang="bash">
#!/bin/bash
# Use when using the module environment for OpenMPI
module load compiler/<placeholder_for_compiler>/<placeholder_for_compiler_version>
module load mpi/openmpi/<placeholder_for_mpi_version>
mpirun --bind-to core --map-by core -report-bindings my_par_program
</source>
'''Attention:''' Do '''NOT''' add mpirun options ''-n <number_of_processes>'' or any other option defining processes or nodes, since Slurm instructs mpirun about number of processes and node hostnames. Use '''ALWAYS''' the MPI options '''''--bind-to core''''' and '''''--map-by core|socket|node'''''. Please type ''mpirun --help'' for an explanation of the meaning of the different options of mpirun option ''--map-by''.
 
Considering 4 OpenMPI tasks on a single node, each requiring 2000 MByte, and running for 1 hour, execute:
<pre>
$ sbatch -p cpu -N 1 -n 4 --mem-per-cpu=2000 --time=01:00:00 ./job_ompi.sh
</pre>
 

===== Intel MPI =====

Generate a wrapper script for '''Intel MPI''', ''job_impi.sh'' containing the following lines:
<source lang="bash">
#!/bin/bash
# Use when a defined module environment related to Intel MPI is wished
module load compiler/<placeholder_for_compiler>/<placeholder_for_compiler_version>
module load mpi/impi/<placeholder_for_version>
mpiexec.hydra -bootstrap slurm my_par_program
</source>
'''Attention:''' 
Do '''NOT''' add mpirun options ''-n <number_of_processes>'' or any other option defining processes or nodes, since Slurm instructs mpirun about number of processes and node hostnames.
 
Launching and running 200 Intel MPI tasks on 5 nodes, each requiring 80 GByte, and running for 5 hours, execute:
<pre>
$ sbatch --partition=cpu -N 5 --ntasks-per-node=40 --mem=80gb -t 300 ./job_impi.sh
</pre>
 
If you want to use 128 or more nodes, you must also set the environment variable as follows: 
export I_MPI_HYDRA_BRANCH_COUNT=-1
 
If you want to use the options perhost, ppn or rr, you must additionally set the environment variable I_MPI_JOB_RESPECT_PROCESS_PLACEMENT=off.
 
 

==== Multithreaded + MPI parallel Programs ====
Multithreaded + MPI parallel programs operate faster than serial programs on multi CPUs with multiple cores. All threads of one process share resources such as memory. On the contrary MPI tasks do not share memory but can be spawned over different nodes. '''Hyperthreading is switched on on bwUniCluster 3.0, the option --threads-per-core must be set to 1, if you do not want to use it.'''
 
 
===== OpenMPI with Multithreading =====
Multiple MPI tasks using '''OpenMPI''' must be launched by the MPI parallel program '''mpirun'''. For multithreaded programs based on '''Open''' '''M'''ulti-'''P'''rocessing (OpenMP) number of threads are defined by the environment variable OMP_NUM_THREADS. By default this variable is set to 1 (OMP_NUM_THREADS=1).
 
'''For OpenMPI''' a job-script to submit a batch job called ''job_ompi_omp.sh'' that runs a MPI program with 4 tasks and a 28-fold threaded program ''ompi_omp_program'' requiring 3000 MByte of physical memory per thread (using 28 threads per MPI task you will get 28*3000 MByte = 84000 MByte per MPI task) and total wall clock time of 3 hours looks like:

<source lang="bash">
#!/bin/bash
#SBATCH --ntasks=4
#SBATCH --cpus-per-task=28
#SBATCH --threads-per-core=1
#SBATCH --time=03:00:00
#SBATCH --mem=83gb # 84000 MB = 84000/1024 GB = 82.1 GB
#SBATCH --export=ALL,MPI_MODULE=mpi/openmpi/3.1,EXECUTABLE=./ompi_omp_program
#SBATCH --output="parprog_hybrid_%j.out"

# Use when a defined module environment related to OpenMPI is wished
module load ${MPI_MODULE}
export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK}
export MPIRUN_OPTIONS="--bind-to core --map-by socket:PE=${OMP_NUM_THREADS} -report-bindings"
export NUM_CORES=${SLURM_NTASKS}*${OMP_NUM_THREADS}
echo "${EXECUTABLE} running on ${NUM_CORES} cores with ${SLURM_NTASKS} MPI-tasks and ${OMP_NUM_THREADS} threads"
startexe="mpirun -n ${SLURM_NTASKS} ${MPIRUN_OPTIONS} ${EXECUTABLE}"
echo $startexe
exec $startexe
</source>
Execute the script '''job_ompi_omp.sh''' by command sbatch:
<pre>
$ sbatch -p cpu ./job_ompi_omp.sh
</pre>
* With the mpirun option ''--bind-to core'' MPI tasks and OpenMP threads are bound to physical cores.
* With the option ''--map-by node:PE=<value>'' (neighbored) MPI tasks will be attached to different nodes and each MPI task is bound to the first core of a node. <value> must be set to ${OMP_NUM_THREADS}.
* The option ''-report-bindings'' shows the bindings between MPI tasks and physical cores.
* The mpirun-options '''--bind-to core''', '''--map-by socket|...|node:PE=<value>''' should always be used when running a multithreaded MPI program.
 

===== Intel MPI with Multithreading =====
Multithreaded + MPI parallel programs operate faster than serial programs on multi CPUs with multiple cores. All threads of one process share resources such as memory. On the contrary MPI tasks do not share memory but can be spawned over different nodes.

Multiple Intel MPI tasks must be launched by the MPI parallel program '''mpiexec.hydra'''. For multithreaded programs based on '''Open''' '''M'''ulti-'''P'''rocessing (OpenMP) number of threads are defined by the environment variable OMP_NUM_THREADS. By default this variable is set to 1 (OMP_NUM_THREADS=1).

'''For Intel MPI''' a job-script to submit a batch job called ''job_impi_omp.sh'' that runs a Intel MPI program with 10 tasks and a 96-fold threaded program ''impi_omp_program'' requiring 96000 MByte of total physical memory per task and total wall clock time of 1 hours looks like:


<source lang="bash">
#!/bin/bash
#SBATCH --ntasks=10
#SBATCH --cpus-per-task=96
#SBATCH --threads-per-core=1
#SBATCH --time=60
#SBATCH --mem=96000
#SBATCH --export=ALL,MPI_MODULE=mpi/impi,EXE=./impi_omp_program
#SBATCH --output="parprog_impi_omp_%j.out"

#If using more than one MPI task per node please set
export KMP_AFFINITY=compact,1,0
#export KMP_AFFINITY=verbose,scatter prints messages concerning the supported affinity
#KMP_AFFINITY Description: https://software.intel.com/en-us/node/524790#KMP_AFFINITY_ENVIRONMENT_VARIABLE

# Use when a defined module environment related to Intel MPI is wished
module load ${MPI_MODULE}
export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK}
export MPIRUN_OPTIONS="-binding "domain=omp:compact" -print-rank-map -envall"
export NUM_PROCS=eval(${SLURM_NTASKS}*${OMP_NUM_THREADS})
echo "${EXE} running on ${NUM_PROCS} cores with ${SLURM_NTASKS} MPI-tasks and ${OMP_NUM_THREADS} threads"
startexe="mpiexec.hydra -bootstrap slurm ${MPIRUN_OPTIONS} -n ${SLURM_NTASKS} ${EXE}"
echo $startexe
exec $startexe
</source>
Using Intel compiler the environment variable KMP_AFFINITY switches on binding of threads to specific cores. If you only run one MPI task per node please set KMP_AFFINITY=compact,1,0.
 
If you want to use 128 or more nodes, you must also set the environment variable as follows: 
export I_MPI_HYDRA_BRANCH_COUNT=-1
 
If you want to use the options perhost, ppn or rr, you must additionally set the environment variable I_MPI_JOB_RESPECT_PROCESS_PLACEMENT=off.
 
 
Execute the script '''job_impi_omp.sh''' by command sbatch:
<pre>
$ sbatch -p cpu ./job_impi_omp.sh
</pre>
 
The mpirun option ''-print-rank-map'' shows the bindings between MPI tasks and nodes (not very beneficial). The option ''-binding'' binds MPI tasks (processes) to a particular processor; ''domain=omp'' means that the domain size is determined by the number of threads. If you would choose 2 MPI tasks per node, you should choose ''-binding "cell=unit;map=bunch"''; this binding maps one MPI process to each socket.
 
 

==== Chain jobs ====
The CPU time requirements of many applications exceed the limits of the job classes. In those situations it is recommended to solve the problem by a job chain. A job chain is a sequence of jobs where each job automatically starts its successor.
<source lang="bash">
#!/bin/bash
####################################
## simple Slurm submitter script to setup ##
## a chain of jobs using Slurm ##
####################################
## ver. : 2018-11-27, KIT, SCC

## Define maximum number of jobs via positional parameter 1, default is 5
max_nojob=${1:-5}

## Define your jobscript (e.g. "~/chain_job.sh")
chain_link_job=${PWD}/chain_job.sh

## Define type of dependency via positional parameter 2, default is 'afterok'
dep_type="${2:-afterok}"
## -> List of all dependencies:
## https://slurm.schedmd.com/sbatch.html

myloop_counter=1
## Submit loop
while [ ${myloop_counter} -le ${max_nojob} ] ; do
##
## Differ slurm_opt depending on chain link number
if [ ${myloop_counter} -eq 1 ] ; then
slurm_opt=""
else
slurm_opt="-d ${dep_type}:${jobID}"
fi
##
## Print current iteration number and sbatch command
echo "Chain job iteration = ${myloop_counter}"
echo " sbatch --export=myloop_counter=${myloop_counter} ${slurm_opt} ${chain_link_job}"
## Store job ID for next iteration by storing output of sbatch command with empty lines
jobID=$(sbatch -p <queue> --export=ALL,myloop_counter=${myloop_counter} ${slurm_opt} ${chain_link_job} 2>&1 | sed 's/[S,a-z]* //g')
##
## Check if ERROR occured
if [[ "${jobID}" =~ "ERROR" ]] ; then
echo " -> submission failed!" ; exit 1
else
echo " -> job number = ${jobID}"
fi
##
## Increase counter
let myloop_counter+=1
done
</source>
 

==== GPU jobs ====

The nodes in the gpu_h100, gpu_mi300, gpu_a100_il and gpu_h100_il queues have 4 NVIDIA Ampere A100 GPUs or 4 NVIDIA Hopper H100 GPUs. Just submitting a job to these queues is not enough to also allocate one or more GPUs, you have to do so using the "--gres=gpu" parameter. You have to specifiy how many GPUs your job needs, e.g. "--gres=gpu:2" will request two GPUs.

The GPU nodes are shared between multiple jobs if the jobs don't request all the GPUs in a node and there are enough resources to run more than one job. The individual GPUs are always bound to a single job and will not be shared between different jobs.

a) add after the initial line of your script job.sh the line including the
information about the GPU usage: #SBATCH --gres=gpu:2
<pre>
#!/bin/bash
#SBATCH --ntasks=40
#SBATCH --time=02:00:00
#SBATCH --mem=4000
#SBATCH --gres=gpu:2
</pre>

or b) execute:
<pre>
$ sbatch -p <queue> -n 40 -t 02:00:00 --mem 4000 --gres=gpu:2 job.sh
</pre>
 
If you start an interactive session on of the GPU nodes, you can use the "nvidia-smi" command to list the GPUs allocated to your job:
<pre>
$ nvidia-smi
Fri Apr 4 09:51:29 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.86.15 Driver Version: 570.86.15 CUDA Version: 12.8 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA H100 On | 00000000:06:00.0 Off | 0 |
| N/A 45C P0 70W / 415W | 27MiB / 95830MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA H100 On | 00000000:26:00.0 Off | 0 |
| N/A 45C P0 69W / 415W | 27MiB / 95830MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 3569 G /usr/libexec/Xorg 17MiB |
| 1 N/A N/A 3569 G /usr/libexec/Xorg 17MiB |
+-----------------------------------------------------------------------------------------+
</pre>

 
In case of using Open MPI, the underlying communication infrastructure (UCX and Open MPI's BTL) is CUDA-aware.
Please run Open MPI's mpirun using the following command:
<pre>
$ mpirun --mca pml ucx --mca btl_openib_warn_default_gid_prefix 0 -np 2 ./mpi_cuda_app
</pre>
or disabling the (older) communication layer Bit-Transfer-Layer (short BTL) alltogether:
<pre>
$ mpirun --mca pml ucx --mca btl ^openib -np 2 ./mpi_cuda_app
</pre>

 
 

==== LSDF Online Storage ====
On bwUniCluster 3.0 you can use for special cases the LSDF Online Storage on the HPC cluster nodes. Please request for this service separately ([https://www.lsdf.kit.edu/os/storagerequest/: LSDF Storage Request]).
To mount the LSDF Online Storage on the compute nodes during the job runtime the
the constraint flag "LSDF" has to be set.

a) add after the initial line of your script job.sh the line including the
information about the LSDF Online Storage usage: #SBATCH --constraint=LSDF
<pre>
#!/bin/bash
#SBATCH --ntasks=1
#SBATCH --time=120
#SBATCH --mem=200
#SBATCH --constraint=LSDF
</pre>

or b) execute:
<pre>
$ sbatch -p <queue> -n 1 -t 2:00:00 --mem 200 job.sh -C LSDF
</pre>
 
For the usage of the LSDF Online Storage
the following environment variables are available: $LSDF, $LSDFPROJECTS, $LSDFHOME.
 
 

====BeeOND (BeeGFS On-Demand)====

BeeOND instances are integrated into the prolog and epilog script of the cluster batch system Slurm. It can be used on the exclusive compute nodes during the job runtime with the constraint flag "BEEOND", "BEEOND_4MDS" or "BEEOND_MAXMDS" ([[BwUniCluster_2.0_Slurm_common_Features#sbatch_Command_Parameters|Slurm Command Parameters]])
* BEEOND: one metadata server is started on the first node
* BEEOND_4MDS: 4 metadata servers are started within your job. If you have less than 4 nodes less metadata servers are started.
* BEEOND_MAXMDS: on every node of your job a metadata server for the on_demand file system is started

BEEOND will only work if the node is allocated exclusively, meaning that no other jobs are running on the same node. To achieve this, use the batch option "--exclusive". This is particularly important in the in shared partitions.

As starting point we recommend using the "BEEOND" option. If you are unsure if this is sufficient for you feel free to contact the support team.
<source lang="bash">
#!/bin/bash
#SBATCH ...
#SBATCH --constraint=BEEOND # or BEEOND_4MDS or BEEOND_MAXMDS
#SBATCH --exclusive
</source>

After your job has started you can find the private on-demand file system in '''/mnt/odfs/${SLURM_JOB_ID}''' directory. The mountpoint comes with five pre-configured directories:
<source lang="bash">
# For small files (stripe count = 1)
/mnt/odfs/${SLURM_JOB_ID}/stripe_1
# Stripe count = 4
/mnt/odfs/${SLURM_JOB_ID}/stripe_default
# or
/mnt/odfs/${SLURM_JOB_ID}/stripe_4
# Stripe count = 8, 16 or 32, use this directories for medium sized and large files or when using MPI-IO
/mnt/odfs/${SLURM_JOB_ID}/stripe_8
/mnt/odfs/${SLURM_JOB_ID}/stripe_16
# or
/mnt/odfs/${SLURM_JOB_ID}/stripe_32
</source>

If you request less nodes than stripe count, the stripe count will be the number of nodes. For example, if you only request 8 nodes the directory stripe_16 has only a stripe count 8.

; '''Attention:''' 
:Be careful when creating large files: use always the directory with the max stripe count for large files.
:If you create large files use a higher stripe count. For example, if your largest file is 1.1 Tb, then you have to use a stripe count larger than 2,
:otherwise the used disk space is exceeded.

The capacity of the private file system depends on the number of nodes. For each node you get 750 Gbyte.
If you request 100 nodes for your job, the private file system is 100 * 750 Gbyte ~ 75 Tbyte (approx) capacity.

== Start time of job or resources : squeue --start ==
The command can be used by any user to displays the estimated start time of a job based a number of analysis types based on historical usage, earliest available reservable resources, and priority based backlog. The command squeue is explained in detail on the webpage https://slurm.schedmd.com/squeue.html or via manpage (man squeue).
 
 
=== Access ===
By default, this command can be run by '''any user'''.
 
 

== List of your submitted jobs : squeue ==
Displays information about YOUR active, pending and/or recently completed jobs. The command displays all own active and pending jobs. The command squeue is explained in detail on the webpage https://slurm.schedmd.com/squeue.html or via manpage (man squeue).
 
 
=== Access ===
By default, this command can be run by any user.
 
 

=== Flags ===
{| width=750px class="wikitable"
|-
! Flag !! Description
|-
| -l, --long
| Report more of the available information for the selected jobs or job steps, subject to any constraints specified.
|}
 

=== Examples ===
''squeue'' example on bwUniCluster 3.0 (Only your own jobs are displayed!).
<pre>
$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
1262 cpu wrap ka_ab123 R 8:15 1 uc3n002
1267 dev_gpu_h wrap ka_ab123 PD 0:00 1 (Resources)
1265 highmem wrap ka_ab123 R 2:41 1 uc3n084
$ squeue -l
JOBID PARTITION NAME USER STATE TIME TIME_LIMI NODES NODELIST(REASON)
1262 cpu wrap ka_ab123 RUNNING 8:55 20:00 1 uc3n002
1267 dev_gpu_h wrap ka_ab123 PENDING 0:00 20:00 1 (Resources)
1265 highmem wrap ka_ab123 RUNNING 3:21 20:00 1 uc3n084

</pre>
* The output of ''squeue'' shows how many jobs of yours are running or pending and how many nodes are in use by your jobs.
 

== Shows free resources : sinfo_t_idle ==
The Slurm command sinfo is used to view partition and node information for a system running Slurm. It incorporates down time, reservations, and node state information in determining the available backfill window. The sinfo command can only be used by the administrator.
 
SCC has prepared a special script (sinfo_t_idle) to find out how many processors are available for immediate use on the system. It is anticipated that users will use this information to submit jobs that meet these criteria and thus obtain quick job turnaround times.
 
 
=== Example ===
* The following command displays what resources are available for immediate use for the whole partition.
<pre>$ sinfo_t_idle
Partition dev_cpu : 2 nodes idle
Partition cpu : 68 nodes idle
Partition highmem : 4 nodes idle
Partition dev_gpu_h100 : 0 nodes idle
Partition gpu_h100 : 11 nodes idle
Partition gpu_mi300 : 1 nodes idle
Partition dev_cpu_il : 0 nodes idle
Partition cpu_il : 0 nodes idle
Partition dev_gpu_a100_il : 0 nodes idle
Partition gpu_a100_il : 0 nodes idle
Partition gpu_h100_il : 0 nodes idle
</pre>
 

== Detailed job information : scontrol show job ==
scontrol show job displays detailed job state information and diagnostic output for all or a specified job of yours. Detailed information is available for active, pending and recently completed jobs. The command scontrol is explained in detail on the webpage https://slurm.schedmd.com/scontrol.html or via manpage (man scontrol).
 
Display the state of all your jobs in normal mode: scontrol show job
 
Display the state of a job with <jobid> in normal mode: scontrol show job <jobid>
 
 
=== Access ===
* End users can use scontrol show job to view the status of their '''own jobs''' only.
 

=== Arguments ===
{| width=750px class="wikitable"
|-
! Option !! Default !! Description !! Example
|- style="vertical-align:top;"
|- style="width:12%;"
| -d
| (n/a)
| Detailed mode
| Example: Display the state with jobid 18089884 in detailed mode. <pre>scontrol -d show job 18089884</pre>
|}
 
 

=== Scontrol show job Example ===
Here is an example from bwUniCluster 3.0.
<pre>
squeue # show my own jobs (here the userid is replaced!)
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
1262 cpu wrap ka_zs040 R 1:12 1 uc3n002

$
$ # now, see what's up with my pending job with jobid 18089884
$
$ scontrol show job 1262

JobId=1262 JobName=wrap
UserId=ka_zs0402(241992) GroupId=ka_scc(12345) MCS_label=N/A
Priority=4246 Nice=0 Account=ka QOS=normal
JobState=RUNNING Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
RunTime=00:00:37 TimeLimit=00:20:00 TimeMin=N/A
SubmitTime=2025-04-04T10:01:30 EligibleTime=2025-04-04T10:01:30
AccrueTime=2025-04-04T10:01:30
StartTime=2025-04-04T10:01:31 EndTime=2025-04-04T10:21:31 Deadline=N/A
SuspendTime=None SecsPreSuspend=0 LastSchedEval=2025-04-04T10:01:31 Scheduler=Main
Partition=cpu AllocNode:Sid=uc3n999:2819841
ReqNodeList=(null) ExcNodeList=(null)
NodeList=uc3n002
BatchHost=uc3n002
NumNodes=1 NumCPUs=2 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
ReqTRES=cpu=1,mem=2000M,node=1,billing=1
AllocTRES=cpu=2,mem=4000M,node=1,billing=2
Socks/Node=* NtasksPerN:B:S:C=0:0:*:1 CoreSpec=*
MinCPUsNode=1 MinMemoryCPU=2000M MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
Command=(null)
WorkDir=/pfs/data6/home/ka/ka_scc/ka_zs0402
StdErr=/pfs/data6/home/ka/ka_scc/ka_zs0402/slurm-1262.out
StdIn=/dev/null
StdOut=/pfs/data6/home/ka/ka_scc/ka_zs0402/slurm-1262.out

</pre>
 
You can use standard Linux pipe commands to filter the very detailed scontrol show job output.
* In which state the job is?
<pre>$ scontrol show job 1262 | grep -i State
JobState=RUNNING Reason=None Dependency=(null)
</pre>
 

== Cancel Slurm Jobs ==
The scancel command is used to cancel jobs. The command scancel is explained in detail on the webpage https://slurm.schedmd.com/scancel.html or via manpage (man scancel).
 
 
=== Canceling own jobs : scancel ===
scancel is used to signal or cancel jobs, job arrays or job steps. The command is:
<pre>
$ scancel [-i] <job-id>
$ scancel -t <job_state_name>
</pre>

 
{| class="wikitable"
! Flag !! Default !! Description !! Example
|- style="vertical-align:top;"
| -i, --interactive
| (n/a)
| Interactive mode.
| Cancel the job 987654 interactively. <pre> scancel -i 987654 </pre>
|-
| -t, --state
| (n/a)
| Restrict the scancel operation to jobs in a certain state. "job_state_name" may have a value of either "PENDING", "RUNNING" or "SUSPENDED".
| Cancel all jobs in state "PENDING". <pre> scancel -t "PENDING" </pre>
|}
 

= Resource Managers =
=== Batch Job (Slurm) Variables ===
The following environment variables of Slurm are added to your environment once your job has started
(only an excerpt of the most important ones).
{| width=750px class="wikitable"
! Environment !! Brief explanation
|-
| SLURM_JOB_CPUS_PER_NODE
| Number of processes per node dedicated to the job
|-
| SLURM_JOB_NODELIST
| List of nodes dedicated to the job
|-
| SLURM_JOB_NUM_NODES
| Number of nodes dedicated to the job
|-
| SLURM_MEM_PER_NODE
| Memory per node dedicated to the job
|-
| SLURM_NPROCS
| Total number of processes dedicated to the job
|-
| SLURM_CLUSTER_NAME
| Name of the cluster executing the job
|-
| SLURM_CPUS_PER_TASK
| Number of CPUs requested per task
|-
| SLURM_JOB_ACCOUNT
| Account name
|-
| SLURM_JOB_ID
| Job ID
|-
| SLURM_JOB_NAME
| Job Name
|-
| SLURM_JOB_PARTITION
| Partition/queue running the job
|-
| SLURM_JOB_UID
| User ID of the job's owner
|-
| SLURM_SUBMIT_DIR
| Job submit folder. The directory from which sbatch was invoked.
|-
| SLURM_JOB_USER
| User name of the job's owner
|-
| SLURM_RESTART_COUNT
| Number of times job has restarted
|-
| SLURM_PROCID
| Task ID (MPI rank)
|-
| SLURM_NTASKS
| The total number of tasks available for the job
|-
| SLURM_STEP_ID
| Job step ID
|-
| SLURM_STEP_NUM_TASKS
| Task count (number of MPI ranks)
|-
| SLURM_JOB_CONSTRAINT
| Job constraints
|}
See also:
* [https://slurm.schedmd.com/sbatch.html#SECTION_INPUT-ENVIRONMENT-VARIABLES Slurm input and output environment variables]
 

=== Job Exit Codes ===
A job's exit code (also known as exit status, return code and completion code) is captured by SLURM and saved as part of the job record.
 
Any non-zero exit code will be assumed to be a job failure and will result in a Job State of FAILED with a reason of "NonZeroExitCode".
 
The exit code is an 8 bit unsigned number ranging between 0 and 255. While it is possible for a job to return a negative exit code, SLURM will display it as an unsigned value in the 0 - 255 range.
 
 
==== Displaying Exit Codes and Signals ====
SLURM displays a job's exit code in the output of the '''scontrol show job''' and the sview utility.
 
When a signal was responsible for a job or step's termination, the signal number will be displayed after the exit code, delineated by a colon(:).
 
 
==== Submitting Termination Signal ====
Here is an example, how to 'save' a Slurm termination signal in a typical jobscript.
<source lang="bash">
[...]
exit_code=$?
mpirun -np <#cores> <EXE_BIN_DIR>/<executable> ... (options) 2>&1
[ "$exit_code" -eq 0 ] && echo "all clean..." || \
echo "Executable <EXE_BIN_DIR>/<executable> finished with exit code ${$exit_code}"
[...]
</source>
* Do not use ''''time'''' mpirun! The exit code will be the one submitted by the first (time) program.
* You do not need an '''exit $exit_code''' in the scripts.
 
 
----
[[#top|Back to top]]

BwUniCluster3.0/Running Jobs

2025-08-19T09:21:20Z

P Schuhmacher: /* Batch Jobs: sbatch */

= Purpose and function of a queuing system =

All compute activities on bwUniCluster 3.0 have to be performed on the compute nodes. Compute nodes are only available by requesting the corresponding resources via the queuing system. As soon as the requested resources are available, automated tasks are executed via a batch script or they can be accessed interactively. 
General procedure: Hint to [[Running_Calculations | Running Calculations]]

== Job submission process ==

bwUniCluster 3.0 has installed the workload managing software Slurm. Therefore any job submission by the user is to be executed by commands of the Slurm software. Slurm queues and runs user jobs based on fair sharing policies.

== Slurm ==

HPC Workload Manager on bwUniCluster 3.0 is Slurm.
Slurm is a cluster management and job scheduling system. Slurm has three key functions.
* It allocates access to resources (compute cores on nodes) to users for some duration of time so they can perform work.
* It provides a framework for starting, executing, and monitoring work (normally a parallel job) on the set of allocated nodes.
* It arbitrates contention for resources by managing a queue of pending work.

Any kind of calculation on the compute nodes of bwUniCluster 3.0 requires the user to define calculations as a sequence of commands together with required run time, number of CPU cores and main memory and submit all, i.e., the '''batch job''', to a resource and workload managing software.

== Terms and definitions ==

''' Partitions '''

Slurm manages job queues for different '''partitions'''. Partitions are used to group similar node types (e.g. nodes with and without accelerators) and to enforce different access policies and resource limits.

On bwUniCluster 3.0 there are different partitions:

* CPU-only nodes
** 2-socket nodes, consisting of 2 Intel Ice Lake processors with 32 cores each or 2 AMD processors with 48 cores each
** 2-socket nodes with very high RAM capacity, consisting of 2 AMD processors with 48 cores each
* GPU-accelerated nodes
** 2-socket nodes with 4x NVIDIA A100 or 4x NVIDIA H100 GPUs
** 4-socket node with 4x AMD Instinct accelerator

''' Queues '''

Job '''queues''' are used to manage jobs that request access to shared but limited computing resources of a certain kind (partition).

On bwUniCluster 3.0 there are different main types of queues:
* Regular queues
** cpu: Jobs that request CPU-only nodes.
** gpu: Jobs that request GPU-accelerated nodes.
* Development queues (dev)
** Short, usually interactive jobs that are used for developing, compiling and testing code and workflows. The intention behind development queues is to provide users with immediate access to computer resources without having to wait. This is the place to realize instantaneous heavy compute without affecting other users, as would be the case on the login nodes.

Requested compute resources such as (wall-)time, number of nodes and amount of memory are restricted and must fit into the boundaries imposed by the queues. The request for compute resources on the bwUniCluster 3.0 requires at least the specification of the '''queue''' and the '''time'''.

''' Jobs '''

Jobs can be run non-interactively as '''batch jobs''' or as '''interactive jobs'''. 
Submitting a batch job means, that all steps of a compute project are defined in a Bash script. This Bash script is queued and executed as soon as the compute resources are available and allocated. Jobs are enqueued with the <code>sbatch</code> command.
For interactive jobs, the resources are requested with the <code>salloc</code> command. As soon as the computing resources are available and allocated, a command line prompt is returned on a computing node and the user can freely dispose of the resources now available to him.
{|style="background:#deffee; width:100%;"
|style="padding:5px; background:#cef2e0; text-align:left"|
[[Image:Attention.svg|center|25px]]
|style="padding:5px; background:#cef2e0; text-align:left"|
'''Please remember:'''
* '''Heavy computations are not allowed on the login nodes'''. Use a developement or a regular job queue instead! Please refer to [[BwUniCluster3.0/Login#Allowed_Activities_on_Login_Nodes|Allowed Activities on Login Nodes]].
* '''Development queues''' are meant for '''development tasks'''. Do not misuse this queue for regular, short-running jobs or chain jobs! Only one running job at a time is enabled. Maximum queue length is reduced to 3.
|}

= Queues on bwUniCluster 3.0 =
== Policy ==

The computing time is provided in accordance with the '''fair share policy'''. The individual investment shares of the respective university and the resources already used by its members are taken into account. Furthermore, the following throttling policy is also active: The '''maximum amount of physical cores''' used at any given time from jobs running is '''1920 per user''' (aggregated over all running jobs). This number corresponds to 30 nodes on the Ice Lake partition or 20 nodes on the standard partition. The aim is to minimize waiting times and maximize the number of users who can access computing time at the same time.

== Regular Queues ==
{| class="wikitable"
|-
! style="width:5%"| Queue
! style="width:13%"| Node-Type
! style="width:23%"| Default Resources
! style="width:13%"| Minimal Resources
! style="width:13%"| Maximum Resources
|-
| <code>cpu_il</code>
| CPU nodes Ice Lake
| mem-per-cpu=2000mb
|
| time=72:00:00, nodes=30, mem=249600mb, ntasks-per-node=64, (threads-per-core=2)
|-
| <code>cpu</code>
| CPU nodes Standard
| mem-per-cpu=2000mb
|
| time=72:00:00, nodes=20, mem=380000mb, ntasks-per-node=96, (threads-per-core=2)
|-
| <code>highmem</code>
| CPU nodes High Memory
| mem-per-cpu=12090mb
| mem=380001mb
| time=72:00:00, nodes=4, mem=2300000mb, ntasks-per-node=96, (threads-per-core=2)
|-
| <code>gpu_h100</code>
| GPU nodes NVIDIA GPU x4
| mem-per-gpu=193300mb cpus-per-gpu=24
| gres=gpu:1
| time=72:00:00, nodes=12, mem=760000mb, ntasks-per-node=96, (threads-per-core=2)
|-
| <code>gpu_mi300</code>
| GPU node AMD GPU x4
| mem-per-gpu=128200mb cpus-per-gpu=24
| gres=gpu:1
| time=72:00:00, nodes=1, mem=510000mb, ntasks-per-node=40, (threads-per-core=2)
|-
| <code>gpu_a100_il</code>/<code>gpu_h100_il</code>
| GPU nodes Ice Lake NVIDIA GPU x4
| mem-per-gpu=127500mb cpus-per-gpu=16
| gres=gpu:1
| time=48:00:00, nodes=9(A100)/nodes=5(H100) , mem=510000mb, ntasks-per-node=64, (threads-per-core=2)
|}
Table 1: Regular Queues

== Short Queues ==
Queues with a short runtime of 30 minutes.
{| class="wikitable"
|-
! style="width:5%"| Queue
! style="width:13%"| Node Type
! style="width:23%"| Default Resources
! style="width:13%"| Minimal Resources
! style="width:13%"| Maximum Resources
|-
| <code>gpu_a100_short</code>
| GPU nodes Ice Lake NVIDIA GPU x4
| mem-per-gpu=94000mb cpus-per-gpu=12
| gres=gpu:1
| time=30, nodes=12, mem=376000mb, ntasks-per-node=48, (threads-per-core=2)
|}
Table 2: Short Queues

== Development Queues ==
Only for development, i.e. debugging or performance optimization ...

{| class="wikitable"
|-
! style="width:5%"| Queue
! style="width:13%"| Node Type
! style="width:23%"| Default Resources
! style="width:13%"| Minimal Resources
! style="width:13%"| Maximum Resources
|-
| <code>dev_cpu_il</code>
| CPU nodes Ice Lake
| mem-per-cpu=2000mb
|
| time=30, nodes=8, mem=249600mb, ntasks-per-node=64, (threads-per-core=2)
|-
| <code>dev_cpu</code>
| CPU nodes Standard
| mem-per-cpu=2000mb
|
| time=30, nodes=1, mem=380000mb, ntasks-per-node=96, (threads-per-core=2)
|-
| <code>dev_gpu_h100</code>
| GPU nodes NVIDIA GPU x4
| mem-per-gpu=193300mb cpus-per-gpu=24
| gres=gpu:1
| time=30, nodes=1, mem=760000mb, ntasks-per-node=96, (threads-per-core=2)
|-
| <code>dev_gpu_a100_il</code>
| GPU nodes NVIDIA GPU x4 
| mem-per-gpu=127500mb cpus-per-gpu=16
| gres=gpu:1
| time=30, nodes=1, mem=510000mb, ntasks-per-node=64, (threads-per-core=2)
|}
Table 3: Development Queues

Default resources of a queue class define number of tasks and memory if not explicitly given with sbatch command. Resource list acronyms ''--time'', ''--ntasks'', ''--nodes'', ''--mem'' and ''--mem-per-cpu'' are described [[BwUniCluster3.0/Running_Jobs/Slurm|here]].

== Check available resources: sinfo_t_idle ==
The Slurm command sinfo is used to view partition and node information for a system running Slurm. It incorporates down time, reservations, and node state information in determining the available backfill window. The sinfo command can only be used by the administrator.
 
SCC has prepared a special script (sinfo_t_idle) to find out how many processors are available for immediate use on the system. It is anticipated that users will use this information to submit jobs that meet these criteria and thus obtain quick job turnaround times.
 
 

* The following command displays what resources are available for immediate use for the whole partition.
<pre>$ sinfo_t_idle
Partition dev_cpu : 1 nodes idle
Partition cpu : 1 nodes idle
Partition highmem : 2 nodes idle
Partition dev_gpu_h100 : 0 nodes idle
Partition gpu_h100 : 0 nodes idle
Partition gpu_mi300 : 0 nodes idle
Partition dev_cpu_il : 7 nodes idle
Partition cpu_il : 2 nodes idle
Partition dev_gpu_a100_il : 1 nodes idle
Partition gpu_a100_il : 0 nodes idle
Partition gpu_h100_il : 1 nodes idle
Partition gpu_a100_short : 0 nodes idle
</pre>
 

= Running Jobs =

== Slurm Commands (excerpt) ==
Important Slurm commands for non-administrators working on bwUniCluster 3.0.
{| width=850px class="wikitable"
! Slurm commands !! Brief explanation
|-
| [[#Batch Jobs: sbatch|sbatch]] || Submits a job and puts it into the queue [[https://slurm.schedmd.com/sbatch.html sbatch]]
|-
| [[#Interactive Jobs: salloc|salloc]] || Requests resources for an interactive Job [[https://slurm.schedmd.com/salloc.html salloc]]
|-
| [[#Monitor and manage jobs |scontrol show job]] || Displays detailed job state information [[https://slurm.schedmd.com/scontrol.html scontrol]]
|-
| [[#List of your submitted jobs : squeue|squeue]] || Displays information about active, eligible, blocked, and/or recently completed jobs [[https://slurm.schedmd.com/squeue.html squeue]]
|-
| [[#List of your submitted jobs : squeue|squeue --start]] || Returns start time of submitted job [[https://slurm.schedmd.com/squeue.html squeue]]
|-
| [[#Check available resources: sinfo_t_idle|sinfo_t_idle]] || Shows what resources are available for immediate use [[https://slurm.schedmd.com/sinfo.html sinfo]]
|-
| [[#Canceling own jobs : scancel|scancel]] || Cancels a job [[https://slurm.schedmd.com/scancel.html scancel]]
|}

 
* [https://slurm.schedmd.com/tutorials.html Slurm Tutorials]
* [https://slurm.schedmd.com/pdfs/summary.pdf Slurm command/option summary (2 pages)]
* [https://slurm.schedmd.com/man_index.html Slurm Commands]
 

== Batch Jobs: sbatch ==

Batch jobs are submitted by using the command '''sbatch'''. The main purpose of the '''sbatch''' command is to specify the resources that are needed to run the job. '''sbatch''' will then queue the batch job. However, starting of batch job depends on the availability of the requested resources and the fair sharing value.
 
 

* The syntax and use of '''sbatch''' can be displayed via:
<pre>
$ man sbatch
</pre>

'''sbatch''' options can be used from the command line or in your job script. Different defaults for some of these options are set based on the queue and can be found [[BwUniCluster3.0/Slurm | here]]

{| class="wikitable"
! colspan="3" | sbatch Options
|-
! style="width:8%"| Command line
! style="width:9%"| Script
! style="width:13%"| Purpose
|- style="vertical-align:top;"
| -t, --time=''time''
| #SBATCH --time=''time''
| Wall clock time limit. 
|-
|- style="vertical-align:top;"
| -N, --nodes=''count''
| #SBATCH --nodes=''count''
| Number of nodes to be used.
|- style="vertical-align:top;"
| -n, --ntasks=''count''
| #SBATCH --ntasks=''count''
| Number of tasks to be launched.
|-
|- style="vertical-align:top;"
| --ntasks-per-node=''count''
| #SBATCH --ntasks-per-node=''count''
| Maximum count of tasks per node.
|-
|- style="vertical-align:top;"
| -c, --cpus-per-task=''count''
| #SBATCH --cpus-per-task=''count''
| Number of CPUs required per (MPI-)task.
|-
|- style="vertical-align:top;"
| --mem=''value_in_MB''
| #SBATCH --mem=''value_in_MB''
| Memory in MegaByte per node. (You should omit the setting of this option.)
|-
|- style="vertical-align:top;"
| --mem-per-cpu=''value_in_MB''
| #SBATCH --mem-per-cpu=''value_in_MB''
| Minimum Memory required per allocated CPU. (You should omit the setting of this option.)
|-
|- style="vertical-align:top;"
| --exclusive
| #SBATCH --exclusive
| The job allocates all CPUs and GPUs on the nodes. It will not share the node with other running jobs
|-
|- style="vertical-align:top;"
| --mail-type=''type''
| #SBATCH --mail-type=''type''
| Notify user by email when certain event types occur. Valid type values are NONE, BEGIN, END, FAIL, REQUEUE, ALL.
|-
|- style="vertical-align:top;"
| --mail-user=''mail-address''
| #SBATCH --mail-user=''mail-address''
| The specified mail-address receives email notification of state changes as defined by --mail-type.
|-
|- style="vertical-align:top;"
| --output=''name''
| #SBATCH --output=''name''
| File in which job output is stored.
|-
|- style="vertical-align:top;"
| --error=''name''
| #SBATCH --error=''name''
| File in which job error messages are stored.
|-
|- style="vertical-align:top;"
| -J, --job-name=''name''
| #SBATCH --job-name=''name''
| Job name.
|-
|- style="vertical-align:top;"
| --export=[ALL,] ''env-variables''
| #SBATCH --export=[ALL,] ''env-variables''
| Identifies which environment variables from the submission environment are propagated to the launched application. Default is ALL.
|-
|- style="vertical-align:top;"
| -A, --account=''group-name''
| #SBATCH --account=''group-name''
| Change resources used by this job to specified group. You may need this option if your account is assigned to more than one group. By command "scontrol show job" the project group the job is accounted on can be seen behind "Account=".
|-
|- style="vertical-align:top;"
| -p, --partition=''queue-name''
| #SBATCH --partition=''queue-name''
| Request a specific queue for the resource allocation.
|-
|- style="vertical-align:top;"
| --reservation=''reservation-name''
| #SBATCH --reservation=''reservation-name''
| Use a specific reservation for the resource allocation.
|-
|- style="vertical-align:top;"
| -C, --constraint=''LSDF''
| #SBATCH --constraint=LSDF
| Job constraint LSDF filesystems.
|-
|- style="vertical-align:top;"
| -C, --constraint=''BEEOND (BEEOND_4MDS, BEEOND_MAXMDS)''
| #SBATCH --constraint=BEEOND (BEEOND_4MDS, BEEOND_MAXMDS)
| Job constraint BeeOND filesystem.
|-
|}
 

== Interactive Jobs: salloc ==

On bwUniCluster 3.0 you are only allowed to run short jobs (<< 1 hour) with little memory requirements (<< 8 GByte) on the logins nodes. If you want to run longer jobs and/or jobs with a request of more than 8 GByte of memory, you must allocate resources for so-called interactive jobs by usage of the command salloc on a login node. Considering a serial application running on a compute node that requires 5000 MByte of memory and limiting the interactive run to 2 hours the following command has to be executed:

<pre>
$ salloc -p cpu -n 1 -t 120 --mem=5000
</pre>

Then you will get one core on a compute node within the partition "cpu". After execution of this command '''DO NOT CLOSE''' your current terminal session but wait until the queueing system Slurm has granted you the requested resources on the compute system. You will be logged in automatically on the granted core! To run a serial program on the granted core you only have to type the name of the executable.

<pre>
$ ./<my_serial_program>
</pre>

Please be aware that your serial job must run less than 2 hours in this example, else the job will be killed during runtime by the system.

You can also start now a graphical X11-terminal connecting you to the dedicated resource that is available for 2 hours. You can start it by the command:

<pre>
$ xterm
</pre>

Note that, once the walltime limit has been reached the resources - i.e. the compute node - will automatically be revoked.

An interactive parallel application running on one compute node or on many compute nodes (e.g. here 5 nodes) with 96 cores each requires usually an amount of memory in GByte (e.g. 50 GByte) and a maximum time (e.g. 1 hour). E.g. 5 nodes can be allocated by the following command:

<pre>
$ salloc -p cpu -N 5 --ntasks-per-node=96 -t 01:00:00 --mem=50gb
</pre>

Now you can run parallel jobs on 480 cores requiring 50 GByte of memory per node. Please be aware that you will be logged in on core 0 of the first node.
If you want to have access to another node you have to open a new terminal, connect it also to bwUniCluster 3.0 and type the following commands to
connect to the running interactive job and then to a specific node:

<pre>
$ srun --jobid=XXXXXXXX --pty /bin/bash
$ srun --nodelist=uc3nXXX --pty /bin/bash
</pre>

With the command:

<pre>
$ squeue
</pre>

the jobid and the nodelist can be shown.

If you want to run MPI-programs, you can do it by simply typing mpirun <program_name>. Then your program will be run on 480 cores. A very simple example for starting a parallel job can be:

<pre>
$ mpirun <my_mpi_program>
</pre>

You can also start the debugger ddt by the commands:

<pre>
$ module add devel/ddt
$ ddt <my_mpi_program>
</pre>

The above commands will execute the parallel program <my_mpi_program> on all available cores. You can also start parallel programs on a subset of cores; an example for this can be:

<pre>
$ mpirun -n 50 <my_mpi_program>
</pre>

If you are using Intel MPI you must start <my_mpi_program> by the command mpiexec.hydra (instead of mpirun).

== Interactive Computing with Jupyter ==

== Monitor and manage jobs ==

=== List of your submitted jobs : squeue ===
Displays information about YOUR active, pending and/or recently completed jobs. The command displays all own active and pending jobs. The command squeue is explained in detail on the webpage https://slurm.schedmd.com/squeue.html or via manpage (man squeue).
 
 

* ''squeue'' example on bwUniCluster 3.0 (Only your own jobs are displayed!).
<pre>
$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
1262 cpu wrap ka_ab123 R 8:15 1 uc3n002
1267 dev_gpu_h wrap ka_ab123 PD 0:00 1 (Resources)
1265 highmem wrap ka_ab123 R 2:41 1 uc3n084
$ squeue -l
JOBID PARTITION NAME USER STATE TIME TIME_LIMI NODES NODELIST(REASON)
1262 cpu wrap ka_ab123 RUNNING 8:55 20:00 1 uc3n002
1267 dev_gpu_h wrap ka_ab123 PENDING 0:00 20:00 1 (Resources)
1265 highmem wrap ka_ab123 RUNNING 3:21 20:00 1 uc3n084

</pre>

=== Detailed job information : scontrol show job ===
scontrol show job displays detailed job state information and diagnostic output for all or a specified job of yours. Detailed information is available for active, pending and recently completed jobs. The command scontrol is explained in detail on the webpage https://slurm.schedmd.com/scontrol.html or via manpage (man scontrol).
 
Display the state of all your jobs in normal mode: scontrol show job
 
Display the state of a job with <jobid> in normal mode: scontrol show job <jobid>
 
 

* Here is an example from bwUniCluster 3.0.
<pre>
$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
1262 cpu wrap ka_zs040 R 1:12 1 uc3n002

$
$ # now, see what's up with my pending job with jobid 1262
$
$ scontrol show job 1262

JobId=1262 JobName=wrap
UserId=ka_zs0402(241992) GroupId=ka_scc(12345) MCS_label=N/A
Priority=4246 Nice=0 Account=ka QOS=normal
JobState=RUNNING Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
RunTime=00:00:37 TimeLimit=00:20:00 TimeMin=N/A
SubmitTime=2025-04-04T10:01:30 EligibleTime=2025-04-04T10:01:30
AccrueTime=2025-04-04T10:01:30
StartTime=2025-04-04T10:01:31 EndTime=2025-04-04T10:21:31 Deadline=N/A
SuspendTime=None SecsPreSuspend=0 LastSchedEval=2025-04-04T10:01:31 Scheduler=Main
Partition=cpu AllocNode:Sid=uc3n999:2819841
ReqNodeList=(null) ExcNodeList=(null)
NodeList=uc3n002
BatchHost=uc3n002
NumNodes=1 NumCPUs=2 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
ReqTRES=cpu=1,mem=2000M,node=1,billing=1
AllocTRES=cpu=2,mem=4000M,node=1,billing=2
Socks/Node=* NtasksPerN:B:S:C=0:0:*:1 CoreSpec=*
MinCPUsNode=1 MinMemoryCPU=2000M MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
Command=(null)
WorkDir=/pfs/data6/home/ka/ka_scc/ka_zs0402
StdErr=/pfs/data6/home/ka/ka_scc/ka_zs0402/slurm-1262.out
StdIn=/dev/null
StdOut=/pfs/data6/home/ka/ka_scc/ka_zs0402/slurm-1262.out

</pre>
 
=== Canceling own jobs : scancel ===
The scancel command is used to cancel jobs. The command scancel is explained in detail on the webpage https://slurm.schedmd.com/scancel.html or via manpage (man scancel). The command is:
<pre>
$ scancel [-i] <job-id>
$ scancel -t <job_state_name>
</pre>

= Slurm Options =
[[BwUniCluster3.0/Running_Jobs/Slurm | Detailed Slurm usage]]

= Best Practices =

== Step-by-Step example==

== Dos and Don'ts ==

BwUniCluster3.0/Batch Queues

2025-08-13T12:37:00Z

P Schuhmacher: /* Regular Queues */

{|style="background:#FEF4AB; width:100%;"
|style="padding:5px; background:#FEF4AB; text-align:left"|
This page is work in progress.
|}

== Partitions, Queues and Jobs==

=== Partitions ===
Slurm manages job queues for different '''partitions'''. Partitions are used to group similar node types (e.g. nodes with and without accelerators) and to enforce different access policies and resource limits.

On bwUniCluster 3.0 there are different partitions:

* CPU-only nodes
** 2-socket nodes, consisting of 2 Intel Ice Lake processors with 32 cores each or 2 AMD processors with 48 cores each
** 2-socket nodes with very high RAM capacity, consisting of 2 AMD processors with 48 cores each
* GPU-accelerated nodes
** 2-socket nodes with 4x NVIDIA A100 or 4x NVIDIA H100 GPUs
** 4-socket node with 4x AMD Instinct accelerator

=== Queues ===

Job '''queues''' are used to manage jobs that request access to shared but limited computing resources of a certain kind (partition).

On bwUniCluster 3.0 there are different main types of queues:
* Regular queues
** cpu: Jobs that request nodes without GPUs. Normal or very high memory capacity.
** gpu: Jobs that request GPU accelerators on one or more than one node.
* Development queues (dev)
** Short, usually interactive jobs that are used for developing, compiling and testing code and workflows. The intention behind development queues is to provide users with immediate access to computer resources without having to wait. This is the place to realize instantaneous heavy compute without affecting other users, as would be the case on the login nodes.

Requested compute resources such as (wall-)time, number of nodes and amount of memory are restricted and must fit into the boundaries imposed by the queues. The request for compute resources on the bwUniCluster 3.0 requires at least the specification of the '''queue''' and the '''time'''.

=== Jobs ===

Jobs can be run non-interactively as '''batch jobs''' or as '''interactive jobs'''.
Submitting a batch job means, that all steps of a compute project are defined in a Bash script. This Bash script is queued and executed as soon as the compute resources are available and allocated. Jobs are enqueued with the ''sbatch'' command.

For interactive jobs, the resources are requested with the ''salloc'' command. As soon as the computing resources are available and allocated, a command line prompt is returned on a computing node and the user can freely dispose of the resources now available to him.
{|style="background:#deffee; width:100%;"
|style="padding:5px; background:#cef2e0; text-align:left"|
[[Image:Attention.svg|center|25px]]
|style="padding:5px; background:#cef2e0; text-align:left"|
'''Please remember:'''
* '''Heavy computations are not allowed on the login nodes'''. Use a developement or a regular job queue instead! Please refer to [[BwUniCluster2.0/Login#Allowed_Activities_on_Login_Nodes|Allowed Activities on Login Nodes]].
* '''Development queues''' are meant for '''development tasks'''. Do not misuse this queue for regular, short-running jobs or chain jobs! Only one running job at a time is enabled. Maximum queue length is reduced to 3.
|}

== Batch Jobs: sbatch ==

=== Regular Queues ===
{| class="wikitable"
|-
! style="width:5%"| queue
! style="width:13%"| node
! style="width:23%"| default resources
! style="width:13%"| minimal resources
! style="width:13%"| maximum resources
|-
| <code>cpu_il</code>
| CPU nodes Ice Lake
| mem-per-cpu=1950mb
|
| time=72:00:00, nodes=80, mem=249600mb, ntasks-per-node=64, (threads-per-core=2)
|-
| <code>cpu</code>
| CPU nodes Standard
| mem-per-cpu=1125mb
|
| time=72:00:00, nodes=70, mem=380000mb, ntasks-per-node=96, (threads-per-core=2)
|-
| <code>highmem</code>
| CPU nodes High Memory
| mem-per-cpu=1125mb
|
| time=72:00:00, nodes=4, mem=2300000mb, ntasks-per-node=96, (threads-per-core=2)
|-
| <code>gpu_h100</code>
| GPU nodes NVIDIA GPU x4
| mem-per-cpu=1125mb
|
| time=72:00:00, nodes=12, mem=760000mb, ntasks-per-node=96, (threads-per-core=2)
|-
| <code>gpu_mi300</code>
| GPU node AMD GPU x4
| mem-per-cpu=1125mb cpus-per-gpu=24
|
| time=72:00:00, nodes=1, mem=510000mb, ntasks-per-node=40, (threads-per-core=2)
|-
| <code>gpu_a100_il</code>/<code>gpu_h100_il</code>
| GPU nodes Ice Lake NVIDIA GPU x4
| mem-per-gpu=127500mb cpus-per-gpu=16
|
| time=48:00:00, nodes=9(A100)/nodes=5(H100), mem=510000mb, ntasks-per-node=64, (threads-per-core=2)
|}
Table 1: Regular Queues

=== Development Queues ===
Only for development, i.e. debugging or performance optimization ...

{| class="wikitable"
|-
! style="width:5%"| queue
! style="width:13%"| node
! style="width:23%"| default resources
! style="width:13%"| minimal resources
! style="width:13%"| maximum resources
|-
| <code>dev_cpu_il</code>
| CPU nodes Ice Lake
| mem-per-cpu=1950mb
|
| time=30, nodes=8, mem=249600mb, ntasks-per-node=64, (threads-per-core=2)
|-
| <code>dev_cpu</code>
| CPU nodes Standard
| mem-per-cpu=1125mb
|
| time=30, nodes=1, mem=180000mb, ntasks-per-node=40, (threads-per-core=2)
|-
| <code>dev_highmem</code>
| CPU nodes High Memory
| mem-per-cpu=1125mb
|
| time=30, nodes=1, mem=180000mb, ntasks-per-node=40, (threads-per-core=2)
|-
| <code>dev_gpu_h100</code>
| GPU nodes NVIDIA GPU x4
| mem-per-cpu=1125mb
|
| time=30, nodes=1, mem=180000mb, ntasks-per-node=40, (threads-per-core=2)
|-
| <code>dev_gpu_a100_il</code>
| GPU nodes NVIDIA GPU x4 
| mem-per-gpu=127500mb cpus-per-gpu=16
|
| time=30, nodes=1, mem=510000mb, ntasks-per-node=64, (threads-per-core=2)
|}
Table 2: Development Queues

Default resources of a queue class defines time, #tasks and memory if not explicitly given with sbatch command. Resource list acronyms ''--time'', ''--ntasks'', ''--nodes'', ''--mem'' and ''--mem-per-cpu'' are described [[BwUniCluster_2.0_Slurm_common_Features|here]].

=== Queue class examples ===

To run your batch job on one of the cpu nodes, please use:

<pre>
$ sbatch --partition=dev_cpu
</pre>
 

== Interactive Jobs: salloc ==

On bwUniCluster 3.0 you are only allowed to run short jobs (<< 1 hour) with little memory requirements (<< 8 GByte) on the logins nodes. If you want to run longer jobs and/or jobs with a request of more than 8 GByte of memory, you must allocate resources for so-called interactive jobs by usage of the command salloc on a login node. Considering a serial application running on a compute node that requires 5000 MByte of memory and limiting the interactive run to 2 hours the following command has to be executed:

<pre>
$ salloc -p cpu -n 1 -t 120 --mem=5000
</pre>

Then you will get one core on a compute node within the partition "single". After execution of this command '''DO NOT CLOSE''' your current terminal session but wait until the queueing system Slurm has granted you the requested resources on the compute system. You will be logged in automatically on the granted core! To run a serial program on the granted core you only have to type the name of the executable.

<pre>
$ ./<my_serial_program>
</pre>

Please be aware that your serial job must run less than 2 hours in this example, else the job will be killed during runtime by the system.

You can also start now a graphical X11-terminal connecting you to the dedicated resource that is available for 2 hours. You can start it by the command:

<pre>
$ xterm
</pre>

Note that, once the walltime limit has been reached the resources - i.e. the compute node - will automatically be revoked.

An interactive parallel application running on one compute node or on many compute nodes (e.g. here 5 nodes) with 40 cores each requires usually an amount of memory in GByte (e.g. 50 GByte) and a maximum time (e.g. 1 hour). E.g. 5 nodes can be allocated by the following command:

<pre>
$ salloc -p cpu -N 5 --ntasks-per-node=40 -t 01:00:00 --mem=50gb
</pre>

Now you can run parallel jobs on 200 cores requiring 50 GByte of memory per node. Please be aware that you will be logged in on core 0 of the first node.
If you want to have access to another node you have to open a new terminal, connect it also to bwUniCluster 3.0 and type the following commands to
connect to the running interactive job and then to a specific node:

<pre>
$ srun --jobid=XXXXXXXX --pty /bin/bash
$ srun --nodelist=uc2nXXX --pty /bin/bash
</pre>

With the command:

<pre>
$ squeue
</pre>

the jobid and the nodelist can be shown.

If you want to run MPI-programs, you can do it by simply typing mpirun <program_name>. Then your program will be run on 200 cores. A very simple example for starting a parallel job can be:

<pre>
$ mpirun <my_mpi_program>
</pre>

You can also start the debugger ddt by the commands:

<pre>
$ module add devel/ddt
$ ddt <my_mpi_program>
</pre>

The above commands will execute the parallel program <my_mpi_program> on all available cores. You can also start parallel programs on a subset of cores; an example for this can be:

<pre>
$ mpirun -n 50 <my_mpi_program>
</pre>

If you are using Intel MPI you must start <my_mpi_program> by the command mpiexec.hydra (instead of mpirun).

BwUniCluster3.0/Batch Queues

2025-08-13T12:36:25Z

P Schuhmacher: /* Regular Queues */

{|style="background:#FEF4AB; width:100%;"
|style="padding:5px; background:#FEF4AB; text-align:left"|
This page is work in progress.
|}

== Partitions, Queues and Jobs==

=== Partitions ===
Slurm manages job queues for different '''partitions'''. Partitions are used to group similar node types (e.g. nodes with and without accelerators) and to enforce different access policies and resource limits.

On bwUniCluster 3.0 there are different partitions:

* CPU-only nodes
** 2-socket nodes, consisting of 2 Intel Ice Lake processors with 32 cores each or 2 AMD processors with 48 cores each
** 2-socket nodes with very high RAM capacity, consisting of 2 AMD processors with 48 cores each
* GPU-accelerated nodes
** 2-socket nodes with 4x NVIDIA A100 or 4x NVIDIA H100 GPUs
** 4-socket node with 4x AMD Instinct accelerator

=== Queues ===

Job '''queues''' are used to manage jobs that request access to shared but limited computing resources of a certain kind (partition).

On bwUniCluster 3.0 there are different main types of queues:
* Regular queues
** cpu: Jobs that request nodes without GPUs. Normal or very high memory capacity.
** gpu: Jobs that request GPU accelerators on one or more than one node.
* Development queues (dev)
** Short, usually interactive jobs that are used for developing, compiling and testing code and workflows. The intention behind development queues is to provide users with immediate access to computer resources without having to wait. This is the place to realize instantaneous heavy compute without affecting other users, as would be the case on the login nodes.

Requested compute resources such as (wall-)time, number of nodes and amount of memory are restricted and must fit into the boundaries imposed by the queues. The request for compute resources on the bwUniCluster 3.0 requires at least the specification of the '''queue''' and the '''time'''.

=== Jobs ===

Jobs can be run non-interactively as '''batch jobs''' or as '''interactive jobs'''.
Submitting a batch job means, that all steps of a compute project are defined in a Bash script. This Bash script is queued and executed as soon as the compute resources are available and allocated. Jobs are enqueued with the ''sbatch'' command.

For interactive jobs, the resources are requested with the ''salloc'' command. As soon as the computing resources are available and allocated, a command line prompt is returned on a computing node and the user can freely dispose of the resources now available to him.
{|style="background:#deffee; width:100%;"
|style="padding:5px; background:#cef2e0; text-align:left"|
[[Image:Attention.svg|center|25px]]
|style="padding:5px; background:#cef2e0; text-align:left"|
'''Please remember:'''
* '''Heavy computations are not allowed on the login nodes'''. Use a developement or a regular job queue instead! Please refer to [[BwUniCluster2.0/Login#Allowed_Activities_on_Login_Nodes|Allowed Activities on Login Nodes]].
* '''Development queues''' are meant for '''development tasks'''. Do not misuse this queue for regular, short-running jobs or chain jobs! Only one running job at a time is enabled. Maximum queue length is reduced to 3.
|}

== Batch Jobs: sbatch ==

=== Regular Queues ===
{| class="wikitable"
|-
! style="width:5%"| queue
! style="width:13%"| node
! style="width:23%"| default resources
! style="width:13%"| minimal resources
! style="width:13%"| maximum resources
|-
| <code>cpu_il</code>
| CPU nodes Ice Lake
| mem-per-cpu=1950mb
|
| time=72:00:00, nodes=80, mem=249600mb, ntasks-per-node=64, (threads-per-core=2)
|-
| <code>cpu</code>
| CPU nodes Standard
| mem-per-cpu=1125mb
|
| time=72:00:00, nodes=70, mem=380000mb, ntasks-per-node=96, (threads-per-core=2)
|-
| <code>highmem</code>
| CPU nodes High Memory
| mem-per-cpu=1125mb
|
| time=72:00:00, nodes=4, mem=2300000mb, ntasks-per-node=96, (threads-per-core=2)
|-
| <code>gpu_h100</code>
| GPU nodes NVIDIA GPU x4
| mem-per-cpu=1125mb
|
| time=72:00:00, nodes=12, mem=760000mb, ntasks-per-node=96, (threads-per-core=2)
|-
| <code>gpu_mi300</code>
| GPU node AMD GPU x4
| mem-per-cpu=1125mb cpus-per-gpu=24
|
| time=72:00:00, nodes=1, mem=510000mb, ntasks-per-node=40, (threads-per-core=2)
|-
| <code>gpu_a100_il</code>/<code>gpu_h100_il</code>
| GPU nodes Ice Lake NVIDIA GPU x4
| mem-per-gpu=127500mb cpus-per-gpu=16
|
| time=72:00:00, nodes=9(A100)/nodes=5(H100), mem=510000mb, ntasks-per-node=64, (threads-per-core=2)
|}
Table 1: Regular Queues

=== Development Queues ===
Only for development, i.e. debugging or performance optimization ...

{| class="wikitable"
|-
! style="width:5%"| queue
! style="width:13%"| node
! style="width:23%"| default resources
! style="width:13%"| minimal resources
! style="width:13%"| maximum resources
|-
| <code>dev_cpu_il</code>
| CPU nodes Ice Lake
| mem-per-cpu=1950mb
|
| time=30, nodes=8, mem=249600mb, ntasks-per-node=64, (threads-per-core=2)
|-
| <code>dev_cpu</code>
| CPU nodes Standard
| mem-per-cpu=1125mb
|
| time=30, nodes=1, mem=180000mb, ntasks-per-node=40, (threads-per-core=2)
|-
| <code>dev_highmem</code>
| CPU nodes High Memory
| mem-per-cpu=1125mb
|
| time=30, nodes=1, mem=180000mb, ntasks-per-node=40, (threads-per-core=2)
|-
| <code>dev_gpu_h100</code>
| GPU nodes NVIDIA GPU x4
| mem-per-cpu=1125mb
|
| time=30, nodes=1, mem=180000mb, ntasks-per-node=40, (threads-per-core=2)
|-
| <code>dev_gpu_a100_il</code>
| GPU nodes NVIDIA GPU x4 
| mem-per-gpu=127500mb cpus-per-gpu=16
|
| time=30, nodes=1, mem=510000mb, ntasks-per-node=64, (threads-per-core=2)
|}
Table 2: Development Queues

Default resources of a queue class defines time, #tasks and memory if not explicitly given with sbatch command. Resource list acronyms ''--time'', ''--ntasks'', ''--nodes'', ''--mem'' and ''--mem-per-cpu'' are described [[BwUniCluster_2.0_Slurm_common_Features|here]].

=== Queue class examples ===

To run your batch job on one of the cpu nodes, please use:

<pre>
$ sbatch --partition=dev_cpu
</pre>
 

== Interactive Jobs: salloc ==

On bwUniCluster 3.0 you are only allowed to run short jobs (<< 1 hour) with little memory requirements (<< 8 GByte) on the logins nodes. If you want to run longer jobs and/or jobs with a request of more than 8 GByte of memory, you must allocate resources for so-called interactive jobs by usage of the command salloc on a login node. Considering a serial application running on a compute node that requires 5000 MByte of memory and limiting the interactive run to 2 hours the following command has to be executed:

<pre>
$ salloc -p cpu -n 1 -t 120 --mem=5000
</pre>

Then you will get one core on a compute node within the partition "single". After execution of this command '''DO NOT CLOSE''' your current terminal session but wait until the queueing system Slurm has granted you the requested resources on the compute system. You will be logged in automatically on the granted core! To run a serial program on the granted core you only have to type the name of the executable.

<pre>
$ ./<my_serial_program>
</pre>

Please be aware that your serial job must run less than 2 hours in this example, else the job will be killed during runtime by the system.

You can also start now a graphical X11-terminal connecting you to the dedicated resource that is available for 2 hours. You can start it by the command:

<pre>
$ xterm
</pre>

Note that, once the walltime limit has been reached the resources - i.e. the compute node - will automatically be revoked.

An interactive parallel application running on one compute node or on many compute nodes (e.g. here 5 nodes) with 40 cores each requires usually an amount of memory in GByte (e.g. 50 GByte) and a maximum time (e.g. 1 hour). E.g. 5 nodes can be allocated by the following command:

<pre>
$ salloc -p cpu -N 5 --ntasks-per-node=40 -t 01:00:00 --mem=50gb
</pre>

Now you can run parallel jobs on 200 cores requiring 50 GByte of memory per node. Please be aware that you will be logged in on core 0 of the first node.
If you want to have access to another node you have to open a new terminal, connect it also to bwUniCluster 3.0 and type the following commands to
connect to the running interactive job and then to a specific node:

<pre>
$ srun --jobid=XXXXXXXX --pty /bin/bash
$ srun --nodelist=uc2nXXX --pty /bin/bash
</pre>

With the command:

<pre>
$ squeue
</pre>

the jobid and the nodelist can be shown.

If you want to run MPI-programs, you can do it by simply typing mpirun <program_name>. Then your program will be run on 200 cores. A very simple example for starting a parallel job can be:

<pre>
$ mpirun <my_mpi_program>
</pre>

You can also start the debugger ddt by the commands:

<pre>
$ module add devel/ddt
$ ddt <my_mpi_program>
</pre>

The above commands will execute the parallel program <my_mpi_program> on all available cores. You can also start parallel programs on a subset of cores; an example for this can be:

<pre>
$ mpirun -n 50 <my_mpi_program>
</pre>

If you are using Intel MPI you must start <my_mpi_program> by the command mpiexec.hydra (instead of mpirun).

BwUniCluster3.0/Running Jobs

2025-08-13T12:35:24Z

P Schuhmacher: /* Regular Queues */

= Purpose and function of a queuing system =

All compute activities on bwUniCluster 3.0 have to be performed on the compute nodes. Compute nodes are only available by requesting the corresponding resources via the queuing system. As soon as the requested resources are available, automated tasks are executed via a batch script or they can be accessed interactively. 
General procedure: Hint to [[Running_Calculations | Running Calculations]]

== Job submission process ==

bwUniCluster 3.0 has installed the workload managing software Slurm. Therefore any job submission by the user is to be executed by commands of the Slurm software. Slurm queues and runs user jobs based on fair sharing policies.

== Slurm ==

HPC Workload Manager on bwUniCluster 3.0 is Slurm.
Slurm is a cluster management and job scheduling system. Slurm has three key functions.
* It allocates access to resources (compute cores on nodes) to users for some duration of time so they can perform work.
* It provides a framework for starting, executing, and monitoring work (normally a parallel job) on the set of allocated nodes.
* It arbitrates contention for resources by managing a queue of pending work.

Any kind of calculation on the compute nodes of bwUniCluster 3.0 requires the user to define calculations as a sequence of commands together with required run time, number of CPU cores and main memory and submit all, i.e., the '''batch job''', to a resource and workload managing software.

== Terms and definitions ==

''' Partitions '''

Slurm manages job queues for different '''partitions'''. Partitions are used to group similar node types (e.g. nodes with and without accelerators) and to enforce different access policies and resource limits.

On bwUniCluster 3.0 there are different partitions:

* CPU-only nodes
** 2-socket nodes, consisting of 2 Intel Ice Lake processors with 32 cores each or 2 AMD processors with 48 cores each
** 2-socket nodes with very high RAM capacity, consisting of 2 AMD processors with 48 cores each
* GPU-accelerated nodes
** 2-socket nodes with 4x NVIDIA A100 or 4x NVIDIA H100 GPUs
** 4-socket node with 4x AMD Instinct accelerator

''' Queues '''

Job '''queues''' are used to manage jobs that request access to shared but limited computing resources of a certain kind (partition).

On bwUniCluster 3.0 there are different main types of queues:
* Regular queues
** cpu: Jobs that request CPU-only nodes.
** gpu: Jobs that request GPU-accelerated nodes.
* Development queues (dev)
** Short, usually interactive jobs that are used for developing, compiling and testing code and workflows. The intention behind development queues is to provide users with immediate access to computer resources without having to wait. This is the place to realize instantaneous heavy compute without affecting other users, as would be the case on the login nodes.

Requested compute resources such as (wall-)time, number of nodes and amount of memory are restricted and must fit into the boundaries imposed by the queues. The request for compute resources on the bwUniCluster 3.0 requires at least the specification of the '''queue''' and the '''time'''.

''' Jobs '''

Jobs can be run non-interactively as '''batch jobs''' or as '''interactive jobs'''. 
Submitting a batch job means, that all steps of a compute project are defined in a Bash script. This Bash script is queued and executed as soon as the compute resources are available and allocated. Jobs are enqueued with the <code>sbatch</code> command.
For interactive jobs, the resources are requested with the <code>salloc</code> command. As soon as the computing resources are available and allocated, a command line prompt is returned on a computing node and the user can freely dispose of the resources now available to him.
{|style="background:#deffee; width:100%;"
|style="padding:5px; background:#cef2e0; text-align:left"|
[[Image:Attention.svg|center|25px]]
|style="padding:5px; background:#cef2e0; text-align:left"|
'''Please remember:'''
* '''Heavy computations are not allowed on the login nodes'''. Use a developement or a regular job queue instead! Please refer to [[BwUniCluster3.0/Login#Allowed_Activities_on_Login_Nodes|Allowed Activities on Login Nodes]].
* '''Development queues''' are meant for '''development tasks'''. Do not misuse this queue for regular, short-running jobs or chain jobs! Only one running job at a time is enabled. Maximum queue length is reduced to 3.
|}

= Queues on bwUniCluster 3.0 =
== Policy ==

The computing time is provided in accordance with the '''fair share policy'''. The individual investment shares of the respective university and the resources already used by its members are taken into account. Furthermore, the following throttling policy is also active: The '''maximum amount of physical cores''' used at any given time from jobs running is '''1920 per user''' (aggregated over all running jobs). This number corresponds to 30 nodes on the Ice Lake partition or 20 nodes on the standard partition. The aim is to minimize waiting times and maximize the number of users who can access computing time at the same time.

== Regular Queues ==
{| class="wikitable"
|-
! style="width:5%"| Queue
! style="width:13%"| Node-Type
! style="width:23%"| Default Resources
! style="width:13%"| Minimal Resources
! style="width:13%"| Maximum Resources
|-
| <code>cpu_il</code>
| CPU nodes Ice Lake
| mem-per-cpu=2000mb
|
| time=72:00:00, nodes=30, mem=249600mb, ntasks-per-node=64, (threads-per-core=2)
|-
| <code>cpu</code>
| CPU nodes Standard
| mem-per-cpu=2000mb
|
| time=72:00:00, nodes=20, mem=380000mb, ntasks-per-node=96, (threads-per-core=2)
|-
| <code>highmem</code>
| CPU nodes High Memory
| mem-per-cpu=12090mb
| mem=380001mb
| time=72:00:00, nodes=4, mem=2300000mb, ntasks-per-node=96, (threads-per-core=2)
|-
| <code>gpu_h100</code>
| GPU nodes NVIDIA GPU x4
| mem-per-gpu=193300mb cpus-per-gpu=24
| gres=gpu:1
| time=72:00:00, nodes=12, mem=760000mb, ntasks-per-node=96, (threads-per-core=2)
|-
| <code>gpu_mi300</code>
| GPU node AMD GPU x4
| mem-per-gpu=128200mb cpus-per-gpu=24
| gres=gpu:1
| time=72:00:00, nodes=1, mem=510000mb, ntasks-per-node=40, (threads-per-core=2)
|-
| <code>gpu_a100_il</code>/<code>gpu_h100_il</code>
| GPU nodes Ice Lake NVIDIA GPU x4
| mem-per-gpu=127500mb cpus-per-gpu=16
| gres=gpu:1
| time=48:00:00, nodes=9(A100)/nodes=5(H100) , mem=510000mb, ntasks-per-node=64, (threads-per-core=2)
|}
Table 1: Regular Queues

== Short Queues ==
Queues with a short runtime of 30 minutes.
{| class="wikitable"
|-
! style="width:5%"| Queue
! style="width:13%"| Node Type
! style="width:23%"| Default Resources
! style="width:13%"| Minimal Resources
! style="width:13%"| Maximum Resources
|-
| <code>gpu_a100_short</code>
| GPU nodes Ice Lake NVIDIA GPU x4
| mem-per-gpu=94000mb cpus-per-gpu=12
| gres=gpu:1
| time=30, nodes=12, mem=376000mb, ntasks-per-node=48, (threads-per-core=2)
|}
Table 2: Short Queues

== Development Queues ==
Only for development, i.e. debugging or performance optimization ...

{| class="wikitable"
|-
! style="width:5%"| Queue
! style="width:13%"| Node Type
! style="width:23%"| Default Resources
! style="width:13%"| Minimal Resources
! style="width:13%"| Maximum Resources
|-
| <code>dev_cpu_il</code>
| CPU nodes Ice Lake
| mem-per-cpu=2000mb
|
| time=30, nodes=8, mem=249600mb, ntasks-per-node=64, (threads-per-core=2)
|-
| <code>dev_cpu</code>
| CPU nodes Standard
| mem-per-cpu=2000mb
|
| time=30, nodes=1, mem=380000mb, ntasks-per-node=96, (threads-per-core=2)
|-
| <code>dev_gpu_h100</code>
| GPU nodes NVIDIA GPU x4
| mem-per-gpu=193300mb cpus-per-gpu=24
| gres=gpu:1
| time=30, nodes=1, mem=760000mb, ntasks-per-node=96, (threads-per-core=2)
|-
| <code>dev_gpu_a100_il</code>
| GPU nodes NVIDIA GPU x4 
| mem-per-gpu=127500mb cpus-per-gpu=16
| gres=gpu:1
| time=30, nodes=1, mem=510000mb, ntasks-per-node=64, (threads-per-core=2)
|}
Table 3: Development Queues

Default resources of a queue class define number of tasks and memory if not explicitly given with sbatch command. Resource list acronyms ''--time'', ''--ntasks'', ''--nodes'', ''--mem'' and ''--mem-per-cpu'' are described [[BwUniCluster3.0/Running_Jobs/Slurm|here]].

== Check available resources: sinfo_t_idle ==
The Slurm command sinfo is used to view partition and node information for a system running Slurm. It incorporates down time, reservations, and node state information in determining the available backfill window. The sinfo command can only be used by the administrator.
 
SCC has prepared a special script (sinfo_t_idle) to find out how many processors are available for immediate use on the system. It is anticipated that users will use this information to submit jobs that meet these criteria and thus obtain quick job turnaround times.
 
 

* The following command displays what resources are available for immediate use for the whole partition.
<pre>$ sinfo_t_idle
Partition dev_cpu : 1 nodes idle
Partition cpu : 1 nodes idle
Partition highmem : 2 nodes idle
Partition dev_gpu_h100 : 0 nodes idle
Partition gpu_h100 : 0 nodes idle
Partition gpu_mi300 : 0 nodes idle
Partition dev_cpu_il : 7 nodes idle
Partition cpu_il : 2 nodes idle
Partition dev_gpu_a100_il : 1 nodes idle
Partition gpu_a100_il : 0 nodes idle
Partition gpu_h100_il : 1 nodes idle
Partition gpu_a100_short : 0 nodes idle
</pre>
 

= Running Jobs =

== Slurm Commands (excerpt) ==
Important Slurm commands for non-administrators working on bwUniCluster 3.0.
{| width=850px class="wikitable"
! Slurm commands !! Brief explanation
|-
| [[#Batch Jobs: sbatch|sbatch]] || Submits a job and puts it into the queue [[https://slurm.schedmd.com/sbatch.html sbatch]]
|-
| [[#Interactive Jobs: salloc|salloc]] || Requests resources for an interactive Job [[https://slurm.schedmd.com/salloc.html salloc]]
|-
| [[#Monitor and manage jobs |scontrol show job]] || Displays detailed job state information [[https://slurm.schedmd.com/scontrol.html scontrol]]
|-
| [[#List of your submitted jobs : squeue|squeue]] || Displays information about active, eligible, blocked, and/or recently completed jobs [[https://slurm.schedmd.com/squeue.html squeue]]
|-
| [[#List of your submitted jobs : squeue|squeue --start]] || Returns start time of submitted job [[https://slurm.schedmd.com/squeue.html squeue]]
|-
| [[#Check available resources: sinfo_t_idle|sinfo_t_idle]] || Shows what resources are available for immediate use [[https://slurm.schedmd.com/sinfo.html sinfo]]
|-
| [[#Canceling own jobs : scancel|scancel]] || Cancels a job [[https://slurm.schedmd.com/scancel.html scancel]]
|}

 
* [https://slurm.schedmd.com/tutorials.html Slurm Tutorials]
* [https://slurm.schedmd.com/pdfs/summary.pdf Slurm command/option summary (2 pages)]
* [https://slurm.schedmd.com/man_index.html Slurm Commands]
 

== Batch Jobs: sbatch ==

Batch jobs are submitted by using the command '''sbatch'''. The main purpose of the '''sbatch''' command is to specify the resources that are needed to run the job. '''sbatch''' will then queue the batch job. However, starting of batch job depends on the availability of the requested resources and the fair sharing value.
 
 

* The syntax and use of '''sbatch''' can be displayed via:
<pre>
$ man sbatch
</pre>

'''sbatch''' options can be used from the command line or in your job script. Different defaults for some of these options are set based on the queue and can be found [[BwUniCluster3.0/Slurm | here]]

{| class="wikitable"
! colspan="3" | sbatch Options
|-
! style="width:8%"| Command line
! style="width:9%"| Script
! style="width:13%"| Purpose
|- style="vertical-align:top;"
| -t, --time=''time''
| #SBATCH --time=''time''
| Wall clock time limit. 
|-
|- style="vertical-align:top;"
| -N, --nodes=''count''
| #SBATCH --nodes=''count''
| Number of nodes to be used.
|- style="vertical-align:top;"
| -n, --ntasks=''count''
| #SBATCH --ntasks=''count''
| Number of tasks to be launched.
|-
|- style="vertical-align:top;"
| --ntasks-per-node=''count''
| #SBATCH --ntasks-per-node=''count''
| Maximum count of tasks per node.
|-
|- style="vertical-align:top;"
| -c, --cpus-per-task=''count''
| #SBATCH --cpus-per-task=''count''
| Number of CPUs required per (MPI-)task.
|-
|- style="vertical-align:top;"
| --mem=''value_in_MB''
| #SBATCH --mem=''value_in_MB''
| Memory in MegaByte per node. (You should omit the setting of this option.)
|-
|- style="vertical-align:top;"
| --mem-per-cpu=''value_in_MB''
| #SBATCH --mem-per-cpu=''value_in_MB''
| Minimum Memory required per allocated CPU. (You should omit the setting of this option.)
|-
|- style="vertical-align:top;"
| --mail-type=''type''
| #SBATCH --mail-type=''type''
| Notify user by email when certain event types occur. Valid type values are NONE, BEGIN, END, FAIL, REQUEUE, ALL.
|-
|- style="vertical-align:top;"
| --mail-user=''mail-address''
| #SBATCH --mail-user=''mail-address''
| The specified mail-address receives email notification of state changes as defined by --mail-type.
|-
|- style="vertical-align:top;"
| --output=''name''
| #SBATCH --output=''name''
| File in which job output is stored.
|-
|- style="vertical-align:top;"
| --error=''name''
| #SBATCH --error=''name''
| File in which job error messages are stored.
|-
|- style="vertical-align:top;"
| -J, --job-name=''name''
| #SBATCH --job-name=''name''
| Job name.
|-
|- style="vertical-align:top;"
| --export=[ALL,] ''env-variables''
| #SBATCH --export=[ALL,] ''env-variables''
| Identifies which environment variables from the submission environment are propagated to the launched application. Default is ALL.
|-
|- style="vertical-align:top;"
| -A, --account=''group-name''
| #SBATCH --account=''group-name''
| Change resources used by this job to specified group. You may need this option if your account is assigned to more than one group. By command "scontrol show job" the project group the job is accounted on can be seen behind "Account=".
|-
|- style="vertical-align:top;"
| -p, --partition=''queue-name''
| #SBATCH --partition=''queue-name''
| Request a specific queue for the resource allocation.
|-
|- style="vertical-align:top;"
| --reservation=''reservation-name''
| #SBATCH --reservation=''reservation-name''
| Use a specific reservation for the resource allocation.
|-
|- style="vertical-align:top;"
| -C, --constraint=''LSDF''
| #SBATCH --constraint=LSDF
| Job constraint LSDF filesystems.
|-
|- style="vertical-align:top;"
| -C, --constraint=''BEEOND (BEEOND_4MDS, BEEOND_MAXMDS)''
| #SBATCH --constraint=BEEOND (BEEOND_4MDS, BEEOND_MAXMDS)
| Job constraint BeeOND filesystem.
|-
|}
 

== Interactive Jobs: salloc ==

On bwUniCluster 3.0 you are only allowed to run short jobs (<< 1 hour) with little memory requirements (<< 8 GByte) on the logins nodes. If you want to run longer jobs and/or jobs with a request of more than 8 GByte of memory, you must allocate resources for so-called interactive jobs by usage of the command salloc on a login node. Considering a serial application running on a compute node that requires 5000 MByte of memory and limiting the interactive run to 2 hours the following command has to be executed:

<pre>
$ salloc -p cpu -n 1 -t 120 --mem=5000
</pre>

Then you will get one core on a compute node within the partition "cpu". After execution of this command '''DO NOT CLOSE''' your current terminal session but wait until the queueing system Slurm has granted you the requested resources on the compute system. You will be logged in automatically on the granted core! To run a serial program on the granted core you only have to type the name of the executable.

<pre>
$ ./<my_serial_program>
</pre>

Please be aware that your serial job must run less than 2 hours in this example, else the job will be killed during runtime by the system.

You can also start now a graphical X11-terminal connecting you to the dedicated resource that is available for 2 hours. You can start it by the command:

<pre>
$ xterm
</pre>

Note that, once the walltime limit has been reached the resources - i.e. the compute node - will automatically be revoked.

An interactive parallel application running on one compute node or on many compute nodes (e.g. here 5 nodes) with 96 cores each requires usually an amount of memory in GByte (e.g. 50 GByte) and a maximum time (e.g. 1 hour). E.g. 5 nodes can be allocated by the following command:

<pre>
$ salloc -p cpu -N 5 --ntasks-per-node=96 -t 01:00:00 --mem=50gb
</pre>

Now you can run parallel jobs on 480 cores requiring 50 GByte of memory per node. Please be aware that you will be logged in on core 0 of the first node.
If you want to have access to another node you have to open a new terminal, connect it also to bwUniCluster 3.0 and type the following commands to
connect to the running interactive job and then to a specific node:

<pre>
$ srun --jobid=XXXXXXXX --pty /bin/bash
$ srun --nodelist=uc3nXXX --pty /bin/bash
</pre>

With the command:

<pre>
$ squeue
</pre>

the jobid and the nodelist can be shown.

If you want to run MPI-programs, you can do it by simply typing mpirun <program_name>. Then your program will be run on 480 cores. A very simple example for starting a parallel job can be:

<pre>
$ mpirun <my_mpi_program>
</pre>

You can also start the debugger ddt by the commands:

<pre>
$ module add devel/ddt
$ ddt <my_mpi_program>
</pre>

The above commands will execute the parallel program <my_mpi_program> on all available cores. You can also start parallel programs on a subset of cores; an example for this can be:

<pre>
$ mpirun -n 50 <my_mpi_program>
</pre>

If you are using Intel MPI you must start <my_mpi_program> by the command mpiexec.hydra (instead of mpirun).

== Interactive Computing with Jupyter ==

== Monitor and manage jobs ==

=== List of your submitted jobs : squeue ===
Displays information about YOUR active, pending and/or recently completed jobs. The command displays all own active and pending jobs. The command squeue is explained in detail on the webpage https://slurm.schedmd.com/squeue.html or via manpage (man squeue).
 
 

* ''squeue'' example on bwUniCluster 3.0 (Only your own jobs are displayed!).
<pre>
$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
1262 cpu wrap ka_ab123 R 8:15 1 uc3n002
1267 dev_gpu_h wrap ka_ab123 PD 0:00 1 (Resources)
1265 highmem wrap ka_ab123 R 2:41 1 uc3n084
$ squeue -l
JOBID PARTITION NAME USER STATE TIME TIME_LIMI NODES NODELIST(REASON)
1262 cpu wrap ka_ab123 RUNNING 8:55 20:00 1 uc3n002
1267 dev_gpu_h wrap ka_ab123 PENDING 0:00 20:00 1 (Resources)
1265 highmem wrap ka_ab123 RUNNING 3:21 20:00 1 uc3n084

</pre>

=== Detailed job information : scontrol show job ===
scontrol show job displays detailed job state information and diagnostic output for all or a specified job of yours. Detailed information is available for active, pending and recently completed jobs. The command scontrol is explained in detail on the webpage https://slurm.schedmd.com/scontrol.html or via manpage (man scontrol).
 
Display the state of all your jobs in normal mode: scontrol show job
 
Display the state of a job with <jobid> in normal mode: scontrol show job <jobid>
 
 

* Here is an example from bwUniCluster 3.0.
<pre>
$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
1262 cpu wrap ka_zs040 R 1:12 1 uc3n002

$
$ # now, see what's up with my pending job with jobid 1262
$
$ scontrol show job 1262

JobId=1262 JobName=wrap
UserId=ka_zs0402(241992) GroupId=ka_scc(12345) MCS_label=N/A
Priority=4246 Nice=0 Account=ka QOS=normal
JobState=RUNNING Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
RunTime=00:00:37 TimeLimit=00:20:00 TimeMin=N/A
SubmitTime=2025-04-04T10:01:30 EligibleTime=2025-04-04T10:01:30
AccrueTime=2025-04-04T10:01:30
StartTime=2025-04-04T10:01:31 EndTime=2025-04-04T10:21:31 Deadline=N/A
SuspendTime=None SecsPreSuspend=0 LastSchedEval=2025-04-04T10:01:31 Scheduler=Main
Partition=cpu AllocNode:Sid=uc3n999:2819841
ReqNodeList=(null) ExcNodeList=(null)
NodeList=uc3n002
BatchHost=uc3n002
NumNodes=1 NumCPUs=2 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
ReqTRES=cpu=1,mem=2000M,node=1,billing=1
AllocTRES=cpu=2,mem=4000M,node=1,billing=2
Socks/Node=* NtasksPerN:B:S:C=0:0:*:1 CoreSpec=*
MinCPUsNode=1 MinMemoryCPU=2000M MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
Command=(null)
WorkDir=/pfs/data6/home/ka/ka_scc/ka_zs0402
StdErr=/pfs/data6/home/ka/ka_scc/ka_zs0402/slurm-1262.out
StdIn=/dev/null
StdOut=/pfs/data6/home/ka/ka_scc/ka_zs0402/slurm-1262.out

</pre>
 
=== Canceling own jobs : scancel ===
The scancel command is used to cancel jobs. The command scancel is explained in detail on the webpage https://slurm.schedmd.com/scancel.html or via manpage (man scancel). The command is:
<pre>
$ scancel [-i] <job-id>
$ scancel -t <job_state_name>
</pre>

= Slurm Options =
[[BwUniCluster3.0/Running_Jobs/Slurm | Detailed Slurm usage]]

= Best Practices =

== Step-by-Step example==

== Dos and Don'ts ==

BwUniCluster3.0/Running Jobs

2025-08-13T12:35:08Z

P Schuhmacher: /* Regular Queues */

= Purpose and function of a queuing system =

All compute activities on bwUniCluster 3.0 have to be performed on the compute nodes. Compute nodes are only available by requesting the corresponding resources via the queuing system. As soon as the requested resources are available, automated tasks are executed via a batch script or they can be accessed interactively. 
General procedure: Hint to [[Running_Calculations | Running Calculations]]

== Job submission process ==

bwUniCluster 3.0 has installed the workload managing software Slurm. Therefore any job submission by the user is to be executed by commands of the Slurm software. Slurm queues and runs user jobs based on fair sharing policies.

== Slurm ==

HPC Workload Manager on bwUniCluster 3.0 is Slurm.
Slurm is a cluster management and job scheduling system. Slurm has three key functions.
* It allocates access to resources (compute cores on nodes) to users for some duration of time so they can perform work.
* It provides a framework for starting, executing, and monitoring work (normally a parallel job) on the set of allocated nodes.
* It arbitrates contention for resources by managing a queue of pending work.

Any kind of calculation on the compute nodes of bwUniCluster 3.0 requires the user to define calculations as a sequence of commands together with required run time, number of CPU cores and main memory and submit all, i.e., the '''batch job''', to a resource and workload managing software.

== Terms and definitions ==

''' Partitions '''

Slurm manages job queues for different '''partitions'''. Partitions are used to group similar node types (e.g. nodes with and without accelerators) and to enforce different access policies and resource limits.

On bwUniCluster 3.0 there are different partitions:

* CPU-only nodes
** 2-socket nodes, consisting of 2 Intel Ice Lake processors with 32 cores each or 2 AMD processors with 48 cores each
** 2-socket nodes with very high RAM capacity, consisting of 2 AMD processors with 48 cores each
* GPU-accelerated nodes
** 2-socket nodes with 4x NVIDIA A100 or 4x NVIDIA H100 GPUs
** 4-socket node with 4x AMD Instinct accelerator

''' Queues '''

Job '''queues''' are used to manage jobs that request access to shared but limited computing resources of a certain kind (partition).

On bwUniCluster 3.0 there are different main types of queues:
* Regular queues
** cpu: Jobs that request CPU-only nodes.
** gpu: Jobs that request GPU-accelerated nodes.
* Development queues (dev)
** Short, usually interactive jobs that are used for developing, compiling and testing code and workflows. The intention behind development queues is to provide users with immediate access to computer resources without having to wait. This is the place to realize instantaneous heavy compute without affecting other users, as would be the case on the login nodes.

Requested compute resources such as (wall-)time, number of nodes and amount of memory are restricted and must fit into the boundaries imposed by the queues. The request for compute resources on the bwUniCluster 3.0 requires at least the specification of the '''queue''' and the '''time'''.

''' Jobs '''

Jobs can be run non-interactively as '''batch jobs''' or as '''interactive jobs'''. 
Submitting a batch job means, that all steps of a compute project are defined in a Bash script. This Bash script is queued and executed as soon as the compute resources are available and allocated. Jobs are enqueued with the <code>sbatch</code> command.
For interactive jobs, the resources are requested with the <code>salloc</code> command. As soon as the computing resources are available and allocated, a command line prompt is returned on a computing node and the user can freely dispose of the resources now available to him.
{|style="background:#deffee; width:100%;"
|style="padding:5px; background:#cef2e0; text-align:left"|
[[Image:Attention.svg|center|25px]]
|style="padding:5px; background:#cef2e0; text-align:left"|
'''Please remember:'''
* '''Heavy computations are not allowed on the login nodes'''. Use a developement or a regular job queue instead! Please refer to [[BwUniCluster3.0/Login#Allowed_Activities_on_Login_Nodes|Allowed Activities on Login Nodes]].
* '''Development queues''' are meant for '''development tasks'''. Do not misuse this queue for regular, short-running jobs or chain jobs! Only one running job at a time is enabled. Maximum queue length is reduced to 3.
|}

= Queues on bwUniCluster 3.0 =
== Policy ==

The computing time is provided in accordance with the '''fair share policy'''. The individual investment shares of the respective university and the resources already used by its members are taken into account. Furthermore, the following throttling policy is also active: The '''maximum amount of physical cores''' used at any given time from jobs running is '''1920 per user''' (aggregated over all running jobs). This number corresponds to 30 nodes on the Ice Lake partition or 20 nodes on the standard partition. The aim is to minimize waiting times and maximize the number of users who can access computing time at the same time.

== Regular Queues ==
{| class="wikitable"
|-
! style="width:5%"| Queue
! style="width:13%"| Node-Type
! style="width:23%"| Default Resources
! style="width:13%"| Minimal Resources
! style="width:13%"| Maximum Resources
|-
| <code>cpu_il</code>
| CPU nodes Ice Lake
| mem-per-cpu=2000mb
|
| time=72:00:00, nodes=30, mem=249600mb, ntasks-per-node=64, (threads-per-core=2)
|-
| <code>cpu</code>
| CPU nodes Standard
| mem-per-cpu=2000mb
|
| time=72:00:00, nodes=20, mem=380000mb, ntasks-per-node=96, (threads-per-core=2)
|-
| <code>highmem</code>
| CPU nodes High Memory
| mem-per-cpu=12090mb
| mem=380001mb
| time=72:00:00, nodes=4, mem=2300000mb, ntasks-per-node=96, (threads-per-core=2)
|-
| <code>gpu_h100</code>
| GPU nodes NVIDIA GPU x4
| mem-per-gpu=193300mb cpus-per-gpu=24
| gres=gpu:1
| time=72:00:00, nodes=12, mem=760000mb, ntasks-per-node=96, (threads-per-core=2)
|-
| <code>gpu_mi300</code>
| GPU node AMD GPU x4
| mem-per-gpu=128200mb cpus-per-gpu=24
| gres=gpu:1
| time=72:00:00, nodes=1, mem=510000mb, ntasks-per-node=40, (threads-per-core=2)
|-
| <code>gpu_a100_il</code>/<code>gpu_h100_il</code>
| GPU nodes Ice Lake NVIDIA GPU x4
| mem-per-gpu=127500mb cpus-per-gpu=16
| gres=gpu:1
| time=48:00:00, nodes=9(A100)|nodes=5(H100) , mem=510000mb, ntasks-per-node=64, (threads-per-core=2)
|}
Table 1: Regular Queues

== Short Queues ==
Queues with a short runtime of 30 minutes.
{| class="wikitable"
|-
! style="width:5%"| Queue
! style="width:13%"| Node Type
! style="width:23%"| Default Resources
! style="width:13%"| Minimal Resources
! style="width:13%"| Maximum Resources
|-
| <code>gpu_a100_short</code>
| GPU nodes Ice Lake NVIDIA GPU x4
| mem-per-gpu=94000mb cpus-per-gpu=12
| gres=gpu:1
| time=30, nodes=12, mem=376000mb, ntasks-per-node=48, (threads-per-core=2)
|}
Table 2: Short Queues

== Development Queues ==
Only for development, i.e. debugging or performance optimization ...

{| class="wikitable"
|-
! style="width:5%"| Queue
! style="width:13%"| Node Type
! style="width:23%"| Default Resources
! style="width:13%"| Minimal Resources
! style="width:13%"| Maximum Resources
|-
| <code>dev_cpu_il</code>
| CPU nodes Ice Lake
| mem-per-cpu=2000mb
|
| time=30, nodes=8, mem=249600mb, ntasks-per-node=64, (threads-per-core=2)
|-
| <code>dev_cpu</code>
| CPU nodes Standard
| mem-per-cpu=2000mb
|
| time=30, nodes=1, mem=380000mb, ntasks-per-node=96, (threads-per-core=2)
|-
| <code>dev_gpu_h100</code>
| GPU nodes NVIDIA GPU x4
| mem-per-gpu=193300mb cpus-per-gpu=24
| gres=gpu:1
| time=30, nodes=1, mem=760000mb, ntasks-per-node=96, (threads-per-core=2)
|-
| <code>dev_gpu_a100_il</code>
| GPU nodes NVIDIA GPU x4 
| mem-per-gpu=127500mb cpus-per-gpu=16
| gres=gpu:1
| time=30, nodes=1, mem=510000mb, ntasks-per-node=64, (threads-per-core=2)
|}
Table 3: Development Queues

Default resources of a queue class define number of tasks and memory if not explicitly given with sbatch command. Resource list acronyms ''--time'', ''--ntasks'', ''--nodes'', ''--mem'' and ''--mem-per-cpu'' are described [[BwUniCluster3.0/Running_Jobs/Slurm|here]].

== Check available resources: sinfo_t_idle ==
The Slurm command sinfo is used to view partition and node information for a system running Slurm. It incorporates down time, reservations, and node state information in determining the available backfill window. The sinfo command can only be used by the administrator.
 
SCC has prepared a special script (sinfo_t_idle) to find out how many processors are available for immediate use on the system. It is anticipated that users will use this information to submit jobs that meet these criteria and thus obtain quick job turnaround times.
 
 

* The following command displays what resources are available for immediate use for the whole partition.
<pre>$ sinfo_t_idle
Partition dev_cpu : 1 nodes idle
Partition cpu : 1 nodes idle
Partition highmem : 2 nodes idle
Partition dev_gpu_h100 : 0 nodes idle
Partition gpu_h100 : 0 nodes idle
Partition gpu_mi300 : 0 nodes idle
Partition dev_cpu_il : 7 nodes idle
Partition cpu_il : 2 nodes idle
Partition dev_gpu_a100_il : 1 nodes idle
Partition gpu_a100_il : 0 nodes idle
Partition gpu_h100_il : 1 nodes idle
Partition gpu_a100_short : 0 nodes idle
</pre>
 

= Running Jobs =

== Slurm Commands (excerpt) ==
Important Slurm commands for non-administrators working on bwUniCluster 3.0.
{| width=850px class="wikitable"
! Slurm commands !! Brief explanation
|-
| [[#Batch Jobs: sbatch|sbatch]] || Submits a job and puts it into the queue [[https://slurm.schedmd.com/sbatch.html sbatch]]
|-
| [[#Interactive Jobs: salloc|salloc]] || Requests resources for an interactive Job [[https://slurm.schedmd.com/salloc.html salloc]]
|-
| [[#Monitor and manage jobs |scontrol show job]] || Displays detailed job state information [[https://slurm.schedmd.com/scontrol.html scontrol]]
|-
| [[#List of your submitted jobs : squeue|squeue]] || Displays information about active, eligible, blocked, and/or recently completed jobs [[https://slurm.schedmd.com/squeue.html squeue]]
|-
| [[#List of your submitted jobs : squeue|squeue --start]] || Returns start time of submitted job [[https://slurm.schedmd.com/squeue.html squeue]]
|-
| [[#Check available resources: sinfo_t_idle|sinfo_t_idle]] || Shows what resources are available for immediate use [[https://slurm.schedmd.com/sinfo.html sinfo]]
|-
| [[#Canceling own jobs : scancel|scancel]] || Cancels a job [[https://slurm.schedmd.com/scancel.html scancel]]
|}

 
* [https://slurm.schedmd.com/tutorials.html Slurm Tutorials]
* [https://slurm.schedmd.com/pdfs/summary.pdf Slurm command/option summary (2 pages)]
* [https://slurm.schedmd.com/man_index.html Slurm Commands]
 

== Batch Jobs: sbatch ==

Batch jobs are submitted by using the command '''sbatch'''. The main purpose of the '''sbatch''' command is to specify the resources that are needed to run the job. '''sbatch''' will then queue the batch job. However, starting of batch job depends on the availability of the requested resources and the fair sharing value.
 
 

* The syntax and use of '''sbatch''' can be displayed via:
<pre>
$ man sbatch
</pre>

'''sbatch''' options can be used from the command line or in your job script. Different defaults for some of these options are set based on the queue and can be found [[BwUniCluster3.0/Slurm | here]]

{| class="wikitable"
! colspan="3" | sbatch Options
|-
! style="width:8%"| Command line
! style="width:9%"| Script
! style="width:13%"| Purpose
|- style="vertical-align:top;"
| -t, --time=''time''
| #SBATCH --time=''time''
| Wall clock time limit. 
|-
|- style="vertical-align:top;"
| -N, --nodes=''count''
| #SBATCH --nodes=''count''
| Number of nodes to be used.
|- style="vertical-align:top;"
| -n, --ntasks=''count''
| #SBATCH --ntasks=''count''
| Number of tasks to be launched.
|-
|- style="vertical-align:top;"
| --ntasks-per-node=''count''
| #SBATCH --ntasks-per-node=''count''
| Maximum count of tasks per node.
|-
|- style="vertical-align:top;"
| -c, --cpus-per-task=''count''
| #SBATCH --cpus-per-task=''count''
| Number of CPUs required per (MPI-)task.
|-
|- style="vertical-align:top;"
| --mem=''value_in_MB''
| #SBATCH --mem=''value_in_MB''
| Memory in MegaByte per node. (You should omit the setting of this option.)
|-
|- style="vertical-align:top;"
| --mem-per-cpu=''value_in_MB''
| #SBATCH --mem-per-cpu=''value_in_MB''
| Minimum Memory required per allocated CPU. (You should omit the setting of this option.)
|-
|- style="vertical-align:top;"
| --mail-type=''type''
| #SBATCH --mail-type=''type''
| Notify user by email when certain event types occur. Valid type values are NONE, BEGIN, END, FAIL, REQUEUE, ALL.
|-
|- style="vertical-align:top;"
| --mail-user=''mail-address''
| #SBATCH --mail-user=''mail-address''
| The specified mail-address receives email notification of state changes as defined by --mail-type.
|-
|- style="vertical-align:top;"
| --output=''name''
| #SBATCH --output=''name''
| File in which job output is stored.
|-
|- style="vertical-align:top;"
| --error=''name''
| #SBATCH --error=''name''
| File in which job error messages are stored.
|-
|- style="vertical-align:top;"
| -J, --job-name=''name''
| #SBATCH --job-name=''name''
| Job name.
|-
|- style="vertical-align:top;"
| --export=[ALL,] ''env-variables''
| #SBATCH --export=[ALL,] ''env-variables''
| Identifies which environment variables from the submission environment are propagated to the launched application. Default is ALL.
|-
|- style="vertical-align:top;"
| -A, --account=''group-name''
| #SBATCH --account=''group-name''
| Change resources used by this job to specified group. You may need this option if your account is assigned to more than one group. By command "scontrol show job" the project group the job is accounted on can be seen behind "Account=".
|-
|- style="vertical-align:top;"
| -p, --partition=''queue-name''
| #SBATCH --partition=''queue-name''
| Request a specific queue for the resource allocation.
|-
|- style="vertical-align:top;"
| --reservation=''reservation-name''
| #SBATCH --reservation=''reservation-name''
| Use a specific reservation for the resource allocation.
|-
|- style="vertical-align:top;"
| -C, --constraint=''LSDF''
| #SBATCH --constraint=LSDF
| Job constraint LSDF filesystems.
|-
|- style="vertical-align:top;"
| -C, --constraint=''BEEOND (BEEOND_4MDS, BEEOND_MAXMDS)''
| #SBATCH --constraint=BEEOND (BEEOND_4MDS, BEEOND_MAXMDS)
| Job constraint BeeOND filesystem.
|-
|}
 

== Interactive Jobs: salloc ==

On bwUniCluster 3.0 you are only allowed to run short jobs (<< 1 hour) with little memory requirements (<< 8 GByte) on the logins nodes. If you want to run longer jobs and/or jobs with a request of more than 8 GByte of memory, you must allocate resources for so-called interactive jobs by usage of the command salloc on a login node. Considering a serial application running on a compute node that requires 5000 MByte of memory and limiting the interactive run to 2 hours the following command has to be executed:

<pre>
$ salloc -p cpu -n 1 -t 120 --mem=5000
</pre>

Then you will get one core on a compute node within the partition "cpu". After execution of this command '''DO NOT CLOSE''' your current terminal session but wait until the queueing system Slurm has granted you the requested resources on the compute system. You will be logged in automatically on the granted core! To run a serial program on the granted core you only have to type the name of the executable.

<pre>
$ ./<my_serial_program>
</pre>

Please be aware that your serial job must run less than 2 hours in this example, else the job will be killed during runtime by the system.

You can also start now a graphical X11-terminal connecting you to the dedicated resource that is available for 2 hours. You can start it by the command:

<pre>
$ xterm
</pre>

Note that, once the walltime limit has been reached the resources - i.e. the compute node - will automatically be revoked.

An interactive parallel application running on one compute node or on many compute nodes (e.g. here 5 nodes) with 96 cores each requires usually an amount of memory in GByte (e.g. 50 GByte) and a maximum time (e.g. 1 hour). E.g. 5 nodes can be allocated by the following command:

<pre>
$ salloc -p cpu -N 5 --ntasks-per-node=96 -t 01:00:00 --mem=50gb
</pre>

Now you can run parallel jobs on 480 cores requiring 50 GByte of memory per node. Please be aware that you will be logged in on core 0 of the first node.
If you want to have access to another node you have to open a new terminal, connect it also to bwUniCluster 3.0 and type the following commands to
connect to the running interactive job and then to a specific node:

<pre>
$ srun --jobid=XXXXXXXX --pty /bin/bash
$ srun --nodelist=uc3nXXX --pty /bin/bash
</pre>

With the command:

<pre>
$ squeue
</pre>

the jobid and the nodelist can be shown.

If you want to run MPI-programs, you can do it by simply typing mpirun <program_name>. Then your program will be run on 480 cores. A very simple example for starting a parallel job can be:

<pre>
$ mpirun <my_mpi_program>
</pre>

You can also start the debugger ddt by the commands:

<pre>
$ module add devel/ddt
$ ddt <my_mpi_program>
</pre>

The above commands will execute the parallel program <my_mpi_program> on all available cores. You can also start parallel programs on a subset of cores; an example for this can be:

<pre>
$ mpirun -n 50 <my_mpi_program>
</pre>

If you are using Intel MPI you must start <my_mpi_program> by the command mpiexec.hydra (instead of mpirun).

== Interactive Computing with Jupyter ==

== Monitor and manage jobs ==

=== List of your submitted jobs : squeue ===
Displays information about YOUR active, pending and/or recently completed jobs. The command displays all own active and pending jobs. The command squeue is explained in detail on the webpage https://slurm.schedmd.com/squeue.html or via manpage (man squeue).
 
 

* ''squeue'' example on bwUniCluster 3.0 (Only your own jobs are displayed!).
<pre>
$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
1262 cpu wrap ka_ab123 R 8:15 1 uc3n002
1267 dev_gpu_h wrap ka_ab123 PD 0:00 1 (Resources)
1265 highmem wrap ka_ab123 R 2:41 1 uc3n084
$ squeue -l
JOBID PARTITION NAME USER STATE TIME TIME_LIMI NODES NODELIST(REASON)
1262 cpu wrap ka_ab123 RUNNING 8:55 20:00 1 uc3n002
1267 dev_gpu_h wrap ka_ab123 PENDING 0:00 20:00 1 (Resources)
1265 highmem wrap ka_ab123 RUNNING 3:21 20:00 1 uc3n084

</pre>

=== Detailed job information : scontrol show job ===
scontrol show job displays detailed job state information and diagnostic output for all or a specified job of yours. Detailed information is available for active, pending and recently completed jobs. The command scontrol is explained in detail on the webpage https://slurm.schedmd.com/scontrol.html or via manpage (man scontrol).
 
Display the state of all your jobs in normal mode: scontrol show job
 
Display the state of a job with <jobid> in normal mode: scontrol show job <jobid>
 
 

* Here is an example from bwUniCluster 3.0.
<pre>
$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
1262 cpu wrap ka_zs040 R 1:12 1 uc3n002

$
$ # now, see what's up with my pending job with jobid 1262
$
$ scontrol show job 1262

JobId=1262 JobName=wrap
UserId=ka_zs0402(241992) GroupId=ka_scc(12345) MCS_label=N/A
Priority=4246 Nice=0 Account=ka QOS=normal
JobState=RUNNING Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
RunTime=00:00:37 TimeLimit=00:20:00 TimeMin=N/A
SubmitTime=2025-04-04T10:01:30 EligibleTime=2025-04-04T10:01:30
AccrueTime=2025-04-04T10:01:30
StartTime=2025-04-04T10:01:31 EndTime=2025-04-04T10:21:31 Deadline=N/A
SuspendTime=None SecsPreSuspend=0 LastSchedEval=2025-04-04T10:01:31 Scheduler=Main
Partition=cpu AllocNode:Sid=uc3n999:2819841
ReqNodeList=(null) ExcNodeList=(null)
NodeList=uc3n002
BatchHost=uc3n002
NumNodes=1 NumCPUs=2 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
ReqTRES=cpu=1,mem=2000M,node=1,billing=1
AllocTRES=cpu=2,mem=4000M,node=1,billing=2
Socks/Node=* NtasksPerN:B:S:C=0:0:*:1 CoreSpec=*
MinCPUsNode=1 MinMemoryCPU=2000M MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
Command=(null)
WorkDir=/pfs/data6/home/ka/ka_scc/ka_zs0402
StdErr=/pfs/data6/home/ka/ka_scc/ka_zs0402/slurm-1262.out
StdIn=/dev/null
StdOut=/pfs/data6/home/ka/ka_scc/ka_zs0402/slurm-1262.out

</pre>
 
=== Canceling own jobs : scancel ===
The scancel command is used to cancel jobs. The command scancel is explained in detail on the webpage https://slurm.schedmd.com/scancel.html or via manpage (man scancel). The command is:
<pre>
$ scancel [-i] <job-id>
$ scancel -t <job_state_name>
</pre>

= Slurm Options =
[[BwUniCluster3.0/Running_Jobs/Slurm | Detailed Slurm usage]]

= Best Practices =

== Step-by-Step example==

== Dos and Don'ts ==

BwUniCluster3.0/Running Jobs

2025-08-13T12:34:33Z

P Schuhmacher: /* Regular Queues */

= Purpose and function of a queuing system =

All compute activities on bwUniCluster 3.0 have to be performed on the compute nodes. Compute nodes are only available by requesting the corresponding resources via the queuing system. As soon as the requested resources are available, automated tasks are executed via a batch script or they can be accessed interactively. 
General procedure: Hint to [[Running_Calculations | Running Calculations]]

== Job submission process ==

bwUniCluster 3.0 has installed the workload managing software Slurm. Therefore any job submission by the user is to be executed by commands of the Slurm software. Slurm queues and runs user jobs based on fair sharing policies.

== Slurm ==

HPC Workload Manager on bwUniCluster 3.0 is Slurm.
Slurm is a cluster management and job scheduling system. Slurm has three key functions.
* It allocates access to resources (compute cores on nodes) to users for some duration of time so they can perform work.
* It provides a framework for starting, executing, and monitoring work (normally a parallel job) on the set of allocated nodes.
* It arbitrates contention for resources by managing a queue of pending work.

Any kind of calculation on the compute nodes of bwUniCluster 3.0 requires the user to define calculations as a sequence of commands together with required run time, number of CPU cores and main memory and submit all, i.e., the '''batch job''', to a resource and workload managing software.

== Terms and definitions ==

''' Partitions '''

Slurm manages job queues for different '''partitions'''. Partitions are used to group similar node types (e.g. nodes with and without accelerators) and to enforce different access policies and resource limits.

On bwUniCluster 3.0 there are different partitions:

* CPU-only nodes
** 2-socket nodes, consisting of 2 Intel Ice Lake processors with 32 cores each or 2 AMD processors with 48 cores each
** 2-socket nodes with very high RAM capacity, consisting of 2 AMD processors with 48 cores each
* GPU-accelerated nodes
** 2-socket nodes with 4x NVIDIA A100 or 4x NVIDIA H100 GPUs
** 4-socket node with 4x AMD Instinct accelerator

''' Queues '''

Job '''queues''' are used to manage jobs that request access to shared but limited computing resources of a certain kind (partition).

On bwUniCluster 3.0 there are different main types of queues:
* Regular queues
** cpu: Jobs that request CPU-only nodes.
** gpu: Jobs that request GPU-accelerated nodes.
* Development queues (dev)
** Short, usually interactive jobs that are used for developing, compiling and testing code and workflows. The intention behind development queues is to provide users with immediate access to computer resources without having to wait. This is the place to realize instantaneous heavy compute without affecting other users, as would be the case on the login nodes.

Requested compute resources such as (wall-)time, number of nodes and amount of memory are restricted and must fit into the boundaries imposed by the queues. The request for compute resources on the bwUniCluster 3.0 requires at least the specification of the '''queue''' and the '''time'''.

''' Jobs '''

Jobs can be run non-interactively as '''batch jobs''' or as '''interactive jobs'''. 
Submitting a batch job means, that all steps of a compute project are defined in a Bash script. This Bash script is queued and executed as soon as the compute resources are available and allocated. Jobs are enqueued with the <code>sbatch</code> command.
For interactive jobs, the resources are requested with the <code>salloc</code> command. As soon as the computing resources are available and allocated, a command line prompt is returned on a computing node and the user can freely dispose of the resources now available to him.
{|style="background:#deffee; width:100%;"
|style="padding:5px; background:#cef2e0; text-align:left"|
[[Image:Attention.svg|center|25px]]
|style="padding:5px; background:#cef2e0; text-align:left"|
'''Please remember:'''
* '''Heavy computations are not allowed on the login nodes'''. Use a developement or a regular job queue instead! Please refer to [[BwUniCluster3.0/Login#Allowed_Activities_on_Login_Nodes|Allowed Activities on Login Nodes]].
* '''Development queues''' are meant for '''development tasks'''. Do not misuse this queue for regular, short-running jobs or chain jobs! Only one running job at a time is enabled. Maximum queue length is reduced to 3.
|}

= Queues on bwUniCluster 3.0 =
== Policy ==

The computing time is provided in accordance with the '''fair share policy'''. The individual investment shares of the respective university and the resources already used by its members are taken into account. Furthermore, the following throttling policy is also active: The '''maximum amount of physical cores''' used at any given time from jobs running is '''1920 per user''' (aggregated over all running jobs). This number corresponds to 30 nodes on the Ice Lake partition or 20 nodes on the standard partition. The aim is to minimize waiting times and maximize the number of users who can access computing time at the same time.

== Regular Queues ==
{| class="wikitable"
|-
! style="width:5%"| Queue
! style="width:13%"| Node-Type
! style="width:23%"| Default Resources
! style="width:13%"| Minimal Resources
! style="width:13%"| Maximum Resources
|-
| <code>cpu_il</code>
| CPU nodes Ice Lake
| mem-per-cpu=2000mb
|
| time=72:00:00, nodes=30, mem=249600mb, ntasks-per-node=64, (threads-per-core=2)
|-
| <code>cpu</code>
| CPU nodes Standard
| mem-per-cpu=2000mb
|
| time=72:00:00, nodes=20, mem=380000mb, ntasks-per-node=96, (threads-per-core=2)
|-
| <code>highmem</code>
| CPU nodes High Memory
| mem-per-cpu=12090mb
| mem=380001mb
| time=72:00:00, nodes=4, mem=2300000mb, ntasks-per-node=96, (threads-per-core=2)
|-
| <code>gpu_h100</code>
| GPU nodes NVIDIA GPU x4
| mem-per-gpu=193300mb cpus-per-gpu=24
| gres=gpu:1
| time=72:00:00, nodes=12, mem=760000mb, ntasks-per-node=96, (threads-per-core=2)
|-
| <code>gpu_mi300</code>
| GPU node AMD GPU x4
| mem-per-gpu=128200mb cpus-per-gpu=24
| gres=gpu:1
| time=72:00:00, nodes=1, mem=510000mb, ntasks-per-node=40, (threads-per-core=2)
|-
| <code>gpu_a100_il</code>/<code>gpu_h100_il</code>
| GPU nodes Ice Lake NVIDIA GPU x4
| mem-per-gpu=127500mb cpus-per-gpu=16
| gres=gpu:1
| time=48:00:00, nodes=9(A100) nodes=5(H100) , mem=510000mb, ntasks-per-node=64, (threads-per-core=2)
|}
Table 1: Regular Queues

== Short Queues ==
Queues with a short runtime of 30 minutes.
{| class="wikitable"
|-
! style="width:5%"| Queue
! style="width:13%"| Node Type
! style="width:23%"| Default Resources
! style="width:13%"| Minimal Resources
! style="width:13%"| Maximum Resources
|-
| <code>gpu_a100_short</code>
| GPU nodes Ice Lake NVIDIA GPU x4
| mem-per-gpu=94000mb cpus-per-gpu=12
| gres=gpu:1
| time=30, nodes=12, mem=376000mb, ntasks-per-node=48, (threads-per-core=2)
|}
Table 2: Short Queues

== Development Queues ==
Only for development, i.e. debugging or performance optimization ...

{| class="wikitable"
|-
! style="width:5%"| Queue
! style="width:13%"| Node Type
! style="width:23%"| Default Resources
! style="width:13%"| Minimal Resources
! style="width:13%"| Maximum Resources
|-
| <code>dev_cpu_il</code>
| CPU nodes Ice Lake
| mem-per-cpu=2000mb
|
| time=30, nodes=8, mem=249600mb, ntasks-per-node=64, (threads-per-core=2)
|-
| <code>dev_cpu</code>
| CPU nodes Standard
| mem-per-cpu=2000mb
|
| time=30, nodes=1, mem=380000mb, ntasks-per-node=96, (threads-per-core=2)
|-
| <code>dev_gpu_h100</code>
| GPU nodes NVIDIA GPU x4
| mem-per-gpu=193300mb cpus-per-gpu=24
| gres=gpu:1
| time=30, nodes=1, mem=760000mb, ntasks-per-node=96, (threads-per-core=2)
|-
| <code>dev_gpu_a100_il</code>
| GPU nodes NVIDIA GPU x4 
| mem-per-gpu=127500mb cpus-per-gpu=16
| gres=gpu:1
| time=30, nodes=1, mem=510000mb, ntasks-per-node=64, (threads-per-core=2)
|}
Table 3: Development Queues

Default resources of a queue class define number of tasks and memory if not explicitly given with sbatch command. Resource list acronyms ''--time'', ''--ntasks'', ''--nodes'', ''--mem'' and ''--mem-per-cpu'' are described [[BwUniCluster3.0/Running_Jobs/Slurm|here]].

== Check available resources: sinfo_t_idle ==
The Slurm command sinfo is used to view partition and node information for a system running Slurm. It incorporates down time, reservations, and node state information in determining the available backfill window. The sinfo command can only be used by the administrator.
 
SCC has prepared a special script (sinfo_t_idle) to find out how many processors are available for immediate use on the system. It is anticipated that users will use this information to submit jobs that meet these criteria and thus obtain quick job turnaround times.
 
 

* The following command displays what resources are available for immediate use for the whole partition.
<pre>$ sinfo_t_idle
Partition dev_cpu : 1 nodes idle
Partition cpu : 1 nodes idle
Partition highmem : 2 nodes idle
Partition dev_gpu_h100 : 0 nodes idle
Partition gpu_h100 : 0 nodes idle
Partition gpu_mi300 : 0 nodes idle
Partition dev_cpu_il : 7 nodes idle
Partition cpu_il : 2 nodes idle
Partition dev_gpu_a100_il : 1 nodes idle
Partition gpu_a100_il : 0 nodes idle
Partition gpu_h100_il : 1 nodes idle
Partition gpu_a100_short : 0 nodes idle
</pre>
 

= Running Jobs =

== Slurm Commands (excerpt) ==
Important Slurm commands for non-administrators working on bwUniCluster 3.0.
{| width=850px class="wikitable"
! Slurm commands !! Brief explanation
|-
| [[#Batch Jobs: sbatch|sbatch]] || Submits a job and puts it into the queue [[https://slurm.schedmd.com/sbatch.html sbatch]]
|-
| [[#Interactive Jobs: salloc|salloc]] || Requests resources for an interactive Job [[https://slurm.schedmd.com/salloc.html salloc]]
|-
| [[#Monitor and manage jobs |scontrol show job]] || Displays detailed job state information [[https://slurm.schedmd.com/scontrol.html scontrol]]
|-
| [[#List of your submitted jobs : squeue|squeue]] || Displays information about active, eligible, blocked, and/or recently completed jobs [[https://slurm.schedmd.com/squeue.html squeue]]
|-
| [[#List of your submitted jobs : squeue|squeue --start]] || Returns start time of submitted job [[https://slurm.schedmd.com/squeue.html squeue]]
|-
| [[#Check available resources: sinfo_t_idle|sinfo_t_idle]] || Shows what resources are available for immediate use [[https://slurm.schedmd.com/sinfo.html sinfo]]
|-
| [[#Canceling own jobs : scancel|scancel]] || Cancels a job [[https://slurm.schedmd.com/scancel.html scancel]]
|}

 
* [https://slurm.schedmd.com/tutorials.html Slurm Tutorials]
* [https://slurm.schedmd.com/pdfs/summary.pdf Slurm command/option summary (2 pages)]
* [https://slurm.schedmd.com/man_index.html Slurm Commands]
 

== Batch Jobs: sbatch ==

Batch jobs are submitted by using the command '''sbatch'''. The main purpose of the '''sbatch''' command is to specify the resources that are needed to run the job. '''sbatch''' will then queue the batch job. However, starting of batch job depends on the availability of the requested resources and the fair sharing value.
 
 

* The syntax and use of '''sbatch''' can be displayed via:
<pre>
$ man sbatch
</pre>

'''sbatch''' options can be used from the command line or in your job script. Different defaults for some of these options are set based on the queue and can be found [[BwUniCluster3.0/Slurm | here]]

{| class="wikitable"
! colspan="3" | sbatch Options
|-
! style="width:8%"| Command line
! style="width:9%"| Script
! style="width:13%"| Purpose
|- style="vertical-align:top;"
| -t, --time=''time''
| #SBATCH --time=''time''
| Wall clock time limit. 
|-
|- style="vertical-align:top;"
| -N, --nodes=''count''
| #SBATCH --nodes=''count''
| Number of nodes to be used.
|- style="vertical-align:top;"
| -n, --ntasks=''count''
| #SBATCH --ntasks=''count''
| Number of tasks to be launched.
|-
|- style="vertical-align:top;"
| --ntasks-per-node=''count''
| #SBATCH --ntasks-per-node=''count''
| Maximum count of tasks per node.
|-
|- style="vertical-align:top;"
| -c, --cpus-per-task=''count''
| #SBATCH --cpus-per-task=''count''
| Number of CPUs required per (MPI-)task.
|-
|- style="vertical-align:top;"
| --mem=''value_in_MB''
| #SBATCH --mem=''value_in_MB''
| Memory in MegaByte per node. (You should omit the setting of this option.)
|-
|- style="vertical-align:top;"
| --mem-per-cpu=''value_in_MB''
| #SBATCH --mem-per-cpu=''value_in_MB''
| Minimum Memory required per allocated CPU. (You should omit the setting of this option.)
|-
|- style="vertical-align:top;"
| --mail-type=''type''
| #SBATCH --mail-type=''type''
| Notify user by email when certain event types occur. Valid type values are NONE, BEGIN, END, FAIL, REQUEUE, ALL.
|-
|- style="vertical-align:top;"
| --mail-user=''mail-address''
| #SBATCH --mail-user=''mail-address''
| The specified mail-address receives email notification of state changes as defined by --mail-type.
|-
|- style="vertical-align:top;"
| --output=''name''
| #SBATCH --output=''name''
| File in which job output is stored.
|-
|- style="vertical-align:top;"
| --error=''name''
| #SBATCH --error=''name''
| File in which job error messages are stored.
|-
|- style="vertical-align:top;"
| -J, --job-name=''name''
| #SBATCH --job-name=''name''
| Job name.
|-
|- style="vertical-align:top;"
| --export=[ALL,] ''env-variables''
| #SBATCH --export=[ALL,] ''env-variables''
| Identifies which environment variables from the submission environment are propagated to the launched application. Default is ALL.
|-
|- style="vertical-align:top;"
| -A, --account=''group-name''
| #SBATCH --account=''group-name''
| Change resources used by this job to specified group. You may need this option if your account is assigned to more than one group. By command "scontrol show job" the project group the job is accounted on can be seen behind "Account=".
|-
|- style="vertical-align:top;"
| -p, --partition=''queue-name''
| #SBATCH --partition=''queue-name''
| Request a specific queue for the resource allocation.
|-
|- style="vertical-align:top;"
| --reservation=''reservation-name''
| #SBATCH --reservation=''reservation-name''
| Use a specific reservation for the resource allocation.
|-
|- style="vertical-align:top;"
| -C, --constraint=''LSDF''
| #SBATCH --constraint=LSDF
| Job constraint LSDF filesystems.
|-
|- style="vertical-align:top;"
| -C, --constraint=''BEEOND (BEEOND_4MDS, BEEOND_MAXMDS)''
| #SBATCH --constraint=BEEOND (BEEOND_4MDS, BEEOND_MAXMDS)
| Job constraint BeeOND filesystem.
|-
|}
 

== Interactive Jobs: salloc ==

On bwUniCluster 3.0 you are only allowed to run short jobs (<< 1 hour) with little memory requirements (<< 8 GByte) on the logins nodes. If you want to run longer jobs and/or jobs with a request of more than 8 GByte of memory, you must allocate resources for so-called interactive jobs by usage of the command salloc on a login node. Considering a serial application running on a compute node that requires 5000 MByte of memory and limiting the interactive run to 2 hours the following command has to be executed:

<pre>
$ salloc -p cpu -n 1 -t 120 --mem=5000
</pre>

Then you will get one core on a compute node within the partition "cpu". After execution of this command '''DO NOT CLOSE''' your current terminal session but wait until the queueing system Slurm has granted you the requested resources on the compute system. You will be logged in automatically on the granted core! To run a serial program on the granted core you only have to type the name of the executable.

<pre>
$ ./<my_serial_program>
</pre>

Please be aware that your serial job must run less than 2 hours in this example, else the job will be killed during runtime by the system.

You can also start now a graphical X11-terminal connecting you to the dedicated resource that is available for 2 hours. You can start it by the command:

<pre>
$ xterm
</pre>

Note that, once the walltime limit has been reached the resources - i.e. the compute node - will automatically be revoked.

An interactive parallel application running on one compute node or on many compute nodes (e.g. here 5 nodes) with 96 cores each requires usually an amount of memory in GByte (e.g. 50 GByte) and a maximum time (e.g. 1 hour). E.g. 5 nodes can be allocated by the following command:

<pre>
$ salloc -p cpu -N 5 --ntasks-per-node=96 -t 01:00:00 --mem=50gb
</pre>

Now you can run parallel jobs on 480 cores requiring 50 GByte of memory per node. Please be aware that you will be logged in on core 0 of the first node.
If you want to have access to another node you have to open a new terminal, connect it also to bwUniCluster 3.0 and type the following commands to
connect to the running interactive job and then to a specific node:

<pre>
$ srun --jobid=XXXXXXXX --pty /bin/bash
$ srun --nodelist=uc3nXXX --pty /bin/bash
</pre>

With the command:

<pre>
$ squeue
</pre>

the jobid and the nodelist can be shown.

If you want to run MPI-programs, you can do it by simply typing mpirun <program_name>. Then your program will be run on 480 cores. A very simple example for starting a parallel job can be:

<pre>
$ mpirun <my_mpi_program>
</pre>

You can also start the debugger ddt by the commands:

<pre>
$ module add devel/ddt
$ ddt <my_mpi_program>
</pre>

The above commands will execute the parallel program <my_mpi_program> on all available cores. You can also start parallel programs on a subset of cores; an example for this can be:

<pre>
$ mpirun -n 50 <my_mpi_program>
</pre>

If you are using Intel MPI you must start <my_mpi_program> by the command mpiexec.hydra (instead of mpirun).

== Interactive Computing with Jupyter ==

== Monitor and manage jobs ==

=== List of your submitted jobs : squeue ===
Displays information about YOUR active, pending and/or recently completed jobs. The command displays all own active and pending jobs. The command squeue is explained in detail on the webpage https://slurm.schedmd.com/squeue.html or via manpage (man squeue).
 
 

* ''squeue'' example on bwUniCluster 3.0 (Only your own jobs are displayed!).
<pre>
$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
1262 cpu wrap ka_ab123 R 8:15 1 uc3n002
1267 dev_gpu_h wrap ka_ab123 PD 0:00 1 (Resources)
1265 highmem wrap ka_ab123 R 2:41 1 uc3n084
$ squeue -l
JOBID PARTITION NAME USER STATE TIME TIME_LIMI NODES NODELIST(REASON)
1262 cpu wrap ka_ab123 RUNNING 8:55 20:00 1 uc3n002
1267 dev_gpu_h wrap ka_ab123 PENDING 0:00 20:00 1 (Resources)
1265 highmem wrap ka_ab123 RUNNING 3:21 20:00 1 uc3n084

</pre>

=== Detailed job information : scontrol show job ===
scontrol show job displays detailed job state information and diagnostic output for all or a specified job of yours. Detailed information is available for active, pending and recently completed jobs. The command scontrol is explained in detail on the webpage https://slurm.schedmd.com/scontrol.html or via manpage (man scontrol).
 
Display the state of all your jobs in normal mode: scontrol show job
 
Display the state of a job with <jobid> in normal mode: scontrol show job <jobid>
 
 

* Here is an example from bwUniCluster 3.0.
<pre>
$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
1262 cpu wrap ka_zs040 R 1:12 1 uc3n002

$
$ # now, see what's up with my pending job with jobid 1262
$
$ scontrol show job 1262

JobId=1262 JobName=wrap
UserId=ka_zs0402(241992) GroupId=ka_scc(12345) MCS_label=N/A
Priority=4246 Nice=0 Account=ka QOS=normal
JobState=RUNNING Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
RunTime=00:00:37 TimeLimit=00:20:00 TimeMin=N/A
SubmitTime=2025-04-04T10:01:30 EligibleTime=2025-04-04T10:01:30
AccrueTime=2025-04-04T10:01:30
StartTime=2025-04-04T10:01:31 EndTime=2025-04-04T10:21:31 Deadline=N/A
SuspendTime=None SecsPreSuspend=0 LastSchedEval=2025-04-04T10:01:31 Scheduler=Main
Partition=cpu AllocNode:Sid=uc3n999:2819841
ReqNodeList=(null) ExcNodeList=(null)
NodeList=uc3n002
BatchHost=uc3n002
NumNodes=1 NumCPUs=2 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
ReqTRES=cpu=1,mem=2000M,node=1,billing=1
AllocTRES=cpu=2,mem=4000M,node=1,billing=2
Socks/Node=* NtasksPerN:B:S:C=0:0:*:1 CoreSpec=*
MinCPUsNode=1 MinMemoryCPU=2000M MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
Command=(null)
WorkDir=/pfs/data6/home/ka/ka_scc/ka_zs0402
StdErr=/pfs/data6/home/ka/ka_scc/ka_zs0402/slurm-1262.out
StdIn=/dev/null
StdOut=/pfs/data6/home/ka/ka_scc/ka_zs0402/slurm-1262.out

</pre>
 
=== Canceling own jobs : scancel ===
The scancel command is used to cancel jobs. The command scancel is explained in detail on the webpage https://slurm.schedmd.com/scancel.html or via manpage (man scancel). The command is:
<pre>
$ scancel [-i] <job-id>
$ scancel -t <job_state_name>
</pre>

= Slurm Options =
[[BwUniCluster3.0/Running_Jobs/Slurm | Detailed Slurm usage]]

= Best Practices =

== Step-by-Step example==

== Dos and Don'ts ==

BwUniCluster3.0/Running Jobs

2025-08-13T12:33:22Z

P Schuhmacher: /* Regular Queues */

= Purpose and function of a queuing system =

All compute activities on bwUniCluster 3.0 have to be performed on the compute nodes. Compute nodes are only available by requesting the corresponding resources via the queuing system. As soon as the requested resources are available, automated tasks are executed via a batch script or they can be accessed interactively. 
General procedure: Hint to [[Running_Calculations | Running Calculations]]

== Job submission process ==

bwUniCluster 3.0 has installed the workload managing software Slurm. Therefore any job submission by the user is to be executed by commands of the Slurm software. Slurm queues and runs user jobs based on fair sharing policies.

== Slurm ==

HPC Workload Manager on bwUniCluster 3.0 is Slurm.
Slurm is a cluster management and job scheduling system. Slurm has three key functions.
* It allocates access to resources (compute cores on nodes) to users for some duration of time so they can perform work.
* It provides a framework for starting, executing, and monitoring work (normally a parallel job) on the set of allocated nodes.
* It arbitrates contention for resources by managing a queue of pending work.

Any kind of calculation on the compute nodes of bwUniCluster 3.0 requires the user to define calculations as a sequence of commands together with required run time, number of CPU cores and main memory and submit all, i.e., the '''batch job''', to a resource and workload managing software.

== Terms and definitions ==

''' Partitions '''

Slurm manages job queues for different '''partitions'''. Partitions are used to group similar node types (e.g. nodes with and without accelerators) and to enforce different access policies and resource limits.

On bwUniCluster 3.0 there are different partitions:

* CPU-only nodes
** 2-socket nodes, consisting of 2 Intel Ice Lake processors with 32 cores each or 2 AMD processors with 48 cores each
** 2-socket nodes with very high RAM capacity, consisting of 2 AMD processors with 48 cores each
* GPU-accelerated nodes
** 2-socket nodes with 4x NVIDIA A100 or 4x NVIDIA H100 GPUs
** 4-socket node with 4x AMD Instinct accelerator

''' Queues '''

Job '''queues''' are used to manage jobs that request access to shared but limited computing resources of a certain kind (partition).

On bwUniCluster 3.0 there are different main types of queues:
* Regular queues
** cpu: Jobs that request CPU-only nodes.
** gpu: Jobs that request GPU-accelerated nodes.
* Development queues (dev)
** Short, usually interactive jobs that are used for developing, compiling and testing code and workflows. The intention behind development queues is to provide users with immediate access to computer resources without having to wait. This is the place to realize instantaneous heavy compute without affecting other users, as would be the case on the login nodes.

Requested compute resources such as (wall-)time, number of nodes and amount of memory are restricted and must fit into the boundaries imposed by the queues. The request for compute resources on the bwUniCluster 3.0 requires at least the specification of the '''queue''' and the '''time'''.

''' Jobs '''

Jobs can be run non-interactively as '''batch jobs''' or as '''interactive jobs'''. 
Submitting a batch job means, that all steps of a compute project are defined in a Bash script. This Bash script is queued and executed as soon as the compute resources are available and allocated. Jobs are enqueued with the <code>sbatch</code> command.
For interactive jobs, the resources are requested with the <code>salloc</code> command. As soon as the computing resources are available and allocated, a command line prompt is returned on a computing node and the user can freely dispose of the resources now available to him.
{|style="background:#deffee; width:100%;"
|style="padding:5px; background:#cef2e0; text-align:left"|
[[Image:Attention.svg|center|25px]]
|style="padding:5px; background:#cef2e0; text-align:left"|
'''Please remember:'''
* '''Heavy computations are not allowed on the login nodes'''. Use a developement or a regular job queue instead! Please refer to [[BwUniCluster3.0/Login#Allowed_Activities_on_Login_Nodes|Allowed Activities on Login Nodes]].
* '''Development queues''' are meant for '''development tasks'''. Do not misuse this queue for regular, short-running jobs or chain jobs! Only one running job at a time is enabled. Maximum queue length is reduced to 3.
|}

= Queues on bwUniCluster 3.0 =
== Policy ==

The computing time is provided in accordance with the '''fair share policy'''. The individual investment shares of the respective university and the resources already used by its members are taken into account. Furthermore, the following throttling policy is also active: The '''maximum amount of physical cores''' used at any given time from jobs running is '''1920 per user''' (aggregated over all running jobs). This number corresponds to 30 nodes on the Ice Lake partition or 20 nodes on the standard partition. The aim is to minimize waiting times and maximize the number of users who can access computing time at the same time.

== Regular Queues ==
{| class="wikitable"
|-
! style="width:5%"| Queue
! style="width:13%"| Node-Type
! style="width:23%"| Default Resources
! style="width:13%"| Minimal Resources
! style="width:13%"| Maximum Resources
|-
| <code>cpu_il</code>
| CPU nodes Ice Lake
| mem-per-cpu=2000mb
|
| time=72:00:00, nodes=30, mem=249600mb, ntasks-per-node=64, (threads-per-core=2)
|-
| <code>cpu</code>
| CPU nodes Standard
| mem-per-cpu=2000mb
|
| time=72:00:00, nodes=20, mem=380000mb, ntasks-per-node=96, (threads-per-core=2)
|-
| <code>highmem</code>
| CPU nodes High Memory
| mem-per-cpu=12090mb
| mem=380001mb
| time=72:00:00, nodes=4, mem=2300000mb, ntasks-per-node=96, (threads-per-core=2)
|-
| <code>gpu_h100</code>
| GPU nodes NVIDIA GPU x4
| mem-per-gpu=193300mb cpus-per-gpu=24
| gres=gpu:1
| time=72:00:00, nodes=12, mem=760000mb, ntasks-per-node=96, (threads-per-core=2)
|-
| <code>gpu_mi300</code>
| GPU node AMD GPU x4
| mem-per-gpu=128200mb cpus-per-gpu=24
| gres=gpu:1
| time=72:00:00, nodes=1, mem=510000mb, ntasks-per-node=40, (threads-per-core=2)
|-
| <code>gpu_a100_il</code>/<code>gpu_h100_il</code>
| GPU nodes Ice Lake NVIDIA GPU x4
| mem-per-gpu=127500mb cpus-per-gpu=16
| gres=gpu:1
| time=48:00:00, nodes=5, mem=510000mb, ntasks-per-node=64, (threads-per-core=2)
|}
Table 1: Regular Queues

== Short Queues ==
Queues with a short runtime of 30 minutes.
{| class="wikitable"
|-
! style="width:5%"| Queue
! style="width:13%"| Node Type
! style="width:23%"| Default Resources
! style="width:13%"| Minimal Resources
! style="width:13%"| Maximum Resources
|-
| <code>gpu_a100_short</code>
| GPU nodes Ice Lake NVIDIA GPU x4
| mem-per-gpu=94000mb cpus-per-gpu=12
| gres=gpu:1
| time=30, nodes=12, mem=376000mb, ntasks-per-node=48, (threads-per-core=2)
|}
Table 2: Short Queues

== Development Queues ==
Only for development, i.e. debugging or performance optimization ...

{| class="wikitable"
|-
! style="width:5%"| Queue
! style="width:13%"| Node Type
! style="width:23%"| Default Resources
! style="width:13%"| Minimal Resources
! style="width:13%"| Maximum Resources
|-
| <code>dev_cpu_il</code>
| CPU nodes Ice Lake
| mem-per-cpu=2000mb
|
| time=30, nodes=8, mem=249600mb, ntasks-per-node=64, (threads-per-core=2)
|-
| <code>dev_cpu</code>
| CPU nodes Standard
| mem-per-cpu=2000mb
|
| time=30, nodes=1, mem=380000mb, ntasks-per-node=96, (threads-per-core=2)
|-
| <code>dev_gpu_h100</code>
| GPU nodes NVIDIA GPU x4
| mem-per-gpu=193300mb cpus-per-gpu=24
| gres=gpu:1
| time=30, nodes=1, mem=760000mb, ntasks-per-node=96, (threads-per-core=2)
|-
| <code>dev_gpu_a100_il</code>
| GPU nodes NVIDIA GPU x4 
| mem-per-gpu=127500mb cpus-per-gpu=16
| gres=gpu:1
| time=30, nodes=1, mem=510000mb, ntasks-per-node=64, (threads-per-core=2)
|}
Table 3: Development Queues

Default resources of a queue class define number of tasks and memory if not explicitly given with sbatch command. Resource list acronyms ''--time'', ''--ntasks'', ''--nodes'', ''--mem'' and ''--mem-per-cpu'' are described [[BwUniCluster3.0/Running_Jobs/Slurm|here]].

== Check available resources: sinfo_t_idle ==
The Slurm command sinfo is used to view partition and node information for a system running Slurm. It incorporates down time, reservations, and node state information in determining the available backfill window. The sinfo command can only be used by the administrator.
 
SCC has prepared a special script (sinfo_t_idle) to find out how many processors are available for immediate use on the system. It is anticipated that users will use this information to submit jobs that meet these criteria and thus obtain quick job turnaround times.
 
 

* The following command displays what resources are available for immediate use for the whole partition.
<pre>$ sinfo_t_idle
Partition dev_cpu : 1 nodes idle
Partition cpu : 1 nodes idle
Partition highmem : 2 nodes idle
Partition dev_gpu_h100 : 0 nodes idle
Partition gpu_h100 : 0 nodes idle
Partition gpu_mi300 : 0 nodes idle
Partition dev_cpu_il : 7 nodes idle
Partition cpu_il : 2 nodes idle
Partition dev_gpu_a100_il : 1 nodes idle
Partition gpu_a100_il : 0 nodes idle
Partition gpu_h100_il : 1 nodes idle
Partition gpu_a100_short : 0 nodes idle
</pre>
 

= Running Jobs =

== Slurm Commands (excerpt) ==
Important Slurm commands for non-administrators working on bwUniCluster 3.0.
{| width=850px class="wikitable"
! Slurm commands !! Brief explanation
|-
| [[#Batch Jobs: sbatch|sbatch]] || Submits a job and puts it into the queue [[https://slurm.schedmd.com/sbatch.html sbatch]]
|-
| [[#Interactive Jobs: salloc|salloc]] || Requests resources for an interactive Job [[https://slurm.schedmd.com/salloc.html salloc]]
|-
| [[#Monitor and manage jobs |scontrol show job]] || Displays detailed job state information [[https://slurm.schedmd.com/scontrol.html scontrol]]
|-
| [[#List of your submitted jobs : squeue|squeue]] || Displays information about active, eligible, blocked, and/or recently completed jobs [[https://slurm.schedmd.com/squeue.html squeue]]
|-
| [[#List of your submitted jobs : squeue|squeue --start]] || Returns start time of submitted job [[https://slurm.schedmd.com/squeue.html squeue]]
|-
| [[#Check available resources: sinfo_t_idle|sinfo_t_idle]] || Shows what resources are available for immediate use [[https://slurm.schedmd.com/sinfo.html sinfo]]
|-
| [[#Canceling own jobs : scancel|scancel]] || Cancels a job [[https://slurm.schedmd.com/scancel.html scancel]]
|}

 
* [https://slurm.schedmd.com/tutorials.html Slurm Tutorials]
* [https://slurm.schedmd.com/pdfs/summary.pdf Slurm command/option summary (2 pages)]
* [https://slurm.schedmd.com/man_index.html Slurm Commands]
 

== Batch Jobs: sbatch ==

Batch jobs are submitted by using the command '''sbatch'''. The main purpose of the '''sbatch''' command is to specify the resources that are needed to run the job. '''sbatch''' will then queue the batch job. However, starting of batch job depends on the availability of the requested resources and the fair sharing value.
 
 

* The syntax and use of '''sbatch''' can be displayed via:
<pre>
$ man sbatch
</pre>

'''sbatch''' options can be used from the command line or in your job script. Different defaults for some of these options are set based on the queue and can be found [[BwUniCluster3.0/Slurm | here]]

{| class="wikitable"
! colspan="3" | sbatch Options
|-
! style="width:8%"| Command line
! style="width:9%"| Script
! style="width:13%"| Purpose
|- style="vertical-align:top;"
| -t, --time=''time''
| #SBATCH --time=''time''
| Wall clock time limit. 
|-
|- style="vertical-align:top;"
| -N, --nodes=''count''
| #SBATCH --nodes=''count''
| Number of nodes to be used.
|- style="vertical-align:top;"
| -n, --ntasks=''count''
| #SBATCH --ntasks=''count''
| Number of tasks to be launched.
|-
|- style="vertical-align:top;"
| --ntasks-per-node=''count''
| #SBATCH --ntasks-per-node=''count''
| Maximum count of tasks per node.
|-
|- style="vertical-align:top;"
| -c, --cpus-per-task=''count''
| #SBATCH --cpus-per-task=''count''
| Number of CPUs required per (MPI-)task.
|-
|- style="vertical-align:top;"
| --mem=''value_in_MB''
| #SBATCH --mem=''value_in_MB''
| Memory in MegaByte per node. (You should omit the setting of this option.)
|-
|- style="vertical-align:top;"
| --mem-per-cpu=''value_in_MB''
| #SBATCH --mem-per-cpu=''value_in_MB''
| Minimum Memory required per allocated CPU. (You should omit the setting of this option.)
|-
|- style="vertical-align:top;"
| --mail-type=''type''
| #SBATCH --mail-type=''type''
| Notify user by email when certain event types occur. Valid type values are NONE, BEGIN, END, FAIL, REQUEUE, ALL.
|-
|- style="vertical-align:top;"
| --mail-user=''mail-address''
| #SBATCH --mail-user=''mail-address''
| The specified mail-address receives email notification of state changes as defined by --mail-type.
|-
|- style="vertical-align:top;"
| --output=''name''
| #SBATCH --output=''name''
| File in which job output is stored.
|-
|- style="vertical-align:top;"
| --error=''name''
| #SBATCH --error=''name''
| File in which job error messages are stored.
|-
|- style="vertical-align:top;"
| -J, --job-name=''name''
| #SBATCH --job-name=''name''
| Job name.
|-
|- style="vertical-align:top;"
| --export=[ALL,] ''env-variables''
| #SBATCH --export=[ALL,] ''env-variables''
| Identifies which environment variables from the submission environment are propagated to the launched application. Default is ALL.
|-
|- style="vertical-align:top;"
| -A, --account=''group-name''
| #SBATCH --account=''group-name''
| Change resources used by this job to specified group. You may need this option if your account is assigned to more than one group. By command "scontrol show job" the project group the job is accounted on can be seen behind "Account=".
|-
|- style="vertical-align:top;"
| -p, --partition=''queue-name''
| #SBATCH --partition=''queue-name''
| Request a specific queue for the resource allocation.
|-
|- style="vertical-align:top;"
| --reservation=''reservation-name''
| #SBATCH --reservation=''reservation-name''
| Use a specific reservation for the resource allocation.
|-
|- style="vertical-align:top;"
| -C, --constraint=''LSDF''
| #SBATCH --constraint=LSDF
| Job constraint LSDF filesystems.
|-
|- style="vertical-align:top;"
| -C, --constraint=''BEEOND (BEEOND_4MDS, BEEOND_MAXMDS)''
| #SBATCH --constraint=BEEOND (BEEOND_4MDS, BEEOND_MAXMDS)
| Job constraint BeeOND filesystem.
|-
|}
 

== Interactive Jobs: salloc ==

On bwUniCluster 3.0 you are only allowed to run short jobs (<< 1 hour) with little memory requirements (<< 8 GByte) on the logins nodes. If you want to run longer jobs and/or jobs with a request of more than 8 GByte of memory, you must allocate resources for so-called interactive jobs by usage of the command salloc on a login node. Considering a serial application running on a compute node that requires 5000 MByte of memory and limiting the interactive run to 2 hours the following command has to be executed:

<pre>
$ salloc -p cpu -n 1 -t 120 --mem=5000
</pre>

Then you will get one core on a compute node within the partition "cpu". After execution of this command '''DO NOT CLOSE''' your current terminal session but wait until the queueing system Slurm has granted you the requested resources on the compute system. You will be logged in automatically on the granted core! To run a serial program on the granted core you only have to type the name of the executable.

<pre>
$ ./<my_serial_program>
</pre>

Please be aware that your serial job must run less than 2 hours in this example, else the job will be killed during runtime by the system.

You can also start now a graphical X11-terminal connecting you to the dedicated resource that is available for 2 hours. You can start it by the command:

<pre>
$ xterm
</pre>

Note that, once the walltime limit has been reached the resources - i.e. the compute node - will automatically be revoked.

An interactive parallel application running on one compute node or on many compute nodes (e.g. here 5 nodes) with 96 cores each requires usually an amount of memory in GByte (e.g. 50 GByte) and a maximum time (e.g. 1 hour). E.g. 5 nodes can be allocated by the following command:

<pre>
$ salloc -p cpu -N 5 --ntasks-per-node=96 -t 01:00:00 --mem=50gb
</pre>

Now you can run parallel jobs on 480 cores requiring 50 GByte of memory per node. Please be aware that you will be logged in on core 0 of the first node.
If you want to have access to another node you have to open a new terminal, connect it also to bwUniCluster 3.0 and type the following commands to
connect to the running interactive job and then to a specific node:

<pre>
$ srun --jobid=XXXXXXXX --pty /bin/bash
$ srun --nodelist=uc3nXXX --pty /bin/bash
</pre>

With the command:

<pre>
$ squeue
</pre>

the jobid and the nodelist can be shown.

If you want to run MPI-programs, you can do it by simply typing mpirun <program_name>. Then your program will be run on 480 cores. A very simple example for starting a parallel job can be:

<pre>
$ mpirun <my_mpi_program>
</pre>

You can also start the debugger ddt by the commands:

<pre>
$ module add devel/ddt
$ ddt <my_mpi_program>
</pre>

The above commands will execute the parallel program <my_mpi_program> on all available cores. You can also start parallel programs on a subset of cores; an example for this can be:

<pre>
$ mpirun -n 50 <my_mpi_program>
</pre>

If you are using Intel MPI you must start <my_mpi_program> by the command mpiexec.hydra (instead of mpirun).

== Interactive Computing with Jupyter ==

== Monitor and manage jobs ==

=== List of your submitted jobs : squeue ===
Displays information about YOUR active, pending and/or recently completed jobs. The command displays all own active and pending jobs. The command squeue is explained in detail on the webpage https://slurm.schedmd.com/squeue.html or via manpage (man squeue).
 
 

* ''squeue'' example on bwUniCluster 3.0 (Only your own jobs are displayed!).
<pre>
$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
1262 cpu wrap ka_ab123 R 8:15 1 uc3n002
1267 dev_gpu_h wrap ka_ab123 PD 0:00 1 (Resources)
1265 highmem wrap ka_ab123 R 2:41 1 uc3n084
$ squeue -l
JOBID PARTITION NAME USER STATE TIME TIME_LIMI NODES NODELIST(REASON)
1262 cpu wrap ka_ab123 RUNNING 8:55 20:00 1 uc3n002
1267 dev_gpu_h wrap ka_ab123 PENDING 0:00 20:00 1 (Resources)
1265 highmem wrap ka_ab123 RUNNING 3:21 20:00 1 uc3n084

</pre>

=== Detailed job information : scontrol show job ===
scontrol show job displays detailed job state information and diagnostic output for all or a specified job of yours. Detailed information is available for active, pending and recently completed jobs. The command scontrol is explained in detail on the webpage https://slurm.schedmd.com/scontrol.html or via manpage (man scontrol).
 
Display the state of all your jobs in normal mode: scontrol show job
 
Display the state of a job with <jobid> in normal mode: scontrol show job <jobid>
 
 

* Here is an example from bwUniCluster 3.0.
<pre>
$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
1262 cpu wrap ka_zs040 R 1:12 1 uc3n002

$
$ # now, see what's up with my pending job with jobid 1262
$
$ scontrol show job 1262

JobId=1262 JobName=wrap
UserId=ka_zs0402(241992) GroupId=ka_scc(12345) MCS_label=N/A
Priority=4246 Nice=0 Account=ka QOS=normal
JobState=RUNNING Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
RunTime=00:00:37 TimeLimit=00:20:00 TimeMin=N/A
SubmitTime=2025-04-04T10:01:30 EligibleTime=2025-04-04T10:01:30
AccrueTime=2025-04-04T10:01:30
StartTime=2025-04-04T10:01:31 EndTime=2025-04-04T10:21:31 Deadline=N/A
SuspendTime=None SecsPreSuspend=0 LastSchedEval=2025-04-04T10:01:31 Scheduler=Main
Partition=cpu AllocNode:Sid=uc3n999:2819841
ReqNodeList=(null) ExcNodeList=(null)
NodeList=uc3n002
BatchHost=uc3n002
NumNodes=1 NumCPUs=2 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
ReqTRES=cpu=1,mem=2000M,node=1,billing=1
AllocTRES=cpu=2,mem=4000M,node=1,billing=2
Socks/Node=* NtasksPerN:B:S:C=0:0:*:1 CoreSpec=*
MinCPUsNode=1 MinMemoryCPU=2000M MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
Command=(null)
WorkDir=/pfs/data6/home/ka/ka_scc/ka_zs0402
StdErr=/pfs/data6/home/ka/ka_scc/ka_zs0402/slurm-1262.out
StdIn=/dev/null
StdOut=/pfs/data6/home/ka/ka_scc/ka_zs0402/slurm-1262.out

</pre>
 
=== Canceling own jobs : scancel ===
The scancel command is used to cancel jobs. The command scancel is explained in detail on the webpage https://slurm.schedmd.com/scancel.html or via manpage (man scancel). The command is:
<pre>
$ scancel [-i] <job-id>
$ scancel -t <job_state_name>
</pre>

= Slurm Options =
[[BwUniCluster3.0/Running_Jobs/Slurm | Detailed Slurm usage]]

= Best Practices =

== Step-by-Step example==

== Dos and Don'ts ==

BwUniCluster3.0/Running Jobs

2025-07-22T06:57:56Z

P Schuhmacher: /* Short Queues */

= Purpose and function of a queuing system =

All compute activities on bwUniCluster 3.0 have to be performed on the compute nodes. Compute nodes are only available by requesting the corresponding resources via the queuing system. As soon as the requested resources are available, automated tasks are executed via a batch script or they can be accessed interactively. 
General procedure: Hint to [[Running_Calculations | Running Calculations]]

== Job submission process ==

bwUniCluster 3.0 has installed the workload managing software Slurm. Therefore any job submission by the user is to be executed by commands of the Slurm software. Slurm queues and runs user jobs based on fair sharing policies.

== Slurm ==

HPC Workload Manager on bwUniCluster 3.0 is Slurm.
Slurm is a cluster management and job scheduling system. Slurm has three key functions.
* It allocates access to resources (compute cores on nodes) to users for some duration of time so they can perform work.
* It provides a framework for starting, executing, and monitoring work (normally a parallel job) on the set of allocated nodes.
* It arbitrates contention for resources by managing a queue of pending work.

Any kind of calculation on the compute nodes of bwUniCluster 3.0 requires the user to define calculations as a sequence of commands together with required run time, number of CPU cores and main memory and submit all, i.e., the '''batch job''', to a resource and workload managing software.

== Terms and definitions ==

''' Partitions '''

Slurm manages job queues for different '''partitions'''. Partitions are used to group similar node types (e.g. nodes with and without accelerators) and to enforce different access policies and resource limits.

On bwUniCluster 3.0 there are different partitions:

* CPU-only nodes
** 2-socket nodes, consisting of 2 Intel Ice Lake processors with 32 cores each or 2 AMD processors with 48 cores each
** 2-socket nodes with very high RAM capacity, consisting of 2 AMD processors with 48 cores each
* GPU-accelerated nodes
** 2-socket nodes with 4x NVIDIA A100 or 4x NVIDIA H100 GPUs
** 4-socket node with 4x AMD Instinct accelerator

''' Queues '''

Job '''queues''' are used to manage jobs that request access to shared but limited computing resources of a certain kind (partition).

On bwUniCluster 3.0 there are different main types of queues:
* Regular queues
** cpu: Jobs that request CPU-only nodes.
** gpu: Jobs that request GPU-accelerated nodes.
* Development queues (dev)
** Short, usually interactive jobs that are used for developing, compiling and testing code and workflows. The intention behind development queues is to provide users with immediate access to computer resources without having to wait. This is the place to realize instantaneous heavy compute without affecting other users, as would be the case on the login nodes.

Requested compute resources such as (wall-)time, number of nodes and amount of memory are restricted and must fit into the boundaries imposed by the queues. The request for compute resources on the bwUniCluster 3.0 requires at least the specification of the '''queue''' and the '''time'''.

''' Jobs '''

Jobs can be run non-interactively as '''batch jobs''' or as '''interactive jobs'''. 
Submitting a batch job means, that all steps of a compute project are defined in a Bash script. This Bash script is queued and executed as soon as the compute resources are available and allocated. Jobs are enqueued with the <code>sbatch</code> command.
For interactive jobs, the resources are requested with the <code>salloc</code> command. As soon as the computing resources are available and allocated, a command line prompt is returned on a computing node and the user can freely dispose of the resources now available to him.
{|style="background:#deffee; width:100%;"
|style="padding:5px; background:#cef2e0; text-align:left"|
[[Image:Attention.svg|center|25px]]
|style="padding:5px; background:#cef2e0; text-align:left"|
'''Please remember:'''
* '''Heavy computations are not allowed on the login nodes'''. Use a developement or a regular job queue instead! Please refer to [[BwUniCluster3.0/Login#Allowed_Activities_on_Login_Nodes|Allowed Activities on Login Nodes]].
* '''Development queues''' are meant for '''development tasks'''. Do not misuse this queue for regular, short-running jobs or chain jobs! Only one running job at a time is enabled. Maximum queue length is reduced to 3.
|}

= Queues on bwUniCluster 3.0 =
== Policy ==

The computing time is provided in accordance with the '''fair share policy'''. The individual investment shares of the respective university and the resources already used by its members are taken into account. Furthermore, the following throttling policy is also active: The '''maximum amount of physical cores''' used at any given time from jobs running is '''1920 per user''' (aggregated over all running jobs). This number corresponds to 30 nodes on the Ice Lake partition or 20 nodes on the standard partition. The aim is to minimize waiting times and maximize the number of users who can access computing time at the same time.

== Regular Queues ==
{| class="wikitable"
|-
! style="width:5%"| Queue
! style="width:13%"| Node-Type
! style="width:23%"| Default Resources
! style="width:13%"| Minimal Resources
! style="width:13%"| Maximum Resources
|-
| <code>cpu_il</code>
| CPU nodes Ice Lake
| mem-per-cpu=2000mb
|
| time=72:00:00, nodes=30, mem=249600mb, ntasks-per-node=64, (threads-per-core=2)
|-
| <code>cpu</code>
| CPU nodes Standard
| mem-per-cpu=2000mb
|
| time=72:00:00, nodes=20, mem=380000mb, ntasks-per-node=96, (threads-per-core=2)
|-
| <code>highmem</code>
| CPU nodes High Memory
| mem-per-cpu=12090mb
| mem=380001mb
| time=72:00:00, nodes=4, mem=2300000mb, ntasks-per-node=96, (threads-per-core=2)
|-
| <code>gpu_h100</code>
| GPU nodes NVIDIA GPU x4
| mem-per-gpu=193300mb cpus-per-gpu=24
|
| time=72:00:00, nodes=12, mem=760000mb, ntasks-per-node=96, (threads-per-core=2)
|-
| <code>gpu_mi300</code>
| GPU node AMD GPU x4
| mem-per-gpu=128200mb cpus-per-gpu=24
|
| time=72:00:00, nodes=1, mem=510000mb, ntasks-per-node=40, (threads-per-core=2)
|-
| <code>gpu_a100_il</code>/<code>gpu_h100_il</code>
| GPU nodes Ice Lake NVIDIA GPU x4
| mem-per-gpu=127500mb cpus-per-gpu=16
|
| time=48:00:00, nodes=9, mem=510000mb, ntasks-per-node=64, (threads-per-core=2)
|}
Table 1: Regular Queues

== Short Queues ==
Queues with a short runtime of 30 minutes.

{| class="wikitable"
|-
! style="width:5%"| Queue
! style="width:13%"| Node Type
! style="width:23%"| Default Resources
! style="width:13%"| Minimal Resources
! style="width:13%"| Maximum Resources
|-
| <code>gpu_a100_short</code>
| GPU nodes Ice Lake NVIDIA GPU x4
| mem-per-gpu=94000mb cpus-per-gpu=12
|
| time=30, nodes=12, mem=376000mb, ntasks-per-node=48, (threads-per-core=2)
|}
Table 2: Short Queues

== Development Queues ==
Only for development, i.e. debugging or performance optimization ...

{| class="wikitable"
|-
! style="width:5%"| Queue
! style="width:13%"| Node Type
! style="width:23%"| Default Resources
! style="width:13%"| Minimal Resources
! style="width:13%"| Maximum Resources
|-
| <code>dev_cpu_il</code>
| CPU nodes Ice Lake
| mem-per-cpu=2000mb
|
| time=30, nodes=8, mem=249600mb, ntasks-per-node=64, (threads-per-core=2)
|-
| <code>dev_cpu</code>
| CPU nodes Standard
| mem-per-cpu=2000mb
|
| time=30, nodes=1, mem=380000mb, ntasks-per-node=96, (threads-per-core=2)
|-
| <code>dev_gpu_h100</code>
| GPU nodes NVIDIA GPU x4
| mem-per-gpu=193300mb cpus-per-gpu=24
|
| time=30, nodes=1, mem=760000mb, ntasks-per-node=96, (threads-per-core=2)
|-
| <code>dev_gpu_a100_il</code>
| GPU nodes NVIDIA GPU x4 
| mem-per-gpu=127500mb cpus-per-gpu=16
|
| time=30, nodes=1, mem=510000mb, ntasks-per-node=64, (threads-per-core=2)
|}
Table 3: Development Queues

Default resources of a queue class define number of tasks and memory if not explicitly given with sbatch command. Resource list acronyms ''--time'', ''--ntasks'', ''--nodes'', ''--mem'' and ''--mem-per-cpu'' are described [[BwUniCluster3.0/Running_Jobs/Slurm|here]].

== Check available resources: sinfo_t_idle ==
The Slurm command sinfo is used to view partition and node information for a system running Slurm. It incorporates down time, reservations, and node state information in determining the available backfill window. The sinfo command can only be used by the administrator.
 
SCC has prepared a special script (sinfo_t_idle) to find out how many processors are available for immediate use on the system. It is anticipated that users will use this information to submit jobs that meet these criteria and thus obtain quick job turnaround times.
 
 

* The following command displays what resources are available for immediate use for the whole partition.
<pre>$ sinfo_t_idle
Partition dev_cpu : 2 nodes idle
Partition cpu : 68 nodes idle
Partition highmem : 4 nodes idle
Partition dev_gpu_h100 : 0 nodes idle
Partition gpu_h100 : 11 nodes idle
Partition gpu_mi300 : 1 nodes idle
Partition dev_cpu_il : 0 nodes idle
Partition cpu_il : 0 nodes idle
Partition dev_gpu_a100_il : 0 nodes idle
Partition gpu_a100_il : 0 nodes idle
Partition gpu_h100_il : 0 nodes idle
</pre>
 

= Running Jobs =

== Slurm Commands (excerpt) ==
Important Slurm commands for non-administrators working on bwUniCluster 3.0.
{| width=850px class="wikitable"
! Slurm commands !! Brief explanation
|-
| [[#Batch Jobs: sbatch|sbatch]] || Submits a job and puts it into the queue [[https://slurm.schedmd.com/sbatch.html sbatch]]
|-
| [[#Interactive Jobs: salloc|salloc]] || Requests resources for an interactive Job [[https://slurm.schedmd.com/salloc.html salloc]]
|-
| [[#Monitor and manage jobs |scontrol show job]] || Displays detailed job state information [[https://slurm.schedmd.com/scontrol.html scontrol]]
|-
| [[#List of your submitted jobs : squeue|squeue]] || Displays information about active, eligible, blocked, and/or recently completed jobs [[https://slurm.schedmd.com/squeue.html squeue]]
|-
| [[#List of your submitted jobs : squeue|squeue --start]] || Returns start time of submitted job [[https://slurm.schedmd.com/squeue.html squeue]]
|-
| [[#Check available resources: sinfo_t_idle|sinfo_t_idle]] || Shows what resources are available for immediate use [[https://slurm.schedmd.com/sinfo.html sinfo]]
|-
| [[#Canceling own jobs : scancel|scancel]] || Cancels a job [[https://slurm.schedmd.com/scancel.html scancel]]
|}

 
* [https://slurm.schedmd.com/tutorials.html Slurm Tutorials]
* [https://slurm.schedmd.com/pdfs/summary.pdf Slurm command/option summary (2 pages)]
* [https://slurm.schedmd.com/man_index.html Slurm Commands]
 

== Batch Jobs: sbatch ==

Batch jobs are submitted by using the command '''sbatch'''. The main purpose of the '''sbatch''' command is to specify the resources that are needed to run the job. '''sbatch''' will then queue the batch job. However, starting of batch job depends on the availability of the requested resources and the fair sharing value.
 
 

* The syntax and use of '''sbatch''' can be displayed via:
<pre>
$ man sbatch
</pre>

'''sbatch''' options can be used from the command line or in your job script. Different defaults for some of these options are set based on the queue and can be found [[BwUniCluster3.0/Slurm | here]]

{| class="wikitable"
! colspan="3" | sbatch Options
|-
! style="width:8%"| Command line
! style="width:9%"| Script
! style="width:13%"| Purpose
|- style="vertical-align:top;"
| -t, --time=''time''
| #SBATCH --time=''time''
| Wall clock time limit. 
|-
|- style="vertical-align:top;"
| -N, --nodes=''count''
| #SBATCH --nodes=''count''
| Number of nodes to be used.
|- style="vertical-align:top;"
| -n, --ntasks=''count''
| #SBATCH --ntasks=''count''
| Number of tasks to be launched.
|-
|- style="vertical-align:top;"
| --ntasks-per-node=''count''
| #SBATCH --ntasks-per-node=''count''
| Maximum count of tasks per node.
|-
|- style="vertical-align:top;"
| -c, --cpus-per-task=''count''
| #SBATCH --cpus-per-task=''count''
| Number of CPUs required per (MPI-)task.
|-
|- style="vertical-align:top;"
| --mem=''value_in_MB''
| #SBATCH --mem=''value_in_MB''
| Memory in MegaByte per node. (You should omit the setting of this option.)
|-
|- style="vertical-align:top;"
| --mem-per-cpu=''value_in_MB''
| #SBATCH --mem-per-cpu=''value_in_MB''
| Minimum Memory required per allocated CPU. (You should omit the setting of this option.)
|-
|- style="vertical-align:top;"
| --mail-type=''type''
| #SBATCH --mail-type=''type''
| Notify user by email when certain event types occur. Valid type values are NONE, BEGIN, END, FAIL, REQUEUE, ALL.
|-
|- style="vertical-align:top;"
| --mail-user=''mail-address''
| #SBATCH --mail-user=''mail-address''
| The specified mail-address receives email notification of state changes as defined by --mail-type.
|-
|- style="vertical-align:top;"
| --output=''name''
| #SBATCH --output=''name''
| File in which job output is stored.
|-
|- style="vertical-align:top;"
| --error=''name''
| #SBATCH --error=''name''
| File in which job error messages are stored.
|-
|- style="vertical-align:top;"
| -J, --job-name=''name''
| #SBATCH --job-name=''name''
| Job name.
|-
|- style="vertical-align:top;"
| --export=[ALL,] ''env-variables''
| #SBATCH --export=[ALL,] ''env-variables''
| Identifies which environment variables from the submission environment are propagated to the launched application. Default is ALL.
|-
|- style="vertical-align:top;"
| -A, --account=''group-name''
| #SBATCH --account=''group-name''
| Change resources used by this job to specified group. You may need this option if your account is assigned to more than one group. By command "scontrol show job" the project group the job is accounted on can be seen behind "Account=".
|-
|- style="vertical-align:top;"
| -p, --partition=''queue-name''
| #SBATCH --partition=''queue-name''
| Request a specific queue for the resource allocation.
|-
|- style="vertical-align:top;"
| --reservation=''reservation-name''
| #SBATCH --reservation=''reservation-name''
| Use a specific reservation for the resource allocation.
|-
|- style="vertical-align:top;"
| -C, --constraint=''LSDF''
| #SBATCH --constraint=LSDF
| Job constraint LSDF filesystems.
|-
|- style="vertical-align:top;"
| -C, --constraint=''BEEOND (BEEOND_4MDS, BEEOND_MAXMDS)''
| #SBATCH --constraint=BEEOND (BEEOND_4MDS, BEEOND_MAXMDS)
| Job constraint BeeOND filesystem.
|-
|}
 

== Interactive Jobs: salloc ==

On bwUniCluster 3.0 you are only allowed to run short jobs (<< 1 hour) with little memory requirements (<< 8 GByte) on the logins nodes. If you want to run longer jobs and/or jobs with a request of more than 8 GByte of memory, you must allocate resources for so-called interactive jobs by usage of the command salloc on a login node. Considering a serial application running on a compute node that requires 5000 MByte of memory and limiting the interactive run to 2 hours the following command has to be executed:

<pre>
$ salloc -p cpu -n 1 -t 120 --mem=5000
</pre>

Then you will get one core on a compute node within the partition "cpu". After execution of this command '''DO NOT CLOSE''' your current terminal session but wait until the queueing system Slurm has granted you the requested resources on the compute system. You will be logged in automatically on the granted core! To run a serial program on the granted core you only have to type the name of the executable.

<pre>
$ ./<my_serial_program>
</pre>

Please be aware that your serial job must run less than 2 hours in this example, else the job will be killed during runtime by the system.

You can also start now a graphical X11-terminal connecting you to the dedicated resource that is available for 2 hours. You can start it by the command:

<pre>
$ xterm
</pre>

Note that, once the walltime limit has been reached the resources - i.e. the compute node - will automatically be revoked.

An interactive parallel application running on one compute node or on many compute nodes (e.g. here 5 nodes) with 96 cores each requires usually an amount of memory in GByte (e.g. 50 GByte) and a maximum time (e.g. 1 hour). E.g. 5 nodes can be allocated by the following command:

<pre>
$ salloc -p cpu -N 5 --ntasks-per-node=96 -t 01:00:00 --mem=50gb
</pre>

Now you can run parallel jobs on 480 cores requiring 50 GByte of memory per node. Please be aware that you will be logged in on core 0 of the first node.
If you want to have access to another node you have to open a new terminal, connect it also to bwUniCluster 3.0 and type the following commands to
connect to the running interactive job and then to a specific node:

<pre>
$ srun --jobid=XXXXXXXX --pty /bin/bash
$ srun --nodelist=uc3nXXX --pty /bin/bash
</pre>

With the command:

<pre>
$ squeue
</pre>

the jobid and the nodelist can be shown.

If you want to run MPI-programs, you can do it by simply typing mpirun <program_name>. Then your program will be run on 480 cores. A very simple example for starting a parallel job can be:

<pre>
$ mpirun <my_mpi_program>
</pre>

You can also start the debugger ddt by the commands:

<pre>
$ module add devel/ddt
$ ddt <my_mpi_program>
</pre>

The above commands will execute the parallel program <my_mpi_program> on all available cores. You can also start parallel programs on a subset of cores; an example for this can be:

<pre>
$ mpirun -n 50 <my_mpi_program>
</pre>

If you are using Intel MPI you must start <my_mpi_program> by the command mpiexec.hydra (instead of mpirun).

== Interactive Computing with Jupyter ==

== Monitor and manage jobs ==

=== List of your submitted jobs : squeue ===
Displays information about YOUR active, pending and/or recently completed jobs. The command displays all own active and pending jobs. The command squeue is explained in detail on the webpage https://slurm.schedmd.com/squeue.html or via manpage (man squeue).
 
 

* ''squeue'' example on bwUniCluster 3.0 (Only your own jobs are displayed!).
<pre>
$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
1262 cpu wrap ka_ab123 R 8:15 1 uc3n002
1267 dev_gpu_h wrap ka_ab123 PD 0:00 1 (Resources)
1265 highmem wrap ka_ab123 R 2:41 1 uc3n084
$ squeue -l
JOBID PARTITION NAME USER STATE TIME TIME_LIMI NODES NODELIST(REASON)
1262 cpu wrap ka_ab123 RUNNING 8:55 20:00 1 uc3n002
1267 dev_gpu_h wrap ka_ab123 PENDING 0:00 20:00 1 (Resources)
1265 highmem wrap ka_ab123 RUNNING 3:21 20:00 1 uc3n084

</pre>

=== Detailed job information : scontrol show job ===
scontrol show job displays detailed job state information and diagnostic output for all or a specified job of yours. Detailed information is available for active, pending and recently completed jobs. The command scontrol is explained in detail on the webpage https://slurm.schedmd.com/scontrol.html or via manpage (man scontrol).
 
Display the state of all your jobs in normal mode: scontrol show job
 
Display the state of a job with <jobid> in normal mode: scontrol show job <jobid>
 
 

* Here is an example from bwUniCluster 3.0.
<pre>
$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
1262 cpu wrap ka_zs040 R 1:12 1 uc3n002

$
$ # now, see what's up with my pending job with jobid 1262
$
$ scontrol show job 1262

JobId=1262 JobName=wrap
UserId=ka_zs0402(241992) GroupId=ka_scc(12345) MCS_label=N/A
Priority=4246 Nice=0 Account=ka QOS=normal
JobState=RUNNING Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
RunTime=00:00:37 TimeLimit=00:20:00 TimeMin=N/A
SubmitTime=2025-04-04T10:01:30 EligibleTime=2025-04-04T10:01:30
AccrueTime=2025-04-04T10:01:30
StartTime=2025-04-04T10:01:31 EndTime=2025-04-04T10:21:31 Deadline=N/A
SuspendTime=None SecsPreSuspend=0 LastSchedEval=2025-04-04T10:01:31 Scheduler=Main
Partition=cpu AllocNode:Sid=uc3n999:2819841
ReqNodeList=(null) ExcNodeList=(null)
NodeList=uc3n002
BatchHost=uc3n002
NumNodes=1 NumCPUs=2 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
ReqTRES=cpu=1,mem=2000M,node=1,billing=1
AllocTRES=cpu=2,mem=4000M,node=1,billing=2
Socks/Node=* NtasksPerN:B:S:C=0:0:*:1 CoreSpec=*
MinCPUsNode=1 MinMemoryCPU=2000M MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
Command=(null)
WorkDir=/pfs/data6/home/ka/ka_scc/ka_zs0402
StdErr=/pfs/data6/home/ka/ka_scc/ka_zs0402/slurm-1262.out
StdIn=/dev/null
StdOut=/pfs/data6/home/ka/ka_scc/ka_zs0402/slurm-1262.out

</pre>
 
=== Canceling own jobs : scancel ===
The scancel command is used to cancel jobs. The command scancel is explained in detail on the webpage https://slurm.schedmd.com/scancel.html or via manpage (man scancel). The command is:
<pre>
$ scancel [-i] <job-id>
$ scancel -t <job_state_name>
</pre>

= Slurm Options =
[[BwUniCluster3.0/Running_Jobs/Slurm | Detailed Slurm usage]]

= Best Practices =

== Step-by-Step example==

== Dos and Don'ts ==

BwUniCluster3.0/Running Jobs

2025-06-26T06:54:30Z

P Schuhmacher: /* Regular Queues */

= Purpose and function of a queuing system =

All compute activities on bwUniCluster 3.0 have to be performed on the compute nodes. Compute nodes are only available by requesting the corresponding resources via the queuing system. As soon as the requested resources are available, automated tasks are executed via a batch script or they can be accessed interactively. 
General procedure: Hint to [[Running_Calculations | Running Calculations]]

== Job submission process ==

bwUniCluster 3.0 has installed the workload managing software Slurm. Therefore any job submission by the user is to be executed by commands of the Slurm software. Slurm queues and runs user jobs based on fair sharing policies.

== Slurm ==

HPC Workload Manager on bwUniCluster 3.0 is Slurm.
Slurm is a cluster management and job scheduling system. Slurm has three key functions.
* It allocates access to resources (compute cores on nodes) to users for some duration of time so they can perform work.
* It provides a framework for starting, executing, and monitoring work (normally a parallel job) on the set of allocated nodes.
* It arbitrates contention for resources by managing a queue of pending work.

Any kind of calculation on the compute nodes of bwUniCluster 3.0 requires the user to define calculations as a sequence of commands together with required run time, number of CPU cores and main memory and submit all, i.e., the '''batch job''', to a resource and workload managing software.

== Terms and definitions ==

''' Partitions '''

Slurm manages job queues for different '''partitions'''. Partitions are used to group similar node types (e.g. nodes with and without accelerators) and to enforce different access policies and resource limits.

On bwUniCluster 3.0 there are different partitions:

* CPU-only nodes
** 2-socket nodes, consisting of 2 Intel Ice Lake processors with 32 cores each or 2 AMD processors with 48 cores each
** 2-socket nodes with very high RAM capacity, consisting of 2 AMD processors with 48 cores each
* GPU-accelerated nodes
** 2-socket nodes with 4x NVIDIA A100 or 4x NVIDIA H100 GPUs
** 4-socket node with 4x AMD Instinct accelerator

''' Queues '''

Job '''queues''' are used to manage jobs that request access to shared but limited computing resources of a certain kind (partition).

On bwUniCluster 3.0 there are different main types of queues:
* Regular queues
** cpu: Jobs that request CPU-only nodes.
** gpu: Jobs that request GPU-accelerated nodes.
* Development queues (dev)
** Short, usually interactive jobs that are used for developing, compiling and testing code and workflows. The intention behind development queues is to provide users with immediate access to computer resources without having to wait. This is the place to realize instantaneous heavy compute without affecting other users, as would be the case on the login nodes.

Requested compute resources such as (wall-)time, number of nodes and amount of memory are restricted and must fit into the boundaries imposed by the queues. The request for compute resources on the bwUniCluster 3.0 requires at least the specification of the '''queue''' and the '''time'''.

''' Jobs '''

Jobs can be run non-interactively as '''batch jobs''' or as '''interactive jobs'''. 
Submitting a batch job means, that all steps of a compute project are defined in a Bash script. This Bash script is queued and executed as soon as the compute resources are available and allocated. Jobs are enqueued with the <code>sbatch</code> command.
For interactive jobs, the resources are requested with the <code>salloc</code> command. As soon as the computing resources are available and allocated, a command line prompt is returned on a computing node and the user can freely dispose of the resources now available to him.
{|style="background:#deffee; width:100%;"
|style="padding:5px; background:#cef2e0; text-align:left"|
[[Image:Attention.svg|center|25px]]
|style="padding:5px; background:#cef2e0; text-align:left"|
'''Please remember:'''
* '''Heavy computations are not allowed on the login nodes'''. Use a developement or a regular job queue instead! Please refer to [[BwUniCluster3.0/Login#Allowed_Activities_on_Login_Nodes|Allowed Activities on Login Nodes]].
* '''Development queues''' are meant for '''development tasks'''. Do not misuse this queue for regular, short-running jobs or chain jobs! Only one running job at a time is enabled. Maximum queue length is reduced to 3.
|}

= Queues on bwUniCluster 3.0 =
== Policy ==

The computing time is provided in accordance with the '''fair share policy'''. The individual investment shares of the respective university and the resources already used by its members are taken into account. Furthermore, the following throttling policy is also active: The '''maximum amount of physical cores''' used at any given time from jobs running is '''1920 per user''' (aggregated over all running jobs). This number corresponds to 30 nodes on the Ice Lake partition or 20 nodes on the standard partition. The aim is to minimize waiting times and maximize the number of users who can access computing time at the same time.

== Regular Queues ==
{| class="wikitable"
|-
! style="width:5%"| Queue
! style="width:13%"| Node-Type
! style="width:23%"| Default Resources
! style="width:13%"| Minimal Resources
! style="width:13%"| Maximum Resources
|-
| <code>cpu_il</code>
| CPU nodes Ice Lake
| mem-per-cpu=2000mb
|
| time=72:00:00, nodes=30, mem=249600mb, ntasks-per-node=64, (threads-per-core=2)
|-
| <code>cpu</code>
| CPU nodes Standard
| mem-per-cpu=2000mb
|
| time=72:00:00, nodes=20, mem=380000mb, ntasks-per-node=96, (threads-per-core=2)
|-
| <code>highmem</code>
| CPU nodes High Memory
| mem-per-cpu=12090mb
| mem=380001mb
| time=72:00:00, nodes=4, mem=2300000mb, ntasks-per-node=96, (threads-per-core=2)
|-
| <code>gpu_h100</code>
| GPU nodes NVIDIA GPU x4
| mem-per-gpu=193300mb cpus-per-gpu=24
|
| time=72:00:00, nodes=12, mem=760000mb, ntasks-per-node=96, (threads-per-core=2)
|-
| <code>gpu_mi300</code>
| GPU node AMD GPU x4
| mem-per-gpu=128200mb cpus-per-gpu=24
|
| time=72:00:00, nodes=1, mem=510000mb, ntasks-per-node=40, (threads-per-core=2)
|-
| <code>gpu_a100_il</code>/<code>gpu_h100_il</code>
| GPU nodes Ice Lake NVIDIA GPU x4
| mem-per-gpu=127500mb cpus-per-gpu=16
|
| time=48:00:00, nodes=9, mem=510000mb, ntasks-per-node=64, (threads-per-core=2)
|}
Table 1: Regular Queues

== Development Queues ==
Only for development, i.e. debugging or performance optimization ...

{| class="wikitable"
|-
! style="width:5%"| Queue
! style="width:13%"| Node Type
! style="width:23%"| Default Resources
! style="width:13%"| Minimal Resources
! style="width:13%"| Maximum Resources
|-
| <code>dev_cpu_il</code>
| CPU nodes Ice Lake
| mem-per-cpu=2000mb
|
| time=30, nodes=8, mem=249600mb, ntasks-per-node=64, (threads-per-core=2)
|-
| <code>dev_cpu</code>
| CPU nodes Standard
| mem-per-cpu=2000mb
|
| time=30, nodes=1, mem=380000mb, ntasks-per-node=96, (threads-per-core=2)
|-
| <code>dev_gpu_h100</code>
| GPU nodes NVIDIA GPU x4
| mem-per-gpu=193300mb cpus-per-gpu=24
|
| time=30, nodes=1, mem=760000mb, ntasks-per-node=96, (threads-per-core=2)
|-
| <code>dev_gpu_a100_il</code>
| GPU nodes NVIDIA GPU x4 
| mem-per-gpu=127500mb cpus-per-gpu=16
|
| time=30, nodes=1, mem=510000mb, ntasks-per-node=64, (threads-per-core=2)
|}
Table 2: Development Queues

Default resources of a queue class define number of tasks and memory if not explicitly given with sbatch command. Resource list acronyms ''--time'', ''--ntasks'', ''--nodes'', ''--mem'' and ''--mem-per-cpu'' are described [[BwUniCluster3.0/Running_Jobs/Slurm|here]].

== Check available resources: sinfo_t_idle ==
The Slurm command sinfo is used to view partition and node information for a system running Slurm. It incorporates down time, reservations, and node state information in determining the available backfill window. The sinfo command can only be used by the administrator.
 
SCC has prepared a special script (sinfo_t_idle) to find out how many processors are available for immediate use on the system. It is anticipated that users will use this information to submit jobs that meet these criteria and thus obtain quick job turnaround times.
 
 

* The following command displays what resources are available for immediate use for the whole partition.
<pre>$ sinfo_t_idle
Partition dev_cpu : 2 nodes idle
Partition cpu : 68 nodes idle
Partition highmem : 4 nodes idle
Partition dev_gpu_h100 : 0 nodes idle
Partition gpu_h100 : 11 nodes idle
Partition gpu_mi300 : 1 nodes idle
Partition dev_cpu_il : 0 nodes idle
Partition cpu_il : 0 nodes idle
Partition dev_gpu_a100_il : 0 nodes idle
Partition gpu_a100_il : 0 nodes idle
Partition gpu_h100_il : 0 nodes idle
</pre>
 

= Running Jobs =

== Slurm Commands (excerpt) ==
Important Slurm commands for non-administrators working on bwUniCluster 3.0.
{| width=850px class="wikitable"
! Slurm commands !! Brief explanation
|-
| [[#Batch Jobs: sbatch|sbatch]] || Submits a job and puts it into the queue [[https://slurm.schedmd.com/sbatch.html sbatch]]
|-
| [[#Interactive Jobs: salloc|salloc]] || Requests resources for an interactive Job [[https://slurm.schedmd.com/salloc.html salloc]]
|-
| [[#Monitor and manage jobs |scontrol show job]] || Displays detailed job state information [[https://slurm.schedmd.com/scontrol.html scontrol]]
|-
| [[#List of your submitted jobs : squeue|squeue]] || Displays information about active, eligible, blocked, and/or recently completed jobs [[https://slurm.schedmd.com/squeue.html squeue]]
|-
| [[#List of your submitted jobs : squeue|squeue --start]] || Returns start time of submitted job [[https://slurm.schedmd.com/squeue.html squeue]]
|-
| [[#Check available resources: sinfo_t_idle|sinfo_t_idle]] || Shows what resources are available for immediate use [[https://slurm.schedmd.com/sinfo.html sinfo]]
|-
| [[#Canceling own jobs : scancel|scancel]] || Cancels a job [[https://slurm.schedmd.com/scancel.html scancel]]
|}

 
* [https://slurm.schedmd.com/tutorials.html Slurm Tutorials]
* [https://slurm.schedmd.com/pdfs/summary.pdf Slurm command/option summary (2 pages)]
* [https://slurm.schedmd.com/man_index.html Slurm Commands]
 

== Batch Jobs: sbatch ==

Batch jobs are submitted by using the command '''sbatch'''. The main purpose of the '''sbatch''' command is to specify the resources that are needed to run the job. '''sbatch''' will then queue the batch job. However, starting of batch job depends on the availability of the requested resources and the fair sharing value.
 
 

* The syntax and use of '''sbatch''' can be displayed via:
<pre>
$ man sbatch
</pre>

'''sbatch''' options can be used from the command line or in your job script. Different defaults for some of these options are set based on the queue and can be found [[BwUniCluster3.0/Slurm | here]]

{| class="wikitable"
! colspan="3" | sbatch Options
|-
! style="width:8%"| Command line
! style="width:9%"| Script
! style="width:13%"| Purpose
|- style="vertical-align:top;"
| -t, --time=''time''
| #SBATCH --time=''time''
| Wall clock time limit. 
|-
|- style="vertical-align:top;"
| -N, --nodes=''count''
| #SBATCH --nodes=''count''
| Number of nodes to be used.
|- style="vertical-align:top;"
| -n, --ntasks=''count''
| #SBATCH --ntasks=''count''
| Number of tasks to be launched.
|-
|- style="vertical-align:top;"
| --ntasks-per-node=''count''
| #SBATCH --ntasks-per-node=''count''
| Maximum count of tasks per node.
|-
|- style="vertical-align:top;"
| -c, --cpus-per-task=''count''
| #SBATCH --cpus-per-task=''count''
| Number of CPUs required per (MPI-)task.
|-
|- style="vertical-align:top;"
| --mem=''value_in_MB''
| #SBATCH --mem=''value_in_MB''
| Memory in MegaByte per node. (You should omit the setting of this option.)
|-
|- style="vertical-align:top;"
| --mem-per-cpu=''value_in_MB''
| #SBATCH --mem-per-cpu=''value_in_MB''
| Minimum Memory required per allocated CPU. (You should omit the setting of this option.)
|-
|- style="vertical-align:top;"
| --mail-type=''type''
| #SBATCH --mail-type=''type''
| Notify user by email when certain event types occur. Valid type values are NONE, BEGIN, END, FAIL, REQUEUE, ALL.
|-
|- style="vertical-align:top;"
| --mail-user=''mail-address''
| #SBATCH --mail-user=''mail-address''
| The specified mail-address receives email notification of state changes as defined by --mail-type.
|-
|- style="vertical-align:top;"
| --output=''name''
| #SBATCH --output=''name''
| File in which job output is stored.
|-
|- style="vertical-align:top;"
| --error=''name''
| #SBATCH --error=''name''
| File in which job error messages are stored.
|-
|- style="vertical-align:top;"
| -J, --job-name=''name''
| #SBATCH --job-name=''name''
| Job name.
|-
|- style="vertical-align:top;"
| --export=[ALL,] ''env-variables''
| #SBATCH --export=[ALL,] ''env-variables''
| Identifies which environment variables from the submission environment are propagated to the launched application. Default is ALL.
|-
|- style="vertical-align:top;"
| -A, --account=''group-name''
| #SBATCH --account=''group-name''
| Change resources used by this job to specified group. You may need this option if your account is assigned to more than one group. By command "scontrol show job" the project group the job is accounted on can be seen behind "Account=".
|-
|- style="vertical-align:top;"
| -p, --partition=''queue-name''
| #SBATCH --partition=''queue-name''
| Request a specific queue for the resource allocation.
|-
|- style="vertical-align:top;"
| --reservation=''reservation-name''
| #SBATCH --reservation=''reservation-name''
| Use a specific reservation for the resource allocation.
|-
|- style="vertical-align:top;"
| -C, --constraint=''LSDF''
| #SBATCH --constraint=LSDF
| Job constraint LSDF filesystems.
|-
|- style="vertical-align:top;"
| -C, --constraint=''BEEOND (BEEOND_4MDS, BEEOND_MAXMDS)''
| #SBATCH --constraint=BEEOND (BEEOND_4MDS, BEEOND_MAXMDS)
| Job constraint BeeOND filesystem.
|-
|}
 

== Interactive Jobs: salloc ==

On bwUniCluster 3.0 you are only allowed to run short jobs (<< 1 hour) with little memory requirements (<< 8 GByte) on the logins nodes. If you want to run longer jobs and/or jobs with a request of more than 8 GByte of memory, you must allocate resources for so-called interactive jobs by usage of the command salloc on a login node. Considering a serial application running on a compute node that requires 5000 MByte of memory and limiting the interactive run to 2 hours the following command has to be executed:

<pre>
$ salloc -p cpu -n 1 -t 120 --mem=5000
</pre>

Then you will get one core on a compute node within the partition "cpu". After execution of this command '''DO NOT CLOSE''' your current terminal session but wait until the queueing system Slurm has granted you the requested resources on the compute system. You will be logged in automatically on the granted core! To run a serial program on the granted core you only have to type the name of the executable.

<pre>
$ ./<my_serial_program>
</pre>

Please be aware that your serial job must run less than 2 hours in this example, else the job will be killed during runtime by the system.

You can also start now a graphical X11-terminal connecting you to the dedicated resource that is available for 2 hours. You can start it by the command:

<pre>
$ xterm
</pre>

Note that, once the walltime limit has been reached the resources - i.e. the compute node - will automatically be revoked.

An interactive parallel application running on one compute node or on many compute nodes (e.g. here 5 nodes) with 96 cores each requires usually an amount of memory in GByte (e.g. 50 GByte) and a maximum time (e.g. 1 hour). E.g. 5 nodes can be allocated by the following command:

<pre>
$ salloc -p cpu -N 5 --ntasks-per-node=96 -t 01:00:00 --mem=50gb
</pre>

Now you can run parallel jobs on 480 cores requiring 50 GByte of memory per node. Please be aware that you will be logged in on core 0 of the first node.
If you want to have access to another node you have to open a new terminal, connect it also to bwUniCluster 3.0 and type the following commands to
connect to the running interactive job and then to a specific node:

<pre>
$ srun --jobid=XXXXXXXX --pty /bin/bash
$ srun --nodelist=uc3nXXX --pty /bin/bash
</pre>

With the command:

<pre>
$ squeue
</pre>

the jobid and the nodelist can be shown.

If you want to run MPI-programs, you can do it by simply typing mpirun <program_name>. Then your program will be run on 480 cores. A very simple example for starting a parallel job can be:

<pre>
$ mpirun <my_mpi_program>
</pre>

You can also start the debugger ddt by the commands:

<pre>
$ module add devel/ddt
$ ddt <my_mpi_program>
</pre>

The above commands will execute the parallel program <my_mpi_program> on all available cores. You can also start parallel programs on a subset of cores; an example for this can be:

<pre>
$ mpirun -n 50 <my_mpi_program>
</pre>

If you are using Intel MPI you must start <my_mpi_program> by the command mpiexec.hydra (instead of mpirun).

== Interactive Computing with Jupyter ==

== Monitor and manage jobs ==

=== List of your submitted jobs : squeue ===
Displays information about YOUR active, pending and/or recently completed jobs. The command displays all own active and pending jobs. The command squeue is explained in detail on the webpage https://slurm.schedmd.com/squeue.html or via manpage (man squeue).
 
 

* ''squeue'' example on bwUniCluster 3.0 (Only your own jobs are displayed!).
<pre>
$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
1262 cpu wrap ka_ab123 R 8:15 1 uc3n002
1267 dev_gpu_h wrap ka_ab123 PD 0:00 1 (Resources)
1265 highmem wrap ka_ab123 R 2:41 1 uc3n084
$ squeue -l
JOBID PARTITION NAME USER STATE TIME TIME_LIMI NODES NODELIST(REASON)
1262 cpu wrap ka_ab123 RUNNING 8:55 20:00 1 uc3n002
1267 dev_gpu_h wrap ka_ab123 PENDING 0:00 20:00 1 (Resources)
1265 highmem wrap ka_ab123 RUNNING 3:21 20:00 1 uc3n084

</pre>

=== Detailed job information : scontrol show job ===
scontrol show job displays detailed job state information and diagnostic output for all or a specified job of yours. Detailed information is available for active, pending and recently completed jobs. The command scontrol is explained in detail on the webpage https://slurm.schedmd.com/scontrol.html or via manpage (man scontrol).
 
Display the state of all your jobs in normal mode: scontrol show job
 
Display the state of a job with <jobid> in normal mode: scontrol show job <jobid>
 
 

* Here is an example from bwUniCluster 3.0.
<pre>
$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
1262 cpu wrap ka_zs040 R 1:12 1 uc3n002

$
$ # now, see what's up with my pending job with jobid 1262
$
$ scontrol show job 1262

JobId=1262 JobName=wrap
UserId=ka_zs0402(241992) GroupId=ka_scc(12345) MCS_label=N/A
Priority=4246 Nice=0 Account=ka QOS=normal
JobState=RUNNING Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
RunTime=00:00:37 TimeLimit=00:20:00 TimeMin=N/A
SubmitTime=2025-04-04T10:01:30 EligibleTime=2025-04-04T10:01:30
AccrueTime=2025-04-04T10:01:30
StartTime=2025-04-04T10:01:31 EndTime=2025-04-04T10:21:31 Deadline=N/A
SuspendTime=None SecsPreSuspend=0 LastSchedEval=2025-04-04T10:01:31 Scheduler=Main
Partition=cpu AllocNode:Sid=uc3n999:2819841
ReqNodeList=(null) ExcNodeList=(null)
NodeList=uc3n002
BatchHost=uc3n002
NumNodes=1 NumCPUs=2 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
ReqTRES=cpu=1,mem=2000M,node=1,billing=1
AllocTRES=cpu=2,mem=4000M,node=1,billing=2
Socks/Node=* NtasksPerN:B:S:C=0:0:*:1 CoreSpec=*
MinCPUsNode=1 MinMemoryCPU=2000M MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
Command=(null)
WorkDir=/pfs/data6/home/ka/ka_scc/ka_zs0402
StdErr=/pfs/data6/home/ka/ka_scc/ka_zs0402/slurm-1262.out
StdIn=/dev/null
StdOut=/pfs/data6/home/ka/ka_scc/ka_zs0402/slurm-1262.out

</pre>
 
=== Canceling own jobs : scancel ===
The scancel command is used to cancel jobs. The command scancel is explained in detail on the webpage https://slurm.schedmd.com/scancel.html or via manpage (man scancel). The command is:
<pre>
$ scancel [-i] <job-id>
$ scancel -t <job_state_name>
</pre>

= Slurm Options =
[[BwUniCluster3.0/Running_Jobs/Slurm | Detailed Slurm usage]]

= Best Practices =

== Step-by-Step example==

== Dos and Don'ts ==

BwUniCluster3.0/Login

2025-04-09T11:08:50Z

P Schuhmacher: /* Login with SSH command (Linux, Mac, Windows) */

{|style="background:#deffee; width:100%;"
|style="padding:5px; background:#cef2e0; text-align:left"|
[[Image:Attention.svg|center|25px]]
|style="padding:5px; background:#cef2e0; text-align:left"|
Access to bwUniCluster 3.0 is '''limited to IP addresses from the BelWü network'''.
All home institutions of our current users are connected to BelWü, so if you are on your campus network (e.g. in your office or on the Campus WiFi) you should be able to connect to bwUniCluster 3.0 without restrictions.
If you are outside one of the BelWü networks (e.g. at home), a VPN connection to the home institution or a connection to an SSH jump host at the home institution must be established first.
|}

The login nodes of the bwHPC clusters are the access point to the compute system, your <code>$HOME</code> directory and your workspaces.
All users must log in through these nodes to submit jobs to the cluster.

'''Prerequisites for successful login:'''

You need to have
# Completed the 3-step [[registration|'''registration''']] procedure.
# Set a [[Registration/Password|'''service password''']] for bwUniCluster 3.0.
# Set up a [[Registration/2FA|'''second factor''']] for the time-based one-time password (TOTP).

= Login to the bwUniCluster =

Login to the bwUniCluster 3.0 is only possible with a Secure Shell (SSH) client for which you must know your username on the cluster and the hostname of the login nodes.
For more general information on SSH clients, visit the [[BwUniCluster3.0/Login/Client|SSH Clients Guide]].

== Username ==

If you want to use the bwUniCluster 3.0 you need to add a prefix to your local username.

For prefixes please refer to the [[Registration/Login/Username#Prefix_for_Universities|prefix table]].

Examples: 
* If your local username for the University is <code>ab123</code> and you are a user from the University of Freiburg this would combine to: <code>fr_ab123</code>.
* If your KIT username is <code>ab1234</code> and you are a user from KIT this would combine to: <code>ka_ab1234</code>.

== Hostnames ==

The system has two login nodes.
The selection of the login node is done automatically.
If you are logging in multiple times, different sessions might run on different login nodes.

Login to bwUniCluster 3.0:

{| class="wikitable"
! Hostname !! Node type
|-
| '''bwunicluster.scc.kit.edu''' || login to one of the two login nodes
|-
| '''uc3.scc.kit.edu''' || login to one of the two login nodes
|-
|}

In general, you should use automatic selection to allow us to balance the load over the three login nodes.
If you need to connect to specific login nodes, you can use the following hostnames:

{| class="wikitable"
! Hostname !! Node type
|-
| '''uc3-login1.scc.kit.edu''' || bwUniCluster 3.0 first login node
|-
| '''uc3-login2.scc.kit.edu''' || bwUniCluster 3.0 second login node
|-
|}

== Host Keys ==

When you log in, you may receive the message <code>The authenticity of host '<host address>' can't be established.</code> along with the host key fingerprint. This is intended so you can verify the authenticity of the host you are connecting to. Before you continue you should verify, if this fingerprint matches one of the following:

{| class="wikitable"
! Algorithm !! Fingerprint (SHA256)
|-
| '''RSA''' || SHA256:RaE0/tqQMMBmJuDCIo3WZ38YJsz0godVyt6aUOk/E0M
|-
| '''ECDSA''' || SHA256:LjBYL/x86ZAlL0JdlXrCmPYXvS3DaSiMuvycojBMdwQ
|-
| '''ED25519''' || SHA256:5mZYEpKigwK5ibBMHRrh3WIkOtCqomJW6H7OMbPk3ec
|-
|}

== Login with SSH command (Linux, Mac, Windows) ==

Linux, Mac OS, other Unix-like operating systems and Microsoft Windows come with a built-in SSH client, most likely provided by the OpenSSH project.

For login use one of the following ssh commands:

ssh <username> uc3.scc.kit.edu
ssh <username>@bwunicluster.scc.kit.edu



== Login with graphical SSH client (Windows) ==

For Windows we suggest using [[Data_Transfer/Graphical_Clients#MobaXterm|MobaXterm]] for login and file transfer.

Start ''MobaXterm'', fill in the following fields:
<pre>
Remote name : uc3.scc.kit.edu # or bwunicluster.scc.kit.edu
Specify user name : <username>
Port : 22
</pre>

After that click on 'ok'. Then a terminal will be opened and there you can enter your credentials.

'''Note:''' When using File transfer with MobaXterm version 23.6 the following configuration change has to be made:
In the settings in the tab "SSH", change the option "SSH engine" from "<new>" to "<legacy>". Then restart MobaXterm

== Login with Jupyterhub ==

Login takes place at:
* bwUniCluster 3.0: [https://uc3-jupyter.scc.kit.edu uc3-jupyter.scc.kit.edu]
* SDIL: [https://sdil-jupyter.scc.kit.edu sdil-jupyter.scc.kit.edu]

More Information can be found [[BwUniCluster3.0/Jupyter#Login_process|here]].

== Login Example ==

To log in to bwUniCluster 3.0, you must provide your [[Registration/Password|service password]].
Proceed as follows:
# Use SSH for a login node.
# The system will ask for a one-time password <code>Your OTP:</code>. Please enter your OTP and confirm it with Enter/Return. If you do not have a second factor yet, please create one (see [[Registration/2FA]]).
# The system will ask you for your service password <code>Password:</code>. Please enter it and confirm it with Enter/Return. If you do not have a service password yet or have forgotten it, please create one (see [[Registration/Password]]).
# You will be greeted by the cluster, followed by a shell.

<pre>
[user@client ~]$ ssh ka_ab1234@uc3.scc.kit.edu
(ka_ab1234@uc3.scc.kit.edu) Your OTP: cccccctlljdbrjdleujigivvfnkjbucudugjjlutfbrk
(ka_ab1234@uc3.scc.kit.edu) Password:
********************************************************************************
* *
* Karlsruher Institut für Technologie (KIT) *
* *
* Scientific Computing Center (SCC) *
* *
* _ _ _____ ____ *
* | | | | / ____| |___ \ *
* | | | | | | __) | *
* | | | | | | |__ < *
* | |__| | | |____ ___) | *
* \____/ \_____| |____/ *
* *
* *
* (KITE 2.0, RHEL 9.4, Lustre 2.14.0_ddn154) *
* *
* *
********************************************************************************
Last login: Wed Feb 26 11:08:20 2025 from 2a00:1398:4:181c:2be1:437b:1c36:1337

[ka_ab1234@uc3n990 ~]$
</pre>

== Troubleshooting ==

See [[BwUniCluster3.0/FAQ#Login|bwUniCluster FAQ]].

= Allowed Activities on Login Nodes =

{|style="background:#deffee; width:100%;"
|style="padding:5px; background:#ffa500; text-align:left"|
[[Image:Attention.svg|center|25px]]
|style="padding:5px; background:#ffa500; text-align:left"|
To guarantee usability for all the users of clusters you must not run your compute jobs on the login nodes.
Compute jobs must be submitted to the queuing system. 
'''Any compute job running on the login nodes will be terminated without any notice.''' 
Any long-running compilation or any long-running pre- or post-processing of batch jobs must also be submitted to the queuing system.
|}

The login nodes of the bwHPC clusters are the access point to the compute system, your <code>$HOME</code> directory and your workspaces.
These nodes are shared with all the users therefore, your activities on the login nodes are limited to primarily set up your batch jobs.
Your activities may also be:
* '''short''' compilation of your program code and
* '''light weight''' pre- and post-processing of your batch jobs.

We advise users to use [[BwUniCluster3.0/Batch_Queues#Interactive_Jobs|interactive jobs]] for compute and memory intensive tasks like compiling.

= Related Information =

* If you want to reset your service password, consult the [[Registration/Password|Password Guide]].
* If you want to register a new token for the two factor authentication (2FA), consult the [[Registration/2FA|2FA Guide]].
* If you want to de-register, consult the [[Registration/Deregistration|De-registration Guide]].
* If you need an SSH key for your workflow, read [[Registration/SSH|Registering SSH Keys with your Cluster]].
* Configuring your shell: [[.bashrc Do's and Don'ts]]

BwUniCluster3.0/Login

2025-04-09T11:08:42Z

P Schuhmacher: /* Hostnames */

{|style="background:#deffee; width:100%;"
|style="padding:5px; background:#cef2e0; text-align:left"|
[[Image:Attention.svg|center|25px]]
|style="padding:5px; background:#cef2e0; text-align:left"|
Access to bwUniCluster 3.0 is '''limited to IP addresses from the BelWü network'''.
All home institutions of our current users are connected to BelWü, so if you are on your campus network (e.g. in your office or on the Campus WiFi) you should be able to connect to bwUniCluster 3.0 without restrictions.
If you are outside one of the BelWü networks (e.g. at home), a VPN connection to the home institution or a connection to an SSH jump host at the home institution must be established first.
|}

The login nodes of the bwHPC clusters are the access point to the compute system, your <code>$HOME</code> directory and your workspaces.
All users must log in through these nodes to submit jobs to the cluster.

'''Prerequisites for successful login:'''

You need to have
# Completed the 3-step [[registration|'''registration''']] procedure.
# Set a [[Registration/Password|'''service password''']] for bwUniCluster 3.0.
# Set up a [[Registration/2FA|'''second factor''']] for the time-based one-time password (TOTP).

= Login to the bwUniCluster =

Login to the bwUniCluster 3.0 is only possible with a Secure Shell (SSH) client for which you must know your username on the cluster and the hostname of the login nodes.
For more general information on SSH clients, visit the [[BwUniCluster3.0/Login/Client|SSH Clients Guide]].

== Username ==

If you want to use the bwUniCluster 3.0 you need to add a prefix to your local username.

For prefixes please refer to the [[Registration/Login/Username#Prefix_for_Universities|prefix table]].

Examples: 
* If your local username for the University is <code>ab123</code> and you are a user from the University of Freiburg this would combine to: <code>fr_ab123</code>.
* If your KIT username is <code>ab1234</code> and you are a user from KIT this would combine to: <code>ka_ab1234</code>.

== Hostnames ==

The system has two login nodes.
The selection of the login node is done automatically.
If you are logging in multiple times, different sessions might run on different login nodes.

Login to bwUniCluster 3.0:

{| class="wikitable"
! Hostname !! Node type
|-
| '''bwunicluster.scc.kit.edu''' || login to one of the two login nodes
|-
| '''uc3.scc.kit.edu''' || login to one of the two login nodes
|-
|}

In general, you should use automatic selection to allow us to balance the load over the three login nodes.
If you need to connect to specific login nodes, you can use the following hostnames:

{| class="wikitable"
! Hostname !! Node type
|-
| '''uc3-login1.scc.kit.edu''' || bwUniCluster 3.0 first login node
|-
| '''uc3-login2.scc.kit.edu''' || bwUniCluster 3.0 second login node
|-
|}

== Host Keys ==

When you log in, you may receive the message <code>The authenticity of host '<host address>' can't be established.</code> along with the host key fingerprint. This is intended so you can verify the authenticity of the host you are connecting to. Before you continue you should verify, if this fingerprint matches one of the following:

{| class="wikitable"
! Algorithm !! Fingerprint (SHA256)
|-
| '''RSA''' || SHA256:RaE0/tqQMMBmJuDCIo3WZ38YJsz0godVyt6aUOk/E0M
|-
| '''ECDSA''' || SHA256:LjBYL/x86ZAlL0JdlXrCmPYXvS3DaSiMuvycojBMdwQ
|-
| '''ED25519''' || SHA256:5mZYEpKigwK5ibBMHRrh3WIkOtCqomJW6H7OMbPk3ec
|-
|}

== Login with SSH command (Linux, Mac, Windows) ==

Linux, Mac OS, other Unix-like operating systems and Microsoft Windows come with a built-in SSH client, most likely provided by the OpenSSH project.

For login use one of the following ssh commands:

ssh <username> uc3.scc.kit.edu



== Login with graphical SSH client (Windows) ==

For Windows we suggest using [[Data_Transfer/Graphical_Clients#MobaXterm|MobaXterm]] for login and file transfer.

Start ''MobaXterm'', fill in the following fields:
<pre>
Remote name : uc3.scc.kit.edu # or bwunicluster.scc.kit.edu
Specify user name : <username>
Port : 22
</pre>

After that click on 'ok'. Then a terminal will be opened and there you can enter your credentials.

'''Note:''' When using File transfer with MobaXterm version 23.6 the following configuration change has to be made:
In the settings in the tab "SSH", change the option "SSH engine" from "<new>" to "<legacy>". Then restart MobaXterm

== Login with Jupyterhub ==

Login takes place at:
* bwUniCluster 3.0: [https://uc3-jupyter.scc.kit.edu uc3-jupyter.scc.kit.edu]
* SDIL: [https://sdil-jupyter.scc.kit.edu sdil-jupyter.scc.kit.edu]

More Information can be found [[BwUniCluster3.0/Jupyter#Login_process|here]].

== Login Example ==

To log in to bwUniCluster 3.0, you must provide your [[Registration/Password|service password]].
Proceed as follows:
# Use SSH for a login node.
# The system will ask for a one-time password <code>Your OTP:</code>. Please enter your OTP and confirm it with Enter/Return. If you do not have a second factor yet, please create one (see [[Registration/2FA]]).
# The system will ask you for your service password <code>Password:</code>. Please enter it and confirm it with Enter/Return. If you do not have a service password yet or have forgotten it, please create one (see [[Registration/Password]]).
# You will be greeted by the cluster, followed by a shell.

<pre>
[user@client ~]$ ssh ka_ab1234@uc3.scc.kit.edu
(ka_ab1234@uc3.scc.kit.edu) Your OTP: cccccctlljdbrjdleujigivvfnkjbucudugjjlutfbrk
(ka_ab1234@uc3.scc.kit.edu) Password:
********************************************************************************
* *
* Karlsruher Institut für Technologie (KIT) *
* *
* Scientific Computing Center (SCC) *
* *
* _ _ _____ ____ *
* | | | | / ____| |___ \ *
* | | | | | | __) | *
* | | | | | | |__ < *
* | |__| | | |____ ___) | *
* \____/ \_____| |____/ *
* *
* *
* (KITE 2.0, RHEL 9.4, Lustre 2.14.0_ddn154) *
* *
* *
********************************************************************************
Last login: Wed Feb 26 11:08:20 2025 from 2a00:1398:4:181c:2be1:437b:1c36:1337

[ka_ab1234@uc3n990 ~]$
</pre>

== Troubleshooting ==

See [[BwUniCluster3.0/FAQ#Login|bwUniCluster FAQ]].

= Allowed Activities on Login Nodes =

{|style="background:#deffee; width:100%;"
|style="padding:5px; background:#ffa500; text-align:left"|
[[Image:Attention.svg|center|25px]]
|style="padding:5px; background:#ffa500; text-align:left"|
To guarantee usability for all the users of clusters you must not run your compute jobs on the login nodes.
Compute jobs must be submitted to the queuing system. 
'''Any compute job running on the login nodes will be terminated without any notice.''' 
Any long-running compilation or any long-running pre- or post-processing of batch jobs must also be submitted to the queuing system.
|}

The login nodes of the bwHPC clusters are the access point to the compute system, your <code>$HOME</code> directory and your workspaces.
These nodes are shared with all the users therefore, your activities on the login nodes are limited to primarily set up your batch jobs.
Your activities may also be:
* '''short''' compilation of your program code and
* '''light weight''' pre- and post-processing of your batch jobs.

We advise users to use [[BwUniCluster3.0/Batch_Queues#Interactive_Jobs|interactive jobs]] for compute and memory intensive tasks like compiling.

= Related Information =

* If you want to reset your service password, consult the [[Registration/Password|Password Guide]].
* If you want to register a new token for the two factor authentication (2FA), consult the [[Registration/2FA|2FA Guide]].
* If you want to de-register, consult the [[Registration/Deregistration|De-registration Guide]].
* If you need an SSH key for your workflow, read [[Registration/SSH|Registering SSH Keys with your Cluster]].
* Configuring your shell: [[.bashrc Do's and Don'ts]]

BwUniCluster3.0/Running Jobs

2025-04-08T10:54:23Z

P Schuhmacher: /* Policy */

= Purpose and function of a queuing system =

All compute activities on bwUniCluster 3.0 have to be performed on the compute nodes. Compute nodes are only available by requesting the corresponding resources via the queuing system. As soon as the requested resources are available, automated tasks are executed via a batch script or they can be accessed interactively. 
General procedure: Hint to [[Running_Calculations | Running Calculations]]

== Job submission process ==

bwUniCluster 3.0 has installed the workload managing software Slurm. Therefore any job submission by the user is to be executed by commands of the Slurm software. Slurm queues and runs user jobs based on fair sharing policies.

== Slurm ==

HPC Workload Manager on bwUniCluster 3.0 is Slurm.
Slurm is a cluster management and job scheduling system. Slurm has three key functions.
* It allocates access to resources (compute cores on nodes) to users for some duration of time so they can perform work.
* It provides a framework for starting, executing, and monitoring work (normally a parallel job) on the set of allocated nodes.
* It arbitrates contention for resources by managing a queue of pending work.

Any kind of calculation on the compute nodes of bwUniCluster 3.0 requires the user to define calculations as a sequence of commands together with required run time, number of CPU cores and main memory and submit all, i.e., the '''batch job''', to a resource and workload managing software.

== Terms and definitions ==

''' Partitions '''

Slurm manages job queues for different '''partitions'''. Partitions are used to group similar node types (e.g. nodes with and without accelerators) and to enforce different access policies and resource limits.

On bwUniCluster 3.0 there are different partitions:

* CPU-only nodes
** 2-socket nodes, consisting of 2 Intel Ice Lake processors with 32 cores each or 2 AMD processors with 48 cores each
** 2-socket nodes with very high RAM capacity, consisting of 2 AMD processors with 48 cores each
* GPU-accelerated nodes
** 2-socket nodes with 4x NVIDIA A100 or 4x NVIDIA H100 GPUs
** 4-socket node with 4x AMD Instinct accelerator

''' Queues '''

Job '''queues''' are used to manage jobs that request access to shared but limited computing resources of a certain kind (partition).

On bwUniCluster 3.0 there are different main types of queues:
* Regular queues
** cpu: Jobs that request CPU-only nodes.
** gpu: Jobs that request GPU-accelerated nodes.
* Development queues (dev)
** Short, usually interactive jobs that are used for developing, compiling and testing code and workflows. The intention behind development queues is to provide users with immediate access to computer resources without having to wait. This is the place to realize instantaneous heavy compute without affecting other users, as would be the case on the login nodes.

Requested compute resources such as (wall-)time, number of nodes and amount of memory are restricted and must fit into the boundaries imposed by the queues. The request for compute resources on the bwUniCluster 3.0 requires at least the specification of the '''queue''' and the '''time'''.

''' Jobs '''

Jobs can be run non-interactively as '''batch jobs''' or as '''interactive jobs'''. 
Submitting a batch job means, that all steps of a compute project are defined in a Bash script. This Bash script is queued and executed as soon as the compute resources are available and allocated. Jobs are enqueued with the <code>sbatch</code> command.
For interactive jobs, the resources are requested with the <code>salloc</code> command. As soon as the computing resources are available and allocated, a command line prompt is returned on a computing node and the user can freely dispose of the resources now available to him.
{|style="background:#deffee; width:100%;"
|style="padding:5px; background:#cef2e0; text-align:left"|
[[Image:Attention.svg|center|25px]]
|style="padding:5px; background:#cef2e0; text-align:left"|
'''Please remember:'''
* '''Heavy computations are not allowed on the login nodes'''. Use a developement or a regular job queue instead! Please refer to [[BwUniCluster3.0/Login#Allowed_Activities_on_Login_Nodes|Allowed Activities on Login Nodes]].
* '''Development queues''' are meant for '''development tasks'''. Do not misuse this queue for regular, short-running jobs or chain jobs! Only one running job at a time is enabled. Maximum queue length is reduced to 3.
|}

= Queues on bwUniCluster 3.0 =
== Policy ==

The computing time is provided in accordance with the '''fair share policy'''. The individual investment shares of the respective university and the resources already used by its members are taken into account. Furthermore, the following throttling policy is also active: The '''maximum amount of physical cores''' used at any given time from jobs running is '''1920 per user''' (aggregated over all running jobs). This number corresponds to 30 nodes on the Ice Lake partition or 20 nodes on the standard partition. The aim is to minimize waiting times and maximize the number of users who can access computing time at the same time.

== Regular Queues ==
{| class="wikitable"
|-
! style="width:5%"| Queue
! style="width:13%"| Node-Type
! style="width:23%"| Default Resources
! style="width:13%"| Minimal Resources
! style="width:13%"| Maximum Resources
|-
| <code>cpu_il</code>
| CPU nodes Ice Lake
| mem-per-cpu=2000mb
|
| time=72:00:00, nodes=30, mem=249600mb, ntasks-per-node=64, (threads-per-core=2)
|-
| <code>cpu</code>
| CPU nodes Standard
| mem-per-cpu=2000mb
|
| time=72:00:00, nodes=20, mem=380000mb, ntasks-per-node=96, (threads-per-core=2)
|-
| <code>highmem</code>
| CPU nodes High Memory
| mem-per-cpu=12090mb
| mem=380001mb
| time=72:00:00, nodes=4, mem=2300000mb, ntasks-per-node=96, (threads-per-core=2)
|-
| <code>gpu_h100</code>
| GPU nodes NVIDIA GPU x4
| mem-per-gpu=193300mb cpus-per-gpu=24
|
| time=72:00:00, nodes=12, mem=760000mb, ntasks-per-node=96, (threads-per-core=2)
|-
| <code>gpu_mi300</code>
| GPU node AMD GPU x4
| mem-per-gpu=128200mb cpus-per-gpu=24
|
| time=72:00:00, nodes=1, mem=510000mb, ntasks-per-node=40, (threads-per-core=2)
|-
| <code>gpu_a100_il</code>/<code>gpu_h100_il</code>
| GPU nodes Ice Lake NVIDIA GPU x4
| mem-per-gpu=127500mb cpus-per-gpu=16
|
| time=72:00:00, nodes=9, mem=510000mb, ntasks-per-node=64, (threads-per-core=2)
|}
Table 1: Regular Queues

== Development Queues ==
Only for development, i.e. debugging or performance optimization ...

{| class="wikitable"
|-
! style="width:5%"| Queue
! style="width:13%"| Node Type
! style="width:23%"| Default Resources
! style="width:13%"| Minimal Resources
! style="width:13%"| Maximum Resources
|-
| <code>dev_cpu_il</code>
| CPU nodes Ice Lake
| mem-per-cpu=2000mb
|
| time=30, nodes=8, mem=249600mb, ntasks-per-node=64, (threads-per-core=2)
|-
| <code>dev_cpu</code>
| CPU nodes Standard
| mem-per-cpu=2000mb
|
| time=30, nodes=1, mem=380000mb, ntasks-per-node=96, (threads-per-core=2)
|-
| <code>dev_gpu_h100</code>
| GPU nodes NVIDIA GPU x4
| mem-per-gpu=193300mb cpus-per-gpu=24
|
| time=30, nodes=1, mem=760000mb, ntasks-per-node=96, (threads-per-core=2)
|-
| <code>dev_gpu_a100_il</code>
| GPU nodes NVIDIA GPU x4 
| mem-per-gpu=127500mb cpus-per-gpu=16
|
| time=30, nodes=1, mem=510000mb, ntasks-per-node=64, (threads-per-core=2)
|}
Table 2: Development Queues

Default resources of a queue class define number of tasks and memory if not explicitly given with sbatch command. Resource list acronyms ''--time'', ''--ntasks'', ''--nodes'', ''--mem'' and ''--mem-per-cpu'' are described [[BwUniCluster3.0/Running_Jobs/Slurm|here]].

== Check available resources: sinfo_t_idle ==
The Slurm command sinfo is used to view partition and node information for a system running Slurm. It incorporates down time, reservations, and node state information in determining the available backfill window. The sinfo command can only be used by the administrator.
 
SCC has prepared a special script (sinfo_t_idle) to find out how many processors are available for immediate use on the system. It is anticipated that users will use this information to submit jobs that meet these criteria and thus obtain quick job turnaround times.
 
 

* The following command displays what resources are available for immediate use for the whole partition.
<pre>$ sinfo_t_idle
Partition dev_cpu : 2 nodes idle
Partition cpu : 68 nodes idle
Partition highmem : 4 nodes idle
Partition dev_gpu_h100 : 0 nodes idle
Partition gpu_h100 : 11 nodes idle
Partition gpu_mi300 : 1 nodes idle
Partition dev_cpu_il : 0 nodes idle
Partition cpu_il : 0 nodes idle
Partition dev_gpu_a100_il : 0 nodes idle
Partition gpu_a100_il : 0 nodes idle
Partition gpu_h100_il : 0 nodes idle
</pre>
 

= Running Jobs =

== Slurm Commands (excerpt) ==
Important Slurm commands for non-administrators working on bwUniCluster 3.0.
{| width=850px class="wikitable"
! Slurm commands !! Brief explanation
|-
| [[#Batch Jobs: sbatch|sbatch]] || Submits a job and puts it into the queue [[https://slurm.schedmd.com/sbatch.html sbatch]]
|-
| [[#Interactive Jobs: salloc|salloc]] || Requests resources for an interactive Job [[https://slurm.schedmd.com/salloc.html salloc]]
|-
| [[#Monitor and manage jobs |scontrol show job]] || Displays detailed job state information [[https://slurm.schedmd.com/scontrol.html scontrol]]
|-
| [[#List of your submitted jobs : squeue|squeue]] || Displays information about active, eligible, blocked, and/or recently completed jobs [[https://slurm.schedmd.com/squeue.html squeue]]
|-
| [[#List of your submitted jobs : squeue|squeue --start]] || Returns start time of submitted job [[https://slurm.schedmd.com/squeue.html squeue]]
|-
| [[#Check available resources: sinfo_t_idle|sinfo_t_idle]] || Shows what resources are available for immediate use [[https://slurm.schedmd.com/sinfo.html sinfo]]
|-
| [[#Canceling own jobs : scancel|scancel]] || Cancels a job [[https://slurm.schedmd.com/scancel.html scancel]]
|}

 
* [https://slurm.schedmd.com/tutorials.html Slurm Tutorials]
* [https://slurm.schedmd.com/pdfs/summary.pdf Slurm command/option summary (2 pages)]
* [https://slurm.schedmd.com/man_index.html Slurm Commands]
 

== Batch Jobs: sbatch ==

Batch jobs are submitted by using the command '''sbatch'''. The main purpose of the '''sbatch''' command is to specify the resources that are needed to run the job. '''sbatch''' will then queue the batch job. However, starting of batch job depends on the availability of the requested resources and the fair sharing value.
 
 

* The syntax and use of '''sbatch''' can be displayed via:
<pre>
$ man sbatch
</pre>

'''sbatch''' options can be used from the command line or in your job script. Different defaults for some of these options are set based on the queue and can be found [[BwUniCluster3.0/Slurm | here]]

{| class="wikitable"
! colspan="3" | sbatch Options
|-
! style="width:8%"| Command line
! style="width:9%"| Script
! style="width:13%"| Purpose
|- style="vertical-align:top;"
| -t, --time=''time''
| #SBATCH --time=''time''
| Wall clock time limit. 
|-
|- style="vertical-align:top;"
| -N, --nodes=''count''
| #SBATCH --nodes=''count''
| Number of nodes to be used.
|- style="vertical-align:top;"
| -n, --ntasks=''count''
| #SBATCH --ntasks=''count''
| Number of tasks to be launched.
|-
|- style="vertical-align:top;"
| --ntasks-per-node=''count''
| #SBATCH --ntasks-per-node=''count''
| Maximum count of tasks per node.
|-
|- style="vertical-align:top;"
| -c, --cpus-per-task=''count''
| #SBATCH --cpus-per-task=''count''
| Number of CPUs required per (MPI-)task.
|-
|- style="vertical-align:top;"
| --mem=''value_in_MB''
| #SBATCH --mem=''value_in_MB''
| Memory in MegaByte per node. (You should omit the setting of this option.)
|-
|- style="vertical-align:top;"
| --mem-per-cpu=''value_in_MB''
| #SBATCH --mem-per-cpu=''value_in_MB''
| Minimum Memory required per allocated CPU. (You should omit the setting of this option.)
|-
|- style="vertical-align:top;"
| --mail-type=''type''
| #SBATCH --mail-type=''type''
| Notify user by email when certain event types occur. Valid type values are NONE, BEGIN, END, FAIL, REQUEUE, ALL.
|-
|- style="vertical-align:top;"
| --mail-user=''mail-address''
| #SBATCH --mail-user=''mail-address''
| The specified mail-address receives email notification of state changes as defined by --mail-type.
|-
|- style="vertical-align:top;"
| --output=''name''
| #SBATCH --output=''name''
| File in which job output is stored.
|-
|- style="vertical-align:top;"
| --error=''name''
| #SBATCH --error=''name''
| File in which job error messages are stored.
|-
|- style="vertical-align:top;"
| -J, --job-name=''name''
| #SBATCH --job-name=''name''
| Job name.
|-
|- style="vertical-align:top;"
| --export=[ALL,] ''env-variables''
| #SBATCH --export=[ALL,] ''env-variables''
| Identifies which environment variables from the submission environment are propagated to the launched application. Default is ALL.
|-
|- style="vertical-align:top;"
| -A, --account=''group-name''
| #SBATCH --account=''group-name''
| Change resources used by this job to specified group. You may need this option if your account is assigned to more than one group. By command "scontrol show job" the project group the job is accounted on can be seen behind "Account=".
|-
|- style="vertical-align:top;"
| -p, --partition=''queue-name''
| #SBATCH --partition=''queue-name''
| Request a specific queue for the resource allocation.
|-
|- style="vertical-align:top;"
| --reservation=''reservation-name''
| #SBATCH --reservation=''reservation-name''
| Use a specific reservation for the resource allocation.
|-
|- style="vertical-align:top;"
| -C, --constraint=''LSDF''
| #SBATCH --constraint=LSDF
| Job constraint LSDF filesystems.
|-
|- style="vertical-align:top;"
| -C, --constraint=''BEEOND (BEEOND_4MDS, BEEOND_MAXMDS)''
| #SBATCH --constraint=BEEOND (BEEOND_4MDS, BEEOND_MAXMDS)
| Job constraint BeeOND filesystem.
|-
|}
 

== Interactive Jobs: salloc ==

On bwUniCluster 3.0 you are only allowed to run short jobs (<< 1 hour) with little memory requirements (<< 8 GByte) on the logins nodes. If you want to run longer jobs and/or jobs with a request of more than 8 GByte of memory, you must allocate resources for so-called interactive jobs by usage of the command salloc on a login node. Considering a serial application running on a compute node that requires 5000 MByte of memory and limiting the interactive run to 2 hours the following command has to be executed:

<pre>
$ salloc -p cpu -n 1 -t 120 --mem=5000
</pre>

Then you will get one core on a compute node within the partition "cpu". After execution of this command '''DO NOT CLOSE''' your current terminal session but wait until the queueing system Slurm has granted you the requested resources on the compute system. You will be logged in automatically on the granted core! To run a serial program on the granted core you only have to type the name of the executable.

<pre>
$ ./<my_serial_program>
</pre>

Please be aware that your serial job must run less than 2 hours in this example, else the job will be killed during runtime by the system.

You can also start now a graphical X11-terminal connecting you to the dedicated resource that is available for 2 hours. You can start it by the command:

<pre>
$ xterm
</pre>

Note that, once the walltime limit has been reached the resources - i.e. the compute node - will automatically be revoked.

An interactive parallel application running on one compute node or on many compute nodes (e.g. here 5 nodes) with 96 cores each requires usually an amount of memory in GByte (e.g. 50 GByte) and a maximum time (e.g. 1 hour). E.g. 5 nodes can be allocated by the following command:

<pre>
$ salloc -p cpu -N 5 --ntasks-per-node=96 -t 01:00:00 --mem=50gb
</pre>

Now you can run parallel jobs on 480 cores requiring 50 GByte of memory per node. Please be aware that you will be logged in on core 0 of the first node.
If you want to have access to another node you have to open a new terminal, connect it also to bwUniCluster 3.0 and type the following commands to
connect to the running interactive job and then to a specific node:

<pre>
$ srun --jobid=XXXXXXXX --pty /bin/bash
$ srun --nodelist=uc3nXXX --pty /bin/bash
</pre>

With the command:

<pre>
$ squeue
</pre>

the jobid and the nodelist can be shown.

If you want to run MPI-programs, you can do it by simply typing mpirun <program_name>. Then your program will be run on 480 cores. A very simple example for starting a parallel job can be:

<pre>
$ mpirun <my_mpi_program>
</pre>

You can also start the debugger ddt by the commands:

<pre>
$ module add devel/ddt
$ ddt <my_mpi_program>
</pre>

The above commands will execute the parallel program <my_mpi_program> on all available cores. You can also start parallel programs on a subset of cores; an example for this can be:

<pre>
$ mpirun -n 50 <my_mpi_program>
</pre>

If you are using Intel MPI you must start <my_mpi_program> by the command mpiexec.hydra (instead of mpirun).

== Interactive Computing with Jupyter ==

== Monitor and manage jobs ==

=== List of your submitted jobs : squeue ===
Displays information about YOUR active, pending and/or recently completed jobs. The command displays all own active and pending jobs. The command squeue is explained in detail on the webpage https://slurm.schedmd.com/squeue.html or via manpage (man squeue).
 
 

* ''squeue'' example on bwUniCluster 3.0 (Only your own jobs are displayed!).
<pre>
$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
1262 cpu wrap ka_ab123 R 8:15 1 uc3n002
1267 dev_gpu_h wrap ka_ab123 PD 0:00 1 (Resources)
1265 highmem wrap ka_ab123 R 2:41 1 uc3n084
$ squeue -l
JOBID PARTITION NAME USER STATE TIME TIME_LIMI NODES NODELIST(REASON)
1262 cpu wrap ka_ab123 RUNNING 8:55 20:00 1 uc3n002
1267 dev_gpu_h wrap ka_ab123 PENDING 0:00 20:00 1 (Resources)
1265 highmem wrap ka_ab123 RUNNING 3:21 20:00 1 uc3n084

</pre>

=== Detailed job information : scontrol show job ===
scontrol show job displays detailed job state information and diagnostic output for all or a specified job of yours. Detailed information is available for active, pending and recently completed jobs. The command scontrol is explained in detail on the webpage https://slurm.schedmd.com/scontrol.html or via manpage (man scontrol).
 
Display the state of all your jobs in normal mode: scontrol show job
 
Display the state of a job with <jobid> in normal mode: scontrol show job <jobid>
 
 

* Here is an example from bwUniCluster 3.0.
<pre>
$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
1262 cpu wrap ka_zs040 R 1:12 1 uc3n002

$
$ # now, see what's up with my pending job with jobid 1262
$
$ scontrol show job 1262

JobId=1262 JobName=wrap
UserId=ka_zs0402(241992) GroupId=ka_scc(12345) MCS_label=N/A
Priority=4246 Nice=0 Account=ka QOS=normal
JobState=RUNNING Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
RunTime=00:00:37 TimeLimit=00:20:00 TimeMin=N/A
SubmitTime=2025-04-04T10:01:30 EligibleTime=2025-04-04T10:01:30
AccrueTime=2025-04-04T10:01:30
StartTime=2025-04-04T10:01:31 EndTime=2025-04-04T10:21:31 Deadline=N/A
SuspendTime=None SecsPreSuspend=0 LastSchedEval=2025-04-04T10:01:31 Scheduler=Main
Partition=cpu AllocNode:Sid=uc3n999:2819841
ReqNodeList=(null) ExcNodeList=(null)
NodeList=uc3n002
BatchHost=uc3n002
NumNodes=1 NumCPUs=2 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
ReqTRES=cpu=1,mem=2000M,node=1,billing=1
AllocTRES=cpu=2,mem=4000M,node=1,billing=2
Socks/Node=* NtasksPerN:B:S:C=0:0:*:1 CoreSpec=*
MinCPUsNode=1 MinMemoryCPU=2000M MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
Command=(null)
WorkDir=/pfs/data6/home/ka/ka_scc/ka_zs0402
StdErr=/pfs/data6/home/ka/ka_scc/ka_zs0402/slurm-1262.out
StdIn=/dev/null
StdOut=/pfs/data6/home/ka/ka_scc/ka_zs0402/slurm-1262.out

</pre>
 
=== Canceling own jobs : scancel ===
The scancel command is used to cancel jobs. The command scancel is explained in detail on the webpage https://slurm.schedmd.com/scancel.html or via manpage (man scancel). The command is:
<pre>
$ scancel [-i] <job-id>
$ scancel -t <job_state_name>
</pre>

= Slurm Options =
[[BwUniCluster3.0/Running_Jobs/Slurm | Detailed Slurm usage]]

= Best Practices =

== Step-by-Step example==

== Dos and Don'ts ==

BwUniCluster3.0/Running Jobs

2025-04-08T10:53:15Z

P Schuhmacher: /* Development Queues */

= Purpose and function of a queuing system =

All compute activities on bwUniCluster 3.0 have to be performed on the compute nodes. Compute nodes are only available by requesting the corresponding resources via the queuing system. As soon as the requested resources are available, automated tasks are executed via a batch script or they can be accessed interactively. 
General procedure: Hint to [[Running_Calculations | Running Calculations]]

== Job submission process ==

bwUniCluster 3.0 has installed the workload managing software Slurm. Therefore any job submission by the user is to be executed by commands of the Slurm software. Slurm queues and runs user jobs based on fair sharing policies.

== Slurm ==

HPC Workload Manager on bwUniCluster 3.0 is Slurm.
Slurm is a cluster management and job scheduling system. Slurm has three key functions.
* It allocates access to resources (compute cores on nodes) to users for some duration of time so they can perform work.
* It provides a framework for starting, executing, and monitoring work (normally a parallel job) on the set of allocated nodes.
* It arbitrates contention for resources by managing a queue of pending work.

Any kind of calculation on the compute nodes of bwUniCluster 3.0 requires the user to define calculations as a sequence of commands together with required run time, number of CPU cores and main memory and submit all, i.e., the '''batch job''', to a resource and workload managing software.

== Terms and definitions ==

''' Partitions '''

Slurm manages job queues for different '''partitions'''. Partitions are used to group similar node types (e.g. nodes with and without accelerators) and to enforce different access policies and resource limits.

On bwUniCluster 3.0 there are different partitions:

* CPU-only nodes
** 2-socket nodes, consisting of 2 Intel Ice Lake processors with 32 cores each or 2 AMD processors with 48 cores each
** 2-socket nodes with very high RAM capacity, consisting of 2 AMD processors with 48 cores each
* GPU-accelerated nodes
** 2-socket nodes with 4x NVIDIA A100 or 4x NVIDIA H100 GPUs
** 4-socket node with 4x AMD Instinct accelerator

''' Queues '''

Job '''queues''' are used to manage jobs that request access to shared but limited computing resources of a certain kind (partition).

On bwUniCluster 3.0 there are different main types of queues:
* Regular queues
** cpu: Jobs that request CPU-only nodes.
** gpu: Jobs that request GPU-accelerated nodes.
* Development queues (dev)
** Short, usually interactive jobs that are used for developing, compiling and testing code and workflows. The intention behind development queues is to provide users with immediate access to computer resources without having to wait. This is the place to realize instantaneous heavy compute without affecting other users, as would be the case on the login nodes.

Requested compute resources such as (wall-)time, number of nodes and amount of memory are restricted and must fit into the boundaries imposed by the queues. The request for compute resources on the bwUniCluster 3.0 requires at least the specification of the '''queue''' and the '''time'''.

''' Jobs '''

Jobs can be run non-interactively as '''batch jobs''' or as '''interactive jobs'''. 
Submitting a batch job means, that all steps of a compute project are defined in a Bash script. This Bash script is queued and executed as soon as the compute resources are available and allocated. Jobs are enqueued with the <code>sbatch</code> command.
For interactive jobs, the resources are requested with the <code>salloc</code> command. As soon as the computing resources are available and allocated, a command line prompt is returned on a computing node and the user can freely dispose of the resources now available to him.
{|style="background:#deffee; width:100%;"
|style="padding:5px; background:#cef2e0; text-align:left"|
[[Image:Attention.svg|center|25px]]
|style="padding:5px; background:#cef2e0; text-align:left"|
'''Please remember:'''
* '''Heavy computations are not allowed on the login nodes'''. Use a developement or a regular job queue instead! Please refer to [[BwUniCluster3.0/Login#Allowed_Activities_on_Login_Nodes|Allowed Activities on Login Nodes]].
* '''Development queues''' are meant for '''development tasks'''. Do not misuse this queue for regular, short-running jobs or chain jobs! Only one running job at a time is enabled. Maximum queue length is reduced to 3.
|}

= Queues on bwUniCluster 3.0 =
== Policy ==

The computing time is provided in accordance with the '''fair share policy'''. The individual investment shares of the respective university and the resources already used by its members are taken into account. Furthermore, the following throttling policy is also active: The '''maximum amount of cores''' used at any given time from jobs running is '''1920 per user''' (aggregated over all running jobs). This number corresponds to 30 nodes on the Ice Lake partition or 20 nodes on the standard partition. The aim is to minimize waiting times and maximize the number of users who can access computing time at the same time.

== Regular Queues ==
{| class="wikitable"
|-
! style="width:5%"| Queue
! style="width:13%"| Node-Type
! style="width:23%"| Default Resources
! style="width:13%"| Minimal Resources
! style="width:13%"| Maximum Resources
|-
| <code>cpu_il</code>
| CPU nodes Ice Lake
| mem-per-cpu=2000mb
|
| time=72:00:00, nodes=30, mem=249600mb, ntasks-per-node=64, (threads-per-core=2)
|-
| <code>cpu</code>
| CPU nodes Standard
| mem-per-cpu=2000mb
|
| time=72:00:00, nodes=20, mem=380000mb, ntasks-per-node=96, (threads-per-core=2)
|-
| <code>highmem</code>
| CPU nodes High Memory
| mem-per-cpu=12090mb
| mem=380001mb
| time=72:00:00, nodes=4, mem=2300000mb, ntasks-per-node=96, (threads-per-core=2)
|-
| <code>gpu_h100</code>
| GPU nodes NVIDIA GPU x4
| mem-per-gpu=193300mb cpus-per-gpu=24
|
| time=72:00:00, nodes=12, mem=760000mb, ntasks-per-node=96, (threads-per-core=2)
|-
| <code>gpu_mi300</code>
| GPU node AMD GPU x4
| mem-per-gpu=128200mb cpus-per-gpu=24
|
| time=72:00:00, nodes=1, mem=510000mb, ntasks-per-node=40, (threads-per-core=2)
|-
| <code>gpu_a100_il</code>/<code>gpu_h100_il</code>
| GPU nodes Ice Lake NVIDIA GPU x4
| mem-per-gpu=127500mb cpus-per-gpu=16
|
| time=72:00:00, nodes=9, mem=510000mb, ntasks-per-node=64, (threads-per-core=2)
|}
Table 1: Regular Queues

== Development Queues ==
Only for development, i.e. debugging or performance optimization ...

{| class="wikitable"
|-
! style="width:5%"| Queue
! style="width:13%"| Node Type
! style="width:23%"| Default Resources
! style="width:13%"| Minimal Resources
! style="width:13%"| Maximum Resources
|-
| <code>dev_cpu_il</code>
| CPU nodes Ice Lake
| mem-per-cpu=2000mb
|
| time=30, nodes=8, mem=249600mb, ntasks-per-node=64, (threads-per-core=2)
|-
| <code>dev_cpu</code>
| CPU nodes Standard
| mem-per-cpu=2000mb
|
| time=30, nodes=1, mem=380000mb, ntasks-per-node=96, (threads-per-core=2)
|-
| <code>dev_gpu_h100</code>
| GPU nodes NVIDIA GPU x4
| mem-per-gpu=193300mb cpus-per-gpu=24
|
| time=30, nodes=1, mem=760000mb, ntasks-per-node=96, (threads-per-core=2)
|-
| <code>dev_gpu_a100_il</code>
| GPU nodes NVIDIA GPU x4 
| mem-per-gpu=127500mb cpus-per-gpu=16
|
| time=30, nodes=1, mem=510000mb, ntasks-per-node=64, (threads-per-core=2)
|}
Table 2: Development Queues

Default resources of a queue class define number of tasks and memory if not explicitly given with sbatch command. Resource list acronyms ''--time'', ''--ntasks'', ''--nodes'', ''--mem'' and ''--mem-per-cpu'' are described [[BwUniCluster3.0/Running_Jobs/Slurm|here]].

== Check available resources: sinfo_t_idle ==
The Slurm command sinfo is used to view partition and node information for a system running Slurm. It incorporates down time, reservations, and node state information in determining the available backfill window. The sinfo command can only be used by the administrator.
 
SCC has prepared a special script (sinfo_t_idle) to find out how many processors are available for immediate use on the system. It is anticipated that users will use this information to submit jobs that meet these criteria and thus obtain quick job turnaround times.
 
 

* The following command displays what resources are available for immediate use for the whole partition.
<pre>$ sinfo_t_idle
Partition dev_cpu : 2 nodes idle
Partition cpu : 68 nodes idle
Partition highmem : 4 nodes idle
Partition dev_gpu_h100 : 0 nodes idle
Partition gpu_h100 : 11 nodes idle
Partition gpu_mi300 : 1 nodes idle
Partition dev_cpu_il : 0 nodes idle
Partition cpu_il : 0 nodes idle
Partition dev_gpu_a100_il : 0 nodes idle
Partition gpu_a100_il : 0 nodes idle
Partition gpu_h100_il : 0 nodes idle
</pre>
 

= Running Jobs =

== Slurm Commands (excerpt) ==
Important Slurm commands for non-administrators working on bwUniCluster 3.0.
{| width=850px class="wikitable"
! Slurm commands !! Brief explanation
|-
| [[#Batch Jobs: sbatch|sbatch]] || Submits a job and puts it into the queue [[https://slurm.schedmd.com/sbatch.html sbatch]]
|-
| [[#Interactive Jobs: salloc|salloc]] || Requests resources for an interactive Job [[https://slurm.schedmd.com/salloc.html salloc]]
|-
| [[#Monitor and manage jobs |scontrol show job]] || Displays detailed job state information [[https://slurm.schedmd.com/scontrol.html scontrol]]
|-
| [[#List of your submitted jobs : squeue|squeue]] || Displays information about active, eligible, blocked, and/or recently completed jobs [[https://slurm.schedmd.com/squeue.html squeue]]
|-
| [[#List of your submitted jobs : squeue|squeue --start]] || Returns start time of submitted job [[https://slurm.schedmd.com/squeue.html squeue]]
|-
| [[#Check available resources: sinfo_t_idle|sinfo_t_idle]] || Shows what resources are available for immediate use [[https://slurm.schedmd.com/sinfo.html sinfo]]
|-
| [[#Canceling own jobs : scancel|scancel]] || Cancels a job [[https://slurm.schedmd.com/scancel.html scancel]]
|}

 
* [https://slurm.schedmd.com/tutorials.html Slurm Tutorials]
* [https://slurm.schedmd.com/pdfs/summary.pdf Slurm command/option summary (2 pages)]
* [https://slurm.schedmd.com/man_index.html Slurm Commands]
 

== Batch Jobs: sbatch ==

Batch jobs are submitted by using the command '''sbatch'''. The main purpose of the '''sbatch''' command is to specify the resources that are needed to run the job. '''sbatch''' will then queue the batch job. However, starting of batch job depends on the availability of the requested resources and the fair sharing value.
 
 

* The syntax and use of '''sbatch''' can be displayed via:
<pre>
$ man sbatch
</pre>

'''sbatch''' options can be used from the command line or in your job script. Different defaults for some of these options are set based on the queue and can be found [[BwUniCluster3.0/Slurm | here]]

{| class="wikitable"
! colspan="3" | sbatch Options
|-
! style="width:8%"| Command line
! style="width:9%"| Script
! style="width:13%"| Purpose
|- style="vertical-align:top;"
| -t, --time=''time''
| #SBATCH --time=''time''
| Wall clock time limit. 
|-
|- style="vertical-align:top;"
| -N, --nodes=''count''
| #SBATCH --nodes=''count''
| Number of nodes to be used.
|- style="vertical-align:top;"
| -n, --ntasks=''count''
| #SBATCH --ntasks=''count''
| Number of tasks to be launched.
|-
|- style="vertical-align:top;"
| --ntasks-per-node=''count''
| #SBATCH --ntasks-per-node=''count''
| Maximum count of tasks per node.
|-
|- style="vertical-align:top;"
| -c, --cpus-per-task=''count''
| #SBATCH --cpus-per-task=''count''
| Number of CPUs required per (MPI-)task.
|-
|- style="vertical-align:top;"
| --mem=''value_in_MB''
| #SBATCH --mem=''value_in_MB''
| Memory in MegaByte per node. (You should omit the setting of this option.)
|-
|- style="vertical-align:top;"
| --mem-per-cpu=''value_in_MB''
| #SBATCH --mem-per-cpu=''value_in_MB''
| Minimum Memory required per allocated CPU. (You should omit the setting of this option.)
|-
|- style="vertical-align:top;"
| --mail-type=''type''
| #SBATCH --mail-type=''type''
| Notify user by email when certain event types occur. Valid type values are NONE, BEGIN, END, FAIL, REQUEUE, ALL.
|-
|- style="vertical-align:top;"
| --mail-user=''mail-address''
| #SBATCH --mail-user=''mail-address''
| The specified mail-address receives email notification of state changes as defined by --mail-type.
|-
|- style="vertical-align:top;"
| --output=''name''
| #SBATCH --output=''name''
| File in which job output is stored.
|-
|- style="vertical-align:top;"
| --error=''name''
| #SBATCH --error=''name''
| File in which job error messages are stored.
|-
|- style="vertical-align:top;"
| -J, --job-name=''name''
| #SBATCH --job-name=''name''
| Job name.
|-
|- style="vertical-align:top;"
| --export=[ALL,] ''env-variables''
| #SBATCH --export=[ALL,] ''env-variables''
| Identifies which environment variables from the submission environment are propagated to the launched application. Default is ALL.
|-
|- style="vertical-align:top;"
| -A, --account=''group-name''
| #SBATCH --account=''group-name''
| Change resources used by this job to specified group. You may need this option if your account is assigned to more than one group. By command "scontrol show job" the project group the job is accounted on can be seen behind "Account=".
|-
|- style="vertical-align:top;"
| -p, --partition=''queue-name''
| #SBATCH --partition=''queue-name''
| Request a specific queue for the resource allocation.
|-
|- style="vertical-align:top;"
| --reservation=''reservation-name''
| #SBATCH --reservation=''reservation-name''
| Use a specific reservation for the resource allocation.
|-
|- style="vertical-align:top;"
| -C, --constraint=''LSDF''
| #SBATCH --constraint=LSDF
| Job constraint LSDF filesystems.
|-
|- style="vertical-align:top;"
| -C, --constraint=''BEEOND (BEEOND_4MDS, BEEOND_MAXMDS)''
| #SBATCH --constraint=BEEOND (BEEOND_4MDS, BEEOND_MAXMDS)
| Job constraint BeeOND filesystem.
|-
|}
 

== Interactive Jobs: salloc ==

On bwUniCluster 3.0 you are only allowed to run short jobs (<< 1 hour) with little memory requirements (<< 8 GByte) on the logins nodes. If you want to run longer jobs and/or jobs with a request of more than 8 GByte of memory, you must allocate resources for so-called interactive jobs by usage of the command salloc on a login node. Considering a serial application running on a compute node that requires 5000 MByte of memory and limiting the interactive run to 2 hours the following command has to be executed:

<pre>
$ salloc -p cpu -n 1 -t 120 --mem=5000
</pre>

Then you will get one core on a compute node within the partition "cpu". After execution of this command '''DO NOT CLOSE''' your current terminal session but wait until the queueing system Slurm has granted you the requested resources on the compute system. You will be logged in automatically on the granted core! To run a serial program on the granted core you only have to type the name of the executable.

<pre>
$ ./<my_serial_program>
</pre>

Please be aware that your serial job must run less than 2 hours in this example, else the job will be killed during runtime by the system.

You can also start now a graphical X11-terminal connecting you to the dedicated resource that is available for 2 hours. You can start it by the command:

<pre>
$ xterm
</pre>

Note that, once the walltime limit has been reached the resources - i.e. the compute node - will automatically be revoked.

An interactive parallel application running on one compute node or on many compute nodes (e.g. here 5 nodes) with 96 cores each requires usually an amount of memory in GByte (e.g. 50 GByte) and a maximum time (e.g. 1 hour). E.g. 5 nodes can be allocated by the following command:

<pre>
$ salloc -p cpu -N 5 --ntasks-per-node=96 -t 01:00:00 --mem=50gb
</pre>

Now you can run parallel jobs on 480 cores requiring 50 GByte of memory per node. Please be aware that you will be logged in on core 0 of the first node.
If you want to have access to another node you have to open a new terminal, connect it also to bwUniCluster 3.0 and type the following commands to
connect to the running interactive job and then to a specific node:

<pre>
$ srun --jobid=XXXXXXXX --pty /bin/bash
$ srun --nodelist=uc3nXXX --pty /bin/bash
</pre>

With the command:

<pre>
$ squeue
</pre>

the jobid and the nodelist can be shown.

If you want to run MPI-programs, you can do it by simply typing mpirun <program_name>. Then your program will be run on 480 cores. A very simple example for starting a parallel job can be:

<pre>
$ mpirun <my_mpi_program>
</pre>

You can also start the debugger ddt by the commands:

<pre>
$ module add devel/ddt
$ ddt <my_mpi_program>
</pre>

The above commands will execute the parallel program <my_mpi_program> on all available cores. You can also start parallel programs on a subset of cores; an example for this can be:

<pre>
$ mpirun -n 50 <my_mpi_program>
</pre>

If you are using Intel MPI you must start <my_mpi_program> by the command mpiexec.hydra (instead of mpirun).

== Interactive Computing with Jupyter ==

== Monitor and manage jobs ==

=== List of your submitted jobs : squeue ===
Displays information about YOUR active, pending and/or recently completed jobs. The command displays all own active and pending jobs. The command squeue is explained in detail on the webpage https://slurm.schedmd.com/squeue.html or via manpage (man squeue).
 
 

* ''squeue'' example on bwUniCluster 3.0 (Only your own jobs are displayed!).
<pre>
$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
1262 cpu wrap ka_ab123 R 8:15 1 uc3n002
1267 dev_gpu_h wrap ka_ab123 PD 0:00 1 (Resources)
1265 highmem wrap ka_ab123 R 2:41 1 uc3n084
$ squeue -l
JOBID PARTITION NAME USER STATE TIME TIME_LIMI NODES NODELIST(REASON)
1262 cpu wrap ka_ab123 RUNNING 8:55 20:00 1 uc3n002
1267 dev_gpu_h wrap ka_ab123 PENDING 0:00 20:00 1 (Resources)
1265 highmem wrap ka_ab123 RUNNING 3:21 20:00 1 uc3n084

</pre>

=== Detailed job information : scontrol show job ===
scontrol show job displays detailed job state information and diagnostic output for all or a specified job of yours. Detailed information is available for active, pending and recently completed jobs. The command scontrol is explained in detail on the webpage https://slurm.schedmd.com/scontrol.html or via manpage (man scontrol).
 
Display the state of all your jobs in normal mode: scontrol show job
 
Display the state of a job with <jobid> in normal mode: scontrol show job <jobid>
 
 

* Here is an example from bwUniCluster 3.0.
<pre>
$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
1262 cpu wrap ka_zs040 R 1:12 1 uc3n002

$
$ # now, see what's up with my pending job with jobid 1262
$
$ scontrol show job 1262

JobId=1262 JobName=wrap
UserId=ka_zs0402(241992) GroupId=ka_scc(12345) MCS_label=N/A
Priority=4246 Nice=0 Account=ka QOS=normal
JobState=RUNNING Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
RunTime=00:00:37 TimeLimit=00:20:00 TimeMin=N/A
SubmitTime=2025-04-04T10:01:30 EligibleTime=2025-04-04T10:01:30
AccrueTime=2025-04-04T10:01:30
StartTime=2025-04-04T10:01:31 EndTime=2025-04-04T10:21:31 Deadline=N/A
SuspendTime=None SecsPreSuspend=0 LastSchedEval=2025-04-04T10:01:31 Scheduler=Main
Partition=cpu AllocNode:Sid=uc3n999:2819841
ReqNodeList=(null) ExcNodeList=(null)
NodeList=uc3n002
BatchHost=uc3n002
NumNodes=1 NumCPUs=2 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
ReqTRES=cpu=1,mem=2000M,node=1,billing=1
AllocTRES=cpu=2,mem=4000M,node=1,billing=2
Socks/Node=* NtasksPerN:B:S:C=0:0:*:1 CoreSpec=*
MinCPUsNode=1 MinMemoryCPU=2000M MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
Command=(null)
WorkDir=/pfs/data6/home/ka/ka_scc/ka_zs0402
StdErr=/pfs/data6/home/ka/ka_scc/ka_zs0402/slurm-1262.out
StdIn=/dev/null
StdOut=/pfs/data6/home/ka/ka_scc/ka_zs0402/slurm-1262.out

</pre>
 
=== Canceling own jobs : scancel ===
The scancel command is used to cancel jobs. The command scancel is explained in detail on the webpage https://slurm.schedmd.com/scancel.html or via manpage (man scancel). The command is:
<pre>
$ scancel [-i] <job-id>
$ scancel -t <job_state_name>
</pre>

= Slurm Options =
[[BwUniCluster3.0/Running_Jobs/Slurm | Detailed Slurm usage]]

= Best Practices =

== Step-by-Step example==

== Dos and Don'ts ==

BwUniCluster3.0/Running Jobs

2025-04-08T10:52:17Z

P Schuhmacher: /* Regular Queues */

= Purpose and function of a queuing system =

All compute activities on bwUniCluster 3.0 have to be performed on the compute nodes. Compute nodes are only available by requesting the corresponding resources via the queuing system. As soon as the requested resources are available, automated tasks are executed via a batch script or they can be accessed interactively. 
General procedure: Hint to [[Running_Calculations | Running Calculations]]

== Job submission process ==

bwUniCluster 3.0 has installed the workload managing software Slurm. Therefore any job submission by the user is to be executed by commands of the Slurm software. Slurm queues and runs user jobs based on fair sharing policies.

== Slurm ==

HPC Workload Manager on bwUniCluster 3.0 is Slurm.
Slurm is a cluster management and job scheduling system. Slurm has three key functions.
* It allocates access to resources (compute cores on nodes) to users for some duration of time so they can perform work.
* It provides a framework for starting, executing, and monitoring work (normally a parallel job) on the set of allocated nodes.
* It arbitrates contention for resources by managing a queue of pending work.

Any kind of calculation on the compute nodes of bwUniCluster 3.0 requires the user to define calculations as a sequence of commands together with required run time, number of CPU cores and main memory and submit all, i.e., the '''batch job''', to a resource and workload managing software.

== Terms and definitions ==

''' Partitions '''

Slurm manages job queues for different '''partitions'''. Partitions are used to group similar node types (e.g. nodes with and without accelerators) and to enforce different access policies and resource limits.

On bwUniCluster 3.0 there are different partitions:

* CPU-only nodes
** 2-socket nodes, consisting of 2 Intel Ice Lake processors with 32 cores each or 2 AMD processors with 48 cores each
** 2-socket nodes with very high RAM capacity, consisting of 2 AMD processors with 48 cores each
* GPU-accelerated nodes
** 2-socket nodes with 4x NVIDIA A100 or 4x NVIDIA H100 GPUs
** 4-socket node with 4x AMD Instinct accelerator

''' Queues '''

Job '''queues''' are used to manage jobs that request access to shared but limited computing resources of a certain kind (partition).

On bwUniCluster 3.0 there are different main types of queues:
* Regular queues
** cpu: Jobs that request CPU-only nodes.
** gpu: Jobs that request GPU-accelerated nodes.
* Development queues (dev)
** Short, usually interactive jobs that are used for developing, compiling and testing code and workflows. The intention behind development queues is to provide users with immediate access to computer resources without having to wait. This is the place to realize instantaneous heavy compute without affecting other users, as would be the case on the login nodes.

Requested compute resources such as (wall-)time, number of nodes and amount of memory are restricted and must fit into the boundaries imposed by the queues. The request for compute resources on the bwUniCluster 3.0 requires at least the specification of the '''queue''' and the '''time'''.

''' Jobs '''

Jobs can be run non-interactively as '''batch jobs''' or as '''interactive jobs'''. 
Submitting a batch job means, that all steps of a compute project are defined in a Bash script. This Bash script is queued and executed as soon as the compute resources are available and allocated. Jobs are enqueued with the <code>sbatch</code> command.
For interactive jobs, the resources are requested with the <code>salloc</code> command. As soon as the computing resources are available and allocated, a command line prompt is returned on a computing node and the user can freely dispose of the resources now available to him.
{|style="background:#deffee; width:100%;"
|style="padding:5px; background:#cef2e0; text-align:left"|
[[Image:Attention.svg|center|25px]]
|style="padding:5px; background:#cef2e0; text-align:left"|
'''Please remember:'''
* '''Heavy computations are not allowed on the login nodes'''. Use a developement or a regular job queue instead! Please refer to [[BwUniCluster3.0/Login#Allowed_Activities_on_Login_Nodes|Allowed Activities on Login Nodes]].
* '''Development queues''' are meant for '''development tasks'''. Do not misuse this queue for regular, short-running jobs or chain jobs! Only one running job at a time is enabled. Maximum queue length is reduced to 3.
|}

= Queues on bwUniCluster 3.0 =
== Policy ==

The computing time is provided in accordance with the '''fair share policy'''. The individual investment shares of the respective university and the resources already used by its members are taken into account. Furthermore, the following throttling policy is also active: The '''maximum amount of cores''' used at any given time from jobs running is '''1920 per user''' (aggregated over all running jobs). This number corresponds to 30 nodes on the Ice Lake partition or 20 nodes on the standard partition. The aim is to minimize waiting times and maximize the number of users who can access computing time at the same time.

== Regular Queues ==
{| class="wikitable"
|-
! style="width:5%"| Queue
! style="width:13%"| Node-Type
! style="width:23%"| Default Resources
! style="width:13%"| Minimal Resources
! style="width:13%"| Maximum Resources
|-
| <code>cpu_il</code>
| CPU nodes Ice Lake
| mem-per-cpu=2000mb
|
| time=72:00:00, nodes=30, mem=249600mb, ntasks-per-node=64, (threads-per-core=2)
|-
| <code>cpu</code>
| CPU nodes Standard
| mem-per-cpu=2000mb
|
| time=72:00:00, nodes=20, mem=380000mb, ntasks-per-node=96, (threads-per-core=2)
|-
| <code>highmem</code>
| CPU nodes High Memory
| mem-per-cpu=12090mb
| mem=380001mb
| time=72:00:00, nodes=4, mem=2300000mb, ntasks-per-node=96, (threads-per-core=2)
|-
| <code>gpu_h100</code>
| GPU nodes NVIDIA GPU x4
| mem-per-gpu=193300mb cpus-per-gpu=24
|
| time=72:00:00, nodes=12, mem=760000mb, ntasks-per-node=96, (threads-per-core=2)
|-
| <code>gpu_mi300</code>
| GPU node AMD GPU x4
| mem-per-gpu=128200mb cpus-per-gpu=24
|
| time=72:00:00, nodes=1, mem=510000mb, ntasks-per-node=40, (threads-per-core=2)
|-
| <code>gpu_a100_il</code>/<code>gpu_h100_il</code>
| GPU nodes Ice Lake NVIDIA GPU x4
| mem-per-gpu=127500mb cpus-per-gpu=16
|
| time=72:00:00, nodes=9, mem=510000mb, ntasks-per-node=64, (threads-per-core=2)
|}
Table 1: Regular Queues

== Development Queues ==
Only for development, i.e. debugging or performance optimization ...

{| class="wikitable"
|-
! style="width:5%"| Queue
! style="width:13%"| Node Type
! style="width:23%"| Default Resources
! style="width:13%"| Minimal Resources
! style="width:13%"| Maximum Resources
|-
| <code>dev_cpu_il</code>
| CPU nodes Ice Lake
| mem-per-cpu=1950mb
|
| time=30, nodes=8, mem=249600mb, ntasks-per-node=64, (threads-per-core=2)
|-
| <code>dev_cpu</code>
| CPU nodes Standard
| mem-per-cpu=1125mb
|
| time=30, nodes=1, mem=380000mb, ntasks-per-node=96, (threads-per-core=2)
|-
| <code>dev_gpu_h100</code>
| GPU nodes NVIDIA GPU x4
| mem-per-cpu=1125mb cpus-per-gpu=24
|
| time=30, nodes=1, mem=760000mb, ntasks-per-node=96, (threads-per-core=2)
|-
| <code>dev_gpu_a100_il</code>
| GPU nodes NVIDIA GPU x4 
| mem-per-gpu=127500mb cpus-per-gpu=16
|
| time=30, nodes=1, mem=510000mb, ntasks-per-node=64, (threads-per-core=2)
|}
Table 2: Development Queues

Default resources of a queue class define number of tasks and memory if not explicitly given with sbatch command. Resource list acronyms ''--time'', ''--ntasks'', ''--nodes'', ''--mem'' and ''--mem-per-cpu'' are described [[BwUniCluster3.0/Running_Jobs/Slurm|here]].

== Check available resources: sinfo_t_idle ==
The Slurm command sinfo is used to view partition and node information for a system running Slurm. It incorporates down time, reservations, and node state information in determining the available backfill window. The sinfo command can only be used by the administrator.
 
SCC has prepared a special script (sinfo_t_idle) to find out how many processors are available for immediate use on the system. It is anticipated that users will use this information to submit jobs that meet these criteria and thus obtain quick job turnaround times.
 
 

* The following command displays what resources are available for immediate use for the whole partition.
<pre>$ sinfo_t_idle
Partition dev_cpu : 2 nodes idle
Partition cpu : 68 nodes idle
Partition highmem : 4 nodes idle
Partition dev_gpu_h100 : 0 nodes idle
Partition gpu_h100 : 11 nodes idle
Partition gpu_mi300 : 1 nodes idle
Partition dev_cpu_il : 0 nodes idle
Partition cpu_il : 0 nodes idle
Partition dev_gpu_a100_il : 0 nodes idle
Partition gpu_a100_il : 0 nodes idle
Partition gpu_h100_il : 0 nodes idle
</pre>
 

= Running Jobs =

== Slurm Commands (excerpt) ==
Important Slurm commands for non-administrators working on bwUniCluster 3.0.
{| width=850px class="wikitable"
! Slurm commands !! Brief explanation
|-
| [[#Batch Jobs: sbatch|sbatch]] || Submits a job and puts it into the queue [[https://slurm.schedmd.com/sbatch.html sbatch]]
|-
| [[#Interactive Jobs: salloc|salloc]] || Requests resources for an interactive Job [[https://slurm.schedmd.com/salloc.html salloc]]
|-
| [[#Monitor and manage jobs |scontrol show job]] || Displays detailed job state information [[https://slurm.schedmd.com/scontrol.html scontrol]]
|-
| [[#List of your submitted jobs : squeue|squeue]] || Displays information about active, eligible, blocked, and/or recently completed jobs [[https://slurm.schedmd.com/squeue.html squeue]]
|-
| [[#List of your submitted jobs : squeue|squeue --start]] || Returns start time of submitted job [[https://slurm.schedmd.com/squeue.html squeue]]
|-
| [[#Check available resources: sinfo_t_idle|sinfo_t_idle]] || Shows what resources are available for immediate use [[https://slurm.schedmd.com/sinfo.html sinfo]]
|-
| [[#Canceling own jobs : scancel|scancel]] || Cancels a job [[https://slurm.schedmd.com/scancel.html scancel]]
|}

 
* [https://slurm.schedmd.com/tutorials.html Slurm Tutorials]
* [https://slurm.schedmd.com/pdfs/summary.pdf Slurm command/option summary (2 pages)]
* [https://slurm.schedmd.com/man_index.html Slurm Commands]
 

== Batch Jobs: sbatch ==

Batch jobs are submitted by using the command '''sbatch'''. The main purpose of the '''sbatch''' command is to specify the resources that are needed to run the job. '''sbatch''' will then queue the batch job. However, starting of batch job depends on the availability of the requested resources and the fair sharing value.
 
 

* The syntax and use of '''sbatch''' can be displayed via:
<pre>
$ man sbatch
</pre>

'''sbatch''' options can be used from the command line or in your job script. Different defaults for some of these options are set based on the queue and can be found [[BwUniCluster3.0/Slurm | here]]

{| class="wikitable"
! colspan="3" | sbatch Options
|-
! style="width:8%"| Command line
! style="width:9%"| Script
! style="width:13%"| Purpose
|- style="vertical-align:top;"
| -t, --time=''time''
| #SBATCH --time=''time''
| Wall clock time limit. 
|-
|- style="vertical-align:top;"
| -N, --nodes=''count''
| #SBATCH --nodes=''count''
| Number of nodes to be used.
|- style="vertical-align:top;"
| -n, --ntasks=''count''
| #SBATCH --ntasks=''count''
| Number of tasks to be launched.
|-
|- style="vertical-align:top;"
| --ntasks-per-node=''count''
| #SBATCH --ntasks-per-node=''count''
| Maximum count of tasks per node.
|-
|- style="vertical-align:top;"
| -c, --cpus-per-task=''count''
| #SBATCH --cpus-per-task=''count''
| Number of CPUs required per (MPI-)task.
|-
|- style="vertical-align:top;"
| --mem=''value_in_MB''
| #SBATCH --mem=''value_in_MB''
| Memory in MegaByte per node. (You should omit the setting of this option.)
|-
|- style="vertical-align:top;"
| --mem-per-cpu=''value_in_MB''
| #SBATCH --mem-per-cpu=''value_in_MB''
| Minimum Memory required per allocated CPU. (You should omit the setting of this option.)
|-
|- style="vertical-align:top;"
| --mail-type=''type''
| #SBATCH --mail-type=''type''
| Notify user by email when certain event types occur. Valid type values are NONE, BEGIN, END, FAIL, REQUEUE, ALL.
|-
|- style="vertical-align:top;"
| --mail-user=''mail-address''
| #SBATCH --mail-user=''mail-address''
| The specified mail-address receives email notification of state changes as defined by --mail-type.
|-
|- style="vertical-align:top;"
| --output=''name''
| #SBATCH --output=''name''
| File in which job output is stored.
|-
|- style="vertical-align:top;"
| --error=''name''
| #SBATCH --error=''name''
| File in which job error messages are stored.
|-
|- style="vertical-align:top;"
| -J, --job-name=''name''
| #SBATCH --job-name=''name''
| Job name.
|-
|- style="vertical-align:top;"
| --export=[ALL,] ''env-variables''
| #SBATCH --export=[ALL,] ''env-variables''
| Identifies which environment variables from the submission environment are propagated to the launched application. Default is ALL.
|-
|- style="vertical-align:top;"
| -A, --account=''group-name''
| #SBATCH --account=''group-name''
| Change resources used by this job to specified group. You may need this option if your account is assigned to more than one group. By command "scontrol show job" the project group the job is accounted on can be seen behind "Account=".
|-
|- style="vertical-align:top;"
| -p, --partition=''queue-name''
| #SBATCH --partition=''queue-name''
| Request a specific queue for the resource allocation.
|-
|- style="vertical-align:top;"
| --reservation=''reservation-name''
| #SBATCH --reservation=''reservation-name''
| Use a specific reservation for the resource allocation.
|-
|- style="vertical-align:top;"
| -C, --constraint=''LSDF''
| #SBATCH --constraint=LSDF
| Job constraint LSDF filesystems.
|-
|- style="vertical-align:top;"
| -C, --constraint=''BEEOND (BEEOND_4MDS, BEEOND_MAXMDS)''
| #SBATCH --constraint=BEEOND (BEEOND_4MDS, BEEOND_MAXMDS)
| Job constraint BeeOND filesystem.
|-
|}
 

== Interactive Jobs: salloc ==

On bwUniCluster 3.0 you are only allowed to run short jobs (<< 1 hour) with little memory requirements (<< 8 GByte) on the logins nodes. If you want to run longer jobs and/or jobs with a request of more than 8 GByte of memory, you must allocate resources for so-called interactive jobs by usage of the command salloc on a login node. Considering a serial application running on a compute node that requires 5000 MByte of memory and limiting the interactive run to 2 hours the following command has to be executed:

<pre>
$ salloc -p cpu -n 1 -t 120 --mem=5000
</pre>

Then you will get one core on a compute node within the partition "cpu". After execution of this command '''DO NOT CLOSE''' your current terminal session but wait until the queueing system Slurm has granted you the requested resources on the compute system. You will be logged in automatically on the granted core! To run a serial program on the granted core you only have to type the name of the executable.

<pre>
$ ./<my_serial_program>
</pre>

Please be aware that your serial job must run less than 2 hours in this example, else the job will be killed during runtime by the system.

You can also start now a graphical X11-terminal connecting you to the dedicated resource that is available for 2 hours. You can start it by the command:

<pre>
$ xterm
</pre>

Note that, once the walltime limit has been reached the resources - i.e. the compute node - will automatically be revoked.

An interactive parallel application running on one compute node or on many compute nodes (e.g. here 5 nodes) with 96 cores each requires usually an amount of memory in GByte (e.g. 50 GByte) and a maximum time (e.g. 1 hour). E.g. 5 nodes can be allocated by the following command:

<pre>
$ salloc -p cpu -N 5 --ntasks-per-node=96 -t 01:00:00 --mem=50gb
</pre>

Now you can run parallel jobs on 480 cores requiring 50 GByte of memory per node. Please be aware that you will be logged in on core 0 of the first node.
If you want to have access to another node you have to open a new terminal, connect it also to bwUniCluster 3.0 and type the following commands to
connect to the running interactive job and then to a specific node:

<pre>
$ srun --jobid=XXXXXXXX --pty /bin/bash
$ srun --nodelist=uc3nXXX --pty /bin/bash
</pre>

With the command:

<pre>
$ squeue
</pre>

the jobid and the nodelist can be shown.

If you want to run MPI-programs, you can do it by simply typing mpirun <program_name>. Then your program will be run on 480 cores. A very simple example for starting a parallel job can be:

<pre>
$ mpirun <my_mpi_program>
</pre>

You can also start the debugger ddt by the commands:

<pre>
$ module add devel/ddt
$ ddt <my_mpi_program>
</pre>

The above commands will execute the parallel program <my_mpi_program> on all available cores. You can also start parallel programs on a subset of cores; an example for this can be:

<pre>
$ mpirun -n 50 <my_mpi_program>
</pre>

If you are using Intel MPI you must start <my_mpi_program> by the command mpiexec.hydra (instead of mpirun).

== Interactive Computing with Jupyter ==

== Monitor and manage jobs ==

=== List of your submitted jobs : squeue ===
Displays information about YOUR active, pending and/or recently completed jobs. The command displays all own active and pending jobs. The command squeue is explained in detail on the webpage https://slurm.schedmd.com/squeue.html or via manpage (man squeue).
 
 

* ''squeue'' example on bwUniCluster 3.0 (Only your own jobs are displayed!).
<pre>
$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
1262 cpu wrap ka_ab123 R 8:15 1 uc3n002
1267 dev_gpu_h wrap ka_ab123 PD 0:00 1 (Resources)
1265 highmem wrap ka_ab123 R 2:41 1 uc3n084
$ squeue -l
JOBID PARTITION NAME USER STATE TIME TIME_LIMI NODES NODELIST(REASON)
1262 cpu wrap ka_ab123 RUNNING 8:55 20:00 1 uc3n002
1267 dev_gpu_h wrap ka_ab123 PENDING 0:00 20:00 1 (Resources)
1265 highmem wrap ka_ab123 RUNNING 3:21 20:00 1 uc3n084

</pre>

=== Detailed job information : scontrol show job ===
scontrol show job displays detailed job state information and diagnostic output for all or a specified job of yours. Detailed information is available for active, pending and recently completed jobs. The command scontrol is explained in detail on the webpage https://slurm.schedmd.com/scontrol.html or via manpage (man scontrol).
 
Display the state of all your jobs in normal mode: scontrol show job
 
Display the state of a job with <jobid> in normal mode: scontrol show job <jobid>
 
 

* Here is an example from bwUniCluster 3.0.
<pre>
$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
1262 cpu wrap ka_zs040 R 1:12 1 uc3n002

$
$ # now, see what's up with my pending job with jobid 1262
$
$ scontrol show job 1262

JobId=1262 JobName=wrap
UserId=ka_zs0402(241992) GroupId=ka_scc(12345) MCS_label=N/A
Priority=4246 Nice=0 Account=ka QOS=normal
JobState=RUNNING Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
RunTime=00:00:37 TimeLimit=00:20:00 TimeMin=N/A
SubmitTime=2025-04-04T10:01:30 EligibleTime=2025-04-04T10:01:30
AccrueTime=2025-04-04T10:01:30
StartTime=2025-04-04T10:01:31 EndTime=2025-04-04T10:21:31 Deadline=N/A
SuspendTime=None SecsPreSuspend=0 LastSchedEval=2025-04-04T10:01:31 Scheduler=Main
Partition=cpu AllocNode:Sid=uc3n999:2819841
ReqNodeList=(null) ExcNodeList=(null)
NodeList=uc3n002
BatchHost=uc3n002
NumNodes=1 NumCPUs=2 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
ReqTRES=cpu=1,mem=2000M,node=1,billing=1
AllocTRES=cpu=2,mem=4000M,node=1,billing=2
Socks/Node=* NtasksPerN:B:S:C=0:0:*:1 CoreSpec=*
MinCPUsNode=1 MinMemoryCPU=2000M MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
Command=(null)
WorkDir=/pfs/data6/home/ka/ka_scc/ka_zs0402
StdErr=/pfs/data6/home/ka/ka_scc/ka_zs0402/slurm-1262.out
StdIn=/dev/null
StdOut=/pfs/data6/home/ka/ka_scc/ka_zs0402/slurm-1262.out

</pre>
 
=== Canceling own jobs : scancel ===
The scancel command is used to cancel jobs. The command scancel is explained in detail on the webpage https://slurm.schedmd.com/scancel.html or via manpage (man scancel). The command is:
<pre>
$ scancel [-i] <job-id>
$ scancel -t <job_state_name>
</pre>

= Slurm Options =
[[BwUniCluster3.0/Running_Jobs/Slurm | Detailed Slurm usage]]

= Best Practices =

== Step-by-Step example==

== Dos and Don'ts ==

BwUniCluster3.0/Running Jobs

2025-04-08T10:48:30Z

P Schuhmacher: /* Development Queues */

= Purpose and function of a queuing system =

All compute activities on bwUniCluster 3.0 have to be performed on the compute nodes. Compute nodes are only available by requesting the corresponding resources via the queuing system. As soon as the requested resources are available, automated tasks are executed via a batch script or they can be accessed interactively. 
General procedure: Hint to [[Running_Calculations | Running Calculations]]

== Job submission process ==

bwUniCluster 3.0 has installed the workload managing software Slurm. Therefore any job submission by the user is to be executed by commands of the Slurm software. Slurm queues and runs user jobs based on fair sharing policies.

== Slurm ==

HPC Workload Manager on bwUniCluster 3.0 is Slurm.
Slurm is a cluster management and job scheduling system. Slurm has three key functions.
* It allocates access to resources (compute cores on nodes) to users for some duration of time so they can perform work.
* It provides a framework for starting, executing, and monitoring work (normally a parallel job) on the set of allocated nodes.
* It arbitrates contention for resources by managing a queue of pending work.

Any kind of calculation on the compute nodes of bwUniCluster 3.0 requires the user to define calculations as a sequence of commands together with required run time, number of CPU cores and main memory and submit all, i.e., the '''batch job''', to a resource and workload managing software.

== Terms and definitions ==

''' Partitions '''

Slurm manages job queues for different '''partitions'''. Partitions are used to group similar node types (e.g. nodes with and without accelerators) and to enforce different access policies and resource limits.

On bwUniCluster 3.0 there are different partitions:

* CPU-only nodes
** 2-socket nodes, consisting of 2 Intel Ice Lake processors with 32 cores each or 2 AMD processors with 48 cores each
** 2-socket nodes with very high RAM capacity, consisting of 2 AMD processors with 48 cores each
* GPU-accelerated nodes
** 2-socket nodes with 4x NVIDIA A100 or 4x NVIDIA H100 GPUs
** 4-socket node with 4x AMD Instinct accelerator

''' Queues '''

Job '''queues''' are used to manage jobs that request access to shared but limited computing resources of a certain kind (partition).

On bwUniCluster 3.0 there are different main types of queues:
* Regular queues
** cpu: Jobs that request CPU-only nodes.
** gpu: Jobs that request GPU-accelerated nodes.
* Development queues (dev)
** Short, usually interactive jobs that are used for developing, compiling and testing code and workflows. The intention behind development queues is to provide users with immediate access to computer resources without having to wait. This is the place to realize instantaneous heavy compute without affecting other users, as would be the case on the login nodes.

Requested compute resources such as (wall-)time, number of nodes and amount of memory are restricted and must fit into the boundaries imposed by the queues. The request for compute resources on the bwUniCluster 3.0 requires at least the specification of the '''queue''' and the '''time'''.

''' Jobs '''

Jobs can be run non-interactively as '''batch jobs''' or as '''interactive jobs'''. 
Submitting a batch job means, that all steps of a compute project are defined in a Bash script. This Bash script is queued and executed as soon as the compute resources are available and allocated. Jobs are enqueued with the <code>sbatch</code> command.
For interactive jobs, the resources are requested with the <code>salloc</code> command. As soon as the computing resources are available and allocated, a command line prompt is returned on a computing node and the user can freely dispose of the resources now available to him.
{|style="background:#deffee; width:100%;"
|style="padding:5px; background:#cef2e0; text-align:left"|
[[Image:Attention.svg|center|25px]]
|style="padding:5px; background:#cef2e0; text-align:left"|
'''Please remember:'''
* '''Heavy computations are not allowed on the login nodes'''. Use a developement or a regular job queue instead! Please refer to [[BwUniCluster3.0/Login#Allowed_Activities_on_Login_Nodes|Allowed Activities on Login Nodes]].
* '''Development queues''' are meant for '''development tasks'''. Do not misuse this queue for regular, short-running jobs or chain jobs! Only one running job at a time is enabled. Maximum queue length is reduced to 3.
|}

= Queues on bwUniCluster 3.0 =
== Policy ==

The computing time is provided in accordance with the '''fair share policy'''. The individual investment shares of the respective university and the resources already used by its members are taken into account. Furthermore, the following throttling policy is also active: The '''maximum amount of cores''' used at any given time from jobs running is '''1920 per user''' (aggregated over all running jobs). This number corresponds to 30 nodes on the Ice Lake partition or 20 nodes on the standard partition. The aim is to minimize waiting times and maximize the number of users who can access computing time at the same time.

== Regular Queues ==
{| class="wikitable"
|-
! style="width:5%"| Queue
! style="width:13%"| Node-Type
! style="width:23%"| Default Resources
! style="width:13%"| Minimal Resources
! style="width:13%"| Maximum Resources
|-
| <code>cpu_il</code>
| CPU nodes Ice Lake
| mem-per-cpu=1950mb
|
| time=72:00:00, nodes=30, mem=249600mb, ntasks-per-node=64, (threads-per-core=2)
|-
| <code>cpu</code>
| CPU nodes Standard
| mem-per-cpu=1125mb
|
| time=72:00:00, nodes=20, mem=380000mb, ntasks-per-node=96, (threads-per-core=2)
|-
| <code>highmem</code>
| CPU nodes High Memory
| mem-per-cpu=1125mb
| mem=380001mb
| time=72:00:00, nodes=4, mem=2300000mb, ntasks-per-node=96, (threads-per-core=2)
|-
| <code>gpu_h100</code>
| GPU nodes NVIDIA GPU x4
| mem-per-cpu=1125mb cpus-per-gpu=24
|
| time=72:00:00, nodes=12, mem=760000mb, ntasks-per-node=96, (threads-per-core=2)
|-
| <code>gpu_mi300</code>
| GPU node AMD GPU x4
| mem-per-cpu=1125mb cpus-per-gpu=24
|
| time=72:00:00, nodes=1, mem=510000mb, ntasks-per-node=40, (threads-per-core=2)
|-
| <code>gpu_a100_il</code>/<code>gpu_h100_il</code>
| GPU nodes Ice Lake NVIDIA GPU x4
| mem-per-gpu=127500mb cpus-per-gpu=16
|
| time=72:00:00, nodes=9, mem=510000mb, ntasks-per-node=64, (threads-per-core=2)
|}
Table 1: Regular Queues

== Development Queues ==
Only for development, i.e. debugging or performance optimization ...

{| class="wikitable"
|-
! style="width:5%"| Queue
! style="width:13%"| Node Type
! style="width:23%"| Default Resources
! style="width:13%"| Minimal Resources
! style="width:13%"| Maximum Resources
|-
| <code>dev_cpu_il</code>
| CPU nodes Ice Lake
| mem-per-cpu=1950mb
|
| time=30, nodes=8, mem=249600mb, ntasks-per-node=64, (threads-per-core=2)
|-
| <code>dev_cpu</code>
| CPU nodes Standard
| mem-per-cpu=1125mb
|
| time=30, nodes=1, mem=380000mb, ntasks-per-node=96, (threads-per-core=2)
|-
| <code>dev_gpu_h100</code>
| GPU nodes NVIDIA GPU x4
| mem-per-cpu=1125mb cpus-per-gpu=24
|
| time=30, nodes=1, mem=760000mb, ntasks-per-node=96, (threads-per-core=2)
|-
| <code>dev_gpu_a100_il</code>
| GPU nodes NVIDIA GPU x4 
| mem-per-gpu=127500mb cpus-per-gpu=16
|
| time=30, nodes=1, mem=510000mb, ntasks-per-node=64, (threads-per-core=2)
|}
Table 2: Development Queues

Default resources of a queue class define number of tasks and memory if not explicitly given with sbatch command. Resource list acronyms ''--time'', ''--ntasks'', ''--nodes'', ''--mem'' and ''--mem-per-cpu'' are described [[BwUniCluster3.0/Running_Jobs/Slurm|here]].

== Check available resources: sinfo_t_idle ==
The Slurm command sinfo is used to view partition and node information for a system running Slurm. It incorporates down time, reservations, and node state information in determining the available backfill window. The sinfo command can only be used by the administrator.
 
SCC has prepared a special script (sinfo_t_idle) to find out how many processors are available for immediate use on the system. It is anticipated that users will use this information to submit jobs that meet these criteria and thus obtain quick job turnaround times.
 
 

* The following command displays what resources are available for immediate use for the whole partition.
<pre>$ sinfo_t_idle
Partition dev_cpu : 2 nodes idle
Partition cpu : 68 nodes idle
Partition highmem : 4 nodes idle
Partition dev_gpu_h100 : 0 nodes idle
Partition gpu_h100 : 11 nodes idle
Partition gpu_mi300 : 1 nodes idle
Partition dev_cpu_il : 0 nodes idle
Partition cpu_il : 0 nodes idle
Partition dev_gpu_a100_il : 0 nodes idle
Partition gpu_a100_il : 0 nodes idle
Partition gpu_h100_il : 0 nodes idle
</pre>
 

= Running Jobs =

== Slurm Commands (excerpt) ==
Important Slurm commands for non-administrators working on bwUniCluster 3.0.
{| width=850px class="wikitable"
! Slurm commands !! Brief explanation
|-
| [[#Batch Jobs: sbatch|sbatch]] || Submits a job and puts it into the queue [[https://slurm.schedmd.com/sbatch.html sbatch]]
|-
| [[#Interactive Jobs: salloc|salloc]] || Requests resources for an interactive Job [[https://slurm.schedmd.com/salloc.html salloc]]
|-
| [[#Monitor and manage jobs |scontrol show job]] || Displays detailed job state information [[https://slurm.schedmd.com/scontrol.html scontrol]]
|-
| [[#List of your submitted jobs : squeue|squeue]] || Displays information about active, eligible, blocked, and/or recently completed jobs [[https://slurm.schedmd.com/squeue.html squeue]]
|-
| [[#List of your submitted jobs : squeue|squeue --start]] || Returns start time of submitted job [[https://slurm.schedmd.com/squeue.html squeue]]
|-
| [[#Check available resources: sinfo_t_idle|sinfo_t_idle]] || Shows what resources are available for immediate use [[https://slurm.schedmd.com/sinfo.html sinfo]]
|-
| [[#Canceling own jobs : scancel|scancel]] || Cancels a job [[https://slurm.schedmd.com/scancel.html scancel]]
|}

 
* [https://slurm.schedmd.com/tutorials.html Slurm Tutorials]
* [https://slurm.schedmd.com/pdfs/summary.pdf Slurm command/option summary (2 pages)]
* [https://slurm.schedmd.com/man_index.html Slurm Commands]
 

== Batch Jobs: sbatch ==

Batch jobs are submitted by using the command '''sbatch'''. The main purpose of the '''sbatch''' command is to specify the resources that are needed to run the job. '''sbatch''' will then queue the batch job. However, starting of batch job depends on the availability of the requested resources and the fair sharing value.
 
 

* The syntax and use of '''sbatch''' can be displayed via:
<pre>
$ man sbatch
</pre>

'''sbatch''' options can be used from the command line or in your job script. Different defaults for some of these options are set based on the queue and can be found [[BwUniCluster3.0/Slurm | here]]

{| class="wikitable"
! colspan="3" | sbatch Options
|-
! style="width:8%"| Command line
! style="width:9%"| Script
! style="width:13%"| Purpose
|- style="vertical-align:top;"
| -t, --time=''time''
| #SBATCH --time=''time''
| Wall clock time limit. 
|-
|- style="vertical-align:top;"
| -N, --nodes=''count''
| #SBATCH --nodes=''count''
| Number of nodes to be used.
|- style="vertical-align:top;"
| -n, --ntasks=''count''
| #SBATCH --ntasks=''count''
| Number of tasks to be launched.
|-
|- style="vertical-align:top;"
| --ntasks-per-node=''count''
| #SBATCH --ntasks-per-node=''count''
| Maximum count of tasks per node.
|-
|- style="vertical-align:top;"
| -c, --cpus-per-task=''count''
| #SBATCH --cpus-per-task=''count''
| Number of CPUs required per (MPI-)task.
|-
|- style="vertical-align:top;"
| --mem=''value_in_MB''
| #SBATCH --mem=''value_in_MB''
| Memory in MegaByte per node. (You should omit the setting of this option.)
|-
|- style="vertical-align:top;"
| --mem-per-cpu=''value_in_MB''
| #SBATCH --mem-per-cpu=''value_in_MB''
| Minimum Memory required per allocated CPU. (You should omit the setting of this option.)
|-
|- style="vertical-align:top;"
| --mail-type=''type''
| #SBATCH --mail-type=''type''
| Notify user by email when certain event types occur. Valid type values are NONE, BEGIN, END, FAIL, REQUEUE, ALL.
|-
|- style="vertical-align:top;"
| --mail-user=''mail-address''
| #SBATCH --mail-user=''mail-address''
| The specified mail-address receives email notification of state changes as defined by --mail-type.
|-
|- style="vertical-align:top;"
| --output=''name''
| #SBATCH --output=''name''
| File in which job output is stored.
|-
|- style="vertical-align:top;"
| --error=''name''
| #SBATCH --error=''name''
| File in which job error messages are stored.
|-
|- style="vertical-align:top;"
| -J, --job-name=''name''
| #SBATCH --job-name=''name''
| Job name.
|-
|- style="vertical-align:top;"
| --export=[ALL,] ''env-variables''
| #SBATCH --export=[ALL,] ''env-variables''
| Identifies which environment variables from the submission environment are propagated to the launched application. Default is ALL.
|-
|- style="vertical-align:top;"
| -A, --account=''group-name''
| #SBATCH --account=''group-name''
| Change resources used by this job to specified group. You may need this option if your account is assigned to more than one group. By command "scontrol show job" the project group the job is accounted on can be seen behind "Account=".
|-
|- style="vertical-align:top;"
| -p, --partition=''queue-name''
| #SBATCH --partition=''queue-name''
| Request a specific queue for the resource allocation.
|-
|- style="vertical-align:top;"
| --reservation=''reservation-name''
| #SBATCH --reservation=''reservation-name''
| Use a specific reservation for the resource allocation.
|-
|- style="vertical-align:top;"
| -C, --constraint=''LSDF''
| #SBATCH --constraint=LSDF
| Job constraint LSDF filesystems.
|-
|- style="vertical-align:top;"
| -C, --constraint=''BEEOND (BEEOND_4MDS, BEEOND_MAXMDS)''
| #SBATCH --constraint=BEEOND (BEEOND_4MDS, BEEOND_MAXMDS)
| Job constraint BeeOND filesystem.
|-
|}
 

== Interactive Jobs: salloc ==

On bwUniCluster 3.0 you are only allowed to run short jobs (<< 1 hour) with little memory requirements (<< 8 GByte) on the logins nodes. If you want to run longer jobs and/or jobs with a request of more than 8 GByte of memory, you must allocate resources for so-called interactive jobs by usage of the command salloc on a login node. Considering a serial application running on a compute node that requires 5000 MByte of memory and limiting the interactive run to 2 hours the following command has to be executed:

<pre>
$ salloc -p cpu -n 1 -t 120 --mem=5000
</pre>

Then you will get one core on a compute node within the partition "cpu". After execution of this command '''DO NOT CLOSE''' your current terminal session but wait until the queueing system Slurm has granted you the requested resources on the compute system. You will be logged in automatically on the granted core! To run a serial program on the granted core you only have to type the name of the executable.

<pre>
$ ./<my_serial_program>
</pre>

Please be aware that your serial job must run less than 2 hours in this example, else the job will be killed during runtime by the system.

You can also start now a graphical X11-terminal connecting you to the dedicated resource that is available for 2 hours. You can start it by the command:

<pre>
$ xterm
</pre>

Note that, once the walltime limit has been reached the resources - i.e. the compute node - will automatically be revoked.

An interactive parallel application running on one compute node or on many compute nodes (e.g. here 5 nodes) with 96 cores each requires usually an amount of memory in GByte (e.g. 50 GByte) and a maximum time (e.g. 1 hour). E.g. 5 nodes can be allocated by the following command:

<pre>
$ salloc -p cpu -N 5 --ntasks-per-node=96 -t 01:00:00 --mem=50gb
</pre>

Now you can run parallel jobs on 480 cores requiring 50 GByte of memory per node. Please be aware that you will be logged in on core 0 of the first node.
If you want to have access to another node you have to open a new terminal, connect it also to bwUniCluster 3.0 and type the following commands to
connect to the running interactive job and then to a specific node:

<pre>
$ srun --jobid=XXXXXXXX --pty /bin/bash
$ srun --nodelist=uc3nXXX --pty /bin/bash
</pre>

With the command:

<pre>
$ squeue
</pre>

the jobid and the nodelist can be shown.

If you want to run MPI-programs, you can do it by simply typing mpirun <program_name>. Then your program will be run on 480 cores. A very simple example for starting a parallel job can be:

<pre>
$ mpirun <my_mpi_program>
</pre>

You can also start the debugger ddt by the commands:

<pre>
$ module add devel/ddt
$ ddt <my_mpi_program>
</pre>

The above commands will execute the parallel program <my_mpi_program> on all available cores. You can also start parallel programs on a subset of cores; an example for this can be:

<pre>
$ mpirun -n 50 <my_mpi_program>
</pre>

If you are using Intel MPI you must start <my_mpi_program> by the command mpiexec.hydra (instead of mpirun).

== Interactive Computing with Jupyter ==

== Monitor and manage jobs ==

=== List of your submitted jobs : squeue ===
Displays information about YOUR active, pending and/or recently completed jobs. The command displays all own active and pending jobs. The command squeue is explained in detail on the webpage https://slurm.schedmd.com/squeue.html or via manpage (man squeue).
 
 

* ''squeue'' example on bwUniCluster 3.0 (Only your own jobs are displayed!).
<pre>
$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
1262 cpu wrap ka_ab123 R 8:15 1 uc3n002
1267 dev_gpu_h wrap ka_ab123 PD 0:00 1 (Resources)
1265 highmem wrap ka_ab123 R 2:41 1 uc3n084
$ squeue -l
JOBID PARTITION NAME USER STATE TIME TIME_LIMI NODES NODELIST(REASON)
1262 cpu wrap ka_ab123 RUNNING 8:55 20:00 1 uc3n002
1267 dev_gpu_h wrap ka_ab123 PENDING 0:00 20:00 1 (Resources)
1265 highmem wrap ka_ab123 RUNNING 3:21 20:00 1 uc3n084

</pre>

=== Detailed job information : scontrol show job ===
scontrol show job displays detailed job state information and diagnostic output for all or a specified job of yours. Detailed information is available for active, pending and recently completed jobs. The command scontrol is explained in detail on the webpage https://slurm.schedmd.com/scontrol.html or via manpage (man scontrol).
 
Display the state of all your jobs in normal mode: scontrol show job
 
Display the state of a job with <jobid> in normal mode: scontrol show job <jobid>
 
 

* Here is an example from bwUniCluster 3.0.
<pre>
$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
1262 cpu wrap ka_zs040 R 1:12 1 uc3n002

$
$ # now, see what's up with my pending job with jobid 1262
$
$ scontrol show job 1262

JobId=1262 JobName=wrap
UserId=ka_zs0402(241992) GroupId=ka_scc(12345) MCS_label=N/A
Priority=4246 Nice=0 Account=ka QOS=normal
JobState=RUNNING Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
RunTime=00:00:37 TimeLimit=00:20:00 TimeMin=N/A
SubmitTime=2025-04-04T10:01:30 EligibleTime=2025-04-04T10:01:30
AccrueTime=2025-04-04T10:01:30
StartTime=2025-04-04T10:01:31 EndTime=2025-04-04T10:21:31 Deadline=N/A
SuspendTime=None SecsPreSuspend=0 LastSchedEval=2025-04-04T10:01:31 Scheduler=Main
Partition=cpu AllocNode:Sid=uc3n999:2819841
ReqNodeList=(null) ExcNodeList=(null)
NodeList=uc3n002
BatchHost=uc3n002
NumNodes=1 NumCPUs=2 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
ReqTRES=cpu=1,mem=2000M,node=1,billing=1
AllocTRES=cpu=2,mem=4000M,node=1,billing=2
Socks/Node=* NtasksPerN:B:S:C=0:0:*:1 CoreSpec=*
MinCPUsNode=1 MinMemoryCPU=2000M MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
Command=(null)
WorkDir=/pfs/data6/home/ka/ka_scc/ka_zs0402
StdErr=/pfs/data6/home/ka/ka_scc/ka_zs0402/slurm-1262.out
StdIn=/dev/null
StdOut=/pfs/data6/home/ka/ka_scc/ka_zs0402/slurm-1262.out

</pre>
 
=== Canceling own jobs : scancel ===
The scancel command is used to cancel jobs. The command scancel is explained in detail on the webpage https://slurm.schedmd.com/scancel.html or via manpage (man scancel). The command is:
<pre>
$ scancel [-i] <job-id>
$ scancel -t <job_state_name>
</pre>

= Slurm Options =
[[BwUniCluster3.0/Running_Jobs/Slurm | Detailed Slurm usage]]

= Best Practices =

== Step-by-Step example==

== Dos and Don'ts ==

BwUniCluster3.0/Running Jobs

2025-04-08T10:47:47Z

P Schuhmacher: /* Development Queues */

= Purpose and function of a queuing system =

All compute activities on bwUniCluster 3.0 have to be performed on the compute nodes. Compute nodes are only available by requesting the corresponding resources via the queuing system. As soon as the requested resources are available, automated tasks are executed via a batch script or they can be accessed interactively. 
General procedure: Hint to [[Running_Calculations | Running Calculations]]

== Job submission process ==

bwUniCluster 3.0 has installed the workload managing software Slurm. Therefore any job submission by the user is to be executed by commands of the Slurm software. Slurm queues and runs user jobs based on fair sharing policies.

== Slurm ==

HPC Workload Manager on bwUniCluster 3.0 is Slurm.
Slurm is a cluster management and job scheduling system. Slurm has three key functions.
* It allocates access to resources (compute cores on nodes) to users for some duration of time so they can perform work.
* It provides a framework for starting, executing, and monitoring work (normally a parallel job) on the set of allocated nodes.
* It arbitrates contention for resources by managing a queue of pending work.

Any kind of calculation on the compute nodes of bwUniCluster 3.0 requires the user to define calculations as a sequence of commands together with required run time, number of CPU cores and main memory and submit all, i.e., the '''batch job''', to a resource and workload managing software.

== Terms and definitions ==

''' Partitions '''

Slurm manages job queues for different '''partitions'''. Partitions are used to group similar node types (e.g. nodes with and without accelerators) and to enforce different access policies and resource limits.

On bwUniCluster 3.0 there are different partitions:

* CPU-only nodes
** 2-socket nodes, consisting of 2 Intel Ice Lake processors with 32 cores each or 2 AMD processors with 48 cores each
** 2-socket nodes with very high RAM capacity, consisting of 2 AMD processors with 48 cores each
* GPU-accelerated nodes
** 2-socket nodes with 4x NVIDIA A100 or 4x NVIDIA H100 GPUs
** 4-socket node with 4x AMD Instinct accelerator

''' Queues '''

Job '''queues''' are used to manage jobs that request access to shared but limited computing resources of a certain kind (partition).

On bwUniCluster 3.0 there are different main types of queues:
* Regular queues
** cpu: Jobs that request CPU-only nodes.
** gpu: Jobs that request GPU-accelerated nodes.
* Development queues (dev)
** Short, usually interactive jobs that are used for developing, compiling and testing code and workflows. The intention behind development queues is to provide users with immediate access to computer resources without having to wait. This is the place to realize instantaneous heavy compute without affecting other users, as would be the case on the login nodes.

Requested compute resources such as (wall-)time, number of nodes and amount of memory are restricted and must fit into the boundaries imposed by the queues. The request for compute resources on the bwUniCluster 3.0 requires at least the specification of the '''queue''' and the '''time'''.

''' Jobs '''

Jobs can be run non-interactively as '''batch jobs''' or as '''interactive jobs'''. 
Submitting a batch job means, that all steps of a compute project are defined in a Bash script. This Bash script is queued and executed as soon as the compute resources are available and allocated. Jobs are enqueued with the <code>sbatch</code> command.
For interactive jobs, the resources are requested with the <code>salloc</code> command. As soon as the computing resources are available and allocated, a command line prompt is returned on a computing node and the user can freely dispose of the resources now available to him.
{|style="background:#deffee; width:100%;"
|style="padding:5px; background:#cef2e0; text-align:left"|
[[Image:Attention.svg|center|25px]]
|style="padding:5px; background:#cef2e0; text-align:left"|
'''Please remember:'''
* '''Heavy computations are not allowed on the login nodes'''. Use a developement or a regular job queue instead! Please refer to [[BwUniCluster3.0/Login#Allowed_Activities_on_Login_Nodes|Allowed Activities on Login Nodes]].
* '''Development queues''' are meant for '''development tasks'''. Do not misuse this queue for regular, short-running jobs or chain jobs! Only one running job at a time is enabled. Maximum queue length is reduced to 3.
|}

= Queues on bwUniCluster 3.0 =
== Policy ==

The computing time is provided in accordance with the '''fair share policy'''. The individual investment shares of the respective university and the resources already used by its members are taken into account. Furthermore, the following throttling policy is also active: The '''maximum amount of cores''' used at any given time from jobs running is '''1920 per user''' (aggregated over all running jobs). This number corresponds to 30 nodes on the Ice Lake partition or 20 nodes on the standard partition. The aim is to minimize waiting times and maximize the number of users who can access computing time at the same time.

== Regular Queues ==
{| class="wikitable"
|-
! style="width:5%"| Queue
! style="width:13%"| Node-Type
! style="width:23%"| Default Resources
! style="width:13%"| Minimal Resources
! style="width:13%"| Maximum Resources
|-
| <code>cpu_il</code>
| CPU nodes Ice Lake
| mem-per-cpu=1950mb
|
| time=72:00:00, nodes=30, mem=249600mb, ntasks-per-node=64, (threads-per-core=2)
|-
| <code>cpu</code>
| CPU nodes Standard
| mem-per-cpu=1125mb
|
| time=72:00:00, nodes=20, mem=380000mb, ntasks-per-node=96, (threads-per-core=2)
|-
| <code>highmem</code>
| CPU nodes High Memory
| mem-per-cpu=1125mb
| mem=380001mb
| time=72:00:00, nodes=4, mem=2300000mb, ntasks-per-node=96, (threads-per-core=2)
|-
| <code>gpu_h100</code>
| GPU nodes NVIDIA GPU x4
| mem-per-cpu=1125mb cpus-per-gpu=24
|
| time=72:00:00, nodes=12, mem=760000mb, ntasks-per-node=96, (threads-per-core=2)
|-
| <code>gpu_mi300</code>
| GPU node AMD GPU x4
| mem-per-cpu=1125mb cpus-per-gpu=24
|
| time=72:00:00, nodes=1, mem=510000mb, ntasks-per-node=40, (threads-per-core=2)
|-
| <code>gpu_a100_il</code>/<code>gpu_h100_il</code>
| GPU nodes Ice Lake NVIDIA GPU x4
| mem-per-gpu=127500mb cpus-per-gpu=16
|
| time=72:00:00, nodes=9, mem=510000mb, ntasks-per-node=64, (threads-per-core=2)
|}
Table 1: Regular Queues

== Development Queues ==
Only for development, i.e. debugging or performance optimization ...

{| class="wikitable"
|-
! style="width:5%"| Queue
! style="width:13%"| Node Type
! style="width:23%"| Default Resources
! style="width:13%"| Minimal Resources
! style="width:13%"| Maximum Resources
|-
| <code>dev_cpu_il</code>
| CPU nodes Ice Lake
| mem-per-cpu=1950mb
|
| time=30, nodes=8, mem=249600mb, ntasks-per-node=64, (threads-per-core=2)
|-
| <code>dev_cpu</code>
| CPU nodes Standard
| mem-per-cpu=1125mb
|
| time=30, nodes=1, mem=380000mb, ntasks-per-node=96, (threads-per-core=2)
|-
| <code>dev_gpu_h100</code>
| GPU nodes NVIDIA GPU x4
| mem-per-cpu=1125mb cpus-per-gpu=24
|
| time=30, nodes=1, mem=180000mb, ntasks-per-node=40, (threads-per-core=2)
|-
| <code>dev_gpu_a100_il</code>
| GPU nodes NVIDIA GPU x4 
| mem-per-gpu=127500mb cpus-per-gpu=16
|
| time=30, nodes=1, mem=510000mb, ntasks-per-node=64, (threads-per-core=2)
|}
Table 2: Development Queues

Default resources of a queue class define number of tasks and memory if not explicitly given with sbatch command. Resource list acronyms ''--time'', ''--ntasks'', ''--nodes'', ''--mem'' and ''--mem-per-cpu'' are described [[BwUniCluster3.0/Running_Jobs/Slurm|here]].

== Check available resources: sinfo_t_idle ==
The Slurm command sinfo is used to view partition and node information for a system running Slurm. It incorporates down time, reservations, and node state information in determining the available backfill window. The sinfo command can only be used by the administrator.
 
SCC has prepared a special script (sinfo_t_idle) to find out how many processors are available for immediate use on the system. It is anticipated that users will use this information to submit jobs that meet these criteria and thus obtain quick job turnaround times.
 
 

* The following command displays what resources are available for immediate use for the whole partition.
<pre>$ sinfo_t_idle
Partition dev_cpu : 2 nodes idle
Partition cpu : 68 nodes idle
Partition highmem : 4 nodes idle
Partition dev_gpu_h100 : 0 nodes idle
Partition gpu_h100 : 11 nodes idle
Partition gpu_mi300 : 1 nodes idle
Partition dev_cpu_il : 0 nodes idle
Partition cpu_il : 0 nodes idle
Partition dev_gpu_a100_il : 0 nodes idle
Partition gpu_a100_il : 0 nodes idle
Partition gpu_h100_il : 0 nodes idle
</pre>
 

= Running Jobs =

== Slurm Commands (excerpt) ==
Important Slurm commands for non-administrators working on bwUniCluster 3.0.
{| width=850px class="wikitable"
! Slurm commands !! Brief explanation
|-
| [[#Batch Jobs: sbatch|sbatch]] || Submits a job and puts it into the queue [[https://slurm.schedmd.com/sbatch.html sbatch]]
|-
| [[#Interactive Jobs: salloc|salloc]] || Requests resources for an interactive Job [[https://slurm.schedmd.com/salloc.html salloc]]
|-
| [[#Monitor and manage jobs |scontrol show job]] || Displays detailed job state information [[https://slurm.schedmd.com/scontrol.html scontrol]]
|-
| [[#List of your submitted jobs : squeue|squeue]] || Displays information about active, eligible, blocked, and/or recently completed jobs [[https://slurm.schedmd.com/squeue.html squeue]]
|-
| [[#List of your submitted jobs : squeue|squeue --start]] || Returns start time of submitted job [[https://slurm.schedmd.com/squeue.html squeue]]
|-
| [[#Check available resources: sinfo_t_idle|sinfo_t_idle]] || Shows what resources are available for immediate use [[https://slurm.schedmd.com/sinfo.html sinfo]]
|-
| [[#Canceling own jobs : scancel|scancel]] || Cancels a job [[https://slurm.schedmd.com/scancel.html scancel]]
|}

 
* [https://slurm.schedmd.com/tutorials.html Slurm Tutorials]
* [https://slurm.schedmd.com/pdfs/summary.pdf Slurm command/option summary (2 pages)]
* [https://slurm.schedmd.com/man_index.html Slurm Commands]
 

== Batch Jobs: sbatch ==

Batch jobs are submitted by using the command '''sbatch'''. The main purpose of the '''sbatch''' command is to specify the resources that are needed to run the job. '''sbatch''' will then queue the batch job. However, starting of batch job depends on the availability of the requested resources and the fair sharing value.
 
 

* The syntax and use of '''sbatch''' can be displayed via:
<pre>
$ man sbatch
</pre>

'''sbatch''' options can be used from the command line or in your job script. Different defaults for some of these options are set based on the queue and can be found [[BwUniCluster3.0/Slurm | here]]

{| class="wikitable"
! colspan="3" | sbatch Options
|-
! style="width:8%"| Command line
! style="width:9%"| Script
! style="width:13%"| Purpose
|- style="vertical-align:top;"
| -t, --time=''time''
| #SBATCH --time=''time''
| Wall clock time limit. 
|-
|- style="vertical-align:top;"
| -N, --nodes=''count''
| #SBATCH --nodes=''count''
| Number of nodes to be used.
|- style="vertical-align:top;"
| -n, --ntasks=''count''
| #SBATCH --ntasks=''count''
| Number of tasks to be launched.
|-
|- style="vertical-align:top;"
| --ntasks-per-node=''count''
| #SBATCH --ntasks-per-node=''count''
| Maximum count of tasks per node.
|-
|- style="vertical-align:top;"
| -c, --cpus-per-task=''count''
| #SBATCH --cpus-per-task=''count''
| Number of CPUs required per (MPI-)task.
|-
|- style="vertical-align:top;"
| --mem=''value_in_MB''
| #SBATCH --mem=''value_in_MB''
| Memory in MegaByte per node. (You should omit the setting of this option.)
|-
|- style="vertical-align:top;"
| --mem-per-cpu=''value_in_MB''
| #SBATCH --mem-per-cpu=''value_in_MB''
| Minimum Memory required per allocated CPU. (You should omit the setting of this option.)
|-
|- style="vertical-align:top;"
| --mail-type=''type''
| #SBATCH --mail-type=''type''
| Notify user by email when certain event types occur. Valid type values are NONE, BEGIN, END, FAIL, REQUEUE, ALL.
|-
|- style="vertical-align:top;"
| --mail-user=''mail-address''
| #SBATCH --mail-user=''mail-address''
| The specified mail-address receives email notification of state changes as defined by --mail-type.
|-
|- style="vertical-align:top;"
| --output=''name''
| #SBATCH --output=''name''
| File in which job output is stored.
|-
|- style="vertical-align:top;"
| --error=''name''
| #SBATCH --error=''name''
| File in which job error messages are stored.
|-
|- style="vertical-align:top;"
| -J, --job-name=''name''
| #SBATCH --job-name=''name''
| Job name.
|-
|- style="vertical-align:top;"
| --export=[ALL,] ''env-variables''
| #SBATCH --export=[ALL,] ''env-variables''
| Identifies which environment variables from the submission environment are propagated to the launched application. Default is ALL.
|-
|- style="vertical-align:top;"
| -A, --account=''group-name''
| #SBATCH --account=''group-name''
| Change resources used by this job to specified group. You may need this option if your account is assigned to more than one group. By command "scontrol show job" the project group the job is accounted on can be seen behind "Account=".
|-
|- style="vertical-align:top;"
| -p, --partition=''queue-name''
| #SBATCH --partition=''queue-name''
| Request a specific queue for the resource allocation.
|-
|- style="vertical-align:top;"
| --reservation=''reservation-name''
| #SBATCH --reservation=''reservation-name''
| Use a specific reservation for the resource allocation.
|-
|- style="vertical-align:top;"
| -C, --constraint=''LSDF''
| #SBATCH --constraint=LSDF
| Job constraint LSDF filesystems.
|-
|- style="vertical-align:top;"
| -C, --constraint=''BEEOND (BEEOND_4MDS, BEEOND_MAXMDS)''
| #SBATCH --constraint=BEEOND (BEEOND_4MDS, BEEOND_MAXMDS)
| Job constraint BeeOND filesystem.
|-
|}
 

== Interactive Jobs: salloc ==

On bwUniCluster 3.0 you are only allowed to run short jobs (<< 1 hour) with little memory requirements (<< 8 GByte) on the logins nodes. If you want to run longer jobs and/or jobs with a request of more than 8 GByte of memory, you must allocate resources for so-called interactive jobs by usage of the command salloc on a login node. Considering a serial application running on a compute node that requires 5000 MByte of memory and limiting the interactive run to 2 hours the following command has to be executed:

<pre>
$ salloc -p cpu -n 1 -t 120 --mem=5000
</pre>

Then you will get one core on a compute node within the partition "cpu". After execution of this command '''DO NOT CLOSE''' your current terminal session but wait until the queueing system Slurm has granted you the requested resources on the compute system. You will be logged in automatically on the granted core! To run a serial program on the granted core you only have to type the name of the executable.

<pre>
$ ./<my_serial_program>
</pre>

Please be aware that your serial job must run less than 2 hours in this example, else the job will be killed during runtime by the system.

You can also start now a graphical X11-terminal connecting you to the dedicated resource that is available for 2 hours. You can start it by the command:

<pre>
$ xterm
</pre>

Note that, once the walltime limit has been reached the resources - i.e. the compute node - will automatically be revoked.

An interactive parallel application running on one compute node or on many compute nodes (e.g. here 5 nodes) with 96 cores each requires usually an amount of memory in GByte (e.g. 50 GByte) and a maximum time (e.g. 1 hour). E.g. 5 nodes can be allocated by the following command:

<pre>
$ salloc -p cpu -N 5 --ntasks-per-node=96 -t 01:00:00 --mem=50gb
</pre>

Now you can run parallel jobs on 480 cores requiring 50 GByte of memory per node. Please be aware that you will be logged in on core 0 of the first node.
If you want to have access to another node you have to open a new terminal, connect it also to bwUniCluster 3.0 and type the following commands to
connect to the running interactive job and then to a specific node:

<pre>
$ srun --jobid=XXXXXXXX --pty /bin/bash
$ srun --nodelist=uc3nXXX --pty /bin/bash
</pre>

With the command:

<pre>
$ squeue
</pre>

the jobid and the nodelist can be shown.

If you want to run MPI-programs, you can do it by simply typing mpirun <program_name>. Then your program will be run on 480 cores. A very simple example for starting a parallel job can be:

<pre>
$ mpirun <my_mpi_program>
</pre>

You can also start the debugger ddt by the commands:

<pre>
$ module add devel/ddt
$ ddt <my_mpi_program>
</pre>

The above commands will execute the parallel program <my_mpi_program> on all available cores. You can also start parallel programs on a subset of cores; an example for this can be:

<pre>
$ mpirun -n 50 <my_mpi_program>
</pre>

If you are using Intel MPI you must start <my_mpi_program> by the command mpiexec.hydra (instead of mpirun).

== Interactive Computing with Jupyter ==

== Monitor and manage jobs ==

=== List of your submitted jobs : squeue ===
Displays information about YOUR active, pending and/or recently completed jobs. The command displays all own active and pending jobs. The command squeue is explained in detail on the webpage https://slurm.schedmd.com/squeue.html or via manpage (man squeue).
 
 

* ''squeue'' example on bwUniCluster 3.0 (Only your own jobs are displayed!).
<pre>
$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
1262 cpu wrap ka_ab123 R 8:15 1 uc3n002
1267 dev_gpu_h wrap ka_ab123 PD 0:00 1 (Resources)
1265 highmem wrap ka_ab123 R 2:41 1 uc3n084
$ squeue -l
JOBID PARTITION NAME USER STATE TIME TIME_LIMI NODES NODELIST(REASON)
1262 cpu wrap ka_ab123 RUNNING 8:55 20:00 1 uc3n002
1267 dev_gpu_h wrap ka_ab123 PENDING 0:00 20:00 1 (Resources)
1265 highmem wrap ka_ab123 RUNNING 3:21 20:00 1 uc3n084

</pre>

=== Detailed job information : scontrol show job ===
scontrol show job displays detailed job state information and diagnostic output for all or a specified job of yours. Detailed information is available for active, pending and recently completed jobs. The command scontrol is explained in detail on the webpage https://slurm.schedmd.com/scontrol.html or via manpage (man scontrol).
 
Display the state of all your jobs in normal mode: scontrol show job
 
Display the state of a job with <jobid> in normal mode: scontrol show job <jobid>
 
 

* Here is an example from bwUniCluster 3.0.
<pre>
$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
1262 cpu wrap ka_zs040 R 1:12 1 uc3n002

$
$ # now, see what's up with my pending job with jobid 1262
$
$ scontrol show job 1262

JobId=1262 JobName=wrap
UserId=ka_zs0402(241992) GroupId=ka_scc(12345) MCS_label=N/A
Priority=4246 Nice=0 Account=ka QOS=normal
JobState=RUNNING Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
RunTime=00:00:37 TimeLimit=00:20:00 TimeMin=N/A
SubmitTime=2025-04-04T10:01:30 EligibleTime=2025-04-04T10:01:30
AccrueTime=2025-04-04T10:01:30
StartTime=2025-04-04T10:01:31 EndTime=2025-04-04T10:21:31 Deadline=N/A
SuspendTime=None SecsPreSuspend=0 LastSchedEval=2025-04-04T10:01:31 Scheduler=Main
Partition=cpu AllocNode:Sid=uc3n999:2819841
ReqNodeList=(null) ExcNodeList=(null)
NodeList=uc3n002
BatchHost=uc3n002
NumNodes=1 NumCPUs=2 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
ReqTRES=cpu=1,mem=2000M,node=1,billing=1
AllocTRES=cpu=2,mem=4000M,node=1,billing=2
Socks/Node=* NtasksPerN:B:S:C=0:0:*:1 CoreSpec=*
MinCPUsNode=1 MinMemoryCPU=2000M MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
Command=(null)
WorkDir=/pfs/data6/home/ka/ka_scc/ka_zs0402
StdErr=/pfs/data6/home/ka/ka_scc/ka_zs0402/slurm-1262.out
StdIn=/dev/null
StdOut=/pfs/data6/home/ka/ka_scc/ka_zs0402/slurm-1262.out

</pre>
 
=== Canceling own jobs : scancel ===
The scancel command is used to cancel jobs. The command scancel is explained in detail on the webpage https://slurm.schedmd.com/scancel.html or via manpage (man scancel). The command is:
<pre>
$ scancel [-i] <job-id>
$ scancel -t <job_state_name>
</pre>

= Slurm Options =
[[BwUniCluster3.0/Running_Jobs/Slurm | Detailed Slurm usage]]

= Best Practices =

== Step-by-Step example==

== Dos and Don'ts ==

BwUniCluster3.0/Login

2025-04-08T08:55:25Z

P Schuhmacher: /* Login with SSH command (Linux, Mac, Windows) */

{|style="background:#deffee; width:100%;"
|style="padding:5px; background:#cef2e0; text-align:left"|
[[Image:Attention.svg|center|25px]]
|style="padding:5px; background:#cef2e0; text-align:left"|
Access to bwUniCluster 3.0 is '''limited to IP addresses from the BelWü network'''.
All home institutions of our current users are connected to BelWü, so if you are on your campus network (e.g. in your office or on the Campus WiFi) you should be able to connect to bwUniCluster 3.0 without restrictions.
If you are outside one of the BelWü networks (e.g. at home), a VPN connection to the home institution or a connection to an SSH jump host at the home institution must be established first.
|}

The login nodes of the bwHPC clusters are the access point to the compute system, your <code>$HOME</code> directory and your workspaces.
All users must log in through these nodes to submit jobs to the cluster.

'''Prerequisites for successful login:'''

You need to have
# Completed the 3-step [[registration|'''registration''']] procedure.
# Set a [[Registration/Password|'''service password''']] for bwUniCluster 3.0.
# Set up a [[Registration/2FA|'''second factor''']] for the time-based one-time password (TOTP).

= Login to the bwUniCluster =

Login to the bwUniCluster 3.0 is only possible with a Secure Shell (SSH) client for which you must know your username on the cluster and the hostname of the login nodes.
For more general information on SSH clients, visit the [[BwUniCluster3.0/Login/Client|SSH Clients Guide]].

== Username ==

If you want to use the bwUniCluster 3.0 you need to add a prefix to your local username.

For prefixes please refer to the [[Registration/Login/Username#Prefix_for_Universities|prefix table]].

Examples: 
* If your local username for the University is <code>ab123</code> and you are a user from the University of Freiburg this would combine to: <code>fr_ab123</code>.
* If your KIT username is <code>ab1234</code> and you are a user from KIT this would combine to: <code>ka_ab1234</code>.

== Hostnames ==

The system has two login nodes.
The selection of the login node is done automatically.
If you are logging in multiple times, different sessions might run on different login nodes.

Login to bwUniCluster 3.0:

{| class="wikitable"
! Hostname !! Node type

|-
| '''uc3.scc.kit.edu''' || login to one of the two login nodes
|-
|}

In general, you should use automatic selection to allow us to balance the load over the three login nodes.
If you need to connect to specific login nodes, you can use the following hostnames:

{| class="wikitable"
! Hostname !! Node type
|-
| '''uc3-login1.scc.kit.edu''' || bwUniCluster 3.0 first login node
|-
| '''uc3-login2.scc.kit.edu''' || bwUniCluster 3.0 second login node
|-
|}

== Host Keys ==

When you log in, you may receive the message <code>The authenticity of host '<host address>' can't be established.</code> along with the host key fingerprint. This is intended so you can verify the authenticity of the host you are connecting to. Before you continue you should verify, if this fingerprint matches one of the following:

{| class="wikitable"
! Algorithm !! Fingerprint (SHA256)
|-
| '''RSA''' || SHA256:RaE0/tqQMMBmJuDCIo3WZ38YJsz0godVyt6aUOk/E0M
|-
| '''ECDSA''' || SHA256:LjBYL/x86ZAlL0JdlXrCmPYXvS3DaSiMuvycojBMdwQ
|-
| '''ED25519''' || SHA256:5mZYEpKigwK5ibBMHRrh3WIkOtCqomJW6H7OMbPk3ec
|-
|}

== Login with SSH command (Linux, Mac, Windows) ==

Linux, Mac OS, other Unix-like operating systems and Microsoft Windows come with a built-in SSH client, most likely provided by the OpenSSH project.

For login use one of the following ssh commands:

ssh <username> uc3.scc.kit.edu



== Login with graphical SSH client (Windows) ==

For Windows we suggest using [[Data_Transfer/Graphical_Clients#MobaXterm|MobaXterm]] for login and file transfer.

Start ''MobaXterm'', fill in the following fields:
<pre>
Remote name : uc3.scc.kit.edu # or bwunicluster.scc.kit.edu
Specify user name : <username>
Port : 22
</pre>

After that click on 'ok'. Then a terminal will be opened and there you can enter your credentials.

'''Note:''' When using File transfer with MobaXterm version 23.6 the following configuration change has to be made:
In the settings in the tab "SSH", change the option "SSH engine" from "<new>" to "<legacy>". Then restart MobaXterm

== Login with Jupyterhub ==

Login takes place at:
* bwUniCluster 3.0: [https://uc3-jupyter.scc.kit.edu uc3-jupyter.scc.kit.edu]
* SDIL: [https://sdil-jupyter.scc.kit.edu sdil-jupyter.scc.kit.edu]

More Information can be found [[BwUniCluster3.0/Jupyter#Login_process|here]].

== Login Example ==

To log in to bwUniCluster 3.0, you must provide your [[Registration/Password|service password]].
Proceed as follows:
# Use SSH for a login node.
# The system will ask for a one-time password <code>Your OTP:</code>. Please enter your OTP and confirm it with Enter/Return. If you do not have a second factor yet, please create one (see [[Registration/2FA]]).
# The system will ask you for your service password <code>Password:</code>. Please enter it and confirm it with Enter/Return. If you do not have a service password yet or have forgotten it, please create one (see [[Registration/Password]]).
# You will be greeted by the cluster, followed by a shell.

<pre>
[user@client ~]$ ssh ka_ab1234@uc3.scc.kit.edu
(ka_ab1234@uc3.scc.kit.edu) Your OTP: cccccctlljdbrjdleujigivvfnkjbucudugjjlutfbrk
(ka_ab1234@uc3.scc.kit.edu) Password:
********************************************************************************
* *
* Karlsruher Institut für Technologie (KIT) *
* *
* Scientific Computing Center (SCC) *
* *
* _ _ _____ ____ *
* | | | | / ____| |___ \ *
* | | | | | | __) | *
* | | | | | | |__ < *
* | |__| | | |____ ___) | *
* \____/ \_____| |____/ *
* *
* *
* (KITE 2.0, RHEL 9.4, Lustre 2.14.0_ddn154) *
* *
* *
********************************************************************************
Last login: Wed Feb 26 11:08:20 2025 from 2a00:1398:4:181c:2be1:437b:1c36:1337

[ka_ab1234@uc3n990 ~]$
</pre>

== Troubleshooting ==

See [[BwUniCluster3.0/FAQ#Login|bwUniCluster FAQ]].

= Allowed Activities on Login Nodes =

{|style="background:#deffee; width:100%;"
|style="padding:5px; background:#ffa500; text-align:left"|
[[Image:Attention.svg|center|25px]]
|style="padding:5px; background:#ffa500; text-align:left"|
To guarantee usability for all the users of clusters you must not run your compute jobs on the login nodes.
Compute jobs must be submitted to the queuing system. 
'''Any compute job running on the login nodes will be terminated without any notice.''' 
Any long-running compilation or any long-running pre- or post-processing of batch jobs must also be submitted to the queuing system.
|}

The login nodes of the bwHPC clusters are the access point to the compute system, your <code>$HOME</code> directory and your workspaces.
These nodes are shared with all the users therefore, your activities on the login nodes are limited to primarily set up your batch jobs.
Your activities may also be:
* '''short''' compilation of your program code and
* '''light weight''' pre- and post-processing of your batch jobs.

We advise users to use [[BwUniCluster3.0/Batch_Queues#Interactive_Jobs|interactive jobs]] for compute and memory intensive tasks like compiling.

= Related Information =

* If you want to reset your service password, consult the [[Registration/Password|Password Guide]].
* If you want to register a new token for the two factor authentication (2FA), consult the [[Registration/2FA|2FA Guide]].
* If you want to de-register, consult the [[Registration/Deregistration|De-registration Guide]].
* If you need an SSH key for your workflow, read [[Registration/SSH|Registering SSH Keys with your Cluster]].
* Configuring your shell: [[.bashrc Do's and Don'ts]]

BwUniCluster3.0/Login

2025-04-08T08:54:20Z

P Schuhmacher: /* Hostnames */

{|style="background:#deffee; width:100%;"
|style="padding:5px; background:#cef2e0; text-align:left"|
[[Image:Attention.svg|center|25px]]
|style="padding:5px; background:#cef2e0; text-align:left"|
Access to bwUniCluster 3.0 is '''limited to IP addresses from the BelWü network'''.
All home institutions of our current users are connected to BelWü, so if you are on your campus network (e.g. in your office or on the Campus WiFi) you should be able to connect to bwUniCluster 3.0 without restrictions.
If you are outside one of the BelWü networks (e.g. at home), a VPN connection to the home institution or a connection to an SSH jump host at the home institution must be established first.
|}

The login nodes of the bwHPC clusters are the access point to the compute system, your <code>$HOME</code> directory and your workspaces.
All users must log in through these nodes to submit jobs to the cluster.

'''Prerequisites for successful login:'''

You need to have
# Completed the 3-step [[registration|'''registration''']] procedure.
# Set a [[Registration/Password|'''service password''']] for bwUniCluster 3.0.
# Set up a [[Registration/2FA|'''second factor''']] for the time-based one-time password (TOTP).

= Login to the bwUniCluster =

Login to the bwUniCluster 3.0 is only possible with a Secure Shell (SSH) client for which you must know your username on the cluster and the hostname of the login nodes.
For more general information on SSH clients, visit the [[BwUniCluster3.0/Login/Client|SSH Clients Guide]].

== Username ==

If you want to use the bwUniCluster 3.0 you need to add a prefix to your local username.

For prefixes please refer to the [[Registration/Login/Username#Prefix_for_Universities|prefix table]].

Examples: 
* If your local username for the University is <code>ab123</code> and you are a user from the University of Freiburg this would combine to: <code>fr_ab123</code>.
* If your KIT username is <code>ab1234</code> and you are a user from KIT this would combine to: <code>ka_ab1234</code>.

== Hostnames ==

The system has two login nodes.
The selection of the login node is done automatically.
If you are logging in multiple times, different sessions might run on different login nodes.

Login to bwUniCluster 3.0:

{| class="wikitable"
! Hostname !! Node type

|-
| '''uc3.scc.kit.edu''' || login to one of the two login nodes
|-
|}

In general, you should use automatic selection to allow us to balance the load over the three login nodes.
If you need to connect to specific login nodes, you can use the following hostnames:

{| class="wikitable"
! Hostname !! Node type
|-
| '''uc3-login1.scc.kit.edu''' || bwUniCluster 3.0 first login node
|-
| '''uc3-login2.scc.kit.edu''' || bwUniCluster 3.0 second login node
|-
|}

== Host Keys ==

When you log in, you may receive the message <code>The authenticity of host '<host address>' can't be established.</code> along with the host key fingerprint. This is intended so you can verify the authenticity of the host you are connecting to. Before you continue you should verify, if this fingerprint matches one of the following:

{| class="wikitable"
! Algorithm !! Fingerprint (SHA256)
|-
| '''RSA''' || SHA256:RaE0/tqQMMBmJuDCIo3WZ38YJsz0godVyt6aUOk/E0M
|-
| '''ECDSA''' || SHA256:LjBYL/x86ZAlL0JdlXrCmPYXvS3DaSiMuvycojBMdwQ
|-
| '''ED25519''' || SHA256:5mZYEpKigwK5ibBMHRrh3WIkOtCqomJW6H7OMbPk3ec
|-
|}

== Login with SSH command (Linux, Mac, Windows) ==

Linux, Mac OS, other Unix-like operating systems and Microsoft Windows come with a built-in SSH client, most likely provided by the OpenSSH project.

For login use one of the following ssh commands:

ssh <username> uc3.scc.kit.edu
ssh <username>@bwunicluster.scc.kit.edu



== Login with graphical SSH client (Windows) ==

For Windows we suggest using [[Data_Transfer/Graphical_Clients#MobaXterm|MobaXterm]] for login and file transfer.

Start ''MobaXterm'', fill in the following fields:
<pre>
Remote name : uc3.scc.kit.edu # or bwunicluster.scc.kit.edu
Specify user name : <username>
Port : 22
</pre>

After that click on 'ok'. Then a terminal will be opened and there you can enter your credentials.

'''Note:''' When using File transfer with MobaXterm version 23.6 the following configuration change has to be made:
In the settings in the tab "SSH", change the option "SSH engine" from "<new>" to "<legacy>". Then restart MobaXterm

== Login with Jupyterhub ==

Login takes place at:
* bwUniCluster 3.0: [https://uc3-jupyter.scc.kit.edu uc3-jupyter.scc.kit.edu]
* SDIL: [https://sdil-jupyter.scc.kit.edu sdil-jupyter.scc.kit.edu]

More Information can be found [[BwUniCluster3.0/Jupyter#Login_process|here]].

== Login Example ==

To log in to bwUniCluster 3.0, you must provide your [[Registration/Password|service password]].
Proceed as follows:
# Use SSH for a login node.
# The system will ask for a one-time password <code>Your OTP:</code>. Please enter your OTP and confirm it with Enter/Return. If you do not have a second factor yet, please create one (see [[Registration/2FA]]).
# The system will ask you for your service password <code>Password:</code>. Please enter it and confirm it with Enter/Return. If you do not have a service password yet or have forgotten it, please create one (see [[Registration/Password]]).
# You will be greeted by the cluster, followed by a shell.

<pre>
[user@client ~]$ ssh ka_ab1234@uc3.scc.kit.edu
(ka_ab1234@uc3.scc.kit.edu) Your OTP: cccccctlljdbrjdleujigivvfnkjbucudugjjlutfbrk
(ka_ab1234@uc3.scc.kit.edu) Password:
********************************************************************************
* *
* Karlsruher Institut für Technologie (KIT) *
* *
* Scientific Computing Center (SCC) *
* *
* _ _ _____ ____ *
* | | | | / ____| |___ \ *
* | | | | | | __) | *
* | | | | | | |__ < *
* | |__| | | |____ ___) | *
* \____/ \_____| |____/ *
* *
* *
* (KITE 2.0, RHEL 9.4, Lustre 2.14.0_ddn154) *
* *
* *
********************************************************************************
Last login: Wed Feb 26 11:08:20 2025 from 2a00:1398:4:181c:2be1:437b:1c36:1337

[ka_ab1234@uc3n990 ~]$
</pre>

== Troubleshooting ==

See [[BwUniCluster3.0/FAQ#Login|bwUniCluster FAQ]].

= Allowed Activities on Login Nodes =

{|style="background:#deffee; width:100%;"
|style="padding:5px; background:#ffa500; text-align:left"|
[[Image:Attention.svg|center|25px]]
|style="padding:5px; background:#ffa500; text-align:left"|
To guarantee usability for all the users of clusters you must not run your compute jobs on the login nodes.
Compute jobs must be submitted to the queuing system. 
'''Any compute job running on the login nodes will be terminated without any notice.''' 
Any long-running compilation or any long-running pre- or post-processing of batch jobs must also be submitted to the queuing system.
|}

The login nodes of the bwHPC clusters are the access point to the compute system, your <code>$HOME</code> directory and your workspaces.
These nodes are shared with all the users therefore, your activities on the login nodes are limited to primarily set up your batch jobs.
Your activities may also be:
* '''short''' compilation of your program code and
* '''light weight''' pre- and post-processing of your batch jobs.

We advise users to use [[BwUniCluster3.0/Batch_Queues#Interactive_Jobs|interactive jobs]] for compute and memory intensive tasks like compiling.

= Related Information =

* If you want to reset your service password, consult the [[Registration/Password|Password Guide]].
* If you want to register a new token for the two factor authentication (2FA), consult the [[Registration/2FA|2FA Guide]].
* If you want to de-register, consult the [[Registration/Deregistration|De-registration Guide]].
* If you need an SSH key for your workflow, read [[Registration/SSH|Registering SSH Keys with your Cluster]].
* Configuring your shell: [[.bashrc Do's and Don'ts]]

BwUniCluster3.0/Running Jobs

2025-04-07T06:11:39Z

P Schuhmacher: /* Development Queues */

= Purpose and function of a queuing system =

All compute activities on bwUniCluster 3.0 have to be performed on the compute nodes. Compute nodes are only available by requesting the corresponding resources via the queuing system. As soon as the requested resources are available, automated tasks are executed via a batch script or they can be accessed interactively. 
General procedure: Hint to [[Running_Calculations | Running Calculations]]

== Job submission process ==

bwUniCluster 3.0 has installed the workload managing software Slurm. Therefore any job submission by the user is to be executed by commands of the Slurm software. Slurm queues and runs user jobs based on fair sharing policies.

== Slurm ==

HPC Workload Manager on bwUniCluster 3.0 is Slurm.
Slurm is a cluster management and job scheduling system. Slurm has three key functions.
* It allocates access to resources (compute cores on nodes) to users for some duration of time so they can perform work.
* It provides a framework for starting, executing, and monitoring work (normally a parallel job) on the set of allocated nodes.
* It arbitrates contention for resources by managing a queue of pending work.

Any kind of calculation on the compute nodes of bwUniCluster 3.0 requires the user to define calculations as a sequence of commands together with required run time, number of CPU cores and main memory and submit all, i.e., the '''batch job''', to a resource and workload managing software.

== Terms and definitions ==

''' Partitions '''

Slurm manages job queues for different '''partitions'''. Partitions are used to group similar node types (e.g. nodes with and without accelerators) and to enforce different access policies and resource limits.

On bwUniCluster 3.0 there are different partitions:

* CPU-only nodes
** 2-socket nodes, consisting of 2 Intel Ice Lake processors with 32 cores each or 2 AMD processors with 48 cores each
** 2-socket nodes with very high RAM capacity, consisting of 2 AMD processors with 48 cores each
* GPU-accelerated nodes
** 2-socket nodes with 4x NVIDIA A100 or 4x NVIDIA H100 GPUs
** 4-socket node with 4x AMD Instinct accelerator

''' Queues '''

Job '''queues''' are used to manage jobs that request access to shared but limited computing resources of a certain kind (partition).

On bwUniCluster 3.0 there are different main types of queues:
* Regular queues
** cpu: Jobs that request CPU-only nodes.
** gpu: Jobs that request GPU-accelerated nodes.
* Development queues (dev)
** Short, usually interactive jobs that are used for developing, compiling and testing code and workflows. The intention behind development queues is to provide users with immediate access to computer resources without having to wait. This is the place to realize instantaneous heavy compute without affecting other users, as would be the case on the login nodes.

Requested compute resources such as (wall-)time, number of nodes and amount of memory are restricted and must fit into the boundaries imposed by the queues. The request for compute resources on the bwUniCluster 3.0 requires at least the specification of the '''queue''' and the '''time'''.

''' Jobs '''

Jobs can be run non-interactively as '''batch jobs''' or as '''interactive jobs'''. 
Submitting a batch job means, that all steps of a compute project are defined in a Bash script. This Bash script is queued and executed as soon as the compute resources are available and allocated. Jobs are enqueued with the <code>sbatch</code> command.
For interactive jobs, the resources are requested with the <code>salloc</code> command. As soon as the computing resources are available and allocated, a command line prompt is returned on a computing node and the user can freely dispose of the resources now available to him.
{|style="background:#deffee; width:100%;"
|style="padding:5px; background:#cef2e0; text-align:left"|
[[Image:Attention.svg|center|25px]]
|style="padding:5px; background:#cef2e0; text-align:left"|
'''Please remember:'''
* '''Heavy computations are not allowed on the login nodes'''. Use a developement or a regular job queue instead! Please refer to [[BwUniCluster3.0/Login#Allowed_Activities_on_Login_Nodes|Allowed Activities on Login Nodes]].
* '''Development queues''' are meant for '''development tasks'''. Do not misuse this queue for regular, short-running jobs or chain jobs! Only one running job at a time is enabled. Maximum queue length is reduced to 3.
|}

= Queues on bwUniCluster 3.0 =
== Policy ==

The computing time is provided in accordance with the '''fair share policy'''. The individual investment shares of the respective university and the resources already used by its members are taken into account. Furthermore, the following throttling policy is also active: The '''maximum amount of cores''' used at any given time from jobs running is '''1920 per user''' (aggregated over all running jobs). This number corresponds to 30 nodes on the Ice Lake partition or 20 nodes on the standard partition. The aim is to minimize waiting times and maximize the number of users who can access computing time at the same time.

== Regular Queues ==
{| class="wikitable"
|-
! style="width:5%"| Queue
! style="width:13%"| Node-Type
! style="width:23%"| Default Resources
! style="width:13%"| Minimal Resources
! style="width:13%"| Maximum Resources
|-
| <code>cpu_il</code>
| CPU nodes Ice Lake
| mem-per-cpu=1950mb
|
| time=72:00:00, nodes=30, mem=249600mb, ntasks-per-node=64, (threads-per-core=2)
|-
| <code>cpu</code>
| CPU nodes Standard
| mem-per-cpu=1125mb
|
| time=72:00:00, nodes=20, mem=380000mb, ntasks-per-node=96, (threads-per-core=2)
|-
| <code>highmem</code>
| CPU nodes High Memory
| mem-per-cpu=1125mb
| mem=380001mb
| time=72:00:00, nodes=4, mem=2300000mb, ntasks-per-node=96, (threads-per-core=2)
|-
| <code>gpu_h100</code>
| GPU nodes NVIDIA GPU x4
| mem-per-cpu=1125mb cpus-per-gpu=24
|
| time=72:00:00, nodes=12, mem=760000mb, ntasks-per-node=96, (threads-per-core=2)
|-
| <code>gpu_mi300</code>
| GPU node AMD GPU x4
| mem-per-cpu=1125mb cpus-per-gpu=24
|
| time=72:00:00, nodes=1, mem=510000mb, ntasks-per-node=40, (threads-per-core=2)
|-
| <code>gpu_a100_il</code>/<code>gpu_h100_il</code>
| GPU nodes Ice Lake NVIDIA GPU x4
| mem-per-gpu=127500mb cpus-per-gpu=16
|
| time=72:00:00, nodes=9, mem=510000mb, ntasks-per-node=64, (threads-per-core=2)
|}
Table 1: Regular Queues

== Development Queues ==
Only for development, i.e. debugging or performance optimization ...

{| class="wikitable"
|-
! style="width:5%"| Queue
! style="width:13%"| Node Type
! style="width:23%"| Default Resources
! style="width:13%"| Minimal Resources
! style="width:13%"| Maximum Resources
|-
| <code>dev_cpu_il</code>
| CPU nodes Ice Lake
| mem-per-cpu=1950mb
|
| time=30, nodes=8, mem=249600mb, ntasks-per-node=64, (threads-per-core=2)
|-
| <code>dev_cpu</code>
| CPU nodes Standard
| mem-per-cpu=1125mb
|
| time=30, nodes=1, mem=180000mb, ntasks-per-node=40, (threads-per-core=2)
|-
| <code>dev_gpu_h100</code>
| GPU nodes NVIDIA GPU x4
| mem-per-cpu=1125mb cpus-per-gpu=24
|
| time=30, nodes=1, mem=180000mb, ntasks-per-node=40, (threads-per-core=2)
|-
| <code>dev_gpu_a100_il</code>
| GPU nodes NVIDIA GPU x4 
| mem-per-gpu=127500mb cpus-per-gpu=16
|
| time=30, nodes=1, mem=510000mb, ntasks-per-node=64, (threads-per-core=2)
|}
Table 2: Development Queues

Default resources of a queue class define number of tasks and memory if not explicitly given with sbatch command. Resource list acronyms ''--time'', ''--ntasks'', ''--nodes'', ''--mem'' and ''--mem-per-cpu'' are described [[BwUniCluster3.0/Running_Jobs/Slurm|here]].

== Check available resources: sinfo_t_idle ==
The Slurm command sinfo is used to view partition and node information for a system running Slurm. It incorporates down time, reservations, and node state information in determining the available backfill window. The sinfo command can only be used by the administrator.
 
SCC has prepared a special script (sinfo_t_idle) to find out how many processors are available for immediate use on the system. It is anticipated that users will use this information to submit jobs that meet these criteria and thus obtain quick job turnaround times.
 
 

* The following command displays what resources are available for immediate use for the whole partition.
<pre>$ sinfo_t_idle
Partition dev_cpu : 2 nodes idle
Partition cpu : 68 nodes idle
Partition highmem : 4 nodes idle
Partition dev_gpu_h100 : 0 nodes idle
Partition gpu_h100 : 11 nodes idle
Partition gpu_mi300 : 1 nodes idle
Partition dev_cpu_il : 0 nodes idle
Partition cpu_il : 0 nodes idle
Partition dev_gpu_a100_il : 0 nodes idle
Partition gpu_a100_il : 0 nodes idle
Partition gpu_h100_il : 0 nodes idle
</pre>
 

= Running Jobs =

== Slurm Commands (excerpt) ==
Important Slurm commands for non-administrators working on bwUniCluster 3.0.
{| width=850px class="wikitable"
! Slurm commands !! Brief explanation
|-
| [[#Batch Jobs: sbatch|sbatch]] || Submits a job and puts it into the queue [[https://slurm.schedmd.com/sbatch.html sbatch]]
|-
| [[#Interactive Jobs: salloc|salloc]] || Requests resources for an interactive Job [[https://slurm.schedmd.com/salloc.html salloc]]
|-
| [[#Monitor and manage jobs |scontrol show job]] || Displays detailed job state information [[https://slurm.schedmd.com/scontrol.html scontrol]]
|-
| [[#List of your submitted jobs : squeue|squeue]] || Displays information about active, eligible, blocked, and/or recently completed jobs [[https://slurm.schedmd.com/squeue.html squeue]]
|-
| [[#List of your submitted jobs : squeue|squeue --start]] || Returns start time of submitted job [[https://slurm.schedmd.com/squeue.html squeue]]
|-
| [[#Check available resources: sinfo_t_idle|sinfo_t_idle]] || Shows what resources are available for immediate use [[https://slurm.schedmd.com/sinfo.html sinfo]]
|-
| [[#Canceling own jobs : scancel|scancel]] || Cancels a job [[https://slurm.schedmd.com/scancel.html scancel]]
|}

 
* [https://slurm.schedmd.com/tutorials.html Slurm Tutorials]
* [https://slurm.schedmd.com/pdfs/summary.pdf Slurm command/option summary (2 pages)]
* [https://slurm.schedmd.com/man_index.html Slurm Commands]
 

== Batch Jobs: sbatch ==

Batch jobs are submitted by using the command '''sbatch'''. The main purpose of the '''sbatch''' command is to specify the resources that are needed to run the job. '''sbatch''' will then queue the batch job. However, starting of batch job depends on the availability of the requested resources and the fair sharing value.
 
 

* The syntax and use of '''sbatch''' can be displayed via:
<pre>
$ man sbatch
</pre>

'''sbatch''' options can be used from the command line or in your job script. Different defaults for some of these options are set based on the queue and can be found [[BwUniCluster3.0/Slurm | here]]

{| class="wikitable"
! colspan="3" | sbatch Options
|-
! style="width:8%"| Command line
! style="width:9%"| Script
! style="width:13%"| Purpose
|- style="vertical-align:top;"
| -t, --time=''time''
| #SBATCH --time=''time''
| Wall clock time limit. 
|-
|- style="vertical-align:top;"
| -N, --nodes=''count''
| #SBATCH --nodes=''count''
| Number of nodes to be used.
|- style="vertical-align:top;"
| -n, --ntasks=''count''
| #SBATCH --ntasks=''count''
| Number of tasks to be launched.
|-
|- style="vertical-align:top;"
| --ntasks-per-node=''count''
| #SBATCH --ntasks-per-node=''count''
| Maximum count of tasks per node.
|-
|- style="vertical-align:top;"
| -c, --cpus-per-task=''count''
| #SBATCH --cpus-per-task=''count''
| Number of CPUs required per (MPI-)task.
|-
|- style="vertical-align:top;"
| --mem=''value_in_MB''
| #SBATCH --mem=''value_in_MB''
| Memory in MegaByte per node. (You should omit the setting of this option.)
|-
|- style="vertical-align:top;"
| --mem-per-cpu=''value_in_MB''
| #SBATCH --mem-per-cpu=''value_in_MB''
| Minimum Memory required per allocated CPU. (You should omit the setting of this option.)
|-
|- style="vertical-align:top;"
| --mail-type=''type''
| #SBATCH --mail-type=''type''
| Notify user by email when certain event types occur. Valid type values are NONE, BEGIN, END, FAIL, REQUEUE, ALL.
|-
|- style="vertical-align:top;"
| --mail-user=''mail-address''
| #SBATCH --mail-user=''mail-address''
| The specified mail-address receives email notification of state changes as defined by --mail-type.
|-
|- style="vertical-align:top;"
| --output=''name''
| #SBATCH --output=''name''
| File in which job output is stored.
|-
|- style="vertical-align:top;"
| --error=''name''
| #SBATCH --error=''name''
| File in which job error messages are stored.
|-
|- style="vertical-align:top;"
| -J, --job-name=''name''
| #SBATCH --job-name=''name''
| Job name.
|-
|- style="vertical-align:top;"
| --export=[ALL,] ''env-variables''
| #SBATCH --export=[ALL,] ''env-variables''
| Identifies which environment variables from the submission environment are propagated to the launched application. Default is ALL.
|-
|- style="vertical-align:top;"
| -A, --account=''group-name''
| #SBATCH --account=''group-name''
| Change resources used by this job to specified group. You may need this option if your account is assigned to more than one group. By command "scontrol show job" the project group the job is accounted on can be seen behind "Account=".
|-
|- style="vertical-align:top;"
| -p, --partition=''queue-name''
| #SBATCH --partition=''queue-name''
| Request a specific queue for the resource allocation.
|-
|- style="vertical-align:top;"
| --reservation=''reservation-name''
| #SBATCH --reservation=''reservation-name''
| Use a specific reservation for the resource allocation.
|-
|- style="vertical-align:top;"
| -C, --constraint=''LSDF''
| #SBATCH --constraint=LSDF
| Job constraint LSDF filesystems.
|-
|- style="vertical-align:top;"
| -C, --constraint=''BEEOND (BEEOND_4MDS, BEEOND_MAXMDS)''
| #SBATCH --constraint=BEEOND (BEEOND_4MDS, BEEOND_MAXMDS)
| Job constraint BeeOND filesystem.
|-
|}
 

== Interactive Jobs: salloc ==

On bwUniCluster 3.0 you are only allowed to run short jobs (<< 1 hour) with little memory requirements (<< 8 GByte) on the logins nodes. If you want to run longer jobs and/or jobs with a request of more than 8 GByte of memory, you must allocate resources for so-called interactive jobs by usage of the command salloc on a login node. Considering a serial application running on a compute node that requires 5000 MByte of memory and limiting the interactive run to 2 hours the following command has to be executed:

<pre>
$ salloc -p cpu -n 1 -t 120 --mem=5000
</pre>

Then you will get one core on a compute node within the partition "cpu". After execution of this command '''DO NOT CLOSE''' your current terminal session but wait until the queueing system Slurm has granted you the requested resources on the compute system. You will be logged in automatically on the granted core! To run a serial program on the granted core you only have to type the name of the executable.

<pre>
$ ./<my_serial_program>
</pre>

Please be aware that your serial job must run less than 2 hours in this example, else the job will be killed during runtime by the system.

You can also start now a graphical X11-terminal connecting you to the dedicated resource that is available for 2 hours. You can start it by the command:

<pre>
$ xterm
</pre>

Note that, once the walltime limit has been reached the resources - i.e. the compute node - will automatically be revoked.

An interactive parallel application running on one compute node or on many compute nodes (e.g. here 5 nodes) with 96 cores each requires usually an amount of memory in GByte (e.g. 50 GByte) and a maximum time (e.g. 1 hour). E.g. 5 nodes can be allocated by the following command:

<pre>
$ salloc -p cpu -N 5 --ntasks-per-node=96 -t 01:00:00 --mem=50gb
</pre>

Now you can run parallel jobs on 480 cores requiring 50 GByte of memory per node. Please be aware that you will be logged in on core 0 of the first node.
If you want to have access to another node you have to open a new terminal, connect it also to bwUniCluster 3.0 and type the following commands to
connect to the running interactive job and then to a specific node:

<pre>
$ srun --jobid=XXXXXXXX --pty /bin/bash
$ srun --nodelist=uc3nXXX --pty /bin/bash
</pre>

With the command:

<pre>
$ squeue
</pre>

the jobid and the nodelist can be shown.

If you want to run MPI-programs, you can do it by simply typing mpirun <program_name>. Then your program will be run on 480 cores. A very simple example for starting a parallel job can be:

<pre>
$ mpirun <my_mpi_program>
</pre>

You can also start the debugger ddt by the commands:

<pre>
$ module add devel/ddt
$ ddt <my_mpi_program>
</pre>

The above commands will execute the parallel program <my_mpi_program> on all available cores. You can also start parallel programs on a subset of cores; an example for this can be:

<pre>
$ mpirun -n 50 <my_mpi_program>
</pre>

If you are using Intel MPI you must start <my_mpi_program> by the command mpiexec.hydra (instead of mpirun).

== Interactive Computing with Jupyter ==

== Monitor and manage jobs ==

=== List of your submitted jobs : squeue ===
Displays information about YOUR active, pending and/or recently completed jobs. The command displays all own active and pending jobs. The command squeue is explained in detail on the webpage https://slurm.schedmd.com/squeue.html or via manpage (man squeue).
 
 

* ''squeue'' example on bwUniCluster 3.0 (Only your own jobs are displayed!).
<pre>
$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
1262 cpu wrap ka_ab123 R 8:15 1 uc3n002
1267 dev_gpu_h wrap ka_ab123 PD 0:00 1 (Resources)
1265 highmem wrap ka_ab123 R 2:41 1 uc3n084
$ squeue -l
JOBID PARTITION NAME USER STATE TIME TIME_LIMI NODES NODELIST(REASON)
1262 cpu wrap ka_ab123 RUNNING 8:55 20:00 1 uc3n002
1267 dev_gpu_h wrap ka_ab123 PENDING 0:00 20:00 1 (Resources)
1265 highmem wrap ka_ab123 RUNNING 3:21 20:00 1 uc3n084

</pre>

=== Detailed job information : scontrol show job ===
scontrol show job displays detailed job state information and diagnostic output for all or a specified job of yours. Detailed information is available for active, pending and recently completed jobs. The command scontrol is explained in detail on the webpage https://slurm.schedmd.com/scontrol.html or via manpage (man scontrol).
 
Display the state of all your jobs in normal mode: scontrol show job
 
Display the state of a job with <jobid> in normal mode: scontrol show job <jobid>
 
 

* Here is an example from bwUniCluster 3.0.
<pre>
$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
1262 cpu wrap ka_zs040 R 1:12 1 uc3n002

$
$ # now, see what's up with my pending job with jobid 1262
$
$ scontrol show job 1262

JobId=1262 JobName=wrap
UserId=ka_zs0402(241992) GroupId=ka_scc(12345) MCS_label=N/A
Priority=4246 Nice=0 Account=ka QOS=normal
JobState=RUNNING Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
RunTime=00:00:37 TimeLimit=00:20:00 TimeMin=N/A
SubmitTime=2025-04-04T10:01:30 EligibleTime=2025-04-04T10:01:30
AccrueTime=2025-04-04T10:01:30
StartTime=2025-04-04T10:01:31 EndTime=2025-04-04T10:21:31 Deadline=N/A
SuspendTime=None SecsPreSuspend=0 LastSchedEval=2025-04-04T10:01:31 Scheduler=Main
Partition=cpu AllocNode:Sid=uc3n999:2819841
ReqNodeList=(null) ExcNodeList=(null)
NodeList=uc3n002
BatchHost=uc3n002
NumNodes=1 NumCPUs=2 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
ReqTRES=cpu=1,mem=2000M,node=1,billing=1
AllocTRES=cpu=2,mem=4000M,node=1,billing=2
Socks/Node=* NtasksPerN:B:S:C=0:0:*:1 CoreSpec=*
MinCPUsNode=1 MinMemoryCPU=2000M MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
Command=(null)
WorkDir=/pfs/data6/home/ka/ka_scc/ka_zs0402
StdErr=/pfs/data6/home/ka/ka_scc/ka_zs0402/slurm-1262.out
StdIn=/dev/null
StdOut=/pfs/data6/home/ka/ka_scc/ka_zs0402/slurm-1262.out

</pre>
 
=== Canceling own jobs : scancel ===
The scancel command is used to cancel jobs. The command scancel is explained in detail on the webpage https://slurm.schedmd.com/scancel.html or via manpage (man scancel). The command is:
<pre>
$ scancel [-i] <job-id>
$ scancel -t <job_state_name>
</pre>

= Slurm Options =
[[BwUniCluster3.0/Running_Jobs/Slurm | Detailed Slurm usage]]

= Best Practices =

== Step-by-Step example==

== Dos and Don'ts ==

BwUniCluster3.0/Running Jobs

2025-04-07T06:10:23Z

P Schuhmacher: /* Regular Queues */

= Purpose and function of a queuing system =

All compute activities on bwUniCluster 3.0 have to be performed on the compute nodes. Compute nodes are only available by requesting the corresponding resources via the queuing system. As soon as the requested resources are available, automated tasks are executed via a batch script or they can be accessed interactively. 
General procedure: Hint to [[Running_Calculations | Running Calculations]]

== Job submission process ==

bwUniCluster 3.0 has installed the workload managing software Slurm. Therefore any job submission by the user is to be executed by commands of the Slurm software. Slurm queues and runs user jobs based on fair sharing policies.

== Slurm ==

HPC Workload Manager on bwUniCluster 3.0 is Slurm.
Slurm is a cluster management and job scheduling system. Slurm has three key functions.
* It allocates access to resources (compute cores on nodes) to users for some duration of time so they can perform work.
* It provides a framework for starting, executing, and monitoring work (normally a parallel job) on the set of allocated nodes.
* It arbitrates contention for resources by managing a queue of pending work.

Any kind of calculation on the compute nodes of bwUniCluster 3.0 requires the user to define calculations as a sequence of commands together with required run time, number of CPU cores and main memory and submit all, i.e., the '''batch job''', to a resource and workload managing software.

== Terms and definitions ==

''' Partitions '''

Slurm manages job queues for different '''partitions'''. Partitions are used to group similar node types (e.g. nodes with and without accelerators) and to enforce different access policies and resource limits.

On bwUniCluster 3.0 there are different partitions:

* CPU-only nodes
** 2-socket nodes, consisting of 2 Intel Ice Lake processors with 32 cores each or 2 AMD processors with 48 cores each
** 2-socket nodes with very high RAM capacity, consisting of 2 AMD processors with 48 cores each
* GPU-accelerated nodes
** 2-socket nodes with 4x NVIDIA A100 or 4x NVIDIA H100 GPUs
** 4-socket node with 4x AMD Instinct accelerator

''' Queues '''

Job '''queues''' are used to manage jobs that request access to shared but limited computing resources of a certain kind (partition).

On bwUniCluster 3.0 there are different main types of queues:
* Regular queues
** cpu: Jobs that request CPU-only nodes.
** gpu: Jobs that request GPU-accelerated nodes.
* Development queues (dev)
** Short, usually interactive jobs that are used for developing, compiling and testing code and workflows. The intention behind development queues is to provide users with immediate access to computer resources without having to wait. This is the place to realize instantaneous heavy compute without affecting other users, as would be the case on the login nodes.

Requested compute resources such as (wall-)time, number of nodes and amount of memory are restricted and must fit into the boundaries imposed by the queues. The request for compute resources on the bwUniCluster 3.0 requires at least the specification of the '''queue''' and the '''time'''.

''' Jobs '''

Jobs can be run non-interactively as '''batch jobs''' or as '''interactive jobs'''. 
Submitting a batch job means, that all steps of a compute project are defined in a Bash script. This Bash script is queued and executed as soon as the compute resources are available and allocated. Jobs are enqueued with the <code>sbatch</code> command.
For interactive jobs, the resources are requested with the <code>salloc</code> command. As soon as the computing resources are available and allocated, a command line prompt is returned on a computing node and the user can freely dispose of the resources now available to him.
{|style="background:#deffee; width:100%;"
|style="padding:5px; background:#cef2e0; text-align:left"|
[[Image:Attention.svg|center|25px]]
|style="padding:5px; background:#cef2e0; text-align:left"|
'''Please remember:'''
* '''Heavy computations are not allowed on the login nodes'''. Use a developement or a regular job queue instead! Please refer to [[BwUniCluster3.0/Login#Allowed_Activities_on_Login_Nodes|Allowed Activities on Login Nodes]].
* '''Development queues''' are meant for '''development tasks'''. Do not misuse this queue for regular, short-running jobs or chain jobs! Only one running job at a time is enabled. Maximum queue length is reduced to 3.
|}

= Queues on bwUniCluster 3.0 =
== Policy ==

The computing time is provided in accordance with the '''fair share policy'''. The individual investment shares of the respective university and the resources already used by its members are taken into account. Furthermore, the following throttling policy is also active: The '''maximum amount of cores''' used at any given time from jobs running is '''1920 per user''' (aggregated over all running jobs). This number corresponds to 30 nodes on the Ice Lake partition or 20 nodes on the standard partition. The aim is to minimize waiting times and maximize the number of users who can access computing time at the same time.

== Regular Queues ==
{| class="wikitable"
|-
! style="width:5%"| Queue
! style="width:13%"| Node-Type
! style="width:23%"| Default Resources
! style="width:13%"| Minimal Resources
! style="width:13%"| Maximum Resources
|-
| <code>cpu_il</code>
| CPU nodes Ice Lake
| mem-per-cpu=1950mb
|
| time=72:00:00, nodes=30, mem=249600mb, ntasks-per-node=64, (threads-per-core=2)
|-
| <code>cpu</code>
| CPU nodes Standard
| mem-per-cpu=1125mb
|
| time=72:00:00, nodes=20, mem=380000mb, ntasks-per-node=96, (threads-per-core=2)
|-
| <code>highmem</code>
| CPU nodes High Memory
| mem-per-cpu=1125mb
| mem=380001mb
| time=72:00:00, nodes=4, mem=2300000mb, ntasks-per-node=96, (threads-per-core=2)
|-
| <code>gpu_h100</code>
| GPU nodes NVIDIA GPU x4
| mem-per-cpu=1125mb cpus-per-gpu=24
|
| time=72:00:00, nodes=12, mem=760000mb, ntasks-per-node=96, (threads-per-core=2)
|-
| <code>gpu_mi300</code>
| GPU node AMD GPU x4
| mem-per-cpu=1125mb cpus-per-gpu=24
|
| time=72:00:00, nodes=1, mem=510000mb, ntasks-per-node=40, (threads-per-core=2)
|-
| <code>gpu_a100_il</code>/<code>gpu_h100_il</code>
| GPU nodes Ice Lake NVIDIA GPU x4
| mem-per-gpu=127500mb cpus-per-gpu=16
|
| time=72:00:00, nodes=9, mem=510000mb, ntasks-per-node=64, (threads-per-core=2)
|}
Table 1: Regular Queues

== Development Queues ==
Only for development, i.e. debugging or performance optimization ...

{| class="wikitable"
|-
! style="width:5%"| Queue
! style="width:13%"| Node Type
! style="width:23%"| Default Resources
! style="width:13%"| Minimal Resources
! style="width:13%"| Maximum Resources
|-
| <code>dev_cpu_il</code>
| CPU nodes Ice Lake
| mem-per-cpu=1950mb
|
| time=30, nodes=8, mem=249600mb, ntasks-per-node=64, (threads-per-core=2)
|-
| <code>dev_cpu</code>
| CPU nodes Standard
| mem-per-cpu=1125mb
|
| time=30, nodes=1, mem=180000mb, ntasks-per-node=40, (threads-per-core=2)
|-
| <code>dev_highmem</code>
| CPU nodes High Memory
| mem-per-cpu=1125mb
|
| time=30, nodes=1, mem=180000mb, ntasks-per-node=40, (threads-per-core=2)
|-
| <code>dev_gpu_h100</code>
| GPU nodes NVIDIA GPU x4
| mem-per-cpu=1125mb cpus-per-gpu=24
|
| time=30, nodes=1, mem=180000mb, ntasks-per-node=40, (threads-per-core=2)
|-
| <code>dev_gpu_a100_il</code>
| GPU nodes NVIDIA GPU x4 
| mem-per-gpu=127500mb cpus-per-gpu=16
|
| time=30, nodes=1, mem=510000mb, ntasks-per-node=64, (threads-per-core=2)
|}
Table 2: Development Queues

Default resources of a queue class define number of tasks and memory if not explicitly given with sbatch command. Resource list acronyms ''--time'', ''--ntasks'', ''--nodes'', ''--mem'' and ''--mem-per-cpu'' are described [[BwUniCluster3.0/Running_Jobs/Slurm|here]].

== Check available resources: sinfo_t_idle ==
The Slurm command sinfo is used to view partition and node information for a system running Slurm. It incorporates down time, reservations, and node state information in determining the available backfill window. The sinfo command can only be used by the administrator.
 
SCC has prepared a special script (sinfo_t_idle) to find out how many processors are available for immediate use on the system. It is anticipated that users will use this information to submit jobs that meet these criteria and thus obtain quick job turnaround times.
 
 

* The following command displays what resources are available for immediate use for the whole partition.
<pre>$ sinfo_t_idle
Partition dev_cpu : 2 nodes idle
Partition cpu : 68 nodes idle
Partition highmem : 4 nodes idle
Partition dev_gpu_h100 : 0 nodes idle
Partition gpu_h100 : 11 nodes idle
Partition gpu_mi300 : 1 nodes idle
Partition dev_cpu_il : 0 nodes idle
Partition cpu_il : 0 nodes idle
Partition dev_gpu_a100_il : 0 nodes idle
Partition gpu_a100_il : 0 nodes idle
Partition gpu_h100_il : 0 nodes idle
</pre>
 

= Running Jobs =

== Slurm Commands (excerpt) ==
Important Slurm commands for non-administrators working on bwUniCluster 3.0.
{| width=850px class="wikitable"
! Slurm commands !! Brief explanation
|-
| [[#Batch Jobs: sbatch|sbatch]] || Submits a job and puts it into the queue [[https://slurm.schedmd.com/sbatch.html sbatch]]
|-
| [[#Interactive Jobs: salloc|salloc]] || Requests resources for an interactive Job [[https://slurm.schedmd.com/salloc.html salloc]]
|-
| [[#Monitor and manage jobs |scontrol show job]] || Displays detailed job state information [[https://slurm.schedmd.com/scontrol.html scontrol]]
|-
| [[#List of your submitted jobs : squeue|squeue]] || Displays information about active, eligible, blocked, and/or recently completed jobs [[https://slurm.schedmd.com/squeue.html squeue]]
|-
| [[#List of your submitted jobs : squeue|squeue --start]] || Returns start time of submitted job [[https://slurm.schedmd.com/squeue.html squeue]]
|-
| [[#Check available resources: sinfo_t_idle|sinfo_t_idle]] || Shows what resources are available for immediate use [[https://slurm.schedmd.com/sinfo.html sinfo]]
|-
| [[#Canceling own jobs : scancel|scancel]] || Cancels a job [[https://slurm.schedmd.com/scancel.html scancel]]
|}

 
* [https://slurm.schedmd.com/tutorials.html Slurm Tutorials]
* [https://slurm.schedmd.com/pdfs/summary.pdf Slurm command/option summary (2 pages)]
* [https://slurm.schedmd.com/man_index.html Slurm Commands]
 

== Batch Jobs: sbatch ==

Batch jobs are submitted by using the command '''sbatch'''. The main purpose of the '''sbatch''' command is to specify the resources that are needed to run the job. '''sbatch''' will then queue the batch job. However, starting of batch job depends on the availability of the requested resources and the fair sharing value.
 
 

* The syntax and use of '''sbatch''' can be displayed via:
<pre>
$ man sbatch
</pre>

'''sbatch''' options can be used from the command line or in your job script. Different defaults for some of these options are set based on the queue and can be found [[BwUniCluster3.0/Slurm | here]]

{| class="wikitable"
! colspan="3" | sbatch Options
|-
! style="width:8%"| Command line
! style="width:9%"| Script
! style="width:13%"| Purpose
|- style="vertical-align:top;"
| -t, --time=''time''
| #SBATCH --time=''time''
| Wall clock time limit. 
|-
|- style="vertical-align:top;"
| -N, --nodes=''count''
| #SBATCH --nodes=''count''
| Number of nodes to be used.
|- style="vertical-align:top;"
| -n, --ntasks=''count''
| #SBATCH --ntasks=''count''
| Number of tasks to be launched.
|-
|- style="vertical-align:top;"
| --ntasks-per-node=''count''
| #SBATCH --ntasks-per-node=''count''
| Maximum count of tasks per node.
|-
|- style="vertical-align:top;"
| -c, --cpus-per-task=''count''
| #SBATCH --cpus-per-task=''count''
| Number of CPUs required per (MPI-)task.
|-
|- style="vertical-align:top;"
| --mem=''value_in_MB''
| #SBATCH --mem=''value_in_MB''
| Memory in MegaByte per node. (You should omit the setting of this option.)
|-
|- style="vertical-align:top;"
| --mem-per-cpu=''value_in_MB''
| #SBATCH --mem-per-cpu=''value_in_MB''
| Minimum Memory required per allocated CPU. (You should omit the setting of this option.)
|-
|- style="vertical-align:top;"
| --mail-type=''type''
| #SBATCH --mail-type=''type''
| Notify user by email when certain event types occur. Valid type values are NONE, BEGIN, END, FAIL, REQUEUE, ALL.
|-
|- style="vertical-align:top;"
| --mail-user=''mail-address''
| #SBATCH --mail-user=''mail-address''
| The specified mail-address receives email notification of state changes as defined by --mail-type.
|-
|- style="vertical-align:top;"
| --output=''name''
| #SBATCH --output=''name''
| File in which job output is stored.
|-
|- style="vertical-align:top;"
| --error=''name''
| #SBATCH --error=''name''
| File in which job error messages are stored.
|-
|- style="vertical-align:top;"
| -J, --job-name=''name''
| #SBATCH --job-name=''name''
| Job name.
|-
|- style="vertical-align:top;"
| --export=[ALL,] ''env-variables''
| #SBATCH --export=[ALL,] ''env-variables''
| Identifies which environment variables from the submission environment are propagated to the launched application. Default is ALL.
|-
|- style="vertical-align:top;"
| -A, --account=''group-name''
| #SBATCH --account=''group-name''
| Change resources used by this job to specified group. You may need this option if your account is assigned to more than one group. By command "scontrol show job" the project group the job is accounted on can be seen behind "Account=".
|-
|- style="vertical-align:top;"
| -p, --partition=''queue-name''
| #SBATCH --partition=''queue-name''
| Request a specific queue for the resource allocation.
|-
|- style="vertical-align:top;"
| --reservation=''reservation-name''
| #SBATCH --reservation=''reservation-name''
| Use a specific reservation for the resource allocation.
|-
|- style="vertical-align:top;"
| -C, --constraint=''LSDF''
| #SBATCH --constraint=LSDF
| Job constraint LSDF filesystems.
|-
|- style="vertical-align:top;"
| -C, --constraint=''BEEOND (BEEOND_4MDS, BEEOND_MAXMDS)''
| #SBATCH --constraint=BEEOND (BEEOND_4MDS, BEEOND_MAXMDS)
| Job constraint BeeOND filesystem.
|-
|}
 

== Interactive Jobs: salloc ==

On bwUniCluster 3.0 you are only allowed to run short jobs (<< 1 hour) with little memory requirements (<< 8 GByte) on the logins nodes. If you want to run longer jobs and/or jobs with a request of more than 8 GByte of memory, you must allocate resources for so-called interactive jobs by usage of the command salloc on a login node. Considering a serial application running on a compute node that requires 5000 MByte of memory and limiting the interactive run to 2 hours the following command has to be executed:

<pre>
$ salloc -p cpu -n 1 -t 120 --mem=5000
</pre>

Then you will get one core on a compute node within the partition "cpu". After execution of this command '''DO NOT CLOSE''' your current terminal session but wait until the queueing system Slurm has granted you the requested resources on the compute system. You will be logged in automatically on the granted core! To run a serial program on the granted core you only have to type the name of the executable.

<pre>
$ ./<my_serial_program>
</pre>

Please be aware that your serial job must run less than 2 hours in this example, else the job will be killed during runtime by the system.

You can also start now a graphical X11-terminal connecting you to the dedicated resource that is available for 2 hours. You can start it by the command:

<pre>
$ xterm
</pre>

Note that, once the walltime limit has been reached the resources - i.e. the compute node - will automatically be revoked.

An interactive parallel application running on one compute node or on many compute nodes (e.g. here 5 nodes) with 96 cores each requires usually an amount of memory in GByte (e.g. 50 GByte) and a maximum time (e.g. 1 hour). E.g. 5 nodes can be allocated by the following command:

<pre>
$ salloc -p cpu -N 5 --ntasks-per-node=96 -t 01:00:00 --mem=50gb
</pre>

Now you can run parallel jobs on 480 cores requiring 50 GByte of memory per node. Please be aware that you will be logged in on core 0 of the first node.
If you want to have access to another node you have to open a new terminal, connect it also to bwUniCluster 3.0 and type the following commands to
connect to the running interactive job and then to a specific node:

<pre>
$ srun --jobid=XXXXXXXX --pty /bin/bash
$ srun --nodelist=uc3nXXX --pty /bin/bash
</pre>

With the command:

<pre>
$ squeue
</pre>

the jobid and the nodelist can be shown.

If you want to run MPI-programs, you can do it by simply typing mpirun <program_name>. Then your program will be run on 480 cores. A very simple example for starting a parallel job can be:

<pre>
$ mpirun <my_mpi_program>
</pre>

You can also start the debugger ddt by the commands:

<pre>
$ module add devel/ddt
$ ddt <my_mpi_program>
</pre>

The above commands will execute the parallel program <my_mpi_program> on all available cores. You can also start parallel programs on a subset of cores; an example for this can be:

<pre>
$ mpirun -n 50 <my_mpi_program>
</pre>

If you are using Intel MPI you must start <my_mpi_program> by the command mpiexec.hydra (instead of mpirun).

== Interactive Computing with Jupyter ==

== Monitor and manage jobs ==

=== List of your submitted jobs : squeue ===
Displays information about YOUR active, pending and/or recently completed jobs. The command displays all own active and pending jobs. The command squeue is explained in detail on the webpage https://slurm.schedmd.com/squeue.html or via manpage (man squeue).
 
 

* ''squeue'' example on bwUniCluster 3.0 (Only your own jobs are displayed!).
<pre>
$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
1262 cpu wrap ka_ab123 R 8:15 1 uc3n002
1267 dev_gpu_h wrap ka_ab123 PD 0:00 1 (Resources)
1265 highmem wrap ka_ab123 R 2:41 1 uc3n084
$ squeue -l
JOBID PARTITION NAME USER STATE TIME TIME_LIMI NODES NODELIST(REASON)
1262 cpu wrap ka_ab123 RUNNING 8:55 20:00 1 uc3n002
1267 dev_gpu_h wrap ka_ab123 PENDING 0:00 20:00 1 (Resources)
1265 highmem wrap ka_ab123 RUNNING 3:21 20:00 1 uc3n084

</pre>

=== Detailed job information : scontrol show job ===
scontrol show job displays detailed job state information and diagnostic output for all or a specified job of yours. Detailed information is available for active, pending and recently completed jobs. The command scontrol is explained in detail on the webpage https://slurm.schedmd.com/scontrol.html or via manpage (man scontrol).
 
Display the state of all your jobs in normal mode: scontrol show job
 
Display the state of a job with <jobid> in normal mode: scontrol show job <jobid>
 
 

* Here is an example from bwUniCluster 3.0.
<pre>
$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
1262 cpu wrap ka_zs040 R 1:12 1 uc3n002

$
$ # now, see what's up with my pending job with jobid 1262
$
$ scontrol show job 1262

JobId=1262 JobName=wrap
UserId=ka_zs0402(241992) GroupId=ka_scc(12345) MCS_label=N/A
Priority=4246 Nice=0 Account=ka QOS=normal
JobState=RUNNING Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
RunTime=00:00:37 TimeLimit=00:20:00 TimeMin=N/A
SubmitTime=2025-04-04T10:01:30 EligibleTime=2025-04-04T10:01:30
AccrueTime=2025-04-04T10:01:30
StartTime=2025-04-04T10:01:31 EndTime=2025-04-04T10:21:31 Deadline=N/A
SuspendTime=None SecsPreSuspend=0 LastSchedEval=2025-04-04T10:01:31 Scheduler=Main
Partition=cpu AllocNode:Sid=uc3n999:2819841
ReqNodeList=(null) ExcNodeList=(null)
NodeList=uc3n002
BatchHost=uc3n002
NumNodes=1 NumCPUs=2 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
ReqTRES=cpu=1,mem=2000M,node=1,billing=1
AllocTRES=cpu=2,mem=4000M,node=1,billing=2
Socks/Node=* NtasksPerN:B:S:C=0:0:*:1 CoreSpec=*
MinCPUsNode=1 MinMemoryCPU=2000M MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
Command=(null)
WorkDir=/pfs/data6/home/ka/ka_scc/ka_zs0402
StdErr=/pfs/data6/home/ka/ka_scc/ka_zs0402/slurm-1262.out
StdIn=/dev/null
StdOut=/pfs/data6/home/ka/ka_scc/ka_zs0402/slurm-1262.out

</pre>
 
=== Canceling own jobs : scancel ===
The scancel command is used to cancel jobs. The command scancel is explained in detail on the webpage https://slurm.schedmd.com/scancel.html or via manpage (man scancel). The command is:
<pre>
$ scancel [-i] <job-id>
$ scancel -t <job_state_name>
</pre>

= Slurm Options =
[[BwUniCluster3.0/Running_Jobs/Slurm | Detailed Slurm usage]]

= Best Practices =

== Step-by-Step example==

== Dos and Don'ts ==

BwUniCluster3.0/Running Jobs

2025-04-04T15:25:24Z

P Schuhmacher: /* Purpose and function of a queuing system */

= Purpose and function of a queuing system =

General procedure: Hint to [[Running_Calculations | Running Calculations]]

== Job submission process ==

bwUniCluster 3.0 has installed the workload managing software Slurm. Therefore any job submission by the user is to be executed by commands of the Slurm software. Slurm queues and runs user jobs based on fair sharing policies.

== Slurm ==

HPC Workload Manager on bwUniCluster 3.0 is Slurm.
Slurm is a cluster management and job scheduling system. Slurm has three key functions.
* First, it allocates access to resources (compute cores on nodes) to users for some duration of time so they can perform work.
* it provides a framework for starting, executing, and monitoring work (normally a parallel job) on the set of allocated nodes.
* it arbitrates contention for resources by managing a queue of pending work.

Any kind of calculation on the compute nodes of bwUniCluster 3.0 requires the user to define calculations as a sequence of commands together with required run time, number of CPU cores and main memory and submit all, i.e., the '''batch job''', to a resource and workload managing software.
== Terms and definitions ==

''' Partitions '''

Slurm manages job queues for different '''partitions'''. Partitions are used to group similar node types (e.g. nodes with and without accelerators) and to enforce different access policies and resource limits.

On bwUniCluster 3.0 there are different partitions:

* CPU-only nodes
** 2-socket nodes, consisting of 2 Intel Ice Lake processors with 32 cores each or 2 AMD processors with 48 cores each
** 2-socket nodes with very high RAM capacity, consisting of 2 AMD processors with 48 cores each
* GPU-accelerated nodes
** 2-socket nodes with 4x NVIDIA A100 or 4x NVIDIA H100 GPUs
** 4-socket node with 4x AMD Instinct accelerator

''' Queues '''

Job '''queues''' are used to manage jobs that request access to shared but limited computing resources of a certain kind (partition).

On bwUniCluster 3.0 there are different main types of queues:
* Regular queues
** cpu: Jobs that request CPU-only nodes.
** gpu: Jobs that request GPU-accelerated nodes.
* Development queues (dev)
** Short, usually interactive jobs that are used for developing, compiling and testing code and workflows. The intention behind development queues is to provide users with immediate access to computer resources without having to wait. This is the place to realize instantaneous heavy compute without affecting other users, as would be the case on the login nodes.

Requested compute resources such as (wall-)time, number of nodes and amount of memory are restricted and must fit into the boundaries imposed by the queues. The request for compute resources on the bwUniCluster 3.0 requires at least the specification of the '''queue''' and the '''time'''.

''' Jobs '''

Jobs can be run non-interactively as '''batch jobs''' or as '''interactive jobs'''.
Submitting a batch job means, that all steps of a compute project are defined in a Bash script. This Bash script is queued and executed as soon as the compute resources are available and allocated. Jobs are enqueued with the ''sbatch'' command.

For interactive jobs, the resources are requested with the ''salloc'' command. As soon as the computing resources are available and allocated, a command line prompt is returned on a computing node and the user can freely dispose of the resources now available to him.
{|style="background:#deffee; width:100%;"
|style="padding:5px; background:#cef2e0; text-align:left"|
[[Image:Attention.svg|center|25px]]
|style="padding:5px; background:#cef2e0; text-align:left"|
'''Please remember:'''
* '''Heavy computations are not allowed on the login nodes'''. Use a developement or a regular job queue instead! Please refer to [[BwUniCluster3.0/Login#Allowed_Activities_on_Login_Nodes|Allowed Activities on Login Nodes]].
* '''Development queues''' are meant for '''development tasks'''. Do not misuse this queue for regular, short-running jobs or chain jobs! Only one running job at a time is enabled. Maximum queue length is reduced to 3.
|}

= Queues on bwUniCluster 3.0 =
== Policy ==

== Regular Queues ==
{| class="wikitable"
|-
! style="width:5%"| Queue
! style="width:13%"| Node-Type
! style="width:23%"| Default Resources
! style="width:13%"| Minimal Resources
! style="width:13%"| Maximum Resources
|-
| <code>cpu_il</code>
| CPU nodes Ice Lake
| mem-per-cpu=1950mb
|
| time=72:00:00, nodes=30, mem=249600mb, ntasks-per-node=64, (threads-per-core=2)
|-
| <code>cpu</code>
| CPU nodes Standard
| mem-per-cpu=1125mb
|
| time=72:00:00, nodes=20, mem=380000mb, ntasks-per-node=96, (threads-per-core=2)
|-
| <code>highmem</code>
| CPU nodes High Memory
| mem-per-cpu=1125mb
|
| time=72:00:00, nodes=4, mem=2300000mb, ntasks-per-node=96, (threads-per-core=2)
|-
| <code>gpu_h100</code>
| GPU nodes NVIDIA GPU x4
| mem-per-cpu=1125mb cpus-per-gpu=24
|
| time=72:00:00, nodes=12, mem=760000mb, ntasks-per-node=96, (threads-per-core=2)
|-
| <code>gpu_mi300</code>
| GPU node AMD GPU x4
| mem-per-cpu=1125mb cpus-per-gpu=24
|
| time=72:00:00, nodes=1, mem=510000mb, ntasks-per-node=40, (threads-per-core=2)
|-
| <code>gpu_a100_il</code>/<code>gpu_h100_il</code>
| GPU nodes Ice Lake NVIDIA GPU x4
| mem-per-gpu=127500mb cpus-per-gpu=16
|
| time=72:00:00, nodes=9, mem=510000mb, ntasks-per-node=64, (threads-per-core=2)
|}
Table 1: Regular Queues

== Development Queues ==
Only for development, i.e. debugging or performance optimization ...

{| class="wikitable"
|-
! style="width:5%"| Queue
! style="width:13%"| Node Type
! style="width:23%"| Default Resources
! style="width:13%"| Minimal Resources
! style="width:13%"| Maximum Resources
|-
| <code>dev_cpu_il</code>
| CPU nodes Ice Lake
| mem-per-cpu=1950mb
|
| time=30, nodes=8, mem=249600mb, ntasks-per-node=64, (threads-per-core=2)
|-
| <code>dev_cpu</code>
| CPU nodes Standard
| mem-per-cpu=1125mb
|
| time=30, nodes=1, mem=180000mb, ntasks-per-node=40, (threads-per-core=2)
|-
| <code>dev_highmem</code>
| CPU nodes High Memory
| mem-per-cpu=1125mb
|
| time=30, nodes=1, mem=180000mb, ntasks-per-node=40, (threads-per-core=2)
|-
| <code>dev_gpu_h100</code>
| GPU nodes NVIDIA GPU x4
| mem-per-cpu=1125mb cpus-per-gpu=24
|
| time=30, nodes=1, mem=180000mb, ntasks-per-node=40, (threads-per-core=2)
|-
| <code>dev_gpu_a100_il</code>
| GPU nodes NVIDIA GPU x4 
| mem-per-gpu=127500mb cpus-per-gpu=16
|
| time=30, nodes=1, mem=510000mb, ntasks-per-node=64, (threads-per-core=2)
|}
Table 2: Development Queues

Default resources of a queue class defines time, #tasks and memory if not explicitly given with sbatch command. Resource list acronyms ''--time'', ''--ntasks'', ''--nodes'', ''--mem'' and ''--mem-per-cpu'' are described [[BwUniCluster_3.0_Slurm_common_Features|here]].

== Check available resources: sinfo_t_idle ==
The Slurm command sinfo is used to view partition and node information for a system running Slurm. It incorporates down time, reservations, and node state information in determining the available backfill window. The sinfo command can only be used by the administrator.
 
SCC has prepared a special script (sinfo_t_idle) to find out how many processors are available for immediate use on the system. It is anticipated that users will use this information to submit jobs that meet these criteria and thus obtain quick job turnaround times.
 
 

* The following command displays what resources are available for immediate use for the whole partition.
<pre>$ sinfo_t_idle
Partition dev_cpu : 2 nodes idle
Partition cpu : 68 nodes idle
Partition highmem : 4 nodes idle
Partition dev_gpu_h100 : 0 nodes idle
Partition gpu_h100 : 11 nodes idle
Partition gpu_mi300 : 1 nodes idle
Partition dev_cpu_il : 0 nodes idle
Partition cpu_il : 0 nodes idle
Partition dev_gpu_a100_il : 0 nodes idle
Partition gpu_a100_il : 0 nodes idle
Partition gpu_h100_il : 0 nodes idle
</pre>
 

= Running Jobs =

== Slurm Commands (excerpt) ==
Important Slurm commands for non-administrators working on bwUniCluster 3.0.
{| width=850px class="wikitable"
! Slurm commands !! Brief explanation
|-
| [[#Batch Jobs: sbatch|sbatch]] || Submits a job and puts it into the queue [[https://slurm.schedmd.com/sbatch.html sbatch]]
|-
| [[#Interactive Jobs: salloc|salloc]] || Requests resources for an interactive Job [[https://slurm.schedmd.com/salloc.html salloc]]
|-
| [[#Monitor and manage jobs |scontrol show job]] || Displays detailed job state information [[https://slurm.schedmd.com/scontrol.html scontrol]]
|-
| [[#List of your submitted jobs : squeue|squeue]] || Displays information about active, eligible, blocked, and/or recently completed jobs [[https://slurm.schedmd.com/squeue.html squeue]]
|-
| [[#List of your submitted jobs : squeue|squeue --start]] || Returns start time of submitted job [[https://slurm.schedmd.com/squeue.html squeue]]
|-
| [[#Check available resources: sinfo_t_idle|sinfo_t_idle]] || Shows what resources are available for immediate use [[https://slurm.schedmd.com/sinfo.html sinfo]]
|-
| [[#Canceling own jobs : scancel|scancel]] || Cancels a job [[https://slurm.schedmd.com/scancel.html scancel]]
|}

 
* [https://slurm.schedmd.com/tutorials.html Slurm Tutorials]
* [https://slurm.schedmd.com/pdfs/summary.pdf Slurm command/option summary (2 pages)]
* [https://slurm.schedmd.com/man_index.html Slurm Commands]
 

== Batch Jobs: sbatch ==

Batch jobs are submitted by using the command '''sbatch'''. The main purpose of the '''sbatch''' command is to specify the resources that are needed to run the job. '''sbatch''' will then queue the batch job. However, starting of batch job depends on the availability of the requested resources and the fair sharing value.
 
 

* The syntax and use of '''sbatch''' can be displayed via:
<pre>
$ man sbatch
</pre>

'''sbatch''' options can be used from the command line or in your job script. Different defaults for some of these options are set based on the queue and can be found [https://wiki.bwhpc.de/e/BwUniCluster3.0/Batch_Queues#sbatch_-p_queue here]

{| class="wikitable"
! colspan="3" | sbatch Options
|-
! style="width:8%"| Command line
! style="width:9%"| Script
! style="width:13%"| Purpose
|- style="vertical-align:top;"
| -t, --time=''time''
| #SBATCH --time=''time''
| Wall clock time limit. 
|-
|- style="vertical-align:top;"
| -N, --nodes=''count''
| #SBATCH --nodes=''count''
| Number of nodes to be used.
|- style="vertical-align:top;"
| -n, --ntasks=''count''
| #SBATCH --ntasks=''count''
| Number of tasks to be launched.
|-
|- style="vertical-align:top;"
| --ntasks-per-node=''count''
| #SBATCH --ntasks-per-node=''count''
| Maximum count of tasks per node.
|-
|- style="vertical-align:top;"
| -c, --cpus-per-task=''count''
| #SBATCH --cpus-per-task=''count''
| Number of CPUs required per (MPI-)task.
|-
|- style="vertical-align:top;"
| --mem=''value_in_MB''
| #SBATCH --mem=''value_in_MB''
| Memory in MegaByte per node. (You should omit the setting of this option.)
|-
|- style="vertical-align:top;"
| --mem-per-cpu=''value_in_MB''
| #SBATCH --mem-per-cpu=''value_in_MB''
| Minimum Memory required per allocated CPU. (You should omit the setting of this option.)
|-
|- style="vertical-align:top;"
| --mail-type=''type''
| #SBATCH --mail-type=''type''
| Notify user by email when certain event types occur. Valid type values are NONE, BEGIN, END, FAIL, REQUEUE, ALL.
|-
|- style="vertical-align:top;"
| --mail-user=''mail-address''
| #SBATCH --mail-user=''mail-address''
| The specified mail-address receives email notification of state changes as defined by --mail-type.
|-
|- style="vertical-align:top;"
| --output=''name''
| #SBATCH --output=''name''
| File in which job output is stored.
|-
|- style="vertical-align:top;"
| --error=''name''
| #SBATCH --error=''name''
| File in which job error messages are stored.
|-
|- style="vertical-align:top;"
| -J, --job-name=''name''
| #SBATCH --job-name=''name''
| Job name.
|-
|- style="vertical-align:top;"
| --export=[ALL,] ''env-variables''
| #SBATCH --export=[ALL,] ''env-variables''
| Identifies which environment variables from the submission environment are propagated to the launched application. Default is ALL.
|-
|- style="vertical-align:top;"
| -A, --account=''group-name''
| #SBATCH --account=''group-name''
| Change resources used by this job to specified group. You may need this option if your account is assigned to more than one group. By command "scontrol show job" the project group the job is accounted on can be seen behind "Account=".
|-
|- style="vertical-align:top;"
| -p, --partition=''queue-name''
| #SBATCH --partition=''queue-name''
| Request a specific queue for the resource allocation.
|-
|- style="vertical-align:top;"
| --reservation=''reservation-name''
| #SBATCH --reservation=''reservation-name''
| Use a specific reservation for the resource allocation.
|-
|- style="vertical-align:top;"
| -C, --constraint=''LSDF''
| #SBATCH --constraint=LSDF
| Job constraint LSDF filesystems.
|-
|- style="vertical-align:top;"
| -C, --constraint=''BEEOND (BEEOND_4MDS, BEEOND_MAXMDS)''
| #SBATCH --constraint=BEEOND (BEEOND_4MDS, BEEOND_MAXMDS)
| Job constraint BeeOND filesystem.
|-
|}
 

== Interactive Jobs: salloc ==

On bwUniCluster 3.0 you are only allowed to run short jobs (<< 1 hour) with little memory requirements (<< 8 GByte) on the logins nodes. If you want to run longer jobs and/or jobs with a request of more than 8 GByte of memory, you must allocate resources for so-called interactive jobs by usage of the command salloc on a login node. Considering a serial application running on a compute node that requires 5000 MByte of memory and limiting the interactive run to 2 hours the following command has to be executed:

<pre>
$ salloc -p cpu -n 1 -t 120 --mem=5000
</pre>

Then you will get one core on a compute node within the partition "cpu". After execution of this command '''DO NOT CLOSE''' your current terminal session but wait until the queueing system Slurm has granted you the requested resources on the compute system. You will be logged in automatically on the granted core! To run a serial program on the granted core you only have to type the name of the executable.

<pre>
$ ./<my_serial_program>
</pre>

Please be aware that your serial job must run less than 2 hours in this example, else the job will be killed during runtime by the system.

You can also start now a graphical X11-terminal connecting you to the dedicated resource that is available for 2 hours. You can start it by the command:

<pre>
$ xterm
</pre>

Note that, once the walltime limit has been reached the resources - i.e. the compute node - will automatically be revoked.

An interactive parallel application running on one compute node or on many compute nodes (e.g. here 5 nodes) with 96 cores each requires usually an amount of memory in GByte (e.g. 50 GByte) and a maximum time (e.g. 1 hour). E.g. 5 nodes can be allocated by the following command:

<pre>
$ salloc -p cpu -N 5 --ntasks-per-node=96 -t 01:00:00 --mem=50gb
</pre>

Now you can run parallel jobs on 480 cores requiring 50 GByte of memory per node. Please be aware that you will be logged in on core 0 of the first node.
If you want to have access to another node you have to open a new terminal, connect it also to bwUniCluster 3.0 and type the following commands to
connect to the running interactive job and then to a specific node:

<pre>
$ srun --jobid=XXXXXXXX --pty /bin/bash
$ srun --nodelist=uc3nXXX --pty /bin/bash
</pre>

With the command:

<pre>
$ squeue
</pre>

the jobid and the nodelist can be shown.

If you want to run MPI-programs, you can do it by simply typing mpirun <program_name>. Then your program will be run on 480 cores. A very simple example for starting a parallel job can be:

<pre>
$ mpirun <my_mpi_program>
</pre>

You can also start the debugger ddt by the commands:

<pre>
$ module add devel/ddt
$ ddt <my_mpi_program>
</pre>

The above commands will execute the parallel program <my_mpi_program> on all available cores. You can also start parallel programs on a subset of cores; an example for this can be:

<pre>
$ mpirun -n 50 <my_mpi_program>
</pre>

If you are using Intel MPI you must start <my_mpi_program> by the command mpiexec.hydra (instead of mpirun).

== Interactive Computing with Jupyter ==

== Monitor and manage jobs ==

=== List of your submitted jobs : squeue ===
Displays information about YOUR active, pending and/or recently completed jobs. The command displays all own active and pending jobs. The command squeue is explained in detail on the webpage https://slurm.schedmd.com/squeue.html or via manpage (man squeue).
 
 

* ''squeue'' example on bwUniCluster 3.0 (Only your own jobs are displayed!).
<pre>
$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
1262 cpu wrap ka_ab123 R 8:15 1 uc3n002
1267 dev_gpu_h wrap ka_ab123 PD 0:00 1 (Resources)
1265 highmem wrap ka_ab123 R 2:41 1 uc3n084
$ squeue -l
JOBID PARTITION NAME USER STATE TIME TIME_LIMI NODES NODELIST(REASON)
1262 cpu wrap ka_ab123 RUNNING 8:55 20:00 1 uc3n002
1267 dev_gpu_h wrap ka_ab123 PENDING 0:00 20:00 1 (Resources)
1265 highmem wrap ka_ab123 RUNNING 3:21 20:00 1 uc3n084

</pre>

=== Detailed job information : scontrol show job ===
scontrol show job displays detailed job state information and diagnostic output for all or a specified job of yours. Detailed information is available for active, pending and recently completed jobs. The command scontrol is explained in detail on the webpage https://slurm.schedmd.com/scontrol.html or via manpage (man scontrol).
 
Display the state of all your jobs in normal mode: scontrol show job
 
Display the state of a job with <jobid> in normal mode: scontrol show job <jobid>
 
 

* Here is an example from bwUniCluster 3.0.
<pre>
$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
1262 cpu wrap ka_zs040 R 1:12 1 uc3n002

$
$ # now, see what's up with my pending job with jobid 1262
$
$ scontrol show job 1262

JobId=1262 JobName=wrap
UserId=ka_zs0402(241992) GroupId=ka_scc(12345) MCS_label=N/A
Priority=4246 Nice=0 Account=ka QOS=normal
JobState=RUNNING Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
RunTime=00:00:37 TimeLimit=00:20:00 TimeMin=N/A
SubmitTime=2025-04-04T10:01:30 EligibleTime=2025-04-04T10:01:30
AccrueTime=2025-04-04T10:01:30
StartTime=2025-04-04T10:01:31 EndTime=2025-04-04T10:21:31 Deadline=N/A
SuspendTime=None SecsPreSuspend=0 LastSchedEval=2025-04-04T10:01:31 Scheduler=Main
Partition=cpu AllocNode:Sid=uc3n999:2819841
ReqNodeList=(null) ExcNodeList=(null)
NodeList=uc3n002
BatchHost=uc3n002
NumNodes=1 NumCPUs=2 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
ReqTRES=cpu=1,mem=2000M,node=1,billing=1
AllocTRES=cpu=2,mem=4000M,node=1,billing=2
Socks/Node=* NtasksPerN:B:S:C=0:0:*:1 CoreSpec=*
MinCPUsNode=1 MinMemoryCPU=2000M MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
Command=(null)
WorkDir=/pfs/data6/home/ka/ka_scc/ka_zs0402
StdErr=/pfs/data6/home/ka/ka_scc/ka_zs0402/slurm-1262.out
StdIn=/dev/null
StdOut=/pfs/data6/home/ka/ka_scc/ka_zs0402/slurm-1262.out

</pre>
 
=== Canceling own jobs : scancel ===
The scancel command is used to cancel jobs. The command scancel is explained in detail on the webpage https://slurm.schedmd.com/scancel.html or via manpage (man scancel). The command is:
<pre>
$ scancel [-i] <job-id>
$ scancel -t <job_state_name>
</pre>

= Slurm Options =

= Best Practices =

== Step-by-Step example==

== Dos and Don'ts ==

BwUniCluster3.0/Running Jobs

2025-04-04T15:24:38Z

P Schuhmacher: /* Batch Jobs: sbatch */

BwUniCluster3.0/Running Jobs

2025-04-04T15:21:05Z

P Schuhmacher: /* Monitor and manage jobs */

= Purpose and function of a queuing system =

General procedure: Hint to [[Running_Calculations | Running Calculations]]

== Job submission process ==

bwUniCluster 3.0 has installed the workload managing software Slurm. Therefore any job submission by the user is to be executed by commands of the Slurm software. Slurm queues and runs user jobs based on fair sharing policies.

== Slurm ==

HPC Workload Manager on bwUniCluster 3.0 is Slurm.
Slurm is a cluster management and job scheduling system. Slurm has three key functions.
* First, it allocates access to resources (compute cores on nodes) to users for some duration of time so they can perform work.
* it provides a framework for starting, executing, and monitoring work (normally a parallel job) on the set of allocated nodes.
* it arbitrates contention for resources by managing a queue of pending work.

Any kind of calculation on the compute nodes of bwUniCluster 3.0 requires the user to define calculations as a sequence of commands together with required run time, number of CPU cores and main memory and submit all, i.e., the '''batch job''', to a resource and workload managing software.
== Terms and definitions ==

''' Partitions '''

Slurm manages job queues for different '''partitions'''. Partitions are used to group similar node types (e.g. nodes with and without accelerators) and to enforce different access policies and resource limits.

On bwUniCluster 3.0 there are different partitions:

* CPU-only nodes
** 2-socket nodes, consisting of 2 Intel Ice Lake processors with 32 cores each or 2 AMD processors with 48 cores each
** 2-socket nodes with very high RAM capacity, consisting of 2 AMD processors with 48 cores each
* GPU-accelerated nodes
** 2-socket nodes with 4x NVIDIA A100 or 4x NVIDIA H100 GPUs
** 4-socket node with 4x AMD Instinct accelerator

''' Queues '''

Job '''queues''' are used to manage jobs that request access to shared but limited computing resources of a certain kind (partition).

On bwUniCluster 3.0 there are different main types of queues:
* Regular queues
** cpu: Jobs that request CPU-only nodes.
** gpu: Jobs that request GPU-accelerated nodes.
* Development queues (dev)
** Short, usually interactive jobs that are used for developing, compiling and testing code and workflows. The intention behind development queues is to provide users with immediate access to computer resources without having to wait. This is the place to realize instantaneous heavy compute without affecting other users, as would be the case on the login nodes.

Requested compute resources such as (wall-)time, number of nodes and amount of memory are restricted and must fit into the boundaries imposed by the queues. The request for compute resources on the bwUniCluster 3.0 requires at least the specification of the '''queue''' and the '''time'''.

''' Jobs '''

Jobs can be run non-interactively as '''batch jobs''' or as '''interactive jobs'''.
Submitting a batch job means, that all steps of a compute project are defined in a Bash script. This Bash script is queued and executed as soon as the compute resources are available and allocated. Jobs are enqueued with the ''sbatch'' command.

For interactive jobs, the resources are requested with the ''salloc'' command. As soon as the computing resources are available and allocated, a command line prompt is returned on a computing node and the user can freely dispose of the resources now available to him.
{|style="background:#deffee; width:100%;"
|style="padding:5px; background:#cef2e0; text-align:left"|
[[Image:Attention.svg|center|25px]]
|style="padding:5px; background:#cef2e0; text-align:left"|
'''Please remember:'''
* '''Heavy computations are not allowed on the login nodes'''. Use a developement or a regular job queue instead! Please refer to [[BwUniCluster3.0/Login#Allowed_Activities_on_Login_Nodes|Allowed Activities on Login Nodes]].
* '''Development queues''' are meant for '''development tasks'''. Do not misuse this queue for regular, short-running jobs or chain jobs! Only one running job at a time is enabled. Maximum queue length is reduced to 3.
|}

= Queues on bwUniCluster 3.0 =
== Policy ==

== Regular Queues ==
{| class="wikitable"
|-
! style="width:5%"| Queue
! style="width:13%"| Node-Type
! style="width:23%"| Default Resources
! style="width:13%"| Minimal Resources
! style="width:13%"| Maximum Resources
|-
| <code>cpu_il</code>
| CPU nodes Ice Lake
| mem-per-cpu=1950mb
|
| time=72:00:00, nodes=30, mem=249600mb, ntasks-per-node=64, (threads-per-core=2)
|-
| <code>cpu</code>
| CPU nodes Standard
| mem-per-cpu=1125mb
|
| time=72:00:00, nodes=20, mem=380000mb, ntasks-per-node=96, (threads-per-core=2)
|-
| <code>highmem</code>
| CPU nodes High Memory
| mem-per-cpu=1125mb
|
| time=72:00:00, nodes=4, mem=2300000mb, ntasks-per-node=96, (threads-per-core=2)
|-
| <code>gpu_h100</code>
| GPU nodes NVIDIA GPU x4
| mem-per-cpu=1125mb cpus-per-gpu=24
|
| time=72:00:00, nodes=12, mem=760000mb, ntasks-per-node=96, (threads-per-core=2)
|-
| <code>gpu_mi300</code>
| GPU node AMD GPU x4
| mem-per-cpu=1125mb cpus-per-gpu=24
|
| time=72:00:00, nodes=1, mem=510000mb, ntasks-per-node=40, (threads-per-core=2)
|-
| <code>gpu_a100_il</code>/<code>gpu_h100_il</code>
| GPU nodes Ice Lake NVIDIA GPU x4
| mem-per-gpu=127500mb cpus-per-gpu=16
|
| time=72:00:00, nodes=9, mem=510000mb, ntasks-per-node=64, (threads-per-core=2)
|}
Table 1: Regular Queues

== Development Queues ==
Only for development, i.e. debugging or performance optimization ...

{| class="wikitable"
|-
! style="width:5%"| Queue
! style="width:13%"| Node Type
! style="width:23%"| Default Resources
! style="width:13%"| Minimal Resources
! style="width:13%"| Maximum Resources
|-
| <code>dev_cpu_il</code>
| CPU nodes Ice Lake
| mem-per-cpu=1950mb
|
| time=30, nodes=8, mem=249600mb, ntasks-per-node=64, (threads-per-core=2)
|-
| <code>dev_cpu</code>
| CPU nodes Standard
| mem-per-cpu=1125mb
|
| time=30, nodes=1, mem=180000mb, ntasks-per-node=40, (threads-per-core=2)
|-
| <code>dev_highmem</code>
| CPU nodes High Memory
| mem-per-cpu=1125mb
|
| time=30, nodes=1, mem=180000mb, ntasks-per-node=40, (threads-per-core=2)
|-
| <code>dev_gpu_h100</code>
| GPU nodes NVIDIA GPU x4
| mem-per-cpu=1125mb cpus-per-gpu=24
|
| time=30, nodes=1, mem=180000mb, ntasks-per-node=40, (threads-per-core=2)
|-
| <code>dev_gpu_a100_il</code>
| GPU nodes NVIDIA GPU x4 
| mem-per-gpu=127500mb cpus-per-gpu=16
|
| time=30, nodes=1, mem=510000mb, ntasks-per-node=64, (threads-per-core=2)
|}
Table 2: Development Queues

Default resources of a queue class defines time, #tasks and memory if not explicitly given with sbatch command. Resource list acronyms ''--time'', ''--ntasks'', ''--nodes'', ''--mem'' and ''--mem-per-cpu'' are described [[BwUniCluster_3.0_Slurm_common_Features|here]].

== Check available resources: sinfo_t_idle ==
The Slurm command sinfo is used to view partition and node information for a system running Slurm. It incorporates down time, reservations, and node state information in determining the available backfill window. The sinfo command can only be used by the administrator.
 
SCC has prepared a special script (sinfo_t_idle) to find out how many processors are available for immediate use on the system. It is anticipated that users will use this information to submit jobs that meet these criteria and thus obtain quick job turnaround times.
 
 

* The following command displays what resources are available for immediate use for the whole partition.
<pre>$ sinfo_t_idle
Partition dev_cpu : 2 nodes idle
Partition cpu : 68 nodes idle
Partition highmem : 4 nodes idle
Partition dev_gpu_h100 : 0 nodes idle
Partition gpu_h100 : 11 nodes idle
Partition gpu_mi300 : 1 nodes idle
Partition dev_cpu_il : 0 nodes idle
Partition cpu_il : 0 nodes idle
Partition dev_gpu_a100_il : 0 nodes idle
Partition gpu_a100_il : 0 nodes idle
Partition gpu_h100_il : 0 nodes idle
</pre>
 

= Running Jobs =

== Slurm Commands (excerpt) ==
Important Slurm commands for non-administrators working on bwUniCluster 3.0.
{| width=850px class="wikitable"
! Slurm commands !! Brief explanation
|-
| [[#Batch Jobs: sbatch|sbatch]] || Submits a job and puts it into the queue [[https://slurm.schedmd.com/sbatch.html sbatch]]
|-
| [[#Interactive Jobs: salloc|salloc]] || Requests resources for an interactive Job [[https://slurm.schedmd.com/salloc.html salloc]]
|-
| [[#Monitor and manage jobs |scontrol show job]] || Displays detailed job state information [[https://slurm.schedmd.com/scontrol.html scontrol]]
|-
| [[#List of your submitted jobs : squeue|squeue]] || Displays information about active, eligible, blocked, and/or recently completed jobs [[https://slurm.schedmd.com/squeue.html squeue]]
|-
| [[#List of your submitted jobs : squeue|squeue --start]] || Returns start time of submitted job [[https://slurm.schedmd.com/squeue.html squeue]]
|-
| [[#Check available resources: sinfo_t_idle|sinfo_t_idle]] || Shows what resources are available for immediate use [[https://slurm.schedmd.com/sinfo.html sinfo]]
|-
| [[#Canceling own jobs : scancel|scancel]] || Cancels a job [[https://slurm.schedmd.com/scancel.html scancel]]
|}

 
* [https://slurm.schedmd.com/tutorials.html Slurm Tutorials]
* [https://slurm.schedmd.com/pdfs/summary.pdf Slurm command/option summary (2 pages)]
* [https://slurm.schedmd.com/man_index.html Slurm Commands]
 

== Batch Jobs: sbatch ==

Batch jobs are submitted by using the command '''sbatch'''. The main purpose of the '''sbatch''' command is to specify the resources that are needed to run the job. '''sbatch''' will then queue the batch job. However, starting of batch job depends on the availability of the requested resources and the fair sharing value.
 
 

To run your batch job on one of the cpu nodes, please use:

<pre>
$ sbatch --partition=dev_cpu
or
$ sbatch -p dev_cpu
</pre>
 

=== Command parameters sbatch ===
The syntax and use of '''sbatch''' can be displayed via:
<pre>
$ man sbatch
</pre>
'''sbatch''' options can be used from the command line or in your job script. Different defaults for some of these options are set based on the queue and can be found [https://wiki.bwhpc.de/e/BwUniCluster3.0/Batch_Queues#sbatch_-p_queue here]

{| class="wikitable"
! colspan="3" | sbatch Options
|-
! style="width:8%"| Command line
! style="width:9%"| Script
! style="width:13%"| Purpose
|- style="vertical-align:top;"
| -t, --time=''time''
| #SBATCH --time=''time''
| Wall clock time limit. 
|-
|- style="vertical-align:top;"
| -N, --nodes=''count''
| #SBATCH --nodes=''count''
| Number of nodes to be used.
|- style="vertical-align:top;"
| -n, --ntasks=''count''
| #SBATCH --ntasks=''count''
| Number of tasks to be launched.
|-
|- style="vertical-align:top;"
| --ntasks-per-node=''count''
| #SBATCH --ntasks-per-node=''count''
| Maximum count of tasks per node.
|-
|- style="vertical-align:top;"
| -c, --cpus-per-task=''count''
| #SBATCH --cpus-per-task=''count''
| Number of CPUs required per (MPI-)task.
|-
|- style="vertical-align:top;"
| --mem=''value_in_MB''
| #SBATCH --mem=''value_in_MB''
| Memory in MegaByte per node. (You should omit the setting of this option.)
|-
|- style="vertical-align:top;"
| --mem-per-cpu=''value_in_MB''
| #SBATCH --mem-per-cpu=''value_in_MB''
| Minimum Memory required per allocated CPU. (You should omit the setting of this option.)
|-
|- style="vertical-align:top;"
| --mail-type=''type''
| #SBATCH --mail-type=''type''
| Notify user by email when certain event types occur. Valid type values are NONE, BEGIN, END, FAIL, REQUEUE, ALL.
|-
|- style="vertical-align:top;"
| --mail-user=''mail-address''
| #SBATCH --mail-user=''mail-address''
| The specified mail-address receives email notification of state changes as defined by --mail-type.
|-
|- style="vertical-align:top;"
| --output=''name''
| #SBATCH --output=''name''
| File in which job output is stored.
|-
|- style="vertical-align:top;"
| --error=''name''
| #SBATCH --error=''name''
| File in which job error messages are stored.
|-
|- style="vertical-align:top;"
| -J, --job-name=''name''
| #SBATCH --job-name=''name''
| Job name.
|-
|- style="vertical-align:top;"
| --export=[ALL,] ''env-variables''
| #SBATCH --export=[ALL,] ''env-variables''
| Identifies which environment variables from the submission environment are propagated to the launched application. Default is ALL.
|-
|- style="vertical-align:top;"
| -A, --account=''group-name''
| #SBATCH --account=''group-name''
| Change resources used by this job to specified group. You may need this option if your account is assigned to more than one group. By command "scontrol show job" the project group the job is accounted on can be seen behind "Account=".
|-
|- style="vertical-align:top;"
| -p, --partition=''queue-name''
| #SBATCH --partition=''queue-name''
| Request a specific queue for the resource allocation.
|-
|- style="vertical-align:top;"
| --reservation=''reservation-name''
| #SBATCH --reservation=''reservation-name''
| Use a specific reservation for the resource allocation.
|-
|- style="vertical-align:top;"
| -C, --constraint=''LSDF''
| #SBATCH --constraint=LSDF
| Job constraint LSDF filesystems.
|-
|- style="vertical-align:top;"
| -C, --constraint=''BEEOND (BEEOND_4MDS, BEEOND_MAXMDS)''
| #SBATCH --constraint=BEEOND (BEEOND_4MDS, BEEOND_MAXMDS)
| Job constraint BeeOND filesystem.
|-
|}
 

== Interactive Jobs: salloc ==

On bwUniCluster 3.0 you are only allowed to run short jobs (<< 1 hour) with little memory requirements (<< 8 GByte) on the logins nodes. If you want to run longer jobs and/or jobs with a request of more than 8 GByte of memory, you must allocate resources for so-called interactive jobs by usage of the command salloc on a login node. Considering a serial application running on a compute node that requires 5000 MByte of memory and limiting the interactive run to 2 hours the following command has to be executed:

<pre>
$ salloc -p cpu -n 1 -t 120 --mem=5000
</pre>

Then you will get one core on a compute node within the partition "cpu". After execution of this command '''DO NOT CLOSE''' your current terminal session but wait until the queueing system Slurm has granted you the requested resources on the compute system. You will be logged in automatically on the granted core! To run a serial program on the granted core you only have to type the name of the executable.

<pre>
$ ./<my_serial_program>
</pre>

Please be aware that your serial job must run less than 2 hours in this example, else the job will be killed during runtime by the system.

You can also start now a graphical X11-terminal connecting you to the dedicated resource that is available for 2 hours. You can start it by the command:

<pre>
$ xterm
</pre>

Note that, once the walltime limit has been reached the resources - i.e. the compute node - will automatically be revoked.

An interactive parallel application running on one compute node or on many compute nodes (e.g. here 5 nodes) with 96 cores each requires usually an amount of memory in GByte (e.g. 50 GByte) and a maximum time (e.g. 1 hour). E.g. 5 nodes can be allocated by the following command:

<pre>
$ salloc -p cpu -N 5 --ntasks-per-node=96 -t 01:00:00 --mem=50gb
</pre>

Now you can run parallel jobs on 480 cores requiring 50 GByte of memory per node. Please be aware that you will be logged in on core 0 of the first node.
If you want to have access to another node you have to open a new terminal, connect it also to bwUniCluster 3.0 and type the following commands to
connect to the running interactive job and then to a specific node:

<pre>
$ srun --jobid=XXXXXXXX --pty /bin/bash
$ srun --nodelist=uc3nXXX --pty /bin/bash
</pre>

With the command:

<pre>
$ squeue
</pre>

the jobid and the nodelist can be shown.

If you want to run MPI-programs, you can do it by simply typing mpirun <program_name>. Then your program will be run on 480 cores. A very simple example for starting a parallel job can be:

<pre>
$ mpirun <my_mpi_program>
</pre>

You can also start the debugger ddt by the commands:

<pre>
$ module add devel/ddt
$ ddt <my_mpi_program>
</pre>

The above commands will execute the parallel program <my_mpi_program> on all available cores. You can also start parallel programs on a subset of cores; an example for this can be:

<pre>
$ mpirun -n 50 <my_mpi_program>
</pre>

If you are using Intel MPI you must start <my_mpi_program> by the command mpiexec.hydra (instead of mpirun).

== Interactive Computing with Jupyter ==

== Monitor and manage jobs ==

=== List of your submitted jobs : squeue ===
Displays information about YOUR active, pending and/or recently completed jobs. The command displays all own active and pending jobs. The command squeue is explained in detail on the webpage https://slurm.schedmd.com/squeue.html or via manpage (man squeue).
 
 

* ''squeue'' example on bwUniCluster 3.0 (Only your own jobs are displayed!).
<pre>
$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
1262 cpu wrap ka_ab123 R 8:15 1 uc3n002
1267 dev_gpu_h wrap ka_ab123 PD 0:00 1 (Resources)
1265 highmem wrap ka_ab123 R 2:41 1 uc3n084
$ squeue -l
JOBID PARTITION NAME USER STATE TIME TIME_LIMI NODES NODELIST(REASON)
1262 cpu wrap ka_ab123 RUNNING 8:55 20:00 1 uc3n002
1267 dev_gpu_h wrap ka_ab123 PENDING 0:00 20:00 1 (Resources)
1265 highmem wrap ka_ab123 RUNNING 3:21 20:00 1 uc3n084

</pre>

=== Detailed job information : scontrol show job ===
scontrol show job displays detailed job state information and diagnostic output for all or a specified job of yours. Detailed information is available for active, pending and recently completed jobs. The command scontrol is explained in detail on the webpage https://slurm.schedmd.com/scontrol.html or via manpage (man scontrol).
 
Display the state of all your jobs in normal mode: scontrol show job
 
Display the state of a job with <jobid> in normal mode: scontrol show job <jobid>
 
 

* Here is an example from bwUniCluster 3.0.
<pre>
$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
1262 cpu wrap ka_zs040 R 1:12 1 uc3n002

$
$ # now, see what's up with my pending job with jobid 1262
$
$ scontrol show job 1262

JobId=1262 JobName=wrap
UserId=ka_zs0402(241992) GroupId=ka_scc(12345) MCS_label=N/A
Priority=4246 Nice=0 Account=ka QOS=normal
JobState=RUNNING Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
RunTime=00:00:37 TimeLimit=00:20:00 TimeMin=N/A
SubmitTime=2025-04-04T10:01:30 EligibleTime=2025-04-04T10:01:30
AccrueTime=2025-04-04T10:01:30
StartTime=2025-04-04T10:01:31 EndTime=2025-04-04T10:21:31 Deadline=N/A
SuspendTime=None SecsPreSuspend=0 LastSchedEval=2025-04-04T10:01:31 Scheduler=Main
Partition=cpu AllocNode:Sid=uc3n999:2819841
ReqNodeList=(null) ExcNodeList=(null)
NodeList=uc3n002
BatchHost=uc3n002
NumNodes=1 NumCPUs=2 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
ReqTRES=cpu=1,mem=2000M,node=1,billing=1
AllocTRES=cpu=2,mem=4000M,node=1,billing=2
Socks/Node=* NtasksPerN:B:S:C=0:0:*:1 CoreSpec=*
MinCPUsNode=1 MinMemoryCPU=2000M MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
Command=(null)
WorkDir=/pfs/data6/home/ka/ka_scc/ka_zs0402
StdErr=/pfs/data6/home/ka/ka_scc/ka_zs0402/slurm-1262.out
StdIn=/dev/null
StdOut=/pfs/data6/home/ka/ka_scc/ka_zs0402/slurm-1262.out

</pre>
 
=== Canceling own jobs : scancel ===
The scancel command is used to cancel jobs. The command scancel is explained in detail on the webpage https://slurm.schedmd.com/scancel.html or via manpage (man scancel). The command is:
<pre>
$ scancel [-i] <job-id>
$ scancel -t <job_state_name>
</pre>

= Slurm Options =

= Best Practices =

== Step-by-Step example==

== Dos and Don'ts ==

BwUniCluster3.0/Running Jobs

2025-04-04T15:18:39Z

P Schuhmacher: /* Slurm Commands (excerpt) */

= Purpose and function of a queuing system =

General procedure: Hint to [[Running_Calculations | Running Calculations]]

== Job submission process ==

bwUniCluster 3.0 has installed the workload managing software Slurm. Therefore any job submission by the user is to be executed by commands of the Slurm software. Slurm queues and runs user jobs based on fair sharing policies.

== Slurm ==

HPC Workload Manager on bwUniCluster 3.0 is Slurm.
Slurm is a cluster management and job scheduling system. Slurm has three key functions.
* First, it allocates access to resources (compute cores on nodes) to users for some duration of time so they can perform work.
* it provides a framework for starting, executing, and monitoring work (normally a parallel job) on the set of allocated nodes.
* it arbitrates contention for resources by managing a queue of pending work.

Any kind of calculation on the compute nodes of bwUniCluster 3.0 requires the user to define calculations as a sequence of commands together with required run time, number of CPU cores and main memory and submit all, i.e., the '''batch job''', to a resource and workload managing software.
== Terms and definitions ==

''' Partitions '''

Slurm manages job queues for different '''partitions'''. Partitions are used to group similar node types (e.g. nodes with and without accelerators) and to enforce different access policies and resource limits.

On bwUniCluster 3.0 there are different partitions:

* CPU-only nodes
** 2-socket nodes, consisting of 2 Intel Ice Lake processors with 32 cores each or 2 AMD processors with 48 cores each
** 2-socket nodes with very high RAM capacity, consisting of 2 AMD processors with 48 cores each
* GPU-accelerated nodes
** 2-socket nodes with 4x NVIDIA A100 or 4x NVIDIA H100 GPUs
** 4-socket node with 4x AMD Instinct accelerator

''' Queues '''

Job '''queues''' are used to manage jobs that request access to shared but limited computing resources of a certain kind (partition).

On bwUniCluster 3.0 there are different main types of queues:
* Regular queues
** cpu: Jobs that request CPU-only nodes.
** gpu: Jobs that request GPU-accelerated nodes.
* Development queues (dev)
** Short, usually interactive jobs that are used for developing, compiling and testing code and workflows. The intention behind development queues is to provide users with immediate access to computer resources without having to wait. This is the place to realize instantaneous heavy compute without affecting other users, as would be the case on the login nodes.

Requested compute resources such as (wall-)time, number of nodes and amount of memory are restricted and must fit into the boundaries imposed by the queues. The request for compute resources on the bwUniCluster 3.0 requires at least the specification of the '''queue''' and the '''time'''.

''' Jobs '''

Jobs can be run non-interactively as '''batch jobs''' or as '''interactive jobs'''.
Submitting a batch job means, that all steps of a compute project are defined in a Bash script. This Bash script is queued and executed as soon as the compute resources are available and allocated. Jobs are enqueued with the ''sbatch'' command.

For interactive jobs, the resources are requested with the ''salloc'' command. As soon as the computing resources are available and allocated, a command line prompt is returned on a computing node and the user can freely dispose of the resources now available to him.
{|style="background:#deffee; width:100%;"
|style="padding:5px; background:#cef2e0; text-align:left"|
[[Image:Attention.svg|center|25px]]
|style="padding:5px; background:#cef2e0; text-align:left"|
'''Please remember:'''
* '''Heavy computations are not allowed on the login nodes'''. Use a developement or a regular job queue instead! Please refer to [[BwUniCluster3.0/Login#Allowed_Activities_on_Login_Nodes|Allowed Activities on Login Nodes]].
* '''Development queues''' are meant for '''development tasks'''. Do not misuse this queue for regular, short-running jobs or chain jobs! Only one running job at a time is enabled. Maximum queue length is reduced to 3.
|}

= Queues on bwUniCluster 3.0 =
== Policy ==

== Regular Queues ==
{| class="wikitable"
|-
! style="width:5%"| Queue
! style="width:13%"| Node-Type
! style="width:23%"| Default Resources
! style="width:13%"| Minimal Resources
! style="width:13%"| Maximum Resources
|-
| <code>cpu_il</code>
| CPU nodes Ice Lake
| mem-per-cpu=1950mb
|
| time=72:00:00, nodes=30, mem=249600mb, ntasks-per-node=64, (threads-per-core=2)
|-
| <code>cpu</code>
| CPU nodes Standard
| mem-per-cpu=1125mb
|
| time=72:00:00, nodes=20, mem=380000mb, ntasks-per-node=96, (threads-per-core=2)
|-
| <code>highmem</code>
| CPU nodes High Memory
| mem-per-cpu=1125mb
|
| time=72:00:00, nodes=4, mem=2300000mb, ntasks-per-node=96, (threads-per-core=2)
|-
| <code>gpu_h100</code>
| GPU nodes NVIDIA GPU x4
| mem-per-cpu=1125mb cpus-per-gpu=24
|
| time=72:00:00, nodes=12, mem=760000mb, ntasks-per-node=96, (threads-per-core=2)
|-
| <code>gpu_mi300</code>
| GPU node AMD GPU x4
| mem-per-cpu=1125mb cpus-per-gpu=24
|
| time=72:00:00, nodes=1, mem=510000mb, ntasks-per-node=40, (threads-per-core=2)
|-
| <code>gpu_a100_il</code>/<code>gpu_h100_il</code>
| GPU nodes Ice Lake NVIDIA GPU x4
| mem-per-gpu=127500mb cpus-per-gpu=16
|
| time=72:00:00, nodes=9, mem=510000mb, ntasks-per-node=64, (threads-per-core=2)
|}
Table 1: Regular Queues

== Development Queues ==
Only for development, i.e. debugging or performance optimization ...

{| class="wikitable"
|-
! style="width:5%"| Queue
! style="width:13%"| Node Type
! style="width:23%"| Default Resources
! style="width:13%"| Minimal Resources
! style="width:13%"| Maximum Resources
|-
| <code>dev_cpu_il</code>
| CPU nodes Ice Lake
| mem-per-cpu=1950mb
|
| time=30, nodes=8, mem=249600mb, ntasks-per-node=64, (threads-per-core=2)
|-
| <code>dev_cpu</code>
| CPU nodes Standard
| mem-per-cpu=1125mb
|
| time=30, nodes=1, mem=180000mb, ntasks-per-node=40, (threads-per-core=2)
|-
| <code>dev_highmem</code>
| CPU nodes High Memory
| mem-per-cpu=1125mb
|
| time=30, nodes=1, mem=180000mb, ntasks-per-node=40, (threads-per-core=2)
|-
| <code>dev_gpu_h100</code>
| GPU nodes NVIDIA GPU x4
| mem-per-cpu=1125mb cpus-per-gpu=24
|
| time=30, nodes=1, mem=180000mb, ntasks-per-node=40, (threads-per-core=2)
|-
| <code>dev_gpu_a100_il</code>
| GPU nodes NVIDIA GPU x4 
| mem-per-gpu=127500mb cpus-per-gpu=16
|
| time=30, nodes=1, mem=510000mb, ntasks-per-node=64, (threads-per-core=2)
|}
Table 2: Development Queues

Default resources of a queue class defines time, #tasks and memory if not explicitly given with sbatch command. Resource list acronyms ''--time'', ''--ntasks'', ''--nodes'', ''--mem'' and ''--mem-per-cpu'' are described [[BwUniCluster_3.0_Slurm_common_Features|here]].

== Check available resources: sinfo_t_idle ==
The Slurm command sinfo is used to view partition and node information for a system running Slurm. It incorporates down time, reservations, and node state information in determining the available backfill window. The sinfo command can only be used by the administrator.
 
SCC has prepared a special script (sinfo_t_idle) to find out how many processors are available for immediate use on the system. It is anticipated that users will use this information to submit jobs that meet these criteria and thus obtain quick job turnaround times.
 
 

* The following command displays what resources are available for immediate use for the whole partition.
<pre>$ sinfo_t_idle
Partition dev_cpu : 2 nodes idle
Partition cpu : 68 nodes idle
Partition highmem : 4 nodes idle
Partition dev_gpu_h100 : 0 nodes idle
Partition gpu_h100 : 11 nodes idle
Partition gpu_mi300 : 1 nodes idle
Partition dev_cpu_il : 0 nodes idle
Partition cpu_il : 0 nodes idle
Partition dev_gpu_a100_il : 0 nodes idle
Partition gpu_a100_il : 0 nodes idle
Partition gpu_h100_il : 0 nodes idle
</pre>
 

= Running Jobs =

== Slurm Commands (excerpt) ==
Important Slurm commands for non-administrators working on bwUniCluster 3.0.
{| width=850px class="wikitable"
! Slurm commands !! Brief explanation
|-
| [[#Batch Jobs: sbatch|sbatch]] || Submits a job and puts it into the queue [[https://slurm.schedmd.com/sbatch.html sbatch]]
|-
| [[#Interactive Jobs: salloc|salloc]] || Requests resources for an interactive Job [[https://slurm.schedmd.com/salloc.html salloc]]
|-
| [[#Monitor and manage jobs |scontrol show job]] || Displays detailed job state information [[https://slurm.schedmd.com/scontrol.html scontrol]]
|-
| [[#List of your submitted jobs : squeue|squeue]] || Displays information about active, eligible, blocked, and/or recently completed jobs [[https://slurm.schedmd.com/squeue.html squeue]]
|-
| [[#List of your submitted jobs : squeue|squeue --start]] || Returns start time of submitted job [[https://slurm.schedmd.com/squeue.html squeue]]
|-
| [[#Check available resources: sinfo_t_idle|sinfo_t_idle]] || Shows what resources are available for immediate use [[https://slurm.schedmd.com/sinfo.html sinfo]]
|-
| [[#Canceling own jobs : scancel|scancel]] || Cancels a job [[https://slurm.schedmd.com/scancel.html scancel]]
|}

 
* [https://slurm.schedmd.com/tutorials.html Slurm Tutorials]
* [https://slurm.schedmd.com/pdfs/summary.pdf Slurm command/option summary (2 pages)]
* [https://slurm.schedmd.com/man_index.html Slurm Commands]
 

== Batch Jobs: sbatch ==

Batch jobs are submitted by using the command '''sbatch'''. The main purpose of the '''sbatch''' command is to specify the resources that are needed to run the job. '''sbatch''' will then queue the batch job. However, starting of batch job depends on the availability of the requested resources and the fair sharing value.
 
 

To run your batch job on one of the cpu nodes, please use:

<pre>
$ sbatch --partition=dev_cpu
or
$ sbatch -p dev_cpu
</pre>
 

=== Command parameters sbatch ===
The syntax and use of '''sbatch''' can be displayed via:
<pre>
$ man sbatch
</pre>
'''sbatch''' options can be used from the command line or in your job script. Different defaults for some of these options are set based on the queue and can be found [https://wiki.bwhpc.de/e/BwUniCluster3.0/Batch_Queues#sbatch_-p_queue here]

{| class="wikitable"
! colspan="3" | sbatch Options
|-
! style="width:8%"| Command line
! style="width:9%"| Script
! style="width:13%"| Purpose
|- style="vertical-align:top;"
| -t, --time=''time''
| #SBATCH --time=''time''
| Wall clock time limit. 
|-
|- style="vertical-align:top;"
| -N, --nodes=''count''
| #SBATCH --nodes=''count''
| Number of nodes to be used.
|- style="vertical-align:top;"
| -n, --ntasks=''count''
| #SBATCH --ntasks=''count''
| Number of tasks to be launched.
|-
|- style="vertical-align:top;"
| --ntasks-per-node=''count''
| #SBATCH --ntasks-per-node=''count''
| Maximum count of tasks per node.
|-
|- style="vertical-align:top;"
| -c, --cpus-per-task=''count''
| #SBATCH --cpus-per-task=''count''
| Number of CPUs required per (MPI-)task.
|-
|- style="vertical-align:top;"
| --mem=''value_in_MB''
| #SBATCH --mem=''value_in_MB''
| Memory in MegaByte per node. (You should omit the setting of this option.)
|-
|- style="vertical-align:top;"
| --mem-per-cpu=''value_in_MB''
| #SBATCH --mem-per-cpu=''value_in_MB''
| Minimum Memory required per allocated CPU. (You should omit the setting of this option.)
|-
|- style="vertical-align:top;"
| --mail-type=''type''
| #SBATCH --mail-type=''type''
| Notify user by email when certain event types occur. Valid type values are NONE, BEGIN, END, FAIL, REQUEUE, ALL.
|-
|- style="vertical-align:top;"
| --mail-user=''mail-address''
| #SBATCH --mail-user=''mail-address''
| The specified mail-address receives email notification of state changes as defined by --mail-type.
|-
|- style="vertical-align:top;"
| --output=''name''
| #SBATCH --output=''name''
| File in which job output is stored.
|-
|- style="vertical-align:top;"
| --error=''name''
| #SBATCH --error=''name''
| File in which job error messages are stored.
|-
|- style="vertical-align:top;"
| -J, --job-name=''name''
| #SBATCH --job-name=''name''
| Job name.
|-
|- style="vertical-align:top;"
| --export=[ALL,] ''env-variables''
| #SBATCH --export=[ALL,] ''env-variables''
| Identifies which environment variables from the submission environment are propagated to the launched application. Default is ALL.
|-
|- style="vertical-align:top;"
| -A, --account=''group-name''
| #SBATCH --account=''group-name''
| Change resources used by this job to specified group. You may need this option if your account is assigned to more than one group. By command "scontrol show job" the project group the job is accounted on can be seen behind "Account=".
|-
|- style="vertical-align:top;"
| -p, --partition=''queue-name''
| #SBATCH --partition=''queue-name''
| Request a specific queue for the resource allocation.
|-
|- style="vertical-align:top;"
| --reservation=''reservation-name''
| #SBATCH --reservation=''reservation-name''
| Use a specific reservation for the resource allocation.
|-
|- style="vertical-align:top;"
| -C, --constraint=''LSDF''
| #SBATCH --constraint=LSDF
| Job constraint LSDF filesystems.
|-
|- style="vertical-align:top;"
| -C, --constraint=''BEEOND (BEEOND_4MDS, BEEOND_MAXMDS)''
| #SBATCH --constraint=BEEOND (BEEOND_4MDS, BEEOND_MAXMDS)
| Job constraint BeeOND filesystem.
|-
|}
 

== Interactive Jobs: salloc ==

On bwUniCluster 3.0 you are only allowed to run short jobs (<< 1 hour) with little memory requirements (<< 8 GByte) on the logins nodes. If you want to run longer jobs and/or jobs with a request of more than 8 GByte of memory, you must allocate resources for so-called interactive jobs by usage of the command salloc on a login node. Considering a serial application running on a compute node that requires 5000 MByte of memory and limiting the interactive run to 2 hours the following command has to be executed:

<pre>
$ salloc -p cpu -n 1 -t 120 --mem=5000
</pre>

Then you will get one core on a compute node within the partition "cpu". After execution of this command '''DO NOT CLOSE''' your current terminal session but wait until the queueing system Slurm has granted you the requested resources on the compute system. You will be logged in automatically on the granted core! To run a serial program on the granted core you only have to type the name of the executable.

<pre>
$ ./<my_serial_program>
</pre>

Please be aware that your serial job must run less than 2 hours in this example, else the job will be killed during runtime by the system.

You can also start now a graphical X11-terminal connecting you to the dedicated resource that is available for 2 hours. You can start it by the command:

<pre>
$ xterm
</pre>

Note that, once the walltime limit has been reached the resources - i.e. the compute node - will automatically be revoked.

An interactive parallel application running on one compute node or on many compute nodes (e.g. here 5 nodes) with 96 cores each requires usually an amount of memory in GByte (e.g. 50 GByte) and a maximum time (e.g. 1 hour). E.g. 5 nodes can be allocated by the following command:

<pre>
$ salloc -p cpu -N 5 --ntasks-per-node=96 -t 01:00:00 --mem=50gb
</pre>

Now you can run parallel jobs on 480 cores requiring 50 GByte of memory per node. Please be aware that you will be logged in on core 0 of the first node.
If you want to have access to another node you have to open a new terminal, connect it also to bwUniCluster 3.0 and type the following commands to
connect to the running interactive job and then to a specific node:

<pre>
$ srun --jobid=XXXXXXXX --pty /bin/bash
$ srun --nodelist=uc3nXXX --pty /bin/bash
</pre>

With the command:

<pre>
$ squeue
</pre>

the jobid and the nodelist can be shown.

If you want to run MPI-programs, you can do it by simply typing mpirun <program_name>. Then your program will be run on 480 cores. A very simple example for starting a parallel job can be:

<pre>
$ mpirun <my_mpi_program>
</pre>

You can also start the debugger ddt by the commands:

<pre>
$ module add devel/ddt
$ ddt <my_mpi_program>
</pre>

The above commands will execute the parallel program <my_mpi_program> on all available cores. You can also start parallel programs on a subset of cores; an example for this can be:

<pre>
$ mpirun -n 50 <my_mpi_program>
</pre>

If you are using Intel MPI you must start <my_mpi_program> by the command mpiexec.hydra (instead of mpirun).

== Interactive Computing with Jupyter ==

== Monitor and manage jobs ==

=== List of your submitted jobs : squeue ===
Displays information about YOUR active, pending and/or recently completed jobs. The command displays all own active and pending jobs. The command squeue is explained in detail on the webpage https://slurm.schedmd.com/squeue.html or via manpage (man squeue).
 
 

* ''squeue'' example on bwUniCluster 3.0 (Only your own jobs are displayed!).
<pre>
$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
1262 cpu wrap ka_ab123 R 8:15 1 uc3n002
1267 dev_gpu_h wrap ka_ab123 PD 0:00 1 (Resources)
1265 highmem wrap ka_ab123 R 2:41 1 uc3n084
$ squeue -l
JOBID PARTITION NAME USER STATE TIME TIME_LIMI NODES NODELIST(REASON)
1262 cpu wrap ka_ab123 RUNNING 8:55 20:00 1 uc3n002
1267 dev_gpu_h wrap ka_ab123 PENDING 0:00 20:00 1 (Resources)
1265 highmem wrap ka_ab123 RUNNING 3:21 20:00 1 uc3n084

</pre>

=== Detailed job information : scontrol show job ===
scontrol show job displays detailed job state information and diagnostic output for all or a specified job of yours. Detailed information is available for active, pending and recently completed jobs. The command scontrol is explained in detail on the webpage https://slurm.schedmd.com/scontrol.html or via manpage (man scontrol).
 
Display the state of all your jobs in normal mode: scontrol show job
 
Display the state of a job with <jobid> in normal mode: scontrol show job <jobid>
 
 

* Here is an example from bwUniCluster 3.0.
<pre>
$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
1262 cpu wrap ka_zs040 R 1:12 1 uc3n002

$
$ # now, see what's up with my pending job with jobid 1262
$
$ scontrol show job 1262

JobId=1262 JobName=wrap
UserId=ka_zs0402(241992) GroupId=ka_scc(12345) MCS_label=N/A
Priority=4246 Nice=0 Account=ka QOS=normal
JobState=RUNNING Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
RunTime=00:00:37 TimeLimit=00:20:00 TimeMin=N/A
SubmitTime=2025-04-04T10:01:30 EligibleTime=2025-04-04T10:01:30
AccrueTime=2025-04-04T10:01:30
StartTime=2025-04-04T10:01:31 EndTime=2025-04-04T10:21:31 Deadline=N/A
SuspendTime=None SecsPreSuspend=0 LastSchedEval=2025-04-04T10:01:31 Scheduler=Main
Partition=cpu AllocNode:Sid=uc3n999:2819841
ReqNodeList=(null) ExcNodeList=(null)
NodeList=uc3n002
BatchHost=uc3n002
NumNodes=1 NumCPUs=2 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
ReqTRES=cpu=1,mem=2000M,node=1,billing=1
AllocTRES=cpu=2,mem=4000M,node=1,billing=2
Socks/Node=* NtasksPerN:B:S:C=0:0:*:1 CoreSpec=*
MinCPUsNode=1 MinMemoryCPU=2000M MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
Command=(null)
WorkDir=/pfs/data6/home/ka/ka_scc/ka_zs0402
StdErr=/pfs/data6/home/ka/ka_scc/ka_zs0402/slurm-1262.out
StdIn=/dev/null
StdOut=/pfs/data6/home/ka/ka_scc/ka_zs0402/slurm-1262.out

</pre>
 

= Slurm Options =

= Best Practices =

== Step-by-Step example==

== Dos and Don'ts ==

BwUniCluster3.0/Running Jobs

2025-04-04T15:18:17Z

P Schuhmacher: /* Check available resources */

= Purpose and function of a queuing system =

General procedure: Hint to [[Running_Calculations | Running Calculations]]

== Job submission process ==

bwUniCluster 3.0 has installed the workload managing software Slurm. Therefore any job submission by the user is to be executed by commands of the Slurm software. Slurm queues and runs user jobs based on fair sharing policies.

== Slurm ==

HPC Workload Manager on bwUniCluster 3.0 is Slurm.
Slurm is a cluster management and job scheduling system. Slurm has three key functions.
* First, it allocates access to resources (compute cores on nodes) to users for some duration of time so they can perform work.
* it provides a framework for starting, executing, and monitoring work (normally a parallel job) on the set of allocated nodes.
* it arbitrates contention for resources by managing a queue of pending work.

Any kind of calculation on the compute nodes of bwUniCluster 3.0 requires the user to define calculations as a sequence of commands together with required run time, number of CPU cores and main memory and submit all, i.e., the '''batch job''', to a resource and workload managing software.
== Terms and definitions ==

''' Partitions '''

Slurm manages job queues for different '''partitions'''. Partitions are used to group similar node types (e.g. nodes with and without accelerators) and to enforce different access policies and resource limits.

On bwUniCluster 3.0 there are different partitions:

* CPU-only nodes
** 2-socket nodes, consisting of 2 Intel Ice Lake processors with 32 cores each or 2 AMD processors with 48 cores each
** 2-socket nodes with very high RAM capacity, consisting of 2 AMD processors with 48 cores each
* GPU-accelerated nodes
** 2-socket nodes with 4x NVIDIA A100 or 4x NVIDIA H100 GPUs
** 4-socket node with 4x AMD Instinct accelerator

''' Queues '''

Job '''queues''' are used to manage jobs that request access to shared but limited computing resources of a certain kind (partition).

On bwUniCluster 3.0 there are different main types of queues:
* Regular queues
** cpu: Jobs that request CPU-only nodes.
** gpu: Jobs that request GPU-accelerated nodes.
* Development queues (dev)
** Short, usually interactive jobs that are used for developing, compiling and testing code and workflows. The intention behind development queues is to provide users with immediate access to computer resources without having to wait. This is the place to realize instantaneous heavy compute without affecting other users, as would be the case on the login nodes.

Requested compute resources such as (wall-)time, number of nodes and amount of memory are restricted and must fit into the boundaries imposed by the queues. The request for compute resources on the bwUniCluster 3.0 requires at least the specification of the '''queue''' and the '''time'''.

''' Jobs '''

Jobs can be run non-interactively as '''batch jobs''' or as '''interactive jobs'''.
Submitting a batch job means, that all steps of a compute project are defined in a Bash script. This Bash script is queued and executed as soon as the compute resources are available and allocated. Jobs are enqueued with the ''sbatch'' command.

For interactive jobs, the resources are requested with the ''salloc'' command. As soon as the computing resources are available and allocated, a command line prompt is returned on a computing node and the user can freely dispose of the resources now available to him.
{|style="background:#deffee; width:100%;"
|style="padding:5px; background:#cef2e0; text-align:left"|
[[Image:Attention.svg|center|25px]]
|style="padding:5px; background:#cef2e0; text-align:left"|
'''Please remember:'''
* '''Heavy computations are not allowed on the login nodes'''. Use a developement or a regular job queue instead! Please refer to [[BwUniCluster3.0/Login#Allowed_Activities_on_Login_Nodes|Allowed Activities on Login Nodes]].
* '''Development queues''' are meant for '''development tasks'''. Do not misuse this queue for regular, short-running jobs or chain jobs! Only one running job at a time is enabled. Maximum queue length is reduced to 3.
|}

= Queues on bwUniCluster 3.0 =
== Policy ==

== Regular Queues ==
{| class="wikitable"
|-
! style="width:5%"| Queue
! style="width:13%"| Node-Type
! style="width:23%"| Default Resources
! style="width:13%"| Minimal Resources
! style="width:13%"| Maximum Resources
|-
| <code>cpu_il</code>
| CPU nodes Ice Lake
| mem-per-cpu=1950mb
|
| time=72:00:00, nodes=30, mem=249600mb, ntasks-per-node=64, (threads-per-core=2)
|-
| <code>cpu</code>
| CPU nodes Standard
| mem-per-cpu=1125mb
|
| time=72:00:00, nodes=20, mem=380000mb, ntasks-per-node=96, (threads-per-core=2)
|-
| <code>highmem</code>
| CPU nodes High Memory
| mem-per-cpu=1125mb
|
| time=72:00:00, nodes=4, mem=2300000mb, ntasks-per-node=96, (threads-per-core=2)
|-
| <code>gpu_h100</code>
| GPU nodes NVIDIA GPU x4
| mem-per-cpu=1125mb cpus-per-gpu=24
|
| time=72:00:00, nodes=12, mem=760000mb, ntasks-per-node=96, (threads-per-core=2)
|-
| <code>gpu_mi300</code>
| GPU node AMD GPU x4
| mem-per-cpu=1125mb cpus-per-gpu=24
|
| time=72:00:00, nodes=1, mem=510000mb, ntasks-per-node=40, (threads-per-core=2)
|-
| <code>gpu_a100_il</code>/<code>gpu_h100_il</code>
| GPU nodes Ice Lake NVIDIA GPU x4
| mem-per-gpu=127500mb cpus-per-gpu=16
|
| time=72:00:00, nodes=9, mem=510000mb, ntasks-per-node=64, (threads-per-core=2)
|}
Table 1: Regular Queues

== Development Queues ==
Only for development, i.e. debugging or performance optimization ...

{| class="wikitable"
|-
! style="width:5%"| Queue
! style="width:13%"| Node Type
! style="width:23%"| Default Resources
! style="width:13%"| Minimal Resources
! style="width:13%"| Maximum Resources
|-
| <code>dev_cpu_il</code>
| CPU nodes Ice Lake
| mem-per-cpu=1950mb
|
| time=30, nodes=8, mem=249600mb, ntasks-per-node=64, (threads-per-core=2)
|-
| <code>dev_cpu</code>
| CPU nodes Standard
| mem-per-cpu=1125mb
|
| time=30, nodes=1, mem=180000mb, ntasks-per-node=40, (threads-per-core=2)
|-
| <code>dev_highmem</code>
| CPU nodes High Memory
| mem-per-cpu=1125mb
|
| time=30, nodes=1, mem=180000mb, ntasks-per-node=40, (threads-per-core=2)
|-
| <code>dev_gpu_h100</code>
| GPU nodes NVIDIA GPU x4
| mem-per-cpu=1125mb cpus-per-gpu=24
|
| time=30, nodes=1, mem=180000mb, ntasks-per-node=40, (threads-per-core=2)
|-
| <code>dev_gpu_a100_il</code>
| GPU nodes NVIDIA GPU x4 
| mem-per-gpu=127500mb cpus-per-gpu=16
|
| time=30, nodes=1, mem=510000mb, ntasks-per-node=64, (threads-per-core=2)
|}
Table 2: Development Queues

Default resources of a queue class defines time, #tasks and memory if not explicitly given with sbatch command. Resource list acronyms ''--time'', ''--ntasks'', ''--nodes'', ''--mem'' and ''--mem-per-cpu'' are described [[BwUniCluster_3.0_Slurm_common_Features|here]].

== Check available resources: sinfo_t_idle ==
The Slurm command sinfo is used to view partition and node information for a system running Slurm. It incorporates down time, reservations, and node state information in determining the available backfill window. The sinfo command can only be used by the administrator.
 
SCC has prepared a special script (sinfo_t_idle) to find out how many processors are available for immediate use on the system. It is anticipated that users will use this information to submit jobs that meet these criteria and thus obtain quick job turnaround times.
 
 

* The following command displays what resources are available for immediate use for the whole partition.
<pre>$ sinfo_t_idle
Partition dev_cpu : 2 nodes idle
Partition cpu : 68 nodes idle
Partition highmem : 4 nodes idle
Partition dev_gpu_h100 : 0 nodes idle
Partition gpu_h100 : 11 nodes idle
Partition gpu_mi300 : 1 nodes idle
Partition dev_cpu_il : 0 nodes idle
Partition cpu_il : 0 nodes idle
Partition dev_gpu_a100_il : 0 nodes idle
Partition gpu_a100_il : 0 nodes idle
Partition gpu_h100_il : 0 nodes idle
</pre>
 

= Running Jobs =

== Slurm Commands (excerpt) ==
Important Slurm commands for non-administrators working on bwUniCluster 3.0.
{| width=850px class="wikitable"
! Slurm commands !! Brief explanation
|-
| [[#Batch Jobs: sbatch|sbatch]] || Submits a job and puts it into the queue [[https://slurm.schedmd.com/sbatch.html sbatch]]
|-
| [[#Interactive Jobs: salloc|salloc]] || Requests resources for an interactive Job [[https://slurm.schedmd.com/salloc.html salloc]]
|-
| [[#Monitor and manage jobs |scontrol show job]] || Displays detailed job state information [[https://slurm.schedmd.com/scontrol.html scontrol]]
|-
| [[#List of your submitted jobs : squeue|squeue]] || Displays information about active, eligible, blocked, and/or recently completed jobs [[https://slurm.schedmd.com/squeue.html squeue]]
|-
| [[#List of your submitted jobs : squeue|squeue --start]] || Returns start time of submitted job [[https://slurm.schedmd.com/squeue.html squeue]]
|-
| [[#Shows free resources : sinfo_t_idle|sinfo_t_idle]] || Shows what resources are available for immediate use [[https://slurm.schedmd.com/sinfo.html sinfo]]
|-
| [[#Canceling own jobs : scancel|scancel]] || Cancels a job [[https://slurm.schedmd.com/scancel.html scancel]]
|}

 
* [https://slurm.schedmd.com/tutorials.html Slurm Tutorials]
* [https://slurm.schedmd.com/pdfs/summary.pdf Slurm command/option summary (2 pages)]
* [https://slurm.schedmd.com/man_index.html Slurm Commands]
 

== Batch Jobs: sbatch ==

Batch jobs are submitted by using the command '''sbatch'''. The main purpose of the '''sbatch''' command is to specify the resources that are needed to run the job. '''sbatch''' will then queue the batch job. However, starting of batch job depends on the availability of the requested resources and the fair sharing value.
 
 

To run your batch job on one of the cpu nodes, please use:

<pre>
$ sbatch --partition=dev_cpu
or
$ sbatch -p dev_cpu
</pre>
 

=== Command parameters sbatch ===
The syntax and use of '''sbatch''' can be displayed via:
<pre>
$ man sbatch
</pre>
'''sbatch''' options can be used from the command line or in your job script. Different defaults for some of these options are set based on the queue and can be found [https://wiki.bwhpc.de/e/BwUniCluster3.0/Batch_Queues#sbatch_-p_queue here]

{| class="wikitable"
! colspan="3" | sbatch Options
|-
! style="width:8%"| Command line
! style="width:9%"| Script
! style="width:13%"| Purpose
|- style="vertical-align:top;"
| -t, --time=''time''
| #SBATCH --time=''time''
| Wall clock time limit. 
|-
|- style="vertical-align:top;"
| -N, --nodes=''count''
| #SBATCH --nodes=''count''
| Number of nodes to be used.
|- style="vertical-align:top;"
| -n, --ntasks=''count''
| #SBATCH --ntasks=''count''
| Number of tasks to be launched.
|-
|- style="vertical-align:top;"
| --ntasks-per-node=''count''
| #SBATCH --ntasks-per-node=''count''
| Maximum count of tasks per node.
|-
|- style="vertical-align:top;"
| -c, --cpus-per-task=''count''
| #SBATCH --cpus-per-task=''count''
| Number of CPUs required per (MPI-)task.
|-
|- style="vertical-align:top;"
| --mem=''value_in_MB''
| #SBATCH --mem=''value_in_MB''
| Memory in MegaByte per node. (You should omit the setting of this option.)
|-
|- style="vertical-align:top;"
| --mem-per-cpu=''value_in_MB''
| #SBATCH --mem-per-cpu=''value_in_MB''
| Minimum Memory required per allocated CPU. (You should omit the setting of this option.)
|-
|- style="vertical-align:top;"
| --mail-type=''type''
| #SBATCH --mail-type=''type''
| Notify user by email when certain event types occur. Valid type values are NONE, BEGIN, END, FAIL, REQUEUE, ALL.
|-
|- style="vertical-align:top;"
| --mail-user=''mail-address''
| #SBATCH --mail-user=''mail-address''
| The specified mail-address receives email notification of state changes as defined by --mail-type.
|-
|- style="vertical-align:top;"
| --output=''name''
| #SBATCH --output=''name''
| File in which job output is stored.
|-
|- style="vertical-align:top;"
| --error=''name''
| #SBATCH --error=''name''
| File in which job error messages are stored.
|-
|- style="vertical-align:top;"
| -J, --job-name=''name''
| #SBATCH --job-name=''name''
| Job name.
|-
|- style="vertical-align:top;"
| --export=[ALL,] ''env-variables''
| #SBATCH --export=[ALL,] ''env-variables''
| Identifies which environment variables from the submission environment are propagated to the launched application. Default is ALL.
|-
|- style="vertical-align:top;"
| -A, --account=''group-name''
| #SBATCH --account=''group-name''
| Change resources used by this job to specified group. You may need this option if your account is assigned to more than one group. By command "scontrol show job" the project group the job is accounted on can be seen behind "Account=".
|-
|- style="vertical-align:top;"
| -p, --partition=''queue-name''
| #SBATCH --partition=''queue-name''
| Request a specific queue for the resource allocation.
|-
|- style="vertical-align:top;"
| --reservation=''reservation-name''
| #SBATCH --reservation=''reservation-name''
| Use a specific reservation for the resource allocation.
|-
|- style="vertical-align:top;"
| -C, --constraint=''LSDF''
| #SBATCH --constraint=LSDF
| Job constraint LSDF filesystems.
|-
|- style="vertical-align:top;"
| -C, --constraint=''BEEOND (BEEOND_4MDS, BEEOND_MAXMDS)''
| #SBATCH --constraint=BEEOND (BEEOND_4MDS, BEEOND_MAXMDS)
| Job constraint BeeOND filesystem.
|-
|}
 

== Interactive Jobs: salloc ==

On bwUniCluster 3.0 you are only allowed to run short jobs (<< 1 hour) with little memory requirements (<< 8 GByte) on the logins nodes. If you want to run longer jobs and/or jobs with a request of more than 8 GByte of memory, you must allocate resources for so-called interactive jobs by usage of the command salloc on a login node. Considering a serial application running on a compute node that requires 5000 MByte of memory and limiting the interactive run to 2 hours the following command has to be executed:

<pre>
$ salloc -p cpu -n 1 -t 120 --mem=5000
</pre>

Then you will get one core on a compute node within the partition "cpu". After execution of this command '''DO NOT CLOSE''' your current terminal session but wait until the queueing system Slurm has granted you the requested resources on the compute system. You will be logged in automatically on the granted core! To run a serial program on the granted core you only have to type the name of the executable.

<pre>
$ ./<my_serial_program>
</pre>

Please be aware that your serial job must run less than 2 hours in this example, else the job will be killed during runtime by the system.

You can also start now a graphical X11-terminal connecting you to the dedicated resource that is available for 2 hours. You can start it by the command:

<pre>
$ xterm
</pre>

Note that, once the walltime limit has been reached the resources - i.e. the compute node - will automatically be revoked.

An interactive parallel application running on one compute node or on many compute nodes (e.g. here 5 nodes) with 96 cores each requires usually an amount of memory in GByte (e.g. 50 GByte) and a maximum time (e.g. 1 hour). E.g. 5 nodes can be allocated by the following command:

<pre>
$ salloc -p cpu -N 5 --ntasks-per-node=96 -t 01:00:00 --mem=50gb
</pre>

Now you can run parallel jobs on 480 cores requiring 50 GByte of memory per node. Please be aware that you will be logged in on core 0 of the first node.
If you want to have access to another node you have to open a new terminal, connect it also to bwUniCluster 3.0 and type the following commands to
connect to the running interactive job and then to a specific node:

<pre>
$ srun --jobid=XXXXXXXX --pty /bin/bash
$ srun --nodelist=uc3nXXX --pty /bin/bash
</pre>

With the command:

<pre>
$ squeue
</pre>

the jobid and the nodelist can be shown.

If you want to run MPI-programs, you can do it by simply typing mpirun <program_name>. Then your program will be run on 480 cores. A very simple example for starting a parallel job can be:

<pre>
$ mpirun <my_mpi_program>
</pre>

You can also start the debugger ddt by the commands:

<pre>
$ module add devel/ddt
$ ddt <my_mpi_program>
</pre>

The above commands will execute the parallel program <my_mpi_program> on all available cores. You can also start parallel programs on a subset of cores; an example for this can be:

<pre>
$ mpirun -n 50 <my_mpi_program>
</pre>

If you are using Intel MPI you must start <my_mpi_program> by the command mpiexec.hydra (instead of mpirun).

== Interactive Computing with Jupyter ==

== Monitor and manage jobs ==

=== List of your submitted jobs : squeue ===
Displays information about YOUR active, pending and/or recently completed jobs. The command displays all own active and pending jobs. The command squeue is explained in detail on the webpage https://slurm.schedmd.com/squeue.html or via manpage (man squeue).
 
 

* ''squeue'' example on bwUniCluster 3.0 (Only your own jobs are displayed!).
<pre>
$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
1262 cpu wrap ka_ab123 R 8:15 1 uc3n002
1267 dev_gpu_h wrap ka_ab123 PD 0:00 1 (Resources)
1265 highmem wrap ka_ab123 R 2:41 1 uc3n084
$ squeue -l
JOBID PARTITION NAME USER STATE TIME TIME_LIMI NODES NODELIST(REASON)
1262 cpu wrap ka_ab123 RUNNING 8:55 20:00 1 uc3n002
1267 dev_gpu_h wrap ka_ab123 PENDING 0:00 20:00 1 (Resources)
1265 highmem wrap ka_ab123 RUNNING 3:21 20:00 1 uc3n084

</pre>

=== Detailed job information : scontrol show job ===
scontrol show job displays detailed job state information and diagnostic output for all or a specified job of yours. Detailed information is available for active, pending and recently completed jobs. The command scontrol is explained in detail on the webpage https://slurm.schedmd.com/scontrol.html or via manpage (man scontrol).
 
Display the state of all your jobs in normal mode: scontrol show job
 
Display the state of a job with <jobid> in normal mode: scontrol show job <jobid>
 
 

* Here is an example from bwUniCluster 3.0.
<pre>
$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
1262 cpu wrap ka_zs040 R 1:12 1 uc3n002

$
$ # now, see what's up with my pending job with jobid 1262
$
$ scontrol show job 1262

JobId=1262 JobName=wrap
UserId=ka_zs0402(241992) GroupId=ka_scc(12345) MCS_label=N/A
Priority=4246 Nice=0 Account=ka QOS=normal
JobState=RUNNING Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
RunTime=00:00:37 TimeLimit=00:20:00 TimeMin=N/A
SubmitTime=2025-04-04T10:01:30 EligibleTime=2025-04-04T10:01:30
AccrueTime=2025-04-04T10:01:30
StartTime=2025-04-04T10:01:31 EndTime=2025-04-04T10:21:31 Deadline=N/A
SuspendTime=None SecsPreSuspend=0 LastSchedEval=2025-04-04T10:01:31 Scheduler=Main
Partition=cpu AllocNode:Sid=uc3n999:2819841
ReqNodeList=(null) ExcNodeList=(null)
NodeList=uc3n002
BatchHost=uc3n002
NumNodes=1 NumCPUs=2 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
ReqTRES=cpu=1,mem=2000M,node=1,billing=1
AllocTRES=cpu=2,mem=4000M,node=1,billing=2
Socks/Node=* NtasksPerN:B:S:C=0:0:*:1 CoreSpec=*
MinCPUsNode=1 MinMemoryCPU=2000M MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
Command=(null)
WorkDir=/pfs/data6/home/ka/ka_scc/ka_zs0402
StdErr=/pfs/data6/home/ka/ka_scc/ka_zs0402/slurm-1262.out
StdIn=/dev/null
StdOut=/pfs/data6/home/ka/ka_scc/ka_zs0402/slurm-1262.out

</pre>
 

= Slurm Options =

= Best Practices =

== Step-by-Step example==

== Dos and Don'ts ==

BwUniCluster3.0/Running Jobs

2025-04-04T15:17:42Z

P Schuhmacher: /* Slurm Commands (excerpt) */

= Purpose and function of a queuing system =

General procedure: Hint to [[Running_Calculations | Running Calculations]]

== Job submission process ==

bwUniCluster 3.0 has installed the workload managing software Slurm. Therefore any job submission by the user is to be executed by commands of the Slurm software. Slurm queues and runs user jobs based on fair sharing policies.

== Slurm ==

HPC Workload Manager on bwUniCluster 3.0 is Slurm.
Slurm is a cluster management and job scheduling system. Slurm has three key functions.
* First, it allocates access to resources (compute cores on nodes) to users for some duration of time so they can perform work.
* it provides a framework for starting, executing, and monitoring work (normally a parallel job) on the set of allocated nodes.
* it arbitrates contention for resources by managing a queue of pending work.

Any kind of calculation on the compute nodes of bwUniCluster 3.0 requires the user to define calculations as a sequence of commands together with required run time, number of CPU cores and main memory and submit all, i.e., the '''batch job''', to a resource and workload managing software.
== Terms and definitions ==

''' Partitions '''

Slurm manages job queues for different '''partitions'''. Partitions are used to group similar node types (e.g. nodes with and without accelerators) and to enforce different access policies and resource limits.

On bwUniCluster 3.0 there are different partitions:

* CPU-only nodes
** 2-socket nodes, consisting of 2 Intel Ice Lake processors with 32 cores each or 2 AMD processors with 48 cores each
** 2-socket nodes with very high RAM capacity, consisting of 2 AMD processors with 48 cores each
* GPU-accelerated nodes
** 2-socket nodes with 4x NVIDIA A100 or 4x NVIDIA H100 GPUs
** 4-socket node with 4x AMD Instinct accelerator

''' Queues '''

Job '''queues''' are used to manage jobs that request access to shared but limited computing resources of a certain kind (partition).

On bwUniCluster 3.0 there are different main types of queues:
* Regular queues
** cpu: Jobs that request CPU-only nodes.
** gpu: Jobs that request GPU-accelerated nodes.
* Development queues (dev)
** Short, usually interactive jobs that are used for developing, compiling and testing code and workflows. The intention behind development queues is to provide users with immediate access to computer resources without having to wait. This is the place to realize instantaneous heavy compute without affecting other users, as would be the case on the login nodes.

Requested compute resources such as (wall-)time, number of nodes and amount of memory are restricted and must fit into the boundaries imposed by the queues. The request for compute resources on the bwUniCluster 3.0 requires at least the specification of the '''queue''' and the '''time'''.

''' Jobs '''

Jobs can be run non-interactively as '''batch jobs''' or as '''interactive jobs'''.
Submitting a batch job means, that all steps of a compute project are defined in a Bash script. This Bash script is queued and executed as soon as the compute resources are available and allocated. Jobs are enqueued with the ''sbatch'' command.

For interactive jobs, the resources are requested with the ''salloc'' command. As soon as the computing resources are available and allocated, a command line prompt is returned on a computing node and the user can freely dispose of the resources now available to him.
{|style="background:#deffee; width:100%;"
|style="padding:5px; background:#cef2e0; text-align:left"|
[[Image:Attention.svg|center|25px]]
|style="padding:5px; background:#cef2e0; text-align:left"|
'''Please remember:'''
* '''Heavy computations are not allowed on the login nodes'''. Use a developement or a regular job queue instead! Please refer to [[BwUniCluster3.0/Login#Allowed_Activities_on_Login_Nodes|Allowed Activities on Login Nodes]].
* '''Development queues''' are meant for '''development tasks'''. Do not misuse this queue for regular, short-running jobs or chain jobs! Only one running job at a time is enabled. Maximum queue length is reduced to 3.
|}

= Queues on bwUniCluster 3.0 =
== Policy ==

== Regular Queues ==
{| class="wikitable"
|-
! style="width:5%"| Queue
! style="width:13%"| Node-Type
! style="width:23%"| Default Resources
! style="width:13%"| Minimal Resources
! style="width:13%"| Maximum Resources
|-
| <code>cpu_il</code>
| CPU nodes Ice Lake
| mem-per-cpu=1950mb
|
| time=72:00:00, nodes=30, mem=249600mb, ntasks-per-node=64, (threads-per-core=2)
|-
| <code>cpu</code>
| CPU nodes Standard
| mem-per-cpu=1125mb
|
| time=72:00:00, nodes=20, mem=380000mb, ntasks-per-node=96, (threads-per-core=2)
|-
| <code>highmem</code>
| CPU nodes High Memory
| mem-per-cpu=1125mb
|
| time=72:00:00, nodes=4, mem=2300000mb, ntasks-per-node=96, (threads-per-core=2)
|-
| <code>gpu_h100</code>
| GPU nodes NVIDIA GPU x4
| mem-per-cpu=1125mb cpus-per-gpu=24
|
| time=72:00:00, nodes=12, mem=760000mb, ntasks-per-node=96, (threads-per-core=2)
|-
| <code>gpu_mi300</code>
| GPU node AMD GPU x4
| mem-per-cpu=1125mb cpus-per-gpu=24
|
| time=72:00:00, nodes=1, mem=510000mb, ntasks-per-node=40, (threads-per-core=2)
|-
| <code>gpu_a100_il</code>/<code>gpu_h100_il</code>
| GPU nodes Ice Lake NVIDIA GPU x4
| mem-per-gpu=127500mb cpus-per-gpu=16
|
| time=72:00:00, nodes=9, mem=510000mb, ntasks-per-node=64, (threads-per-core=2)
|}
Table 1: Regular Queues

== Development Queues ==
Only for development, i.e. debugging or performance optimization ...

{| class="wikitable"
|-
! style="width:5%"| Queue
! style="width:13%"| Node Type
! style="width:23%"| Default Resources
! style="width:13%"| Minimal Resources
! style="width:13%"| Maximum Resources
|-
| <code>dev_cpu_il</code>
| CPU nodes Ice Lake
| mem-per-cpu=1950mb
|
| time=30, nodes=8, mem=249600mb, ntasks-per-node=64, (threads-per-core=2)
|-
| <code>dev_cpu</code>
| CPU nodes Standard
| mem-per-cpu=1125mb
|
| time=30, nodes=1, mem=180000mb, ntasks-per-node=40, (threads-per-core=2)
|-
| <code>dev_highmem</code>
| CPU nodes High Memory
| mem-per-cpu=1125mb
|
| time=30, nodes=1, mem=180000mb, ntasks-per-node=40, (threads-per-core=2)
|-
| <code>dev_gpu_h100</code>
| GPU nodes NVIDIA GPU x4
| mem-per-cpu=1125mb cpus-per-gpu=24
|
| time=30, nodes=1, mem=180000mb, ntasks-per-node=40, (threads-per-core=2)
|-
| <code>dev_gpu_a100_il</code>
| GPU nodes NVIDIA GPU x4 
| mem-per-gpu=127500mb cpus-per-gpu=16
|
| time=30, nodes=1, mem=510000mb, ntasks-per-node=64, (threads-per-core=2)
|}
Table 2: Development Queues

Default resources of a queue class defines time, #tasks and memory if not explicitly given with sbatch command. Resource list acronyms ''--time'', ''--ntasks'', ''--nodes'', ''--mem'' and ''--mem-per-cpu'' are described [[BwUniCluster_3.0_Slurm_common_Features|here]].

== Check available resources ==
The Slurm command sinfo is used to view partition and node information for a system running Slurm. It incorporates down time, reservations, and node state information in determining the available backfill window. The sinfo command can only be used by the administrator.
 
SCC has prepared a special script (sinfo_t_idle) to find out how many processors are available for immediate use on the system. It is anticipated that users will use this information to submit jobs that meet these criteria and thus obtain quick job turnaround times.
 
 

* The following command displays what resources are available for immediate use for the whole partition.
<pre>$ sinfo_t_idle
Partition dev_cpu : 2 nodes idle
Partition cpu : 68 nodes idle
Partition highmem : 4 nodes idle
Partition dev_gpu_h100 : 0 nodes idle
Partition gpu_h100 : 11 nodes idle
Partition gpu_mi300 : 1 nodes idle
Partition dev_cpu_il : 0 nodes idle
Partition cpu_il : 0 nodes idle
Partition dev_gpu_a100_il : 0 nodes idle
Partition gpu_a100_il : 0 nodes idle
Partition gpu_h100_il : 0 nodes idle
</pre>
 

= Running Jobs =

== Slurm Commands (excerpt) ==
Important Slurm commands for non-administrators working on bwUniCluster 3.0.
{| width=850px class="wikitable"
! Slurm commands !! Brief explanation
|-
| [[#Batch Jobs: sbatch|sbatch]] || Submits a job and puts it into the queue [[https://slurm.schedmd.com/sbatch.html sbatch]]
|-
| [[#Interactive Jobs: salloc|salloc]] || Requests resources for an interactive Job [[https://slurm.schedmd.com/salloc.html salloc]]
|-
| [[#Monitor and manage jobs |scontrol show job]] || Displays detailed job state information [[https://slurm.schedmd.com/scontrol.html scontrol]]
|-
| [[#List of your submitted jobs : squeue|squeue]] || Displays information about active, eligible, blocked, and/or recently completed jobs [[https://slurm.schedmd.com/squeue.html squeue]]
|-
| [[#List of your submitted jobs : squeue|squeue --start]] || Returns start time of submitted job [[https://slurm.schedmd.com/squeue.html squeue]]
|-
| [[#Shows free resources : sinfo_t_idle|sinfo_t_idle]] || Shows what resources are available for immediate use [[https://slurm.schedmd.com/sinfo.html sinfo]]
|-
| [[#Canceling own jobs : scancel|scancel]] || Cancels a job [[https://slurm.schedmd.com/scancel.html scancel]]
|}

 
* [https://slurm.schedmd.com/tutorials.html Slurm Tutorials]
* [https://slurm.schedmd.com/pdfs/summary.pdf Slurm command/option summary (2 pages)]
* [https://slurm.schedmd.com/man_index.html Slurm Commands]
 

== Batch Jobs: sbatch ==

Batch jobs are submitted by using the command '''sbatch'''. The main purpose of the '''sbatch''' command is to specify the resources that are needed to run the job. '''sbatch''' will then queue the batch job. However, starting of batch job depends on the availability of the requested resources and the fair sharing value.
 
 

To run your batch job on one of the cpu nodes, please use:

<pre>
$ sbatch --partition=dev_cpu
or
$ sbatch -p dev_cpu
</pre>
 

=== Command parameters sbatch ===
The syntax and use of '''sbatch''' can be displayed via:
<pre>
$ man sbatch
</pre>
'''sbatch''' options can be used from the command line or in your job script. Different defaults for some of these options are set based on the queue and can be found [https://wiki.bwhpc.de/e/BwUniCluster3.0/Batch_Queues#sbatch_-p_queue here]

{| class="wikitable"
! colspan="3" | sbatch Options
|-
! style="width:8%"| Command line
! style="width:9%"| Script
! style="width:13%"| Purpose
|- style="vertical-align:top;"
| -t, --time=''time''
| #SBATCH --time=''time''
| Wall clock time limit. 
|-
|- style="vertical-align:top;"
| -N, --nodes=''count''
| #SBATCH --nodes=''count''
| Number of nodes to be used.
|- style="vertical-align:top;"
| -n, --ntasks=''count''
| #SBATCH --ntasks=''count''
| Number of tasks to be launched.
|-
|- style="vertical-align:top;"
| --ntasks-per-node=''count''
| #SBATCH --ntasks-per-node=''count''
| Maximum count of tasks per node.
|-
|- style="vertical-align:top;"
| -c, --cpus-per-task=''count''
| #SBATCH --cpus-per-task=''count''
| Number of CPUs required per (MPI-)task.
|-
|- style="vertical-align:top;"
| --mem=''value_in_MB''
| #SBATCH --mem=''value_in_MB''
| Memory in MegaByte per node. (You should omit the setting of this option.)
|-
|- style="vertical-align:top;"
| --mem-per-cpu=''value_in_MB''
| #SBATCH --mem-per-cpu=''value_in_MB''
| Minimum Memory required per allocated CPU. (You should omit the setting of this option.)
|-
|- style="vertical-align:top;"
| --mail-type=''type''
| #SBATCH --mail-type=''type''
| Notify user by email when certain event types occur. Valid type values are NONE, BEGIN, END, FAIL, REQUEUE, ALL.
|-
|- style="vertical-align:top;"
| --mail-user=''mail-address''
| #SBATCH --mail-user=''mail-address''
| The specified mail-address receives email notification of state changes as defined by --mail-type.
|-
|- style="vertical-align:top;"
| --output=''name''
| #SBATCH --output=''name''
| File in which job output is stored.
|-
|- style="vertical-align:top;"
| --error=''name''
| #SBATCH --error=''name''
| File in which job error messages are stored.
|-
|- style="vertical-align:top;"
| -J, --job-name=''name''
| #SBATCH --job-name=''name''
| Job name.
|-
|- style="vertical-align:top;"
| --export=[ALL,] ''env-variables''
| #SBATCH --export=[ALL,] ''env-variables''
| Identifies which environment variables from the submission environment are propagated to the launched application. Default is ALL.
|-
|- style="vertical-align:top;"
| -A, --account=''group-name''
| #SBATCH --account=''group-name''
| Change resources used by this job to specified group. You may need this option if your account is assigned to more than one group. By command "scontrol show job" the project group the job is accounted on can be seen behind "Account=".
|-
|- style="vertical-align:top;"
| -p, --partition=''queue-name''
| #SBATCH --partition=''queue-name''
| Request a specific queue for the resource allocation.
|-
|- style="vertical-align:top;"
| --reservation=''reservation-name''
| #SBATCH --reservation=''reservation-name''
| Use a specific reservation for the resource allocation.
|-
|- style="vertical-align:top;"
| -C, --constraint=''LSDF''
| #SBATCH --constraint=LSDF
| Job constraint LSDF filesystems.
|-
|- style="vertical-align:top;"
| -C, --constraint=''BEEOND (BEEOND_4MDS, BEEOND_MAXMDS)''
| #SBATCH --constraint=BEEOND (BEEOND_4MDS, BEEOND_MAXMDS)
| Job constraint BeeOND filesystem.
|-
|}
 

== Interactive Jobs: salloc ==

On bwUniCluster 3.0 you are only allowed to run short jobs (<< 1 hour) with little memory requirements (<< 8 GByte) on the logins nodes. If you want to run longer jobs and/or jobs with a request of more than 8 GByte of memory, you must allocate resources for so-called interactive jobs by usage of the command salloc on a login node. Considering a serial application running on a compute node that requires 5000 MByte of memory and limiting the interactive run to 2 hours the following command has to be executed:

<pre>
$ salloc -p cpu -n 1 -t 120 --mem=5000
</pre>

Then you will get one core on a compute node within the partition "cpu". After execution of this command '''DO NOT CLOSE''' your current terminal session but wait until the queueing system Slurm has granted you the requested resources on the compute system. You will be logged in automatically on the granted core! To run a serial program on the granted core you only have to type the name of the executable.

<pre>
$ ./<my_serial_program>
</pre>

Please be aware that your serial job must run less than 2 hours in this example, else the job will be killed during runtime by the system.

You can also start now a graphical X11-terminal connecting you to the dedicated resource that is available for 2 hours. You can start it by the command:

<pre>
$ xterm
</pre>

Note that, once the walltime limit has been reached the resources - i.e. the compute node - will automatically be revoked.

An interactive parallel application running on one compute node or on many compute nodes (e.g. here 5 nodes) with 96 cores each requires usually an amount of memory in GByte (e.g. 50 GByte) and a maximum time (e.g. 1 hour). E.g. 5 nodes can be allocated by the following command:

<pre>
$ salloc -p cpu -N 5 --ntasks-per-node=96 -t 01:00:00 --mem=50gb
</pre>

Now you can run parallel jobs on 480 cores requiring 50 GByte of memory per node. Please be aware that you will be logged in on core 0 of the first node.
If you want to have access to another node you have to open a new terminal, connect it also to bwUniCluster 3.0 and type the following commands to
connect to the running interactive job and then to a specific node:

<pre>
$ srun --jobid=XXXXXXXX --pty /bin/bash
$ srun --nodelist=uc3nXXX --pty /bin/bash
</pre>

With the command:

<pre>
$ squeue
</pre>

the jobid and the nodelist can be shown.

If you want to run MPI-programs, you can do it by simply typing mpirun <program_name>. Then your program will be run on 480 cores. A very simple example for starting a parallel job can be:

<pre>
$ mpirun <my_mpi_program>
</pre>

You can also start the debugger ddt by the commands:

<pre>
$ module add devel/ddt
$ ddt <my_mpi_program>
</pre>

The above commands will execute the parallel program <my_mpi_program> on all available cores. You can also start parallel programs on a subset of cores; an example for this can be:

<pre>
$ mpirun -n 50 <my_mpi_program>
</pre>

If you are using Intel MPI you must start <my_mpi_program> by the command mpiexec.hydra (instead of mpirun).

== Interactive Computing with Jupyter ==

== Monitor and manage jobs ==

=== List of your submitted jobs : squeue ===
Displays information about YOUR active, pending and/or recently completed jobs. The command displays all own active and pending jobs. The command squeue is explained in detail on the webpage https://slurm.schedmd.com/squeue.html or via manpage (man squeue).
 
 

* ''squeue'' example on bwUniCluster 3.0 (Only your own jobs are displayed!).
<pre>
$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
1262 cpu wrap ka_ab123 R 8:15 1 uc3n002
1267 dev_gpu_h wrap ka_ab123 PD 0:00 1 (Resources)
1265 highmem wrap ka_ab123 R 2:41 1 uc3n084
$ squeue -l
JOBID PARTITION NAME USER STATE TIME TIME_LIMI NODES NODELIST(REASON)
1262 cpu wrap ka_ab123 RUNNING 8:55 20:00 1 uc3n002
1267 dev_gpu_h wrap ka_ab123 PENDING 0:00 20:00 1 (Resources)
1265 highmem wrap ka_ab123 RUNNING 3:21 20:00 1 uc3n084

</pre>

=== Detailed job information : scontrol show job ===
scontrol show job displays detailed job state information and diagnostic output for all or a specified job of yours. Detailed information is available for active, pending and recently completed jobs. The command scontrol is explained in detail on the webpage https://slurm.schedmd.com/scontrol.html or via manpage (man scontrol).
 
Display the state of all your jobs in normal mode: scontrol show job
 
Display the state of a job with <jobid> in normal mode: scontrol show job <jobid>
 
 

* Here is an example from bwUniCluster 3.0.
<pre>
$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
1262 cpu wrap ka_zs040 R 1:12 1 uc3n002

$
$ # now, see what's up with my pending job with jobid 1262
$
$ scontrol show job 1262

JobId=1262 JobName=wrap
UserId=ka_zs0402(241992) GroupId=ka_scc(12345) MCS_label=N/A
Priority=4246 Nice=0 Account=ka QOS=normal
JobState=RUNNING Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
RunTime=00:00:37 TimeLimit=00:20:00 TimeMin=N/A
SubmitTime=2025-04-04T10:01:30 EligibleTime=2025-04-04T10:01:30
AccrueTime=2025-04-04T10:01:30
StartTime=2025-04-04T10:01:31 EndTime=2025-04-04T10:21:31 Deadline=N/A
SuspendTime=None SecsPreSuspend=0 LastSchedEval=2025-04-04T10:01:31 Scheduler=Main
Partition=cpu AllocNode:Sid=uc3n999:2819841
ReqNodeList=(null) ExcNodeList=(null)
NodeList=uc3n002
BatchHost=uc3n002
NumNodes=1 NumCPUs=2 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
ReqTRES=cpu=1,mem=2000M,node=1,billing=1
AllocTRES=cpu=2,mem=4000M,node=1,billing=2
Socks/Node=* NtasksPerN:B:S:C=0:0:*:1 CoreSpec=*
MinCPUsNode=1 MinMemoryCPU=2000M MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
Command=(null)
WorkDir=/pfs/data6/home/ka/ka_scc/ka_zs0402
StdErr=/pfs/data6/home/ka/ka_scc/ka_zs0402/slurm-1262.out
StdIn=/dev/null
StdOut=/pfs/data6/home/ka/ka_scc/ka_zs0402/slurm-1262.out

</pre>
 

= Slurm Options =

= Best Practices =

== Step-by-Step example==

== Dos and Don'ts ==

BwUniCluster3.0/Running Jobs

2025-04-04T15:15:48Z

P Schuhmacher: /* Detailed job information : scontrol show job */

= Purpose and function of a queuing system =

General procedure: Hint to [[Running_Calculations | Running Calculations]]

== Job submission process ==

bwUniCluster 3.0 has installed the workload managing software Slurm. Therefore any job submission by the user is to be executed by commands of the Slurm software. Slurm queues and runs user jobs based on fair sharing policies.

== Slurm ==

HPC Workload Manager on bwUniCluster 3.0 is Slurm.
Slurm is a cluster management and job scheduling system. Slurm has three key functions.
* First, it allocates access to resources (compute cores on nodes) to users for some duration of time so they can perform work.
* it provides a framework for starting, executing, and monitoring work (normally a parallel job) on the set of allocated nodes.
* it arbitrates contention for resources by managing a queue of pending work.

Any kind of calculation on the compute nodes of bwUniCluster 3.0 requires the user to define calculations as a sequence of commands together with required run time, number of CPU cores and main memory and submit all, i.e., the '''batch job''', to a resource and workload managing software.
== Terms and definitions ==

''' Partitions '''

Slurm manages job queues for different '''partitions'''. Partitions are used to group similar node types (e.g. nodes with and without accelerators) and to enforce different access policies and resource limits.

On bwUniCluster 3.0 there are different partitions:

* CPU-only nodes
** 2-socket nodes, consisting of 2 Intel Ice Lake processors with 32 cores each or 2 AMD processors with 48 cores each
** 2-socket nodes with very high RAM capacity, consisting of 2 AMD processors with 48 cores each
* GPU-accelerated nodes
** 2-socket nodes with 4x NVIDIA A100 or 4x NVIDIA H100 GPUs
** 4-socket node with 4x AMD Instinct accelerator

''' Queues '''

Job '''queues''' are used to manage jobs that request access to shared but limited computing resources of a certain kind (partition).

On bwUniCluster 3.0 there are different main types of queues:
* Regular queues
** cpu: Jobs that request CPU-only nodes.
** gpu: Jobs that request GPU-accelerated nodes.
* Development queues (dev)
** Short, usually interactive jobs that are used for developing, compiling and testing code and workflows. The intention behind development queues is to provide users with immediate access to computer resources without having to wait. This is the place to realize instantaneous heavy compute without affecting other users, as would be the case on the login nodes.

Requested compute resources such as (wall-)time, number of nodes and amount of memory are restricted and must fit into the boundaries imposed by the queues. The request for compute resources on the bwUniCluster 3.0 requires at least the specification of the '''queue''' and the '''time'''.

''' Jobs '''

Jobs can be run non-interactively as '''batch jobs''' or as '''interactive jobs'''.
Submitting a batch job means, that all steps of a compute project are defined in a Bash script. This Bash script is queued and executed as soon as the compute resources are available and allocated. Jobs are enqueued with the ''sbatch'' command.

For interactive jobs, the resources are requested with the ''salloc'' command. As soon as the computing resources are available and allocated, a command line prompt is returned on a computing node and the user can freely dispose of the resources now available to him.
{|style="background:#deffee; width:100%;"
|style="padding:5px; background:#cef2e0; text-align:left"|
[[Image:Attention.svg|center|25px]]
|style="padding:5px; background:#cef2e0; text-align:left"|
'''Please remember:'''
* '''Heavy computations are not allowed on the login nodes'''. Use a developement or a regular job queue instead! Please refer to [[BwUniCluster3.0/Login#Allowed_Activities_on_Login_Nodes|Allowed Activities on Login Nodes]].
* '''Development queues''' are meant for '''development tasks'''. Do not misuse this queue for regular, short-running jobs or chain jobs! Only one running job at a time is enabled. Maximum queue length is reduced to 3.
|}

= Queues on bwUniCluster 3.0 =
== Policy ==

== Regular Queues ==
{| class="wikitable"
|-
! style="width:5%"| Queue
! style="width:13%"| Node-Type
! style="width:23%"| Default Resources
! style="width:13%"| Minimal Resources
! style="width:13%"| Maximum Resources
|-
| <code>cpu_il</code>
| CPU nodes Ice Lake
| mem-per-cpu=1950mb
|
| time=72:00:00, nodes=30, mem=249600mb, ntasks-per-node=64, (threads-per-core=2)
|-
| <code>cpu</code>
| CPU nodes Standard
| mem-per-cpu=1125mb
|
| time=72:00:00, nodes=20, mem=380000mb, ntasks-per-node=96, (threads-per-core=2)
|-
| <code>highmem</code>
| CPU nodes High Memory
| mem-per-cpu=1125mb
|
| time=72:00:00, nodes=4, mem=2300000mb, ntasks-per-node=96, (threads-per-core=2)
|-
| <code>gpu_h100</code>
| GPU nodes NVIDIA GPU x4
| mem-per-cpu=1125mb cpus-per-gpu=24
|
| time=72:00:00, nodes=12, mem=760000mb, ntasks-per-node=96, (threads-per-core=2)
|-
| <code>gpu_mi300</code>
| GPU node AMD GPU x4
| mem-per-cpu=1125mb cpus-per-gpu=24
|
| time=72:00:00, nodes=1, mem=510000mb, ntasks-per-node=40, (threads-per-core=2)
|-
| <code>gpu_a100_il</code>/<code>gpu_h100_il</code>
| GPU nodes Ice Lake NVIDIA GPU x4
| mem-per-gpu=127500mb cpus-per-gpu=16
|
| time=72:00:00, nodes=9, mem=510000mb, ntasks-per-node=64, (threads-per-core=2)
|}
Table 1: Regular Queues

== Development Queues ==
Only for development, i.e. debugging or performance optimization ...

{| class="wikitable"
|-
! style="width:5%"| Queue
! style="width:13%"| Node Type
! style="width:23%"| Default Resources
! style="width:13%"| Minimal Resources
! style="width:13%"| Maximum Resources
|-
| <code>dev_cpu_il</code>
| CPU nodes Ice Lake
| mem-per-cpu=1950mb
|
| time=30, nodes=8, mem=249600mb, ntasks-per-node=64, (threads-per-core=2)
|-
| <code>dev_cpu</code>
| CPU nodes Standard
| mem-per-cpu=1125mb
|
| time=30, nodes=1, mem=180000mb, ntasks-per-node=40, (threads-per-core=2)
|-
| <code>dev_highmem</code>
| CPU nodes High Memory
| mem-per-cpu=1125mb
|
| time=30, nodes=1, mem=180000mb, ntasks-per-node=40, (threads-per-core=2)
|-
| <code>dev_gpu_h100</code>
| GPU nodes NVIDIA GPU x4
| mem-per-cpu=1125mb cpus-per-gpu=24
|
| time=30, nodes=1, mem=180000mb, ntasks-per-node=40, (threads-per-core=2)
|-
| <code>dev_gpu_a100_il</code>
| GPU nodes NVIDIA GPU x4 
| mem-per-gpu=127500mb cpus-per-gpu=16
|
| time=30, nodes=1, mem=510000mb, ntasks-per-node=64, (threads-per-core=2)
|}
Table 2: Development Queues

Default resources of a queue class defines time, #tasks and memory if not explicitly given with sbatch command. Resource list acronyms ''--time'', ''--ntasks'', ''--nodes'', ''--mem'' and ''--mem-per-cpu'' are described [[BwUniCluster_3.0_Slurm_common_Features|here]].

== Check available resources ==
The Slurm command sinfo is used to view partition and node information for a system running Slurm. It incorporates down time, reservations, and node state information in determining the available backfill window. The sinfo command can only be used by the administrator.
 
SCC has prepared a special script (sinfo_t_idle) to find out how many processors are available for immediate use on the system. It is anticipated that users will use this information to submit jobs that meet these criteria and thus obtain quick job turnaround times.
 
 

* The following command displays what resources are available for immediate use for the whole partition.
<pre>$ sinfo_t_idle
Partition dev_cpu : 2 nodes idle
Partition cpu : 68 nodes idle
Partition highmem : 4 nodes idle
Partition dev_gpu_h100 : 0 nodes idle
Partition gpu_h100 : 11 nodes idle
Partition gpu_mi300 : 1 nodes idle
Partition dev_cpu_il : 0 nodes idle
Partition cpu_il : 0 nodes idle
Partition dev_gpu_a100_il : 0 nodes idle
Partition gpu_a100_il : 0 nodes idle
Partition gpu_h100_il : 0 nodes idle
</pre>
 

= Running Jobs =

== Slurm Commands (excerpt) ==
Important Slurm commands for non-administrators working on bwUniCluster 3.0.
{| width=850px class="wikitable"
! Slurm commands !! Brief explanation
|-
| [[#Batch Jobs: sbatch|sbatch]] || Submits a job and puts it into the queue [[https://slurm.schedmd.com/sbatch.html sbatch]]
|-
| [[#Interactive Jobs: salloc|salloc]] || Requests resources for an interactive Job [[https://slurm.schedmd.com/salloc.html salloc]]
|-
| [[#Monitor and manage jobs |scontrol show job]] || Displays detailed job state information [[https://slurm.schedmd.com/scontrol.html scontrol]]
|-
| [[#List of your submitted jobs : squeue|squeue]] || Displays information about active, eligible, blocked, and/or recently completed jobs [[https://slurm.schedmd.com/squeue.html squeue]]
|-
| [[#Start time of job or resources : squeue|squeue --start]] || Returns start time of submitted job [[https://slurm.schedmd.com/squeue.html squeue]]
|-
| [[#Shows free resources : sinfo_t_idle|sinfo_t_idle]] || Shows what resources are available for immediate use [[https://slurm.schedmd.com/sinfo.html sinfo]]
|-
| [[#Canceling own jobs : scancel|scancel]] || Cancels a job [[https://slurm.schedmd.com/scancel.html scancel]]
|}

 
* [https://slurm.schedmd.com/tutorials.html Slurm Tutorials]
* [https://slurm.schedmd.com/pdfs/summary.pdf Slurm command/option summary (2 pages)]
* [https://slurm.schedmd.com/man_index.html Slurm Commands]
 

== Batch Jobs: sbatch ==

Batch jobs are submitted by using the command '''sbatch'''. The main purpose of the '''sbatch''' command is to specify the resources that are needed to run the job. '''sbatch''' will then queue the batch job. However, starting of batch job depends on the availability of the requested resources and the fair sharing value.
 
 

To run your batch job on one of the cpu nodes, please use:

<pre>
$ sbatch --partition=dev_cpu
or
$ sbatch -p dev_cpu
</pre>
 

=== Command parameters sbatch ===
The syntax and use of '''sbatch''' can be displayed via:
<pre>
$ man sbatch
</pre>
'''sbatch''' options can be used from the command line or in your job script. Different defaults for some of these options are set based on the queue and can be found [https://wiki.bwhpc.de/e/BwUniCluster3.0/Batch_Queues#sbatch_-p_queue here]

{| class="wikitable"
! colspan="3" | sbatch Options
|-
! style="width:8%"| Command line
! style="width:9%"| Script
! style="width:13%"| Purpose
|- style="vertical-align:top;"
| -t, --time=''time''
| #SBATCH --time=''time''
| Wall clock time limit. 
|-
|- style="vertical-align:top;"
| -N, --nodes=''count''
| #SBATCH --nodes=''count''
| Number of nodes to be used.
|- style="vertical-align:top;"
| -n, --ntasks=''count''
| #SBATCH --ntasks=''count''
| Number of tasks to be launched.
|-
|- style="vertical-align:top;"
| --ntasks-per-node=''count''
| #SBATCH --ntasks-per-node=''count''
| Maximum count of tasks per node.
|-
|- style="vertical-align:top;"
| -c, --cpus-per-task=''count''
| #SBATCH --cpus-per-task=''count''
| Number of CPUs required per (MPI-)task.
|-
|- style="vertical-align:top;"
| --mem=''value_in_MB''
| #SBATCH --mem=''value_in_MB''
| Memory in MegaByte per node. (You should omit the setting of this option.)
|-
|- style="vertical-align:top;"
| --mem-per-cpu=''value_in_MB''
| #SBATCH --mem-per-cpu=''value_in_MB''
| Minimum Memory required per allocated CPU. (You should omit the setting of this option.)
|-
|- style="vertical-align:top;"
| --mail-type=''type''
| #SBATCH --mail-type=''type''
| Notify user by email when certain event types occur. Valid type values are NONE, BEGIN, END, FAIL, REQUEUE, ALL.
|-
|- style="vertical-align:top;"
| --mail-user=''mail-address''
| #SBATCH --mail-user=''mail-address''
| The specified mail-address receives email notification of state changes as defined by --mail-type.
|-
|- style="vertical-align:top;"
| --output=''name''
| #SBATCH --output=''name''
| File in which job output is stored.
|-
|- style="vertical-align:top;"
| --error=''name''
| #SBATCH --error=''name''
| File in which job error messages are stored.
|-
|- style="vertical-align:top;"
| -J, --job-name=''name''
| #SBATCH --job-name=''name''
| Job name.
|-
|- style="vertical-align:top;"
| --export=[ALL,] ''env-variables''
| #SBATCH --export=[ALL,] ''env-variables''
| Identifies which environment variables from the submission environment are propagated to the launched application. Default is ALL.
|-
|- style="vertical-align:top;"
| -A, --account=''group-name''
| #SBATCH --account=''group-name''
| Change resources used by this job to specified group. You may need this option if your account is assigned to more than one group. By command "scontrol show job" the project group the job is accounted on can be seen behind "Account=".
|-
|- style="vertical-align:top;"
| -p, --partition=''queue-name''
| #SBATCH --partition=''queue-name''
| Request a specific queue for the resource allocation.
|-
|- style="vertical-align:top;"
| --reservation=''reservation-name''
| #SBATCH --reservation=''reservation-name''
| Use a specific reservation for the resource allocation.
|-
|- style="vertical-align:top;"
| -C, --constraint=''LSDF''
| #SBATCH --constraint=LSDF
| Job constraint LSDF filesystems.
|-
|- style="vertical-align:top;"
| -C, --constraint=''BEEOND (BEEOND_4MDS, BEEOND_MAXMDS)''
| #SBATCH --constraint=BEEOND (BEEOND_4MDS, BEEOND_MAXMDS)
| Job constraint BeeOND filesystem.
|-
|}
 

== Interactive Jobs: salloc ==

On bwUniCluster 3.0 you are only allowed to run short jobs (<< 1 hour) with little memory requirements (<< 8 GByte) on the logins nodes. If you want to run longer jobs and/or jobs with a request of more than 8 GByte of memory, you must allocate resources for so-called interactive jobs by usage of the command salloc on a login node. Considering a serial application running on a compute node that requires 5000 MByte of memory and limiting the interactive run to 2 hours the following command has to be executed:

<pre>
$ salloc -p cpu -n 1 -t 120 --mem=5000
</pre>

Then you will get one core on a compute node within the partition "cpu". After execution of this command '''DO NOT CLOSE''' your current terminal session but wait until the queueing system Slurm has granted you the requested resources on the compute system. You will be logged in automatically on the granted core! To run a serial program on the granted core you only have to type the name of the executable.

<pre>
$ ./<my_serial_program>
</pre>

Please be aware that your serial job must run less than 2 hours in this example, else the job will be killed during runtime by the system.

You can also start now a graphical X11-terminal connecting you to the dedicated resource that is available for 2 hours. You can start it by the command:

<pre>
$ xterm
</pre>

Note that, once the walltime limit has been reached the resources - i.e. the compute node - will automatically be revoked.

An interactive parallel application running on one compute node or on many compute nodes (e.g. here 5 nodes) with 96 cores each requires usually an amount of memory in GByte (e.g. 50 GByte) and a maximum time (e.g. 1 hour). E.g. 5 nodes can be allocated by the following command:

<pre>
$ salloc -p cpu -N 5 --ntasks-per-node=96 -t 01:00:00 --mem=50gb
</pre>

Now you can run parallel jobs on 480 cores requiring 50 GByte of memory per node. Please be aware that you will be logged in on core 0 of the first node.
If you want to have access to another node you have to open a new terminal, connect it also to bwUniCluster 3.0 and type the following commands to
connect to the running interactive job and then to a specific node:

<pre>
$ srun --jobid=XXXXXXXX --pty /bin/bash
$ srun --nodelist=uc3nXXX --pty /bin/bash
</pre>

With the command:

<pre>
$ squeue
</pre>

the jobid and the nodelist can be shown.

If you want to run MPI-programs, you can do it by simply typing mpirun <program_name>. Then your program will be run on 480 cores. A very simple example for starting a parallel job can be:

<pre>
$ mpirun <my_mpi_program>
</pre>

You can also start the debugger ddt by the commands:

<pre>
$ module add devel/ddt
$ ddt <my_mpi_program>
</pre>

The above commands will execute the parallel program <my_mpi_program> on all available cores. You can also start parallel programs on a subset of cores; an example for this can be:

<pre>
$ mpirun -n 50 <my_mpi_program>
</pre>

If you are using Intel MPI you must start <my_mpi_program> by the command mpiexec.hydra (instead of mpirun).

== Interactive Computing with Jupyter ==

== Monitor and manage jobs ==

=== List of your submitted jobs : squeue ===
Displays information about YOUR active, pending and/or recently completed jobs. The command displays all own active and pending jobs. The command squeue is explained in detail on the webpage https://slurm.schedmd.com/squeue.html or via manpage (man squeue).
 
 

* ''squeue'' example on bwUniCluster 3.0 (Only your own jobs are displayed!).
<pre>
$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
1262 cpu wrap ka_ab123 R 8:15 1 uc3n002
1267 dev_gpu_h wrap ka_ab123 PD 0:00 1 (Resources)
1265 highmem wrap ka_ab123 R 2:41 1 uc3n084
$ squeue -l
JOBID PARTITION NAME USER STATE TIME TIME_LIMI NODES NODELIST(REASON)
1262 cpu wrap ka_ab123 RUNNING 8:55 20:00 1 uc3n002
1267 dev_gpu_h wrap ka_ab123 PENDING 0:00 20:00 1 (Resources)
1265 highmem wrap ka_ab123 RUNNING 3:21 20:00 1 uc3n084

</pre>

=== Detailed job information : scontrol show job ===
scontrol show job displays detailed job state information and diagnostic output for all or a specified job of yours. Detailed information is available for active, pending and recently completed jobs. The command scontrol is explained in detail on the webpage https://slurm.schedmd.com/scontrol.html or via manpage (man scontrol).
 
Display the state of all your jobs in normal mode: scontrol show job
 
Display the state of a job with <jobid> in normal mode: scontrol show job <jobid>
 
 

* Here is an example from bwUniCluster 3.0.
<pre>
$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
1262 cpu wrap ka_zs040 R 1:12 1 uc3n002

$
$ # now, see what's up with my pending job with jobid 1262
$
$ scontrol show job 1262

JobId=1262 JobName=wrap
UserId=ka_zs0402(241992) GroupId=ka_scc(12345) MCS_label=N/A
Priority=4246 Nice=0 Account=ka QOS=normal
JobState=RUNNING Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
RunTime=00:00:37 TimeLimit=00:20:00 TimeMin=N/A
SubmitTime=2025-04-04T10:01:30 EligibleTime=2025-04-04T10:01:30
AccrueTime=2025-04-04T10:01:30
StartTime=2025-04-04T10:01:31 EndTime=2025-04-04T10:21:31 Deadline=N/A
SuspendTime=None SecsPreSuspend=0 LastSchedEval=2025-04-04T10:01:31 Scheduler=Main
Partition=cpu AllocNode:Sid=uc3n999:2819841
ReqNodeList=(null) ExcNodeList=(null)
NodeList=uc3n002
BatchHost=uc3n002
NumNodes=1 NumCPUs=2 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
ReqTRES=cpu=1,mem=2000M,node=1,billing=1
AllocTRES=cpu=2,mem=4000M,node=1,billing=2
Socks/Node=* NtasksPerN:B:S:C=0:0:*:1 CoreSpec=*
MinCPUsNode=1 MinMemoryCPU=2000M MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
Command=(null)
WorkDir=/pfs/data6/home/ka/ka_scc/ka_zs0402
StdErr=/pfs/data6/home/ka/ka_scc/ka_zs0402/slurm-1262.out
StdIn=/dev/null
StdOut=/pfs/data6/home/ka/ka_scc/ka_zs0402/slurm-1262.out

</pre>
 

= Slurm Options =

= Best Practices =

== Step-by-Step example==

== Dos and Don'ts ==

BwUniCluster3.0/Running Jobs

2025-04-04T15:15:35Z

P Schuhmacher: /* List of your submitted jobs : squeue */

= Purpose and function of a queuing system =

General procedure: Hint to [[Running_Calculations | Running Calculations]]

== Job submission process ==

bwUniCluster 3.0 has installed the workload managing software Slurm. Therefore any job submission by the user is to be executed by commands of the Slurm software. Slurm queues and runs user jobs based on fair sharing policies.

== Slurm ==

HPC Workload Manager on bwUniCluster 3.0 is Slurm.
Slurm is a cluster management and job scheduling system. Slurm has three key functions.
* First, it allocates access to resources (compute cores on nodes) to users for some duration of time so they can perform work.
* it provides a framework for starting, executing, and monitoring work (normally a parallel job) on the set of allocated nodes.
* it arbitrates contention for resources by managing a queue of pending work.

Any kind of calculation on the compute nodes of bwUniCluster 3.0 requires the user to define calculations as a sequence of commands together with required run time, number of CPU cores and main memory and submit all, i.e., the '''batch job''', to a resource and workload managing software.
== Terms and definitions ==

''' Partitions '''

Slurm manages job queues for different '''partitions'''. Partitions are used to group similar node types (e.g. nodes with and without accelerators) and to enforce different access policies and resource limits.

On bwUniCluster 3.0 there are different partitions:

* CPU-only nodes
** 2-socket nodes, consisting of 2 Intel Ice Lake processors with 32 cores each or 2 AMD processors with 48 cores each
** 2-socket nodes with very high RAM capacity, consisting of 2 AMD processors with 48 cores each
* GPU-accelerated nodes
** 2-socket nodes with 4x NVIDIA A100 or 4x NVIDIA H100 GPUs
** 4-socket node with 4x AMD Instinct accelerator

''' Queues '''

Job '''queues''' are used to manage jobs that request access to shared but limited computing resources of a certain kind (partition).

On bwUniCluster 3.0 there are different main types of queues:
* Regular queues
** cpu: Jobs that request CPU-only nodes.
** gpu: Jobs that request GPU-accelerated nodes.
* Development queues (dev)
** Short, usually interactive jobs that are used for developing, compiling and testing code and workflows. The intention behind development queues is to provide users with immediate access to computer resources without having to wait. This is the place to realize instantaneous heavy compute without affecting other users, as would be the case on the login nodes.

Requested compute resources such as (wall-)time, number of nodes and amount of memory are restricted and must fit into the boundaries imposed by the queues. The request for compute resources on the bwUniCluster 3.0 requires at least the specification of the '''queue''' and the '''time'''.

''' Jobs '''

Jobs can be run non-interactively as '''batch jobs''' or as '''interactive jobs'''.
Submitting a batch job means, that all steps of a compute project are defined in a Bash script. This Bash script is queued and executed as soon as the compute resources are available and allocated. Jobs are enqueued with the ''sbatch'' command.

For interactive jobs, the resources are requested with the ''salloc'' command. As soon as the computing resources are available and allocated, a command line prompt is returned on a computing node and the user can freely dispose of the resources now available to him.
{|style="background:#deffee; width:100%;"
|style="padding:5px; background:#cef2e0; text-align:left"|
[[Image:Attention.svg|center|25px]]
|style="padding:5px; background:#cef2e0; text-align:left"|
'''Please remember:'''
* '''Heavy computations are not allowed on the login nodes'''. Use a developement or a regular job queue instead! Please refer to [[BwUniCluster3.0/Login#Allowed_Activities_on_Login_Nodes|Allowed Activities on Login Nodes]].
* '''Development queues''' are meant for '''development tasks'''. Do not misuse this queue for regular, short-running jobs or chain jobs! Only one running job at a time is enabled. Maximum queue length is reduced to 3.
|}

= Queues on bwUniCluster 3.0 =
== Policy ==

== Regular Queues ==
{| class="wikitable"
|-
! style="width:5%"| Queue
! style="width:13%"| Node-Type
! style="width:23%"| Default Resources
! style="width:13%"| Minimal Resources
! style="width:13%"| Maximum Resources
|-
| <code>cpu_il</code>
| CPU nodes Ice Lake
| mem-per-cpu=1950mb
|
| time=72:00:00, nodes=30, mem=249600mb, ntasks-per-node=64, (threads-per-core=2)
|-
| <code>cpu</code>
| CPU nodes Standard
| mem-per-cpu=1125mb
|
| time=72:00:00, nodes=20, mem=380000mb, ntasks-per-node=96, (threads-per-core=2)
|-
| <code>highmem</code>
| CPU nodes High Memory
| mem-per-cpu=1125mb
|
| time=72:00:00, nodes=4, mem=2300000mb, ntasks-per-node=96, (threads-per-core=2)
|-
| <code>gpu_h100</code>
| GPU nodes NVIDIA GPU x4
| mem-per-cpu=1125mb cpus-per-gpu=24
|
| time=72:00:00, nodes=12, mem=760000mb, ntasks-per-node=96, (threads-per-core=2)
|-
| <code>gpu_mi300</code>
| GPU node AMD GPU x4
| mem-per-cpu=1125mb cpus-per-gpu=24
|
| time=72:00:00, nodes=1, mem=510000mb, ntasks-per-node=40, (threads-per-core=2)
|-
| <code>gpu_a100_il</code>/<code>gpu_h100_il</code>
| GPU nodes Ice Lake NVIDIA GPU x4
| mem-per-gpu=127500mb cpus-per-gpu=16
|
| time=72:00:00, nodes=9, mem=510000mb, ntasks-per-node=64, (threads-per-core=2)
|}
Table 1: Regular Queues

== Development Queues ==
Only for development, i.e. debugging or performance optimization ...

{| class="wikitable"
|-
! style="width:5%"| Queue
! style="width:13%"| Node Type
! style="width:23%"| Default Resources
! style="width:13%"| Minimal Resources
! style="width:13%"| Maximum Resources
|-
| <code>dev_cpu_il</code>
| CPU nodes Ice Lake
| mem-per-cpu=1950mb
|
| time=30, nodes=8, mem=249600mb, ntasks-per-node=64, (threads-per-core=2)
|-
| <code>dev_cpu</code>
| CPU nodes Standard
| mem-per-cpu=1125mb
|
| time=30, nodes=1, mem=180000mb, ntasks-per-node=40, (threads-per-core=2)
|-
| <code>dev_highmem</code>
| CPU nodes High Memory
| mem-per-cpu=1125mb
|
| time=30, nodes=1, mem=180000mb, ntasks-per-node=40, (threads-per-core=2)
|-
| <code>dev_gpu_h100</code>
| GPU nodes NVIDIA GPU x4
| mem-per-cpu=1125mb cpus-per-gpu=24
|
| time=30, nodes=1, mem=180000mb, ntasks-per-node=40, (threads-per-core=2)
|-
| <code>dev_gpu_a100_il</code>
| GPU nodes NVIDIA GPU x4 
| mem-per-gpu=127500mb cpus-per-gpu=16
|
| time=30, nodes=1, mem=510000mb, ntasks-per-node=64, (threads-per-core=2)
|}
Table 2: Development Queues

Default resources of a queue class defines time, #tasks and memory if not explicitly given with sbatch command. Resource list acronyms ''--time'', ''--ntasks'', ''--nodes'', ''--mem'' and ''--mem-per-cpu'' are described [[BwUniCluster_3.0_Slurm_common_Features|here]].

== Check available resources ==
The Slurm command sinfo is used to view partition and node information for a system running Slurm. It incorporates down time, reservations, and node state information in determining the available backfill window. The sinfo command can only be used by the administrator.
 
SCC has prepared a special script (sinfo_t_idle) to find out how many processors are available for immediate use on the system. It is anticipated that users will use this information to submit jobs that meet these criteria and thus obtain quick job turnaround times.
 
 

* The following command displays what resources are available for immediate use for the whole partition.
<pre>$ sinfo_t_idle
Partition dev_cpu : 2 nodes idle
Partition cpu : 68 nodes idle
Partition highmem : 4 nodes idle
Partition dev_gpu_h100 : 0 nodes idle
Partition gpu_h100 : 11 nodes idle
Partition gpu_mi300 : 1 nodes idle
Partition dev_cpu_il : 0 nodes idle
Partition cpu_il : 0 nodes idle
Partition dev_gpu_a100_il : 0 nodes idle
Partition gpu_a100_il : 0 nodes idle
Partition gpu_h100_il : 0 nodes idle
</pre>
 

= Running Jobs =

== Slurm Commands (excerpt) ==
Important Slurm commands for non-administrators working on bwUniCluster 3.0.
{| width=850px class="wikitable"
! Slurm commands !! Brief explanation
|-
| [[#Batch Jobs: sbatch|sbatch]] || Submits a job and puts it into the queue [[https://slurm.schedmd.com/sbatch.html sbatch]]
|-
| [[#Interactive Jobs: salloc|salloc]] || Requests resources for an interactive Job [[https://slurm.schedmd.com/salloc.html salloc]]
|-
| [[#Monitor and manage jobs |scontrol show job]] || Displays detailed job state information [[https://slurm.schedmd.com/scontrol.html scontrol]]
|-
| [[#List of your submitted jobs : squeue|squeue]] || Displays information about active, eligible, blocked, and/or recently completed jobs [[https://slurm.schedmd.com/squeue.html squeue]]
|-
| [[#Start time of job or resources : squeue|squeue --start]] || Returns start time of submitted job [[https://slurm.schedmd.com/squeue.html squeue]]
|-
| [[#Shows free resources : sinfo_t_idle|sinfo_t_idle]] || Shows what resources are available for immediate use [[https://slurm.schedmd.com/sinfo.html sinfo]]
|-
| [[#Canceling own jobs : scancel|scancel]] || Cancels a job [[https://slurm.schedmd.com/scancel.html scancel]]
|}

 
* [https://slurm.schedmd.com/tutorials.html Slurm Tutorials]
* [https://slurm.schedmd.com/pdfs/summary.pdf Slurm command/option summary (2 pages)]
* [https://slurm.schedmd.com/man_index.html Slurm Commands]
 

== Batch Jobs: sbatch ==

Batch jobs are submitted by using the command '''sbatch'''. The main purpose of the '''sbatch''' command is to specify the resources that are needed to run the job. '''sbatch''' will then queue the batch job. However, starting of batch job depends on the availability of the requested resources and the fair sharing value.
 
 

To run your batch job on one of the cpu nodes, please use:

<pre>
$ sbatch --partition=dev_cpu
or
$ sbatch -p dev_cpu
</pre>
 

=== Command parameters sbatch ===
The syntax and use of '''sbatch''' can be displayed via:
<pre>
$ man sbatch
</pre>
'''sbatch''' options can be used from the command line or in your job script. Different defaults for some of these options are set based on the queue and can be found [https://wiki.bwhpc.de/e/BwUniCluster3.0/Batch_Queues#sbatch_-p_queue here]

{| class="wikitable"
! colspan="3" | sbatch Options
|-
! style="width:8%"| Command line
! style="width:9%"| Script
! style="width:13%"| Purpose
|- style="vertical-align:top;"
| -t, --time=''time''
| #SBATCH --time=''time''
| Wall clock time limit. 
|-
|- style="vertical-align:top;"
| -N, --nodes=''count''
| #SBATCH --nodes=''count''
| Number of nodes to be used.
|- style="vertical-align:top;"
| -n, --ntasks=''count''
| #SBATCH --ntasks=''count''
| Number of tasks to be launched.
|-
|- style="vertical-align:top;"
| --ntasks-per-node=''count''
| #SBATCH --ntasks-per-node=''count''
| Maximum count of tasks per node.
|-
|- style="vertical-align:top;"
| -c, --cpus-per-task=''count''
| #SBATCH --cpus-per-task=''count''
| Number of CPUs required per (MPI-)task.
|-
|- style="vertical-align:top;"
| --mem=''value_in_MB''
| #SBATCH --mem=''value_in_MB''
| Memory in MegaByte per node. (You should omit the setting of this option.)
|-
|- style="vertical-align:top;"
| --mem-per-cpu=''value_in_MB''
| #SBATCH --mem-per-cpu=''value_in_MB''
| Minimum Memory required per allocated CPU. (You should omit the setting of this option.)
|-
|- style="vertical-align:top;"
| --mail-type=''type''
| #SBATCH --mail-type=''type''
| Notify user by email when certain event types occur. Valid type values are NONE, BEGIN, END, FAIL, REQUEUE, ALL.
|-
|- style="vertical-align:top;"
| --mail-user=''mail-address''
| #SBATCH --mail-user=''mail-address''
| The specified mail-address receives email notification of state changes as defined by --mail-type.
|-
|- style="vertical-align:top;"
| --output=''name''
| #SBATCH --output=''name''
| File in which job output is stored.
|-
|- style="vertical-align:top;"
| --error=''name''
| #SBATCH --error=''name''
| File in which job error messages are stored.
|-
|- style="vertical-align:top;"
| -J, --job-name=''name''
| #SBATCH --job-name=''name''
| Job name.
|-
|- style="vertical-align:top;"
| --export=[ALL,] ''env-variables''
| #SBATCH --export=[ALL,] ''env-variables''
| Identifies which environment variables from the submission environment are propagated to the launched application. Default is ALL.
|-
|- style="vertical-align:top;"
| -A, --account=''group-name''
| #SBATCH --account=''group-name''
| Change resources used by this job to specified group. You may need this option if your account is assigned to more than one group. By command "scontrol show job" the project group the job is accounted on can be seen behind "Account=".
|-
|- style="vertical-align:top;"
| -p, --partition=''queue-name''
| #SBATCH --partition=''queue-name''
| Request a specific queue for the resource allocation.
|-
|- style="vertical-align:top;"
| --reservation=''reservation-name''
| #SBATCH --reservation=''reservation-name''
| Use a specific reservation for the resource allocation.
|-
|- style="vertical-align:top;"
| -C, --constraint=''LSDF''
| #SBATCH --constraint=LSDF
| Job constraint LSDF filesystems.
|-
|- style="vertical-align:top;"
| -C, --constraint=''BEEOND (BEEOND_4MDS, BEEOND_MAXMDS)''
| #SBATCH --constraint=BEEOND (BEEOND_4MDS, BEEOND_MAXMDS)
| Job constraint BeeOND filesystem.
|-
|}
 

== Interactive Jobs: salloc ==

On bwUniCluster 3.0 you are only allowed to run short jobs (<< 1 hour) with little memory requirements (<< 8 GByte) on the logins nodes. If you want to run longer jobs and/or jobs with a request of more than 8 GByte of memory, you must allocate resources for so-called interactive jobs by usage of the command salloc on a login node. Considering a serial application running on a compute node that requires 5000 MByte of memory and limiting the interactive run to 2 hours the following command has to be executed:

<pre>
$ salloc -p cpu -n 1 -t 120 --mem=5000
</pre>

Then you will get one core on a compute node within the partition "cpu". After execution of this command '''DO NOT CLOSE''' your current terminal session but wait until the queueing system Slurm has granted you the requested resources on the compute system. You will be logged in automatically on the granted core! To run a serial program on the granted core you only have to type the name of the executable.

<pre>
$ ./<my_serial_program>
</pre>

Please be aware that your serial job must run less than 2 hours in this example, else the job will be killed during runtime by the system.

You can also start now a graphical X11-terminal connecting you to the dedicated resource that is available for 2 hours. You can start it by the command:

<pre>
$ xterm
</pre>

Note that, once the walltime limit has been reached the resources - i.e. the compute node - will automatically be revoked.

An interactive parallel application running on one compute node or on many compute nodes (e.g. here 5 nodes) with 96 cores each requires usually an amount of memory in GByte (e.g. 50 GByte) and a maximum time (e.g. 1 hour). E.g. 5 nodes can be allocated by the following command:

<pre>
$ salloc -p cpu -N 5 --ntasks-per-node=96 -t 01:00:00 --mem=50gb
</pre>

Now you can run parallel jobs on 480 cores requiring 50 GByte of memory per node. Please be aware that you will be logged in on core 0 of the first node.
If you want to have access to another node you have to open a new terminal, connect it also to bwUniCluster 3.0 and type the following commands to
connect to the running interactive job and then to a specific node:

<pre>
$ srun --jobid=XXXXXXXX --pty /bin/bash
$ srun --nodelist=uc3nXXX --pty /bin/bash
</pre>

With the command:

<pre>
$ squeue
</pre>

the jobid and the nodelist can be shown.

If you want to run MPI-programs, you can do it by simply typing mpirun <program_name>. Then your program will be run on 480 cores. A very simple example for starting a parallel job can be:

<pre>
$ mpirun <my_mpi_program>
</pre>

You can also start the debugger ddt by the commands:

<pre>
$ module add devel/ddt
$ ddt <my_mpi_program>
</pre>

The above commands will execute the parallel program <my_mpi_program> on all available cores. You can also start parallel programs on a subset of cores; an example for this can be:

<pre>
$ mpirun -n 50 <my_mpi_program>
</pre>

If you are using Intel MPI you must start <my_mpi_program> by the command mpiexec.hydra (instead of mpirun).

== Interactive Computing with Jupyter ==

== Monitor and manage jobs ==

=== List of your submitted jobs : squeue ===
Displays information about YOUR active, pending and/or recently completed jobs. The command displays all own active and pending jobs. The command squeue is explained in detail on the webpage https://slurm.schedmd.com/squeue.html or via manpage (man squeue).
 
 

* ''squeue'' example on bwUniCluster 3.0 (Only your own jobs are displayed!).
<pre>
$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
1262 cpu wrap ka_ab123 R 8:15 1 uc3n002
1267 dev_gpu_h wrap ka_ab123 PD 0:00 1 (Resources)
1265 highmem wrap ka_ab123 R 2:41 1 uc3n084
$ squeue -l
JOBID PARTITION NAME USER STATE TIME TIME_LIMI NODES NODELIST(REASON)
1262 cpu wrap ka_ab123 RUNNING 8:55 20:00 1 uc3n002
1267 dev_gpu_h wrap ka_ab123 PENDING 0:00 20:00 1 (Resources)
1265 highmem wrap ka_ab123 RUNNING 3:21 20:00 1 uc3n084

</pre>

== Detailed job information : scontrol show job ==
scontrol show job displays detailed job state information and diagnostic output for all or a specified job of yours. Detailed information is available for active, pending and recently completed jobs. The command scontrol is explained in detail on the webpage https://slurm.schedmd.com/scontrol.html or via manpage (man scontrol).
 
Display the state of all your jobs in normal mode: scontrol show job
 
Display the state of a job with <jobid> in normal mode: scontrol show job <jobid>
 
 

* Here is an example from bwUniCluster 3.0.
<pre>
$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
1262 cpu wrap ka_zs040 R 1:12 1 uc3n002

$
$ # now, see what's up with my pending job with jobid 1262
$
$ scontrol show job 1262

JobId=1262 JobName=wrap
UserId=ka_zs0402(241992) GroupId=ka_scc(12345) MCS_label=N/A
Priority=4246 Nice=0 Account=ka QOS=normal
JobState=RUNNING Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
RunTime=00:00:37 TimeLimit=00:20:00 TimeMin=N/A
SubmitTime=2025-04-04T10:01:30 EligibleTime=2025-04-04T10:01:30
AccrueTime=2025-04-04T10:01:30
StartTime=2025-04-04T10:01:31 EndTime=2025-04-04T10:21:31 Deadline=N/A
SuspendTime=None SecsPreSuspend=0 LastSchedEval=2025-04-04T10:01:31 Scheduler=Main
Partition=cpu AllocNode:Sid=uc3n999:2819841
ReqNodeList=(null) ExcNodeList=(null)
NodeList=uc3n002
BatchHost=uc3n002
NumNodes=1 NumCPUs=2 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
ReqTRES=cpu=1,mem=2000M,node=1,billing=1
AllocTRES=cpu=2,mem=4000M,node=1,billing=2
Socks/Node=* NtasksPerN:B:S:C=0:0:*:1 CoreSpec=*
MinCPUsNode=1 MinMemoryCPU=2000M MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
Command=(null)
WorkDir=/pfs/data6/home/ka/ka_scc/ka_zs0402
StdErr=/pfs/data6/home/ka/ka_scc/ka_zs0402/slurm-1262.out
StdIn=/dev/null
StdOut=/pfs/data6/home/ka/ka_scc/ka_zs0402/slurm-1262.out

</pre>
 

= Slurm Options =

= Best Practices =

== Step-by-Step example==

== Dos and Don'ts ==

BwUniCluster3.0/Running Jobs

2025-04-04T15:15:08Z

P Schuhmacher: /* Monitor and manage jobs */

= Purpose and function of a queuing system =

General procedure: Hint to [[Running_Calculations | Running Calculations]]

== Job submission process ==

bwUniCluster 3.0 has installed the workload managing software Slurm. Therefore any job submission by the user is to be executed by commands of the Slurm software. Slurm queues and runs user jobs based on fair sharing policies.

== Slurm ==

HPC Workload Manager on bwUniCluster 3.0 is Slurm.
Slurm is a cluster management and job scheduling system. Slurm has three key functions.
* First, it allocates access to resources (compute cores on nodes) to users for some duration of time so they can perform work.
* it provides a framework for starting, executing, and monitoring work (normally a parallel job) on the set of allocated nodes.
* it arbitrates contention for resources by managing a queue of pending work.

Any kind of calculation on the compute nodes of bwUniCluster 3.0 requires the user to define calculations as a sequence of commands together with required run time, number of CPU cores and main memory and submit all, i.e., the '''batch job''', to a resource and workload managing software.
== Terms and definitions ==

''' Partitions '''

Slurm manages job queues for different '''partitions'''. Partitions are used to group similar node types (e.g. nodes with and without accelerators) and to enforce different access policies and resource limits.

On bwUniCluster 3.0 there are different partitions:

* CPU-only nodes
** 2-socket nodes, consisting of 2 Intel Ice Lake processors with 32 cores each or 2 AMD processors with 48 cores each
** 2-socket nodes with very high RAM capacity, consisting of 2 AMD processors with 48 cores each
* GPU-accelerated nodes
** 2-socket nodes with 4x NVIDIA A100 or 4x NVIDIA H100 GPUs
** 4-socket node with 4x AMD Instinct accelerator

''' Queues '''

Job '''queues''' are used to manage jobs that request access to shared but limited computing resources of a certain kind (partition).

On bwUniCluster 3.0 there are different main types of queues:
* Regular queues
** cpu: Jobs that request CPU-only nodes.
** gpu: Jobs that request GPU-accelerated nodes.
* Development queues (dev)
** Short, usually interactive jobs that are used for developing, compiling and testing code and workflows. The intention behind development queues is to provide users with immediate access to computer resources without having to wait. This is the place to realize instantaneous heavy compute without affecting other users, as would be the case on the login nodes.

Requested compute resources such as (wall-)time, number of nodes and amount of memory are restricted and must fit into the boundaries imposed by the queues. The request for compute resources on the bwUniCluster 3.0 requires at least the specification of the '''queue''' and the '''time'''.

''' Jobs '''

Jobs can be run non-interactively as '''batch jobs''' or as '''interactive jobs'''.
Submitting a batch job means, that all steps of a compute project are defined in a Bash script. This Bash script is queued and executed as soon as the compute resources are available and allocated. Jobs are enqueued with the ''sbatch'' command.

For interactive jobs, the resources are requested with the ''salloc'' command. As soon as the computing resources are available and allocated, a command line prompt is returned on a computing node and the user can freely dispose of the resources now available to him.
{|style="background:#deffee; width:100%;"
|style="padding:5px; background:#cef2e0; text-align:left"|
[[Image:Attention.svg|center|25px]]
|style="padding:5px; background:#cef2e0; text-align:left"|
'''Please remember:'''
* '''Heavy computations are not allowed on the login nodes'''. Use a developement or a regular job queue instead! Please refer to [[BwUniCluster3.0/Login#Allowed_Activities_on_Login_Nodes|Allowed Activities on Login Nodes]].
* '''Development queues''' are meant for '''development tasks'''. Do not misuse this queue for regular, short-running jobs or chain jobs! Only one running job at a time is enabled. Maximum queue length is reduced to 3.
|}

= Queues on bwUniCluster 3.0 =
== Policy ==

== Regular Queues ==
{| class="wikitable"
|-
! style="width:5%"| Queue
! style="width:13%"| Node-Type
! style="width:23%"| Default Resources
! style="width:13%"| Minimal Resources
! style="width:13%"| Maximum Resources
|-
| <code>cpu_il</code>
| CPU nodes Ice Lake
| mem-per-cpu=1950mb
|
| time=72:00:00, nodes=30, mem=249600mb, ntasks-per-node=64, (threads-per-core=2)
|-
| <code>cpu</code>
| CPU nodes Standard
| mem-per-cpu=1125mb
|
| time=72:00:00, nodes=20, mem=380000mb, ntasks-per-node=96, (threads-per-core=2)
|-
| <code>highmem</code>
| CPU nodes High Memory
| mem-per-cpu=1125mb
|
| time=72:00:00, nodes=4, mem=2300000mb, ntasks-per-node=96, (threads-per-core=2)
|-
| <code>gpu_h100</code>
| GPU nodes NVIDIA GPU x4
| mem-per-cpu=1125mb cpus-per-gpu=24
|
| time=72:00:00, nodes=12, mem=760000mb, ntasks-per-node=96, (threads-per-core=2)
|-
| <code>gpu_mi300</code>
| GPU node AMD GPU x4
| mem-per-cpu=1125mb cpus-per-gpu=24
|
| time=72:00:00, nodes=1, mem=510000mb, ntasks-per-node=40, (threads-per-core=2)
|-
| <code>gpu_a100_il</code>/<code>gpu_h100_il</code>
| GPU nodes Ice Lake NVIDIA GPU x4
| mem-per-gpu=127500mb cpus-per-gpu=16
|
| time=72:00:00, nodes=9, mem=510000mb, ntasks-per-node=64, (threads-per-core=2)
|}
Table 1: Regular Queues

== Development Queues ==
Only for development, i.e. debugging or performance optimization ...

{| class="wikitable"
|-
! style="width:5%"| Queue
! style="width:13%"| Node Type
! style="width:23%"| Default Resources
! style="width:13%"| Minimal Resources
! style="width:13%"| Maximum Resources
|-
| <code>dev_cpu_il</code>
| CPU nodes Ice Lake
| mem-per-cpu=1950mb
|
| time=30, nodes=8, mem=249600mb, ntasks-per-node=64, (threads-per-core=2)
|-
| <code>dev_cpu</code>
| CPU nodes Standard
| mem-per-cpu=1125mb
|
| time=30, nodes=1, mem=180000mb, ntasks-per-node=40, (threads-per-core=2)
|-
| <code>dev_highmem</code>
| CPU nodes High Memory
| mem-per-cpu=1125mb
|
| time=30, nodes=1, mem=180000mb, ntasks-per-node=40, (threads-per-core=2)
|-
| <code>dev_gpu_h100</code>
| GPU nodes NVIDIA GPU x4
| mem-per-cpu=1125mb cpus-per-gpu=24
|
| time=30, nodes=1, mem=180000mb, ntasks-per-node=40, (threads-per-core=2)
|-
| <code>dev_gpu_a100_il</code>
| GPU nodes NVIDIA GPU x4 
| mem-per-gpu=127500mb cpus-per-gpu=16
|
| time=30, nodes=1, mem=510000mb, ntasks-per-node=64, (threads-per-core=2)
|}
Table 2: Development Queues

Default resources of a queue class defines time, #tasks and memory if not explicitly given with sbatch command. Resource list acronyms ''--time'', ''--ntasks'', ''--nodes'', ''--mem'' and ''--mem-per-cpu'' are described [[BwUniCluster_3.0_Slurm_common_Features|here]].

== Check available resources ==
The Slurm command sinfo is used to view partition and node information for a system running Slurm. It incorporates down time, reservations, and node state information in determining the available backfill window. The sinfo command can only be used by the administrator.
 
SCC has prepared a special script (sinfo_t_idle) to find out how many processors are available for immediate use on the system. It is anticipated that users will use this information to submit jobs that meet these criteria and thus obtain quick job turnaround times.
 
 

* The following command displays what resources are available for immediate use for the whole partition.
<pre>$ sinfo_t_idle
Partition dev_cpu : 2 nodes idle
Partition cpu : 68 nodes idle
Partition highmem : 4 nodes idle
Partition dev_gpu_h100 : 0 nodes idle
Partition gpu_h100 : 11 nodes idle
Partition gpu_mi300 : 1 nodes idle
Partition dev_cpu_il : 0 nodes idle
Partition cpu_il : 0 nodes idle
Partition dev_gpu_a100_il : 0 nodes idle
Partition gpu_a100_il : 0 nodes idle
Partition gpu_h100_il : 0 nodes idle
</pre>
 

= Running Jobs =

== Slurm Commands (excerpt) ==
Important Slurm commands for non-administrators working on bwUniCluster 3.0.
{| width=850px class="wikitable"
! Slurm commands !! Brief explanation
|-
| [[#Batch Jobs: sbatch|sbatch]] || Submits a job and puts it into the queue [[https://slurm.schedmd.com/sbatch.html sbatch]]
|-
| [[#Interactive Jobs: salloc|salloc]] || Requests resources for an interactive Job [[https://slurm.schedmd.com/salloc.html salloc]]
|-
| [[#Monitor and manage jobs |scontrol show job]] || Displays detailed job state information [[https://slurm.schedmd.com/scontrol.html scontrol]]
|-
| [[#List of your submitted jobs : squeue|squeue]] || Displays information about active, eligible, blocked, and/or recently completed jobs [[https://slurm.schedmd.com/squeue.html squeue]]
|-
| [[#Start time of job or resources : squeue|squeue --start]] || Returns start time of submitted job [[https://slurm.schedmd.com/squeue.html squeue]]
|-
| [[#Shows free resources : sinfo_t_idle|sinfo_t_idle]] || Shows what resources are available for immediate use [[https://slurm.schedmd.com/sinfo.html sinfo]]
|-
| [[#Canceling own jobs : scancel|scancel]] || Cancels a job [[https://slurm.schedmd.com/scancel.html scancel]]
|}

 
* [https://slurm.schedmd.com/tutorials.html Slurm Tutorials]
* [https://slurm.schedmd.com/pdfs/summary.pdf Slurm command/option summary (2 pages)]
* [https://slurm.schedmd.com/man_index.html Slurm Commands]
 

== Batch Jobs: sbatch ==

Batch jobs are submitted by using the command '''sbatch'''. The main purpose of the '''sbatch''' command is to specify the resources that are needed to run the job. '''sbatch''' will then queue the batch job. However, starting of batch job depends on the availability of the requested resources and the fair sharing value.
 
 

To run your batch job on one of the cpu nodes, please use:

<pre>
$ sbatch --partition=dev_cpu
or
$ sbatch -p dev_cpu
</pre>
 

=== Command parameters sbatch ===
The syntax and use of '''sbatch''' can be displayed via:
<pre>
$ man sbatch
</pre>
'''sbatch''' options can be used from the command line or in your job script. Different defaults for some of these options are set based on the queue and can be found [https://wiki.bwhpc.de/e/BwUniCluster3.0/Batch_Queues#sbatch_-p_queue here]

{| class="wikitable"
! colspan="3" | sbatch Options
|-
! style="width:8%"| Command line
! style="width:9%"| Script
! style="width:13%"| Purpose
|- style="vertical-align:top;"
| -t, --time=''time''
| #SBATCH --time=''time''
| Wall clock time limit. 
|-
|- style="vertical-align:top;"
| -N, --nodes=''count''
| #SBATCH --nodes=''count''
| Number of nodes to be used.
|- style="vertical-align:top;"
| -n, --ntasks=''count''
| #SBATCH --ntasks=''count''
| Number of tasks to be launched.
|-
|- style="vertical-align:top;"
| --ntasks-per-node=''count''
| #SBATCH --ntasks-per-node=''count''
| Maximum count of tasks per node.
|-
|- style="vertical-align:top;"
| -c, --cpus-per-task=''count''
| #SBATCH --cpus-per-task=''count''
| Number of CPUs required per (MPI-)task.
|-
|- style="vertical-align:top;"
| --mem=''value_in_MB''
| #SBATCH --mem=''value_in_MB''
| Memory in MegaByte per node. (You should omit the setting of this option.)
|-
|- style="vertical-align:top;"
| --mem-per-cpu=''value_in_MB''
| #SBATCH --mem-per-cpu=''value_in_MB''
| Minimum Memory required per allocated CPU. (You should omit the setting of this option.)
|-
|- style="vertical-align:top;"
| --mail-type=''type''
| #SBATCH --mail-type=''type''
| Notify user by email when certain event types occur. Valid type values are NONE, BEGIN, END, FAIL, REQUEUE, ALL.
|-
|- style="vertical-align:top;"
| --mail-user=''mail-address''
| #SBATCH --mail-user=''mail-address''
| The specified mail-address receives email notification of state changes as defined by --mail-type.
|-
|- style="vertical-align:top;"
| --output=''name''
| #SBATCH --output=''name''
| File in which job output is stored.
|-
|- style="vertical-align:top;"
| --error=''name''
| #SBATCH --error=''name''
| File in which job error messages are stored.
|-
|- style="vertical-align:top;"
| -J, --job-name=''name''
| #SBATCH --job-name=''name''
| Job name.
|-
|- style="vertical-align:top;"
| --export=[ALL,] ''env-variables''
| #SBATCH --export=[ALL,] ''env-variables''
| Identifies which environment variables from the submission environment are propagated to the launched application. Default is ALL.
|-
|- style="vertical-align:top;"
| -A, --account=''group-name''
| #SBATCH --account=''group-name''
| Change resources used by this job to specified group. You may need this option if your account is assigned to more than one group. By command "scontrol show job" the project group the job is accounted on can be seen behind "Account=".
|-
|- style="vertical-align:top;"
| -p, --partition=''queue-name''
| #SBATCH --partition=''queue-name''
| Request a specific queue for the resource allocation.
|-
|- style="vertical-align:top;"
| --reservation=''reservation-name''
| #SBATCH --reservation=''reservation-name''
| Use a specific reservation for the resource allocation.
|-
|- style="vertical-align:top;"
| -C, --constraint=''LSDF''
| #SBATCH --constraint=LSDF
| Job constraint LSDF filesystems.
|-
|- style="vertical-align:top;"
| -C, --constraint=''BEEOND (BEEOND_4MDS, BEEOND_MAXMDS)''
| #SBATCH --constraint=BEEOND (BEEOND_4MDS, BEEOND_MAXMDS)
| Job constraint BeeOND filesystem.
|-
|}
 

== Interactive Jobs: salloc ==

On bwUniCluster 3.0 you are only allowed to run short jobs (<< 1 hour) with little memory requirements (<< 8 GByte) on the logins nodes. If you want to run longer jobs and/or jobs with a request of more than 8 GByte of memory, you must allocate resources for so-called interactive jobs by usage of the command salloc on a login node. Considering a serial application running on a compute node that requires 5000 MByte of memory and limiting the interactive run to 2 hours the following command has to be executed:

<pre>
$ salloc -p cpu -n 1 -t 120 --mem=5000
</pre>

Then you will get one core on a compute node within the partition "cpu". After execution of this command '''DO NOT CLOSE''' your current terminal session but wait until the queueing system Slurm has granted you the requested resources on the compute system. You will be logged in automatically on the granted core! To run a serial program on the granted core you only have to type the name of the executable.

<pre>
$ ./<my_serial_program>
</pre>

Please be aware that your serial job must run less than 2 hours in this example, else the job will be killed during runtime by the system.

You can also start now a graphical X11-terminal connecting you to the dedicated resource that is available for 2 hours. You can start it by the command:

<pre>
$ xterm
</pre>

Note that, once the walltime limit has been reached the resources - i.e. the compute node - will automatically be revoked.

An interactive parallel application running on one compute node or on many compute nodes (e.g. here 5 nodes) with 96 cores each requires usually an amount of memory in GByte (e.g. 50 GByte) and a maximum time (e.g. 1 hour). E.g. 5 nodes can be allocated by the following command:

<pre>
$ salloc -p cpu -N 5 --ntasks-per-node=96 -t 01:00:00 --mem=50gb
</pre>

Now you can run parallel jobs on 480 cores requiring 50 GByte of memory per node. Please be aware that you will be logged in on core 0 of the first node.
If you want to have access to another node you have to open a new terminal, connect it also to bwUniCluster 3.0 and type the following commands to
connect to the running interactive job and then to a specific node:

<pre>
$ srun --jobid=XXXXXXXX --pty /bin/bash
$ srun --nodelist=uc3nXXX --pty /bin/bash
</pre>

With the command:

<pre>
$ squeue
</pre>

the jobid and the nodelist can be shown.

If you want to run MPI-programs, you can do it by simply typing mpirun <program_name>. Then your program will be run on 480 cores. A very simple example for starting a parallel job can be:

<pre>
$ mpirun <my_mpi_program>
</pre>

You can also start the debugger ddt by the commands:

<pre>
$ module add devel/ddt
$ ddt <my_mpi_program>
</pre>

The above commands will execute the parallel program <my_mpi_program> on all available cores. You can also start parallel programs on a subset of cores; an example for this can be:

<pre>
$ mpirun -n 50 <my_mpi_program>
</pre>

If you are using Intel MPI you must start <my_mpi_program> by the command mpiexec.hydra (instead of mpirun).

== Interactive Computing with Jupyter ==

== Monitor and manage jobs ==

== List of your submitted jobs : squeue ==
Displays information about YOUR active, pending and/or recently completed jobs. The command displays all own active and pending jobs. The command squeue is explained in detail on the webpage https://slurm.schedmd.com/squeue.html or via manpage (man squeue).
 
 

* ''squeue'' example on bwUniCluster 3.0 (Only your own jobs are displayed!).
<pre>
$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
1262 cpu wrap ka_ab123 R 8:15 1 uc3n002
1267 dev_gpu_h wrap ka_ab123 PD 0:00 1 (Resources)
1265 highmem wrap ka_ab123 R 2:41 1 uc3n084
$ squeue -l
JOBID PARTITION NAME USER STATE TIME TIME_LIMI NODES NODELIST(REASON)
1262 cpu wrap ka_ab123 RUNNING 8:55 20:00 1 uc3n002
1267 dev_gpu_h wrap ka_ab123 PENDING 0:00 20:00 1 (Resources)
1265 highmem wrap ka_ab123 RUNNING 3:21 20:00 1 uc3n084

</pre>

== Detailed job information : scontrol show job ==
scontrol show job displays detailed job state information and diagnostic output for all or a specified job of yours. Detailed information is available for active, pending and recently completed jobs. The command scontrol is explained in detail on the webpage https://slurm.schedmd.com/scontrol.html or via manpage (man scontrol).
 
Display the state of all your jobs in normal mode: scontrol show job
 
Display the state of a job with <jobid> in normal mode: scontrol show job <jobid>
 
 

* Here is an example from bwUniCluster 3.0.
<pre>
$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
1262 cpu wrap ka_zs040 R 1:12 1 uc3n002

$
$ # now, see what's up with my pending job with jobid 1262
$
$ scontrol show job 1262

JobId=1262 JobName=wrap
UserId=ka_zs0402(241992) GroupId=ka_scc(12345) MCS_label=N/A
Priority=4246 Nice=0 Account=ka QOS=normal
JobState=RUNNING Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
RunTime=00:00:37 TimeLimit=00:20:00 TimeMin=N/A
SubmitTime=2025-04-04T10:01:30 EligibleTime=2025-04-04T10:01:30
AccrueTime=2025-04-04T10:01:30
StartTime=2025-04-04T10:01:31 EndTime=2025-04-04T10:21:31 Deadline=N/A
SuspendTime=None SecsPreSuspend=0 LastSchedEval=2025-04-04T10:01:31 Scheduler=Main
Partition=cpu AllocNode:Sid=uc3n999:2819841
ReqNodeList=(null) ExcNodeList=(null)
NodeList=uc3n002
BatchHost=uc3n002
NumNodes=1 NumCPUs=2 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
ReqTRES=cpu=1,mem=2000M,node=1,billing=1
AllocTRES=cpu=2,mem=4000M,node=1,billing=2
Socks/Node=* NtasksPerN:B:S:C=0:0:*:1 CoreSpec=*
MinCPUsNode=1 MinMemoryCPU=2000M MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
Command=(null)
WorkDir=/pfs/data6/home/ka/ka_scc/ka_zs0402
StdErr=/pfs/data6/home/ka/ka_scc/ka_zs0402/slurm-1262.out
StdIn=/dev/null
StdOut=/pfs/data6/home/ka/ka_scc/ka_zs0402/slurm-1262.out

</pre>
 

= Slurm Options =

= Best Practices =

== Step-by-Step example==

== Dos and Don'ts ==

BwUniCluster3.0/Running Jobs

2025-04-04T15:12:32Z

P Schuhmacher: /* Slurm Commands (excerpt) */

= Purpose and function of a queuing system =

General procedure: Hint to [[Running_Calculations | Running Calculations]]

== Job submission process ==

bwUniCluster 3.0 has installed the workload managing software Slurm. Therefore any job submission by the user is to be executed by commands of the Slurm software. Slurm queues and runs user jobs based on fair sharing policies.

== Slurm ==

HPC Workload Manager on bwUniCluster 3.0 is Slurm.
Slurm is a cluster management and job scheduling system. Slurm has three key functions.
* First, it allocates access to resources (compute cores on nodes) to users for some duration of time so they can perform work.
* it provides a framework for starting, executing, and monitoring work (normally a parallel job) on the set of allocated nodes.
* it arbitrates contention for resources by managing a queue of pending work.

Any kind of calculation on the compute nodes of bwUniCluster 3.0 requires the user to define calculations as a sequence of commands together with required run time, number of CPU cores and main memory and submit all, i.e., the '''batch job''', to a resource and workload managing software.
== Terms and definitions ==

''' Partitions '''

Slurm manages job queues for different '''partitions'''. Partitions are used to group similar node types (e.g. nodes with and without accelerators) and to enforce different access policies and resource limits.

On bwUniCluster 3.0 there are different partitions:

* CPU-only nodes
** 2-socket nodes, consisting of 2 Intel Ice Lake processors with 32 cores each or 2 AMD processors with 48 cores each
** 2-socket nodes with very high RAM capacity, consisting of 2 AMD processors with 48 cores each
* GPU-accelerated nodes
** 2-socket nodes with 4x NVIDIA A100 or 4x NVIDIA H100 GPUs
** 4-socket node with 4x AMD Instinct accelerator

''' Queues '''

Job '''queues''' are used to manage jobs that request access to shared but limited computing resources of a certain kind (partition).

On bwUniCluster 3.0 there are different main types of queues:
* Regular queues
** cpu: Jobs that request CPU-only nodes.
** gpu: Jobs that request GPU-accelerated nodes.
* Development queues (dev)
** Short, usually interactive jobs that are used for developing, compiling and testing code and workflows. The intention behind development queues is to provide users with immediate access to computer resources without having to wait. This is the place to realize instantaneous heavy compute without affecting other users, as would be the case on the login nodes.

Requested compute resources such as (wall-)time, number of nodes and amount of memory are restricted and must fit into the boundaries imposed by the queues. The request for compute resources on the bwUniCluster 3.0 requires at least the specification of the '''queue''' and the '''time'''.

''' Jobs '''

Jobs can be run non-interactively as '''batch jobs''' or as '''interactive jobs'''.
Submitting a batch job means, that all steps of a compute project are defined in a Bash script. This Bash script is queued and executed as soon as the compute resources are available and allocated. Jobs are enqueued with the ''sbatch'' command.

For interactive jobs, the resources are requested with the ''salloc'' command. As soon as the computing resources are available and allocated, a command line prompt is returned on a computing node and the user can freely dispose of the resources now available to him.
{|style="background:#deffee; width:100%;"
|style="padding:5px; background:#cef2e0; text-align:left"|
[[Image:Attention.svg|center|25px]]
|style="padding:5px; background:#cef2e0; text-align:left"|
'''Please remember:'''
* '''Heavy computations are not allowed on the login nodes'''. Use a developement or a regular job queue instead! Please refer to [[BwUniCluster3.0/Login#Allowed_Activities_on_Login_Nodes|Allowed Activities on Login Nodes]].
* '''Development queues''' are meant for '''development tasks'''. Do not misuse this queue for regular, short-running jobs or chain jobs! Only one running job at a time is enabled. Maximum queue length is reduced to 3.
|}

= Queues on bwUniCluster 3.0 =
== Policy ==

== Regular Queues ==
{| class="wikitable"
|-
! style="width:5%"| Queue
! style="width:13%"| Node-Type
! style="width:23%"| Default Resources
! style="width:13%"| Minimal Resources
! style="width:13%"| Maximum Resources
|-
| <code>cpu_il</code>
| CPU nodes Ice Lake
| mem-per-cpu=1950mb
|
| time=72:00:00, nodes=30, mem=249600mb, ntasks-per-node=64, (threads-per-core=2)
|-
| <code>cpu</code>
| CPU nodes Standard
| mem-per-cpu=1125mb
|
| time=72:00:00, nodes=20, mem=380000mb, ntasks-per-node=96, (threads-per-core=2)
|-
| <code>highmem</code>
| CPU nodes High Memory
| mem-per-cpu=1125mb
|
| time=72:00:00, nodes=4, mem=2300000mb, ntasks-per-node=96, (threads-per-core=2)
|-
| <code>gpu_h100</code>
| GPU nodes NVIDIA GPU x4
| mem-per-cpu=1125mb cpus-per-gpu=24
|
| time=72:00:00, nodes=12, mem=760000mb, ntasks-per-node=96, (threads-per-core=2)
|-
| <code>gpu_mi300</code>
| GPU node AMD GPU x4
| mem-per-cpu=1125mb cpus-per-gpu=24
|
| time=72:00:00, nodes=1, mem=510000mb, ntasks-per-node=40, (threads-per-core=2)
|-
| <code>gpu_a100_il</code>/<code>gpu_h100_il</code>
| GPU nodes Ice Lake NVIDIA GPU x4
| mem-per-gpu=127500mb cpus-per-gpu=16
|
| time=72:00:00, nodes=9, mem=510000mb, ntasks-per-node=64, (threads-per-core=2)
|}
Table 1: Regular Queues

== Development Queues ==
Only for development, i.e. debugging or performance optimization ...

{| class="wikitable"
|-
! style="width:5%"| Queue
! style="width:13%"| Node Type
! style="width:23%"| Default Resources
! style="width:13%"| Minimal Resources
! style="width:13%"| Maximum Resources
|-
| <code>dev_cpu_il</code>
| CPU nodes Ice Lake
| mem-per-cpu=1950mb
|
| time=30, nodes=8, mem=249600mb, ntasks-per-node=64, (threads-per-core=2)
|-
| <code>dev_cpu</code>
| CPU nodes Standard
| mem-per-cpu=1125mb
|
| time=30, nodes=1, mem=180000mb, ntasks-per-node=40, (threads-per-core=2)
|-
| <code>dev_highmem</code>
| CPU nodes High Memory
| mem-per-cpu=1125mb
|
| time=30, nodes=1, mem=180000mb, ntasks-per-node=40, (threads-per-core=2)
|-
| <code>dev_gpu_h100</code>
| GPU nodes NVIDIA GPU x4
| mem-per-cpu=1125mb cpus-per-gpu=24
|
| time=30, nodes=1, mem=180000mb, ntasks-per-node=40, (threads-per-core=2)
|-
| <code>dev_gpu_a100_il</code>
| GPU nodes NVIDIA GPU x4 
| mem-per-gpu=127500mb cpus-per-gpu=16
|
| time=30, nodes=1, mem=510000mb, ntasks-per-node=64, (threads-per-core=2)
|}
Table 2: Development Queues

Default resources of a queue class defines time, #tasks and memory if not explicitly given with sbatch command. Resource list acronyms ''--time'', ''--ntasks'', ''--nodes'', ''--mem'' and ''--mem-per-cpu'' are described [[BwUniCluster_3.0_Slurm_common_Features|here]].

== Check available resources ==
The Slurm command sinfo is used to view partition and node information for a system running Slurm. It incorporates down time, reservations, and node state information in determining the available backfill window. The sinfo command can only be used by the administrator.
 
SCC has prepared a special script (sinfo_t_idle) to find out how many processors are available for immediate use on the system. It is anticipated that users will use this information to submit jobs that meet these criteria and thus obtain quick job turnaround times.
 
 

* The following command displays what resources are available for immediate use for the whole partition.
<pre>$ sinfo_t_idle
Partition dev_cpu : 2 nodes idle
Partition cpu : 68 nodes idle
Partition highmem : 4 nodes idle
Partition dev_gpu_h100 : 0 nodes idle
Partition gpu_h100 : 11 nodes idle
Partition gpu_mi300 : 1 nodes idle
Partition dev_cpu_il : 0 nodes idle
Partition cpu_il : 0 nodes idle
Partition dev_gpu_a100_il : 0 nodes idle
Partition gpu_a100_il : 0 nodes idle
Partition gpu_h100_il : 0 nodes idle
</pre>
 

= Running Jobs =

== Slurm Commands (excerpt) ==
Important Slurm commands for non-administrators working on bwUniCluster 3.0.
{| width=850px class="wikitable"
! Slurm commands !! Brief explanation
|-
| [[#Batch Jobs: sbatch|sbatch]] || Submits a job and puts it into the queue [[https://slurm.schedmd.com/sbatch.html sbatch]]
|-
| [[#Interactive Jobs: salloc|salloc]] || Requests resources for an interactive Job [[https://slurm.schedmd.com/salloc.html salloc]]
|-
| [[#Monitor and manage jobs |scontrol show job]] || Displays detailed job state information [[https://slurm.schedmd.com/scontrol.html scontrol]]
|-
| [[#List of your submitted jobs : squeue|squeue]] || Displays information about active, eligible, blocked, and/or recently completed jobs [[https://slurm.schedmd.com/squeue.html squeue]]
|-
| [[#Start time of job or resources : squeue|squeue --start]] || Returns start time of submitted job [[https://slurm.schedmd.com/squeue.html squeue]]
|-
| [[#Shows free resources : sinfo_t_idle|sinfo_t_idle]] || Shows what resources are available for immediate use [[https://slurm.schedmd.com/sinfo.html sinfo]]
|-
| [[#Canceling own jobs : scancel|scancel]] || Cancels a job [[https://slurm.schedmd.com/scancel.html scancel]]
|}

 
* [https://slurm.schedmd.com/tutorials.html Slurm Tutorials]
* [https://slurm.schedmd.com/pdfs/summary.pdf Slurm command/option summary (2 pages)]
* [https://slurm.schedmd.com/man_index.html Slurm Commands]
 

== Batch Jobs: sbatch ==

Batch jobs are submitted by using the command '''sbatch'''. The main purpose of the '''sbatch''' command is to specify the resources that are needed to run the job. '''sbatch''' will then queue the batch job. However, starting of batch job depends on the availability of the requested resources and the fair sharing value.
 
 

To run your batch job on one of the cpu nodes, please use:

<pre>
$ sbatch --partition=dev_cpu
or
$ sbatch -p dev_cpu
</pre>
 

=== Command parameters sbatch ===
The syntax and use of '''sbatch''' can be displayed via:
<pre>
$ man sbatch
</pre>
'''sbatch''' options can be used from the command line or in your job script. Different defaults for some of these options are set based on the queue and can be found [https://wiki.bwhpc.de/e/BwUniCluster3.0/Batch_Queues#sbatch_-p_queue here]

{| class="wikitable"
! colspan="3" | sbatch Options
|-
! style="width:8%"| Command line
! style="width:9%"| Script
! style="width:13%"| Purpose
|- style="vertical-align:top;"
| -t, --time=''time''
| #SBATCH --time=''time''
| Wall clock time limit. 
|-
|- style="vertical-align:top;"
| -N, --nodes=''count''
| #SBATCH --nodes=''count''
| Number of nodes to be used.
|- style="vertical-align:top;"
| -n, --ntasks=''count''
| #SBATCH --ntasks=''count''
| Number of tasks to be launched.
|-
|- style="vertical-align:top;"
| --ntasks-per-node=''count''
| #SBATCH --ntasks-per-node=''count''
| Maximum count of tasks per node.
|-
|- style="vertical-align:top;"
| -c, --cpus-per-task=''count''
| #SBATCH --cpus-per-task=''count''
| Number of CPUs required per (MPI-)task.
|-
|- style="vertical-align:top;"
| --mem=''value_in_MB''
| #SBATCH --mem=''value_in_MB''
| Memory in MegaByte per node. (You should omit the setting of this option.)
|-
|- style="vertical-align:top;"
| --mem-per-cpu=''value_in_MB''
| #SBATCH --mem-per-cpu=''value_in_MB''
| Minimum Memory required per allocated CPU. (You should omit the setting of this option.)
|-
|- style="vertical-align:top;"
| --mail-type=''type''
| #SBATCH --mail-type=''type''
| Notify user by email when certain event types occur. Valid type values are NONE, BEGIN, END, FAIL, REQUEUE, ALL.
|-
|- style="vertical-align:top;"
| --mail-user=''mail-address''
| #SBATCH --mail-user=''mail-address''
| The specified mail-address receives email notification of state changes as defined by --mail-type.
|-
|- style="vertical-align:top;"
| --output=''name''
| #SBATCH --output=''name''
| File in which job output is stored.
|-
|- style="vertical-align:top;"
| --error=''name''
| #SBATCH --error=''name''
| File in which job error messages are stored.
|-
|- style="vertical-align:top;"
| -J, --job-name=''name''
| #SBATCH --job-name=''name''
| Job name.
|-
|- style="vertical-align:top;"
| --export=[ALL,] ''env-variables''
| #SBATCH --export=[ALL,] ''env-variables''
| Identifies which environment variables from the submission environment are propagated to the launched application. Default is ALL.
|-
|- style="vertical-align:top;"
| -A, --account=''group-name''
| #SBATCH --account=''group-name''
| Change resources used by this job to specified group. You may need this option if your account is assigned to more than one group. By command "scontrol show job" the project group the job is accounted on can be seen behind "Account=".
|-
|- style="vertical-align:top;"
| -p, --partition=''queue-name''
| #SBATCH --partition=''queue-name''
| Request a specific queue for the resource allocation.
|-
|- style="vertical-align:top;"
| --reservation=''reservation-name''
| #SBATCH --reservation=''reservation-name''
| Use a specific reservation for the resource allocation.
|-
|- style="vertical-align:top;"
| -C, --constraint=''LSDF''
| #SBATCH --constraint=LSDF
| Job constraint LSDF filesystems.
|-
|- style="vertical-align:top;"
| -C, --constraint=''BEEOND (BEEOND_4MDS, BEEOND_MAXMDS)''
| #SBATCH --constraint=BEEOND (BEEOND_4MDS, BEEOND_MAXMDS)
| Job constraint BeeOND filesystem.
|-
|}
 

== Interactive Jobs: salloc ==

On bwUniCluster 3.0 you are only allowed to run short jobs (<< 1 hour) with little memory requirements (<< 8 GByte) on the logins nodes. If you want to run longer jobs and/or jobs with a request of more than 8 GByte of memory, you must allocate resources for so-called interactive jobs by usage of the command salloc on a login node. Considering a serial application running on a compute node that requires 5000 MByte of memory and limiting the interactive run to 2 hours the following command has to be executed:

<pre>
$ salloc -p cpu -n 1 -t 120 --mem=5000
</pre>

Then you will get one core on a compute node within the partition "cpu". After execution of this command '''DO NOT CLOSE''' your current terminal session but wait until the queueing system Slurm has granted you the requested resources on the compute system. You will be logged in automatically on the granted core! To run a serial program on the granted core you only have to type the name of the executable.

<pre>
$ ./<my_serial_program>
</pre>

Please be aware that your serial job must run less than 2 hours in this example, else the job will be killed during runtime by the system.

You can also start now a graphical X11-terminal connecting you to the dedicated resource that is available for 2 hours. You can start it by the command:

<pre>
$ xterm
</pre>

Note that, once the walltime limit has been reached the resources - i.e. the compute node - will automatically be revoked.

An interactive parallel application running on one compute node or on many compute nodes (e.g. here 5 nodes) with 96 cores each requires usually an amount of memory in GByte (e.g. 50 GByte) and a maximum time (e.g. 1 hour). E.g. 5 nodes can be allocated by the following command:

<pre>
$ salloc -p cpu -N 5 --ntasks-per-node=96 -t 01:00:00 --mem=50gb
</pre>

Now you can run parallel jobs on 480 cores requiring 50 GByte of memory per node. Please be aware that you will be logged in on core 0 of the first node.
If you want to have access to another node you have to open a new terminal, connect it also to bwUniCluster 3.0 and type the following commands to
connect to the running interactive job and then to a specific node:

<pre>
$ srun --jobid=XXXXXXXX --pty /bin/bash
$ srun --nodelist=uc3nXXX --pty /bin/bash
</pre>

With the command:

<pre>
$ squeue
</pre>

the jobid and the nodelist can be shown.

If you want to run MPI-programs, you can do it by simply typing mpirun <program_name>. Then your program will be run on 480 cores. A very simple example for starting a parallel job can be:

<pre>
$ mpirun <my_mpi_program>
</pre>

You can also start the debugger ddt by the commands:

<pre>
$ module add devel/ddt
$ ddt <my_mpi_program>
</pre>

The above commands will execute the parallel program <my_mpi_program> on all available cores. You can also start parallel programs on a subset of cores; an example for this can be:

<pre>
$ mpirun -n 50 <my_mpi_program>
</pre>

If you are using Intel MPI you must start <my_mpi_program> by the command mpiexec.hydra (instead of mpirun).

== Interactive Computing with Jupyter ==

== Monitor and manage jobs ==
scontrol show job displays detailed job state information and diagnostic output for all or a specified job of yours. Detailed information is available for active, pending and recently completed jobs. The command scontrol is explained in detail on the webpage https://slurm.schedmd.com/scontrol.html or via manpage (man scontrol).
 
Display the state of all your jobs in normal mode: scontrol show job
 
Display the state of a job with <jobid> in normal mode: scontrol show job <jobid>
 
 

* Here is an example from bwUniCluster 3.0.
<pre>
$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
1262 cpu wrap ka_zs040 R 1:12 1 uc3n002

$
$ # now, see what's up with my pending job with jobid 1262
$
$ scontrol show job 1262

JobId=1262 JobName=wrap
UserId=ka_zs0402(241992) GroupId=ka_scc(12345) MCS_label=N/A
Priority=4246 Nice=0 Account=ka QOS=normal
JobState=RUNNING Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
RunTime=00:00:37 TimeLimit=00:20:00 TimeMin=N/A
SubmitTime=2025-04-04T10:01:30 EligibleTime=2025-04-04T10:01:30
AccrueTime=2025-04-04T10:01:30
StartTime=2025-04-04T10:01:31 EndTime=2025-04-04T10:21:31 Deadline=N/A
SuspendTime=None SecsPreSuspend=0 LastSchedEval=2025-04-04T10:01:31 Scheduler=Main
Partition=cpu AllocNode:Sid=uc3n999:2819841
ReqNodeList=(null) ExcNodeList=(null)
NodeList=uc3n002
BatchHost=uc3n002
NumNodes=1 NumCPUs=2 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
ReqTRES=cpu=1,mem=2000M,node=1,billing=1
AllocTRES=cpu=2,mem=4000M,node=1,billing=2
Socks/Node=* NtasksPerN:B:S:C=0:0:*:1 CoreSpec=*
MinCPUsNode=1 MinMemoryCPU=2000M MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
Command=(null)
WorkDir=/pfs/data6/home/ka/ka_scc/ka_zs0402
StdErr=/pfs/data6/home/ka/ka_scc/ka_zs0402/slurm-1262.out
StdIn=/dev/null
StdOut=/pfs/data6/home/ka/ka_scc/ka_zs0402/slurm-1262.out

</pre>
 

= Slurm Options =

= Best Practices =

== Step-by-Step example==

== Dos and Don'ts ==

BwUniCluster3.0/Running Jobs

2025-04-04T15:10:05Z

P Schuhmacher: /* Slurm Commands (excerpt) */

= Purpose and function of a queuing system =

General procedure: Hint to [[Running_Calculations | Running Calculations]]

== Job submission process ==

bwUniCluster 3.0 has installed the workload managing software Slurm. Therefore any job submission by the user is to be executed by commands of the Slurm software. Slurm queues and runs user jobs based on fair sharing policies.

== Slurm ==

HPC Workload Manager on bwUniCluster 3.0 is Slurm.
Slurm is a cluster management and job scheduling system. Slurm has three key functions.
* First, it allocates access to resources (compute cores on nodes) to users for some duration of time so they can perform work.
* it provides a framework for starting, executing, and monitoring work (normally a parallel job) on the set of allocated nodes.
* it arbitrates contention for resources by managing a queue of pending work.

Any kind of calculation on the compute nodes of bwUniCluster 3.0 requires the user to define calculations as a sequence of commands together with required run time, number of CPU cores and main memory and submit all, i.e., the '''batch job''', to a resource and workload managing software.
== Terms and definitions ==

''' Partitions '''

Slurm manages job queues for different '''partitions'''. Partitions are used to group similar node types (e.g. nodes with and without accelerators) and to enforce different access policies and resource limits.

On bwUniCluster 3.0 there are different partitions:

* CPU-only nodes
** 2-socket nodes, consisting of 2 Intel Ice Lake processors with 32 cores each or 2 AMD processors with 48 cores each
** 2-socket nodes with very high RAM capacity, consisting of 2 AMD processors with 48 cores each
* GPU-accelerated nodes
** 2-socket nodes with 4x NVIDIA A100 or 4x NVIDIA H100 GPUs
** 4-socket node with 4x AMD Instinct accelerator

''' Queues '''

Job '''queues''' are used to manage jobs that request access to shared but limited computing resources of a certain kind (partition).

On bwUniCluster 3.0 there are different main types of queues:
* Regular queues
** cpu: Jobs that request CPU-only nodes.
** gpu: Jobs that request GPU-accelerated nodes.
* Development queues (dev)
** Short, usually interactive jobs that are used for developing, compiling and testing code and workflows. The intention behind development queues is to provide users with immediate access to computer resources without having to wait. This is the place to realize instantaneous heavy compute without affecting other users, as would be the case on the login nodes.

Requested compute resources such as (wall-)time, number of nodes and amount of memory are restricted and must fit into the boundaries imposed by the queues. The request for compute resources on the bwUniCluster 3.0 requires at least the specification of the '''queue''' and the '''time'''.

''' Jobs '''

Jobs can be run non-interactively as '''batch jobs''' or as '''interactive jobs'''.
Submitting a batch job means, that all steps of a compute project are defined in a Bash script. This Bash script is queued and executed as soon as the compute resources are available and allocated. Jobs are enqueued with the ''sbatch'' command.

For interactive jobs, the resources are requested with the ''salloc'' command. As soon as the computing resources are available and allocated, a command line prompt is returned on a computing node and the user can freely dispose of the resources now available to him.
{|style="background:#deffee; width:100%;"
|style="padding:5px; background:#cef2e0; text-align:left"|
[[Image:Attention.svg|center|25px]]
|style="padding:5px; background:#cef2e0; text-align:left"|
'''Please remember:'''
* '''Heavy computations are not allowed on the login nodes'''. Use a developement or a regular job queue instead! Please refer to [[BwUniCluster3.0/Login#Allowed_Activities_on_Login_Nodes|Allowed Activities on Login Nodes]].
* '''Development queues''' are meant for '''development tasks'''. Do not misuse this queue for regular, short-running jobs or chain jobs! Only one running job at a time is enabled. Maximum queue length is reduced to 3.
|}

= Queues on bwUniCluster 3.0 =
== Policy ==

== Regular Queues ==
{| class="wikitable"
|-
! style="width:5%"| Queue
! style="width:13%"| Node-Type
! style="width:23%"| Default Resources
! style="width:13%"| Minimal Resources
! style="width:13%"| Maximum Resources
|-
| <code>cpu_il</code>
| CPU nodes Ice Lake
| mem-per-cpu=1950mb
|
| time=72:00:00, nodes=30, mem=249600mb, ntasks-per-node=64, (threads-per-core=2)
|-
| <code>cpu</code>
| CPU nodes Standard
| mem-per-cpu=1125mb
|
| time=72:00:00, nodes=20, mem=380000mb, ntasks-per-node=96, (threads-per-core=2)
|-
| <code>highmem</code>
| CPU nodes High Memory
| mem-per-cpu=1125mb
|
| time=72:00:00, nodes=4, mem=2300000mb, ntasks-per-node=96, (threads-per-core=2)
|-
| <code>gpu_h100</code>
| GPU nodes NVIDIA GPU x4
| mem-per-cpu=1125mb cpus-per-gpu=24
|
| time=72:00:00, nodes=12, mem=760000mb, ntasks-per-node=96, (threads-per-core=2)
|-
| <code>gpu_mi300</code>
| GPU node AMD GPU x4
| mem-per-cpu=1125mb cpus-per-gpu=24
|
| time=72:00:00, nodes=1, mem=510000mb, ntasks-per-node=40, (threads-per-core=2)
|-
| <code>gpu_a100_il</code>/<code>gpu_h100_il</code>
| GPU nodes Ice Lake NVIDIA GPU x4
| mem-per-gpu=127500mb cpus-per-gpu=16
|
| time=72:00:00, nodes=9, mem=510000mb, ntasks-per-node=64, (threads-per-core=2)
|}
Table 1: Regular Queues

== Development Queues ==
Only for development, i.e. debugging or performance optimization ...

{| class="wikitable"
|-
! style="width:5%"| Queue
! style="width:13%"| Node Type
! style="width:23%"| Default Resources
! style="width:13%"| Minimal Resources
! style="width:13%"| Maximum Resources
|-
| <code>dev_cpu_il</code>
| CPU nodes Ice Lake
| mem-per-cpu=1950mb
|
| time=30, nodes=8, mem=249600mb, ntasks-per-node=64, (threads-per-core=2)
|-
| <code>dev_cpu</code>
| CPU nodes Standard
| mem-per-cpu=1125mb
|
| time=30, nodes=1, mem=180000mb, ntasks-per-node=40, (threads-per-core=2)
|-
| <code>dev_highmem</code>
| CPU nodes High Memory
| mem-per-cpu=1125mb
|
| time=30, nodes=1, mem=180000mb, ntasks-per-node=40, (threads-per-core=2)
|-
| <code>dev_gpu_h100</code>
| GPU nodes NVIDIA GPU x4
| mem-per-cpu=1125mb cpus-per-gpu=24
|
| time=30, nodes=1, mem=180000mb, ntasks-per-node=40, (threads-per-core=2)
|-
| <code>dev_gpu_a100_il</code>
| GPU nodes NVIDIA GPU x4 
| mem-per-gpu=127500mb cpus-per-gpu=16
|
| time=30, nodes=1, mem=510000mb, ntasks-per-node=64, (threads-per-core=2)
|}
Table 2: Development Queues

Default resources of a queue class defines time, #tasks and memory if not explicitly given with sbatch command. Resource list acronyms ''--time'', ''--ntasks'', ''--nodes'', ''--mem'' and ''--mem-per-cpu'' are described [[BwUniCluster_3.0_Slurm_common_Features|here]].

== Check available resources ==
The Slurm command sinfo is used to view partition and node information for a system running Slurm. It incorporates down time, reservations, and node state information in determining the available backfill window. The sinfo command can only be used by the administrator.
 
SCC has prepared a special script (sinfo_t_idle) to find out how many processors are available for immediate use on the system. It is anticipated that users will use this information to submit jobs that meet these criteria and thus obtain quick job turnaround times.
 
 

* The following command displays what resources are available for immediate use for the whole partition.
<pre>$ sinfo_t_idle
Partition dev_cpu : 2 nodes idle
Partition cpu : 68 nodes idle
Partition highmem : 4 nodes idle
Partition dev_gpu_h100 : 0 nodes idle
Partition gpu_h100 : 11 nodes idle
Partition gpu_mi300 : 1 nodes idle
Partition dev_cpu_il : 0 nodes idle
Partition cpu_il : 0 nodes idle
Partition dev_gpu_a100_il : 0 nodes idle
Partition gpu_a100_il : 0 nodes idle
Partition gpu_h100_il : 0 nodes idle
</pre>
 

= Running Jobs =

== Slurm Commands (excerpt) ==
Important Slurm commands for non-administrators working on bwUniCluster 3.0.
{| width=850px class="wikitable"
! Slurm commands !! Brief explanation
|-
| [[#Batch Jobs: sbatch|sbatch]] || Submits a job and puts it into the queue [[https://slurm.schedmd.com/sbatch.html sbatch]]
|-
| [[#Interactive job : salloc|salloc]] || Requests resources for an interactive Job [[https://slurm.schedmd.com/salloc.html salloc]]
|-
| [[#Detailed job information : scontrol show job|scontrol show job]] || Displays detailed job state information [[https://slurm.schedmd.com/scontrol.html scontrol]]
|-
| [[#List of your submitted jo/bs : squeue|squeue]] || Displays information about active, eligible, blocked, and/or recently completed jobs [[https://slurm.schedmd.com/squeue.html squeue]]
|-
| [[#Start time of job or resources : squeue|squeue --start]] || Returns start time of submitted job [[https://slurm.schedmd.com/squeue.html squeue]]
|-
| [[#Shows free resources : sinfo_t_idle|sinfo_t_idle]] || Shows what resources are available for immediate use [[https://slurm.schedmd.com/sinfo.html sinfo]]
|-
| [[#Canceling own jobs : scancel|scancel]] || Cancels a job [[https://slurm.schedmd.com/scancel.html scancel]]
|}

 
* [https://slurm.schedmd.com/tutorials.html Slurm Tutorials]
* [https://slurm.schedmd.com/pdfs/summary.pdf Slurm command/option summary (2 pages)]
* [https://slurm.schedmd.com/man_index.html Slurm Commands]
 

== Batch Jobs: sbatch ==

Batch jobs are submitted by using the command '''sbatch'''. The main purpose of the '''sbatch''' command is to specify the resources that are needed to run the job. '''sbatch''' will then queue the batch job. However, starting of batch job depends on the availability of the requested resources and the fair sharing value.
 
 

To run your batch job on one of the cpu nodes, please use:

<pre>
$ sbatch --partition=dev_cpu
or
$ sbatch -p dev_cpu
</pre>
 

=== Command parameters sbatch ===
The syntax and use of '''sbatch''' can be displayed via:
<pre>
$ man sbatch
</pre>
'''sbatch''' options can be used from the command line or in your job script. Different defaults for some of these options are set based on the queue and can be found [https://wiki.bwhpc.de/e/BwUniCluster3.0/Batch_Queues#sbatch_-p_queue here]

{| class="wikitable"
! colspan="3" | sbatch Options
|-
! style="width:8%"| Command line
! style="width:9%"| Script
! style="width:13%"| Purpose
|- style="vertical-align:top;"
| -t, --time=''time''
| #SBATCH --time=''time''
| Wall clock time limit. 
|-
|- style="vertical-align:top;"
| -N, --nodes=''count''
| #SBATCH --nodes=''count''
| Number of nodes to be used.
|- style="vertical-align:top;"
| -n, --ntasks=''count''
| #SBATCH --ntasks=''count''
| Number of tasks to be launched.
|-
|- style="vertical-align:top;"
| --ntasks-per-node=''count''
| #SBATCH --ntasks-per-node=''count''
| Maximum count of tasks per node.
|-
|- style="vertical-align:top;"
| -c, --cpus-per-task=''count''
| #SBATCH --cpus-per-task=''count''
| Number of CPUs required per (MPI-)task.
|-
|- style="vertical-align:top;"
| --mem=''value_in_MB''
| #SBATCH --mem=''value_in_MB''
| Memory in MegaByte per node. (You should omit the setting of this option.)
|-
|- style="vertical-align:top;"
| --mem-per-cpu=''value_in_MB''
| #SBATCH --mem-per-cpu=''value_in_MB''
| Minimum Memory required per allocated CPU. (You should omit the setting of this option.)
|-
|- style="vertical-align:top;"
| --mail-type=''type''
| #SBATCH --mail-type=''type''
| Notify user by email when certain event types occur. Valid type values are NONE, BEGIN, END, FAIL, REQUEUE, ALL.
|-
|- style="vertical-align:top;"
| --mail-user=''mail-address''
| #SBATCH --mail-user=''mail-address''
| The specified mail-address receives email notification of state changes as defined by --mail-type.
|-
|- style="vertical-align:top;"
| --output=''name''
| #SBATCH --output=''name''
| File in which job output is stored.
|-
|- style="vertical-align:top;"
| --error=''name''
| #SBATCH --error=''name''
| File in which job error messages are stored.
|-
|- style="vertical-align:top;"
| -J, --job-name=''name''
| #SBATCH --job-name=''name''
| Job name.
|-
|- style="vertical-align:top;"
| --export=[ALL,] ''env-variables''
| #SBATCH --export=[ALL,] ''env-variables''
| Identifies which environment variables from the submission environment are propagated to the launched application. Default is ALL.
|-
|- style="vertical-align:top;"
| -A, --account=''group-name''
| #SBATCH --account=''group-name''
| Change resources used by this job to specified group. You may need this option if your account is assigned to more than one group. By command "scontrol show job" the project group the job is accounted on can be seen behind "Account=".
|-
|- style="vertical-align:top;"
| -p, --partition=''queue-name''
| #SBATCH --partition=''queue-name''
| Request a specific queue for the resource allocation.
|-
|- style="vertical-align:top;"
| --reservation=''reservation-name''
| #SBATCH --reservation=''reservation-name''
| Use a specific reservation for the resource allocation.
|-
|- style="vertical-align:top;"
| -C, --constraint=''LSDF''
| #SBATCH --constraint=LSDF
| Job constraint LSDF filesystems.
|-
|- style="vertical-align:top;"
| -C, --constraint=''BEEOND (BEEOND_4MDS, BEEOND_MAXMDS)''
| #SBATCH --constraint=BEEOND (BEEOND_4MDS, BEEOND_MAXMDS)
| Job constraint BeeOND filesystem.
|-
|}
 

== Interactive Jobs: salloc ==

On bwUniCluster 3.0 you are only allowed to run short jobs (<< 1 hour) with little memory requirements (<< 8 GByte) on the logins nodes. If you want to run longer jobs and/or jobs with a request of more than 8 GByte of memory, you must allocate resources for so-called interactive jobs by usage of the command salloc on a login node. Considering a serial application running on a compute node that requires 5000 MByte of memory and limiting the interactive run to 2 hours the following command has to be executed:

<pre>
$ salloc -p cpu -n 1 -t 120 --mem=5000
</pre>

Then you will get one core on a compute node within the partition "cpu". After execution of this command '''DO NOT CLOSE''' your current terminal session but wait until the queueing system Slurm has granted you the requested resources on the compute system. You will be logged in automatically on the granted core! To run a serial program on the granted core you only have to type the name of the executable.

<pre>
$ ./<my_serial_program>
</pre>

Please be aware that your serial job must run less than 2 hours in this example, else the job will be killed during runtime by the system.

You can also start now a graphical X11-terminal connecting you to the dedicated resource that is available for 2 hours. You can start it by the command:

<pre>
$ xterm
</pre>

Note that, once the walltime limit has been reached the resources - i.e. the compute node - will automatically be revoked.

An interactive parallel application running on one compute node or on many compute nodes (e.g. here 5 nodes) with 96 cores each requires usually an amount of memory in GByte (e.g. 50 GByte) and a maximum time (e.g. 1 hour). E.g. 5 nodes can be allocated by the following command:

<pre>
$ salloc -p cpu -N 5 --ntasks-per-node=96 -t 01:00:00 --mem=50gb
</pre>

Now you can run parallel jobs on 480 cores requiring 50 GByte of memory per node. Please be aware that you will be logged in on core 0 of the first node.
If you want to have access to another node you have to open a new terminal, connect it also to bwUniCluster 3.0 and type the following commands to
connect to the running interactive job and then to a specific node:

<pre>
$ srun --jobid=XXXXXXXX --pty /bin/bash
$ srun --nodelist=uc3nXXX --pty /bin/bash
</pre>

With the command:

<pre>
$ squeue
</pre>

the jobid and the nodelist can be shown.

If you want to run MPI-programs, you can do it by simply typing mpirun <program_name>. Then your program will be run on 480 cores. A very simple example for starting a parallel job can be:

<pre>
$ mpirun <my_mpi_program>
</pre>

You can also start the debugger ddt by the commands:

<pre>
$ module add devel/ddt
$ ddt <my_mpi_program>
</pre>

The above commands will execute the parallel program <my_mpi_program> on all available cores. You can also start parallel programs on a subset of cores; an example for this can be:

<pre>
$ mpirun -n 50 <my_mpi_program>
</pre>

If you are using Intel MPI you must start <my_mpi_program> by the command mpiexec.hydra (instead of mpirun).

== Interactive Computing with Jupyter ==

== Monitor and manage jobs ==
scontrol show job displays detailed job state information and diagnostic output for all or a specified job of yours. Detailed information is available for active, pending and recently completed jobs. The command scontrol is explained in detail on the webpage https://slurm.schedmd.com/scontrol.html or via manpage (man scontrol).
 
Display the state of all your jobs in normal mode: scontrol show job
 
Display the state of a job with <jobid> in normal mode: scontrol show job <jobid>
 
 

* Here is an example from bwUniCluster 3.0.
<pre>
$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
1262 cpu wrap ka_zs040 R 1:12 1 uc3n002

$
$ # now, see what's up with my pending job with jobid 1262
$
$ scontrol show job 1262

JobId=1262 JobName=wrap
UserId=ka_zs0402(241992) GroupId=ka_scc(12345) MCS_label=N/A
Priority=4246 Nice=0 Account=ka QOS=normal
JobState=RUNNING Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
RunTime=00:00:37 TimeLimit=00:20:00 TimeMin=N/A
SubmitTime=2025-04-04T10:01:30 EligibleTime=2025-04-04T10:01:30
AccrueTime=2025-04-04T10:01:30
StartTime=2025-04-04T10:01:31 EndTime=2025-04-04T10:21:31 Deadline=N/A
SuspendTime=None SecsPreSuspend=0 LastSchedEval=2025-04-04T10:01:31 Scheduler=Main
Partition=cpu AllocNode:Sid=uc3n999:2819841
ReqNodeList=(null) ExcNodeList=(null)
NodeList=uc3n002
BatchHost=uc3n002
NumNodes=1 NumCPUs=2 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
ReqTRES=cpu=1,mem=2000M,node=1,billing=1
AllocTRES=cpu=2,mem=4000M,node=1,billing=2
Socks/Node=* NtasksPerN:B:S:C=0:0:*:1 CoreSpec=*
MinCPUsNode=1 MinMemoryCPU=2000M MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
Command=(null)
WorkDir=/pfs/data6/home/ka/ka_scc/ka_zs0402
StdErr=/pfs/data6/home/ka/ka_scc/ka_zs0402/slurm-1262.out
StdIn=/dev/null
StdOut=/pfs/data6/home/ka/ka_scc/ka_zs0402/slurm-1262.out

</pre>
 

= Slurm Options =

= Best Practices =

== Step-by-Step example==

== Dos and Don'ts ==

BwUniCluster3.0/Running Jobs

2025-04-04T15:09:25Z

P Schuhmacher: /* Running Jobs */

= Purpose and function of a queuing system =

General procedure: Hint to [[Running_Calculations | Running Calculations]]

== Job submission process ==

bwUniCluster 3.0 has installed the workload managing software Slurm. Therefore any job submission by the user is to be executed by commands of the Slurm software. Slurm queues and runs user jobs based on fair sharing policies.

== Slurm ==

HPC Workload Manager on bwUniCluster 3.0 is Slurm.
Slurm is a cluster management and job scheduling system. Slurm has three key functions.
* First, it allocates access to resources (compute cores on nodes) to users for some duration of time so they can perform work.
* it provides a framework for starting, executing, and monitoring work (normally a parallel job) on the set of allocated nodes.
* it arbitrates contention for resources by managing a queue of pending work.

Any kind of calculation on the compute nodes of bwUniCluster 3.0 requires the user to define calculations as a sequence of commands together with required run time, number of CPU cores and main memory and submit all, i.e., the '''batch job''', to a resource and workload managing software.
== Terms and definitions ==

''' Partitions '''

Slurm manages job queues for different '''partitions'''. Partitions are used to group similar node types (e.g. nodes with and without accelerators) and to enforce different access policies and resource limits.

On bwUniCluster 3.0 there are different partitions:

* CPU-only nodes
** 2-socket nodes, consisting of 2 Intel Ice Lake processors with 32 cores each or 2 AMD processors with 48 cores each
** 2-socket nodes with very high RAM capacity, consisting of 2 AMD processors with 48 cores each
* GPU-accelerated nodes
** 2-socket nodes with 4x NVIDIA A100 or 4x NVIDIA H100 GPUs
** 4-socket node with 4x AMD Instinct accelerator

''' Queues '''

Job '''queues''' are used to manage jobs that request access to shared but limited computing resources of a certain kind (partition).

On bwUniCluster 3.0 there are different main types of queues:
* Regular queues
** cpu: Jobs that request CPU-only nodes.
** gpu: Jobs that request GPU-accelerated nodes.
* Development queues (dev)
** Short, usually interactive jobs that are used for developing, compiling and testing code and workflows. The intention behind development queues is to provide users with immediate access to computer resources without having to wait. This is the place to realize instantaneous heavy compute without affecting other users, as would be the case on the login nodes.

Requested compute resources such as (wall-)time, number of nodes and amount of memory are restricted and must fit into the boundaries imposed by the queues. The request for compute resources on the bwUniCluster 3.0 requires at least the specification of the '''queue''' and the '''time'''.

''' Jobs '''

Jobs can be run non-interactively as '''batch jobs''' or as '''interactive jobs'''.
Submitting a batch job means, that all steps of a compute project are defined in a Bash script. This Bash script is queued and executed as soon as the compute resources are available and allocated. Jobs are enqueued with the ''sbatch'' command.

For interactive jobs, the resources are requested with the ''salloc'' command. As soon as the computing resources are available and allocated, a command line prompt is returned on a computing node and the user can freely dispose of the resources now available to him.
{|style="background:#deffee; width:100%;"
|style="padding:5px; background:#cef2e0; text-align:left"|
[[Image:Attention.svg|center|25px]]
|style="padding:5px; background:#cef2e0; text-align:left"|
'''Please remember:'''
* '''Heavy computations are not allowed on the login nodes'''. Use a developement or a regular job queue instead! Please refer to [[BwUniCluster3.0/Login#Allowed_Activities_on_Login_Nodes|Allowed Activities on Login Nodes]].
* '''Development queues''' are meant for '''development tasks'''. Do not misuse this queue for regular, short-running jobs or chain jobs! Only one running job at a time is enabled. Maximum queue length is reduced to 3.
|}

= Queues on bwUniCluster 3.0 =
== Policy ==

== Regular Queues ==
{| class="wikitable"
|-
! style="width:5%"| Queue
! style="width:13%"| Node-Type
! style="width:23%"| Default Resources
! style="width:13%"| Minimal Resources
! style="width:13%"| Maximum Resources
|-
| <code>cpu_il</code>
| CPU nodes Ice Lake
| mem-per-cpu=1950mb
|
| time=72:00:00, nodes=30, mem=249600mb, ntasks-per-node=64, (threads-per-core=2)
|-
| <code>cpu</code>
| CPU nodes Standard
| mem-per-cpu=1125mb
|
| time=72:00:00, nodes=20, mem=380000mb, ntasks-per-node=96, (threads-per-core=2)
|-
| <code>highmem</code>
| CPU nodes High Memory
| mem-per-cpu=1125mb
|
| time=72:00:00, nodes=4, mem=2300000mb, ntasks-per-node=96, (threads-per-core=2)
|-
| <code>gpu_h100</code>
| GPU nodes NVIDIA GPU x4
| mem-per-cpu=1125mb cpus-per-gpu=24
|
| time=72:00:00, nodes=12, mem=760000mb, ntasks-per-node=96, (threads-per-core=2)
|-
| <code>gpu_mi300</code>
| GPU node AMD GPU x4
| mem-per-cpu=1125mb cpus-per-gpu=24
|
| time=72:00:00, nodes=1, mem=510000mb, ntasks-per-node=40, (threads-per-core=2)
|-
| <code>gpu_a100_il</code>/<code>gpu_h100_il</code>
| GPU nodes Ice Lake NVIDIA GPU x4
| mem-per-gpu=127500mb cpus-per-gpu=16
|
| time=72:00:00, nodes=9, mem=510000mb, ntasks-per-node=64, (threads-per-core=2)
|}
Table 1: Regular Queues

== Development Queues ==
Only for development, i.e. debugging or performance optimization ...

{| class="wikitable"
|-
! style="width:5%"| Queue
! style="width:13%"| Node Type
! style="width:23%"| Default Resources
! style="width:13%"| Minimal Resources
! style="width:13%"| Maximum Resources
|-
| <code>dev_cpu_il</code>
| CPU nodes Ice Lake
| mem-per-cpu=1950mb
|
| time=30, nodes=8, mem=249600mb, ntasks-per-node=64, (threads-per-core=2)
|-
| <code>dev_cpu</code>
| CPU nodes Standard
| mem-per-cpu=1125mb
|
| time=30, nodes=1, mem=180000mb, ntasks-per-node=40, (threads-per-core=2)
|-
| <code>dev_highmem</code>
| CPU nodes High Memory
| mem-per-cpu=1125mb
|
| time=30, nodes=1, mem=180000mb, ntasks-per-node=40, (threads-per-core=2)
|-
| <code>dev_gpu_h100</code>
| GPU nodes NVIDIA GPU x4
| mem-per-cpu=1125mb cpus-per-gpu=24
|
| time=30, nodes=1, mem=180000mb, ntasks-per-node=40, (threads-per-core=2)
|-
| <code>dev_gpu_a100_il</code>
| GPU nodes NVIDIA GPU x4 
| mem-per-gpu=127500mb cpus-per-gpu=16
|
| time=30, nodes=1, mem=510000mb, ntasks-per-node=64, (threads-per-core=2)
|}
Table 2: Development Queues

Default resources of a queue class defines time, #tasks and memory if not explicitly given with sbatch command. Resource list acronyms ''--time'', ''--ntasks'', ''--nodes'', ''--mem'' and ''--mem-per-cpu'' are described [[BwUniCluster_3.0_Slurm_common_Features|here]].

== Check available resources ==
The Slurm command sinfo is used to view partition and node information for a system running Slurm. It incorporates down time, reservations, and node state information in determining the available backfill window. The sinfo command can only be used by the administrator.
 
SCC has prepared a special script (sinfo_t_idle) to find out how many processors are available for immediate use on the system. It is anticipated that users will use this information to submit jobs that meet these criteria and thus obtain quick job turnaround times.
 
 

* The following command displays what resources are available for immediate use for the whole partition.
<pre>$ sinfo_t_idle
Partition dev_cpu : 2 nodes idle
Partition cpu : 68 nodes idle
Partition highmem : 4 nodes idle
Partition dev_gpu_h100 : 0 nodes idle
Partition gpu_h100 : 11 nodes idle
Partition gpu_mi300 : 1 nodes idle
Partition dev_cpu_il : 0 nodes idle
Partition cpu_il : 0 nodes idle
Partition dev_gpu_a100_il : 0 nodes idle
Partition gpu_a100_il : 0 nodes idle
Partition gpu_h100_il : 0 nodes idle
</pre>
 

= Running Jobs =

== Slurm Commands (excerpt) ==
Important Slurm commands for non-administrators working on bwUniCluster 3.0.
{| width=850px class="wikitable"
! Slurm commands !! Brief explanation
|-
| [[#Job submission : sbatch|sbatch]] || Submits a job and puts it into the queue [[https://slurm.schedmd.com/sbatch.html sbatch]]
|-
| [[#Interactive job : salloc|salloc]] || Requests resources for an interactive Job [[https://slurm.schedmd.com/salloc.html salloc]]
|-
| [[#Detailed job information : scontrol show job|scontrol show job]] || Displays detailed job state information [[https://slurm.schedmd.com/scontrol.html scontrol]]
|-
| [[#List of your submitted jo/bs : squeue|squeue]] || Displays information about active, eligible, blocked, and/or recently completed jobs [[https://slurm.schedmd.com/squeue.html squeue]]
|-
| [[#Start time of job or resources : squeue|squeue --start]] || Returns start time of submitted job [[https://slurm.schedmd.com/squeue.html squeue]]
|-
| [[#Shows free resources : sinfo_t_idle|sinfo_t_idle]] || Shows what resources are available for immediate use [[https://slurm.schedmd.com/sinfo.html sinfo]]
|-
| [[#Canceling own jobs : scancel|scancel]] || Cancels a job [[https://slurm.schedmd.com/scancel.html scancel]]
|}

 
* [https://slurm.schedmd.com/tutorials.html Slurm Tutorials]
* [https://slurm.schedmd.com/pdfs/summary.pdf Slurm command/option summary (2 pages)]
* [https://slurm.schedmd.com/man_index.html Slurm Commands]
 

== Batch Jobs: sbatch ==

Batch jobs are submitted by using the command '''sbatch'''. The main purpose of the '''sbatch''' command is to specify the resources that are needed to run the job. '''sbatch''' will then queue the batch job. However, starting of batch job depends on the availability of the requested resources and the fair sharing value.
 
 

To run your batch job on one of the cpu nodes, please use:

<pre>
$ sbatch --partition=dev_cpu
or
$ sbatch -p dev_cpu
</pre>
 

=== Command parameters sbatch ===
The syntax and use of '''sbatch''' can be displayed via:
<pre>
$ man sbatch
</pre>
'''sbatch''' options can be used from the command line or in your job script. Different defaults for some of these options are set based on the queue and can be found [https://wiki.bwhpc.de/e/BwUniCluster3.0/Batch_Queues#sbatch_-p_queue here]

{| class="wikitable"
! colspan="3" | sbatch Options
|-
! style="width:8%"| Command line
! style="width:9%"| Script
! style="width:13%"| Purpose
|- style="vertical-align:top;"
| -t, --time=''time''
| #SBATCH --time=''time''
| Wall clock time limit. 
|-
|- style="vertical-align:top;"
| -N, --nodes=''count''
| #SBATCH --nodes=''count''
| Number of nodes to be used.
|- style="vertical-align:top;"
| -n, --ntasks=''count''
| #SBATCH --ntasks=''count''
| Number of tasks to be launched.
|-
|- style="vertical-align:top;"
| --ntasks-per-node=''count''
| #SBATCH --ntasks-per-node=''count''
| Maximum count of tasks per node.
|-
|- style="vertical-align:top;"
| -c, --cpus-per-task=''count''
| #SBATCH --cpus-per-task=''count''
| Number of CPUs required per (MPI-)task.
|-
|- style="vertical-align:top;"
| --mem=''value_in_MB''
| #SBATCH --mem=''value_in_MB''
| Memory in MegaByte per node. (You should omit the setting of this option.)
|-
|- style="vertical-align:top;"
| --mem-per-cpu=''value_in_MB''
| #SBATCH --mem-per-cpu=''value_in_MB''
| Minimum Memory required per allocated CPU. (You should omit the setting of this option.)
|-
|- style="vertical-align:top;"
| --mail-type=''type''
| #SBATCH --mail-type=''type''
| Notify user by email when certain event types occur. Valid type values are NONE, BEGIN, END, FAIL, REQUEUE, ALL.
|-
|- style="vertical-align:top;"
| --mail-user=''mail-address''
| #SBATCH --mail-user=''mail-address''
| The specified mail-address receives email notification of state changes as defined by --mail-type.
|-
|- style="vertical-align:top;"
| --output=''name''
| #SBATCH --output=''name''
| File in which job output is stored.
|-
|- style="vertical-align:top;"
| --error=''name''
| #SBATCH --error=''name''
| File in which job error messages are stored.
|-
|- style="vertical-align:top;"
| -J, --job-name=''name''
| #SBATCH --job-name=''name''
| Job name.
|-
|- style="vertical-align:top;"
| --export=[ALL,] ''env-variables''
| #SBATCH --export=[ALL,] ''env-variables''
| Identifies which environment variables from the submission environment are propagated to the launched application. Default is ALL.
|-
|- style="vertical-align:top;"
| -A, --account=''group-name''
| #SBATCH --account=''group-name''
| Change resources used by this job to specified group. You may need this option if your account is assigned to more than one group. By command "scontrol show job" the project group the job is accounted on can be seen behind "Account=".
|-
|- style="vertical-align:top;"
| -p, --partition=''queue-name''
| #SBATCH --partition=''queue-name''
| Request a specific queue for the resource allocation.
|-
|- style="vertical-align:top;"
| --reservation=''reservation-name''
| #SBATCH --reservation=''reservation-name''
| Use a specific reservation for the resource allocation.
|-
|- style="vertical-align:top;"
| -C, --constraint=''LSDF''
| #SBATCH --constraint=LSDF
| Job constraint LSDF filesystems.
|-
|- style="vertical-align:top;"
| -C, --constraint=''BEEOND (BEEOND_4MDS, BEEOND_MAXMDS)''
| #SBATCH --constraint=BEEOND (BEEOND_4MDS, BEEOND_MAXMDS)
| Job constraint BeeOND filesystem.
|-
|}
 

== Interactive Jobs: salloc ==

On bwUniCluster 3.0 you are only allowed to run short jobs (<< 1 hour) with little memory requirements (<< 8 GByte) on the logins nodes. If you want to run longer jobs and/or jobs with a request of more than 8 GByte of memory, you must allocate resources for so-called interactive jobs by usage of the command salloc on a login node. Considering a serial application running on a compute node that requires 5000 MByte of memory and limiting the interactive run to 2 hours the following command has to be executed:

<pre>
$ salloc -p cpu -n 1 -t 120 --mem=5000
</pre>

Then you will get one core on a compute node within the partition "cpu". After execution of this command '''DO NOT CLOSE''' your current terminal session but wait until the queueing system Slurm has granted you the requested resources on the compute system. You will be logged in automatically on the granted core! To run a serial program on the granted core you only have to type the name of the executable.

<pre>
$ ./<my_serial_program>
</pre>

Please be aware that your serial job must run less than 2 hours in this example, else the job will be killed during runtime by the system.

You can also start now a graphical X11-terminal connecting you to the dedicated resource that is available for 2 hours. You can start it by the command:

<pre>
$ xterm
</pre>

Note that, once the walltime limit has been reached the resources - i.e. the compute node - will automatically be revoked.

An interactive parallel application running on one compute node or on many compute nodes (e.g. here 5 nodes) with 96 cores each requires usually an amount of memory in GByte (e.g. 50 GByte) and a maximum time (e.g. 1 hour). E.g. 5 nodes can be allocated by the following command:

<pre>
$ salloc -p cpu -N 5 --ntasks-per-node=96 -t 01:00:00 --mem=50gb
</pre>

Now you can run parallel jobs on 480 cores requiring 50 GByte of memory per node. Please be aware that you will be logged in on core 0 of the first node.
If you want to have access to another node you have to open a new terminal, connect it also to bwUniCluster 3.0 and type the following commands to
connect to the running interactive job and then to a specific node:

<pre>
$ srun --jobid=XXXXXXXX --pty /bin/bash
$ srun --nodelist=uc3nXXX --pty /bin/bash
</pre>

With the command:

<pre>
$ squeue
</pre>

the jobid and the nodelist can be shown.

If you want to run MPI-programs, you can do it by simply typing mpirun <program_name>. Then your program will be run on 480 cores. A very simple example for starting a parallel job can be:

<pre>
$ mpirun <my_mpi_program>
</pre>

You can also start the debugger ddt by the commands:

<pre>
$ module add devel/ddt
$ ddt <my_mpi_program>
</pre>

The above commands will execute the parallel program <my_mpi_program> on all available cores. You can also start parallel programs on a subset of cores; an example for this can be:

<pre>
$ mpirun -n 50 <my_mpi_program>
</pre>

If you are using Intel MPI you must start <my_mpi_program> by the command mpiexec.hydra (instead of mpirun).

== Interactive Computing with Jupyter ==

== Monitor and manage jobs ==
scontrol show job displays detailed job state information and diagnostic output for all or a specified job of yours. Detailed information is available for active, pending and recently completed jobs. The command scontrol is explained in detail on the webpage https://slurm.schedmd.com/scontrol.html or via manpage (man scontrol).
 
Display the state of all your jobs in normal mode: scontrol show job
 
Display the state of a job with <jobid> in normal mode: scontrol show job <jobid>
 
 

* Here is an example from bwUniCluster 3.0.
<pre>
$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
1262 cpu wrap ka_zs040 R 1:12 1 uc3n002

$
$ # now, see what's up with my pending job with jobid 1262
$
$ scontrol show job 1262

JobId=1262 JobName=wrap
UserId=ka_zs0402(241992) GroupId=ka_scc(12345) MCS_label=N/A
Priority=4246 Nice=0 Account=ka QOS=normal
JobState=RUNNING Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
RunTime=00:00:37 TimeLimit=00:20:00 TimeMin=N/A
SubmitTime=2025-04-04T10:01:30 EligibleTime=2025-04-04T10:01:30
AccrueTime=2025-04-04T10:01:30
StartTime=2025-04-04T10:01:31 EndTime=2025-04-04T10:21:31 Deadline=N/A
SuspendTime=None SecsPreSuspend=0 LastSchedEval=2025-04-04T10:01:31 Scheduler=Main
Partition=cpu AllocNode:Sid=uc3n999:2819841
ReqNodeList=(null) ExcNodeList=(null)
NodeList=uc3n002
BatchHost=uc3n002
NumNodes=1 NumCPUs=2 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
ReqTRES=cpu=1,mem=2000M,node=1,billing=1
AllocTRES=cpu=2,mem=4000M,node=1,billing=2
Socks/Node=* NtasksPerN:B:S:C=0:0:*:1 CoreSpec=*
MinCPUsNode=1 MinMemoryCPU=2000M MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
Command=(null)
WorkDir=/pfs/data6/home/ka/ka_scc/ka_zs0402
StdErr=/pfs/data6/home/ka/ka_scc/ka_zs0402/slurm-1262.out
StdIn=/dev/null
StdOut=/pfs/data6/home/ka/ka_scc/ka_zs0402/slurm-1262.out

</pre>
 

= Slurm Options =

= Best Practices =

== Step-by-Step example==

== Dos and Don'ts ==

BwUniCluster3.0/Running Jobs

2025-04-04T15:09:01Z

P Schuhmacher: /* Slurm Options */

= Purpose and function of a queuing system =

General procedure: Hint to [[Running_Calculations | Running Calculations]]

== Job submission process ==

bwUniCluster 3.0 has installed the workload managing software Slurm. Therefore any job submission by the user is to be executed by commands of the Slurm software. Slurm queues and runs user jobs based on fair sharing policies.

== Slurm ==

HPC Workload Manager on bwUniCluster 3.0 is Slurm.
Slurm is a cluster management and job scheduling system. Slurm has three key functions.
* First, it allocates access to resources (compute cores on nodes) to users for some duration of time so they can perform work.
* it provides a framework for starting, executing, and monitoring work (normally a parallel job) on the set of allocated nodes.
* it arbitrates contention for resources by managing a queue of pending work.

Any kind of calculation on the compute nodes of bwUniCluster 3.0 requires the user to define calculations as a sequence of commands together with required run time, number of CPU cores and main memory and submit all, i.e., the '''batch job''', to a resource and workload managing software.
== Terms and definitions ==

''' Partitions '''

Slurm manages job queues for different '''partitions'''. Partitions are used to group similar node types (e.g. nodes with and without accelerators) and to enforce different access policies and resource limits.

On bwUniCluster 3.0 there are different partitions:

* CPU-only nodes
** 2-socket nodes, consisting of 2 Intel Ice Lake processors with 32 cores each or 2 AMD processors with 48 cores each
** 2-socket nodes with very high RAM capacity, consisting of 2 AMD processors with 48 cores each
* GPU-accelerated nodes
** 2-socket nodes with 4x NVIDIA A100 or 4x NVIDIA H100 GPUs
** 4-socket node with 4x AMD Instinct accelerator

''' Queues '''

Job '''queues''' are used to manage jobs that request access to shared but limited computing resources of a certain kind (partition).

On bwUniCluster 3.0 there are different main types of queues:
* Regular queues
** cpu: Jobs that request CPU-only nodes.
** gpu: Jobs that request GPU-accelerated nodes.
* Development queues (dev)
** Short, usually interactive jobs that are used for developing, compiling and testing code and workflows. The intention behind development queues is to provide users with immediate access to computer resources without having to wait. This is the place to realize instantaneous heavy compute without affecting other users, as would be the case on the login nodes.

Requested compute resources such as (wall-)time, number of nodes and amount of memory are restricted and must fit into the boundaries imposed by the queues. The request for compute resources on the bwUniCluster 3.0 requires at least the specification of the '''queue''' and the '''time'''.

''' Jobs '''

Jobs can be run non-interactively as '''batch jobs''' or as '''interactive jobs'''.
Submitting a batch job means, that all steps of a compute project are defined in a Bash script. This Bash script is queued and executed as soon as the compute resources are available and allocated. Jobs are enqueued with the ''sbatch'' command.

For interactive jobs, the resources are requested with the ''salloc'' command. As soon as the computing resources are available and allocated, a command line prompt is returned on a computing node and the user can freely dispose of the resources now available to him.
{|style="background:#deffee; width:100%;"
|style="padding:5px; background:#cef2e0; text-align:left"|
[[Image:Attention.svg|center|25px]]
|style="padding:5px; background:#cef2e0; text-align:left"|
'''Please remember:'''
* '''Heavy computations are not allowed on the login nodes'''. Use a developement or a regular job queue instead! Please refer to [[BwUniCluster3.0/Login#Allowed_Activities_on_Login_Nodes|Allowed Activities on Login Nodes]].
* '''Development queues''' are meant for '''development tasks'''. Do not misuse this queue for regular, short-running jobs or chain jobs! Only one running job at a time is enabled. Maximum queue length is reduced to 3.
|}

= Queues on bwUniCluster 3.0 =
== Policy ==

== Regular Queues ==
{| class="wikitable"
|-
! style="width:5%"| Queue
! style="width:13%"| Node-Type
! style="width:23%"| Default Resources
! style="width:13%"| Minimal Resources
! style="width:13%"| Maximum Resources
|-
| <code>cpu_il</code>
| CPU nodes Ice Lake
| mem-per-cpu=1950mb
|
| time=72:00:00, nodes=30, mem=249600mb, ntasks-per-node=64, (threads-per-core=2)
|-
| <code>cpu</code>
| CPU nodes Standard
| mem-per-cpu=1125mb
|
| time=72:00:00, nodes=20, mem=380000mb, ntasks-per-node=96, (threads-per-core=2)
|-
| <code>highmem</code>
| CPU nodes High Memory
| mem-per-cpu=1125mb
|
| time=72:00:00, nodes=4, mem=2300000mb, ntasks-per-node=96, (threads-per-core=2)
|-
| <code>gpu_h100</code>
| GPU nodes NVIDIA GPU x4
| mem-per-cpu=1125mb cpus-per-gpu=24
|
| time=72:00:00, nodes=12, mem=760000mb, ntasks-per-node=96, (threads-per-core=2)
|-
| <code>gpu_mi300</code>
| GPU node AMD GPU x4
| mem-per-cpu=1125mb cpus-per-gpu=24
|
| time=72:00:00, nodes=1, mem=510000mb, ntasks-per-node=40, (threads-per-core=2)
|-
| <code>gpu_a100_il</code>/<code>gpu_h100_il</code>
| GPU nodes Ice Lake NVIDIA GPU x4
| mem-per-gpu=127500mb cpus-per-gpu=16
|
| time=72:00:00, nodes=9, mem=510000mb, ntasks-per-node=64, (threads-per-core=2)
|}
Table 1: Regular Queues

== Development Queues ==
Only for development, i.e. debugging or performance optimization ...

{| class="wikitable"
|-
! style="width:5%"| Queue
! style="width:13%"| Node Type
! style="width:23%"| Default Resources
! style="width:13%"| Minimal Resources
! style="width:13%"| Maximum Resources
|-
| <code>dev_cpu_il</code>
| CPU nodes Ice Lake
| mem-per-cpu=1950mb
|
| time=30, nodes=8, mem=249600mb, ntasks-per-node=64, (threads-per-core=2)
|-
| <code>dev_cpu</code>
| CPU nodes Standard
| mem-per-cpu=1125mb
|
| time=30, nodes=1, mem=180000mb, ntasks-per-node=40, (threads-per-core=2)
|-
| <code>dev_highmem</code>
| CPU nodes High Memory
| mem-per-cpu=1125mb
|
| time=30, nodes=1, mem=180000mb, ntasks-per-node=40, (threads-per-core=2)
|-
| <code>dev_gpu_h100</code>
| GPU nodes NVIDIA GPU x4
| mem-per-cpu=1125mb cpus-per-gpu=24
|
| time=30, nodes=1, mem=180000mb, ntasks-per-node=40, (threads-per-core=2)
|-
| <code>dev_gpu_a100_il</code>
| GPU nodes NVIDIA GPU x4 
| mem-per-gpu=127500mb cpus-per-gpu=16
|
| time=30, nodes=1, mem=510000mb, ntasks-per-node=64, (threads-per-core=2)
|}
Table 2: Development Queues

Default resources of a queue class defines time, #tasks and memory if not explicitly given with sbatch command. Resource list acronyms ''--time'', ''--ntasks'', ''--nodes'', ''--mem'' and ''--mem-per-cpu'' are described [[BwUniCluster_3.0_Slurm_common_Features|here]].

== Check available resources ==
The Slurm command sinfo is used to view partition and node information for a system running Slurm. It incorporates down time, reservations, and node state information in determining the available backfill window. The sinfo command can only be used by the administrator.
 
SCC has prepared a special script (sinfo_t_idle) to find out how many processors are available for immediate use on the system. It is anticipated that users will use this information to submit jobs that meet these criteria and thus obtain quick job turnaround times.
 
 

* The following command displays what resources are available for immediate use for the whole partition.
<pre>$ sinfo_t_idle
Partition dev_cpu : 2 nodes idle
Partition cpu : 68 nodes idle
Partition highmem : 4 nodes idle
Partition dev_gpu_h100 : 0 nodes idle
Partition gpu_h100 : 11 nodes idle
Partition gpu_mi300 : 1 nodes idle
Partition dev_cpu_il : 0 nodes idle
Partition cpu_il : 0 nodes idle
Partition dev_gpu_a100_il : 0 nodes idle
Partition gpu_a100_il : 0 nodes idle
Partition gpu_h100_il : 0 nodes idle
</pre>
 

= Running Jobs =
== Batch Jobs: sbatch ==

Batch jobs are submitted by using the command '''sbatch'''. The main purpose of the '''sbatch''' command is to specify the resources that are needed to run the job. '''sbatch''' will then queue the batch job. However, starting of batch job depends on the availability of the requested resources and the fair sharing value.
 
 

To run your batch job on one of the cpu nodes, please use:

<pre>
$ sbatch --partition=dev_cpu
or
$ sbatch -p dev_cpu
</pre>
 

=== Command parameters sbatch ===
The syntax and use of '''sbatch''' can be displayed via:
<pre>
$ man sbatch
</pre>
'''sbatch''' options can be used from the command line or in your job script. Different defaults for some of these options are set based on the queue and can be found [https://wiki.bwhpc.de/e/BwUniCluster3.0/Batch_Queues#sbatch_-p_queue here]

{| class="wikitable"
! colspan="3" | sbatch Options
|-
! style="width:8%"| Command line
! style="width:9%"| Script
! style="width:13%"| Purpose
|- style="vertical-align:top;"
| -t, --time=''time''
| #SBATCH --time=''time''
| Wall clock time limit. 
|-
|- style="vertical-align:top;"
| -N, --nodes=''count''
| #SBATCH --nodes=''count''
| Number of nodes to be used.
|- style="vertical-align:top;"
| -n, --ntasks=''count''
| #SBATCH --ntasks=''count''
| Number of tasks to be launched.
|-
|- style="vertical-align:top;"
| --ntasks-per-node=''count''
| #SBATCH --ntasks-per-node=''count''
| Maximum count of tasks per node.
|-
|- style="vertical-align:top;"
| -c, --cpus-per-task=''count''
| #SBATCH --cpus-per-task=''count''
| Number of CPUs required per (MPI-)task.
|-
|- style="vertical-align:top;"
| --mem=''value_in_MB''
| #SBATCH --mem=''value_in_MB''
| Memory in MegaByte per node. (You should omit the setting of this option.)
|-
|- style="vertical-align:top;"
| --mem-per-cpu=''value_in_MB''
| #SBATCH --mem-per-cpu=''value_in_MB''
| Minimum Memory required per allocated CPU. (You should omit the setting of this option.)
|-
|- style="vertical-align:top;"
| --mail-type=''type''
| #SBATCH --mail-type=''type''
| Notify user by email when certain event types occur. Valid type values are NONE, BEGIN, END, FAIL, REQUEUE, ALL.
|-
|- style="vertical-align:top;"
| --mail-user=''mail-address''
| #SBATCH --mail-user=''mail-address''
| The specified mail-address receives email notification of state changes as defined by --mail-type.
|-
|- style="vertical-align:top;"
| --output=''name''
| #SBATCH --output=''name''
| File in which job output is stored.
|-
|- style="vertical-align:top;"
| --error=''name''
| #SBATCH --error=''name''
| File in which job error messages are stored.
|-
|- style="vertical-align:top;"
| -J, --job-name=''name''
| #SBATCH --job-name=''name''
| Job name.
|-
|- style="vertical-align:top;"
| --export=[ALL,] ''env-variables''
| #SBATCH --export=[ALL,] ''env-variables''
| Identifies which environment variables from the submission environment are propagated to the launched application. Default is ALL.
|-
|- style="vertical-align:top;"
| -A, --account=''group-name''
| #SBATCH --account=''group-name''
| Change resources used by this job to specified group. You may need this option if your account is assigned to more than one group. By command "scontrol show job" the project group the job is accounted on can be seen behind "Account=".
|-
|- style="vertical-align:top;"
| -p, --partition=''queue-name''
| #SBATCH --partition=''queue-name''
| Request a specific queue for the resource allocation.
|-
|- style="vertical-align:top;"
| --reservation=''reservation-name''
| #SBATCH --reservation=''reservation-name''
| Use a specific reservation for the resource allocation.
|-
|- style="vertical-align:top;"
| -C, --constraint=''LSDF''
| #SBATCH --constraint=LSDF
| Job constraint LSDF filesystems.
|-
|- style="vertical-align:top;"
| -C, --constraint=''BEEOND (BEEOND_4MDS, BEEOND_MAXMDS)''
| #SBATCH --constraint=BEEOND (BEEOND_4MDS, BEEOND_MAXMDS)
| Job constraint BeeOND filesystem.
|-
|}
 

== Interactive Jobs: salloc ==

On bwUniCluster 3.0 you are only allowed to run short jobs (<< 1 hour) with little memory requirements (<< 8 GByte) on the logins nodes. If you want to run longer jobs and/or jobs with a request of more than 8 GByte of memory, you must allocate resources for so-called interactive jobs by usage of the command salloc on a login node. Considering a serial application running on a compute node that requires 5000 MByte of memory and limiting the interactive run to 2 hours the following command has to be executed:

<pre>
$ salloc -p cpu -n 1 -t 120 --mem=5000
</pre>

Then you will get one core on a compute node within the partition "cpu". After execution of this command '''DO NOT CLOSE''' your current terminal session but wait until the queueing system Slurm has granted you the requested resources on the compute system. You will be logged in automatically on the granted core! To run a serial program on the granted core you only have to type the name of the executable.

<pre>
$ ./<my_serial_program>
</pre>

Please be aware that your serial job must run less than 2 hours in this example, else the job will be killed during runtime by the system.

You can also start now a graphical X11-terminal connecting you to the dedicated resource that is available for 2 hours. You can start it by the command:

<pre>
$ xterm
</pre>

Note that, once the walltime limit has been reached the resources - i.e. the compute node - will automatically be revoked.

An interactive parallel application running on one compute node or on many compute nodes (e.g. here 5 nodes) with 96 cores each requires usually an amount of memory in GByte (e.g. 50 GByte) and a maximum time (e.g. 1 hour). E.g. 5 nodes can be allocated by the following command:

<pre>
$ salloc -p cpu -N 5 --ntasks-per-node=96 -t 01:00:00 --mem=50gb
</pre>

Now you can run parallel jobs on 480 cores requiring 50 GByte of memory per node. Please be aware that you will be logged in on core 0 of the first node.
If you want to have access to another node you have to open a new terminal, connect it also to bwUniCluster 3.0 and type the following commands to
connect to the running interactive job and then to a specific node:

<pre>
$ srun --jobid=XXXXXXXX --pty /bin/bash
$ srun --nodelist=uc3nXXX --pty /bin/bash
</pre>

With the command:

<pre>
$ squeue
</pre>

the jobid and the nodelist can be shown.

If you want to run MPI-programs, you can do it by simply typing mpirun <program_name>. Then your program will be run on 480 cores. A very simple example for starting a parallel job can be:

<pre>
$ mpirun <my_mpi_program>
</pre>

You can also start the debugger ddt by the commands:

<pre>
$ module add devel/ddt
$ ddt <my_mpi_program>
</pre>

The above commands will execute the parallel program <my_mpi_program> on all available cores. You can also start parallel programs on a subset of cores; an example for this can be:

<pre>
$ mpirun -n 50 <my_mpi_program>
</pre>

If you are using Intel MPI you must start <my_mpi_program> by the command mpiexec.hydra (instead of mpirun).

== Interactive Computing with Jupyter ==

== Monitor and manage jobs ==
scontrol show job displays detailed job state information and diagnostic output for all or a specified job of yours. Detailed information is available for active, pending and recently completed jobs. The command scontrol is explained in detail on the webpage https://slurm.schedmd.com/scontrol.html or via manpage (man scontrol).
 
Display the state of all your jobs in normal mode: scontrol show job
 
Display the state of a job with <jobid> in normal mode: scontrol show job <jobid>
 
 

* Here is an example from bwUniCluster 3.0.
<pre>
$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
1262 cpu wrap ka_zs040 R 1:12 1 uc3n002

$
$ # now, see what's up with my pending job with jobid 1262
$
$ scontrol show job 1262

JobId=1262 JobName=wrap
UserId=ka_zs0402(241992) GroupId=ka_scc(12345) MCS_label=N/A
Priority=4246 Nice=0 Account=ka QOS=normal
JobState=RUNNING Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
RunTime=00:00:37 TimeLimit=00:20:00 TimeMin=N/A
SubmitTime=2025-04-04T10:01:30 EligibleTime=2025-04-04T10:01:30
AccrueTime=2025-04-04T10:01:30
StartTime=2025-04-04T10:01:31 EndTime=2025-04-04T10:21:31 Deadline=N/A
SuspendTime=None SecsPreSuspend=0 LastSchedEval=2025-04-04T10:01:31 Scheduler=Main
Partition=cpu AllocNode:Sid=uc3n999:2819841
ReqNodeList=(null) ExcNodeList=(null)
NodeList=uc3n002
BatchHost=uc3n002
NumNodes=1 NumCPUs=2 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
ReqTRES=cpu=1,mem=2000M,node=1,billing=1
AllocTRES=cpu=2,mem=4000M,node=1,billing=2
Socks/Node=* NtasksPerN:B:S:C=0:0:*:1 CoreSpec=*
MinCPUsNode=1 MinMemoryCPU=2000M MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
Command=(null)
WorkDir=/pfs/data6/home/ka/ka_scc/ka_zs0402
StdErr=/pfs/data6/home/ka/ka_scc/ka_zs0402/slurm-1262.out
StdIn=/dev/null
StdOut=/pfs/data6/home/ka/ka_scc/ka_zs0402/slurm-1262.out

</pre>
 

= Slurm Options =

= Best Practices =

== Step-by-Step example==

== Dos and Don'ts ==

BwUniCluster3.0/Running Jobs

2025-04-04T15:08:08Z

P Schuhmacher: /* Batch Jobs: sbatch */

= Purpose and function of a queuing system =

General procedure: Hint to [[Running_Calculations | Running Calculations]]

== Job submission process ==

bwUniCluster 3.0 has installed the workload managing software Slurm. Therefore any job submission by the user is to be executed by commands of the Slurm software. Slurm queues and runs user jobs based on fair sharing policies.

== Slurm ==

HPC Workload Manager on bwUniCluster 3.0 is Slurm.
Slurm is a cluster management and job scheduling system. Slurm has three key functions.
* First, it allocates access to resources (compute cores on nodes) to users for some duration of time so they can perform work.
* it provides a framework for starting, executing, and monitoring work (normally a parallel job) on the set of allocated nodes.
* it arbitrates contention for resources by managing a queue of pending work.

Any kind of calculation on the compute nodes of bwUniCluster 3.0 requires the user to define calculations as a sequence of commands together with required run time, number of CPU cores and main memory and submit all, i.e., the '''batch job''', to a resource and workload managing software.
== Terms and definitions ==

''' Partitions '''

Slurm manages job queues for different '''partitions'''. Partitions are used to group similar node types (e.g. nodes with and without accelerators) and to enforce different access policies and resource limits.

On bwUniCluster 3.0 there are different partitions:

* CPU-only nodes
** 2-socket nodes, consisting of 2 Intel Ice Lake processors with 32 cores each or 2 AMD processors with 48 cores each
** 2-socket nodes with very high RAM capacity, consisting of 2 AMD processors with 48 cores each
* GPU-accelerated nodes
** 2-socket nodes with 4x NVIDIA A100 or 4x NVIDIA H100 GPUs
** 4-socket node with 4x AMD Instinct accelerator

''' Queues '''

Job '''queues''' are used to manage jobs that request access to shared but limited computing resources of a certain kind (partition).

On bwUniCluster 3.0 there are different main types of queues:
* Regular queues
** cpu: Jobs that request CPU-only nodes.
** gpu: Jobs that request GPU-accelerated nodes.
* Development queues (dev)
** Short, usually interactive jobs that are used for developing, compiling and testing code and workflows. The intention behind development queues is to provide users with immediate access to computer resources without having to wait. This is the place to realize instantaneous heavy compute without affecting other users, as would be the case on the login nodes.

Requested compute resources such as (wall-)time, number of nodes and amount of memory are restricted and must fit into the boundaries imposed by the queues. The request for compute resources on the bwUniCluster 3.0 requires at least the specification of the '''queue''' and the '''time'''.

''' Jobs '''

Jobs can be run non-interactively as '''batch jobs''' or as '''interactive jobs'''.
Submitting a batch job means, that all steps of a compute project are defined in a Bash script. This Bash script is queued and executed as soon as the compute resources are available and allocated. Jobs are enqueued with the ''sbatch'' command.

For interactive jobs, the resources are requested with the ''salloc'' command. As soon as the computing resources are available and allocated, a command line prompt is returned on a computing node and the user can freely dispose of the resources now available to him.
{|style="background:#deffee; width:100%;"
|style="padding:5px; background:#cef2e0; text-align:left"|
[[Image:Attention.svg|center|25px]]
|style="padding:5px; background:#cef2e0; text-align:left"|
'''Please remember:'''
* '''Heavy computations are not allowed on the login nodes'''. Use a developement or a regular job queue instead! Please refer to [[BwUniCluster3.0/Login#Allowed_Activities_on_Login_Nodes|Allowed Activities on Login Nodes]].
* '''Development queues''' are meant for '''development tasks'''. Do not misuse this queue for regular, short-running jobs or chain jobs! Only one running job at a time is enabled. Maximum queue length is reduced to 3.
|}

= Queues on bwUniCluster 3.0 =
== Policy ==

== Regular Queues ==
{| class="wikitable"
|-
! style="width:5%"| Queue
! style="width:13%"| Node-Type
! style="width:23%"| Default Resources
! style="width:13%"| Minimal Resources
! style="width:13%"| Maximum Resources
|-
| <code>cpu_il</code>
| CPU nodes Ice Lake
| mem-per-cpu=1950mb
|
| time=72:00:00, nodes=30, mem=249600mb, ntasks-per-node=64, (threads-per-core=2)
|-
| <code>cpu</code>
| CPU nodes Standard
| mem-per-cpu=1125mb
|
| time=72:00:00, nodes=20, mem=380000mb, ntasks-per-node=96, (threads-per-core=2)
|-
| <code>highmem</code>
| CPU nodes High Memory
| mem-per-cpu=1125mb
|
| time=72:00:00, nodes=4, mem=2300000mb, ntasks-per-node=96, (threads-per-core=2)
|-
| <code>gpu_h100</code>
| GPU nodes NVIDIA GPU x4
| mem-per-cpu=1125mb cpus-per-gpu=24
|
| time=72:00:00, nodes=12, mem=760000mb, ntasks-per-node=96, (threads-per-core=2)
|-
| <code>gpu_mi300</code>
| GPU node AMD GPU x4
| mem-per-cpu=1125mb cpus-per-gpu=24
|
| time=72:00:00, nodes=1, mem=510000mb, ntasks-per-node=40, (threads-per-core=2)
|-
| <code>gpu_a100_il</code>/<code>gpu_h100_il</code>
| GPU nodes Ice Lake NVIDIA GPU x4
| mem-per-gpu=127500mb cpus-per-gpu=16
|
| time=72:00:00, nodes=9, mem=510000mb, ntasks-per-node=64, (threads-per-core=2)
|}
Table 1: Regular Queues

== Development Queues ==
Only for development, i.e. debugging or performance optimization ...

{| class="wikitable"
|-
! style="width:5%"| Queue
! style="width:13%"| Node Type
! style="width:23%"| Default Resources
! style="width:13%"| Minimal Resources
! style="width:13%"| Maximum Resources
|-
| <code>dev_cpu_il</code>
| CPU nodes Ice Lake
| mem-per-cpu=1950mb
|
| time=30, nodes=8, mem=249600mb, ntasks-per-node=64, (threads-per-core=2)
|-
| <code>dev_cpu</code>
| CPU nodes Standard
| mem-per-cpu=1125mb
|
| time=30, nodes=1, mem=180000mb, ntasks-per-node=40, (threads-per-core=2)
|-
| <code>dev_highmem</code>
| CPU nodes High Memory
| mem-per-cpu=1125mb
|
| time=30, nodes=1, mem=180000mb, ntasks-per-node=40, (threads-per-core=2)
|-
| <code>dev_gpu_h100</code>
| GPU nodes NVIDIA GPU x4
| mem-per-cpu=1125mb cpus-per-gpu=24
|
| time=30, nodes=1, mem=180000mb, ntasks-per-node=40, (threads-per-core=2)
|-
| <code>dev_gpu_a100_il</code>
| GPU nodes NVIDIA GPU x4 
| mem-per-gpu=127500mb cpus-per-gpu=16
|
| time=30, nodes=1, mem=510000mb, ntasks-per-node=64, (threads-per-core=2)
|}
Table 2: Development Queues

Default resources of a queue class defines time, #tasks and memory if not explicitly given with sbatch command. Resource list acronyms ''--time'', ''--ntasks'', ''--nodes'', ''--mem'' and ''--mem-per-cpu'' are described [[BwUniCluster_3.0_Slurm_common_Features|here]].

== Check available resources ==
The Slurm command sinfo is used to view partition and node information for a system running Slurm. It incorporates down time, reservations, and node state information in determining the available backfill window. The sinfo command can only be used by the administrator.
 
SCC has prepared a special script (sinfo_t_idle) to find out how many processors are available for immediate use on the system. It is anticipated that users will use this information to submit jobs that meet these criteria and thus obtain quick job turnaround times.
 
 

* The following command displays what resources are available for immediate use for the whole partition.
<pre>$ sinfo_t_idle
Partition dev_cpu : 2 nodes idle
Partition cpu : 68 nodes idle
Partition highmem : 4 nodes idle
Partition dev_gpu_h100 : 0 nodes idle
Partition gpu_h100 : 11 nodes idle
Partition gpu_mi300 : 1 nodes idle
Partition dev_cpu_il : 0 nodes idle
Partition cpu_il : 0 nodes idle
Partition dev_gpu_a100_il : 0 nodes idle
Partition gpu_a100_il : 0 nodes idle
Partition gpu_h100_il : 0 nodes idle
</pre>
 

= Running Jobs =
== Batch Jobs: sbatch ==

Batch jobs are submitted by using the command '''sbatch'''. The main purpose of the '''sbatch''' command is to specify the resources that are needed to run the job. '''sbatch''' will then queue the batch job. However, starting of batch job depends on the availability of the requested resources and the fair sharing value.
 
 

To run your batch job on one of the cpu nodes, please use:

<pre>
$ sbatch --partition=dev_cpu
or
$ sbatch -p dev_cpu
</pre>
 

=== Command parameters sbatch ===
The syntax and use of '''sbatch''' can be displayed via:
<pre>
$ man sbatch
</pre>
'''sbatch''' options can be used from the command line or in your job script. Different defaults for some of these options are set based on the queue and can be found [https://wiki.bwhpc.de/e/BwUniCluster3.0/Batch_Queues#sbatch_-p_queue here]

{| class="wikitable"
! colspan="3" | sbatch Options
|-
! style="width:8%"| Command line
! style="width:9%"| Script
! style="width:13%"| Purpose
|- style="vertical-align:top;"
| -t, --time=''time''
| #SBATCH --time=''time''
| Wall clock time limit. 
|-
|- style="vertical-align:top;"
| -N, --nodes=''count''
| #SBATCH --nodes=''count''
| Number of nodes to be used.
|- style="vertical-align:top;"
| -n, --ntasks=''count''
| #SBATCH --ntasks=''count''
| Number of tasks to be launched.
|-
|- style="vertical-align:top;"
| --ntasks-per-node=''count''
| #SBATCH --ntasks-per-node=''count''
| Maximum count of tasks per node.
|-
|- style="vertical-align:top;"
| -c, --cpus-per-task=''count''
| #SBATCH --cpus-per-task=''count''
| Number of CPUs required per (MPI-)task.
|-
|- style="vertical-align:top;"
| --mem=''value_in_MB''
| #SBATCH --mem=''value_in_MB''
| Memory in MegaByte per node. (You should omit the setting of this option.)
|-
|- style="vertical-align:top;"
| --mem-per-cpu=''value_in_MB''
| #SBATCH --mem-per-cpu=''value_in_MB''
| Minimum Memory required per allocated CPU. (You should omit the setting of this option.)
|-
|- style="vertical-align:top;"
| --mail-type=''type''
| #SBATCH --mail-type=''type''
| Notify user by email when certain event types occur. Valid type values are NONE, BEGIN, END, FAIL, REQUEUE, ALL.
|-
|- style="vertical-align:top;"
| --mail-user=''mail-address''
| #SBATCH --mail-user=''mail-address''
| The specified mail-address receives email notification of state changes as defined by --mail-type.
|-
|- style="vertical-align:top;"
| --output=''name''
| #SBATCH --output=''name''
| File in which job output is stored.
|-
|- style="vertical-align:top;"
| --error=''name''
| #SBATCH --error=''name''
| File in which job error messages are stored.
|-
|- style="vertical-align:top;"
| -J, --job-name=''name''
| #SBATCH --job-name=''name''
| Job name.
|-
|- style="vertical-align:top;"
| --export=[ALL,] ''env-variables''
| #SBATCH --export=[ALL,] ''env-variables''
| Identifies which environment variables from the submission environment are propagated to the launched application. Default is ALL.
|-
|- style="vertical-align:top;"
| -A, --account=''group-name''
| #SBATCH --account=''group-name''
| Change resources used by this job to specified group. You may need this option if your account is assigned to more than one group. By command "scontrol show job" the project group the job is accounted on can be seen behind "Account=".
|-
|- style="vertical-align:top;"
| -p, --partition=''queue-name''
| #SBATCH --partition=''queue-name''
| Request a specific queue for the resource allocation.
|-
|- style="vertical-align:top;"
| --reservation=''reservation-name''
| #SBATCH --reservation=''reservation-name''
| Use a specific reservation for the resource allocation.
|-
|- style="vertical-align:top;"
| -C, --constraint=''LSDF''
| #SBATCH --constraint=LSDF
| Job constraint LSDF filesystems.
|-
|- style="vertical-align:top;"
| -C, --constraint=''BEEOND (BEEOND_4MDS, BEEOND_MAXMDS)''
| #SBATCH --constraint=BEEOND (BEEOND_4MDS, BEEOND_MAXMDS)
| Job constraint BeeOND filesystem.
|-
|}
 

== Interactive Jobs: salloc ==

On bwUniCluster 3.0 you are only allowed to run short jobs (<< 1 hour) with little memory requirements (<< 8 GByte) on the logins nodes. If you want to run longer jobs and/or jobs with a request of more than 8 GByte of memory, you must allocate resources for so-called interactive jobs by usage of the command salloc on a login node. Considering a serial application running on a compute node that requires 5000 MByte of memory and limiting the interactive run to 2 hours the following command has to be executed:

<pre>
$ salloc -p cpu -n 1 -t 120 --mem=5000
</pre>

Then you will get one core on a compute node within the partition "cpu". After execution of this command '''DO NOT CLOSE''' your current terminal session but wait until the queueing system Slurm has granted you the requested resources on the compute system. You will be logged in automatically on the granted core! To run a serial program on the granted core you only have to type the name of the executable.

<pre>
$ ./<my_serial_program>
</pre>

Please be aware that your serial job must run less than 2 hours in this example, else the job will be killed during runtime by the system.

You can also start now a graphical X11-terminal connecting you to the dedicated resource that is available for 2 hours. You can start it by the command:

<pre>
$ xterm
</pre>

Note that, once the walltime limit has been reached the resources - i.e. the compute node - will automatically be revoked.

An interactive parallel application running on one compute node or on many compute nodes (e.g. here 5 nodes) with 96 cores each requires usually an amount of memory in GByte (e.g. 50 GByte) and a maximum time (e.g. 1 hour). E.g. 5 nodes can be allocated by the following command:

<pre>
$ salloc -p cpu -N 5 --ntasks-per-node=96 -t 01:00:00 --mem=50gb
</pre>

Now you can run parallel jobs on 480 cores requiring 50 GByte of memory per node. Please be aware that you will be logged in on core 0 of the first node.
If you want to have access to another node you have to open a new terminal, connect it also to bwUniCluster 3.0 and type the following commands to
connect to the running interactive job and then to a specific node:

<pre>
$ srun --jobid=XXXXXXXX --pty /bin/bash
$ srun --nodelist=uc3nXXX --pty /bin/bash
</pre>

With the command:

<pre>
$ squeue
</pre>

the jobid and the nodelist can be shown.

If you want to run MPI-programs, you can do it by simply typing mpirun <program_name>. Then your program will be run on 480 cores. A very simple example for starting a parallel job can be:

<pre>
$ mpirun <my_mpi_program>
</pre>

You can also start the debugger ddt by the commands:

<pre>
$ module add devel/ddt
$ ddt <my_mpi_program>
</pre>

The above commands will execute the parallel program <my_mpi_program> on all available cores. You can also start parallel programs on a subset of cores; an example for this can be:

<pre>
$ mpirun -n 50 <my_mpi_program>
</pre>

If you are using Intel MPI you must start <my_mpi_program> by the command mpiexec.hydra (instead of mpirun).

== Interactive Computing with Jupyter ==

== Monitor and manage jobs ==
scontrol show job displays detailed job state information and diagnostic output for all or a specified job of yours. Detailed information is available for active, pending and recently completed jobs. The command scontrol is explained in detail on the webpage https://slurm.schedmd.com/scontrol.html or via manpage (man scontrol).
 
Display the state of all your jobs in normal mode: scontrol show job
 
Display the state of a job with <jobid> in normal mode: scontrol show job <jobid>
 
 

* Here is an example from bwUniCluster 3.0.
<pre>
$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
1262 cpu wrap ka_zs040 R 1:12 1 uc3n002

$
$ # now, see what's up with my pending job with jobid 1262
$
$ scontrol show job 1262

JobId=1262 JobName=wrap
UserId=ka_zs0402(241992) GroupId=ka_scc(12345) MCS_label=N/A
Priority=4246 Nice=0 Account=ka QOS=normal
JobState=RUNNING Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
RunTime=00:00:37 TimeLimit=00:20:00 TimeMin=N/A
SubmitTime=2025-04-04T10:01:30 EligibleTime=2025-04-04T10:01:30
AccrueTime=2025-04-04T10:01:30
StartTime=2025-04-04T10:01:31 EndTime=2025-04-04T10:21:31 Deadline=N/A
SuspendTime=None SecsPreSuspend=0 LastSchedEval=2025-04-04T10:01:31 Scheduler=Main
Partition=cpu AllocNode:Sid=uc3n999:2819841
ReqNodeList=(null) ExcNodeList=(null)
NodeList=uc3n002
BatchHost=uc3n002
NumNodes=1 NumCPUs=2 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
ReqTRES=cpu=1,mem=2000M,node=1,billing=1
AllocTRES=cpu=2,mem=4000M,node=1,billing=2
Socks/Node=* NtasksPerN:B:S:C=0:0:*:1 CoreSpec=*
MinCPUsNode=1 MinMemoryCPU=2000M MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
Command=(null)
WorkDir=/pfs/data6/home/ka/ka_scc/ka_zs0402
StdErr=/pfs/data6/home/ka/ka_scc/ka_zs0402/slurm-1262.out
StdIn=/dev/null
StdOut=/pfs/data6/home/ka/ka_scc/ka_zs0402/slurm-1262.out

</pre>
 

= Slurm Options =

== Slurm Commands (excerpt) ==
Important Slurm commands for non-administrators working on bwUniCluster 3.0.
{| width=850px class="wikitable"
! Slurm commands !! Brief explanation
|-
| [[#Job submission : sbatch|sbatch]] || Submits a job and puts it into the queue [[https://slurm.schedmd.com/sbatch.html sbatch]]
|-
| [[#Interactive job : salloc|salloc]] || Requests resources for an interactive Job [[https://slurm.schedmd.com/salloc.html salloc]]
|-
| [[#Detailed job information : scontrol show job|scontrol show job]] || Displays detailed job state information [[https://slurm.schedmd.com/scontrol.html scontrol]]
|-
| [[#List of your submitted jo/bs : squeue|squeue]] || Displays information about active, eligible, blocked, and/or recently completed jobs [[https://slurm.schedmd.com/squeue.html squeue]]
|-
| [[#Start time of job or resources : squeue|squeue --start]] || Returns start time of submitted job [[https://slurm.schedmd.com/squeue.html squeue]]
|-
| [[#Shows free resources : sinfo_t_idle|sinfo_t_idle]] || Shows what resources are available for immediate use [[https://slurm.schedmd.com/sinfo.html sinfo]]
|-
| [[#Canceling own jobs : scancel|scancel]] || Cancels a job [[https://slurm.schedmd.com/scancel.html scancel]]
|}

 
* [https://slurm.schedmd.com/tutorials.html Slurm Tutorials]
* [https://slurm.schedmd.com/pdfs/summary.pdf Slurm command/option summary (2 pages)]
* [https://slurm.schedmd.com/man_index.html Slurm Commands]
 

== Job submission : sbatch ==
Batch jobs are submitted by using the command '''sbatch'''. The main purpose of the '''sbatch''' command is to specify the resources that are needed to run the job. '''sbatch''' will then queue the batch job. However, starting of batch job depends on the availability of the requested resources and the fair sharing value.
 
 
=== Command parameters sbatch ===
The syntax and use of '''sbatch''' can be displayed via:
<pre>
$ man sbatch
</pre>
'''sbatch''' options can be used from the command line or in your job script. Different defaults for some of these options are set based on the queue and can be found [https://wiki.bwhpc.de/e/BwUniCluster3.0/Batch_Queues#sbatch_-p_queue here]

{| class="wikitable"
! colspan="3" | sbatch Options
|-
! style="width:8%"| Command line
! style="width:9%"| Script
! style="width:13%"| Purpose
|- style="vertical-align:top;"
| -t, --time=''time''
| #SBATCH --time=''time''
| Wall clock time limit. 
|-
|- style="vertical-align:top;"
| -N, --nodes=''count''
| #SBATCH --nodes=''count''
| Number of nodes to be used.
|- style="vertical-align:top;"
| -n, --ntasks=''count''
| #SBATCH --ntasks=''count''
| Number of tasks to be launched.
|-
|- style="vertical-align:top;"
| --ntasks-per-node=''count''
| #SBATCH --ntasks-per-node=''count''
| Maximum count of tasks per node.
|-
|- style="vertical-align:top;"
| -c, --cpus-per-task=''count''
| #SBATCH --cpus-per-task=''count''
| Number of CPUs required per (MPI-)task.
|-
|- style="vertical-align:top;"
| --mem=''value_in_MB''
| #SBATCH --mem=''value_in_MB''
| Memory in MegaByte per node. (You should omit the setting of this option.)
|-
|- style="vertical-align:top;"
| --mem-per-cpu=''value_in_MB''
| #SBATCH --mem-per-cpu=''value_in_MB''
| Minimum Memory required per allocated CPU. (You should omit the setting of this option.)
|-
|- style="vertical-align:top;"
| --mail-type=''type''
| #SBATCH --mail-type=''type''
| Notify user by email when certain event types occur. Valid type values are NONE, BEGIN, END, FAIL, REQUEUE, ALL.
|-
|- style="vertical-align:top;"
| --mail-user=''mail-address''
| #SBATCH --mail-user=''mail-address''
| The specified mail-address receives email notification of state changes as defined by --mail-type.
|-
|- style="vertical-align:top;"
| --output=''name''
| #SBATCH --output=''name''
| File in which job output is stored.
|-
|- style="vertical-align:top;"
| --error=''name''
| #SBATCH --error=''name''
| File in which job error messages are stored.
|-
|- style="vertical-align:top;"
| -J, --job-name=''name''
| #SBATCH --job-name=''name''
| Job name.
|-
|- style="vertical-align:top;"
| --export=[ALL,] ''env-variables''
| #SBATCH --export=[ALL,] ''env-variables''
| Identifies which environment variables from the submission environment are propagated to the launched application. Default is ALL.
|-
|- style="vertical-align:top;"
| -A, --account=''group-name''
| #SBATCH --account=''group-name''
| Change resources used by this job to specified group. You may need this option if your account is assigned to more than one group. By command "scontrol show job" the project group the job is accounted on can be seen behind "Account=".
|-
|- style="vertical-align:top;"
| -p, --partition=''queue-name''
| #SBATCH --partition=''queue-name''
| Request a specific queue for the resource allocation.
|-
|- style="vertical-align:top;"
| --reservation=''reservation-name''
| #SBATCH --reservation=''reservation-name''
| Use a specific reservation for the resource allocation.
|-
|- style="vertical-align:top;"
| -C, --constraint=''LSDF''
| #SBATCH --constraint=LSDF
| Job constraint LSDF filesystems.
|-
|- style="vertical-align:top;"
| -C, --constraint=''BEEOND (BEEOND_4MDS, BEEOND_MAXMDS)''
| #SBATCH --constraint=BEEOND (BEEOND_4MDS, BEEOND_MAXMDS)
| Job constraint BeeOND filesystem.
|-
|}
 

= Best Practices =

== Step-by-Step example==

== Dos and Don'ts ==

BwUniCluster3.0/Running Jobs

2025-04-04T15:05:15Z

P Schuhmacher: /* Monitor and manage jobs */

= Purpose and function of a queuing system =

General procedure: Hint to [[Running_Calculations | Running Calculations]]

== Job submission process ==

bwUniCluster 3.0 has installed the workload managing software Slurm. Therefore any job submission by the user is to be executed by commands of the Slurm software. Slurm queues and runs user jobs based on fair sharing policies.

== Slurm ==

HPC Workload Manager on bwUniCluster 3.0 is Slurm.
Slurm is a cluster management and job scheduling system. Slurm has three key functions.
* First, it allocates access to resources (compute cores on nodes) to users for some duration of time so they can perform work.
* it provides a framework for starting, executing, and monitoring work (normally a parallel job) on the set of allocated nodes.
* it arbitrates contention for resources by managing a queue of pending work.

Any kind of calculation on the compute nodes of bwUniCluster 3.0 requires the user to define calculations as a sequence of commands together with required run time, number of CPU cores and main memory and submit all, i.e., the '''batch job''', to a resource and workload managing software.
== Terms and definitions ==

''' Partitions '''

Slurm manages job queues for different '''partitions'''. Partitions are used to group similar node types (e.g. nodes with and without accelerators) and to enforce different access policies and resource limits.

On bwUniCluster 3.0 there are different partitions:

* CPU-only nodes
** 2-socket nodes, consisting of 2 Intel Ice Lake processors with 32 cores each or 2 AMD processors with 48 cores each
** 2-socket nodes with very high RAM capacity, consisting of 2 AMD processors with 48 cores each
* GPU-accelerated nodes
** 2-socket nodes with 4x NVIDIA A100 or 4x NVIDIA H100 GPUs
** 4-socket node with 4x AMD Instinct accelerator

''' Queues '''

Job '''queues''' are used to manage jobs that request access to shared but limited computing resources of a certain kind (partition).

On bwUniCluster 3.0 there are different main types of queues:
* Regular queues
** cpu: Jobs that request CPU-only nodes.
** gpu: Jobs that request GPU-accelerated nodes.
* Development queues (dev)
** Short, usually interactive jobs that are used for developing, compiling and testing code and workflows. The intention behind development queues is to provide users with immediate access to computer resources without having to wait. This is the place to realize instantaneous heavy compute without affecting other users, as would be the case on the login nodes.

Requested compute resources such as (wall-)time, number of nodes and amount of memory are restricted and must fit into the boundaries imposed by the queues. The request for compute resources on the bwUniCluster 3.0 requires at least the specification of the '''queue''' and the '''time'''.

''' Jobs '''

Jobs can be run non-interactively as '''batch jobs''' or as '''interactive jobs'''.
Submitting a batch job means, that all steps of a compute project are defined in a Bash script. This Bash script is queued and executed as soon as the compute resources are available and allocated. Jobs are enqueued with the ''sbatch'' command.

For interactive jobs, the resources are requested with the ''salloc'' command. As soon as the computing resources are available and allocated, a command line prompt is returned on a computing node and the user can freely dispose of the resources now available to him.
{|style="background:#deffee; width:100%;"
|style="padding:5px; background:#cef2e0; text-align:left"|
[[Image:Attention.svg|center|25px]]
|style="padding:5px; background:#cef2e0; text-align:left"|
'''Please remember:'''
* '''Heavy computations are not allowed on the login nodes'''. Use a developement or a regular job queue instead! Please refer to [[BwUniCluster3.0/Login#Allowed_Activities_on_Login_Nodes|Allowed Activities on Login Nodes]].
* '''Development queues''' are meant for '''development tasks'''. Do not misuse this queue for regular, short-running jobs or chain jobs! Only one running job at a time is enabled. Maximum queue length is reduced to 3.
|}

= Queues on bwUniCluster 3.0 =
== Policy ==

== Regular Queues ==
{| class="wikitable"
|-
! style="width:5%"| Queue
! style="width:13%"| Node-Type
! style="width:23%"| Default Resources
! style="width:13%"| Minimal Resources
! style="width:13%"| Maximum Resources
|-
| <code>cpu_il</code>
| CPU nodes Ice Lake
| mem-per-cpu=1950mb
|
| time=72:00:00, nodes=30, mem=249600mb, ntasks-per-node=64, (threads-per-core=2)
|-
| <code>cpu</code>
| CPU nodes Standard
| mem-per-cpu=1125mb
|
| time=72:00:00, nodes=20, mem=380000mb, ntasks-per-node=96, (threads-per-core=2)
|-
| <code>highmem</code>
| CPU nodes High Memory
| mem-per-cpu=1125mb
|
| time=72:00:00, nodes=4, mem=2300000mb, ntasks-per-node=96, (threads-per-core=2)
|-
| <code>gpu_h100</code>
| GPU nodes NVIDIA GPU x4
| mem-per-cpu=1125mb cpus-per-gpu=24
|
| time=72:00:00, nodes=12, mem=760000mb, ntasks-per-node=96, (threads-per-core=2)
|-
| <code>gpu_mi300</code>
| GPU node AMD GPU x4
| mem-per-cpu=1125mb cpus-per-gpu=24
|
| time=72:00:00, nodes=1, mem=510000mb, ntasks-per-node=40, (threads-per-core=2)
|-
| <code>gpu_a100_il</code>/<code>gpu_h100_il</code>
| GPU nodes Ice Lake NVIDIA GPU x4
| mem-per-gpu=127500mb cpus-per-gpu=16
|
| time=72:00:00, nodes=9, mem=510000mb, ntasks-per-node=64, (threads-per-core=2)
|}
Table 1: Regular Queues

== Development Queues ==
Only for development, i.e. debugging or performance optimization ...

{| class="wikitable"
|-
! style="width:5%"| Queue
! style="width:13%"| Node Type
! style="width:23%"| Default Resources
! style="width:13%"| Minimal Resources
! style="width:13%"| Maximum Resources
|-
| <code>dev_cpu_il</code>
| CPU nodes Ice Lake
| mem-per-cpu=1950mb
|
| time=30, nodes=8, mem=249600mb, ntasks-per-node=64, (threads-per-core=2)
|-
| <code>dev_cpu</code>
| CPU nodes Standard
| mem-per-cpu=1125mb
|
| time=30, nodes=1, mem=180000mb, ntasks-per-node=40, (threads-per-core=2)
|-
| <code>dev_highmem</code>
| CPU nodes High Memory
| mem-per-cpu=1125mb
|
| time=30, nodes=1, mem=180000mb, ntasks-per-node=40, (threads-per-core=2)
|-
| <code>dev_gpu_h100</code>
| GPU nodes NVIDIA GPU x4
| mem-per-cpu=1125mb cpus-per-gpu=24
|
| time=30, nodes=1, mem=180000mb, ntasks-per-node=40, (threads-per-core=2)
|-
| <code>dev_gpu_a100_il</code>
| GPU nodes NVIDIA GPU x4 
| mem-per-gpu=127500mb cpus-per-gpu=16
|
| time=30, nodes=1, mem=510000mb, ntasks-per-node=64, (threads-per-core=2)
|}
Table 2: Development Queues

Default resources of a queue class defines time, #tasks and memory if not explicitly given with sbatch command. Resource list acronyms ''--time'', ''--ntasks'', ''--nodes'', ''--mem'' and ''--mem-per-cpu'' are described [[BwUniCluster_3.0_Slurm_common_Features|here]].

== Check available resources ==
The Slurm command sinfo is used to view partition and node information for a system running Slurm. It incorporates down time, reservations, and node state information in determining the available backfill window. The sinfo command can only be used by the administrator.
 
SCC has prepared a special script (sinfo_t_idle) to find out how many processors are available for immediate use on the system. It is anticipated that users will use this information to submit jobs that meet these criteria and thus obtain quick job turnaround times.
 
 

* The following command displays what resources are available for immediate use for the whole partition.
<pre>$ sinfo_t_idle
Partition dev_cpu : 2 nodes idle
Partition cpu : 68 nodes idle
Partition highmem : 4 nodes idle
Partition dev_gpu_h100 : 0 nodes idle
Partition gpu_h100 : 11 nodes idle
Partition gpu_mi300 : 1 nodes idle
Partition dev_cpu_il : 0 nodes idle
Partition cpu_il : 0 nodes idle
Partition dev_gpu_a100_il : 0 nodes idle
Partition gpu_a100_il : 0 nodes idle
Partition gpu_h100_il : 0 nodes idle
</pre>
 

= Running Jobs =
== Batch Jobs: sbatch ==

To run your batch job on one of the cpu nodes, please use:

<pre>
$ sbatch --partition=dev_cpu
or
$ sbatch -p dev_cpu
</pre>
 

== Interactive Jobs: salloc ==

On bwUniCluster 3.0 you are only allowed to run short jobs (<< 1 hour) with little memory requirements (<< 8 GByte) on the logins nodes. If you want to run longer jobs and/or jobs with a request of more than 8 GByte of memory, you must allocate resources for so-called interactive jobs by usage of the command salloc on a login node. Considering a serial application running on a compute node that requires 5000 MByte of memory and limiting the interactive run to 2 hours the following command has to be executed:

<pre>
$ salloc -p cpu -n 1 -t 120 --mem=5000
</pre>

Then you will get one core on a compute node within the partition "cpu". After execution of this command '''DO NOT CLOSE''' your current terminal session but wait until the queueing system Slurm has granted you the requested resources on the compute system. You will be logged in automatically on the granted core! To run a serial program on the granted core you only have to type the name of the executable.

<pre>
$ ./<my_serial_program>
</pre>

Please be aware that your serial job must run less than 2 hours in this example, else the job will be killed during runtime by the system.

You can also start now a graphical X11-terminal connecting you to the dedicated resource that is available for 2 hours. You can start it by the command:

<pre>
$ xterm
</pre>

Note that, once the walltime limit has been reached the resources - i.e. the compute node - will automatically be revoked.

An interactive parallel application running on one compute node or on many compute nodes (e.g. here 5 nodes) with 96 cores each requires usually an amount of memory in GByte (e.g. 50 GByte) and a maximum time (e.g. 1 hour). E.g. 5 nodes can be allocated by the following command:

<pre>
$ salloc -p cpu -N 5 --ntasks-per-node=96 -t 01:00:00 --mem=50gb
</pre>

Now you can run parallel jobs on 480 cores requiring 50 GByte of memory per node. Please be aware that you will be logged in on core 0 of the first node.
If you want to have access to another node you have to open a new terminal, connect it also to bwUniCluster 3.0 and type the following commands to
connect to the running interactive job and then to a specific node:

<pre>
$ srun --jobid=XXXXXXXX --pty /bin/bash
$ srun --nodelist=uc3nXXX --pty /bin/bash
</pre>

With the command:

<pre>
$ squeue
</pre>

the jobid and the nodelist can be shown.

If you want to run MPI-programs, you can do it by simply typing mpirun <program_name>. Then your program will be run on 480 cores. A very simple example for starting a parallel job can be:

<pre>
$ mpirun <my_mpi_program>
</pre>

You can also start the debugger ddt by the commands:

<pre>
$ module add devel/ddt
$ ddt <my_mpi_program>
</pre>

The above commands will execute the parallel program <my_mpi_program> on all available cores. You can also start parallel programs on a subset of cores; an example for this can be:

<pre>
$ mpirun -n 50 <my_mpi_program>
</pre>

If you are using Intel MPI you must start <my_mpi_program> by the command mpiexec.hydra (instead of mpirun).

== Interactive Computing with Jupyter ==

== Monitor and manage jobs ==
scontrol show job displays detailed job state information and diagnostic output for all or a specified job of yours. Detailed information is available for active, pending and recently completed jobs. The command scontrol is explained in detail on the webpage https://slurm.schedmd.com/scontrol.html or via manpage (man scontrol).
 
Display the state of all your jobs in normal mode: scontrol show job
 
Display the state of a job with <jobid> in normal mode: scontrol show job <jobid>
 
 

* Here is an example from bwUniCluster 3.0.
<pre>
$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
1262 cpu wrap ka_zs040 R 1:12 1 uc3n002

$
$ # now, see what's up with my pending job with jobid 1262
$
$ scontrol show job 1262

JobId=1262 JobName=wrap
UserId=ka_zs0402(241992) GroupId=ka_scc(12345) MCS_label=N/A
Priority=4246 Nice=0 Account=ka QOS=normal
JobState=RUNNING Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
RunTime=00:00:37 TimeLimit=00:20:00 TimeMin=N/A
SubmitTime=2025-04-04T10:01:30 EligibleTime=2025-04-04T10:01:30
AccrueTime=2025-04-04T10:01:30
StartTime=2025-04-04T10:01:31 EndTime=2025-04-04T10:21:31 Deadline=N/A
SuspendTime=None SecsPreSuspend=0 LastSchedEval=2025-04-04T10:01:31 Scheduler=Main
Partition=cpu AllocNode:Sid=uc3n999:2819841
ReqNodeList=(null) ExcNodeList=(null)
NodeList=uc3n002
BatchHost=uc3n002
NumNodes=1 NumCPUs=2 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
ReqTRES=cpu=1,mem=2000M,node=1,billing=1
AllocTRES=cpu=2,mem=4000M,node=1,billing=2
Socks/Node=* NtasksPerN:B:S:C=0:0:*:1 CoreSpec=*
MinCPUsNode=1 MinMemoryCPU=2000M MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
Command=(null)
WorkDir=/pfs/data6/home/ka/ka_scc/ka_zs0402
StdErr=/pfs/data6/home/ka/ka_scc/ka_zs0402/slurm-1262.out
StdIn=/dev/null
StdOut=/pfs/data6/home/ka/ka_scc/ka_zs0402/slurm-1262.out

</pre>
 

= Slurm Options =

== Slurm Commands (excerpt) ==
Important Slurm commands for non-administrators working on bwUniCluster 3.0.
{| width=850px class="wikitable"
! Slurm commands !! Brief explanation
|-
| [[#Job submission : sbatch|sbatch]] || Submits a job and puts it into the queue [[https://slurm.schedmd.com/sbatch.html sbatch]]
|-
| [[#Interactive job : salloc|salloc]] || Requests resources for an interactive Job [[https://slurm.schedmd.com/salloc.html salloc]]
|-
| [[#Detailed job information : scontrol show job|scontrol show job]] || Displays detailed job state information [[https://slurm.schedmd.com/scontrol.html scontrol]]
|-
| [[#List of your submitted jo/bs : squeue|squeue]] || Displays information about active, eligible, blocked, and/or recently completed jobs [[https://slurm.schedmd.com/squeue.html squeue]]
|-
| [[#Start time of job or resources : squeue|squeue --start]] || Returns start time of submitted job [[https://slurm.schedmd.com/squeue.html squeue]]
|-
| [[#Shows free resources : sinfo_t_idle|sinfo_t_idle]] || Shows what resources are available for immediate use [[https://slurm.schedmd.com/sinfo.html sinfo]]
|-
| [[#Canceling own jobs : scancel|scancel]] || Cancels a job [[https://slurm.schedmd.com/scancel.html scancel]]
|}

 
* [https://slurm.schedmd.com/tutorials.html Slurm Tutorials]
* [https://slurm.schedmd.com/pdfs/summary.pdf Slurm command/option summary (2 pages)]
* [https://slurm.schedmd.com/man_index.html Slurm Commands]
 

== Job submission : sbatch ==
Batch jobs are submitted by using the command '''sbatch'''. The main purpose of the '''sbatch''' command is to specify the resources that are needed to run the job. '''sbatch''' will then queue the batch job. However, starting of batch job depends on the availability of the requested resources and the fair sharing value.
 
 
=== Command parameters sbatch ===
The syntax and use of '''sbatch''' can be displayed via:
<pre>
$ man sbatch
</pre>
'''sbatch''' options can be used from the command line or in your job script. Different defaults for some of these options are set based on the queue and can be found [https://wiki.bwhpc.de/e/BwUniCluster3.0/Batch_Queues#sbatch_-p_queue here]

{| class="wikitable"
! colspan="3" | sbatch Options
|-
! style="width:8%"| Command line
! style="width:9%"| Script
! style="width:13%"| Purpose
|- style="vertical-align:top;"
| -t, --time=''time''
| #SBATCH --time=''time''
| Wall clock time limit. 
|-
|- style="vertical-align:top;"
| -N, --nodes=''count''
| #SBATCH --nodes=''count''
| Number of nodes to be used.
|- style="vertical-align:top;"
| -n, --ntasks=''count''
| #SBATCH --ntasks=''count''
| Number of tasks to be launched.
|-
|- style="vertical-align:top;"
| --ntasks-per-node=''count''
| #SBATCH --ntasks-per-node=''count''
| Maximum count of tasks per node.
|-
|- style="vertical-align:top;"
| -c, --cpus-per-task=''count''
| #SBATCH --cpus-per-task=''count''
| Number of CPUs required per (MPI-)task.
|-
|- style="vertical-align:top;"
| --mem=''value_in_MB''
| #SBATCH --mem=''value_in_MB''
| Memory in MegaByte per node. (You should omit the setting of this option.)
|-
|- style="vertical-align:top;"
| --mem-per-cpu=''value_in_MB''
| #SBATCH --mem-per-cpu=''value_in_MB''
| Minimum Memory required per allocated CPU. (You should omit the setting of this option.)
|-
|- style="vertical-align:top;"
| --mail-type=''type''
| #SBATCH --mail-type=''type''
| Notify user by email when certain event types occur. Valid type values are NONE, BEGIN, END, FAIL, REQUEUE, ALL.
|-
|- style="vertical-align:top;"
| --mail-user=''mail-address''
| #SBATCH --mail-user=''mail-address''
| The specified mail-address receives email notification of state changes as defined by --mail-type.
|-
|- style="vertical-align:top;"
| --output=''name''
| #SBATCH --output=''name''
| File in which job output is stored.
|-
|- style="vertical-align:top;"
| --error=''name''
| #SBATCH --error=''name''
| File in which job error messages are stored.
|-
|- style="vertical-align:top;"
| -J, --job-name=''name''
| #SBATCH --job-name=''name''
| Job name.
|-
|- style="vertical-align:top;"
| --export=[ALL,] ''env-variables''
| #SBATCH --export=[ALL,] ''env-variables''
| Identifies which environment variables from the submission environment are propagated to the launched application. Default is ALL.
|-
|- style="vertical-align:top;"
| -A, --account=''group-name''
| #SBATCH --account=''group-name''
| Change resources used by this job to specified group. You may need this option if your account is assigned to more than one group. By command "scontrol show job" the project group the job is accounted on can be seen behind "Account=".
|-
|- style="vertical-align:top;"
| -p, --partition=''queue-name''
| #SBATCH --partition=''queue-name''
| Request a specific queue for the resource allocation.
|-
|- style="vertical-align:top;"
| --reservation=''reservation-name''
| #SBATCH --reservation=''reservation-name''
| Use a specific reservation for the resource allocation.
|-
|- style="vertical-align:top;"
| -C, --constraint=''LSDF''
| #SBATCH --constraint=LSDF
| Job constraint LSDF filesystems.
|-
|- style="vertical-align:top;"
| -C, --constraint=''BEEOND (BEEOND_4MDS, BEEOND_MAXMDS)''
| #SBATCH --constraint=BEEOND (BEEOND_4MDS, BEEOND_MAXMDS)
| Job constraint BeeOND filesystem.
|-
|}
 

= Best Practices =

== Step-by-Step example==

== Dos and Don'ts ==

BwUniCluster3.0/Running Jobs

2025-04-04T15:03:24Z

P Schuhmacher: /* Check available resources */

= Purpose and function of a queuing system =

General procedure: Hint to [[Running_Calculations | Running Calculations]]

== Job submission process ==

bwUniCluster 3.0 has installed the workload managing software Slurm. Therefore any job submission by the user is to be executed by commands of the Slurm software. Slurm queues and runs user jobs based on fair sharing policies.

== Slurm ==

HPC Workload Manager on bwUniCluster 3.0 is Slurm.
Slurm is a cluster management and job scheduling system. Slurm has three key functions.
* First, it allocates access to resources (compute cores on nodes) to users for some duration of time so they can perform work.
* it provides a framework for starting, executing, and monitoring work (normally a parallel job) on the set of allocated nodes.
* it arbitrates contention for resources by managing a queue of pending work.

Any kind of calculation on the compute nodes of bwUniCluster 3.0 requires the user to define calculations as a sequence of commands together with required run time, number of CPU cores and main memory and submit all, i.e., the '''batch job''', to a resource and workload managing software.
== Terms and definitions ==

''' Partitions '''

Slurm manages job queues for different '''partitions'''. Partitions are used to group similar node types (e.g. nodes with and without accelerators) and to enforce different access policies and resource limits.

On bwUniCluster 3.0 there are different partitions:

* CPU-only nodes
** 2-socket nodes, consisting of 2 Intel Ice Lake processors with 32 cores each or 2 AMD processors with 48 cores each
** 2-socket nodes with very high RAM capacity, consisting of 2 AMD processors with 48 cores each
* GPU-accelerated nodes
** 2-socket nodes with 4x NVIDIA A100 or 4x NVIDIA H100 GPUs
** 4-socket node with 4x AMD Instinct accelerator

''' Queues '''

Job '''queues''' are used to manage jobs that request access to shared but limited computing resources of a certain kind (partition).

On bwUniCluster 3.0 there are different main types of queues:
* Regular queues
** cpu: Jobs that request CPU-only nodes.
** gpu: Jobs that request GPU-accelerated nodes.
* Development queues (dev)
** Short, usually interactive jobs that are used for developing, compiling and testing code and workflows. The intention behind development queues is to provide users with immediate access to computer resources without having to wait. This is the place to realize instantaneous heavy compute without affecting other users, as would be the case on the login nodes.

Requested compute resources such as (wall-)time, number of nodes and amount of memory are restricted and must fit into the boundaries imposed by the queues. The request for compute resources on the bwUniCluster 3.0 requires at least the specification of the '''queue''' and the '''time'''.

''' Jobs '''

Jobs can be run non-interactively as '''batch jobs''' or as '''interactive jobs'''.
Submitting a batch job means, that all steps of a compute project are defined in a Bash script. This Bash script is queued and executed as soon as the compute resources are available and allocated. Jobs are enqueued with the ''sbatch'' command.

For interactive jobs, the resources are requested with the ''salloc'' command. As soon as the computing resources are available and allocated, a command line prompt is returned on a computing node and the user can freely dispose of the resources now available to him.
{|style="background:#deffee; width:100%;"
|style="padding:5px; background:#cef2e0; text-align:left"|
[[Image:Attention.svg|center|25px]]
|style="padding:5px; background:#cef2e0; text-align:left"|
'''Please remember:'''
* '''Heavy computations are not allowed on the login nodes'''. Use a developement or a regular job queue instead! Please refer to [[BwUniCluster3.0/Login#Allowed_Activities_on_Login_Nodes|Allowed Activities on Login Nodes]].
* '''Development queues''' are meant for '''development tasks'''. Do not misuse this queue for regular, short-running jobs or chain jobs! Only one running job at a time is enabled. Maximum queue length is reduced to 3.
|}

= Queues on bwUniCluster 3.0 =
== Policy ==

== Regular Queues ==
{| class="wikitable"
|-
! style="width:5%"| Queue
! style="width:13%"| Node-Type
! style="width:23%"| Default Resources
! style="width:13%"| Minimal Resources
! style="width:13%"| Maximum Resources
|-
| <code>cpu_il</code>
| CPU nodes Ice Lake
| mem-per-cpu=1950mb
|
| time=72:00:00, nodes=30, mem=249600mb, ntasks-per-node=64, (threads-per-core=2)
|-
| <code>cpu</code>
| CPU nodes Standard
| mem-per-cpu=1125mb
|
| time=72:00:00, nodes=20, mem=380000mb, ntasks-per-node=96, (threads-per-core=2)
|-
| <code>highmem</code>
| CPU nodes High Memory
| mem-per-cpu=1125mb
|
| time=72:00:00, nodes=4, mem=2300000mb, ntasks-per-node=96, (threads-per-core=2)
|-
| <code>gpu_h100</code>
| GPU nodes NVIDIA GPU x4
| mem-per-cpu=1125mb cpus-per-gpu=24
|
| time=72:00:00, nodes=12, mem=760000mb, ntasks-per-node=96, (threads-per-core=2)
|-
| <code>gpu_mi300</code>
| GPU node AMD GPU x4
| mem-per-cpu=1125mb cpus-per-gpu=24
|
| time=72:00:00, nodes=1, mem=510000mb, ntasks-per-node=40, (threads-per-core=2)
|-
| <code>gpu_a100_il</code>/<code>gpu_h100_il</code>
| GPU nodes Ice Lake NVIDIA GPU x4
| mem-per-gpu=127500mb cpus-per-gpu=16
|
| time=72:00:00, nodes=9, mem=510000mb, ntasks-per-node=64, (threads-per-core=2)
|}
Table 1: Regular Queues

== Development Queues ==
Only for development, i.e. debugging or performance optimization ...

{| class="wikitable"
|-
! style="width:5%"| Queue
! style="width:13%"| Node Type
! style="width:23%"| Default Resources
! style="width:13%"| Minimal Resources
! style="width:13%"| Maximum Resources
|-
| <code>dev_cpu_il</code>
| CPU nodes Ice Lake
| mem-per-cpu=1950mb
|
| time=30, nodes=8, mem=249600mb, ntasks-per-node=64, (threads-per-core=2)
|-
| <code>dev_cpu</code>
| CPU nodes Standard
| mem-per-cpu=1125mb
|
| time=30, nodes=1, mem=180000mb, ntasks-per-node=40, (threads-per-core=2)
|-
| <code>dev_highmem</code>
| CPU nodes High Memory
| mem-per-cpu=1125mb
|
| time=30, nodes=1, mem=180000mb, ntasks-per-node=40, (threads-per-core=2)
|-
| <code>dev_gpu_h100</code>
| GPU nodes NVIDIA GPU x4
| mem-per-cpu=1125mb cpus-per-gpu=24
|
| time=30, nodes=1, mem=180000mb, ntasks-per-node=40, (threads-per-core=2)
|-
| <code>dev_gpu_a100_il</code>
| GPU nodes NVIDIA GPU x4 
| mem-per-gpu=127500mb cpus-per-gpu=16
|
| time=30, nodes=1, mem=510000mb, ntasks-per-node=64, (threads-per-core=2)
|}
Table 2: Development Queues

Default resources of a queue class defines time, #tasks and memory if not explicitly given with sbatch command. Resource list acronyms ''--time'', ''--ntasks'', ''--nodes'', ''--mem'' and ''--mem-per-cpu'' are described [[BwUniCluster_3.0_Slurm_common_Features|here]].

== Check available resources ==
The Slurm command sinfo is used to view partition and node information for a system running Slurm. It incorporates down time, reservations, and node state information in determining the available backfill window. The sinfo command can only be used by the administrator.
 
SCC has prepared a special script (sinfo_t_idle) to find out how many processors are available for immediate use on the system. It is anticipated that users will use this information to submit jobs that meet these criteria and thus obtain quick job turnaround times.
 
 

* The following command displays what resources are available for immediate use for the whole partition.
<pre>$ sinfo_t_idle
Partition dev_cpu : 2 nodes idle
Partition cpu : 68 nodes idle
Partition highmem : 4 nodes idle
Partition dev_gpu_h100 : 0 nodes idle
Partition gpu_h100 : 11 nodes idle
Partition gpu_mi300 : 1 nodes idle
Partition dev_cpu_il : 0 nodes idle
Partition cpu_il : 0 nodes idle
Partition dev_gpu_a100_il : 0 nodes idle
Partition gpu_a100_il : 0 nodes idle
Partition gpu_h100_il : 0 nodes idle
</pre>
 

= Running Jobs =
== Batch Jobs: sbatch ==

To run your batch job on one of the cpu nodes, please use:

<pre>
$ sbatch --partition=dev_cpu
or
$ sbatch -p dev_cpu
</pre>
 

== Interactive Jobs: salloc ==

On bwUniCluster 3.0 you are only allowed to run short jobs (<< 1 hour) with little memory requirements (<< 8 GByte) on the logins nodes. If you want to run longer jobs and/or jobs with a request of more than 8 GByte of memory, you must allocate resources for so-called interactive jobs by usage of the command salloc on a login node. Considering a serial application running on a compute node that requires 5000 MByte of memory and limiting the interactive run to 2 hours the following command has to be executed:

<pre>
$ salloc -p cpu -n 1 -t 120 --mem=5000
</pre>

Then you will get one core on a compute node within the partition "cpu". After execution of this command '''DO NOT CLOSE''' your current terminal session but wait until the queueing system Slurm has granted you the requested resources on the compute system. You will be logged in automatically on the granted core! To run a serial program on the granted core you only have to type the name of the executable.

<pre>
$ ./<my_serial_program>
</pre>

Please be aware that your serial job must run less than 2 hours in this example, else the job will be killed during runtime by the system.

You can also start now a graphical X11-terminal connecting you to the dedicated resource that is available for 2 hours. You can start it by the command:

<pre>
$ xterm
</pre>

Note that, once the walltime limit has been reached the resources - i.e. the compute node - will automatically be revoked.

An interactive parallel application running on one compute node or on many compute nodes (e.g. here 5 nodes) with 96 cores each requires usually an amount of memory in GByte (e.g. 50 GByte) and a maximum time (e.g. 1 hour). E.g. 5 nodes can be allocated by the following command:

<pre>
$ salloc -p cpu -N 5 --ntasks-per-node=96 -t 01:00:00 --mem=50gb
</pre>

Now you can run parallel jobs on 480 cores requiring 50 GByte of memory per node. Please be aware that you will be logged in on core 0 of the first node.
If you want to have access to another node you have to open a new terminal, connect it also to bwUniCluster 3.0 and type the following commands to
connect to the running interactive job and then to a specific node:

<pre>
$ srun --jobid=XXXXXXXX --pty /bin/bash
$ srun --nodelist=uc3nXXX --pty /bin/bash
</pre>

With the command:

<pre>
$ squeue
</pre>

the jobid and the nodelist can be shown.

If you want to run MPI-programs, you can do it by simply typing mpirun <program_name>. Then your program will be run on 480 cores. A very simple example for starting a parallel job can be:

<pre>
$ mpirun <my_mpi_program>
</pre>

You can also start the debugger ddt by the commands:

<pre>
$ module add devel/ddt
$ ddt <my_mpi_program>
</pre>

The above commands will execute the parallel program <my_mpi_program> on all available cores. You can also start parallel programs on a subset of cores; an example for this can be:

<pre>
$ mpirun -n 50 <my_mpi_program>
</pre>

If you are using Intel MPI you must start <my_mpi_program> by the command mpiexec.hydra (instead of mpirun).

== Interactive Computing with Jupyter ==

== Monitor and manage jobs ==
scontrol show job displays detailed job state information and diagnostic output for all or a specified job of yours. Detailed information is available for active, pending and recently completed jobs. The command scontrol is explained in detail on the webpage https://slurm.schedmd.com/scontrol.html or via manpage (man scontrol).
 
Display the state of all your jobs in normal mode: scontrol show job
 
Display the state of a job with <jobid> in normal mode: scontrol show job <jobid>
 
 

=== Example ===
Here is an example from bwUniCluster 3.0.
<pre>
squeue # show my own jobs (here the userid is replaced!)
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
1262 cpu wrap ka_zs040 R 1:12 1 uc3n002

$
$ # now, see what's up with my pending job with jobid 1262
$
$ scontrol show job 1262

JobId=1262 JobName=wrap
UserId=ka_zs0402(241992) GroupId=ka_scc(12345) MCS_label=N/A
Priority=4246 Nice=0 Account=ka QOS=normal
JobState=RUNNING Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
RunTime=00:00:37 TimeLimit=00:20:00 TimeMin=N/A
SubmitTime=2025-04-04T10:01:30 EligibleTime=2025-04-04T10:01:30
AccrueTime=2025-04-04T10:01:30
StartTime=2025-04-04T10:01:31 EndTime=2025-04-04T10:21:31 Deadline=N/A
SuspendTime=None SecsPreSuspend=0 LastSchedEval=2025-04-04T10:01:31 Scheduler=Main
Partition=cpu AllocNode:Sid=uc3n999:2819841
ReqNodeList=(null) ExcNodeList=(null)
NodeList=uc3n002
BatchHost=uc3n002
NumNodes=1 NumCPUs=2 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
ReqTRES=cpu=1,mem=2000M,node=1,billing=1
AllocTRES=cpu=2,mem=4000M,node=1,billing=2
Socks/Node=* NtasksPerN:B:S:C=0:0:*:1 CoreSpec=*
MinCPUsNode=1 MinMemoryCPU=2000M MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
Command=(null)
WorkDir=/pfs/data6/home/ka/ka_scc/ka_zs0402
StdErr=/pfs/data6/home/ka/ka_scc/ka_zs0402/slurm-1262.out
StdIn=/dev/null
StdOut=/pfs/data6/home/ka/ka_scc/ka_zs0402/slurm-1262.out

</pre>
 

= Slurm Options =

== Slurm Commands (excerpt) ==
Important Slurm commands for non-administrators working on bwUniCluster 3.0.
{| width=850px class="wikitable"
! Slurm commands !! Brief explanation
|-
| [[#Job submission : sbatch|sbatch]] || Submits a job and puts it into the queue [[https://slurm.schedmd.com/sbatch.html sbatch]]
|-
| [[#Interactive job : salloc|salloc]] || Requests resources for an interactive Job [[https://slurm.schedmd.com/salloc.html salloc]]
|-
| [[#Detailed job information : scontrol show job|scontrol show job]] || Displays detailed job state information [[https://slurm.schedmd.com/scontrol.html scontrol]]
|-
| [[#List of your submitted jo/bs : squeue|squeue]] || Displays information about active, eligible, blocked, and/or recently completed jobs [[https://slurm.schedmd.com/squeue.html squeue]]
|-
| [[#Start time of job or resources : squeue|squeue --start]] || Returns start time of submitted job [[https://slurm.schedmd.com/squeue.html squeue]]
|-
| [[#Shows free resources : sinfo_t_idle|sinfo_t_idle]] || Shows what resources are available for immediate use [[https://slurm.schedmd.com/sinfo.html sinfo]]
|-
| [[#Canceling own jobs : scancel|scancel]] || Cancels a job [[https://slurm.schedmd.com/scancel.html scancel]]
|}

 
* [https://slurm.schedmd.com/tutorials.html Slurm Tutorials]
* [https://slurm.schedmd.com/pdfs/summary.pdf Slurm command/option summary (2 pages)]
* [https://slurm.schedmd.com/man_index.html Slurm Commands]
 

== Job submission : sbatch ==
Batch jobs are submitted by using the command '''sbatch'''. The main purpose of the '''sbatch''' command is to specify the resources that are needed to run the job. '''sbatch''' will then queue the batch job. However, starting of batch job depends on the availability of the requested resources and the fair sharing value.
 
 
=== Command parameters sbatch ===
The syntax and use of '''sbatch''' can be displayed via:
<pre>
$ man sbatch
</pre>
'''sbatch''' options can be used from the command line or in your job script. Different defaults for some of these options are set based on the queue and can be found [https://wiki.bwhpc.de/e/BwUniCluster3.0/Batch_Queues#sbatch_-p_queue here]

{| class="wikitable"
! colspan="3" | sbatch Options
|-
! style="width:8%"| Command line
! style="width:9%"| Script
! style="width:13%"| Purpose
|- style="vertical-align:top;"
| -t, --time=''time''
| #SBATCH --time=''time''
| Wall clock time limit. 
|-
|- style="vertical-align:top;"
| -N, --nodes=''count''
| #SBATCH --nodes=''count''
| Number of nodes to be used.
|- style="vertical-align:top;"
| -n, --ntasks=''count''
| #SBATCH --ntasks=''count''
| Number of tasks to be launched.
|-
|- style="vertical-align:top;"
| --ntasks-per-node=''count''
| #SBATCH --ntasks-per-node=''count''
| Maximum count of tasks per node.
|-
|- style="vertical-align:top;"
| -c, --cpus-per-task=''count''
| #SBATCH --cpus-per-task=''count''
| Number of CPUs required per (MPI-)task.
|-
|- style="vertical-align:top;"
| --mem=''value_in_MB''
| #SBATCH --mem=''value_in_MB''
| Memory in MegaByte per node. (You should omit the setting of this option.)
|-
|- style="vertical-align:top;"
| --mem-per-cpu=''value_in_MB''
| #SBATCH --mem-per-cpu=''value_in_MB''
| Minimum Memory required per allocated CPU. (You should omit the setting of this option.)
|-
|- style="vertical-align:top;"
| --mail-type=''type''
| #SBATCH --mail-type=''type''
| Notify user by email when certain event types occur. Valid type values are NONE, BEGIN, END, FAIL, REQUEUE, ALL.
|-
|- style="vertical-align:top;"
| --mail-user=''mail-address''
| #SBATCH --mail-user=''mail-address''
| The specified mail-address receives email notification of state changes as defined by --mail-type.
|-
|- style="vertical-align:top;"
| --output=''name''
| #SBATCH --output=''name''
| File in which job output is stored.
|-
|- style="vertical-align:top;"
| --error=''name''
| #SBATCH --error=''name''
| File in which job error messages are stored.
|-
|- style="vertical-align:top;"
| -J, --job-name=''name''
| #SBATCH --job-name=''name''
| Job name.
|-
|- style="vertical-align:top;"
| --export=[ALL,] ''env-variables''
| #SBATCH --export=[ALL,] ''env-variables''
| Identifies which environment variables from the submission environment are propagated to the launched application. Default is ALL.
|-
|- style="vertical-align:top;"
| -A, --account=''group-name''
| #SBATCH --account=''group-name''
| Change resources used by this job to specified group. You may need this option if your account is assigned to more than one group. By command "scontrol show job" the project group the job is accounted on can be seen behind "Account=".
|-
|- style="vertical-align:top;"
| -p, --partition=''queue-name''
| #SBATCH --partition=''queue-name''
| Request a specific queue for the resource allocation.
|-
|- style="vertical-align:top;"
| --reservation=''reservation-name''
| #SBATCH --reservation=''reservation-name''
| Use a specific reservation for the resource allocation.
|-
|- style="vertical-align:top;"
| -C, --constraint=''LSDF''
| #SBATCH --constraint=LSDF
| Job constraint LSDF filesystems.
|-
|- style="vertical-align:top;"
| -C, --constraint=''BEEOND (BEEOND_4MDS, BEEOND_MAXMDS)''
| #SBATCH --constraint=BEEOND (BEEOND_4MDS, BEEOND_MAXMDS)
| Job constraint BeeOND filesystem.
|-
|}
 

= Best Practices =

== Step-by-Step example==

== Dos and Don'ts ==

BwUniCluster3.0/Running Jobs

2025-04-04T15:02:46Z

P Schuhmacher: /* Scontrol show job Example */

= Purpose and function of a queuing system =

General procedure: Hint to [[Running_Calculations | Running Calculations]]

== Job submission process ==

bwUniCluster 3.0 has installed the workload managing software Slurm. Therefore any job submission by the user is to be executed by commands of the Slurm software. Slurm queues and runs user jobs based on fair sharing policies.

== Slurm ==

HPC Workload Manager on bwUniCluster 3.0 is Slurm.
Slurm is a cluster management and job scheduling system. Slurm has three key functions.
* First, it allocates access to resources (compute cores on nodes) to users for some duration of time so they can perform work.
* it provides a framework for starting, executing, and monitoring work (normally a parallel job) on the set of allocated nodes.
* it arbitrates contention for resources by managing a queue of pending work.

Any kind of calculation on the compute nodes of bwUniCluster 3.0 requires the user to define calculations as a sequence of commands together with required run time, number of CPU cores and main memory and submit all, i.e., the '''batch job''', to a resource and workload managing software.
== Terms and definitions ==

''' Partitions '''

Slurm manages job queues for different '''partitions'''. Partitions are used to group similar node types (e.g. nodes with and without accelerators) and to enforce different access policies and resource limits.

On bwUniCluster 3.0 there are different partitions:

* CPU-only nodes
** 2-socket nodes, consisting of 2 Intel Ice Lake processors with 32 cores each or 2 AMD processors with 48 cores each
** 2-socket nodes with very high RAM capacity, consisting of 2 AMD processors with 48 cores each
* GPU-accelerated nodes
** 2-socket nodes with 4x NVIDIA A100 or 4x NVIDIA H100 GPUs
** 4-socket node with 4x AMD Instinct accelerator

''' Queues '''

Job '''queues''' are used to manage jobs that request access to shared but limited computing resources of a certain kind (partition).

On bwUniCluster 3.0 there are different main types of queues:
* Regular queues
** cpu: Jobs that request CPU-only nodes.
** gpu: Jobs that request GPU-accelerated nodes.
* Development queues (dev)
** Short, usually interactive jobs that are used for developing, compiling and testing code and workflows. The intention behind development queues is to provide users with immediate access to computer resources without having to wait. This is the place to realize instantaneous heavy compute without affecting other users, as would be the case on the login nodes.

Requested compute resources such as (wall-)time, number of nodes and amount of memory are restricted and must fit into the boundaries imposed by the queues. The request for compute resources on the bwUniCluster 3.0 requires at least the specification of the '''queue''' and the '''time'''.

''' Jobs '''

Jobs can be run non-interactively as '''batch jobs''' or as '''interactive jobs'''.
Submitting a batch job means, that all steps of a compute project are defined in a Bash script. This Bash script is queued and executed as soon as the compute resources are available and allocated. Jobs are enqueued with the ''sbatch'' command.

For interactive jobs, the resources are requested with the ''salloc'' command. As soon as the computing resources are available and allocated, a command line prompt is returned on a computing node and the user can freely dispose of the resources now available to him.
{|style="background:#deffee; width:100%;"
|style="padding:5px; background:#cef2e0; text-align:left"|
[[Image:Attention.svg|center|25px]]
|style="padding:5px; background:#cef2e0; text-align:left"|
'''Please remember:'''
* '''Heavy computations are not allowed on the login nodes'''. Use a developement or a regular job queue instead! Please refer to [[BwUniCluster3.0/Login#Allowed_Activities_on_Login_Nodes|Allowed Activities on Login Nodes]].
* '''Development queues''' are meant for '''development tasks'''. Do not misuse this queue for regular, short-running jobs or chain jobs! Only one running job at a time is enabled. Maximum queue length is reduced to 3.
|}

= Queues on bwUniCluster 3.0 =
== Policy ==

== Regular Queues ==
{| class="wikitable"
|-
! style="width:5%"| Queue
! style="width:13%"| Node-Type
! style="width:23%"| Default Resources
! style="width:13%"| Minimal Resources
! style="width:13%"| Maximum Resources
|-
| <code>cpu_il</code>
| CPU nodes Ice Lake
| mem-per-cpu=1950mb
|
| time=72:00:00, nodes=30, mem=249600mb, ntasks-per-node=64, (threads-per-core=2)
|-
| <code>cpu</code>
| CPU nodes Standard
| mem-per-cpu=1125mb
|
| time=72:00:00, nodes=20, mem=380000mb, ntasks-per-node=96, (threads-per-core=2)
|-
| <code>highmem</code>
| CPU nodes High Memory
| mem-per-cpu=1125mb
|
| time=72:00:00, nodes=4, mem=2300000mb, ntasks-per-node=96, (threads-per-core=2)
|-
| <code>gpu_h100</code>
| GPU nodes NVIDIA GPU x4
| mem-per-cpu=1125mb cpus-per-gpu=24
|
| time=72:00:00, nodes=12, mem=760000mb, ntasks-per-node=96, (threads-per-core=2)
|-
| <code>gpu_mi300</code>
| GPU node AMD GPU x4
| mem-per-cpu=1125mb cpus-per-gpu=24
|
| time=72:00:00, nodes=1, mem=510000mb, ntasks-per-node=40, (threads-per-core=2)
|-
| <code>gpu_a100_il</code>/<code>gpu_h100_il</code>
| GPU nodes Ice Lake NVIDIA GPU x4
| mem-per-gpu=127500mb cpus-per-gpu=16
|
| time=72:00:00, nodes=9, mem=510000mb, ntasks-per-node=64, (threads-per-core=2)
|}
Table 1: Regular Queues

== Development Queues ==
Only for development, i.e. debugging or performance optimization ...

{| class="wikitable"
|-
! style="width:5%"| Queue
! style="width:13%"| Node Type
! style="width:23%"| Default Resources
! style="width:13%"| Minimal Resources
! style="width:13%"| Maximum Resources
|-
| <code>dev_cpu_il</code>
| CPU nodes Ice Lake
| mem-per-cpu=1950mb
|
| time=30, nodes=8, mem=249600mb, ntasks-per-node=64, (threads-per-core=2)
|-
| <code>dev_cpu</code>
| CPU nodes Standard
| mem-per-cpu=1125mb
|
| time=30, nodes=1, mem=180000mb, ntasks-per-node=40, (threads-per-core=2)
|-
| <code>dev_highmem</code>
| CPU nodes High Memory
| mem-per-cpu=1125mb
|
| time=30, nodes=1, mem=180000mb, ntasks-per-node=40, (threads-per-core=2)
|-
| <code>dev_gpu_h100</code>
| GPU nodes NVIDIA GPU x4
| mem-per-cpu=1125mb cpus-per-gpu=24
|
| time=30, nodes=1, mem=180000mb, ntasks-per-node=40, (threads-per-core=2)
|-
| <code>dev_gpu_a100_il</code>
| GPU nodes NVIDIA GPU x4 
| mem-per-gpu=127500mb cpus-per-gpu=16
|
| time=30, nodes=1, mem=510000mb, ntasks-per-node=64, (threads-per-core=2)
|}
Table 2: Development Queues

Default resources of a queue class defines time, #tasks and memory if not explicitly given with sbatch command. Resource list acronyms ''--time'', ''--ntasks'', ''--nodes'', ''--mem'' and ''--mem-per-cpu'' are described [[BwUniCluster_3.0_Slurm_common_Features|here]].

== Check available resources ==
The Slurm command sinfo is used to view partition and node information for a system running Slurm. It incorporates down time, reservations, and node state information in determining the available backfill window. The sinfo command can only be used by the administrator.
 
SCC has prepared a special script (sinfo_t_idle) to find out how many processors are available for immediate use on the system. It is anticipated that users will use this information to submit jobs that meet these criteria and thus obtain quick job turnaround times.
 
 
=== Example ===
* The following command displays what resources are available for immediate use for the whole partition.
<pre>$ sinfo_t_idle
Partition dev_cpu : 2 nodes idle
Partition cpu : 68 nodes idle
Partition highmem : 4 nodes idle
Partition dev_gpu_h100 : 0 nodes idle
Partition gpu_h100 : 11 nodes idle
Partition gpu_mi300 : 1 nodes idle
Partition dev_cpu_il : 0 nodes idle
Partition cpu_il : 0 nodes idle
Partition dev_gpu_a100_il : 0 nodes idle
Partition gpu_a100_il : 0 nodes idle
Partition gpu_h100_il : 0 nodes idle
</pre>
 

= Running Jobs =
== Batch Jobs: sbatch ==

To run your batch job on one of the cpu nodes, please use:

<pre>
$ sbatch --partition=dev_cpu
or
$ sbatch -p dev_cpu
</pre>
 

== Interactive Jobs: salloc ==

On bwUniCluster 3.0 you are only allowed to run short jobs (<< 1 hour) with little memory requirements (<< 8 GByte) on the logins nodes. If you want to run longer jobs and/or jobs with a request of more than 8 GByte of memory, you must allocate resources for so-called interactive jobs by usage of the command salloc on a login node. Considering a serial application running on a compute node that requires 5000 MByte of memory and limiting the interactive run to 2 hours the following command has to be executed:

<pre>
$ salloc -p cpu -n 1 -t 120 --mem=5000
</pre>

Then you will get one core on a compute node within the partition "cpu". After execution of this command '''DO NOT CLOSE''' your current terminal session but wait until the queueing system Slurm has granted you the requested resources on the compute system. You will be logged in automatically on the granted core! To run a serial program on the granted core you only have to type the name of the executable.

<pre>
$ ./<my_serial_program>
</pre>

Please be aware that your serial job must run less than 2 hours in this example, else the job will be killed during runtime by the system.

You can also start now a graphical X11-terminal connecting you to the dedicated resource that is available for 2 hours. You can start it by the command:

<pre>
$ xterm
</pre>

Note that, once the walltime limit has been reached the resources - i.e. the compute node - will automatically be revoked.

An interactive parallel application running on one compute node or on many compute nodes (e.g. here 5 nodes) with 96 cores each requires usually an amount of memory in GByte (e.g. 50 GByte) and a maximum time (e.g. 1 hour). E.g. 5 nodes can be allocated by the following command:

<pre>
$ salloc -p cpu -N 5 --ntasks-per-node=96 -t 01:00:00 --mem=50gb
</pre>

Now you can run parallel jobs on 480 cores requiring 50 GByte of memory per node. Please be aware that you will be logged in on core 0 of the first node.
If you want to have access to another node you have to open a new terminal, connect it also to bwUniCluster 3.0 and type the following commands to
connect to the running interactive job and then to a specific node:

<pre>
$ srun --jobid=XXXXXXXX --pty /bin/bash
$ srun --nodelist=uc3nXXX --pty /bin/bash
</pre>

With the command:

<pre>
$ squeue
</pre>

the jobid and the nodelist can be shown.

If you want to run MPI-programs, you can do it by simply typing mpirun <program_name>. Then your program will be run on 480 cores. A very simple example for starting a parallel job can be:

<pre>
$ mpirun <my_mpi_program>
</pre>

You can also start the debugger ddt by the commands:

<pre>
$ module add devel/ddt
$ ddt <my_mpi_program>
</pre>

The above commands will execute the parallel program <my_mpi_program> on all available cores. You can also start parallel programs on a subset of cores; an example for this can be:

<pre>
$ mpirun -n 50 <my_mpi_program>
</pre>

If you are using Intel MPI you must start <my_mpi_program> by the command mpiexec.hydra (instead of mpirun).

== Interactive Computing with Jupyter ==

== Monitor and manage jobs ==
scontrol show job displays detailed job state information and diagnostic output for all or a specified job of yours. Detailed information is available for active, pending and recently completed jobs. The command scontrol is explained in detail on the webpage https://slurm.schedmd.com/scontrol.html or via manpage (man scontrol).
 
Display the state of all your jobs in normal mode: scontrol show job
 
Display the state of a job with <jobid> in normal mode: scontrol show job <jobid>
 
 

=== Example ===
Here is an example from bwUniCluster 3.0.
<pre>
squeue # show my own jobs (here the userid is replaced!)
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
1262 cpu wrap ka_zs040 R 1:12 1 uc3n002

$
$ # now, see what's up with my pending job with jobid 1262
$
$ scontrol show job 1262

JobId=1262 JobName=wrap
UserId=ka_zs0402(241992) GroupId=ka_scc(12345) MCS_label=N/A
Priority=4246 Nice=0 Account=ka QOS=normal
JobState=RUNNING Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
RunTime=00:00:37 TimeLimit=00:20:00 TimeMin=N/A
SubmitTime=2025-04-04T10:01:30 EligibleTime=2025-04-04T10:01:30
AccrueTime=2025-04-04T10:01:30
StartTime=2025-04-04T10:01:31 EndTime=2025-04-04T10:21:31 Deadline=N/A
SuspendTime=None SecsPreSuspend=0 LastSchedEval=2025-04-04T10:01:31 Scheduler=Main
Partition=cpu AllocNode:Sid=uc3n999:2819841
ReqNodeList=(null) ExcNodeList=(null)
NodeList=uc3n002
BatchHost=uc3n002
NumNodes=1 NumCPUs=2 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
ReqTRES=cpu=1,mem=2000M,node=1,billing=1
AllocTRES=cpu=2,mem=4000M,node=1,billing=2
Socks/Node=* NtasksPerN:B:S:C=0:0:*:1 CoreSpec=*
MinCPUsNode=1 MinMemoryCPU=2000M MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
Command=(null)
WorkDir=/pfs/data6/home/ka/ka_scc/ka_zs0402
StdErr=/pfs/data6/home/ka/ka_scc/ka_zs0402/slurm-1262.out
StdIn=/dev/null
StdOut=/pfs/data6/home/ka/ka_scc/ka_zs0402/slurm-1262.out

</pre>
 

= Slurm Options =

== Slurm Commands (excerpt) ==
Important Slurm commands for non-administrators working on bwUniCluster 3.0.
{| width=850px class="wikitable"
! Slurm commands !! Brief explanation
|-
| [[#Job submission : sbatch|sbatch]] || Submits a job and puts it into the queue [[https://slurm.schedmd.com/sbatch.html sbatch]]
|-
| [[#Interactive job : salloc|salloc]] || Requests resources for an interactive Job [[https://slurm.schedmd.com/salloc.html salloc]]
|-
| [[#Detailed job information : scontrol show job|scontrol show job]] || Displays detailed job state information [[https://slurm.schedmd.com/scontrol.html scontrol]]
|-
| [[#List of your submitted jo/bs : squeue|squeue]] || Displays information about active, eligible, blocked, and/or recently completed jobs [[https://slurm.schedmd.com/squeue.html squeue]]
|-
| [[#Start time of job or resources : squeue|squeue --start]] || Returns start time of submitted job [[https://slurm.schedmd.com/squeue.html squeue]]
|-
| [[#Shows free resources : sinfo_t_idle|sinfo_t_idle]] || Shows what resources are available for immediate use [[https://slurm.schedmd.com/sinfo.html sinfo]]
|-
| [[#Canceling own jobs : scancel|scancel]] || Cancels a job [[https://slurm.schedmd.com/scancel.html scancel]]
|}

 
* [https://slurm.schedmd.com/tutorials.html Slurm Tutorials]
* [https://slurm.schedmd.com/pdfs/summary.pdf Slurm command/option summary (2 pages)]
* [https://slurm.schedmd.com/man_index.html Slurm Commands]
 

== Job submission : sbatch ==
Batch jobs are submitted by using the command '''sbatch'''. The main purpose of the '''sbatch''' command is to specify the resources that are needed to run the job. '''sbatch''' will then queue the batch job. However, starting of batch job depends on the availability of the requested resources and the fair sharing value.
 
 
=== Command parameters sbatch ===
The syntax and use of '''sbatch''' can be displayed via:
<pre>
$ man sbatch
</pre>
'''sbatch''' options can be used from the command line or in your job script. Different defaults for some of these options are set based on the queue and can be found [https://wiki.bwhpc.de/e/BwUniCluster3.0/Batch_Queues#sbatch_-p_queue here]

{| class="wikitable"
! colspan="3" | sbatch Options
|-
! style="width:8%"| Command line
! style="width:9%"| Script
! style="width:13%"| Purpose
|- style="vertical-align:top;"
| -t, --time=''time''
| #SBATCH --time=''time''
| Wall clock time limit. 
|-
|- style="vertical-align:top;"
| -N, --nodes=''count''
| #SBATCH --nodes=''count''
| Number of nodes to be used.
|- style="vertical-align:top;"
| -n, --ntasks=''count''
| #SBATCH --ntasks=''count''
| Number of tasks to be launched.
|-
|- style="vertical-align:top;"
| --ntasks-per-node=''count''
| #SBATCH --ntasks-per-node=''count''
| Maximum count of tasks per node.
|-
|- style="vertical-align:top;"
| -c, --cpus-per-task=''count''
| #SBATCH --cpus-per-task=''count''
| Number of CPUs required per (MPI-)task.
|-
|- style="vertical-align:top;"
| --mem=''value_in_MB''
| #SBATCH --mem=''value_in_MB''
| Memory in MegaByte per node. (You should omit the setting of this option.)
|-
|- style="vertical-align:top;"
| --mem-per-cpu=''value_in_MB''
| #SBATCH --mem-per-cpu=''value_in_MB''
| Minimum Memory required per allocated CPU. (You should omit the setting of this option.)
|-
|- style="vertical-align:top;"
| --mail-type=''type''
| #SBATCH --mail-type=''type''
| Notify user by email when certain event types occur. Valid type values are NONE, BEGIN, END, FAIL, REQUEUE, ALL.
|-
|- style="vertical-align:top;"
| --mail-user=''mail-address''
| #SBATCH --mail-user=''mail-address''
| The specified mail-address receives email notification of state changes as defined by --mail-type.
|-
|- style="vertical-align:top;"
| --output=''name''
| #SBATCH --output=''name''
| File in which job output is stored.
|-
|- style="vertical-align:top;"
| --error=''name''
| #SBATCH --error=''name''
| File in which job error messages are stored.
|-
|- style="vertical-align:top;"
| -J, --job-name=''name''
| #SBATCH --job-name=''name''
| Job name.
|-
|- style="vertical-align:top;"
| --export=[ALL,] ''env-variables''
| #SBATCH --export=[ALL,] ''env-variables''
| Identifies which environment variables from the submission environment are propagated to the launched application. Default is ALL.
|-
|- style="vertical-align:top;"
| -A, --account=''group-name''
| #SBATCH --account=''group-name''
| Change resources used by this job to specified group. You may need this option if your account is assigned to more than one group. By command "scontrol show job" the project group the job is accounted on can be seen behind "Account=".
|-
|- style="vertical-align:top;"
| -p, --partition=''queue-name''
| #SBATCH --partition=''queue-name''
| Request a specific queue for the resource allocation.
|-
|- style="vertical-align:top;"
| --reservation=''reservation-name''
| #SBATCH --reservation=''reservation-name''
| Use a specific reservation for the resource allocation.
|-
|- style="vertical-align:top;"
| -C, --constraint=''LSDF''
| #SBATCH --constraint=LSDF
| Job constraint LSDF filesystems.
|-
|- style="vertical-align:top;"
| -C, --constraint=''BEEOND (BEEOND_4MDS, BEEOND_MAXMDS)''
| #SBATCH --constraint=BEEOND (BEEOND_4MDS, BEEOND_MAXMDS)
| Job constraint BeeOND filesystem.
|-
|}
 

= Best Practices =

== Step-by-Step example==

== Dos and Don'ts ==

BwUniCluster3.0/Running Jobs

2025-04-04T15:00:44Z

P Schuhmacher: /* Monitor and manage jobs */

= Purpose and function of a queuing system =

General procedure: Hint to [[Running_Calculations | Running Calculations]]

== Job submission process ==

bwUniCluster 3.0 has installed the workload managing software Slurm. Therefore any job submission by the user is to be executed by commands of the Slurm software. Slurm queues and runs user jobs based on fair sharing policies.

== Slurm ==

HPC Workload Manager on bwUniCluster 3.0 is Slurm.
Slurm is a cluster management and job scheduling system. Slurm has three key functions.
* First, it allocates access to resources (compute cores on nodes) to users for some duration of time so they can perform work.
* it provides a framework for starting, executing, and monitoring work (normally a parallel job) on the set of allocated nodes.
* it arbitrates contention for resources by managing a queue of pending work.

Any kind of calculation on the compute nodes of bwUniCluster 3.0 requires the user to define calculations as a sequence of commands together with required run time, number of CPU cores and main memory and submit all, i.e., the '''batch job''', to a resource and workload managing software.
== Terms and definitions ==

''' Partitions '''

Slurm manages job queues for different '''partitions'''. Partitions are used to group similar node types (e.g. nodes with and without accelerators) and to enforce different access policies and resource limits.

On bwUniCluster 3.0 there are different partitions:

* CPU-only nodes
** 2-socket nodes, consisting of 2 Intel Ice Lake processors with 32 cores each or 2 AMD processors with 48 cores each
** 2-socket nodes with very high RAM capacity, consisting of 2 AMD processors with 48 cores each
* GPU-accelerated nodes
** 2-socket nodes with 4x NVIDIA A100 or 4x NVIDIA H100 GPUs
** 4-socket node with 4x AMD Instinct accelerator

''' Queues '''

Job '''queues''' are used to manage jobs that request access to shared but limited computing resources of a certain kind (partition).

On bwUniCluster 3.0 there are different main types of queues:
* Regular queues
** cpu: Jobs that request CPU-only nodes.
** gpu: Jobs that request GPU-accelerated nodes.
* Development queues (dev)
** Short, usually interactive jobs that are used for developing, compiling and testing code and workflows. The intention behind development queues is to provide users with immediate access to computer resources without having to wait. This is the place to realize instantaneous heavy compute without affecting other users, as would be the case on the login nodes.

Requested compute resources such as (wall-)time, number of nodes and amount of memory are restricted and must fit into the boundaries imposed by the queues. The request for compute resources on the bwUniCluster 3.0 requires at least the specification of the '''queue''' and the '''time'''.

''' Jobs '''

Jobs can be run non-interactively as '''batch jobs''' or as '''interactive jobs'''.
Submitting a batch job means, that all steps of a compute project are defined in a Bash script. This Bash script is queued and executed as soon as the compute resources are available and allocated. Jobs are enqueued with the ''sbatch'' command.

For interactive jobs, the resources are requested with the ''salloc'' command. As soon as the computing resources are available and allocated, a command line prompt is returned on a computing node and the user can freely dispose of the resources now available to him.
{|style="background:#deffee; width:100%;"
|style="padding:5px; background:#cef2e0; text-align:left"|
[[Image:Attention.svg|center|25px]]
|style="padding:5px; background:#cef2e0; text-align:left"|
'''Please remember:'''
* '''Heavy computations are not allowed on the login nodes'''. Use a developement or a regular job queue instead! Please refer to [[BwUniCluster3.0/Login#Allowed_Activities_on_Login_Nodes|Allowed Activities on Login Nodes]].
* '''Development queues''' are meant for '''development tasks'''. Do not misuse this queue for regular, short-running jobs or chain jobs! Only one running job at a time is enabled. Maximum queue length is reduced to 3.
|}

= Queues on bwUniCluster 3.0 =
== Policy ==

== Regular Queues ==
{| class="wikitable"
|-
! style="width:5%"| Queue
! style="width:13%"| Node-Type
! style="width:23%"| Default Resources
! style="width:13%"| Minimal Resources
! style="width:13%"| Maximum Resources
|-
| <code>cpu_il</code>
| CPU nodes Ice Lake
| mem-per-cpu=1950mb
|
| time=72:00:00, nodes=30, mem=249600mb, ntasks-per-node=64, (threads-per-core=2)
|-
| <code>cpu</code>
| CPU nodes Standard
| mem-per-cpu=1125mb
|
| time=72:00:00, nodes=20, mem=380000mb, ntasks-per-node=96, (threads-per-core=2)
|-
| <code>highmem</code>
| CPU nodes High Memory
| mem-per-cpu=1125mb
|
| time=72:00:00, nodes=4, mem=2300000mb, ntasks-per-node=96, (threads-per-core=2)
|-
| <code>gpu_h100</code>
| GPU nodes NVIDIA GPU x4
| mem-per-cpu=1125mb cpus-per-gpu=24
|
| time=72:00:00, nodes=12, mem=760000mb, ntasks-per-node=96, (threads-per-core=2)
|-
| <code>gpu_mi300</code>
| GPU node AMD GPU x4
| mem-per-cpu=1125mb cpus-per-gpu=24
|
| time=72:00:00, nodes=1, mem=510000mb, ntasks-per-node=40, (threads-per-core=2)
|-
| <code>gpu_a100_il</code>/<code>gpu_h100_il</code>
| GPU nodes Ice Lake NVIDIA GPU x4
| mem-per-gpu=127500mb cpus-per-gpu=16
|
| time=72:00:00, nodes=9, mem=510000mb, ntasks-per-node=64, (threads-per-core=2)
|}
Table 1: Regular Queues

== Development Queues ==
Only for development, i.e. debugging or performance optimization ...

{| class="wikitable"
|-
! style="width:5%"| Queue
! style="width:13%"| Node Type
! style="width:23%"| Default Resources
! style="width:13%"| Minimal Resources
! style="width:13%"| Maximum Resources
|-
| <code>dev_cpu_il</code>
| CPU nodes Ice Lake
| mem-per-cpu=1950mb
|
| time=30, nodes=8, mem=249600mb, ntasks-per-node=64, (threads-per-core=2)
|-
| <code>dev_cpu</code>
| CPU nodes Standard
| mem-per-cpu=1125mb
|
| time=30, nodes=1, mem=180000mb, ntasks-per-node=40, (threads-per-core=2)
|-
| <code>dev_highmem</code>
| CPU nodes High Memory
| mem-per-cpu=1125mb
|
| time=30, nodes=1, mem=180000mb, ntasks-per-node=40, (threads-per-core=2)
|-
| <code>dev_gpu_h100</code>
| GPU nodes NVIDIA GPU x4
| mem-per-cpu=1125mb cpus-per-gpu=24
|
| time=30, nodes=1, mem=180000mb, ntasks-per-node=40, (threads-per-core=2)
|-
| <code>dev_gpu_a100_il</code>
| GPU nodes NVIDIA GPU x4 
| mem-per-gpu=127500mb cpus-per-gpu=16
|
| time=30, nodes=1, mem=510000mb, ntasks-per-node=64, (threads-per-core=2)
|}
Table 2: Development Queues

Default resources of a queue class defines time, #tasks and memory if not explicitly given with sbatch command. Resource list acronyms ''--time'', ''--ntasks'', ''--nodes'', ''--mem'' and ''--mem-per-cpu'' are described [[BwUniCluster_3.0_Slurm_common_Features|here]].

== Check available resources ==
The Slurm command sinfo is used to view partition and node information for a system running Slurm. It incorporates down time, reservations, and node state information in determining the available backfill window. The sinfo command can only be used by the administrator.
 
SCC has prepared a special script (sinfo_t_idle) to find out how many processors are available for immediate use on the system. It is anticipated that users will use this information to submit jobs that meet these criteria and thus obtain quick job turnaround times.
 
 
=== Example ===
* The following command displays what resources are available for immediate use for the whole partition.
<pre>$ sinfo_t_idle
Partition dev_cpu : 2 nodes idle
Partition cpu : 68 nodes idle
Partition highmem : 4 nodes idle
Partition dev_gpu_h100 : 0 nodes idle
Partition gpu_h100 : 11 nodes idle
Partition gpu_mi300 : 1 nodes idle
Partition dev_cpu_il : 0 nodes idle
Partition cpu_il : 0 nodes idle
Partition dev_gpu_a100_il : 0 nodes idle
Partition gpu_a100_il : 0 nodes idle
Partition gpu_h100_il : 0 nodes idle
</pre>
 

= Running Jobs =
== Batch Jobs: sbatch ==

To run your batch job on one of the cpu nodes, please use:

<pre>
$ sbatch --partition=dev_cpu
or
$ sbatch -p dev_cpu
</pre>
 

== Interactive Jobs: salloc ==

On bwUniCluster 3.0 you are only allowed to run short jobs (<< 1 hour) with little memory requirements (<< 8 GByte) on the logins nodes. If you want to run longer jobs and/or jobs with a request of more than 8 GByte of memory, you must allocate resources for so-called interactive jobs by usage of the command salloc on a login node. Considering a serial application running on a compute node that requires 5000 MByte of memory and limiting the interactive run to 2 hours the following command has to be executed:

<pre>
$ salloc -p cpu -n 1 -t 120 --mem=5000
</pre>

Then you will get one core on a compute node within the partition "cpu". After execution of this command '''DO NOT CLOSE''' your current terminal session but wait until the queueing system Slurm has granted you the requested resources on the compute system. You will be logged in automatically on the granted core! To run a serial program on the granted core you only have to type the name of the executable.

<pre>
$ ./<my_serial_program>
</pre>

Please be aware that your serial job must run less than 2 hours in this example, else the job will be killed during runtime by the system.

You can also start now a graphical X11-terminal connecting you to the dedicated resource that is available for 2 hours. You can start it by the command:

<pre>
$ xterm
</pre>

Note that, once the walltime limit has been reached the resources - i.e. the compute node - will automatically be revoked.

An interactive parallel application running on one compute node or on many compute nodes (e.g. here 5 nodes) with 96 cores each requires usually an amount of memory in GByte (e.g. 50 GByte) and a maximum time (e.g. 1 hour). E.g. 5 nodes can be allocated by the following command:

<pre>
$ salloc -p cpu -N 5 --ntasks-per-node=96 -t 01:00:00 --mem=50gb
</pre>

Now you can run parallel jobs on 480 cores requiring 50 GByte of memory per node. Please be aware that you will be logged in on core 0 of the first node.
If you want to have access to another node you have to open a new terminal, connect it also to bwUniCluster 3.0 and type the following commands to
connect to the running interactive job and then to a specific node:

<pre>
$ srun --jobid=XXXXXXXX --pty /bin/bash
$ srun --nodelist=uc3nXXX --pty /bin/bash
</pre>

With the command:

<pre>
$ squeue
</pre>

the jobid and the nodelist can be shown.

If you want to run MPI-programs, you can do it by simply typing mpirun <program_name>. Then your program will be run on 480 cores. A very simple example for starting a parallel job can be:

<pre>
$ mpirun <my_mpi_program>
</pre>

You can also start the debugger ddt by the commands:

<pre>
$ module add devel/ddt
$ ddt <my_mpi_program>
</pre>

The above commands will execute the parallel program <my_mpi_program> on all available cores. You can also start parallel programs on a subset of cores; an example for this can be:

<pre>
$ mpirun -n 50 <my_mpi_program>
</pre>

If you are using Intel MPI you must start <my_mpi_program> by the command mpiexec.hydra (instead of mpirun).

== Interactive Computing with Jupyter ==

== Monitor and manage jobs ==
scontrol show job displays detailed job state information and diagnostic output for all or a specified job of yours. Detailed information is available for active, pending and recently completed jobs. The command scontrol is explained in detail on the webpage https://slurm.schedmd.com/scontrol.html or via manpage (man scontrol).
 
Display the state of all your jobs in normal mode: scontrol show job
 
Display the state of a job with <jobid> in normal mode: scontrol show job <jobid>
 
 

=== Scontrol show job Example ===
Here is an example from bwUniCluster 3.0.
<pre>
squeue # show my own jobs (here the userid is replaced!)
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
1262 cpu wrap ka_zs040 R 1:12 1 uc3n002

$
$ # now, see what's up with my pending job with jobid 1262
$
$ scontrol show job 1262

JobId=1262 JobName=wrap
UserId=ka_zs0402(241992) GroupId=ka_scc(12345) MCS_label=N/A
Priority=4246 Nice=0 Account=ka QOS=normal
JobState=RUNNING Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
RunTime=00:00:37 TimeLimit=00:20:00 TimeMin=N/A
SubmitTime=2025-04-04T10:01:30 EligibleTime=2025-04-04T10:01:30
AccrueTime=2025-04-04T10:01:30
StartTime=2025-04-04T10:01:31 EndTime=2025-04-04T10:21:31 Deadline=N/A
SuspendTime=None SecsPreSuspend=0 LastSchedEval=2025-04-04T10:01:31 Scheduler=Main
Partition=cpu AllocNode:Sid=uc3n999:2819841
ReqNodeList=(null) ExcNodeList=(null)
NodeList=uc3n002
BatchHost=uc3n002
NumNodes=1 NumCPUs=2 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
ReqTRES=cpu=1,mem=2000M,node=1,billing=1
AllocTRES=cpu=2,mem=4000M,node=1,billing=2
Socks/Node=* NtasksPerN:B:S:C=0:0:*:1 CoreSpec=*
MinCPUsNode=1 MinMemoryCPU=2000M MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
Command=(null)
WorkDir=/pfs/data6/home/ka/ka_scc/ka_zs0402
StdErr=/pfs/data6/home/ka/ka_scc/ka_zs0402/slurm-1262.out
StdIn=/dev/null
StdOut=/pfs/data6/home/ka/ka_scc/ka_zs0402/slurm-1262.out

</pre>
 

= Slurm Options =

== Slurm Commands (excerpt) ==
Important Slurm commands for non-administrators working on bwUniCluster 3.0.
{| width=850px class="wikitable"
! Slurm commands !! Brief explanation
|-
| [[#Job submission : sbatch|sbatch]] || Submits a job and puts it into the queue [[https://slurm.schedmd.com/sbatch.html sbatch]]
|-
| [[#Interactive job : salloc|salloc]] || Requests resources for an interactive Job [[https://slurm.schedmd.com/salloc.html salloc]]
|-
| [[#Detailed job information : scontrol show job|scontrol show job]] || Displays detailed job state information [[https://slurm.schedmd.com/scontrol.html scontrol]]
|-
| [[#List of your submitted jo/bs : squeue|squeue]] || Displays information about active, eligible, blocked, and/or recently completed jobs [[https://slurm.schedmd.com/squeue.html squeue]]
|-
| [[#Start time of job or resources : squeue|squeue --start]] || Returns start time of submitted job [[https://slurm.schedmd.com/squeue.html squeue]]
|-
| [[#Shows free resources : sinfo_t_idle|sinfo_t_idle]] || Shows what resources are available for immediate use [[https://slurm.schedmd.com/sinfo.html sinfo]]
|-
| [[#Canceling own jobs : scancel|scancel]] || Cancels a job [[https://slurm.schedmd.com/scancel.html scancel]]
|}

 
* [https://slurm.schedmd.com/tutorials.html Slurm Tutorials]
* [https://slurm.schedmd.com/pdfs/summary.pdf Slurm command/option summary (2 pages)]
* [https://slurm.schedmd.com/man_index.html Slurm Commands]
 

== Job submission : sbatch ==
Batch jobs are submitted by using the command '''sbatch'''. The main purpose of the '''sbatch''' command is to specify the resources that are needed to run the job. '''sbatch''' will then queue the batch job. However, starting of batch job depends on the availability of the requested resources and the fair sharing value.
 
 
=== Command parameters sbatch ===
The syntax and use of '''sbatch''' can be displayed via:
<pre>
$ man sbatch
</pre>
'''sbatch''' options can be used from the command line or in your job script. Different defaults for some of these options are set based on the queue and can be found [https://wiki.bwhpc.de/e/BwUniCluster3.0/Batch_Queues#sbatch_-p_queue here]

{| class="wikitable"
! colspan="3" | sbatch Options
|-
! style="width:8%"| Command line
! style="width:9%"| Script
! style="width:13%"| Purpose
|- style="vertical-align:top;"
| -t, --time=''time''
| #SBATCH --time=''time''
| Wall clock time limit. 
|-
|- style="vertical-align:top;"
| -N, --nodes=''count''
| #SBATCH --nodes=''count''
| Number of nodes to be used.
|- style="vertical-align:top;"
| -n, --ntasks=''count''
| #SBATCH --ntasks=''count''
| Number of tasks to be launched.
|-
|- style="vertical-align:top;"
| --ntasks-per-node=''count''
| #SBATCH --ntasks-per-node=''count''
| Maximum count of tasks per node.
|-
|- style="vertical-align:top;"
| -c, --cpus-per-task=''count''
| #SBATCH --cpus-per-task=''count''
| Number of CPUs required per (MPI-)task.
|-
|- style="vertical-align:top;"
| --mem=''value_in_MB''
| #SBATCH --mem=''value_in_MB''
| Memory in MegaByte per node. (You should omit the setting of this option.)
|-
|- style="vertical-align:top;"
| --mem-per-cpu=''value_in_MB''
| #SBATCH --mem-per-cpu=''value_in_MB''
| Minimum Memory required per allocated CPU. (You should omit the setting of this option.)
|-
|- style="vertical-align:top;"
| --mail-type=''type''
| #SBATCH --mail-type=''type''
| Notify user by email when certain event types occur. Valid type values are NONE, BEGIN, END, FAIL, REQUEUE, ALL.
|-
|- style="vertical-align:top;"
| --mail-user=''mail-address''
| #SBATCH --mail-user=''mail-address''
| The specified mail-address receives email notification of state changes as defined by --mail-type.
|-
|- style="vertical-align:top;"
| --output=''name''
| #SBATCH --output=''name''
| File in which job output is stored.
|-
|- style="vertical-align:top;"
| --error=''name''
| #SBATCH --error=''name''
| File in which job error messages are stored.
|-
|- style="vertical-align:top;"
| -J, --job-name=''name''
| #SBATCH --job-name=''name''
| Job name.
|-
|- style="vertical-align:top;"
| --export=[ALL,] ''env-variables''
| #SBATCH --export=[ALL,] ''env-variables''
| Identifies which environment variables from the submission environment are propagated to the launched application. Default is ALL.
|-
|- style="vertical-align:top;"
| -A, --account=''group-name''
| #SBATCH --account=''group-name''
| Change resources used by this job to specified group. You may need this option if your account is assigned to more than one group. By command "scontrol show job" the project group the job is accounted on can be seen behind "Account=".
|-
|- style="vertical-align:top;"
| -p, --partition=''queue-name''
| #SBATCH --partition=''queue-name''
| Request a specific queue for the resource allocation.
|-
|- style="vertical-align:top;"
| --reservation=''reservation-name''
| #SBATCH --reservation=''reservation-name''
| Use a specific reservation for the resource allocation.
|-
|- style="vertical-align:top;"
| -C, --constraint=''LSDF''
| #SBATCH --constraint=LSDF
| Job constraint LSDF filesystems.
|-
|- style="vertical-align:top;"
| -C, --constraint=''BEEOND (BEEOND_4MDS, BEEOND_MAXMDS)''
| #SBATCH --constraint=BEEOND (BEEOND_4MDS, BEEOND_MAXMDS)
| Job constraint BeeOND filesystem.
|-
|}
 

= Best Practices =

== Step-by-Step example==

== Dos and Don'ts ==

BwUniCluster3.0/Running Jobs

2025-04-04T14:53:06Z

P Schuhmacher: /* Check available resources */

= Purpose and function of a queuing system =

General procedure: Hint to [[Running_Calculations | Running Calculations]]

== Job submission process ==

bwUniCluster 3.0 has installed the workload managing software Slurm. Therefore any job submission by the user is to be executed by commands of the Slurm software. Slurm queues and runs user jobs based on fair sharing policies.

== Slurm ==

HPC Workload Manager on bwUniCluster 3.0 is Slurm.
Slurm is a cluster management and job scheduling system. Slurm has three key functions.
* First, it allocates access to resources (compute cores on nodes) to users for some duration of time so they can perform work.
* it provides a framework for starting, executing, and monitoring work (normally a parallel job) on the set of allocated nodes.
* it arbitrates contention for resources by managing a queue of pending work.

Any kind of calculation on the compute nodes of bwUniCluster 3.0 requires the user to define calculations as a sequence of commands together with required run time, number of CPU cores and main memory and submit all, i.e., the '''batch job''', to a resource and workload managing software.
== Terms and definitions ==

''' Partitions '''

Slurm manages job queues for different '''partitions'''. Partitions are used to group similar node types (e.g. nodes with and without accelerators) and to enforce different access policies and resource limits.

On bwUniCluster 3.0 there are different partitions:

* CPU-only nodes
** 2-socket nodes, consisting of 2 Intel Ice Lake processors with 32 cores each or 2 AMD processors with 48 cores each
** 2-socket nodes with very high RAM capacity, consisting of 2 AMD processors with 48 cores each
* GPU-accelerated nodes
** 2-socket nodes with 4x NVIDIA A100 or 4x NVIDIA H100 GPUs
** 4-socket node with 4x AMD Instinct accelerator

''' Queues '''

Job '''queues''' are used to manage jobs that request access to shared but limited computing resources of a certain kind (partition).

On bwUniCluster 3.0 there are different main types of queues:
* Regular queues
** cpu: Jobs that request CPU-only nodes.
** gpu: Jobs that request GPU-accelerated nodes.
* Development queues (dev)
** Short, usually interactive jobs that are used for developing, compiling and testing code and workflows. The intention behind development queues is to provide users with immediate access to computer resources without having to wait. This is the place to realize instantaneous heavy compute without affecting other users, as would be the case on the login nodes.

Requested compute resources such as (wall-)time, number of nodes and amount of memory are restricted and must fit into the boundaries imposed by the queues. The request for compute resources on the bwUniCluster 3.0 requires at least the specification of the '''queue''' and the '''time'''.

''' Jobs '''

Jobs can be run non-interactively as '''batch jobs''' or as '''interactive jobs'''.
Submitting a batch job means, that all steps of a compute project are defined in a Bash script. This Bash script is queued and executed as soon as the compute resources are available and allocated. Jobs are enqueued with the ''sbatch'' command.

For interactive jobs, the resources are requested with the ''salloc'' command. As soon as the computing resources are available and allocated, a command line prompt is returned on a computing node and the user can freely dispose of the resources now available to him.
{|style="background:#deffee; width:100%;"
|style="padding:5px; background:#cef2e0; text-align:left"|
[[Image:Attention.svg|center|25px]]
|style="padding:5px; background:#cef2e0; text-align:left"|
'''Please remember:'''
* '''Heavy computations are not allowed on the login nodes'''. Use a developement or a regular job queue instead! Please refer to [[BwUniCluster3.0/Login#Allowed_Activities_on_Login_Nodes|Allowed Activities on Login Nodes]].
* '''Development queues''' are meant for '''development tasks'''. Do not misuse this queue for regular, short-running jobs or chain jobs! Only one running job at a time is enabled. Maximum queue length is reduced to 3.
|}

= Queues on bwUniCluster 3.0 =
== Policy ==

== Regular Queues ==
{| class="wikitable"
|-
! style="width:5%"| Queue
! style="width:13%"| Node-Type
! style="width:23%"| Default Resources
! style="width:13%"| Minimal Resources
! style="width:13%"| Maximum Resources
|-
| <code>cpu_il</code>
| CPU nodes Ice Lake
| mem-per-cpu=1950mb
|
| time=72:00:00, nodes=30, mem=249600mb, ntasks-per-node=64, (threads-per-core=2)
|-
| <code>cpu</code>
| CPU nodes Standard
| mem-per-cpu=1125mb
|
| time=72:00:00, nodes=20, mem=380000mb, ntasks-per-node=96, (threads-per-core=2)
|-
| <code>highmem</code>
| CPU nodes High Memory
| mem-per-cpu=1125mb
|
| time=72:00:00, nodes=4, mem=2300000mb, ntasks-per-node=96, (threads-per-core=2)
|-
| <code>gpu_h100</code>
| GPU nodes NVIDIA GPU x4
| mem-per-cpu=1125mb cpus-per-gpu=24
|
| time=72:00:00, nodes=12, mem=760000mb, ntasks-per-node=96, (threads-per-core=2)
|-
| <code>gpu_mi300</code>
| GPU node AMD GPU x4
| mem-per-cpu=1125mb cpus-per-gpu=24
|
| time=72:00:00, nodes=1, mem=510000mb, ntasks-per-node=40, (threads-per-core=2)
|-
| <code>gpu_a100_il</code>/<code>gpu_h100_il</code>
| GPU nodes Ice Lake NVIDIA GPU x4
| mem-per-gpu=127500mb cpus-per-gpu=16
|
| time=72:00:00, nodes=9, mem=510000mb, ntasks-per-node=64, (threads-per-core=2)
|}
Table 1: Regular Queues

== Development Queues ==
Only for development, i.e. debugging or performance optimization ...

{| class="wikitable"
|-
! style="width:5%"| Queue
! style="width:13%"| Node Type
! style="width:23%"| Default Resources
! style="width:13%"| Minimal Resources
! style="width:13%"| Maximum Resources
|-
| <code>dev_cpu_il</code>
| CPU nodes Ice Lake
| mem-per-cpu=1950mb
|
| time=30, nodes=8, mem=249600mb, ntasks-per-node=64, (threads-per-core=2)
|-
| <code>dev_cpu</code>
| CPU nodes Standard
| mem-per-cpu=1125mb
|
| time=30, nodes=1, mem=180000mb, ntasks-per-node=40, (threads-per-core=2)
|-
| <code>dev_highmem</code>
| CPU nodes High Memory
| mem-per-cpu=1125mb
|
| time=30, nodes=1, mem=180000mb, ntasks-per-node=40, (threads-per-core=2)
|-
| <code>dev_gpu_h100</code>
| GPU nodes NVIDIA GPU x4
| mem-per-cpu=1125mb cpus-per-gpu=24
|
| time=30, nodes=1, mem=180000mb, ntasks-per-node=40, (threads-per-core=2)
|-
| <code>dev_gpu_a100_il</code>
| GPU nodes NVIDIA GPU x4 
| mem-per-gpu=127500mb cpus-per-gpu=16
|
| time=30, nodes=1, mem=510000mb, ntasks-per-node=64, (threads-per-core=2)
|}
Table 2: Development Queues

Default resources of a queue class defines time, #tasks and memory if not explicitly given with sbatch command. Resource list acronyms ''--time'', ''--ntasks'', ''--nodes'', ''--mem'' and ''--mem-per-cpu'' are described [[BwUniCluster_3.0_Slurm_common_Features|here]].

== Check available resources ==
The Slurm command sinfo is used to view partition and node information for a system running Slurm. It incorporates down time, reservations, and node state information in determining the available backfill window. The sinfo command can only be used by the administrator.
 
SCC has prepared a special script (sinfo_t_idle) to find out how many processors are available for immediate use on the system. It is anticipated that users will use this information to submit jobs that meet these criteria and thus obtain quick job turnaround times.
 
 
=== Example ===
* The following command displays what resources are available for immediate use for the whole partition.
<pre>$ sinfo_t_idle
Partition dev_cpu : 2 nodes idle
Partition cpu : 68 nodes idle
Partition highmem : 4 nodes idle
Partition dev_gpu_h100 : 0 nodes idle
Partition gpu_h100 : 11 nodes idle
Partition gpu_mi300 : 1 nodes idle
Partition dev_cpu_il : 0 nodes idle
Partition cpu_il : 0 nodes idle
Partition dev_gpu_a100_il : 0 nodes idle
Partition gpu_a100_il : 0 nodes idle
Partition gpu_h100_il : 0 nodes idle
</pre>
 

= Running Jobs =
== Batch Jobs: sbatch ==

To run your batch job on one of the cpu nodes, please use:

<pre>
$ sbatch --partition=dev_cpu
or
$ sbatch -p dev_cpu
</pre>
 

== Interactive Jobs: salloc ==

On bwUniCluster 3.0 you are only allowed to run short jobs (<< 1 hour) with little memory requirements (<< 8 GByte) on the logins nodes. If you want to run longer jobs and/or jobs with a request of more than 8 GByte of memory, you must allocate resources for so-called interactive jobs by usage of the command salloc on a login node. Considering a serial application running on a compute node that requires 5000 MByte of memory and limiting the interactive run to 2 hours the following command has to be executed:

<pre>
$ salloc -p cpu -n 1 -t 120 --mem=5000
</pre>

Then you will get one core on a compute node within the partition "cpu". After execution of this command '''DO NOT CLOSE''' your current terminal session but wait until the queueing system Slurm has granted you the requested resources on the compute system. You will be logged in automatically on the granted core! To run a serial program on the granted core you only have to type the name of the executable.

<pre>
$ ./<my_serial_program>
</pre>

Please be aware that your serial job must run less than 2 hours in this example, else the job will be killed during runtime by the system.

You can also start now a graphical X11-terminal connecting you to the dedicated resource that is available for 2 hours. You can start it by the command:

<pre>
$ xterm
</pre>

Note that, once the walltime limit has been reached the resources - i.e. the compute node - will automatically be revoked.

An interactive parallel application running on one compute node or on many compute nodes (e.g. here 5 nodes) with 96 cores each requires usually an amount of memory in GByte (e.g. 50 GByte) and a maximum time (e.g. 1 hour). E.g. 5 nodes can be allocated by the following command:

<pre>
$ salloc -p cpu -N 5 --ntasks-per-node=96 -t 01:00:00 --mem=50gb
</pre>

Now you can run parallel jobs on 480 cores requiring 50 GByte of memory per node. Please be aware that you will be logged in on core 0 of the first node.
If you want to have access to another node you have to open a new terminal, connect it also to bwUniCluster 3.0 and type the following commands to
connect to the running interactive job and then to a specific node:

<pre>
$ srun --jobid=XXXXXXXX --pty /bin/bash
$ srun --nodelist=uc3nXXX --pty /bin/bash
</pre>

With the command:

<pre>
$ squeue
</pre>

the jobid and the nodelist can be shown.

If you want to run MPI-programs, you can do it by simply typing mpirun <program_name>. Then your program will be run on 480 cores. A very simple example for starting a parallel job can be:

<pre>
$ mpirun <my_mpi_program>
</pre>

You can also start the debugger ddt by the commands:

<pre>
$ module add devel/ddt
$ ddt <my_mpi_program>
</pre>

The above commands will execute the parallel program <my_mpi_program> on all available cores. You can also start parallel programs on a subset of cores; an example for this can be:

<pre>
$ mpirun -n 50 <my_mpi_program>
</pre>

If you are using Intel MPI you must start <my_mpi_program> by the command mpiexec.hydra (instead of mpirun).

== Interactive Computing with Jupyter ==

== Monitor and manage jobs ==
= Slurm Options =

== Slurm Commands (excerpt) ==
Important Slurm commands for non-administrators working on bwUniCluster 3.0.
{| width=850px class="wikitable"
! Slurm commands !! Brief explanation
|-
| [[#Job submission : sbatch|sbatch]] || Submits a job and puts it into the queue [[https://slurm.schedmd.com/sbatch.html sbatch]]
|-
| [[#Interactive job : salloc|salloc]] || Requests resources for an interactive Job [[https://slurm.schedmd.com/salloc.html salloc]]
|-
| [[#Detailed job information : scontrol show job|scontrol show job]] || Displays detailed job state information [[https://slurm.schedmd.com/scontrol.html scontrol]]
|-
| [[#List of your submitted jo/bs : squeue|squeue]] || Displays information about active, eligible, blocked, and/or recently completed jobs [[https://slurm.schedmd.com/squeue.html squeue]]
|-
| [[#Start time of job or resources : squeue|squeue --start]] || Returns start time of submitted job [[https://slurm.schedmd.com/squeue.html squeue]]
|-
| [[#Shows free resources : sinfo_t_idle|sinfo_t_idle]] || Shows what resources are available for immediate use [[https://slurm.schedmd.com/sinfo.html sinfo]]
|-
| [[#Canceling own jobs : scancel|scancel]] || Cancels a job [[https://slurm.schedmd.com/scancel.html scancel]]
|}

 
* [https://slurm.schedmd.com/tutorials.html Slurm Tutorials]
* [https://slurm.schedmd.com/pdfs/summary.pdf Slurm command/option summary (2 pages)]
* [https://slurm.schedmd.com/man_index.html Slurm Commands]
 

== Job submission : sbatch ==
Batch jobs are submitted by using the command '''sbatch'''. The main purpose of the '''sbatch''' command is to specify the resources that are needed to run the job. '''sbatch''' will then queue the batch job. However, starting of batch job depends on the availability of the requested resources and the fair sharing value.
 
 
=== Command parameters sbatch ===
The syntax and use of '''sbatch''' can be displayed via:
<pre>
$ man sbatch
</pre>
'''sbatch''' options can be used from the command line or in your job script. Different defaults for some of these options are set based on the queue and can be found [https://wiki.bwhpc.de/e/BwUniCluster3.0/Batch_Queues#sbatch_-p_queue here]

{| class="wikitable"
! colspan="3" | sbatch Options
|-
! style="width:8%"| Command line
! style="width:9%"| Script
! style="width:13%"| Purpose
|- style="vertical-align:top;"
| -t, --time=''time''
| #SBATCH --time=''time''
| Wall clock time limit. 
|-
|- style="vertical-align:top;"
| -N, --nodes=''count''
| #SBATCH --nodes=''count''
| Number of nodes to be used.
|- style="vertical-align:top;"
| -n, --ntasks=''count''
| #SBATCH --ntasks=''count''
| Number of tasks to be launched.
|-
|- style="vertical-align:top;"
| --ntasks-per-node=''count''
| #SBATCH --ntasks-per-node=''count''
| Maximum count of tasks per node.
|-
|- style="vertical-align:top;"
| -c, --cpus-per-task=''count''
| #SBATCH --cpus-per-task=''count''
| Number of CPUs required per (MPI-)task.
|-
|- style="vertical-align:top;"
| --mem=''value_in_MB''
| #SBATCH --mem=''value_in_MB''
| Memory in MegaByte per node. (You should omit the setting of this option.)
|-
|- style="vertical-align:top;"
| --mem-per-cpu=''value_in_MB''
| #SBATCH --mem-per-cpu=''value_in_MB''
| Minimum Memory required per allocated CPU. (You should omit the setting of this option.)
|-
|- style="vertical-align:top;"
| --mail-type=''type''
| #SBATCH --mail-type=''type''
| Notify user by email when certain event types occur. Valid type values are NONE, BEGIN, END, FAIL, REQUEUE, ALL.
|-
|- style="vertical-align:top;"
| --mail-user=''mail-address''
| #SBATCH --mail-user=''mail-address''
| The specified mail-address receives email notification of state changes as defined by --mail-type.
|-
|- style="vertical-align:top;"
| --output=''name''
| #SBATCH --output=''name''
| File in which job output is stored.
|-
|- style="vertical-align:top;"
| --error=''name''
| #SBATCH --error=''name''
| File in which job error messages are stored.
|-
|- style="vertical-align:top;"
| -J, --job-name=''name''
| #SBATCH --job-name=''name''
| Job name.
|-
|- style="vertical-align:top;"
| --export=[ALL,] ''env-variables''
| #SBATCH --export=[ALL,] ''env-variables''
| Identifies which environment variables from the submission environment are propagated to the launched application. Default is ALL.
|-
|- style="vertical-align:top;"
| -A, --account=''group-name''
| #SBATCH --account=''group-name''
| Change resources used by this job to specified group. You may need this option if your account is assigned to more than one group. By command "scontrol show job" the project group the job is accounted on can be seen behind "Account=".
|-
|- style="vertical-align:top;"
| -p, --partition=''queue-name''
| #SBATCH --partition=''queue-name''
| Request a specific queue for the resource allocation.
|-
|- style="vertical-align:top;"
| --reservation=''reservation-name''
| #SBATCH --reservation=''reservation-name''
| Use a specific reservation for the resource allocation.
|-
|- style="vertical-align:top;"
| -C, --constraint=''LSDF''
| #SBATCH --constraint=LSDF
| Job constraint LSDF filesystems.
|-
|- style="vertical-align:top;"
| -C, --constraint=''BEEOND (BEEOND_4MDS, BEEOND_MAXMDS)''
| #SBATCH --constraint=BEEOND (BEEOND_4MDS, BEEOND_MAXMDS)
| Job constraint BeeOND filesystem.
|-
|}
 

= Best Practices =

== Step-by-Step example==

== Dos and Don'ts ==

BwUniCluster3.0/Running Jobs/Slurm

2025-04-04T14:52:44Z

P Schuhmacher: /* Shows free resources : sinfo_t_idle */

{|style="background:#FEF4AB; width:100%;"
|style="padding:5px; background:#FEF4AB; text-align:left"|
This page is work in progress.
|}

= Slurm HPC Workload Manager =
== Specification ==
Slurm is an open source, fault-tolerant, and highly scalable cluster management and job scheduling system for large and small Linux clusters. Slurm requires no kernel modifications for its operation and is relatively self-contained. As a cluster workload manager, Slurm has three key functions. First, it allocates exclusive and/or non-exclusive access to resources (compute nodes) to users for some duration of time so they can perform work. Second, it provides a framework for starting, executing, and monitoring work (normally a parallel job) on the set of allocated nodes. Finally, it arbitrates contention for resources by managing a queue of pending work.
 
 
Any kind of calculation on the compute nodes of bwUniCluster 3.0 requires the user to define calculations as a sequence of commands or single command together with required run time, number of CPU cores and main memory and submit all, i.e., the '''batch job''', to a resource and workload managing software. bwUniCluster 3.0 has installed the workload managing software Slurm. Therefore any job submission by the user is to be executed by commands of the Slurm software. Slurm queues and runs user jobs based on fair sharing policies.
 
 

== Slurm Commands (excerpt) ==
Important Slurm commands for non-administrators working on bwUniCluster 3.0.
{| width=850px class="wikitable"
! Slurm commands !! Brief explanation
|-
| [[#Job submission : sbatch|sbatch]] || Submits a job and puts it into the queue [[https://slurm.schedmd.com/sbatch.html sbatch]]
|-
| [[#Interactive job : salloc|salloc]] || Requests resources for an interactive Job [[https://slurm.schedmd.com/salloc.html salloc]]
|-
| [[#Detailed job information : scontrol show job|scontrol show job]] || Displays detailed job state information [[https://slurm.schedmd.com/scontrol.html scontrol]]
|-
| [[#List of your submitted jo/bs : squeue|squeue]] || Displays information about active, eligible, blocked, and/or recently completed jobs [[https://slurm.schedmd.com/squeue.html squeue]]
|-
| [[#Start time of job or resources : squeue|squeue --start]] || Returns start time of submitted job [[https://slurm.schedmd.com/squeue.html squeue]]
|-
| [[#Shows free resources : sinfo_t_idle|sinfo_t_idle]] || Shows what resources are available for immediate use [[https://slurm.schedmd.com/sinfo.html sinfo]]
|-
| [[#Canceling own jobs : scancel|scancel]] || Cancels a job [[https://slurm.schedmd.com/scancel.html scancel]]
|}

 
* [https://slurm.schedmd.com/tutorials.html Slurm Tutorials]
* [https://slurm.schedmd.com/pdfs/summary.pdf Slurm command/option summary (2 pages)]
* [https://slurm.schedmd.com/man_index.html Slurm Commands]
 

== Job submission : sbatch ==
Batch jobs are submitted by using the command '''sbatch'''. The main purpose of the '''sbatch''' command is to specify the resources that are needed to run the job. '''sbatch''' will then queue the batch job. However, starting of batch job depends on the availability of the requested resources and the fair sharing value.
 
 
=== Command parameters sbatch ===
The syntax and use of '''sbatch''' can be displayed via:
<pre>
$ man sbatch
</pre>
'''sbatch''' options can be used from the command line or in your job script. Different defaults for some of these options are set based on the queue and can be found [https://wiki.bwhpc.de/e/BwUniCluster3.0/Batch_Queues#sbatch_-p_queue here]

{| class="wikitable"
! colspan="3" | sbatch Options
|-
! style="width:8%"| Command line
! style="width:9%"| Script
! style="width:13%"| Purpose
|- style="vertical-align:top;"
| -t, --time=''time''
| #SBATCH --time=''time''
| Wall clock time limit. 
|-
|- style="vertical-align:top;"
| -N, --nodes=''count''
| #SBATCH --nodes=''count''
| Number of nodes to be used.
|- style="vertical-align:top;"
| -n, --ntasks=''count''
| #SBATCH --ntasks=''count''
| Number of tasks to be launched.
|-
|- style="vertical-align:top;"
| --ntasks-per-node=''count''
| #SBATCH --ntasks-per-node=''count''
| Maximum count of tasks per node.
|-
|- style="vertical-align:top;"
| -c, --cpus-per-task=''count''
| #SBATCH --cpus-per-task=''count''
| Number of CPUs required per (MPI-)task.
|-
|- style="vertical-align:top;"
| --mem=''value_in_MB''
| #SBATCH --mem=''value_in_MB''
| Memory in MegaByte per node. (You should omit the setting of this option.)
|-
|- style="vertical-align:top;"
| --mem-per-cpu=''value_in_MB''
| #SBATCH --mem-per-cpu=''value_in_MB''
| Minimum Memory required per allocated CPU. (You should omit the setting of this option.)
|-
|- style="vertical-align:top;"
| --mail-type=''type''
| #SBATCH --mail-type=''type''
| Notify user by email when certain event types occur. Valid type values are NONE, BEGIN, END, FAIL, REQUEUE, ALL.
|-
|- style="vertical-align:top;"
| --mail-user=''mail-address''
| #SBATCH --mail-user=''mail-address''
| The specified mail-address receives email notification of state changes as defined by --mail-type.
|-
|- style="vertical-align:top;"
| --output=''name''
| #SBATCH --output=''name''
| File in which job output is stored.
|-
|- style="vertical-align:top;"
| --error=''name''
| #SBATCH --error=''name''
| File in which job error messages are stored.
|-
|- style="vertical-align:top;"
| -J, --job-name=''name''
| #SBATCH --job-name=''name''
| Job name.
|-
|- style="vertical-align:top;"
| --export=[ALL,] ''env-variables''
| #SBATCH --export=[ALL,] ''env-variables''
| Identifies which environment variables from the submission environment are propagated to the launched application. Default is ALL.
|-
|- style="vertical-align:top;"
| -A, --account=''group-name''
| #SBATCH --account=''group-name''
| Change resources used by this job to specified group. You may need this option if your account is assigned to more than one group. By command "scontrol show job" the project group the job is accounted on can be seen behind "Account=".
|-
|- style="vertical-align:top;"
| -p, --partition=''queue-name''
| #SBATCH --partition=''queue-name''
| Request a specific queue for the resource allocation.
|-
|- style="vertical-align:top;"
| --reservation=''reservation-name''
| #SBATCH --reservation=''reservation-name''
| Use a specific reservation for the resource allocation.
|-
|- style="vertical-align:top;"
| -C, --constraint=''LSDF''
| #SBATCH --constraint=LSDF
| Job constraint LSDF filesystems.
|-
|- style="vertical-align:top;"
| -C, --constraint=''BEEOND (BEEOND_4MDS, BEEOND_MAXMDS)''
| #SBATCH --constraint=BEEOND (BEEOND_4MDS, BEEOND_MAXMDS)
| Job constraint BeeOND filesystem.
|-
|}
 

== Interactive job : salloc ==

If you want to run an interactive job, you can do so via the command salloc on a login node. Considering a serial application running on a compute node that requires 5000 MByte of memory and limiting the interactive run to 2 hours the following command has to be executed:

<pre>
$ salloc --partition=cpu --ntasks=1 --time=120 --mem=5000
</pre>

Then you will get one core on a compute node within the partition "cpu". After execution of this command '''DO NOT CLOSE''' your current terminal session but wait until the queueing system Slurm has granted you the requested resources on the compute node. You will be logged in automatically on the granted core! To run a serial program on the granted core you only have to type the name of the executable.

<pre>
$ ./<my_serial_program>
</pre>

Please be aware that your serial job must run less than 2 hours in this example, else the job will be killed during runtime by the system.

You can also start now a graphical X11-terminal connecting you to the dedicated resource that is available for 2 hours. You can start it by the command:

<pre>
$ xterm
</pre>

Note that, once the walltime limit has been reached the resources - i.e. the compute node - will automatically be revoked.

An interactive parallel application running on one compute node or on many compute nodes (e.g. here 5 nodes) with 40 cores each requires usually an amount of memory in GByte (e.g. 50 GByte) and a maximum time (e.g. 1 hour). E.g. 5 nodes can be allocated by the following command:

<pre>
$ salloc --partition=cpu --nodes=5 --ntasks-per-node=40 --time=01:00:00 --mem=50gb
</pre>

Now you can run parallel jobs on 200 cores requiring 50 GByte of memory per node. Please be aware that you will be logged in on core 0 of the first node.
If you want to have access to another node you have to open a new terminal, connect it also to BwUniCluster 2.0 and type the following commands to
connect to the running interactive job and then to a specific node:

<pre>
$ srun --jobid=XXXXXXXX --pty /bin/bash
$ srun --nodelist=uc3nXXX --pty /bin/bash
</pre>

With the command:

<pre>
$ squeue
</pre>

the jobid and the nodelist can be shown.

If you want to run MPI-programs, you can do it by simply typing mpirun <program_name>. Then your program will be run on 200 cores. A very simple example for starting a parallel job can be:

<pre>
$ mpirun <my_mpi_program>
</pre>

You can also start the debugger ddt by the commands:

<pre>
$ module add devel/ddt
$ ddt <my_mpi_program>
</pre>

The above commands will execute the parallel program <my_mpi_program> on all available cores. You can also start parallel programs on a subset of cores; an example for this can be:

<pre>
$ mpirun -n 50 <my_mpi_program>
</pre>

If you are using Intel MPI you must start <my_mpi_program> by the command mpiexec.hydra (instead of mpirun).

<div id="top"></div>
= Slurm HPC Workload Manager =
== Specification ==
Slurm is an open source, fault-tolerant, and highly scalable cluster management and job scheduling system for large and small Linux clusters. Slurm requires no kernel modifications for its operation and is relatively self-contained. As a cluster workload manager, Slurm has three key functions. First, it allocates exclusive and/or non-exclusive access to resources (compute nodes) to users for some duration of time so they can perform work. Second, it provides a framework for starting, executing, and monitoring work (normally a parallel job) on the set of allocated nodes. Finally, it arbitrates contention for resources by managing a queue of pending work.
 
 
Any kind of calculation on the compute nodes of [[bwUniCluster3.0|bwUniCluster 3.0]] requires the user to define calculations as a sequence of commands or single command together with required run time, number of CPU cores and main memory and submit all, i.e., the '''batch job''', to a resource and workload managing software. bwUniCluster 3.0 has installed the workload managing software Slurm. Therefore any job submission by the user is to be executed by commands of the Slurm software. Slurm queues and runs user jobs based on fair sharing policies.
 
 

== Slurm Commands (excerpt) ==
Some of the most used Slurm commands for non-administrators working on bwUniCluster 2.0.
{| width=750px class="wikitable"
! Slurm commands !! Brief explanation
|-
| [[#Job Submission : sbatch|sbatch]] || Submits a job and queues it in an input queue [[https://slurm.schedmd.com/sbatch.html sbatch]]
|-
| [[#Detailed job information : scontrol show job|scontrol show job]] || Displays detailed job state information [[https://slurm.schedmd.com/scontrol.html scontrol]]
|-
| [[#List of your submitted jo/bs : squeue|squeue]] || Displays information about active, eligible, blocked, and/or recently completed jobs [[https://slurm.schedmd.com/squeue.html squeue]]
|-
| [[#Start time of job or resources : squeue|squeue --start]] || Returns start time of submitted job or requested resources [[https://slurm.schedmd.com/squeue.html squeue]]
|-
| [[#Shows free resources : sinfo_t_idle|sinfo_t_idle]] || Shows what resources are available for immediate use [[https://slurm.schedmd.com/sinfo.html sinfo]]
|-
| [[#Canceling own jobs : scancel|scancel]] || Cancels a job (obsoleted!) [[https://slurm.schedmd.com/scancel.html scancel]]
|}

 
* [https://slurm.schedmd.com/tutorials.html Slurm Tutorials]
* [https://slurm.schedmd.com/pdfs/summary.pdf Slurm command/option summary (2 pages)]
* [https://slurm.schedmd.com/man_index.html Slurm Commands]
 

== Job Submission : sbatch ==
Batch jobs are submitted by using the command '''sbatch'''. The main purpose of the '''sbatch''' command is to specify the resources that are needed to run the job. '''sbatch''' will then queue the batch job. However, starting of batch job depends on the availability of the requested resources and the fair sharing value.
 
 
=== sbatch Command Parameters ===
The syntax and use of '''sbatch''' can be displayed via:
<pre>
$ man sbatch
</pre>
'''sbatch''' options can be used from the command line or in your job script.
{| width=750px class="wikitable"
! colspan="3" | sbatch Options
|-
! Command line
! Script
! Purpose
|- style="vertical-align:top;"
| -t ''time'' or --time=''time''
| #SBATCH --time=''time''
| Wall clock time limit. 
|-
|- style="vertical-align:top;"
| -N ''count'' or --nodes=''count''
| #SBATCH --nodes=''count''
| Number of nodes to be used.
|- style="vertical-align:top;"
| -n ''count'' or --ntasks=''count''
| #SBATCH --ntasks=''count''
| Number of tasks to be launched.
|-
|- style="vertical-align:top;"
| --ntasks-per-node=''count''
| #SBATCH --ntasks-per-node=''count''
| Maximum count (<= 28 and <= 40 resp.) of tasks per node. (Replaces the option ppn of MOAB.)
|-
|- style="vertical-align:top;"
| -c ''count'' or --cpus-per-task=''count''
| #SBATCH --cpus-per-task=''count''
| Number of CPUs required per (MPI-)task.
|-
|- style="vertical-align:top;"
| --mem=''value_in_MB''
| #SBATCH --mem=''value_in_MB''
| Memory in MegaByte per node. (Default value is 128000 and 96000 MB resp., i.e. you should omit the setting of this option.)
|-
|- style="vertical-align:top;"
| --mem-per-cpu=''value_in_MB''
| #SBATCH --mem-per-cpu=''value_in_MB''
| Minimum Memory required per allocated CPU. (Replaces the option pmem of MOAB. You should omit the setting of this option.)
|-
|- style="vertical-align:top;"
| --mail-type=''type''
| #SBATCH --mail-type=''type''
| Notify user by email when certain event types occur. Valid type values are NONE, BEGIN, END, FAIL, REQUEUE, ALL.
|-
|- style="vertical-align:top;"
| --mail-user=''mail-address''
| #SBATCH --mail-user=''mail-address''
| The specified mail-address receives email notification of state changes as defined by --mail-type.
|-
|- style="vertical-align:top;"
| --output=''name''
| #SBATCH --output=''name''
| File in which job output is stored.
|-
|- style="vertical-align:top;"
| --error=''name''
| #SBATCH --error=''name''
| File in which job error messages are stored.
|-
|- style="vertical-align:top;"
| -J ''name'' or --job-name=''name''
| #SBATCH --job-name=''name''
| Job name.
|-
|- style="vertical-align:top;"
| --export=[ALL,] ''env-variables''
| #SBATCH --export=[ALL,] ''env-variables''
| Identifies which environment variables from the submission environment are propagated to the launched application. Default is ALL. If adding an environment variable to the submission environment is intended, the argument ALL must be added.
|-
|- style="vertical-align:top;"
| -A ''group-name'' or --account=''group-name''
| #SBATCH --account=''group-name''
| Change resources used by this job to specified group. You may need this option if your account is assigned to more than one group. By command "scontrol show job" the project group the job is accounted on can be seen behind "Account=".
|-
|- style="vertical-align:top;"
| -p ''queue-name'' or --partition=''queue-name''
| #SBATCH --partition=''queue-name''
| Request a specific queue for the resource allocation.
|-
|- style="vertical-align:top;"
| --reservation=''reservation-name''
| #SBATCH --reservation=''reservation-name''
| Use a specific reservation for the resource allocation.
|-
|- style="vertical-align:top;"
| -C ''LSDF'' or --constraint=''LSDF''
| #SBATCH --constraint=LSDF
| Job constraint LSDF Filesystems.
|-
|- style="vertical-align:top;"
| -C ''BEEOND (BEEOND_4MDS, BEEOND_MAXMDS)'' or --constraint=''BEEOND (BEEOND_4MDS, BEEOND_MAXMDS''
| #SBATCH --constraint=BEEOND (BEEOND_4MDS, BEEOND_MAXMDS)
| Job constraint BeeOND file system.
|-
|}
 

==== sbatch --partition ''queues'' ====
Queue classes define maximum resources such as walltime, nodes and processes per node and queue of the compute system. Details can be found here:
* [[BwUniCluster3.0/Batch_Queues|bwUniCluster 3.0 queue settings]]
 

=== sbatch Examples ===
==== Serial Programs ====
To submit a serial job that runs the script '''job.sh''' and that requires 5000 MB of main memory and 10 minutes of wall clock time

a) execute:
<pre>
$ sbatch -p dev_cpu -n 1 -t 10:00 --mem=5000 job.sh
</pre>
or
b) add after the initial line of your script '''job.sh''' the lines (here with a high memory request):
<source lang="bash">
#SBATCH --ntasks=1
#SBATCH --time=10
#SBATCH --mem=180gb
#SBATCH --job-name=simple
</source>
and execute the modified script with the command line option ''--partition=highmem'':
<pre>
$ sbatch --partition=highmem job.sh
</pre>
Note, that sbatch command line options overrule script options.
 
 

==== Multithreaded Programs ====
Multithreaded programs operate faster than serial programs on CPUs with multiple cores. 
Moreover, multiple threads of one process share resources such as memory.
 
For multithreaded programs based on '''Open''' '''M'''ulti-'''P'''rocessing (OpenMP) a number of threads is defined by the environment variable OMP_NUM_THREADS. By default this variable is set to 1 (OMP_NUM_THREADS=1).
 
'''Hyperthreading is switched on on bwUniCluster 3.0, the option --threads-per-core must be set to 1, if you do not want to use it.'''
 
To submit a batch job called ''OpenMP_Test'' that runs a 96-fold threaded program ''omp_exe'' which requires 6000 MByte of total physical memory and total wall clock time of 40 minutes:
 
a) execute:
<pre>
$ sbatch -p cpu --export=ALL,OMP_NUM_THREADS=96 -J OpenMP_Test -N 1 -c 96 --threads-per-core=1 -t 40 --mem=6000 ./omp_exe
</pre>
or
-->
* generate the script '''job_omp.sh''' containing the following lines:
<source lang="bash">
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --cpus-per-task=96
#SBATCH --time=40:00
#SBATCH --threads-per-core=1
#SBATCH --mem=6000mb
#SBATCH --export=ALL,EXECUTABLE=./omp_exe
#SBATCH -J OpenMP_Test

#Usually you should set
export KMP_AFFINITY=compact,1,0
#export KMP_AFFINITY=verbose,compact,1,0 prints messages concerning the supported affinity
#KMP_AFFINITY Description: https://software.intel.com/en-us/node/524790#KMP_AFFINITY_ENVIRONMENT_VARIABLE

export OMP_NUM_THREADS=${SLURM_JOB_CPUS_PER_NODE}
echo "Executable ${EXECUTABLE} running on ${SLURM_JOB_CPUS_PER_NODE} cores with ${OMP_NUM_THREADS} threads"
startexe=${EXECUTABLE}
echo $startexe
exec $startexe
</source>
Using Intel compiler the environment variable KMP_AFFINITY switches on binding of threads to specific cores and, if necessary, replace <placeholder> with the required modulefile to enable the OpenMP environment and execute the script '''job_omp.sh''' adding the queue class ''cpu'' as sbatch option:
<pre>
$ sbatch -p cpu job_omp.sh
</pre>
Note, that sbatch command line options overrule script options, e.g.,
<pre>
$ sbatch --partition=cpu --mem=200 job_omp.sh
</pre>
overwrites the script setting of 6000 MByte with 200 MByte.
 
 

==== MPI Parallel Programs ====
MPI parallel programs run faster than serial programs on multi CPU and multi core systems. N-fold spawned processes of the MPI program, i.e., '''MPI tasks''', run simultaneously and communicate via the Message Passing Interface (MPI) paradigm. MPI tasks do not share memory but can be spawned over different nodes.
 
Multiple MPI tasks must be launched via '''mpirun''', e.g. 4 MPI tasks of ''my_par_program'':
<pre>
$ mpirun -n 4 my_par_program
</pre>
This command runs 4 MPI tasks of ''my_par_program'' on the node you are logged in.
To run this command with a loaded Intel MPI the environment variable I_MPI_HYDRA_BOOTSTRAP must be unset ( --> $ unset I_MPI_HYDRA_BOOTSTRAP).

Running MPI parallel programs in a batch job the interactive environment - particularly the loaded modules - will also be set in the batch job. If you want to set a defined module environment in your batch job you have to purge all modules before setting the wished modules.
 
 
===== OpenMPI =====

If you want to run jobs on batch nodes, generate a wrapper script ''job_ompi.sh'' for '''OpenMPI''' containing the following lines:
<source lang="bash">
#!/bin/bash
# Use when using the module environment for OpenMPI
module load compiler/<placeholder_for_compiler>/<placeholder_for_compiler_version>
module load mpi/openmpi/<placeholder_for_mpi_version>
mpirun --bind-to core --map-by core -report-bindings my_par_program
</source>
'''Attention:''' Do '''NOT''' add mpirun options ''-n <number_of_processes>'' or any other option defining processes or nodes, since Slurm instructs mpirun about number of processes and node hostnames. Use '''ALWAYS''' the MPI options '''''--bind-to core''''' and '''''--map-by core|socket|node'''''. Please type ''mpirun --help'' for an explanation of the meaning of the different options of mpirun option ''--map-by''.
 
Considering 4 OpenMPI tasks on a single node, each requiring 2000 MByte, and running for 1 hour, execute:
<pre>
$ sbatch -p cpu -N 1 -n 4 --mem-per-cpu=2000 --time=01:00:00 ./job_ompi.sh
</pre>
 

===== Intel MPI =====

Generate a wrapper script for '''Intel MPI''', ''job_impi.sh'' containing the following lines:
<source lang="bash">
#!/bin/bash
# Use when a defined module environment related to Intel MPI is wished
module load compiler/<placeholder_for_compiler>/<placeholder_for_compiler_version>
module load mpi/impi/<placeholder_for_version>
mpiexec.hydra -bootstrap slurm my_par_program
</source>
'''Attention:''' 
Do '''NOT''' add mpirun options ''-n <number_of_processes>'' or any other option defining processes or nodes, since Slurm instructs mpirun about number of processes and node hostnames.
 
Launching and running 200 Intel MPI tasks on 5 nodes, each requiring 80 GByte, and running for 5 hours, execute:
<pre>
$ sbatch --partition=cpu -N 5 --ntasks-per-node=40 --mem=80gb -t 300 ./job_impi.sh
</pre>
 
If you want to use 128 or more nodes, you must also set the environment variable as follows: 
export I_MPI_HYDRA_BRANCH_COUNT=-1
 
If you want to use the options perhost, ppn or rr, you must additionally set the environment variable I_MPI_JOB_RESPECT_PROCESS_PLACEMENT=off.
 
 

==== Multithreaded + MPI parallel Programs ====
Multithreaded + MPI parallel programs operate faster than serial programs on multi CPUs with multiple cores. All threads of one process share resources such as memory. On the contrary MPI tasks do not share memory but can be spawned over different nodes. '''Hyperthreading is switched on on bwUniCluster 3.0, the option --threads-per-core must be set to 1, if you do not want to use it.'''
 
 
===== OpenMPI with Multithreading =====
Multiple MPI tasks using '''OpenMPI''' must be launched by the MPI parallel program '''mpirun'''. For multithreaded programs based on '''Open''' '''M'''ulti-'''P'''rocessing (OpenMP) number of threads are defined by the environment variable OMP_NUM_THREADS. By default this variable is set to 1 (OMP_NUM_THREADS=1).
 
'''For OpenMPI''' a job-script to submit a batch job called ''job_ompi_omp.sh'' that runs a MPI program with 4 tasks and a 28-fold threaded program ''ompi_omp_program'' requiring 3000 MByte of physical memory per thread (using 28 threads per MPI task you will get 28*3000 MByte = 84000 MByte per MPI task) and total wall clock time of 3 hours looks like:

<source lang="bash">
#!/bin/bash
#SBATCH --ntasks=4
#SBATCH --cpus-per-task=28
#SBATCH --threads-per-core=1
#SBATCH --time=03:00:00
#SBATCH --mem=83gb # 84000 MB = 84000/1024 GB = 82.1 GB
#SBATCH --export=ALL,MPI_MODULE=mpi/openmpi/3.1,EXECUTABLE=./ompi_omp_program
#SBATCH --output="parprog_hybrid_%j.out"

# Use when a defined module environment related to OpenMPI is wished
module load ${MPI_MODULE}
export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK}
export MPIRUN_OPTIONS="--bind-to core --map-by socket:PE=${OMP_NUM_THREADS} -report-bindings"
export NUM_CORES=${SLURM_NTASKS}*${OMP_NUM_THREADS}
echo "${EXECUTABLE} running on ${NUM_CORES} cores with ${SLURM_NTASKS} MPI-tasks and ${OMP_NUM_THREADS} threads"
startexe="mpirun -n ${SLURM_NTASKS} ${MPIRUN_OPTIONS} ${EXECUTABLE}"
echo $startexe
exec $startexe
</source>
Execute the script '''job_ompi_omp.sh''' by command sbatch:
<pre>
$ sbatch -p cpu ./job_ompi_omp.sh
</pre>
* With the mpirun option ''--bind-to core'' MPI tasks and OpenMP threads are bound to physical cores.
* With the option ''--map-by node:PE=<value>'' (neighbored) MPI tasks will be attached to different nodes and each MPI task is bound to the first core of a node. <value> must be set to ${OMP_NUM_THREADS}.
* The option ''-report-bindings'' shows the bindings between MPI tasks and physical cores.
* The mpirun-options '''--bind-to core''', '''--map-by socket|...|node:PE=<value>''' should always be used when running a multithreaded MPI program.
 

===== Intel MPI with Multithreading =====
Multithreaded + MPI parallel programs operate faster than serial programs on multi CPUs with multiple cores. All threads of one process share resources such as memory. On the contrary MPI tasks do not share memory but can be spawned over different nodes.

Multiple Intel MPI tasks must be launched by the MPI parallel program '''mpiexec.hydra'''. For multithreaded programs based on '''Open''' '''M'''ulti-'''P'''rocessing (OpenMP) number of threads are defined by the environment variable OMP_NUM_THREADS. By default this variable is set to 1 (OMP_NUM_THREADS=1).

'''For Intel MPI''' a job-script to submit a batch job called ''job_impi_omp.sh'' that runs a Intel MPI program with 10 tasks and a 96-fold threaded program ''impi_omp_program'' requiring 96000 MByte of total physical memory per task and total wall clock time of 1 hours looks like:


<source lang="bash">
#!/bin/bash
#SBATCH --ntasks=10
#SBATCH --cpus-per-task=96
#SBATCH --threads-per-core=1
#SBATCH --time=60
#SBATCH --mem=96000
#SBATCH --export=ALL,MPI_MODULE=mpi/impi,EXE=./impi_omp_program
#SBATCH --output="parprog_impi_omp_%j.out"

#If using more than one MPI task per node please set
export KMP_AFFINITY=compact,1,0
#export KMP_AFFINITY=verbose,scatter prints messages concerning the supported affinity
#KMP_AFFINITY Description: https://software.intel.com/en-us/node/524790#KMP_AFFINITY_ENVIRONMENT_VARIABLE

# Use when a defined module environment related to Intel MPI is wished
module load ${MPI_MODULE}
export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK}
export MPIRUN_OPTIONS="-binding "domain=omp:compact" -print-rank-map -envall"
export NUM_PROCS=eval(${SLURM_NTASKS}*${OMP_NUM_THREADS})
echo "${EXE} running on ${NUM_PROCS} cores with ${SLURM_NTASKS} MPI-tasks and ${OMP_NUM_THREADS} threads"
startexe="mpiexec.hydra -bootstrap slurm ${MPIRUN_OPTIONS} -n ${SLURM_NTASKS} ${EXE}"
echo $startexe
exec $startexe
</source>
Using Intel compiler the environment variable KMP_AFFINITY switches on binding of threads to specific cores. If you only run one MPI task per node please set KMP_AFFINITY=compact,1,0.
 
If you want to use 128 or more nodes, you must also set the environment variable as follows: 
export I_MPI_HYDRA_BRANCH_COUNT=-1
 
If you want to use the options perhost, ppn or rr, you must additionally set the environment variable I_MPI_JOB_RESPECT_PROCESS_PLACEMENT=off.
 
 
Execute the script '''job_impi_omp.sh''' by command sbatch:
<pre>
$ sbatch -p cpu ./job_impi_omp.sh
</pre>
 
The mpirun option ''-print-rank-map'' shows the bindings between MPI tasks and nodes (not very beneficial). The option ''-binding'' binds MPI tasks (processes) to a particular processor; ''domain=omp'' means that the domain size is determined by the number of threads. If you would choose 2 MPI tasks per node, you should choose ''-binding "cell=unit;map=bunch"''; this binding maps one MPI process to each socket.
 
 

==== Chain jobs ====
The CPU time requirements of many applications exceed the limits of the job classes. In those situations it is recommended to solve the problem by a job chain. A job chain is a sequence of jobs where each job automatically starts its successor.
<source lang="bash">
#!/bin/bash
####################################
## simple Slurm submitter script to setup ##
## a chain of jobs using Slurm ##
####################################
## ver. : 2018-11-27, KIT, SCC

## Define maximum number of jobs via positional parameter 1, default is 5
max_nojob=${1:-5}

## Define your jobscript (e.g. "~/chain_job.sh")
chain_link_job=${PWD}/chain_job.sh

## Define type of dependency via positional parameter 2, default is 'afterok'
dep_type="${2:-afterok}"
## -> List of all dependencies:
## https://slurm.schedmd.com/sbatch.html

myloop_counter=1
## Submit loop
while [ ${myloop_counter} -le ${max_nojob} ] ; do
##
## Differ slurm_opt depending on chain link number
if [ ${myloop_counter} -eq 1 ] ; then
slurm_opt=""
else
slurm_opt="-d ${dep_type}:${jobID}"
fi
##
## Print current iteration number and sbatch command
echo "Chain job iteration = ${myloop_counter}"
echo " sbatch --export=myloop_counter=${myloop_counter} ${slurm_opt} ${chain_link_job}"
## Store job ID for next iteration by storing output of sbatch command with empty lines
jobID=$(sbatch -p <queue> --export=ALL,myloop_counter=${myloop_counter} ${slurm_opt} ${chain_link_job} 2>&1 | sed 's/[S,a-z]* //g')
##
## Check if ERROR occured
if [[ "${jobID}" =~ "ERROR" ]] ; then
echo " -> submission failed!" ; exit 1
else
echo " -> job number = ${jobID}"
fi
##
## Increase counter
let myloop_counter+=1
done
</source>
 

==== GPU jobs ====

The nodes in the gpu_h100, gpu_mi300, gpu_a100_il and gpu_h100_il queues have 4 NVIDIA Ampere A100 GPUs or 4 NVIDIA Hopper H100 GPUs. Just submitting a job to these queues is not enough to also allocate one or more GPUs, you have to do so using the "--gres=gpu" parameter. You have to specifiy how many GPUs your job needs, e.g. "--gres=gpu:2" will request two GPUs.

The GPU nodes are shared between multiple jobs if the jobs don't request all the GPUs in a node and there are enough resources to run more than one job. The individual GPUs are always bound to a single job and will not be shared between different jobs.

a) add after the initial line of your script job.sh the line including the
information about the GPU usage: #SBATCH --gres=gpu:2
<pre>
#!/bin/bash
#SBATCH --ntasks=40
#SBATCH --time=02:00:00
#SBATCH --mem=4000
#SBATCH --gres=gpu:2
</pre>

or b) execute:
<pre>
$ sbatch -p <queue> -n 40 -t 02:00:00 --mem 4000 --gres=gpu:2 job.sh
</pre>
 
If you start an interactive session on of the GPU nodes, you can use the "nvidia-smi" command to list the GPUs allocated to your job:
<pre>
$ nvidia-smi
Fri Apr 4 09:51:29 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.86.15 Driver Version: 570.86.15 CUDA Version: 12.8 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA H100 On | 00000000:06:00.0 Off | 0 |
| N/A 45C P0 70W / 415W | 27MiB / 95830MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA H100 On | 00000000:26:00.0 Off | 0 |
| N/A 45C P0 69W / 415W | 27MiB / 95830MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 3569 G /usr/libexec/Xorg 17MiB |
| 1 N/A N/A 3569 G /usr/libexec/Xorg 17MiB |
+-----------------------------------------------------------------------------------------+
</pre>

 
In case of using Open MPI, the underlying communication infrastructure (UCX and Open MPI's BTL) is CUDA-aware.
Please run Open MPI's mpirun using the following command:
<pre>
$ mpirun --mca pml ucx --mca btl_openib_warn_default_gid_prefix 0 -np 2 ./mpi_cuda_app
</pre>
or disabling the (older) communication layer Bit-Transfer-Layer (short BTL) alltogether:
<pre>
$ mpirun --mca pml ucx --mca btl ^openib -np 2 ./mpi_cuda_app
</pre>

 
 

==== LSDF Online Storage ====
On bwUniCluster 3.0 you can use for special cases the LSDF Online Storage on the HPC cluster nodes. Please request for this service separately ([https://www.lsdf.kit.edu/os/storagerequest/: LSDF Storage Request]).
To mount the LSDF Online Storage on the compute nodes during the job runtime the
the constraint flag "LSDF" has to be set.

a) add after the initial line of your script job.sh the line including the
information about the LSDF Online Storage usage: #SBATCH --constraint=LSDF
<pre>
#!/bin/bash
#SBATCH --ntasks=1
#SBATCH --time=120
#SBATCH --mem=200
#SBATCH --constraint=LSDF
</pre>

or b) execute:
<pre>
$ sbatch -p <queue> -n 1 -t 2:00:00 --mem 200 job.sh -C LSDF
</pre>
 
For the usage of the LSDF Online Storage
the following environment variables are available: $LSDF, $LSDFPROJECTS, $LSDFHOME.
 
 

====BeeOND (BeeGFS On-Demand)====

BeeOND instances are integrated into the prolog and epilog script of the cluster batch system Slurm. It can be used on the exclusive compute nodes during the job runtime with the constraint flag "BEEOND", "BEEOND_4MDS" or "BEEOND_MAXMDS" ([[BwUniCluster_2.0_Slurm_common_Features#sbatch_Command_Parameters|Slurm Command Parameters]])
* BEEOND: one metadata server is started on the first node
* BEEOND_4MDS: 4 metadata servers are started within your job. If you have less than 4 nodes less metadata servers are started.
* BEEOND_MAXMDS: on every node of your job a metadata server for the on_demand file system is started

As starting point we recommend using the "BEEOND" option. If you are unsure if this is sufficient for you feel free to contact the support team.
<source lang="bash">
#!/bin/bash
#SBATCH ...
#SBATCH --constraint=BEEOND # or BEEOND_4MDS or BEEOND_MAXMDS
</source>

After your job has started you can find the private on-demand file system in '''/mnt/odfs/${SLURM_JOB_ID}''' directory. The mountpoint comes with five pre-configured directories:
<source lang="bash">
# For small files (stripe count = 1)
/mnt/odfs/${SLURM_JOB_ID}/stripe_1
# Stripe count = 4
/mnt/odfs/${SLURM_JOB_ID}/stripe_default
# or
/mnt/odfs/${SLURM_JOB_ID}/stripe_4
# Stripe count = 8, 16 or 32, use this directories for medium sized and large files or when using MPI-IO
/mnt/odfs/${SLURM_JOB_ID}/stripe_8
/mnt/odfs/${SLURM_JOB_ID}/stripe_16
# or
/mnt/odfs/${SLURM_JOB_ID}/stripe_32
</source>

If you request less nodes than stripe count, the stripe count will be the number of nodes. For example, if you only request 8 nodes the directory stripe_16 has only a stripe count 8.

; '''Attention:''' 
:Be careful when creating large files: use always the directory with the max stripe count for large files.
:If you create large files use a higher stripe count. For example, if your largest file is 1.1 Tb, then you have to use a stripe count larger than 2,
:otherwise the used disk space is exceeded.

The capacity of the private file system depends on the number of nodes. For each node you get 750 Gbyte.
If you request 100 nodes for your job, the private file system is 100 * 750 Gbyte ~ 75 Tbyte (approx) capacity.

== Start time of job or resources : squeue --start ==
The command can be used by any user to displays the estimated start time of a job based a number of analysis types based on historical usage, earliest available reservable resources, and priority based backlog. The command squeue is explained in detail on the webpage https://slurm.schedmd.com/squeue.html or via manpage (man squeue).
 
 
=== Access ===
By default, this command can be run by '''any user'''.
 
 

== List of your submitted jobs : squeue ==
Displays information about YOUR active, pending and/or recently completed jobs. The command displays all own active and pending jobs. The command squeue is explained in detail on the webpage https://slurm.schedmd.com/squeue.html or via manpage (man squeue).
 
 
=== Access ===
By default, this command can be run by any user.
 
 

=== Flags ===
{| width=750px class="wikitable"
|-
! Flag !! Description
|-
| -l, --long
| Report more of the available information for the selected jobs or job steps, subject to any constraints specified.
|}
 

=== Examples ===
''squeue'' example on bwUniCluster 3.0 (Only your own jobs are displayed!).
<pre>
$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
1262 cpu wrap ka_ab123 R 8:15 1 uc3n002
1267 dev_gpu_h wrap ka_ab123 PD 0:00 1 (Resources)
1265 highmem wrap ka_ab123 R 2:41 1 uc3n084
$ squeue -l
JOBID PARTITION NAME USER STATE TIME TIME_LIMI NODES NODELIST(REASON)
1262 cpu wrap ka_ab123 RUNNING 8:55 20:00 1 uc3n002
1267 dev_gpu_h wrap ka_ab123 PENDING 0:00 20:00 1 (Resources)
1265 highmem wrap ka_ab123 RUNNING 3:21 20:00 1 uc3n084

</pre>
* The output of ''squeue'' shows how many jobs of yours are running or pending and how many nodes are in use by your jobs.
 

== Shows free resources : sinfo_t_idle ==
The Slurm command sinfo is used to view partition and node information for a system running Slurm. It incorporates down time, reservations, and node state information in determining the available backfill window. The sinfo command can only be used by the administrator.
 
SCC has prepared a special script (sinfo_t_idle) to find out how many processors are available for immediate use on the system. It is anticipated that users will use this information to submit jobs that meet these criteria and thus obtain quick job turnaround times.
 
 
=== Example ===
* The following command displays what resources are available for immediate use for the whole partition.
<pre>$ sinfo_t_idle
Partition dev_cpu : 2 nodes idle
Partition cpu : 68 nodes idle
Partition highmem : 4 nodes idle
Partition dev_gpu_h100 : 0 nodes idle
Partition gpu_h100 : 11 nodes idle
Partition gpu_mi300 : 1 nodes idle
Partition dev_cpu_il : 0 nodes idle
Partition cpu_il : 0 nodes idle
Partition dev_gpu_a100_il : 0 nodes idle
Partition gpu_a100_il : 0 nodes idle
Partition gpu_h100_il : 0 nodes idle
</pre>
 

== Detailed job information : scontrol show job ==
scontrol show job displays detailed job state information and diagnostic output for all or a specified job of yours. Detailed information is available for active, pending and recently completed jobs. The command scontrol is explained in detail on the webpage https://slurm.schedmd.com/scontrol.html or via manpage (man scontrol).
 
Display the state of all your jobs in normal mode: scontrol show job
 
Display the state of a job with <jobid> in normal mode: scontrol show job <jobid>
 
 
=== Access ===
* End users can use scontrol show job to view the status of their '''own jobs''' only.
 

=== Arguments ===
{| width=750px class="wikitable"
|-
! Option !! Default !! Description !! Example
|- style="vertical-align:top;"
|- style="width:12%;"
| -d
| (n/a)
| Detailed mode
| Example: Display the state with jobid 18089884 in detailed mode. <pre>scontrol -d show job 18089884</pre>
|}
 
 

=== Scontrol show job Example ===
Here is an example from bwUniCluster 3.0.
<pre>
squeue # show my own jobs (here the userid is replaced!)
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
1262 cpu wrap ka_zs040 R 1:12 1 uc3n002

$
$ # now, see what's up with my pending job with jobid 18089884
$
$ scontrol show job 1262

JobId=1262 JobName=wrap
UserId=ka_zs0402(241992) GroupId=ka_scc(12345) MCS_label=N/A
Priority=4246 Nice=0 Account=ka QOS=normal
JobState=RUNNING Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
RunTime=00:00:37 TimeLimit=00:20:00 TimeMin=N/A
SubmitTime=2025-04-04T10:01:30 EligibleTime=2025-04-04T10:01:30
AccrueTime=2025-04-04T10:01:30
StartTime=2025-04-04T10:01:31 EndTime=2025-04-04T10:21:31 Deadline=N/A
SuspendTime=None SecsPreSuspend=0 LastSchedEval=2025-04-04T10:01:31 Scheduler=Main
Partition=cpu AllocNode:Sid=uc3n999:2819841
ReqNodeList=(null) ExcNodeList=(null)
NodeList=uc3n002
BatchHost=uc3n002
NumNodes=1 NumCPUs=2 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
ReqTRES=cpu=1,mem=2000M,node=1,billing=1
AllocTRES=cpu=2,mem=4000M,node=1,billing=2
Socks/Node=* NtasksPerN:B:S:C=0:0:*:1 CoreSpec=*
MinCPUsNode=1 MinMemoryCPU=2000M MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
Command=(null)
WorkDir=/pfs/data6/home/ka/ka_scc/ka_zs0402
StdErr=/pfs/data6/home/ka/ka_scc/ka_zs0402/slurm-1262.out
StdIn=/dev/null
StdOut=/pfs/data6/home/ka/ka_scc/ka_zs0402/slurm-1262.out

</pre>
 
You can use standard Linux pipe commands to filter the very detailed scontrol show job output.
* In which state the job is?
<pre>$ scontrol show job 1262 | grep -i State
JobState=RUNNING Reason=None Dependency=(null)
</pre>
 

== Cancel Slurm Jobs ==
The scancel command is used to cancel jobs. The command scancel is explained in detail on the webpage https://slurm.schedmd.com/scancel.html or via manpage (man scancel).
 
 
=== Canceling own jobs : scancel ===
scancel is used to signal or cancel jobs, job arrays or job steps. The command is:
<pre>
$ scancel [-i] <job-id>
$ scancel -t <job_state_name>
</pre>

 
{| class="wikitable"
! Flag !! Default !! Description !! Example
|- style="vertical-align:top;"
| -i, --interactive
| (n/a)
| Interactive mode.
| Cancel the job 987654 interactively. <pre> scancel -i 987654 </pre>
|-
| -t, --state
| (n/a)
| Restrict the scancel operation to jobs in a certain state. "job_state_name" may have a value of either "PENDING", "RUNNING" or "SUSPENDED".
| Cancel all jobs in state "PENDING". <pre> scancel -t "PENDING" </pre>
|}
 

= Resource Managers =
=== Batch Job (Slurm) Variables ===
The following environment variables of Slurm are added to your environment once your job has started
(only an excerpt of the most important ones).
{| width=750px class="wikitable"
! Environment !! Brief explanation
|-
| SLURM_JOB_CPUS_PER_NODE
| Number of processes per node dedicated to the job
|-
| SLURM_JOB_NODELIST
| List of nodes dedicated to the job
|-
| SLURM_JOB_NUM_NODES
| Number of nodes dedicated to the job
|-
| SLURM_MEM_PER_NODE
| Memory per node dedicated to the job
|-
| SLURM_NPROCS
| Total number of processes dedicated to the job
|-
| SLURM_CLUSTER_NAME
| Name of the cluster executing the job
|-
| SLURM_CPUS_PER_TASK
| Number of CPUs requested per task
|-
| SLURM_JOB_ACCOUNT
| Account name
|-
| SLURM_JOB_ID
| Job ID
|-
| SLURM_JOB_NAME
| Job Name
|-
| SLURM_JOB_PARTITION
| Partition/queue running the job
|-
| SLURM_JOB_UID
| User ID of the job's owner
|-
| SLURM_SUBMIT_DIR
| Job submit folder. The directory from which sbatch was invoked.
|-
| SLURM_JOB_USER
| User name of the job's owner
|-
| SLURM_RESTART_COUNT
| Number of times job has restarted
|-
| SLURM_PROCID
| Task ID (MPI rank)
|-
| SLURM_NTASKS
| The total number of tasks available for the job
|-
| SLURM_STEP_ID
| Job step ID
|-
| SLURM_STEP_NUM_TASKS
| Task count (number of MPI ranks)
|-
| SLURM_JOB_CONSTRAINT
| Job constraints
|}
See also:
* [https://slurm.schedmd.com/sbatch.html#SECTION_INPUT-ENVIRONMENT-VARIABLES Slurm input and output environment variables]
 

=== Job Exit Codes ===
A job's exit code (also known as exit status, return code and completion code) is captured by SLURM and saved as part of the job record.
 
Any non-zero exit code will be assumed to be a job failure and will result in a Job State of FAILED with a reason of "NonZeroExitCode".
 
The exit code is an 8 bit unsigned number ranging between 0 and 255. While it is possible for a job to return a negative exit code, SLURM will display it as an unsigned value in the 0 - 255 range.
 
 
==== Displaying Exit Codes and Signals ====
SLURM displays a job's exit code in the output of the '''scontrol show job''' and the sview utility.
 
When a signal was responsible for a job or step's termination, the signal number will be displayed after the exit code, delineated by a colon(:).
 
 
==== Submitting Termination Signal ====
Here is an example, how to 'save' a Slurm termination signal in a typical jobscript.
<source lang="bash">
[...]
exit_code=$?
mpirun -np <#cores> <EXE_BIN_DIR>/<executable> ... (options) 2>&1
[ "$exit_code" -eq 0 ] && echo "all clean..." || \
echo "Executable <EXE_BIN_DIR>/<executable> finished with exit code ${$exit_code}"
[...]
</source>
* Do not use ''''time'''' mpirun! The exit code will be the one submitted by the first (time) program.
* You do not need an '''exit $exit_code''' in the scripts.
 
 
----
[[#top|Back to top]]

BwUniCluster3.0/Running Jobs

2025-04-04T14:51:17Z

P Schuhmacher: /* Check available resources */

= Purpose and function of a queuing system =

General procedure: Hint to [[Running_Calculations | Running Calculations]]

== Job submission process ==

bwUniCluster 3.0 has installed the workload managing software Slurm. Therefore any job submission by the user is to be executed by commands of the Slurm software. Slurm queues and runs user jobs based on fair sharing policies.

== Slurm ==

HPC Workload Manager on bwUniCluster 3.0 is Slurm.
Slurm is a cluster management and job scheduling system. Slurm has three key functions.
* First, it allocates access to resources (compute cores on nodes) to users for some duration of time so they can perform work.
* it provides a framework for starting, executing, and monitoring work (normally a parallel job) on the set of allocated nodes.
* it arbitrates contention for resources by managing a queue of pending work.

Any kind of calculation on the compute nodes of bwUniCluster 3.0 requires the user to define calculations as a sequence of commands together with required run time, number of CPU cores and main memory and submit all, i.e., the '''batch job''', to a resource and workload managing software.
== Terms and definitions ==

''' Partitions '''

Slurm manages job queues for different '''partitions'''. Partitions are used to group similar node types (e.g. nodes with and without accelerators) and to enforce different access policies and resource limits.

On bwUniCluster 3.0 there are different partitions:

* CPU-only nodes
** 2-socket nodes, consisting of 2 Intel Ice Lake processors with 32 cores each or 2 AMD processors with 48 cores each
** 2-socket nodes with very high RAM capacity, consisting of 2 AMD processors with 48 cores each
* GPU-accelerated nodes
** 2-socket nodes with 4x NVIDIA A100 or 4x NVIDIA H100 GPUs
** 4-socket node with 4x AMD Instinct accelerator

''' Queues '''

Job '''queues''' are used to manage jobs that request access to shared but limited computing resources of a certain kind (partition).

On bwUniCluster 3.0 there are different main types of queues:
* Regular queues
** cpu: Jobs that request CPU-only nodes.
** gpu: Jobs that request GPU-accelerated nodes.
* Development queues (dev)
** Short, usually interactive jobs that are used for developing, compiling and testing code and workflows. The intention behind development queues is to provide users with immediate access to computer resources without having to wait. This is the place to realize instantaneous heavy compute without affecting other users, as would be the case on the login nodes.

Requested compute resources such as (wall-)time, number of nodes and amount of memory are restricted and must fit into the boundaries imposed by the queues. The request for compute resources on the bwUniCluster 3.0 requires at least the specification of the '''queue''' and the '''time'''.

''' Jobs '''

Jobs can be run non-interactively as '''batch jobs''' or as '''interactive jobs'''.
Submitting a batch job means, that all steps of a compute project are defined in a Bash script. This Bash script is queued and executed as soon as the compute resources are available and allocated. Jobs are enqueued with the ''sbatch'' command.

For interactive jobs, the resources are requested with the ''salloc'' command. As soon as the computing resources are available and allocated, a command line prompt is returned on a computing node and the user can freely dispose of the resources now available to him.
{|style="background:#deffee; width:100%;"
|style="padding:5px; background:#cef2e0; text-align:left"|
[[Image:Attention.svg|center|25px]]
|style="padding:5px; background:#cef2e0; text-align:left"|
'''Please remember:'''
* '''Heavy computations are not allowed on the login nodes'''. Use a developement or a regular job queue instead! Please refer to [[BwUniCluster3.0/Login#Allowed_Activities_on_Login_Nodes|Allowed Activities on Login Nodes]].
* '''Development queues''' are meant for '''development tasks'''. Do not misuse this queue for regular, short-running jobs or chain jobs! Only one running job at a time is enabled. Maximum queue length is reduced to 3.
|}

= Queues on bwUniCluster 3.0 =
== Policy ==

== Regular Queues ==
{| class="wikitable"
|-
! style="width:5%"| Queue
! style="width:13%"| Node-Type
! style="width:23%"| Default Resources
! style="width:13%"| Minimal Resources
! style="width:13%"| Maximum Resources
|-
| <code>cpu_il</code>
| CPU nodes Ice Lake
| mem-per-cpu=1950mb
|
| time=72:00:00, nodes=30, mem=249600mb, ntasks-per-node=64, (threads-per-core=2)
|-
| <code>cpu</code>
| CPU nodes Standard
| mem-per-cpu=1125mb
|
| time=72:00:00, nodes=20, mem=380000mb, ntasks-per-node=96, (threads-per-core=2)
|-
| <code>highmem</code>
| CPU nodes High Memory
| mem-per-cpu=1125mb
|
| time=72:00:00, nodes=4, mem=2300000mb, ntasks-per-node=96, (threads-per-core=2)
|-
| <code>gpu_h100</code>
| GPU nodes NVIDIA GPU x4
| mem-per-cpu=1125mb cpus-per-gpu=24
|
| time=72:00:00, nodes=12, mem=760000mb, ntasks-per-node=96, (threads-per-core=2)
|-
| <code>gpu_mi300</code>
| GPU node AMD GPU x4
| mem-per-cpu=1125mb cpus-per-gpu=24
|
| time=72:00:00, nodes=1, mem=510000mb, ntasks-per-node=40, (threads-per-core=2)
|-
| <code>gpu_a100_il</code>/<code>gpu_h100_il</code>
| GPU nodes Ice Lake NVIDIA GPU x4
| mem-per-gpu=127500mb cpus-per-gpu=16
|
| time=72:00:00, nodes=9, mem=510000mb, ntasks-per-node=64, (threads-per-core=2)
|}
Table 1: Regular Queues

== Development Queues ==
Only for development, i.e. debugging or performance optimization ...

{| class="wikitable"
|-
! style="width:5%"| Queue
! style="width:13%"| Node Type
! style="width:23%"| Default Resources
! style="width:13%"| Minimal Resources
! style="width:13%"| Maximum Resources
|-
| <code>dev_cpu_il</code>
| CPU nodes Ice Lake
| mem-per-cpu=1950mb
|
| time=30, nodes=8, mem=249600mb, ntasks-per-node=64, (threads-per-core=2)
|-
| <code>dev_cpu</code>
| CPU nodes Standard
| mem-per-cpu=1125mb
|
| time=30, nodes=1, mem=180000mb, ntasks-per-node=40, (threads-per-core=2)
|-
| <code>dev_highmem</code>
| CPU nodes High Memory
| mem-per-cpu=1125mb
|
| time=30, nodes=1, mem=180000mb, ntasks-per-node=40, (threads-per-core=2)
|-
| <code>dev_gpu_h100</code>
| GPU nodes NVIDIA GPU x4
| mem-per-cpu=1125mb cpus-per-gpu=24
|
| time=30, nodes=1, mem=180000mb, ntasks-per-node=40, (threads-per-core=2)
|-
| <code>dev_gpu_a100_il</code>
| GPU nodes NVIDIA GPU x4 
| mem-per-gpu=127500mb cpus-per-gpu=16
|
| time=30, nodes=1, mem=510000mb, ntasks-per-node=64, (threads-per-core=2)
|}
Table 2: Development Queues

Default resources of a queue class defines time, #tasks and memory if not explicitly given with sbatch command. Resource list acronyms ''--time'', ''--ntasks'', ''--nodes'', ''--mem'' and ''--mem-per-cpu'' are described [[BwUniCluster_3.0_Slurm_common_Features|here]].

== Check available resources ==
The Slurm command sinfo is used to view partition and node information for a system running Slurm. It incorporates down time, reservations, and node state information in determining the available backfill window. The sinfo command can only be used by the administrator.
 
SCC has prepared a special script (sinfo_t_idle) to find out how many processors are available for immediate use on the system. It is anticipated that users will use this information to submit jobs that meet these criteria and thus obtain quick job turnaround times.
 
 
=== Access ===
By default, this command can be used by any user or administrator.
 
 
=== Example ===
* The following command displays what resources are available for immediate use for the whole partition.
<pre>$ sinfo_t_idle
Partition dev_cpu : 2 nodes idle
Partition cpu : 68 nodes idle
Partition highmem : 4 nodes idle
Partition dev_gpu_h100 : 0 nodes idle
Partition gpu_h100 : 11 nodes idle
Partition gpu_mi300 : 1 nodes idle
Partition dev_cpu_il : 0 nodes idle
Partition cpu_il : 0 nodes idle
Partition dev_gpu_a100_il : 0 nodes idle
Partition gpu_a100_il : 0 nodes idle
Partition gpu_h100_il : 0 nodes idle
</pre>
* For the above example jobs in all partitions can be run immediately.
 

= Running Jobs =
== Batch Jobs: sbatch ==

To run your batch job on one of the cpu nodes, please use:

<pre>
$ sbatch --partition=dev_cpu
or
$ sbatch -p dev_cpu
</pre>
 

== Interactive Jobs: salloc ==

On bwUniCluster 3.0 you are only allowed to run short jobs (<< 1 hour) with little memory requirements (<< 8 GByte) on the logins nodes. If you want to run longer jobs and/or jobs with a request of more than 8 GByte of memory, you must allocate resources for so-called interactive jobs by usage of the command salloc on a login node. Considering a serial application running on a compute node that requires 5000 MByte of memory and limiting the interactive run to 2 hours the following command has to be executed:

<pre>
$ salloc -p cpu -n 1 -t 120 --mem=5000
</pre>

Then you will get one core on a compute node within the partition "cpu". After execution of this command '''DO NOT CLOSE''' your current terminal session but wait until the queueing system Slurm has granted you the requested resources on the compute system. You will be logged in automatically on the granted core! To run a serial program on the granted core you only have to type the name of the executable.

<pre>
$ ./<my_serial_program>
</pre>

Please be aware that your serial job must run less than 2 hours in this example, else the job will be killed during runtime by the system.

You can also start now a graphical X11-terminal connecting you to the dedicated resource that is available for 2 hours. You can start it by the command:

<pre>
$ xterm
</pre>

Note that, once the walltime limit has been reached the resources - i.e. the compute node - will automatically be revoked.

An interactive parallel application running on one compute node or on many compute nodes (e.g. here 5 nodes) with 96 cores each requires usually an amount of memory in GByte (e.g. 50 GByte) and a maximum time (e.g. 1 hour). E.g. 5 nodes can be allocated by the following command:

<pre>
$ salloc -p cpu -N 5 --ntasks-per-node=96 -t 01:00:00 --mem=50gb
</pre>

Now you can run parallel jobs on 480 cores requiring 50 GByte of memory per node. Please be aware that you will be logged in on core 0 of the first node.
If you want to have access to another node you have to open a new terminal, connect it also to bwUniCluster 3.0 and type the following commands to
connect to the running interactive job and then to a specific node:

<pre>
$ srun --jobid=XXXXXXXX --pty /bin/bash
$ srun --nodelist=uc3nXXX --pty /bin/bash
</pre>

With the command:

<pre>
$ squeue
</pre>

the jobid and the nodelist can be shown.

If you want to run MPI-programs, you can do it by simply typing mpirun <program_name>. Then your program will be run on 480 cores. A very simple example for starting a parallel job can be:

<pre>
$ mpirun <my_mpi_program>
</pre>

You can also start the debugger ddt by the commands:

<pre>
$ module add devel/ddt
$ ddt <my_mpi_program>
</pre>

The above commands will execute the parallel program <my_mpi_program> on all available cores. You can also start parallel programs on a subset of cores; an example for this can be:

<pre>
$ mpirun -n 50 <my_mpi_program>
</pre>

If you are using Intel MPI you must start <my_mpi_program> by the command mpiexec.hydra (instead of mpirun).

== Interactive Computing with Jupyter ==

== Monitor and manage jobs ==
= Slurm Options =

== Slurm Commands (excerpt) ==
Important Slurm commands for non-administrators working on bwUniCluster 3.0.
{| width=850px class="wikitable"
! Slurm commands !! Brief explanation
|-
| [[#Job submission : sbatch|sbatch]] || Submits a job and puts it into the queue [[https://slurm.schedmd.com/sbatch.html sbatch]]
|-
| [[#Interactive job : salloc|salloc]] || Requests resources for an interactive Job [[https://slurm.schedmd.com/salloc.html salloc]]
|-
| [[#Detailed job information : scontrol show job|scontrol show job]] || Displays detailed job state information [[https://slurm.schedmd.com/scontrol.html scontrol]]
|-
| [[#List of your submitted jo/bs : squeue|squeue]] || Displays information about active, eligible, blocked, and/or recently completed jobs [[https://slurm.schedmd.com/squeue.html squeue]]
|-
| [[#Start time of job or resources : squeue|squeue --start]] || Returns start time of submitted job [[https://slurm.schedmd.com/squeue.html squeue]]
|-
| [[#Shows free resources : sinfo_t_idle|sinfo_t_idle]] || Shows what resources are available for immediate use [[https://slurm.schedmd.com/sinfo.html sinfo]]
|-
| [[#Canceling own jobs : scancel|scancel]] || Cancels a job [[https://slurm.schedmd.com/scancel.html scancel]]
|}

 
* [https://slurm.schedmd.com/tutorials.html Slurm Tutorials]
* [https://slurm.schedmd.com/pdfs/summary.pdf Slurm command/option summary (2 pages)]
* [https://slurm.schedmd.com/man_index.html Slurm Commands]
 

== Job submission : sbatch ==
Batch jobs are submitted by using the command '''sbatch'''. The main purpose of the '''sbatch''' command is to specify the resources that are needed to run the job. '''sbatch''' will then queue the batch job. However, starting of batch job depends on the availability of the requested resources and the fair sharing value.
 
 
=== Command parameters sbatch ===
The syntax and use of '''sbatch''' can be displayed via:
<pre>
$ man sbatch
</pre>
'''sbatch''' options can be used from the command line or in your job script. Different defaults for some of these options are set based on the queue and can be found [https://wiki.bwhpc.de/e/BwUniCluster3.0/Batch_Queues#sbatch_-p_queue here]

{| class="wikitable"
! colspan="3" | sbatch Options
|-
! style="width:8%"| Command line
! style="width:9%"| Script
! style="width:13%"| Purpose
|- style="vertical-align:top;"
| -t, --time=''time''
| #SBATCH --time=''time''
| Wall clock time limit. 
|-
|- style="vertical-align:top;"
| -N, --nodes=''count''
| #SBATCH --nodes=''count''
| Number of nodes to be used.
|- style="vertical-align:top;"
| -n, --ntasks=''count''
| #SBATCH --ntasks=''count''
| Number of tasks to be launched.
|-
|- style="vertical-align:top;"
| --ntasks-per-node=''count''
| #SBATCH --ntasks-per-node=''count''
| Maximum count of tasks per node.
|-
|- style="vertical-align:top;"
| -c, --cpus-per-task=''count''
| #SBATCH --cpus-per-task=''count''
| Number of CPUs required per (MPI-)task.
|-
|- style="vertical-align:top;"
| --mem=''value_in_MB''
| #SBATCH --mem=''value_in_MB''
| Memory in MegaByte per node. (You should omit the setting of this option.)
|-
|- style="vertical-align:top;"
| --mem-per-cpu=''value_in_MB''
| #SBATCH --mem-per-cpu=''value_in_MB''
| Minimum Memory required per allocated CPU. (You should omit the setting of this option.)
|-
|- style="vertical-align:top;"
| --mail-type=''type''
| #SBATCH --mail-type=''type''
| Notify user by email when certain event types occur. Valid type values are NONE, BEGIN, END, FAIL, REQUEUE, ALL.
|-
|- style="vertical-align:top;"
| --mail-user=''mail-address''
| #SBATCH --mail-user=''mail-address''
| The specified mail-address receives email notification of state changes as defined by --mail-type.
|-
|- style="vertical-align:top;"
| --output=''name''
| #SBATCH --output=''name''
| File in which job output is stored.
|-
|- style="vertical-align:top;"
| --error=''name''
| #SBATCH --error=''name''
| File in which job error messages are stored.
|-
|- style="vertical-align:top;"
| -J, --job-name=''name''
| #SBATCH --job-name=''name''
| Job name.
|-
|- style="vertical-align:top;"
| --export=[ALL,] ''env-variables''
| #SBATCH --export=[ALL,] ''env-variables''
| Identifies which environment variables from the submission environment are propagated to the launched application. Default is ALL.
|-
|- style="vertical-align:top;"
| -A, --account=''group-name''
| #SBATCH --account=''group-name''
| Change resources used by this job to specified group. You may need this option if your account is assigned to more than one group. By command "scontrol show job" the project group the job is accounted on can be seen behind "Account=".
|-
|- style="vertical-align:top;"
| -p, --partition=''queue-name''
| #SBATCH --partition=''queue-name''
| Request a specific queue for the resource allocation.
|-
|- style="vertical-align:top;"
| --reservation=''reservation-name''
| #SBATCH --reservation=''reservation-name''
| Use a specific reservation for the resource allocation.
|-
|- style="vertical-align:top;"
| -C, --constraint=''LSDF''
| #SBATCH --constraint=LSDF
| Job constraint LSDF filesystems.
|-
|- style="vertical-align:top;"
| -C, --constraint=''BEEOND (BEEOND_4MDS, BEEOND_MAXMDS)''
| #SBATCH --constraint=BEEOND (BEEOND_4MDS, BEEOND_MAXMDS)
| Job constraint BeeOND filesystem.
|-
|}
 

= Best Practices =

== Step-by-Step example==

== Dos and Don'ts ==

BwUniCluster3.0/Running Jobs

2025-04-04T14:44:27Z

P Schuhmacher:

BwUniCluster3.0/Running Jobs/Slurm

2025-04-04T08:12:22Z

P Schuhmacher: /* Examples */

{|style="background:#FEF4AB; width:100%;"
|style="padding:5px; background:#FEF4AB; text-align:left"|
This page is work in progress.
|}

= Slurm HPC Workload Manager =
== Specification ==
Slurm is an open source, fault-tolerant, and highly scalable cluster management and job scheduling system for large and small Linux clusters. Slurm requires no kernel modifications for its operation and is relatively self-contained. As a cluster workload manager, Slurm has three key functions. First, it allocates exclusive and/or non-exclusive access to resources (compute nodes) to users for some duration of time so they can perform work. Second, it provides a framework for starting, executing, and monitoring work (normally a parallel job) on the set of allocated nodes. Finally, it arbitrates contention for resources by managing a queue of pending work.
 
 
Any kind of calculation on the compute nodes of bwUniCluster 3.0 requires the user to define calculations as a sequence of commands or single command together with required run time, number of CPU cores and main memory and submit all, i.e., the '''batch job''', to a resource and workload managing software. bwUniCluster 3.0 has installed the workload managing software Slurm. Therefore any job submission by the user is to be executed by commands of the Slurm software. Slurm queues and runs user jobs based on fair sharing policies.
 
 

== Slurm Commands (excerpt) ==
Important Slurm commands for non-administrators working on bwUniCluster 3.0.
{| width=850px class="wikitable"
! Slurm commands !! Brief explanation
|-
| [[#Job submission : sbatch|sbatch]] || Submits a job and puts it into the queue [[https://slurm.schedmd.com/sbatch.html sbatch]]
|-
| [[#Interactive job : salloc|salloc]] || Requests resources for an interactive Job [[https://slurm.schedmd.com/salloc.html salloc]]
|-
| [[#Detailed job information : scontrol show job|scontrol show job]] || Displays detailed job state information [[https://slurm.schedmd.com/scontrol.html scontrol]]
|-
| [[#List of your submitted jo/bs : squeue|squeue]] || Displays information about active, eligible, blocked, and/or recently completed jobs [[https://slurm.schedmd.com/squeue.html squeue]]
|-
| [[#Start time of job or resources : squeue|squeue --start]] || Returns start time of submitted job [[https://slurm.schedmd.com/squeue.html squeue]]
|-
| [[#Shows free resources : sinfo_t_idle|sinfo_t_idle]] || Shows what resources are available for immediate use [[https://slurm.schedmd.com/sinfo.html sinfo]]
|-
| [[#Canceling own jobs : scancel|scancel]] || Cancels a job [[https://slurm.schedmd.com/scancel.html scancel]]
|}

 
* [https://slurm.schedmd.com/tutorials.html Slurm Tutorials]
* [https://slurm.schedmd.com/pdfs/summary.pdf Slurm command/option summary (2 pages)]
* [https://slurm.schedmd.com/man_index.html Slurm Commands]
 

== Job submission : sbatch ==
Batch jobs are submitted by using the command '''sbatch'''. The main purpose of the '''sbatch''' command is to specify the resources that are needed to run the job. '''sbatch''' will then queue the batch job. However, starting of batch job depends on the availability of the requested resources and the fair sharing value.
 
 
=== Command parameters sbatch ===
The syntax and use of '''sbatch''' can be displayed via:
<pre>
$ man sbatch
</pre>
'''sbatch''' options can be used from the command line or in your job script. Different defaults for some of these options are set based on the queue and can be found [https://wiki.bwhpc.de/e/BwUniCluster3.0/Batch_Queues#sbatch_-p_queue here]

{| class="wikitable"
! colspan="3" | sbatch Options
|-
! style="width:8%"| Command line
! style="width:9%"| Script
! style="width:13%"| Purpose
|- style="vertical-align:top;"
| -t, --time=''time''
| #SBATCH --time=''time''
| Wall clock time limit. 
|-
|- style="vertical-align:top;"
| -N, --nodes=''count''
| #SBATCH --nodes=''count''
| Number of nodes to be used.
|- style="vertical-align:top;"
| -n, --ntasks=''count''
| #SBATCH --ntasks=''count''
| Number of tasks to be launched.
|-
|- style="vertical-align:top;"
| --ntasks-per-node=''count''
| #SBATCH --ntasks-per-node=''count''
| Maximum count of tasks per node.
|-
|- style="vertical-align:top;"
| -c, --cpus-per-task=''count''
| #SBATCH --cpus-per-task=''count''
| Number of CPUs required per (MPI-)task.
|-
|- style="vertical-align:top;"
| --mem=''value_in_MB''
| #SBATCH --mem=''value_in_MB''
| Memory in MegaByte per node. (You should omit the setting of this option.)
|-
|- style="vertical-align:top;"
| --mem-per-cpu=''value_in_MB''
| #SBATCH --mem-per-cpu=''value_in_MB''
| Minimum Memory required per allocated CPU. (You should omit the setting of this option.)
|-
|- style="vertical-align:top;"
| --mail-type=''type''
| #SBATCH --mail-type=''type''
| Notify user by email when certain event types occur. Valid type values are NONE, BEGIN, END, FAIL, REQUEUE, ALL.
|-
|- style="vertical-align:top;"
| --mail-user=''mail-address''
| #SBATCH --mail-user=''mail-address''
| The specified mail-address receives email notification of state changes as defined by --mail-type.
|-
|- style="vertical-align:top;"
| --output=''name''
| #SBATCH --output=''name''
| File in which job output is stored.
|-
|- style="vertical-align:top;"
| --error=''name''
| #SBATCH --error=''name''
| File in which job error messages are stored.
|-
|- style="vertical-align:top;"
| -J, --job-name=''name''
| #SBATCH --job-name=''name''
| Job name.
|-
|- style="vertical-align:top;"
| --export=[ALL,] ''env-variables''
| #SBATCH --export=[ALL,] ''env-variables''
| Identifies which environment variables from the submission environment are propagated to the launched application. Default is ALL.
|-
|- style="vertical-align:top;"
| -A, --account=''group-name''
| #SBATCH --account=''group-name''
| Change resources used by this job to specified group. You may need this option if your account is assigned to more than one group. By command "scontrol show job" the project group the job is accounted on can be seen behind "Account=".
|-
|- style="vertical-align:top;"
| -p, --partition=''queue-name''
| #SBATCH --partition=''queue-name''
| Request a specific queue for the resource allocation.
|-
|- style="vertical-align:top;"
| --reservation=''reservation-name''
| #SBATCH --reservation=''reservation-name''
| Use a specific reservation for the resource allocation.
|-
|- style="vertical-align:top;"
| -C, --constraint=''LSDF''
| #SBATCH --constraint=LSDF
| Job constraint LSDF filesystems.
|-
|- style="vertical-align:top;"
| -C, --constraint=''BEEOND (BEEOND_4MDS, BEEOND_MAXMDS)''
| #SBATCH --constraint=BEEOND (BEEOND_4MDS, BEEOND_MAXMDS)
| Job constraint BeeOND filesystem.
|-
|}
 

== Interactive job : salloc ==

If you want to run an interactive job, you can do so via the command salloc on a login node. Considering a serial application running on a compute node that requires 5000 MByte of memory and limiting the interactive run to 2 hours the following command has to be executed:

<pre>
$ salloc --partition=cpu --ntasks=1 --time=120 --mem=5000
</pre>

Then you will get one core on a compute node within the partition "cpu". After execution of this command '''DO NOT CLOSE''' your current terminal session but wait until the queueing system Slurm has granted you the requested resources on the compute node. You will be logged in automatically on the granted core! To run a serial program on the granted core you only have to type the name of the executable.

<pre>
$ ./<my_serial_program>
</pre>

Please be aware that your serial job must run less than 2 hours in this example, else the job will be killed during runtime by the system.

You can also start now a graphical X11-terminal connecting you to the dedicated resource that is available for 2 hours. You can start it by the command:

<pre>
$ xterm
</pre>

Note that, once the walltime limit has been reached the resources - i.e. the compute node - will automatically be revoked.

An interactive parallel application running on one compute node or on many compute nodes (e.g. here 5 nodes) with 40 cores each requires usually an amount of memory in GByte (e.g. 50 GByte) and a maximum time (e.g. 1 hour). E.g. 5 nodes can be allocated by the following command:

<pre>
$ salloc --partition=cpu --nodes=5 --ntasks-per-node=40 --time=01:00:00 --mem=50gb
</pre>

Now you can run parallel jobs on 200 cores requiring 50 GByte of memory per node. Please be aware that you will be logged in on core 0 of the first node.
If you want to have access to another node you have to open a new terminal, connect it also to BwUniCluster 2.0 and type the following commands to
connect to the running interactive job and then to a specific node:

<pre>
$ srun --jobid=XXXXXXXX --pty /bin/bash
$ srun --nodelist=uc3nXXX --pty /bin/bash
</pre>

With the command:

<pre>
$ squeue
</pre>

the jobid and the nodelist can be shown.

If you want to run MPI-programs, you can do it by simply typing mpirun <program_name>. Then your program will be run on 200 cores. A very simple example for starting a parallel job can be:

<pre>
$ mpirun <my_mpi_program>
</pre>

You can also start the debugger ddt by the commands:

<pre>
$ module add devel/ddt
$ ddt <my_mpi_program>
</pre>

The above commands will execute the parallel program <my_mpi_program> on all available cores. You can also start parallel programs on a subset of cores; an example for this can be:

<pre>
$ mpirun -n 50 <my_mpi_program>
</pre>

If you are using Intel MPI you must start <my_mpi_program> by the command mpiexec.hydra (instead of mpirun).

<div id="top"></div>
= Slurm HPC Workload Manager =
== Specification ==
Slurm is an open source, fault-tolerant, and highly scalable cluster management and job scheduling system for large and small Linux clusters. Slurm requires no kernel modifications for its operation and is relatively self-contained. As a cluster workload manager, Slurm has three key functions. First, it allocates exclusive and/or non-exclusive access to resources (compute nodes) to users for some duration of time so they can perform work. Second, it provides a framework for starting, executing, and monitoring work (normally a parallel job) on the set of allocated nodes. Finally, it arbitrates contention for resources by managing a queue of pending work.
 
 
Any kind of calculation on the compute nodes of [[bwUniCluster3.0|bwUniCluster 3.0]] requires the user to define calculations as a sequence of commands or single command together with required run time, number of CPU cores and main memory and submit all, i.e., the '''batch job''', to a resource and workload managing software. bwUniCluster 3.0 has installed the workload managing software Slurm. Therefore any job submission by the user is to be executed by commands of the Slurm software. Slurm queues and runs user jobs based on fair sharing policies.
 
 

== Slurm Commands (excerpt) ==
Some of the most used Slurm commands for non-administrators working on bwUniCluster 2.0.
{| width=750px class="wikitable"
! Slurm commands !! Brief explanation
|-
| [[#Job Submission : sbatch|sbatch]] || Submits a job and queues it in an input queue [[https://slurm.schedmd.com/sbatch.html sbatch]]
|-
| [[#Detailed job information : scontrol show job|scontrol show job]] || Displays detailed job state information [[https://slurm.schedmd.com/scontrol.html scontrol]]
|-
| [[#List of your submitted jo/bs : squeue|squeue]] || Displays information about active, eligible, blocked, and/or recently completed jobs [[https://slurm.schedmd.com/squeue.html squeue]]
|-
| [[#Start time of job or resources : squeue|squeue --start]] || Returns start time of submitted job or requested resources [[https://slurm.schedmd.com/squeue.html squeue]]
|-
| [[#Shows free resources : sinfo_t_idle|sinfo_t_idle]] || Shows what resources are available for immediate use [[https://slurm.schedmd.com/sinfo.html sinfo]]
|-
| [[#Canceling own jobs : scancel|scancel]] || Cancels a job (obsoleted!) [[https://slurm.schedmd.com/scancel.html scancel]]
|}

 
* [https://slurm.schedmd.com/tutorials.html Slurm Tutorials]
* [https://slurm.schedmd.com/pdfs/summary.pdf Slurm command/option summary (2 pages)]
* [https://slurm.schedmd.com/man_index.html Slurm Commands]
 

== Job Submission : sbatch ==
Batch jobs are submitted by using the command '''sbatch'''. The main purpose of the '''sbatch''' command is to specify the resources that are needed to run the job. '''sbatch''' will then queue the batch job. However, starting of batch job depends on the availability of the requested resources and the fair sharing value.
 
 
=== sbatch Command Parameters ===
The syntax and use of '''sbatch''' can be displayed via:
<pre>
$ man sbatch
</pre>
'''sbatch''' options can be used from the command line or in your job script.
{| width=750px class="wikitable"
! colspan="3" | sbatch Options
|-
! Command line
! Script
! Purpose
|- style="vertical-align:top;"
| -t ''time'' or --time=''time''
| #SBATCH --time=''time''
| Wall clock time limit. 
|-
|- style="vertical-align:top;"
| -N ''count'' or --nodes=''count''
| #SBATCH --nodes=''count''
| Number of nodes to be used.
|- style="vertical-align:top;"
| -n ''count'' or --ntasks=''count''
| #SBATCH --ntasks=''count''
| Number of tasks to be launched.
|-
|- style="vertical-align:top;"
| --ntasks-per-node=''count''
| #SBATCH --ntasks-per-node=''count''
| Maximum count (<= 28 and <= 40 resp.) of tasks per node. (Replaces the option ppn of MOAB.)
|-
|- style="vertical-align:top;"
| -c ''count'' or --cpus-per-task=''count''
| #SBATCH --cpus-per-task=''count''
| Number of CPUs required per (MPI-)task.
|-
|- style="vertical-align:top;"
| --mem=''value_in_MB''
| #SBATCH --mem=''value_in_MB''
| Memory in MegaByte per node. (Default value is 128000 and 96000 MB resp., i.e. you should omit the setting of this option.)
|-
|- style="vertical-align:top;"
| --mem-per-cpu=''value_in_MB''
| #SBATCH --mem-per-cpu=''value_in_MB''
| Minimum Memory required per allocated CPU. (Replaces the option pmem of MOAB. You should omit the setting of this option.)
|-
|- style="vertical-align:top;"
| --mail-type=''type''
| #SBATCH --mail-type=''type''
| Notify user by email when certain event types occur. Valid type values are NONE, BEGIN, END, FAIL, REQUEUE, ALL.
|-
|- style="vertical-align:top;"
| --mail-user=''mail-address''
| #SBATCH --mail-user=''mail-address''
| The specified mail-address receives email notification of state changes as defined by --mail-type.
|-
|- style="vertical-align:top;"
| --output=''name''
| #SBATCH --output=''name''
| File in which job output is stored.
|-
|- style="vertical-align:top;"
| --error=''name''
| #SBATCH --error=''name''
| File in which job error messages are stored.
|-
|- style="vertical-align:top;"
| -J ''name'' or --job-name=''name''
| #SBATCH --job-name=''name''
| Job name.
|-
|- style="vertical-align:top;"
| --export=[ALL,] ''env-variables''
| #SBATCH --export=[ALL,] ''env-variables''
| Identifies which environment variables from the submission environment are propagated to the launched application. Default is ALL. If adding an environment variable to the submission environment is intended, the argument ALL must be added.
|-
|- style="vertical-align:top;"
| -A ''group-name'' or --account=''group-name''
| #SBATCH --account=''group-name''
| Change resources used by this job to specified group. You may need this option if your account is assigned to more than one group. By command "scontrol show job" the project group the job is accounted on can be seen behind "Account=".
|-
|- style="vertical-align:top;"
| -p ''queue-name'' or --partition=''queue-name''
| #SBATCH --partition=''queue-name''
| Request a specific queue for the resource allocation.
|-
|- style="vertical-align:top;"
| --reservation=''reservation-name''
| #SBATCH --reservation=''reservation-name''
| Use a specific reservation for the resource allocation.
|-
|- style="vertical-align:top;"
| -C ''LSDF'' or --constraint=''LSDF''
| #SBATCH --constraint=LSDF
| Job constraint LSDF Filesystems.
|-
|- style="vertical-align:top;"
| -C ''BEEOND (BEEOND_4MDS, BEEOND_MAXMDS)'' or --constraint=''BEEOND (BEEOND_4MDS, BEEOND_MAXMDS''
| #SBATCH --constraint=BEEOND (BEEOND_4MDS, BEEOND_MAXMDS)
| Job constraint BeeOND file system.
|-
|}
 

==== sbatch --partition ''queues'' ====
Queue classes define maximum resources such as walltime, nodes and processes per node and queue of the compute system. Details can be found here:
* [[BwUniCluster3.0/Batch_Queues|bwUniCluster 3.0 queue settings]]
 

=== sbatch Examples ===
==== Serial Programs ====
To submit a serial job that runs the script '''job.sh''' and that requires 5000 MB of main memory and 10 minutes of wall clock time

a) execute:
<pre>
$ sbatch -p dev_cpu -n 1 -t 10:00 --mem=5000 job.sh
</pre>
or
b) add after the initial line of your script '''job.sh''' the lines (here with a high memory request):
<source lang="bash">
#SBATCH --ntasks=1
#SBATCH --time=10
#SBATCH --mem=180gb
#SBATCH --job-name=simple
</source>
and execute the modified script with the command line option ''--partition=highmem'':
<pre>
$ sbatch --partition=highmem job.sh
</pre>
Note, that sbatch command line options overrule script options.
 
 

==== Multithreaded Programs ====
Multithreaded programs operate faster than serial programs on CPUs with multiple cores. 
Moreover, multiple threads of one process share resources such as memory.
 
For multithreaded programs based on '''Open''' '''M'''ulti-'''P'''rocessing (OpenMP) a number of threads is defined by the environment variable OMP_NUM_THREADS. By default this variable is set to 1 (OMP_NUM_THREADS=1).
 
'''Hyperthreading is switched on on bwUniCluster 3.0, the option --threads-per-core must be set to 1, if you do not want to use it.'''
 
To submit a batch job called ''OpenMP_Test'' that runs a 96-fold threaded program ''omp_exe'' which requires 6000 MByte of total physical memory and total wall clock time of 40 minutes:
 
a) execute:
<pre>
$ sbatch -p cpu --export=ALL,OMP_NUM_THREADS=96 -J OpenMP_Test -N 1 -c 96 --threads-per-core=1 -t 40 --mem=6000 ./omp_exe
</pre>
or
-->
* generate the script '''job_omp.sh''' containing the following lines:
<source lang="bash">
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --cpus-per-task=96
#SBATCH --time=40:00
#SBATCH --threads-per-core=1
#SBATCH --mem=6000mb
#SBATCH --export=ALL,EXECUTABLE=./omp_exe
#SBATCH -J OpenMP_Test

#Usually you should set
export KMP_AFFINITY=compact,1,0
#export KMP_AFFINITY=verbose,compact,1,0 prints messages concerning the supported affinity
#KMP_AFFINITY Description: https://software.intel.com/en-us/node/524790#KMP_AFFINITY_ENVIRONMENT_VARIABLE

export OMP_NUM_THREADS=${SLURM_JOB_CPUS_PER_NODE}
echo "Executable ${EXECUTABLE} running on ${SLURM_JOB_CPUS_PER_NODE} cores with ${OMP_NUM_THREADS} threads"
startexe=${EXECUTABLE}
echo $startexe
exec $startexe
</source>
Using Intel compiler the environment variable KMP_AFFINITY switches on binding of threads to specific cores and, if necessary, replace <placeholder> with the required modulefile to enable the OpenMP environment and execute the script '''job_omp.sh''' adding the queue class ''cpu'' as sbatch option:
<pre>
$ sbatch -p cpu job_omp.sh
</pre>
Note, that sbatch command line options overrule script options, e.g.,
<pre>
$ sbatch --partition=cpu --mem=200 job_omp.sh
</pre>
overwrites the script setting of 6000 MByte with 200 MByte.
 
 

==== MPI Parallel Programs ====
MPI parallel programs run faster than serial programs on multi CPU and multi core systems. N-fold spawned processes of the MPI program, i.e., '''MPI tasks''', run simultaneously and communicate via the Message Passing Interface (MPI) paradigm. MPI tasks do not share memory but can be spawned over different nodes.
 
Multiple MPI tasks must be launched via '''mpirun''', e.g. 4 MPI tasks of ''my_par_program'':
<pre>
$ mpirun -n 4 my_par_program
</pre>
This command runs 4 MPI tasks of ''my_par_program'' on the node you are logged in.
To run this command with a loaded Intel MPI the environment variable I_MPI_HYDRA_BOOTSTRAP must be unset ( --> $ unset I_MPI_HYDRA_BOOTSTRAP).

Running MPI parallel programs in a batch job the interactive environment - particularly the loaded modules - will also be set in the batch job. If you want to set a defined module environment in your batch job you have to purge all modules before setting the wished modules.
 
 
===== OpenMPI =====

If you want to run jobs on batch nodes, generate a wrapper script ''job_ompi.sh'' for '''OpenMPI''' containing the following lines:
<source lang="bash">
#!/bin/bash
# Use when using the module environment for OpenMPI
module load compiler/<placeholder_for_compiler>/<placeholder_for_compiler_version>
module load mpi/openmpi/<placeholder_for_mpi_version>
mpirun --bind-to core --map-by core -report-bindings my_par_program
</source>
'''Attention:''' Do '''NOT''' add mpirun options ''-n <number_of_processes>'' or any other option defining processes or nodes, since Slurm instructs mpirun about number of processes and node hostnames. Use '''ALWAYS''' the MPI options '''''--bind-to core''''' and '''''--map-by core|socket|node'''''. Please type ''mpirun --help'' for an explanation of the meaning of the different options of mpirun option ''--map-by''.
 
Considering 4 OpenMPI tasks on a single node, each requiring 2000 MByte, and running for 1 hour, execute:
<pre>
$ sbatch -p cpu -N 1 -n 4 --mem-per-cpu=2000 --time=01:00:00 ./job_ompi.sh
</pre>
 

===== Intel MPI =====

Generate a wrapper script for '''Intel MPI''', ''job_impi.sh'' containing the following lines:
<source lang="bash">
#!/bin/bash
# Use when a defined module environment related to Intel MPI is wished
module load compiler/<placeholder_for_compiler>/<placeholder_for_compiler_version>
module load mpi/impi/<placeholder_for_version>
mpiexec.hydra -bootstrap slurm my_par_program
</source>
'''Attention:''' 
Do '''NOT''' add mpirun options ''-n <number_of_processes>'' or any other option defining processes or nodes, since Slurm instructs mpirun about number of processes and node hostnames.
 
Launching and running 200 Intel MPI tasks on 5 nodes, each requiring 80 GByte, and running for 5 hours, execute:
<pre>
$ sbatch --partition=cpu -N 5 --ntasks-per-node=40 --mem=80gb -t 300 ./job_impi.sh
</pre>
 
If you want to use 128 or more nodes, you must also set the environment variable as follows: 
export I_MPI_HYDRA_BRANCH_COUNT=-1
 
If you want to use the options perhost, ppn or rr, you must additionally set the environment variable I_MPI_JOB_RESPECT_PROCESS_PLACEMENT=off.
 
 

==== Multithreaded + MPI parallel Programs ====
Multithreaded + MPI parallel programs operate faster than serial programs on multi CPUs with multiple cores. All threads of one process share resources such as memory. On the contrary MPI tasks do not share memory but can be spawned over different nodes. '''Hyperthreading is switched on on bwUniCluster 3.0, the option --threads-per-core must be set to 1, if you do not want to use it.'''
 
 
===== OpenMPI with Multithreading =====
Multiple MPI tasks using '''OpenMPI''' must be launched by the MPI parallel program '''mpirun'''. For multithreaded programs based on '''Open''' '''M'''ulti-'''P'''rocessing (OpenMP) number of threads are defined by the environment variable OMP_NUM_THREADS. By default this variable is set to 1 (OMP_NUM_THREADS=1).
 
'''For OpenMPI''' a job-script to submit a batch job called ''job_ompi_omp.sh'' that runs a MPI program with 4 tasks and a 28-fold threaded program ''ompi_omp_program'' requiring 3000 MByte of physical memory per thread (using 28 threads per MPI task you will get 28*3000 MByte = 84000 MByte per MPI task) and total wall clock time of 3 hours looks like:

<source lang="bash">
#!/bin/bash
#SBATCH --ntasks=4
#SBATCH --cpus-per-task=28
#SBATCH --threads-per-core=1
#SBATCH --time=03:00:00
#SBATCH --mem=83gb # 84000 MB = 84000/1024 GB = 82.1 GB
#SBATCH --export=ALL,MPI_MODULE=mpi/openmpi/3.1,EXECUTABLE=./ompi_omp_program
#SBATCH --output="parprog_hybrid_%j.out"

# Use when a defined module environment related to OpenMPI is wished
module load ${MPI_MODULE}
export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK}
export MPIRUN_OPTIONS="--bind-to core --map-by socket:PE=${OMP_NUM_THREADS} -report-bindings"
export NUM_CORES=${SLURM_NTASKS}*${OMP_NUM_THREADS}
echo "${EXECUTABLE} running on ${NUM_CORES} cores with ${SLURM_NTASKS} MPI-tasks and ${OMP_NUM_THREADS} threads"
startexe="mpirun -n ${SLURM_NTASKS} ${MPIRUN_OPTIONS} ${EXECUTABLE}"
echo $startexe
exec $startexe
</source>
Execute the script '''job_ompi_omp.sh''' by command sbatch:
<pre>
$ sbatch -p cpu ./job_ompi_omp.sh
</pre>
* With the mpirun option ''--bind-to core'' MPI tasks and OpenMP threads are bound to physical cores.
* With the option ''--map-by node:PE=<value>'' (neighbored) MPI tasks will be attached to different nodes and each MPI task is bound to the first core of a node. <value> must be set to ${OMP_NUM_THREADS}.
* The option ''-report-bindings'' shows the bindings between MPI tasks and physical cores.
* The mpirun-options '''--bind-to core''', '''--map-by socket|...|node:PE=<value>''' should always be used when running a multithreaded MPI program.
 

===== Intel MPI with Multithreading =====
Multithreaded + MPI parallel programs operate faster than serial programs on multi CPUs with multiple cores. All threads of one process share resources such as memory. On the contrary MPI tasks do not share memory but can be spawned over different nodes.

Multiple Intel MPI tasks must be launched by the MPI parallel program '''mpiexec.hydra'''. For multithreaded programs based on '''Open''' '''M'''ulti-'''P'''rocessing (OpenMP) number of threads are defined by the environment variable OMP_NUM_THREADS. By default this variable is set to 1 (OMP_NUM_THREADS=1).

'''For Intel MPI''' a job-script to submit a batch job called ''job_impi_omp.sh'' that runs a Intel MPI program with 10 tasks and a 96-fold threaded program ''impi_omp_program'' requiring 96000 MByte of total physical memory per task and total wall clock time of 1 hours looks like:


<source lang="bash">
#!/bin/bash
#SBATCH --ntasks=10
#SBATCH --cpus-per-task=96
#SBATCH --threads-per-core=1
#SBATCH --time=60
#SBATCH --mem=96000
#SBATCH --export=ALL,MPI_MODULE=mpi/impi,EXE=./impi_omp_program
#SBATCH --output="parprog_impi_omp_%j.out"

#If using more than one MPI task per node please set
export KMP_AFFINITY=compact,1,0
#export KMP_AFFINITY=verbose,scatter prints messages concerning the supported affinity
#KMP_AFFINITY Description: https://software.intel.com/en-us/node/524790#KMP_AFFINITY_ENVIRONMENT_VARIABLE

# Use when a defined module environment related to Intel MPI is wished
module load ${MPI_MODULE}
export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK}
export MPIRUN_OPTIONS="-binding "domain=omp:compact" -print-rank-map -envall"
export NUM_PROCS=eval(${SLURM_NTASKS}*${OMP_NUM_THREADS})
echo "${EXE} running on ${NUM_PROCS} cores with ${SLURM_NTASKS} MPI-tasks and ${OMP_NUM_THREADS} threads"
startexe="mpiexec.hydra -bootstrap slurm ${MPIRUN_OPTIONS} -n ${SLURM_NTASKS} ${EXE}"
echo $startexe
exec $startexe
</source>
Using Intel compiler the environment variable KMP_AFFINITY switches on binding of threads to specific cores. If you only run one MPI task per node please set KMP_AFFINITY=compact,1,0.
 
If you want to use 128 or more nodes, you must also set the environment variable as follows: 
export I_MPI_HYDRA_BRANCH_COUNT=-1
 
If you want to use the options perhost, ppn or rr, you must additionally set the environment variable I_MPI_JOB_RESPECT_PROCESS_PLACEMENT=off.
 
 
Execute the script '''job_impi_omp.sh''' by command sbatch:
<pre>
$ sbatch -p cpu ./job_impi_omp.sh
</pre>
 
The mpirun option ''-print-rank-map'' shows the bindings between MPI tasks and nodes (not very beneficial). The option ''-binding'' binds MPI tasks (processes) to a particular processor; ''domain=omp'' means that the domain size is determined by the number of threads. If you would choose 2 MPI tasks per node, you should choose ''-binding "cell=unit;map=bunch"''; this binding maps one MPI process to each socket.
 
 

==== Chain jobs ====
The CPU time requirements of many applications exceed the limits of the job classes. In those situations it is recommended to solve the problem by a job chain. A job chain is a sequence of jobs where each job automatically starts its successor.
<source lang="bash">
#!/bin/bash
####################################
## simple Slurm submitter script to setup ##
## a chain of jobs using Slurm ##
####################################
## ver. : 2018-11-27, KIT, SCC

## Define maximum number of jobs via positional parameter 1, default is 5
max_nojob=${1:-5}

## Define your jobscript (e.g. "~/chain_job.sh")
chain_link_job=${PWD}/chain_job.sh

## Define type of dependency via positional parameter 2, default is 'afterok'
dep_type="${2:-afterok}"
## -> List of all dependencies:
## https://slurm.schedmd.com/sbatch.html

myloop_counter=1
## Submit loop
while [ ${myloop_counter} -le ${max_nojob} ] ; do
##
## Differ slurm_opt depending on chain link number
if [ ${myloop_counter} -eq 1 ] ; then
slurm_opt=""
else
slurm_opt="-d ${dep_type}:${jobID}"
fi
##
## Print current iteration number and sbatch command
echo "Chain job iteration = ${myloop_counter}"
echo " sbatch --export=myloop_counter=${myloop_counter} ${slurm_opt} ${chain_link_job}"
## Store job ID for next iteration by storing output of sbatch command with empty lines
jobID=$(sbatch -p <queue> --export=ALL,myloop_counter=${myloop_counter} ${slurm_opt} ${chain_link_job} 2>&1 | sed 's/[S,a-z]* //g')
##
## Check if ERROR occured
if [[ "${jobID}" =~ "ERROR" ]] ; then
echo " -> submission failed!" ; exit 1
else
echo " -> job number = ${jobID}"
fi
##
## Increase counter
let myloop_counter+=1
done
</source>
 

==== GPU jobs ====

The nodes in the gpu_h100, gpu_mi300, gpu_a100_il and gpu_h100_il queues have 4 NVIDIA Ampere A100 GPUs or 4 NVIDIA Hopper H100 GPUs. Just submitting a job to these queues is not enough to also allocate one or more GPUs, you have to do so using the "--gres=gpu" parameter. You have to specifiy how many GPUs your job needs, e.g. "--gres=gpu:2" will request two GPUs.

The GPU nodes are shared between multiple jobs if the jobs don't request all the GPUs in a node and there are enough resources to run more than one job. The individual GPUs are always bound to a single job and will not be shared between different jobs.

a) add after the initial line of your script job.sh the line including the
information about the GPU usage: #SBATCH --gres=gpu:2
<pre>
#!/bin/bash
#SBATCH --ntasks=40
#SBATCH --time=02:00:00
#SBATCH --mem=4000
#SBATCH --gres=gpu:2
</pre>

or b) execute:
<pre>
$ sbatch -p <queue> -n 40 -t 02:00:00 --mem 4000 --gres=gpu:2 job.sh
</pre>
 
If you start an interactive session on of the GPU nodes, you can use the "nvidia-smi" command to list the GPUs allocated to your job:
<pre>
$ nvidia-smi
Fri Apr 4 09:51:29 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.86.15 Driver Version: 570.86.15 CUDA Version: 12.8 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA H100 On | 00000000:06:00.0 Off | 0 |
| N/A 45C P0 70W / 415W | 27MiB / 95830MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA H100 On | 00000000:26:00.0 Off | 0 |
| N/A 45C P0 69W / 415W | 27MiB / 95830MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 3569 G /usr/libexec/Xorg 17MiB |
| 1 N/A N/A 3569 G /usr/libexec/Xorg 17MiB |
+-----------------------------------------------------------------------------------------+
</pre>

 
In case of using Open MPI, the underlying communication infrastructure (UCX and Open MPI's BTL) is CUDA-aware.
Please run Open MPI's mpirun using the following command:
<pre>
$ mpirun --mca pml ucx --mca btl_openib_warn_default_gid_prefix 0 -np 2 ./mpi_cuda_app
</pre>
or disabling the (older) communication layer Bit-Transfer-Layer (short BTL) alltogether:
<pre>
$ mpirun --mca pml ucx --mca btl ^openib -np 2 ./mpi_cuda_app
</pre>

 
 

==== LSDF Online Storage ====
On bwUniCluster 3.0 you can use for special cases the LSDF Online Storage on the HPC cluster nodes. Please request for this service separately ([https://www.lsdf.kit.edu/os/storagerequest/: LSDF Storage Request]).
To mount the LSDF Online Storage on the compute nodes during the job runtime the
the constraint flag "LSDF" has to be set.

a) add after the initial line of your script job.sh the line including the
information about the LSDF Online Storage usage: #SBATCH --constraint=LSDF
<pre>
#!/bin/bash
#SBATCH --ntasks=1
#SBATCH --time=120
#SBATCH --mem=200
#SBATCH --constraint=LSDF
</pre>

or b) execute:
<pre>
$ sbatch -p <queue> -n 1 -t 2:00:00 --mem 200 job.sh -C LSDF
</pre>
 
For the usage of the LSDF Online Storage
the following environment variables are available: $LSDF, $LSDFPROJECTS, $LSDFHOME.
 
 

====BeeOND (BeeGFS On-Demand)====

BeeOND instances are integrated into the prolog and epilog script of the cluster batch system Slurm. It can be used on the exclusive compute nodes during the job runtime with the constraint flag "BEEOND", "BEEOND_4MDS" or "BEEOND_MAXMDS" ([[BwUniCluster_2.0_Slurm_common_Features#sbatch_Command_Parameters|Slurm Command Parameters]])
* BEEOND: one metadata server is started on the first node
* BEEOND_4MDS: 4 metadata servers are started within your job. If you have less than 4 nodes less metadata servers are started.
* BEEOND_MAXMDS: on every node of your job a metadata server for the on_demand file system is started

As starting point we recommend using the "BEEOND" option. If you are unsure if this is sufficient for you feel free to contact the support team.
<source lang="bash">
#!/bin/bash
#SBATCH ...
#SBATCH --constraint=BEEOND # or BEEOND_4MDS or BEEOND_MAXMDS
</source>

After your job has started you can find the private on-demand file system in '''/mnt/odfs/${SLURM_JOB_ID}''' directory. The mountpoint comes with five pre-configured directories:
<source lang="bash">
# For small files (stripe count = 1)
/mnt/odfs/${SLURM_JOB_ID}/stripe_1
# Stripe count = 4
/mnt/odfs/${SLURM_JOB_ID}/stripe_default
# or
/mnt/odfs/${SLURM_JOB_ID}/stripe_4
# Stripe count = 8, 16 or 32, use this directories for medium sized and large files or when using MPI-IO
/mnt/odfs/${SLURM_JOB_ID}/stripe_8
/mnt/odfs/${SLURM_JOB_ID}/stripe_16
# or
/mnt/odfs/${SLURM_JOB_ID}/stripe_32
</source>

If you request less nodes than stripe count, the stripe count will be the number of nodes. For example, if you only request 8 nodes the directory stripe_16 has only a stripe count 8.

; '''Attention:''' 
:Be careful when creating large files: use always the directory with the max stripe count for large files.
:If you create large files use a higher stripe count. For example, if your largest file is 1.1 Tb, then you have to use a stripe count larger than 2,
:otherwise the used disk space is exceeded.

The capacity of the private file system depends on the number of nodes. For each node you get 750 Gbyte.
If you request 100 nodes for your job, the private file system is 100 * 750 Gbyte ~ 75 Tbyte (approx) capacity.

== Start time of job or resources : squeue --start ==
The command can be used by any user to displays the estimated start time of a job based a number of analysis types based on historical usage, earliest available reservable resources, and priority based backlog. The command squeue is explained in detail on the webpage https://slurm.schedmd.com/squeue.html or via manpage (man squeue).
 
 
=== Access ===
By default, this command can be run by '''any user'''.
 
 

== List of your submitted jobs : squeue ==
Displays information about YOUR active, pending and/or recently completed jobs. The command displays all own active and pending jobs. The command squeue is explained in detail on the webpage https://slurm.schedmd.com/squeue.html or via manpage (man squeue).
 
 
=== Access ===
By default, this command can be run by any user.
 
 

=== Flags ===
{| width=750px class="wikitable"
|-
! Flag !! Description
|-
| -l, --long
| Report more of the available information for the selected jobs or job steps, subject to any constraints specified.
|}
 

=== Examples ===
''squeue'' example on bwUniCluster 3.0 (Only your own jobs are displayed!).
<pre>
$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
1262 cpu wrap ka_ab123 R 8:15 1 uc3n002
1267 dev_gpu_h wrap ka_ab123 PD 0:00 1 (Resources)
1265 highmem wrap ka_ab123 R 2:41 1 uc3n084
$ squeue -l
JOBID PARTITION NAME USER STATE TIME TIME_LIMI NODES NODELIST(REASON)
1262 cpu wrap ka_ab123 RUNNING 8:55 20:00 1 uc3n002
1267 dev_gpu_h wrap ka_ab123 PENDING 0:00 20:00 1 (Resources)
1265 highmem wrap ka_ab123 RUNNING 3:21 20:00 1 uc3n084

</pre>
* The output of ''squeue'' shows how many jobs of yours are running or pending and how many nodes are in use by your jobs.
 

== Shows free resources : sinfo_t_idle ==
The Slurm command sinfo is used to view partition and node information for a system running Slurm. It incorporates down time, reservations, and node state information in determining the available backfill window. The sinfo command can only be used by the administrator.
 
SCC has prepared a special script (sinfo_t_idle) to find out how many processors are available for immediate use on the system. It is anticipated that users will use this information to submit jobs that meet these criteria and thus obtain quick job turnaround times.
 
 
=== Access ===
By default, this command can be used by any user or administrator.
 
 
=== Example ===
* The following command displays what resources are available for immediate use for the whole partition.
<pre>$ sinfo_t_idle
Partition dev_cpu : 2 nodes idle
Partition cpu : 68 nodes idle
Partition highmem : 4 nodes idle
Partition dev_gpu_h100 : 0 nodes idle
Partition gpu_h100 : 11 nodes idle
Partition gpu_mi300 : 1 nodes idle
Partition dev_cpu_il : 0 nodes idle
Partition cpu_il : 0 nodes idle
Partition dev_gpu_a100_il : 0 nodes idle
Partition gpu_a100_il : 0 nodes idle
Partition gpu_h100_il : 0 nodes idle
</pre>
* For the above example jobs in all partitions can be run immediately.
 

== Detailed job information : scontrol show job ==
scontrol show job displays detailed job state information and diagnostic output for all or a specified job of yours. Detailed information is available for active, pending and recently completed jobs. The command scontrol is explained in detail on the webpage https://slurm.schedmd.com/scontrol.html or via manpage (man scontrol).
 
Display the state of all your jobs in normal mode: scontrol show job
 
Display the state of a job with <jobid> in normal mode: scontrol show job <jobid>
 
 
=== Access ===
* End users can use scontrol show job to view the status of their '''own jobs''' only.
 

=== Arguments ===
{| width=750px class="wikitable"
|-
! Option !! Default !! Description !! Example
|- style="vertical-align:top;"
|- style="width:12%;"
| -d
| (n/a)
| Detailed mode
| Example: Display the state with jobid 18089884 in detailed mode. <pre>scontrol -d show job 18089884</pre>
|}
 
 

=== Scontrol show job Example ===
Here is an example from bwUniCluster 3.0.
<pre>
squeue # show my own jobs (here the userid is replaced!)
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
1262 cpu wrap ka_zs040 R 1:12 1 uc3n002

$
$ # now, see what's up with my pending job with jobid 18089884
$
$ scontrol show job 1262

JobId=1262 JobName=wrap
UserId=ka_zs0402(241992) GroupId=ka_scc(12345) MCS_label=N/A
Priority=4246 Nice=0 Account=ka QOS=normal
JobState=RUNNING Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
RunTime=00:00:37 TimeLimit=00:20:00 TimeMin=N/A
SubmitTime=2025-04-04T10:01:30 EligibleTime=2025-04-04T10:01:30
AccrueTime=2025-04-04T10:01:30
StartTime=2025-04-04T10:01:31 EndTime=2025-04-04T10:21:31 Deadline=N/A
SuspendTime=None SecsPreSuspend=0 LastSchedEval=2025-04-04T10:01:31 Scheduler=Main
Partition=cpu AllocNode:Sid=uc3n999:2819841
ReqNodeList=(null) ExcNodeList=(null)
NodeList=uc3n002
BatchHost=uc3n002
NumNodes=1 NumCPUs=2 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
ReqTRES=cpu=1,mem=2000M,node=1,billing=1
AllocTRES=cpu=2,mem=4000M,node=1,billing=2
Socks/Node=* NtasksPerN:B:S:C=0:0:*:1 CoreSpec=*
MinCPUsNode=1 MinMemoryCPU=2000M MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
Command=(null)
WorkDir=/pfs/data6/home/ka/ka_scc/ka_zs0402
StdErr=/pfs/data6/home/ka/ka_scc/ka_zs0402/slurm-1262.out
StdIn=/dev/null
StdOut=/pfs/data6/home/ka/ka_scc/ka_zs0402/slurm-1262.out

</pre>
 
You can use standard Linux pipe commands to filter the very detailed scontrol show job output.
* In which state the job is?
<pre>$ scontrol show job 1262 | grep -i State
JobState=RUNNING Reason=None Dependency=(null)
</pre>
 

== Cancel Slurm Jobs ==
The scancel command is used to cancel jobs. The command scancel is explained in detail on the webpage https://slurm.schedmd.com/scancel.html or via manpage (man scancel).
 
 
=== Canceling own jobs : scancel ===
scancel is used to signal or cancel jobs, job arrays or job steps. The command is:
<pre>
$ scancel [-i] <job-id>
$ scancel -t <job_state_name>
</pre>

 
{| class="wikitable"
! Flag !! Default !! Description !! Example
|- style="vertical-align:top;"
| -i, --interactive
| (n/a)
| Interactive mode.
| Cancel the job 987654 interactively. <pre> scancel -i 987654 </pre>
|-
| -t, --state
| (n/a)
| Restrict the scancel operation to jobs in a certain state. "job_state_name" may have a value of either "PENDING", "RUNNING" or "SUSPENDED".
| Cancel all jobs in state "PENDING". <pre> scancel -t "PENDING" </pre>
|}
 

= Resource Managers =
=== Batch Job (Slurm) Variables ===
The following environment variables of Slurm are added to your environment once your job has started
(only an excerpt of the most important ones).
{| width=750px class="wikitable"
! Environment !! Brief explanation
|-
| SLURM_JOB_CPUS_PER_NODE
| Number of processes per node dedicated to the job
|-
| SLURM_JOB_NODELIST
| List of nodes dedicated to the job
|-
| SLURM_JOB_NUM_NODES
| Number of nodes dedicated to the job
|-
| SLURM_MEM_PER_NODE
| Memory per node dedicated to the job
|-
| SLURM_NPROCS
| Total number of processes dedicated to the job
|-
| SLURM_CLUSTER_NAME
| Name of the cluster executing the job
|-
| SLURM_CPUS_PER_TASK
| Number of CPUs requested per task
|-
| SLURM_JOB_ACCOUNT
| Account name
|-
| SLURM_JOB_ID
| Job ID
|-
| SLURM_JOB_NAME
| Job Name
|-
| SLURM_JOB_PARTITION
| Partition/queue running the job
|-
| SLURM_JOB_UID
| User ID of the job's owner
|-
| SLURM_SUBMIT_DIR
| Job submit folder. The directory from which sbatch was invoked.
|-
| SLURM_JOB_USER
| User name of the job's owner
|-
| SLURM_RESTART_COUNT
| Number of times job has restarted
|-
| SLURM_PROCID
| Task ID (MPI rank)
|-
| SLURM_NTASKS
| The total number of tasks available for the job
|-
| SLURM_STEP_ID
| Job step ID
|-
| SLURM_STEP_NUM_TASKS
| Task count (number of MPI ranks)
|-
| SLURM_JOB_CONSTRAINT
| Job constraints
|}
See also:
* [https://slurm.schedmd.com/sbatch.html#SECTION_INPUT-ENVIRONMENT-VARIABLES Slurm input and output environment variables]
 

=== Job Exit Codes ===
A job's exit code (also known as exit status, return code and completion code) is captured by SLURM and saved as part of the job record.
 
Any non-zero exit code will be assumed to be a job failure and will result in a Job State of FAILED with a reason of "NonZeroExitCode".
 
The exit code is an 8 bit unsigned number ranging between 0 and 255. While it is possible for a job to return a negative exit code, SLURM will display it as an unsigned value in the 0 - 255 range.
 
 
==== Displaying Exit Codes and Signals ====
SLURM displays a job's exit code in the output of the '''scontrol show job''' and the sview utility.
 
When a signal was responsible for a job or step's termination, the signal number will be displayed after the exit code, delineated by a colon(:).
 
 
==== Submitting Termination Signal ====
Here is an example, how to 'save' a Slurm termination signal in a typical jobscript.
<source lang="bash">
[...]
exit_code=$?
mpirun -np <#cores> <EXE_BIN_DIR>/<executable> ... (options) 2>&1
[ "$exit_code" -eq 0 ] && echo "all clean..." || \
echo "Executable <EXE_BIN_DIR>/<executable> finished with exit code ${$exit_code}"
[...]
</source>
* Do not use ''''time'''' mpirun! The exit code will be the one submitted by the first (time) program.
* You do not need an '''exit $exit_code''' in the scripts.
 
 
----
[[#top|Back to top]]

BwUniCluster3.0/Running Jobs/Slurm

2025-04-04T08:12:09Z

P Schuhmacher: /* Examples */

{|style="background:#FEF4AB; width:100%;"
|style="padding:5px; background:#FEF4AB; text-align:left"|
This page is work in progress.
|}

= Slurm HPC Workload Manager =
== Specification ==
Slurm is an open source, fault-tolerant, and highly scalable cluster management and job scheduling system for large and small Linux clusters. Slurm requires no kernel modifications for its operation and is relatively self-contained. As a cluster workload manager, Slurm has three key functions. First, it allocates exclusive and/or non-exclusive access to resources (compute nodes) to users for some duration of time so they can perform work. Second, it provides a framework for starting, executing, and monitoring work (normally a parallel job) on the set of allocated nodes. Finally, it arbitrates contention for resources by managing a queue of pending work.
 
 
Any kind of calculation on the compute nodes of bwUniCluster 3.0 requires the user to define calculations as a sequence of commands or single command together with required run time, number of CPU cores and main memory and submit all, i.e., the '''batch job''', to a resource and workload managing software. bwUniCluster 3.0 has installed the workload managing software Slurm. Therefore any job submission by the user is to be executed by commands of the Slurm software. Slurm queues and runs user jobs based on fair sharing policies.
 
 

== Slurm Commands (excerpt) ==
Important Slurm commands for non-administrators working on bwUniCluster 3.0.
{| width=850px class="wikitable"
! Slurm commands !! Brief explanation
|-
| [[#Job submission : sbatch|sbatch]] || Submits a job and puts it into the queue [[https://slurm.schedmd.com/sbatch.html sbatch]]
|-
| [[#Interactive job : salloc|salloc]] || Requests resources for an interactive Job [[https://slurm.schedmd.com/salloc.html salloc]]
|-
| [[#Detailed job information : scontrol show job|scontrol show job]] || Displays detailed job state information [[https://slurm.schedmd.com/scontrol.html scontrol]]
|-
| [[#List of your submitted jo/bs : squeue|squeue]] || Displays information about active, eligible, blocked, and/or recently completed jobs [[https://slurm.schedmd.com/squeue.html squeue]]
|-
| [[#Start time of job or resources : squeue|squeue --start]] || Returns start time of submitted job [[https://slurm.schedmd.com/squeue.html squeue]]
|-
| [[#Shows free resources : sinfo_t_idle|sinfo_t_idle]] || Shows what resources are available for immediate use [[https://slurm.schedmd.com/sinfo.html sinfo]]
|-
| [[#Canceling own jobs : scancel|scancel]] || Cancels a job [[https://slurm.schedmd.com/scancel.html scancel]]
|}

 
* [https://slurm.schedmd.com/tutorials.html Slurm Tutorials]
* [https://slurm.schedmd.com/pdfs/summary.pdf Slurm command/option summary (2 pages)]
* [https://slurm.schedmd.com/man_index.html Slurm Commands]
 

== Job submission : sbatch ==
Batch jobs are submitted by using the command '''sbatch'''. The main purpose of the '''sbatch''' command is to specify the resources that are needed to run the job. '''sbatch''' will then queue the batch job. However, starting of batch job depends on the availability of the requested resources and the fair sharing value.
 
 
=== Command parameters sbatch ===
The syntax and use of '''sbatch''' can be displayed via:
<pre>
$ man sbatch
</pre>
'''sbatch''' options can be used from the command line or in your job script. Different defaults for some of these options are set based on the queue and can be found [https://wiki.bwhpc.de/e/BwUniCluster3.0/Batch_Queues#sbatch_-p_queue here]

{| class="wikitable"
! colspan="3" | sbatch Options
|-
! style="width:8%"| Command line
! style="width:9%"| Script
! style="width:13%"| Purpose
|- style="vertical-align:top;"
| -t, --time=''time''
| #SBATCH --time=''time''
| Wall clock time limit. 
|-
|- style="vertical-align:top;"
| -N, --nodes=''count''
| #SBATCH --nodes=''count''
| Number of nodes to be used.
|- style="vertical-align:top;"
| -n, --ntasks=''count''
| #SBATCH --ntasks=''count''
| Number of tasks to be launched.
|-
|- style="vertical-align:top;"
| --ntasks-per-node=''count''
| #SBATCH --ntasks-per-node=''count''
| Maximum count of tasks per node.
|-
|- style="vertical-align:top;"
| -c, --cpus-per-task=''count''
| #SBATCH --cpus-per-task=''count''
| Number of CPUs required per (MPI-)task.
|-
|- style="vertical-align:top;"
| --mem=''value_in_MB''
| #SBATCH --mem=''value_in_MB''
| Memory in MegaByte per node. (You should omit the setting of this option.)
|-
|- style="vertical-align:top;"
| --mem-per-cpu=''value_in_MB''
| #SBATCH --mem-per-cpu=''value_in_MB''
| Minimum Memory required per allocated CPU. (You should omit the setting of this option.)
|-
|- style="vertical-align:top;"
| --mail-type=''type''
| #SBATCH --mail-type=''type''
| Notify user by email when certain event types occur. Valid type values are NONE, BEGIN, END, FAIL, REQUEUE, ALL.
|-
|- style="vertical-align:top;"
| --mail-user=''mail-address''
| #SBATCH --mail-user=''mail-address''
| The specified mail-address receives email notification of state changes as defined by --mail-type.
|-
|- style="vertical-align:top;"
| --output=''name''
| #SBATCH --output=''name''
| File in which job output is stored.
|-
|- style="vertical-align:top;"
| --error=''name''
| #SBATCH --error=''name''
| File in which job error messages are stored.
|-
|- style="vertical-align:top;"
| -J, --job-name=''name''
| #SBATCH --job-name=''name''
| Job name.
|-
|- style="vertical-align:top;"
| --export=[ALL,] ''env-variables''
| #SBATCH --export=[ALL,] ''env-variables''
| Identifies which environment variables from the submission environment are propagated to the launched application. Default is ALL.
|-
|- style="vertical-align:top;"
| -A, --account=''group-name''
| #SBATCH --account=''group-name''
| Change resources used by this job to specified group. You may need this option if your account is assigned to more than one group. By command "scontrol show job" the project group the job is accounted on can be seen behind "Account=".
|-
|- style="vertical-align:top;"
| -p, --partition=''queue-name''
| #SBATCH --partition=''queue-name''
| Request a specific queue for the resource allocation.
|-
|- style="vertical-align:top;"
| --reservation=''reservation-name''
| #SBATCH --reservation=''reservation-name''
| Use a specific reservation for the resource allocation.
|-
|- style="vertical-align:top;"
| -C, --constraint=''LSDF''
| #SBATCH --constraint=LSDF
| Job constraint LSDF filesystems.
|-
|- style="vertical-align:top;"
| -C, --constraint=''BEEOND (BEEOND_4MDS, BEEOND_MAXMDS)''
| #SBATCH --constraint=BEEOND (BEEOND_4MDS, BEEOND_MAXMDS)
| Job constraint BeeOND filesystem.
|-
|}
 

== Interactive job : salloc ==

If you want to run an interactive job, you can do so via the command salloc on a login node. Considering a serial application running on a compute node that requires 5000 MByte of memory and limiting the interactive run to 2 hours the following command has to be executed:

<pre>
$ salloc --partition=cpu --ntasks=1 --time=120 --mem=5000
</pre>

Then you will get one core on a compute node within the partition "cpu". After execution of this command '''DO NOT CLOSE''' your current terminal session but wait until the queueing system Slurm has granted you the requested resources on the compute node. You will be logged in automatically on the granted core! To run a serial program on the granted core you only have to type the name of the executable.

<pre>
$ ./<my_serial_program>
</pre>

Please be aware that your serial job must run less than 2 hours in this example, else the job will be killed during runtime by the system.

You can also start now a graphical X11-terminal connecting you to the dedicated resource that is available for 2 hours. You can start it by the command:

<pre>
$ xterm
</pre>

Note that, once the walltime limit has been reached the resources - i.e. the compute node - will automatically be revoked.

An interactive parallel application running on one compute node or on many compute nodes (e.g. here 5 nodes) with 40 cores each requires usually an amount of memory in GByte (e.g. 50 GByte) and a maximum time (e.g. 1 hour). E.g. 5 nodes can be allocated by the following command:

<pre>
$ salloc --partition=cpu --nodes=5 --ntasks-per-node=40 --time=01:00:00 --mem=50gb
</pre>

Now you can run parallel jobs on 200 cores requiring 50 GByte of memory per node. Please be aware that you will be logged in on core 0 of the first node.
If you want to have access to another node you have to open a new terminal, connect it also to BwUniCluster 2.0 and type the following commands to
connect to the running interactive job and then to a specific node:

<pre>
$ srun --jobid=XXXXXXXX --pty /bin/bash
$ srun --nodelist=uc3nXXX --pty /bin/bash
</pre>

With the command:

<pre>
$ squeue
</pre>

the jobid and the nodelist can be shown.

If you want to run MPI-programs, you can do it by simply typing mpirun <program_name>. Then your program will be run on 200 cores. A very simple example for starting a parallel job can be:

<pre>
$ mpirun <my_mpi_program>
</pre>

You can also start the debugger ddt by the commands:

<pre>
$ module add devel/ddt
$ ddt <my_mpi_program>
</pre>

The above commands will execute the parallel program <my_mpi_program> on all available cores. You can also start parallel programs on a subset of cores; an example for this can be:

<pre>
$ mpirun -n 50 <my_mpi_program>
</pre>

If you are using Intel MPI you must start <my_mpi_program> by the command mpiexec.hydra (instead of mpirun).

<div id="top"></div>
= Slurm HPC Workload Manager =
== Specification ==
Slurm is an open source, fault-tolerant, and highly scalable cluster management and job scheduling system for large and small Linux clusters. Slurm requires no kernel modifications for its operation and is relatively self-contained. As a cluster workload manager, Slurm has three key functions. First, it allocates exclusive and/or non-exclusive access to resources (compute nodes) to users for some duration of time so they can perform work. Second, it provides a framework for starting, executing, and monitoring work (normally a parallel job) on the set of allocated nodes. Finally, it arbitrates contention for resources by managing a queue of pending work.
 
 
Any kind of calculation on the compute nodes of [[bwUniCluster3.0|bwUniCluster 3.0]] requires the user to define calculations as a sequence of commands or single command together with required run time, number of CPU cores and main memory and submit all, i.e., the '''batch job''', to a resource and workload managing software. bwUniCluster 3.0 has installed the workload managing software Slurm. Therefore any job submission by the user is to be executed by commands of the Slurm software. Slurm queues and runs user jobs based on fair sharing policies.
 
 

== Slurm Commands (excerpt) ==
Some of the most used Slurm commands for non-administrators working on bwUniCluster 2.0.
{| width=750px class="wikitable"
! Slurm commands !! Brief explanation
|-
| [[#Job Submission : sbatch|sbatch]] || Submits a job and queues it in an input queue [[https://slurm.schedmd.com/sbatch.html sbatch]]
|-
| [[#Detailed job information : scontrol show job|scontrol show job]] || Displays detailed job state information [[https://slurm.schedmd.com/scontrol.html scontrol]]
|-
| [[#List of your submitted jo/bs : squeue|squeue]] || Displays information about active, eligible, blocked, and/or recently completed jobs [[https://slurm.schedmd.com/squeue.html squeue]]
|-
| [[#Start time of job or resources : squeue|squeue --start]] || Returns start time of submitted job or requested resources [[https://slurm.schedmd.com/squeue.html squeue]]
|-
| [[#Shows free resources : sinfo_t_idle|sinfo_t_idle]] || Shows what resources are available for immediate use [[https://slurm.schedmd.com/sinfo.html sinfo]]
|-
| [[#Canceling own jobs : scancel|scancel]] || Cancels a job (obsoleted!) [[https://slurm.schedmd.com/scancel.html scancel]]
|}

 
* [https://slurm.schedmd.com/tutorials.html Slurm Tutorials]
* [https://slurm.schedmd.com/pdfs/summary.pdf Slurm command/option summary (2 pages)]
* [https://slurm.schedmd.com/man_index.html Slurm Commands]
 

== Job Submission : sbatch ==
Batch jobs are submitted by using the command '''sbatch'''. The main purpose of the '''sbatch''' command is to specify the resources that are needed to run the job. '''sbatch''' will then queue the batch job. However, starting of batch job depends on the availability of the requested resources and the fair sharing value.
 
 
=== sbatch Command Parameters ===
The syntax and use of '''sbatch''' can be displayed via:
<pre>
$ man sbatch
</pre>
'''sbatch''' options can be used from the command line or in your job script.
{| width=750px class="wikitable"
! colspan="3" | sbatch Options
|-
! Command line
! Script
! Purpose
|- style="vertical-align:top;"
| -t ''time'' or --time=''time''
| #SBATCH --time=''time''
| Wall clock time limit. 
|-
|- style="vertical-align:top;"
| -N ''count'' or --nodes=''count''
| #SBATCH --nodes=''count''
| Number of nodes to be used.
|- style="vertical-align:top;"
| -n ''count'' or --ntasks=''count''
| #SBATCH --ntasks=''count''
| Number of tasks to be launched.
|-
|- style="vertical-align:top;"
| --ntasks-per-node=''count''
| #SBATCH --ntasks-per-node=''count''
| Maximum count (<= 28 and <= 40 resp.) of tasks per node. (Replaces the option ppn of MOAB.)
|-
|- style="vertical-align:top;"
| -c ''count'' or --cpus-per-task=''count''
| #SBATCH --cpus-per-task=''count''
| Number of CPUs required per (MPI-)task.
|-
|- style="vertical-align:top;"
| --mem=''value_in_MB''
| #SBATCH --mem=''value_in_MB''
| Memory in MegaByte per node. (Default value is 128000 and 96000 MB resp., i.e. you should omit the setting of this option.)
|-
|- style="vertical-align:top;"
| --mem-per-cpu=''value_in_MB''
| #SBATCH --mem-per-cpu=''value_in_MB''
| Minimum Memory required per allocated CPU. (Replaces the option pmem of MOAB. You should omit the setting of this option.)
|-
|- style="vertical-align:top;"
| --mail-type=''type''
| #SBATCH --mail-type=''type''
| Notify user by email when certain event types occur. Valid type values are NONE, BEGIN, END, FAIL, REQUEUE, ALL.
|-
|- style="vertical-align:top;"
| --mail-user=''mail-address''
| #SBATCH --mail-user=''mail-address''
| The specified mail-address receives email notification of state changes as defined by --mail-type.
|-
|- style="vertical-align:top;"
| --output=''name''
| #SBATCH --output=''name''
| File in which job output is stored.
|-
|- style="vertical-align:top;"
| --error=''name''
| #SBATCH --error=''name''
| File in which job error messages are stored.
|-
|- style="vertical-align:top;"
| -J ''name'' or --job-name=''name''
| #SBATCH --job-name=''name''
| Job name.
|-
|- style="vertical-align:top;"
| --export=[ALL,] ''env-variables''
| #SBATCH --export=[ALL,] ''env-variables''
| Identifies which environment variables from the submission environment are propagated to the launched application. Default is ALL. If adding an environment variable to the submission environment is intended, the argument ALL must be added.
|-
|- style="vertical-align:top;"
| -A ''group-name'' or --account=''group-name''
| #SBATCH --account=''group-name''
| Change resources used by this job to specified group. You may need this option if your account is assigned to more than one group. By command "scontrol show job" the project group the job is accounted on can be seen behind "Account=".
|-
|- style="vertical-align:top;"
| -p ''queue-name'' or --partition=''queue-name''
| #SBATCH --partition=''queue-name''
| Request a specific queue for the resource allocation.
|-
|- style="vertical-align:top;"
| --reservation=''reservation-name''
| #SBATCH --reservation=''reservation-name''
| Use a specific reservation for the resource allocation.
|-
|- style="vertical-align:top;"
| -C ''LSDF'' or --constraint=''LSDF''
| #SBATCH --constraint=LSDF
| Job constraint LSDF Filesystems.
|-
|- style="vertical-align:top;"
| -C ''BEEOND (BEEOND_4MDS, BEEOND_MAXMDS)'' or --constraint=''BEEOND (BEEOND_4MDS, BEEOND_MAXMDS''
| #SBATCH --constraint=BEEOND (BEEOND_4MDS, BEEOND_MAXMDS)
| Job constraint BeeOND file system.
|-
|}
 

==== sbatch --partition ''queues'' ====
Queue classes define maximum resources such as walltime, nodes and processes per node and queue of the compute system. Details can be found here:
* [[BwUniCluster3.0/Batch_Queues|bwUniCluster 3.0 queue settings]]
 

=== sbatch Examples ===
==== Serial Programs ====
To submit a serial job that runs the script '''job.sh''' and that requires 5000 MB of main memory and 10 minutes of wall clock time

a) execute:
<pre>
$ sbatch -p dev_cpu -n 1 -t 10:00 --mem=5000 job.sh
</pre>
or
b) add after the initial line of your script '''job.sh''' the lines (here with a high memory request):
<source lang="bash">
#SBATCH --ntasks=1
#SBATCH --time=10
#SBATCH --mem=180gb
#SBATCH --job-name=simple
</source>
and execute the modified script with the command line option ''--partition=highmem'':
<pre>
$ sbatch --partition=highmem job.sh
</pre>
Note, that sbatch command line options overrule script options.
 
 

==== Multithreaded Programs ====
Multithreaded programs operate faster than serial programs on CPUs with multiple cores. 
Moreover, multiple threads of one process share resources such as memory.
 
For multithreaded programs based on '''Open''' '''M'''ulti-'''P'''rocessing (OpenMP) a number of threads is defined by the environment variable OMP_NUM_THREADS. By default this variable is set to 1 (OMP_NUM_THREADS=1).
 
'''Hyperthreading is switched on on bwUniCluster 3.0, the option --threads-per-core must be set to 1, if you do not want to use it.'''
 
To submit a batch job called ''OpenMP_Test'' that runs a 96-fold threaded program ''omp_exe'' which requires 6000 MByte of total physical memory and total wall clock time of 40 minutes:
 
a) execute:
<pre>
$ sbatch -p cpu --export=ALL,OMP_NUM_THREADS=96 -J OpenMP_Test -N 1 -c 96 --threads-per-core=1 -t 40 --mem=6000 ./omp_exe
</pre>
or
-->
* generate the script '''job_omp.sh''' containing the following lines:
<source lang="bash">
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --cpus-per-task=96
#SBATCH --time=40:00
#SBATCH --threads-per-core=1
#SBATCH --mem=6000mb
#SBATCH --export=ALL,EXECUTABLE=./omp_exe
#SBATCH -J OpenMP_Test

#Usually you should set
export KMP_AFFINITY=compact,1,0
#export KMP_AFFINITY=verbose,compact,1,0 prints messages concerning the supported affinity
#KMP_AFFINITY Description: https://software.intel.com/en-us/node/524790#KMP_AFFINITY_ENVIRONMENT_VARIABLE

export OMP_NUM_THREADS=${SLURM_JOB_CPUS_PER_NODE}
echo "Executable ${EXECUTABLE} running on ${SLURM_JOB_CPUS_PER_NODE} cores with ${OMP_NUM_THREADS} threads"
startexe=${EXECUTABLE}
echo $startexe
exec $startexe
</source>
Using Intel compiler the environment variable KMP_AFFINITY switches on binding of threads to specific cores and, if necessary, replace <placeholder> with the required modulefile to enable the OpenMP environment and execute the script '''job_omp.sh''' adding the queue class ''cpu'' as sbatch option:
<pre>
$ sbatch -p cpu job_omp.sh
</pre>
Note, that sbatch command line options overrule script options, e.g.,
<pre>
$ sbatch --partition=cpu --mem=200 job_omp.sh
</pre>
overwrites the script setting of 6000 MByte with 200 MByte.
 
 

==== MPI Parallel Programs ====
MPI parallel programs run faster than serial programs on multi CPU and multi core systems. N-fold spawned processes of the MPI program, i.e., '''MPI tasks''', run simultaneously and communicate via the Message Passing Interface (MPI) paradigm. MPI tasks do not share memory but can be spawned over different nodes.
 
Multiple MPI tasks must be launched via '''mpirun''', e.g. 4 MPI tasks of ''my_par_program'':
<pre>
$ mpirun -n 4 my_par_program
</pre>
This command runs 4 MPI tasks of ''my_par_program'' on the node you are logged in.
To run this command with a loaded Intel MPI the environment variable I_MPI_HYDRA_BOOTSTRAP must be unset ( --> $ unset I_MPI_HYDRA_BOOTSTRAP).

Running MPI parallel programs in a batch job the interactive environment - particularly the loaded modules - will also be set in the batch job. If you want to set a defined module environment in your batch job you have to purge all modules before setting the wished modules.
 
 
===== OpenMPI =====

If you want to run jobs on batch nodes, generate a wrapper script ''job_ompi.sh'' for '''OpenMPI''' containing the following lines:
<source lang="bash">
#!/bin/bash
# Use when using the module environment for OpenMPI
module load compiler/<placeholder_for_compiler>/<placeholder_for_compiler_version>
module load mpi/openmpi/<placeholder_for_mpi_version>
mpirun --bind-to core --map-by core -report-bindings my_par_program
</source>
'''Attention:''' Do '''NOT''' add mpirun options ''-n <number_of_processes>'' or any other option defining processes or nodes, since Slurm instructs mpirun about number of processes and node hostnames. Use '''ALWAYS''' the MPI options '''''--bind-to core''''' and '''''--map-by core|socket|node'''''. Please type ''mpirun --help'' for an explanation of the meaning of the different options of mpirun option ''--map-by''.
 
Considering 4 OpenMPI tasks on a single node, each requiring 2000 MByte, and running for 1 hour, execute:
<pre>
$ sbatch -p cpu -N 1 -n 4 --mem-per-cpu=2000 --time=01:00:00 ./job_ompi.sh
</pre>
 

===== Intel MPI =====

Generate a wrapper script for '''Intel MPI''', ''job_impi.sh'' containing the following lines:
<source lang="bash">
#!/bin/bash
# Use when a defined module environment related to Intel MPI is wished
module load compiler/<placeholder_for_compiler>/<placeholder_for_compiler_version>
module load mpi/impi/<placeholder_for_version>
mpiexec.hydra -bootstrap slurm my_par_program
</source>
'''Attention:''' 
Do '''NOT''' add mpirun options ''-n <number_of_processes>'' or any other option defining processes or nodes, since Slurm instructs mpirun about number of processes and node hostnames.
 
Launching and running 200 Intel MPI tasks on 5 nodes, each requiring 80 GByte, and running for 5 hours, execute:
<pre>
$ sbatch --partition=cpu -N 5 --ntasks-per-node=40 --mem=80gb -t 300 ./job_impi.sh
</pre>
 
If you want to use 128 or more nodes, you must also set the environment variable as follows: 
export I_MPI_HYDRA_BRANCH_COUNT=-1
 
If you want to use the options perhost, ppn or rr, you must additionally set the environment variable I_MPI_JOB_RESPECT_PROCESS_PLACEMENT=off.
 
 

==== Multithreaded + MPI parallel Programs ====
Multithreaded + MPI parallel programs operate faster than serial programs on multi CPUs with multiple cores. All threads of one process share resources such as memory. On the contrary MPI tasks do not share memory but can be spawned over different nodes. '''Hyperthreading is switched on on bwUniCluster 3.0, the option --threads-per-core must be set to 1, if you do not want to use it.'''
 
 
===== OpenMPI with Multithreading =====
Multiple MPI tasks using '''OpenMPI''' must be launched by the MPI parallel program '''mpirun'''. For multithreaded programs based on '''Open''' '''M'''ulti-'''P'''rocessing (OpenMP) number of threads are defined by the environment variable OMP_NUM_THREADS. By default this variable is set to 1 (OMP_NUM_THREADS=1).
 
'''For OpenMPI''' a job-script to submit a batch job called ''job_ompi_omp.sh'' that runs a MPI program with 4 tasks and a 28-fold threaded program ''ompi_omp_program'' requiring 3000 MByte of physical memory per thread (using 28 threads per MPI task you will get 28*3000 MByte = 84000 MByte per MPI task) and total wall clock time of 3 hours looks like:

<source lang="bash">
#!/bin/bash
#SBATCH --ntasks=4
#SBATCH --cpus-per-task=28
#SBATCH --threads-per-core=1
#SBATCH --time=03:00:00
#SBATCH --mem=83gb # 84000 MB = 84000/1024 GB = 82.1 GB
#SBATCH --export=ALL,MPI_MODULE=mpi/openmpi/3.1,EXECUTABLE=./ompi_omp_program
#SBATCH --output="parprog_hybrid_%j.out"

# Use when a defined module environment related to OpenMPI is wished
module load ${MPI_MODULE}
export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK}
export MPIRUN_OPTIONS="--bind-to core --map-by socket:PE=${OMP_NUM_THREADS} -report-bindings"
export NUM_CORES=${SLURM_NTASKS}*${OMP_NUM_THREADS}
echo "${EXECUTABLE} running on ${NUM_CORES} cores with ${SLURM_NTASKS} MPI-tasks and ${OMP_NUM_THREADS} threads"
startexe="mpirun -n ${SLURM_NTASKS} ${MPIRUN_OPTIONS} ${EXECUTABLE}"
echo $startexe
exec $startexe
</source>
Execute the script '''job_ompi_omp.sh''' by command sbatch:
<pre>
$ sbatch -p cpu ./job_ompi_omp.sh
</pre>
* With the mpirun option ''--bind-to core'' MPI tasks and OpenMP threads are bound to physical cores.
* With the option ''--map-by node:PE=<value>'' (neighbored) MPI tasks will be attached to different nodes and each MPI task is bound to the first core of a node. <value> must be set to ${OMP_NUM_THREADS}.
* The option ''-report-bindings'' shows the bindings between MPI tasks and physical cores.
* The mpirun-options '''--bind-to core''', '''--map-by socket|...|node:PE=<value>''' should always be used when running a multithreaded MPI program.
 

===== Intel MPI with Multithreading =====
Multithreaded + MPI parallel programs operate faster than serial programs on multi CPUs with multiple cores. All threads of one process share resources such as memory. On the contrary MPI tasks do not share memory but can be spawned over different nodes.

Multiple Intel MPI tasks must be launched by the MPI parallel program '''mpiexec.hydra'''. For multithreaded programs based on '''Open''' '''M'''ulti-'''P'''rocessing (OpenMP) number of threads are defined by the environment variable OMP_NUM_THREADS. By default this variable is set to 1 (OMP_NUM_THREADS=1).

'''For Intel MPI''' a job-script to submit a batch job called ''job_impi_omp.sh'' that runs a Intel MPI program with 10 tasks and a 96-fold threaded program ''impi_omp_program'' requiring 96000 MByte of total physical memory per task and total wall clock time of 1 hours looks like:


<source lang="bash">
#!/bin/bash
#SBATCH --ntasks=10
#SBATCH --cpus-per-task=96
#SBATCH --threads-per-core=1
#SBATCH --time=60
#SBATCH --mem=96000
#SBATCH --export=ALL,MPI_MODULE=mpi/impi,EXE=./impi_omp_program
#SBATCH --output="parprog_impi_omp_%j.out"

#If using more than one MPI task per node please set
export KMP_AFFINITY=compact,1,0
#export KMP_AFFINITY=verbose,scatter prints messages concerning the supported affinity
#KMP_AFFINITY Description: https://software.intel.com/en-us/node/524790#KMP_AFFINITY_ENVIRONMENT_VARIABLE

# Use when a defined module environment related to Intel MPI is wished
module load ${MPI_MODULE}
export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK}
export MPIRUN_OPTIONS="-binding "domain=omp:compact" -print-rank-map -envall"
export NUM_PROCS=eval(${SLURM_NTASKS}*${OMP_NUM_THREADS})
echo "${EXE} running on ${NUM_PROCS} cores with ${SLURM_NTASKS} MPI-tasks and ${OMP_NUM_THREADS} threads"
startexe="mpiexec.hydra -bootstrap slurm ${MPIRUN_OPTIONS} -n ${SLURM_NTASKS} ${EXE}"
echo $startexe
exec $startexe
</source>
Using Intel compiler the environment variable KMP_AFFINITY switches on binding of threads to specific cores. If you only run one MPI task per node please set KMP_AFFINITY=compact,1,0.
 
If you want to use 128 or more nodes, you must also set the environment variable as follows: 
export I_MPI_HYDRA_BRANCH_COUNT=-1
 
If you want to use the options perhost, ppn or rr, you must additionally set the environment variable I_MPI_JOB_RESPECT_PROCESS_PLACEMENT=off.
 
 
Execute the script '''job_impi_omp.sh''' by command sbatch:
<pre>
$ sbatch -p cpu ./job_impi_omp.sh
</pre>
 
The mpirun option ''-print-rank-map'' shows the bindings between MPI tasks and nodes (not very beneficial). The option ''-binding'' binds MPI tasks (processes) to a particular processor; ''domain=omp'' means that the domain size is determined by the number of threads. If you would choose 2 MPI tasks per node, you should choose ''-binding "cell=unit;map=bunch"''; this binding maps one MPI process to each socket.
 
 

==== Chain jobs ====
The CPU time requirements of many applications exceed the limits of the job classes. In those situations it is recommended to solve the problem by a job chain. A job chain is a sequence of jobs where each job automatically starts its successor.
<source lang="bash">
#!/bin/bash
####################################
## simple Slurm submitter script to setup ##
## a chain of jobs using Slurm ##
####################################
## ver. : 2018-11-27, KIT, SCC

## Define maximum number of jobs via positional parameter 1, default is 5
max_nojob=${1:-5}

## Define your jobscript (e.g. "~/chain_job.sh")
chain_link_job=${PWD}/chain_job.sh

## Define type of dependency via positional parameter 2, default is 'afterok'
dep_type="${2:-afterok}"
## -> List of all dependencies:
## https://slurm.schedmd.com/sbatch.html

myloop_counter=1
## Submit loop
while [ ${myloop_counter} -le ${max_nojob} ] ; do
##
## Differ slurm_opt depending on chain link number
if [ ${myloop_counter} -eq 1 ] ; then
slurm_opt=""
else
slurm_opt="-d ${dep_type}:${jobID}"
fi
##
## Print current iteration number and sbatch command
echo "Chain job iteration = ${myloop_counter}"
echo " sbatch --export=myloop_counter=${myloop_counter} ${slurm_opt} ${chain_link_job}"
## Store job ID for next iteration by storing output of sbatch command with empty lines
jobID=$(sbatch -p <queue> --export=ALL,myloop_counter=${myloop_counter} ${slurm_opt} ${chain_link_job} 2>&1 | sed 's/[S,a-z]* //g')
##
## Check if ERROR occured
if [[ "${jobID}" =~ "ERROR" ]] ; then
echo " -> submission failed!" ; exit 1
else
echo " -> job number = ${jobID}"
fi
##
## Increase counter
let myloop_counter+=1
done
</source>
 

==== GPU jobs ====

The nodes in the gpu_h100, gpu_mi300, gpu_a100_il and gpu_h100_il queues have 4 NVIDIA Ampere A100 GPUs or 4 NVIDIA Hopper H100 GPUs. Just submitting a job to these queues is not enough to also allocate one or more GPUs, you have to do so using the "--gres=gpu" parameter. You have to specifiy how many GPUs your job needs, e.g. "--gres=gpu:2" will request two GPUs.

The GPU nodes are shared between multiple jobs if the jobs don't request all the GPUs in a node and there are enough resources to run more than one job. The individual GPUs are always bound to a single job and will not be shared between different jobs.

a) add after the initial line of your script job.sh the line including the
information about the GPU usage: #SBATCH --gres=gpu:2
<pre>
#!/bin/bash
#SBATCH --ntasks=40
#SBATCH --time=02:00:00
#SBATCH --mem=4000
#SBATCH --gres=gpu:2
</pre>

or b) execute:
<pre>
$ sbatch -p <queue> -n 40 -t 02:00:00 --mem 4000 --gres=gpu:2 job.sh
</pre>
 
If you start an interactive session on of the GPU nodes, you can use the "nvidia-smi" command to list the GPUs allocated to your job:
<pre>
$ nvidia-smi
Fri Apr 4 09:51:29 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.86.15 Driver Version: 570.86.15 CUDA Version: 12.8 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA H100 On | 00000000:06:00.0 Off | 0 |
| N/A 45C P0 70W / 415W | 27MiB / 95830MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA H100 On | 00000000:26:00.0 Off | 0 |
| N/A 45C P0 69W / 415W | 27MiB / 95830MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 3569 G /usr/libexec/Xorg 17MiB |
| 1 N/A N/A 3569 G /usr/libexec/Xorg 17MiB |
+-----------------------------------------------------------------------------------------+
</pre>

 
In case of using Open MPI, the underlying communication infrastructure (UCX and Open MPI's BTL) is CUDA-aware.
Please run Open MPI's mpirun using the following command:
<pre>
$ mpirun --mca pml ucx --mca btl_openib_warn_default_gid_prefix 0 -np 2 ./mpi_cuda_app
</pre>
or disabling the (older) communication layer Bit-Transfer-Layer (short BTL) alltogether:
<pre>
$ mpirun --mca pml ucx --mca btl ^openib -np 2 ./mpi_cuda_app
</pre>

 
 

==== LSDF Online Storage ====
On bwUniCluster 3.0 you can use for special cases the LSDF Online Storage on the HPC cluster nodes. Please request for this service separately ([https://www.lsdf.kit.edu/os/storagerequest/: LSDF Storage Request]).
To mount the LSDF Online Storage on the compute nodes during the job runtime the
the constraint flag "LSDF" has to be set.

a) add after the initial line of your script job.sh the line including the
information about the LSDF Online Storage usage: #SBATCH --constraint=LSDF
<pre>
#!/bin/bash
#SBATCH --ntasks=1
#SBATCH --time=120
#SBATCH --mem=200
#SBATCH --constraint=LSDF
</pre>

or b) execute:
<pre>
$ sbatch -p <queue> -n 1 -t 2:00:00 --mem 200 job.sh -C LSDF
</pre>
 
For the usage of the LSDF Online Storage
the following environment variables are available: $LSDF, $LSDFPROJECTS, $LSDFHOME.
 
 

====BeeOND (BeeGFS On-Demand)====

BeeOND instances are integrated into the prolog and epilog script of the cluster batch system Slurm. It can be used on the exclusive compute nodes during the job runtime with the constraint flag "BEEOND", "BEEOND_4MDS" or "BEEOND_MAXMDS" ([[BwUniCluster_2.0_Slurm_common_Features#sbatch_Command_Parameters|Slurm Command Parameters]])
* BEEOND: one metadata server is started on the first node
* BEEOND_4MDS: 4 metadata servers are started within your job. If you have less than 4 nodes less metadata servers are started.
* BEEOND_MAXMDS: on every node of your job a metadata server for the on_demand file system is started

As starting point we recommend using the "BEEOND" option. If you are unsure if this is sufficient for you feel free to contact the support team.
<source lang="bash">
#!/bin/bash
#SBATCH ...
#SBATCH --constraint=BEEOND # or BEEOND_4MDS or BEEOND_MAXMDS
</source>

After your job has started you can find the private on-demand file system in '''/mnt/odfs/${SLURM_JOB_ID}''' directory. The mountpoint comes with five pre-configured directories:
<source lang="bash">
# For small files (stripe count = 1)
/mnt/odfs/${SLURM_JOB_ID}/stripe_1
# Stripe count = 4
/mnt/odfs/${SLURM_JOB_ID}/stripe_default
# or
/mnt/odfs/${SLURM_JOB_ID}/stripe_4
# Stripe count = 8, 16 or 32, use this directories for medium sized and large files or when using MPI-IO
/mnt/odfs/${SLURM_JOB_ID}/stripe_8
/mnt/odfs/${SLURM_JOB_ID}/stripe_16
# or
/mnt/odfs/${SLURM_JOB_ID}/stripe_32
</source>

If you request less nodes than stripe count, the stripe count will be the number of nodes. For example, if you only request 8 nodes the directory stripe_16 has only a stripe count 8.

; '''Attention:''' 
:Be careful when creating large files: use always the directory with the max stripe count for large files.
:If you create large files use a higher stripe count. For example, if your largest file is 1.1 Tb, then you have to use a stripe count larger than 2,
:otherwise the used disk space is exceeded.

The capacity of the private file system depends on the number of nodes. For each node you get 750 Gbyte.
If you request 100 nodes for your job, the private file system is 100 * 750 Gbyte ~ 75 Tbyte (approx) capacity.

== Start time of job or resources : squeue --start ==
The command can be used by any user to displays the estimated start time of a job based a number of analysis types based on historical usage, earliest available reservable resources, and priority based backlog. The command squeue is explained in detail on the webpage https://slurm.schedmd.com/squeue.html or via manpage (man squeue).
 
 
=== Access ===
By default, this command can be run by '''any user'''.
 
 

== List of your submitted jobs : squeue ==
Displays information about YOUR active, pending and/or recently completed jobs. The command displays all own active and pending jobs. The command squeue is explained in detail on the webpage https://slurm.schedmd.com/squeue.html or via manpage (man squeue).
 
 
=== Access ===
By default, this command can be run by any user.
 
 

=== Flags ===
{| width=750px class="wikitable"
|-
! Flag !! Description
|-
| -l, --long
| Report more of the available information for the selected jobs or job steps, subject to any constraints specified.
|}
 

=== Examples ===
''squeue'' example on bwUniCluster 2.0 (Only your own jobs are displayed!).
<pre>
$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
1262 cpu wrap ka_ab123 R 8:15 1 uc3n002
1267 dev_gpu_h wrap ka_ab123 PD 0:00 1 (Resources)
1265 highmem wrap ka_ab123 R 2:41 1 uc3n084
$ squeue -l
JOBID PARTITION NAME USER STATE TIME TIME_LIMI NODES NODELIST(REASON)
1262 cpu wrap ka_ab123 RUNNING 8:55 20:00 1 uc3n002
1267 dev_gpu_h wrap ka_ab123 PENDING 0:00 20:00 1 (Resources)
1265 highmem wrap ka_ab123 RUNNING 3:21 20:00 1 uc3n084

</pre>
* The output of ''squeue'' shows how many jobs of yours are running or pending and how many nodes are in use by your jobs.
 

== Shows free resources : sinfo_t_idle ==
The Slurm command sinfo is used to view partition and node information for a system running Slurm. It incorporates down time, reservations, and node state information in determining the available backfill window. The sinfo command can only be used by the administrator.
 
SCC has prepared a special script (sinfo_t_idle) to find out how many processors are available for immediate use on the system. It is anticipated that users will use this information to submit jobs that meet these criteria and thus obtain quick job turnaround times.
 
 
=== Access ===
By default, this command can be used by any user or administrator.
 
 
=== Example ===
* The following command displays what resources are available for immediate use for the whole partition.
<pre>$ sinfo_t_idle
Partition dev_cpu : 2 nodes idle
Partition cpu : 68 nodes idle
Partition highmem : 4 nodes idle
Partition dev_gpu_h100 : 0 nodes idle
Partition gpu_h100 : 11 nodes idle
Partition gpu_mi300 : 1 nodes idle
Partition dev_cpu_il : 0 nodes idle
Partition cpu_il : 0 nodes idle
Partition dev_gpu_a100_il : 0 nodes idle
Partition gpu_a100_il : 0 nodes idle
Partition gpu_h100_il : 0 nodes idle
</pre>
* For the above example jobs in all partitions can be run immediately.
 

== Detailed job information : scontrol show job ==
scontrol show job displays detailed job state information and diagnostic output for all or a specified job of yours. Detailed information is available for active, pending and recently completed jobs. The command scontrol is explained in detail on the webpage https://slurm.schedmd.com/scontrol.html or via manpage (man scontrol).
 
Display the state of all your jobs in normal mode: scontrol show job
 
Display the state of a job with <jobid> in normal mode: scontrol show job <jobid>
 
 
=== Access ===
* End users can use scontrol show job to view the status of their '''own jobs''' only.
 

=== Arguments ===
{| width=750px class="wikitable"
|-
! Option !! Default !! Description !! Example
|- style="vertical-align:top;"
|- style="width:12%;"
| -d
| (n/a)
| Detailed mode
| Example: Display the state with jobid 18089884 in detailed mode. <pre>scontrol -d show job 18089884</pre>
|}
 
 

=== Scontrol show job Example ===
Here is an example from bwUniCluster 3.0.
<pre>
squeue # show my own jobs (here the userid is replaced!)
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
1262 cpu wrap ka_zs040 R 1:12 1 uc3n002

$
$ # now, see what's up with my pending job with jobid 18089884
$
$ scontrol show job 1262

JobId=1262 JobName=wrap
UserId=ka_zs0402(241992) GroupId=ka_scc(12345) MCS_label=N/A
Priority=4246 Nice=0 Account=ka QOS=normal
JobState=RUNNING Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
RunTime=00:00:37 TimeLimit=00:20:00 TimeMin=N/A
SubmitTime=2025-04-04T10:01:30 EligibleTime=2025-04-04T10:01:30
AccrueTime=2025-04-04T10:01:30
StartTime=2025-04-04T10:01:31 EndTime=2025-04-04T10:21:31 Deadline=N/A
SuspendTime=None SecsPreSuspend=0 LastSchedEval=2025-04-04T10:01:31 Scheduler=Main
Partition=cpu AllocNode:Sid=uc3n999:2819841
ReqNodeList=(null) ExcNodeList=(null)
NodeList=uc3n002
BatchHost=uc3n002
NumNodes=1 NumCPUs=2 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
ReqTRES=cpu=1,mem=2000M,node=1,billing=1
AllocTRES=cpu=2,mem=4000M,node=1,billing=2
Socks/Node=* NtasksPerN:B:S:C=0:0:*:1 CoreSpec=*
MinCPUsNode=1 MinMemoryCPU=2000M MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
Command=(null)
WorkDir=/pfs/data6/home/ka/ka_scc/ka_zs0402
StdErr=/pfs/data6/home/ka/ka_scc/ka_zs0402/slurm-1262.out
StdIn=/dev/null
StdOut=/pfs/data6/home/ka/ka_scc/ka_zs0402/slurm-1262.out

</pre>
 
You can use standard Linux pipe commands to filter the very detailed scontrol show job output.
* In which state the job is?
<pre>$ scontrol show job 1262 | grep -i State
JobState=RUNNING Reason=None Dependency=(null)
</pre>
 

== Cancel Slurm Jobs ==
The scancel command is used to cancel jobs. The command scancel is explained in detail on the webpage https://slurm.schedmd.com/scancel.html or via manpage (man scancel).
 
 
=== Canceling own jobs : scancel ===
scancel is used to signal or cancel jobs, job arrays or job steps. The command is:
<pre>
$ scancel [-i] <job-id>
$ scancel -t <job_state_name>
</pre>

 
{| class="wikitable"
! Flag !! Default !! Description !! Example
|- style="vertical-align:top;"
| -i, --interactive
| (n/a)
| Interactive mode.
| Cancel the job 987654 interactively. <pre> scancel -i 987654 </pre>
|-
| -t, --state
| (n/a)
| Restrict the scancel operation to jobs in a certain state. "job_state_name" may have a value of either "PENDING", "RUNNING" or "SUSPENDED".
| Cancel all jobs in state "PENDING". <pre> scancel -t "PENDING" </pre>
|}
 

= Resource Managers =
=== Batch Job (Slurm) Variables ===
The following environment variables of Slurm are added to your environment once your job has started
(only an excerpt of the most important ones).
{| width=750px class="wikitable"
! Environment !! Brief explanation
|-
| SLURM_JOB_CPUS_PER_NODE
| Number of processes per node dedicated to the job
|-
| SLURM_JOB_NODELIST
| List of nodes dedicated to the job
|-
| SLURM_JOB_NUM_NODES
| Number of nodes dedicated to the job
|-
| SLURM_MEM_PER_NODE
| Memory per node dedicated to the job
|-
| SLURM_NPROCS
| Total number of processes dedicated to the job
|-
| SLURM_CLUSTER_NAME
| Name of the cluster executing the job
|-
| SLURM_CPUS_PER_TASK
| Number of CPUs requested per task
|-
| SLURM_JOB_ACCOUNT
| Account name
|-
| SLURM_JOB_ID
| Job ID
|-
| SLURM_JOB_NAME
| Job Name
|-
| SLURM_JOB_PARTITION
| Partition/queue running the job
|-
| SLURM_JOB_UID
| User ID of the job's owner
|-
| SLURM_SUBMIT_DIR
| Job submit folder. The directory from which sbatch was invoked.
|-
| SLURM_JOB_USER
| User name of the job's owner
|-
| SLURM_RESTART_COUNT
| Number of times job has restarted
|-
| SLURM_PROCID
| Task ID (MPI rank)
|-
| SLURM_NTASKS
| The total number of tasks available for the job
|-
| SLURM_STEP_ID
| Job step ID
|-
| SLURM_STEP_NUM_TASKS
| Task count (number of MPI ranks)
|-
| SLURM_JOB_CONSTRAINT
| Job constraints
|}
See also:
* [https://slurm.schedmd.com/sbatch.html#SECTION_INPUT-ENVIRONMENT-VARIABLES Slurm input and output environment variables]
 

=== Job Exit Codes ===
A job's exit code (also known as exit status, return code and completion code) is captured by SLURM and saved as part of the job record.
 
Any non-zero exit code will be assumed to be a job failure and will result in a Job State of FAILED with a reason of "NonZeroExitCode".
 
The exit code is an 8 bit unsigned number ranging between 0 and 255. While it is possible for a job to return a negative exit code, SLURM will display it as an unsigned value in the 0 - 255 range.
 
 
==== Displaying Exit Codes and Signals ====
SLURM displays a job's exit code in the output of the '''scontrol show job''' and the sview utility.
 
When a signal was responsible for a job or step's termination, the signal number will be displayed after the exit code, delineated by a colon(:).
 
 
==== Submitting Termination Signal ====
Here is an example, how to 'save' a Slurm termination signal in a typical jobscript.
<source lang="bash">
[...]
exit_code=$?
mpirun -np <#cores> <EXE_BIN_DIR>/<executable> ... (options) 2>&1
[ "$exit_code" -eq 0 ] && echo "all clean..." || \
echo "Executable <EXE_BIN_DIR>/<executable> finished with exit code ${$exit_code}"
[...]
</source>
* Do not use ''''time'''' mpirun! The exit code will be the one submitted by the first (time) program.
* You do not need an '''exit $exit_code''' in the scripts.
 
 
----
[[#top|Back to top]]