bwHPC Wiki - User contributions [en]

BinAC2/Getting Started

2026-02-06T17:10:55Z

S Behnle:

== Purpose and Goals ==

The Getting Started guide is designed for users who are new to HPC systems in general and to BinAC 2 specifically. After reading this guide, you should have a basic understanding of how to use BinAC 2 for your research.

Please note that this guide does not cover basic Linux command-line skills. If you're unfamiliar with commands such as listing directory contents or using a text editor, we recommend first exploring the Linux module on the [https://training.bwhpc.de bwHPC training platform].

This guide also doesn't cover every feature of the system but aims to provide a broad overview. For more detailed information about specific features, please refer to the dedicated Wiki pages on topics like the batch system, storage, and more.

Some terms in this guide may be unfamiliar. You can look them up in the [[HPC_Glossary|HPC Glossary]].

== General Workflow of Running a Calculation ==

On an '''HPC Cluster''', you do not simply log in and run your software. Instead, you write a '''Batch Script''' that contains all the commands needed to run and process your job, then submit it to a waiting queue to be executed on one of several hundred '''Compute Nodes'''.

== Get Access to the Cluster ==

Follow the registration process for the bwForCluster. → [[Registration/bwForCluster|How to Register for a bwForCluster]]

== Login to the Cluster ==

Set up your service password and 2FA token, then log in to BinAC 2. → [[BinAC2/Login|Login BinAC]]

== Using the Linux command line ==

It is expected that you have at least basic Linux and command-line knowledge before using bwForCluster BinAC 2.
There are numerous resources available online for learning fundamental concepts and commands.
Here are two:

* bwHPC Linux Training course → [https://training.bwhpc.de/ Linux course on training.bwhpc.de]
* HPC Wiki (external site) → [https://hpc-wiki.info/hpc/Introduction_to_Linux_in_HPC/The_Command_Line Introduction to the Linux command line]

Also see: [[.bashrc Do's and Don'ts]]

= File System Basics =

BinAC 2 offers several file systems for your data, each serving different needs.
These are explained here in a short and simple form. For more detailed documentation, visit: [https://wiki.bwhpc.de/e/BwForCluster_BinAC_Hardware_and_Architecture#Storage_Architecture here].

== Home File System ==

Home directories are intended for the permanent storage of frequently used files, such as like source codes, configuration files, executable programs, conda environments, etc.
The home file system is backed up daily and has a quota.
If that quota is reached, you may experience issues when working with BinAC 2.

Here are some useful command line and bash tips for accessing the Home File system.

<source lang="bash">
# For changing to your home directory, simply run:
cd

# To access files in your home directory within your job script, you can use one of these:
~/myFile # or
$HOME/myFile
</source>

== Project File System ==

BinAC 2 has a <code>project</code> file system intended for data that:
* is shared between members of a compute project
* is not actively used for computations in near future

The data is stored on HDDs. The primary focus of <code>project</code> is pure capacity, not speed.

Every project gets a dedicated directory located at:

<syntaxhighlight>
/pfs/10/project/<project_id>/
</syntaxhighlight>

You can check the project you're member of:

<syntaxhighlight>
# id $USER | grep -o 'bw[^)]*'
bw16f003
</syntaxhighlight>

In this case, your project directory would be:
```
/pfs/10/project/bw16f003/
```

Check our [[BinAC2/Project_Data_Organization | data organization guide ]] for methods to organize data inside the project directory.

== Work File System ==

BinAC 2 has a <code>work</code> file system on SSDs intended for data that is actively used and produced by compute jobs.
Each user creates workspaces on their own via the [[BinAC2/Hardware_and_Architecture#Work | workspace tools]].

The project file system is available at <code>/pfs/10/work</code>

<source lang="bash">
$ ll /pfs/10/work/
total 1822
drwxr-xr-x. 3 root root 33280 Feb 12 14:56 db
drwx------. 5 tu_iioba01 tu_tu 25600 Jan 8 14:42 tu_iioba01-alphafold3
[..]
</source>

As you can see from the file permissions, the resulting workspace can only be accessed by you, not by other group members or other users.

== Scratch ==

Each compute node provides local storage, which is much faster than accessing <code>project</code> and <code>work</code> file systems.
When you execute a job, a dedicated temporary directory will be assigned to it on the compute node. This is often referred to as the <code>scratch</code> directory.
Programs frequently generate temporary data only needed during execution. If the program you are using offers an option for setting a temporary directory,
please configure it to use the <code>scratch</code> directory.
You can use the environment variable <code>$TMPDIR</code>, which will point to your job's <code>scratch</code> directory.

= Batch System Basics =

On HPC clusters like BinAC 2, you don't run analyses directly on the login node.
Instead, you write a script and submit it as a job to the batch system.
BinAC 2 uses SLURM as its batch system.
The system then schedules the job to run on one of the available compute nodes, where the actual computation takes place.

The cluster consists of compute nodes with different [[BinAC2/Hardware_and_Architecture#Compute_Nodes | hardware features]].
These hardware features are only available when submitting the jobs to the correct [[BinAC2/SLURM_Partitions | partitions]].

The getting started guide only provides very basic SLURM information.
Please read the extensive [[BinAC2/Slurm | SLURM documentation]].

== Simple Script Job ==

You will have to write job scripts in order to conduct your computations on BinAC 2.
Use your favourite text editor to create simple job script called 'myjob.sh'.

{|style="background:#deffee; width:100%;"
|style="padding:5px; background:#cef2e0; text-align:left"|
[[Image:Attention.svg|center|25px]]
|style="padding:5px; background:#cef2e0; text-align:left"|
Please note that there are differences between Windows and Linux line endings.
Make sure that your editor uses Linux line endings when you are using Windows.
You can check your line endings with <code>vim -b <your script></code>. Windows line endings will be displayed as <code>^M</code>.
|}

<source lang="bash">
#!/bin/bash
#SBATCH --ntasks=1
#SBATCH --time=10:00
#SBATCH --mem=5000m
#SBATCH --job-name=simple

echo "Scratch directory: $TMPDIR"
echo "Date:"
date

echo "My job is running on node:"
hostname
uname -a

sleep 240
</source>

== Basic SLURM commands ==

Submit the job script you wrote with <code>sbatch</code>.

<source lang="bash">
$ sbatch myjob.sh
Submitted batch job 75441
</source>

Take a note of your <code>jobID</code>. The scheduler will reserve one core and 5000MB memory for 5 minutes on a compute node for your job.
The job should be scheduled within seconds if BinAC 2 is not fully busy.
The output will be stored in a file called <code>slurm-<JobID>.out</code>

<source lang="bash">
$ cat slurm-75441.out
Scratch directory: /scratch/75441
Date:
Thu Feb 13 09:56:41 AM CET 2025
My job is running on node:
node1-083
Linux node1-083 5.14.0-503.14.1.el9_5.x86_64 #1 SMP PREEMPT_DYNAMIC Fri Nov 15 12:04:32 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
</source>

There are tons of options, details and caveats for SLURM job script.
Most of them are explained in the [[BinAC2/Slurm | SLURM documentation]].
If you encounter any problems, just send a mail to hpcmaster@uni-tuebingen.de.

You can get an overview of your queued and running jobs with <code>squeue</code>

<source lang="bash">
[tu_iioba01@login01 ~]$ squeue --user=$USER
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
75441 compute simple tu_iioba R 0:03 1 node1-083
</source>

Let's assume you pulled a Homer and want to stop/kill/remove a running job.

<source lang="bash">
scancel <JobID>
</source>



= Software =

There are several mechanisms how software can be installed on BinAC 2.
If you need software that is not installed on BinAC 2, open a ticket and we can find a way to provide the software on the cluster.

== Environment Modules ==

Environment modules is the 'classic' way for providing software on clusters.
A module consists of a specific software version and can be loaded.
The module system then manipulates the PATH and other environment variables such that the software can be used.

<source lang="bash">
# Show available modules
$ module avail

# Load a module
$ module load bio/samtools/1.21

# Show the module's help
$ module help bio/samtools/1.21
</source>

A more detailed description of module environments can be found [https://wiki.bwhpc.de/e/Environment_Modules on this wiki page]

Sometimes software packages have so many dependencies or the user wants a combination of tools, so that environment modules cannot be used in a meaningful way.
Then other solutions like Conda environments or Singularity containers (see below) can be used.

== Conda Environments ==

Conda environments are a nice possibility for creating custom environments on the cluster, as a majority of the scientific software is available in the meantime as conda packages.
BinAC 2 already provides Conda via Miniforge.
You can find a general documtation for using Conda on [[Development/Conda | on this wiki page]].

== Apptainer (formerly Singularity) ==

Sometimes software is also available in a software container format.
Apptainer (formerly called Singularity) is installed on all BinAC 2 nodes. You can pull Apptainer containers or Docker images from registries onto BinAC 2 and use them.
You can also build new Apptainer containers on your own machine and copy them to BinAC.

Please note that Apptainer containers should be stored in the <code>project</code> file system.
We configured Apptainer such that containers stored in your home directory do not work.

BinAC2/Slurm

2026-01-07T08:47:56Z

S Behnle: /* GPU jobs */

= General information about Slurm =

Any kind of calculation on the compute nodes of bwForCluster BinAC 2 requires the user to define calculations as a sequence of commands or single command together with required run time, number of CPU cores and main memory and submit all, i.e., the '''batch job''', to a resource and workload managing software. BinAC 2 has installed the workload managing software Slurm. Therefore any job submission by the user is to be executed by commands of the Slurm software. Slurm queues and runs user jobs based on fair sharing policies.

= External Slurm documentation =

You can find the official Slurm configuration and some other material here:

* Slurm documentation: https://slurm.schedmd.com/documentation.html
* Slurm cheat sheet: https://slurm.schedmd.com/pdfs/summary.pdf
* Slurm tutorials: https://slurm.schedmd.com/tutorials.html

= SLURM terminology =

SLURM knows and mirrors the division of the cluster into '''nodes''' with several '''cores'''. When queuing '''jobs''', there are several ways of requesting resources and it is important to know which term means what in SLURM. Here are some basic SLURM terms:

;Job
: A job is a self-contained computation that may encompass multiple tasks and is given specific resources like individual CPUs/GPUs, a specific amount of RAM or entire nodes. These resources are said to have been allocated for the job.

;Task
: A task is a single run of a single process. By default, one task is run per node and one CPU is assigned per task.

;Partition
: A partition (usually called queue outside SLURM) is a waiting line in which jobs are put by users.

;Socket
: Receptacle on the motherboard for one physically packaged processor (each of which can contain one or more cores).

;Core
: A complete private set of registers, execution units, and retirement queues needed to execute programs.

;Thread
: One or more hardware contexts withing a single core. Each thread has attributes of one core, managed & scheduled as a single logical processor by the OS.

;CPU
: A '''CPU''' in Slurm means a '''single core'''. This is different from the more common terminology, where a CPU (a microprocessor chip) consists of multiple cores. Slurm uses the term '''sockets''' when talking about CPU chips. Depending upon system configuration, a CPU can be either a '''core''' or a '''thread'''. On '''BinAC 2 Hyperthreading is activated on every machine'''. This means that the operating system and Slurm sees each physical core as two logical cores.

= Slurm Commands =

{| width=750px class="wikitable"
! Slurm commands !! Brief explanation
|-
| [https://slurm.schedmd.com/sbatch.html sbatch] || Submits a job and queues it in an input queue
|-
| [https://slurm.schedmd.com/salloc.html saclloc] || Request resources for an interactive job
|-
| [https://slurm.schedmd.com/squeue.html squeue] || Displays information about active, eligible, blocked, and/or recently completed jobs
|-
| [https://slurm.schedmd.com/scontrol.html scontrol] || Displays detailed job state information
|-
| [https://slurm.schedmd.com/scontrol.html sstat] || Displays status information about a running job
|-
| [https://slurm.schedmd.com/scancel.html scancel] || Cancels a job
|-
|}

== Interactive Jobs ==

You can run interactive jobs for testing and developing your job scripts.
Several nodes are reserved for interactive work, so your jobs should start right away.
You can only submit one job to this partition at a time. A job can run for up to 10 hours (about one workday).

This example command gives you 16 cores and 128 GB of memory for four hours on one of the reserved nodes:

<pre>
salloc --partition=interactive --time=4:00:00 --cpus-per-task=16 --mem=128gb
</pre>

You can also use srun to request the same resources:

<pre>
srun --partition=interactive --time=4:00:00 --cpus-per-task=16 --mem=128gb --pty bash
</pre>

== Job Submission : sbatch ==

Batch jobs are submitted by using the command '''sbatch'''. The main purpose of the '''sbatch''' command is to specify the resources that are needed to run the job. '''sbatch''' will then queue the batch job. However, starting of batch job depends on the availability of the requested resources and the fair sharing value.
 
=== sbatch Command Parameters ===
The syntax and use of '''sbatch''' can be displayed via:
<pre>
$ man sbatch
</pre>
'''sbatch''' options can be used from the command line or in your job script. The following table shows the syntax and provides examples for each option.

{| class="wikitable"
! colspan="5" | sbatch Options
|-
! Command line
! Job Script
! Purpose
! Example
! Default value
|- style="vertical-align:top;"
| <code>-t ''time''</code> or <code>--time=''time''</code>
| #SBATCH --time=''time''
| Wall clock time limit. 
| <code>-t 2:30:00</code> Limits run time to 2h 30 min.<code>-t 2-12</code> Limits run time to 2 days and 12 hours.
| Depends on Slurm partition.
|-
|- style="vertical-align:top;"
| -N ''count'' or --nodes=''count''
| #SBATCH --nodes=''count''
| Number of nodes to be used.
| <code>-N 1</code> Run job on one node.<code>-N 2</code> Run job on two nodes (have to use MPI!)
|
|- style="vertical-align:top;"
| -n ''count'' or --ntasks=''count''
| #SBATCH --ntasks=''count''
| Number of tasks to be launched.
| <code>-n 2</code> launch two tasks in the job.
| One task per node
|-
|- style="vertical-align:top;"
| --ntasks-per-node=''count''
| #SBATCH --ntasks-per-node=''count''
| Maximum count of tasks per node. (Replaces the option <code>ppn</code> of MOAB.)
| <code>--ntasks-per-node=2</code> Run 2 tasks per node
| 1 task per node
|-
|- style="vertical-align:top;"
| -c ''count'' or --cpus-per-task=''count''
| #SBATCH --cpus-per-task=''count''
| Number of CPUs required per (MPI-)task.
| <code>-c 2</code> Request two CPUs per (MPI-)task.
| 1 CPU per (MPI-)task
|-
|- style="vertical-align:top;"
| <code>--mem=<size>[units]</code>
| #SBATCH --mem=''value_in_MB''
| Memory in MegaByte per node.<code>[units]</code> can be one of <code>[K<nowiki>|</nowiki>M<nowiki>|</nowiki>G<nowiki>|</nowiki>T]</code>.
| <code>--mem=10g</code> Request 10GB RAM per node <code>--mem=0</code> Request all memory on node
| Depends on Slurm configuration.It is better to specify <code>--mem</code> in every case.
|-
|- style="vertical-align:top;"
| --mem-per-cpu=''value_in_MB''
| #SBATCH --mem-per-cpu=''value_in_MB''
| Minimum Memory required per allocated CPU. (Replaces the option pmem of MOAB. You should omit the setting of this option.)
|-
|- style="vertical-align:top;"
| --mail-type=''type''
| #SBATCH --mail-type=''type''
| Notify user by email when certain event types occur. Valid type values are NONE, BEGIN, END, FAIL, REQUEUE, ALL.
|-
|- style="vertical-align:top;"
| --mail-user=''mail-address''
| #SBATCH --mail-user=''mail-address''
| The specified mail-address receives email notification of state changes as defined by --mail-type.
|-
|- style="vertical-align:top;"
| --output=''name''
| #SBATCH --output=''name''
| File in which job output is stored.
|-
|- style="vertical-align:top;"
| --error=''name''
| #SBATCH --error=''name''
| File in which job error messages are stored.
|-
|- style="vertical-align:top;"
| -J ''name'' or --job-name=''name''
| #SBATCH --job-name=''name''
| Job name.
|-
|- style="vertical-align:top;"
| --export=[ALL,] ''env-variables''
| #SBATCH --export=[ALL,] ''env-variables''
| Identifies which environment variables from the submission environment are propagated to the launched application. Default is ALL. If adding an environment variable to the submission environment is intended, the argument ALL must be added.
|-
|- style="vertical-align:top;"
| -A ''group-name'' or --account=''group-name''
| #SBATCH --account=''group-name''
| Change resources used by this job to specified group. You may need this option if your account is assigned to more than one group. By command "scontrol show job" the project group the job is accounted on can be seen behind "Account=".
|-
|- style="vertical-align:top;"
| -p ''queue-name'' or --partition=''queue-name''
| #SBATCH --partition=''queue-name''
| Request a specific queue for the resource allocation.
|-
|- style="vertical-align:top;"
| --reservation=''reservation-name''
| #SBATCH --reservation=''reservation-name''
| Use a specific reservation for the resource allocation.
|-
|- style="vertical-align:top;"
| -C ''LSDF'' or --constraint=''LSDF''
| #SBATCH --constraint=LSDF
| Job constraint LSDF Filesystems.
|-
|}
 

==== sbatch --partition ''queues'' ====
Queue classes define maximum resources such as walltime, nodes and processes per node and queue of the compute system. Details can be found here:
* [[BinAC2/SLURM_Partitions|BinAC 2 partitions]]
 

=== sbatch Examples ===

If you are coming from Moab/Torque on BinAC 1 or you are new to HPC/Slurm the <code>sbatch</code> options may confuse you. The following examples give an orientation how to run typical workloads on BinAC 2.

You can find every file mentioned on this Wiki page on BinAC 2 at: <code>/pfs/10/project/examples</code>

==== Serial Programs ====
When you use serial programs that use only one process, you can omit most of the <code>sbatch</code> parameters, as the default values are sufficient.

To submit a serial job that runs the script <code>serial_job.sh</code> and requires 5000 MB of main memory and 10 minutes of wall clock time, Slurm will allocate one '''physical''' core to your job.

a) execute:
<pre>
$ sbatch -p compute -t 10:00 --mem=5000m serial_job.sh
</pre>
or
b) add after the initial line of your script '''serial_job.sh''' the lines:
<source lang="bash">
#SBATCH --time=10:00
#SBATCH --mem=5000m
#SBATCH --job-name=simple-serial-job
</source>
and execute the modified script with the command line option ''--partition=compute''
<pre>
$ sbatch -p=compute serial_job.sh
</pre>
Note, that sbatch command line options overrule script options.
 
 

==== Multithreaded Programs ====

Multithreaded programs run their processes on multiple threads and share resources such as memory. 
You may use a program that includes a built-in option for multithreading (e.g., options like <code>--threads</code>). 
For multithreaded programs based on '''Open''' '''M'''ulti-'''P'''rocessing (OpenMP) a number of threads is defined by the environment variable <code>OMP_NUM_THREADS</code>. By default, this variable is set to 1 (<code>OMP_NUM_THREADS=1</code>).

'''Important:''' Hyperthreading is activated on bwForCluster BinAC 2. Hyperthreading can be beneficial for some applications and codes, but it can also degrade performance in other cases. We therefore recommend to run a small test job with and without hyperthreading to determine the best choice. '''

'''a) Program with built-in multithreading option'''

The example uses the common Bioinformatics software called <code>samtools</code> as example for using built-in multithreading.

The module <code>bio/samtools/1.21</code> provides an example jobscript that requests 4 CPUs and runs <code>samtools sort</code> with 4 threads.

<pre>
#!/bin/bash

#SBATCH --time=19:00
#SBATCH --nodes=1
#SBATCH --cpus-per-task=4
#SBATCH --mem=5000m
#SBATCH --partition compute
[...]
samtools sort -@ 4 sample.bam -o sample.sorted.bam
</pre>

You can use the example jobscript with this command

<pre>
sbatch /opt/bwhpc/common/bio/samtools/1.21/bwhpc-examples/binac2-samtools-1.21-bwhpc-examples.slurm
</pre>

'''b) OpenMP'''

We will run an exaple OpenMP Hello-World program. The jobscript looks like this:

<pre>
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --cpus-per-task=4
#SBATCH --time=1:00
#SBATCH --mem=5000m
#SBATCH -J OpenMP-Hello-World

export OMP_NUM_THREADS=$(${SLURM_JOB_CPUS_PER_NODE}/2)

echo "Executable running on ${SLURM_JOB_CPUS_PER_NODE} cores with ${OMP_NUM_THREADS} threads"

# Run parallel Hello World
/pfs/10/project/examples/openmp_hello_world
</pre>

Submit the job to the <code>compute</code> partition and get the output (in the stdout-file)

<pre>
sbatch --partition=compute /pfs/10/project/examples/openmp_hello_world.sh

Executable running on 4 cores with 4 threads
Hello from process: 0
Hello from process: 2
Hello from process: 1
Hello from process: 3
</pre>

==== OpenMPI ====

If you want to run MPI-jobs on batch nodes, generate a wrapper script <code>mpi_hello_world.sh</code> for '''OpenMPI''' containing the following lines:

<source lang="bash">
#!/bin/bash

#SBATCH --partition compute
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=2
#SBATCH --cpus-per-task=2
#SBATCH --mem-per-cpu=2000
#SBATCH --time=05:00

# Load the MPI implementation of your choice
module load mpi/openmpi/4.1-gnu-14.2

# Run your MPI program
mpirun --bind-to core --map-by core --report-bindings mpi_hello_world
</source>

'''Attention:''' Do '''NOT''' add mpirun options <code>-n <number_of_processes></code> or any other option defining processes or nodes, since Slurm instructs mpirun about number of processes and node hostnames.

Use '''ALWAYS''' the MPI options <code>--bind-to core</code> and <code>--map-by core|socket|node</code>.
Please type <code>man mpirun</code> for an explanation of the meaning of the different options of mpirun option <code>--map-by</code>.

The above jobscript runs four OpenMPI tasks, distributed between two nodes. Because of hyperthreading you have to set <code>--cpus-per-task=2</code>. This means each MPI-task will get one physical core. If you omit <code>--cpus-per-task=2</code> MPI will fail.

'''Attention:''' Not all compute nodes are connected via Infiniband. Tell Slurm you want Infiniband via <code>--constraint=ib</code> when submitting or add <code>#SBATCH --constraint=ib</code> to your jobscript.

<pre>
$ sbatch --constraint=ib /pfs/10/project/examples/mpi_hello_world.sh
</pre>

This will run a simple Hello World program:

<pre>
[...]
Hello world from processor node2-031, rank 3 out of 4 processors
Hello world from processor node2-031, rank 2 out of 4 processors
Hello world from processor node2-030, rank 1 out of 4 processors
Hello world from processor node2-030, rank 0 out of 4 processors

</pre>

==== Multithreaded + MPI parallel Programs ====

Multithreaded + MPI parallel programs operate faster than serial programs on multi CPUs with multiple cores. All threads of one process share resources such as memory. On the contrary MPI tasks do not share memory but can be spawned over different nodes. '''Because hyperthreading is switched on BinaC 2, the option --cpus-per-task (-c) must be set to 2*n, if you want to use n threads.'''
 
 
===== OpenMPI with Multithreading =====
Multiple MPI tasks using '''OpenMPI''' must be launched by the MPI parallel program '''mpirun'''. For multithreaded programs based on '''Open''' '''M'''ulti-'''P'''rocessing (OpenMP) number of threads are defined by the environment variable OMP_NUM_THREADS. By default this variable is set to 1 (OMP_NUM_THREADS=1).
 
'''For OpenMPI''' a job-script to submit a batch job called ''job_ompi_omp.sh'' that runs a MPI program with 4 tasks and a 28-fold threaded program ''ompi_omp_program'' requiring 3000 MByte of physical memory per thread (using 28 threads per MPI task you will get 28*3000 MByte = 84000 MByte per MPI task) and total wall clock time of 3 hours looks like:

<source lang="bash">
#!/bin/bash
#SBATCH --ntasks=4
#SBATCH --cpus-per-task=56
#SBATCH --time=03:00:00
#SBATCH --mem=83gb # 84000 MB = 84000/1024 GB = 82.1 GB
#SBATCH --export=ALL,MPI_MODULE=mpi/openmpi/3.1,EXECUTABLE=./ompi_omp_program
#SBATCH --output="parprog_hybrid_%j.out"

# Use when a defined module environment related to OpenMPI is wished
module load ${MPI_MODULE}
export OMP_NUM_THREADS=$((${SLURM_CPUS_PER_TASK}/2))
export MPIRUN_OPTIONS="--bind-to core --map-by socket:PE=${OMP_NUM_THREADS} -report-bindings"
export NUM_CORES=${SLURM_NTASKS}*${OMP_NUM_THREADS}
echo "${EXECUTABLE} running on ${NUM_CORES} cores with ${SLURM_NTASKS} MPI-tasks and ${OMP_NUM_THREADS} threads"
startexe="mpirun -n ${SLURM_NTASKS} ${MPIRUN_OPTIONS} ${EXECUTABLE}"
echo $startexe
exec $startexe
</source>
Execute the script '''job_ompi_omp.sh''' by command sbatch:
<pre>
$ sbatch -p compute ./job_ompi_omp.sh
</pre>
* With the mpirun option ''--bind-to core'' MPI tasks and OpenMP threads are bound to physical cores.
* With the option ''--map-by node:PE=<value>'' (neighbored) MPI tasks will be attached to different nodes and each MPI task is bound to the first core of a node. <value> must be set to ${OMP_NUM_THREADS}.
* The option ''-report-bindings'' shows the bindings between MPI tasks and physical cores.
* The mpirun-options '''--bind-to core''', '''--map-by socket|...|node:PE=<value>''' should always be used when running a multithreaded MPI program.
 

==== GPU jobs ====

The nodes in the <code>gpu</code> queue have 2 or 4 NVIDIA A30/A100/H200 GPUs. Just submitting a job to these queues is not enough to also allocate one or more GPUs, you have to do so using the "--gres=gpu" parameter. You have to specifiy how many GPUs your job needs, e.g. "--gres=gpu:a30:2" will request two NVIDIA A30 GPUs.

The GPU nodes are shared between multiple jobs if the jobs don't request all the GPUs in a node and there are enough ressources to run more than one job. The individual GPUs are always bound to a single job and will not be shared between different jobs.

a) add after the initial line of your script job.sh the line including the
information about the GPU usage: #SBATCH --gres=gpu:a30:2
<pre>
#!/bin/bash
#SBATCH --ntasks=40
#SBATCH --time=02:00:00
#SBATCH --mem=4000
#SBATCH --gres=gpu:a30:2
</pre>

or b) execute:
<pre>
$ sbatch -p <queue> -n 40 -t 02:00:00 --mem 4000 --gres=gpu:a30:2 job.sh
</pre>
 
If you start an interactive session on of the GPU nodes, you can use the "nvidia-smi" command to list the GPUs allocated to your job:
<pre>
$ nvidia-smi
Sun Mar 29 15:20:05 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.33.01 Driver Version: 440.33.01 CUDA Version: 10.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2... Off | 00000000:3A:00.0 Off | 0 |
| N/A 29C P0 39W / 300W | 9MiB / 32510MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla V100-SXM2... Off | 00000000:3B:00.0 Off | 0 |
| N/A 30C P0 41W / 300W | 8MiB / 32510MiB | 0% Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 14228 G /usr/bin/X 8MiB |
| 1 14228 G /usr/bin/X 8MiB |
+-----------------------------------------------------------------------------+
</pre>

Upon successfull GPU ressource allocation, SLURM will set the environment variable <code>CUDA_VISIBLE_DEVICES</code> appropriately. Do not change this variable!

 
In case of using OpenMPI, the underlying communication infrastructure (UCX and Open MPI's BTL) is CUDA-aware.
However, there may be warnings, e.g. when running
<pre>
$ module load compiler/gnu/10.3 mpi/openmpi devel/cuda
$ mpirun mpirun -np 2 ./mpi_cuda_app
--------------------------------------
WARNING: There are more than one active ports on host 'uc2n520', but the
default subnet GID prefix was detected on more than one of these
ports. If these ports are connected to different physical IB
networks, this configuration will fail in Open MPI. This version of
Open MPI requires that every physically separate IB subnet that is
used between connected MPI processes must have different subnet ID
values.

Please see this FAQ entry for more details:

http://www.open-mpi.org/faq/?category=openfabrics#ofa-default-subnet-gid

NOTE: You can turn off this warning by setting the MCA parameter
btl_openib_warn_default_gid_prefix to 0.
--------------------------------------------------------------------------
</pre>

Please run OpenMPI's mpirun using the following command:
<pre>
$ mpirun --mca pml ucx --mca btl_openib_warn_default_gid_prefix 0 -np 2 ./mpi_cuda_app
</pre>
or disabling the (older) communication layer Bit-Transfer-Layer (short BTL) altogether:
<pre>
$ mpirun --mca pml ucx --mca btl ^openib -np 2 ./mpi_cuda_app
</pre>

(Please note, that CUDA per v12.8 is only officially supported with up to GCC-11)
 
 

== Start time of job or resources : squeue --start ==
The command can be used by any user to displays the estimated start time of a job based a number of analysis types based on historical usage, earliest available reservable resources, and priority based backlog. The command squeue is explained in detail on the webpage https://slurm.schedmd.com/squeue.html or via manpage (man squeue).
 
 
=== Access ===
By default, this command can be run by '''any user'''.
 
 

== List of your submitted jobs : squeue ==
Displays information about YOUR active, pending and/or recently completed jobs. The command displays all own active and pending jobs. The command squeue is explained in detail on the webpage https://slurm.schedmd.com/squeue.html or via manpage (man squeue).
 
 
=== Access ===
By default, this command can be run by any user.
 
 

=== Flags ===
{| width=750px class="wikitable"
|-
! Flag !! Description
|-
| -l, --long
| Report more of the available information for the selected jobs or job steps, subject to any constraints specified.
|}
 

=== Examples ===
''squeue'' example on BinaC 2 (Only your own jobs are displayed!).
<pre>
$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
18088744 single CPV.sbat ab1234 PD 0:00 1 (Priority)
18098414 multiple CPV.sbat ab1234 PD 0:00 2 (Priority)
18090089 multiple CPV.sbat ab1234 R 2:27 2 uc2n[127-128]
$ squeue -l
JOBID PARTITION NAME USER STATE TIME TIME_LIMI NODES NODELIST(REASON)
18088654 single CPV.sbat ab1234 COMPLETI 4:29 2:00:00 1 uc2n374
18088785 single CPV.sbat ab1234 PENDING 0:00 2:00:00 1 (Priority)
18098414 multiple CPV.sbat ab1234 PENDING 0:00 2:00:00 2 (Priority)
18088683 single CPV.sbat ab1234 RUNNING 0:14 2:00:00 1 uc2n413
</pre>
* The output of ''squeue'' shows how many jobs of yours are running or pending and how many nodes are in use by your jobs.
 

== Shows free resources : sinfo_t_idle ==
The Slurm command sinfo is used to view partition and node information for a system running Slurm. It incorporates down time, reservations, and node state information in determining the available backfill window. The sinfo command can only be used by the administrator.
 
SCC has prepared a special script (sinfo_t_idle) to find out how many processors are available for immediate use on the system. It is anticipated that users will use this information to submit jobs that meet these criteria and thus obtain quick job turnaround times.
 
 
=== Access ===
By default, this command can be used by any user or administrator.
 
 
=== Example ===
* The following command displays what resources are available for immediate use for the whole partition.
<pre>$ sinfo_t_idle
Partition dev_multiple : 8 nodes idle
Partition multiple : 332 nodes idle
Partition dev_single : 4 nodes idle
Partition single : 76 nodes idle
Partition long : 80 nodes idle
Partition fat : 5 nodes idle
Partition dev_special : 342 nodes idle
Partition special : 342 nodes idle
Partition dev_multiple_e: 7 nodes idle
Partition multiple_e : 335 nodes idle
Partition gpu_4 : 12 nodes idle
Partition gpu_8 : 6 nodes idle
</pre>
* For the above example jobs in all partitions can be run immediately.
 

== Detailed job information : scontrol show job ==
scontrol show job displays detailed job state information and diagnostic output for all or a specified job of yours. Detailed information is available for active, pending and recently completed jobs. The command scontrol is explained in detail on the webpage https://slurm.schedmd.com/scontrol.html or via manpage (man scontrol).
 
Display the state of all your jobs in normal mode: scontrol show job
 
Display the state of a job with <jobid> in normal mode: scontrol show job <jobid>
 
 
=== Access ===
* End users can use scontrol show job to view the status of their '''own jobs''' only.
 

=== Arguments ===
{| width=750px class="wikitable"
|-
! Option !! Default !! Description !! Example
|- style="vertical-align:top;"
|- style="width:12%;"
| -d
| (n/a)
| Detailed mode
| Example: Display the state with jobid 18089884 in detailed mode. <pre>scontrol -d show job 18089884</pre>
|}
 
 

=== Scontrol show job Example ===
Here is an example from BinAC 2.
<pre>
squeue # show my own jobs (here the userid is replaced!)
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
18089884 multiple CPV.sbat bq0742 R 33:44 2 uc2n[165-166]

$
$ # now, see what's up with my pending job with jobid 18089884
$
$ scontrol show job 18089884

JobId=18089884 JobName=CPV.sbatch
UserId=bq0742(8946) GroupId=scc(12345) MCS_label=N/A
Priority=3 Nice=0 Account=kit QOS=normal
JobState=RUNNING Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
RunTime=00:35:06 TimeLimit=02:00:00 TimeMin=N/A
SubmitTime=2020-03-16T14:14:54 EligibleTime=2020-03-16T14:14:54
AccrueTime=2020-03-16T14:14:54
StartTime=2020-03-16T15:12:51 EndTime=2020-03-16T17:12:51 Deadline=N/A
SuspendTime=None SecsPreSuspend=0 LastSchedEval=2020-03-16T15:12:51
Partition=multiple AllocNode:Sid=uc2n995:5064
ReqNodeList=(null) ExcNodeList=(null)
NodeList=uc2n[165-166]
BatchHost=uc2n165
NumNodes=2 NumCPUs=160 NumTasks=80 CPUs/Task=1 ReqB:S:C:T=0:0:*:1
TRES=cpu=160,mem=96320M,node=2,billing=160
Socks/Node=* NtasksPerN:B:S:C=40:0:*:1 CoreSpec=*
MinCPUsNode=40 MinMemoryCPU=1204M MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
OverSubscribe=NO Contiguous=0 Licenses=(null) Network=(null)
Command=/pfs/data5/home/kit/scc/bq0742/git/CPV/bin/CPV.sbatch
WorkDir=/pfs/data5/home/kit/scc/bq0742/git/CPV/bin
StdErr=/pfs/data5/home/kit/scc/bq0742/git/CPV/bin/slurm-18089884.out
StdIn=/dev/null
StdOut=/pfs/data5/home/kit/scc/bq0742/git/CPV/bin/slurm-18089884.out
Power=
MailUser=(null) MailType=NONE
</pre>
 
You can use standard Linux pipe commands to filter the very detailed scontrol show job output.
* In which state the job is?
<pre>$ scontrol show job 18089884 | grep -i State
JobState=COMPLETED Reason=None Dependency=(null)
</pre>
 

== Cancel Slurm Jobs ==
The scancel command is used to cancel jobs. The command scancel is explained in detail on the webpage https://slurm.schedmd.com/scancel.html or via manpage (man scancel).
 
 
=== Canceling own jobs : scancel ===
scancel is used to signal or cancel jobs, job arrays or job steps. The command is:
<pre>
$ scancel [-i] <job-id>
$ scancel -t <job_state_name>
</pre>

 
{| width=750px class="wikitable"
! Flag !! Default !! Description !! Example
|- style="vertical-align:top;"
| -i, --interactive
| (n/a)
| Interactive mode.
| Cancel the job 987654 interactively. <pre> scancel -i 987654 </pre>
|-
| -t, --state
| (n/a)
| Restrict the scancel operation to jobs in a certain state. "job_state_name" may have a value of either "PENDING", "RUNNING" or "SUSPENDED".
| Cancel all jobs in state "PENDING". <pre> scancel -t "PENDING" </pre>
|}
 

= Resource Managers =
=== Batch Job (Slurm) Variables ===
The following environment variables of Slurm are added to your environment once your job has started
(only an excerpt of the most important ones).
{| width=750px class="wikitable"
! Environment !! Brief explanation
|-
| SLURM_JOB_CPUS_PER_NODE
| Number of processes per node dedicated to the job
|-
| SLURM_JOB_NODELIST
| List of nodes dedicated to the job
|-
| SLURM_JOB_NUM_NODES
| Number of nodes dedicated to the job
|-
| SLURM_MEM_PER_NODE
| Memory per node dedicated to the job
|-
| SLURM_NPROCS
| Total number of processes dedicated to the job
|-
| SLURM_CLUSTER_NAME
| Name of the cluster executing the job
|-
| SLURM_CPUS_PER_TASK
| Number of CPUs requested per task
|-
| SLURM_JOB_ACCOUNT
| Account name
|-
| SLURM_JOB_ID
| Job ID
|-
| SLURM_JOB_NAME
| Job Name
|-
| SLURM_JOB_PARTITION
| Partition/queue running the job
|-
| SLURM_JOB_UID
| User ID of the job's owner
|-
| SLURM_SUBMIT_DIR
| Job submit folder. The directory from which sbatch was invoked.
|-
| SLURM_JOB_USER
| User name of the job's owner
|-
| SLURM_RESTART_COUNT
| Number of times job has restarted
|-
| SLURM_PROCID
| Task ID (MPI rank)
|-
| SLURM_NTASKS
| The total number of tasks available for the job
|-
| SLURM_STEP_ID
| Job step ID
|-
| SLURM_STEP_NUM_TASKS
| Task count (number of MPI ranks)
|-
| SLURM_JOB_CONSTRAINT
| Job constraints
|}
See also:
* [https://slurm.schedmd.com/sbatch.html#lbAI Slurm input and output environment variables]
 

=== Job Exit Codes ===
A job's exit code (also known as exit status, return code and completion code) is captured by SLURM and saved as part of the job record.
 
Any non-zero exit code will be assumed to be a job failure and will result in a Job State of FAILED with a reason of "NonZeroExitCode".
 
The exit code is an 8 bit unsigned number ranging between 0 and 255. While it is possible for a job to return a negative exit code, SLURM will display it as an unsigned value in the 0 - 255 range.
 
 
==== Displaying Exit Codes and Signals ====
SLURM displays a job's exit code in the output of the '''scontrol show job''' and the sview utility.
 
When a signal was responsible for a job or step's termination, the signal number will be displayed after the exit code, delineated by a colon(:).
 
 
==== Submitting Termination Signal ====
Here is an example, how to 'save' a Slurm termination signal in a typical jobscript.
<source lang="bash">
[...]
exit_code=$?
mpirun -np <#cores> <EXE_BIN_DIR>/<executable> ... (options) 2>&1
[ "$exit_code" -eq 0 ] && echo "all clean..." || \
echo "Executable <EXE_BIN_DIR>/<executable> finished with exit code ${$exit_code}"
[...]
</source>
* Do not use ''''time'''' mpirun! The exit code will be the one submitted by the first (time) program.
* You do not need an '''exit $exit_code''' in the scripts.
 
 
----
[[#top|Back to top]]

BinAC2/Slurm

2026-01-07T08:46:11Z

S Behnle: /* GPU jobs */

= General information about Slurm =

Any kind of calculation on the compute nodes of bwForCluster BinAC 2 requires the user to define calculations as a sequence of commands or single command together with required run time, number of CPU cores and main memory and submit all, i.e., the '''batch job''', to a resource and workload managing software. BinAC 2 has installed the workload managing software Slurm. Therefore any job submission by the user is to be executed by commands of the Slurm software. Slurm queues and runs user jobs based on fair sharing policies.

= External Slurm documentation =

You can find the official Slurm configuration and some other material here:

* Slurm documentation: https://slurm.schedmd.com/documentation.html
* Slurm cheat sheet: https://slurm.schedmd.com/pdfs/summary.pdf
* Slurm tutorials: https://slurm.schedmd.com/tutorials.html

= SLURM terminology =

SLURM knows and mirrors the division of the cluster into '''nodes''' with several '''cores'''. When queuing '''jobs''', there are several ways of requesting resources and it is important to know which term means what in SLURM. Here are some basic SLURM terms:

;Job
: A job is a self-contained computation that may encompass multiple tasks and is given specific resources like individual CPUs/GPUs, a specific amount of RAM or entire nodes. These resources are said to have been allocated for the job.

;Task
: A task is a single run of a single process. By default, one task is run per node and one CPU is assigned per task.

;Partition
: A partition (usually called queue outside SLURM) is a waiting line in which jobs are put by users.

;Socket
: Receptacle on the motherboard for one physically packaged processor (each of which can contain one or more cores).

;Core
: A complete private set of registers, execution units, and retirement queues needed to execute programs.

;Thread
: One or more hardware contexts withing a single core. Each thread has attributes of one core, managed & scheduled as a single logical processor by the OS.

;CPU
: A '''CPU''' in Slurm means a '''single core'''. This is different from the more common terminology, where a CPU (a microprocessor chip) consists of multiple cores. Slurm uses the term '''sockets''' when talking about CPU chips. Depending upon system configuration, a CPU can be either a '''core''' or a '''thread'''. On '''BinAC 2 Hyperthreading is activated on every machine'''. This means that the operating system and Slurm sees each physical core as two logical cores.

= Slurm Commands =

{| width=750px class="wikitable"
! Slurm commands !! Brief explanation
|-
| [https://slurm.schedmd.com/sbatch.html sbatch] || Submits a job and queues it in an input queue
|-
| [https://slurm.schedmd.com/salloc.html saclloc] || Request resources for an interactive job
|-
| [https://slurm.schedmd.com/squeue.html squeue] || Displays information about active, eligible, blocked, and/or recently completed jobs
|-
| [https://slurm.schedmd.com/scontrol.html scontrol] || Displays detailed job state information
|-
| [https://slurm.schedmd.com/scontrol.html sstat] || Displays status information about a running job
|-
| [https://slurm.schedmd.com/scancel.html scancel] || Cancels a job
|-
|}

== Interactive Jobs ==

You can run interactive jobs for testing and developing your job scripts.
Several nodes are reserved for interactive work, so your jobs should start right away.
You can only submit one job to this partition at a time. A job can run for up to 10 hours (about one workday).

This example command gives you 16 cores and 128 GB of memory for four hours on one of the reserved nodes:

<pre>
salloc --partition=interactive --time=4:00:00 --cpus-per-task=16 --mem=128gb
</pre>

You can also use srun to request the same resources:

<pre>
srun --partition=interactive --time=4:00:00 --cpus-per-task=16 --mem=128gb --pty bash
</pre>

== Job Submission : sbatch ==

Batch jobs are submitted by using the command '''sbatch'''. The main purpose of the '''sbatch''' command is to specify the resources that are needed to run the job. '''sbatch''' will then queue the batch job. However, starting of batch job depends on the availability of the requested resources and the fair sharing value.
 
=== sbatch Command Parameters ===
The syntax and use of '''sbatch''' can be displayed via:
<pre>
$ man sbatch
</pre>
'''sbatch''' options can be used from the command line or in your job script. The following table shows the syntax and provides examples for each option.

{| class="wikitable"
! colspan="5" | sbatch Options
|-
! Command line
! Job Script
! Purpose
! Example
! Default value
|- style="vertical-align:top;"
| <code>-t ''time''</code> or <code>--time=''time''</code>
| #SBATCH --time=''time''
| Wall clock time limit. 
| <code>-t 2:30:00</code> Limits run time to 2h 30 min.<code>-t 2-12</code> Limits run time to 2 days and 12 hours.
| Depends on Slurm partition.
|-
|- style="vertical-align:top;"
| -N ''count'' or --nodes=''count''
| #SBATCH --nodes=''count''
| Number of nodes to be used.
| <code>-N 1</code> Run job on one node.<code>-N 2</code> Run job on two nodes (have to use MPI!)
|
|- style="vertical-align:top;"
| -n ''count'' or --ntasks=''count''
| #SBATCH --ntasks=''count''
| Number of tasks to be launched.
| <code>-n 2</code> launch two tasks in the job.
| One task per node
|-
|- style="vertical-align:top;"
| --ntasks-per-node=''count''
| #SBATCH --ntasks-per-node=''count''
| Maximum count of tasks per node. (Replaces the option <code>ppn</code> of MOAB.)
| <code>--ntasks-per-node=2</code> Run 2 tasks per node
| 1 task per node
|-
|- style="vertical-align:top;"
| -c ''count'' or --cpus-per-task=''count''
| #SBATCH --cpus-per-task=''count''
| Number of CPUs required per (MPI-)task.
| <code>-c 2</code> Request two CPUs per (MPI-)task.
| 1 CPU per (MPI-)task
|-
|- style="vertical-align:top;"
| <code>--mem=<size>[units]</code>
| #SBATCH --mem=''value_in_MB''
| Memory in MegaByte per node.<code>[units]</code> can be one of <code>[K<nowiki>|</nowiki>M<nowiki>|</nowiki>G<nowiki>|</nowiki>T]</code>.
| <code>--mem=10g</code> Request 10GB RAM per node <code>--mem=0</code> Request all memory on node
| Depends on Slurm configuration.It is better to specify <code>--mem</code> in every case.
|-
|- style="vertical-align:top;"
| --mem-per-cpu=''value_in_MB''
| #SBATCH --mem-per-cpu=''value_in_MB''
| Minimum Memory required per allocated CPU. (Replaces the option pmem of MOAB. You should omit the setting of this option.)
|-
|- style="vertical-align:top;"
| --mail-type=''type''
| #SBATCH --mail-type=''type''
| Notify user by email when certain event types occur. Valid type values are NONE, BEGIN, END, FAIL, REQUEUE, ALL.
|-
|- style="vertical-align:top;"
| --mail-user=''mail-address''
| #SBATCH --mail-user=''mail-address''
| The specified mail-address receives email notification of state changes as defined by --mail-type.
|-
|- style="vertical-align:top;"
| --output=''name''
| #SBATCH --output=''name''
| File in which job output is stored.
|-
|- style="vertical-align:top;"
| --error=''name''
| #SBATCH --error=''name''
| File in which job error messages are stored.
|-
|- style="vertical-align:top;"
| -J ''name'' or --job-name=''name''
| #SBATCH --job-name=''name''
| Job name.
|-
|- style="vertical-align:top;"
| --export=[ALL,] ''env-variables''
| #SBATCH --export=[ALL,] ''env-variables''
| Identifies which environment variables from the submission environment are propagated to the launched application. Default is ALL. If adding an environment variable to the submission environment is intended, the argument ALL must be added.
|-
|- style="vertical-align:top;"
| -A ''group-name'' or --account=''group-name''
| #SBATCH --account=''group-name''
| Change resources used by this job to specified group. You may need this option if your account is assigned to more than one group. By command "scontrol show job" the project group the job is accounted on can be seen behind "Account=".
|-
|- style="vertical-align:top;"
| -p ''queue-name'' or --partition=''queue-name''
| #SBATCH --partition=''queue-name''
| Request a specific queue for the resource allocation.
|-
|- style="vertical-align:top;"
| --reservation=''reservation-name''
| #SBATCH --reservation=''reservation-name''
| Use a specific reservation for the resource allocation.
|-
|- style="vertical-align:top;"
| -C ''LSDF'' or --constraint=''LSDF''
| #SBATCH --constraint=LSDF
| Job constraint LSDF Filesystems.
|-
|}
 

==== sbatch --partition ''queues'' ====
Queue classes define maximum resources such as walltime, nodes and processes per node and queue of the compute system. Details can be found here:
* [[BinAC2/SLURM_Partitions|BinAC 2 partitions]]
 

=== sbatch Examples ===

If you are coming from Moab/Torque on BinAC 1 or you are new to HPC/Slurm the <code>sbatch</code> options may confuse you. The following examples give an orientation how to run typical workloads on BinAC 2.

You can find every file mentioned on this Wiki page on BinAC 2 at: <code>/pfs/10/project/examples</code>

==== Serial Programs ====
When you use serial programs that use only one process, you can omit most of the <code>sbatch</code> parameters, as the default values are sufficient.

To submit a serial job that runs the script <code>serial_job.sh</code> and requires 5000 MB of main memory and 10 minutes of wall clock time, Slurm will allocate one '''physical''' core to your job.

a) execute:
<pre>
$ sbatch -p compute -t 10:00 --mem=5000m serial_job.sh
</pre>
or
b) add after the initial line of your script '''serial_job.sh''' the lines:
<source lang="bash">
#SBATCH --time=10:00
#SBATCH --mem=5000m
#SBATCH --job-name=simple-serial-job
</source>
and execute the modified script with the command line option ''--partition=compute''
<pre>
$ sbatch -p=compute serial_job.sh
</pre>
Note, that sbatch command line options overrule script options.
 
 

==== Multithreaded Programs ====

Multithreaded programs run their processes on multiple threads and share resources such as memory. 
You may use a program that includes a built-in option for multithreading (e.g., options like <code>--threads</code>). 
For multithreaded programs based on '''Open''' '''M'''ulti-'''P'''rocessing (OpenMP) a number of threads is defined by the environment variable <code>OMP_NUM_THREADS</code>. By default, this variable is set to 1 (<code>OMP_NUM_THREADS=1</code>).

'''Important:''' Hyperthreading is activated on bwForCluster BinAC 2. Hyperthreading can be beneficial for some applications and codes, but it can also degrade performance in other cases. We therefore recommend to run a small test job with and without hyperthreading to determine the best choice. '''

'''a) Program with built-in multithreading option'''

The example uses the common Bioinformatics software called <code>samtools</code> as example for using built-in multithreading.

The module <code>bio/samtools/1.21</code> provides an example jobscript that requests 4 CPUs and runs <code>samtools sort</code> with 4 threads.

<pre>
#!/bin/bash

#SBATCH --time=19:00
#SBATCH --nodes=1
#SBATCH --cpus-per-task=4
#SBATCH --mem=5000m
#SBATCH --partition compute
[...]
samtools sort -@ 4 sample.bam -o sample.sorted.bam
</pre>

You can use the example jobscript with this command

<pre>
sbatch /opt/bwhpc/common/bio/samtools/1.21/bwhpc-examples/binac2-samtools-1.21-bwhpc-examples.slurm
</pre>

'''b) OpenMP'''

We will run an exaple OpenMP Hello-World program. The jobscript looks like this:

<pre>
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --cpus-per-task=4
#SBATCH --time=1:00
#SBATCH --mem=5000m
#SBATCH -J OpenMP-Hello-World

export OMP_NUM_THREADS=$(${SLURM_JOB_CPUS_PER_NODE}/2)

echo "Executable running on ${SLURM_JOB_CPUS_PER_NODE} cores with ${OMP_NUM_THREADS} threads"

# Run parallel Hello World
/pfs/10/project/examples/openmp_hello_world
</pre>

Submit the job to the <code>compute</code> partition and get the output (in the stdout-file)

<pre>
sbatch --partition=compute /pfs/10/project/examples/openmp_hello_world.sh

Executable running on 4 cores with 4 threads
Hello from process: 0
Hello from process: 2
Hello from process: 1
Hello from process: 3
</pre>

==== OpenMPI ====

If you want to run MPI-jobs on batch nodes, generate a wrapper script <code>mpi_hello_world.sh</code> for '''OpenMPI''' containing the following lines:

<source lang="bash">
#!/bin/bash

#SBATCH --partition compute
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=2
#SBATCH --cpus-per-task=2
#SBATCH --mem-per-cpu=2000
#SBATCH --time=05:00

# Load the MPI implementation of your choice
module load mpi/openmpi/4.1-gnu-14.2

# Run your MPI program
mpirun --bind-to core --map-by core --report-bindings mpi_hello_world
</source>

'''Attention:''' Do '''NOT''' add mpirun options <code>-n <number_of_processes></code> or any other option defining processes or nodes, since Slurm instructs mpirun about number of processes and node hostnames.

Use '''ALWAYS''' the MPI options <code>--bind-to core</code> and <code>--map-by core|socket|node</code>.
Please type <code>man mpirun</code> for an explanation of the meaning of the different options of mpirun option <code>--map-by</code>.

The above jobscript runs four OpenMPI tasks, distributed between two nodes. Because of hyperthreading you have to set <code>--cpus-per-task=2</code>. This means each MPI-task will get one physical core. If you omit <code>--cpus-per-task=2</code> MPI will fail.

'''Attention:''' Not all compute nodes are connected via Infiniband. Tell Slurm you want Infiniband via <code>--constraint=ib</code> when submitting or add <code>#SBATCH --constraint=ib</code> to your jobscript.

<pre>
$ sbatch --constraint=ib /pfs/10/project/examples/mpi_hello_world.sh
</pre>

This will run a simple Hello World program:

<pre>
[...]
Hello world from processor node2-031, rank 3 out of 4 processors
Hello world from processor node2-031, rank 2 out of 4 processors
Hello world from processor node2-030, rank 1 out of 4 processors
Hello world from processor node2-030, rank 0 out of 4 processors

</pre>

==== Multithreaded + MPI parallel Programs ====

Multithreaded + MPI parallel programs operate faster than serial programs on multi CPUs with multiple cores. All threads of one process share resources such as memory. On the contrary MPI tasks do not share memory but can be spawned over different nodes. '''Because hyperthreading is switched on BinaC 2, the option --cpus-per-task (-c) must be set to 2*n, if you want to use n threads.'''
 
 
===== OpenMPI with Multithreading =====
Multiple MPI tasks using '''OpenMPI''' must be launched by the MPI parallel program '''mpirun'''. For multithreaded programs based on '''Open''' '''M'''ulti-'''P'''rocessing (OpenMP) number of threads are defined by the environment variable OMP_NUM_THREADS. By default this variable is set to 1 (OMP_NUM_THREADS=1).
 
'''For OpenMPI''' a job-script to submit a batch job called ''job_ompi_omp.sh'' that runs a MPI program with 4 tasks and a 28-fold threaded program ''ompi_omp_program'' requiring 3000 MByte of physical memory per thread (using 28 threads per MPI task you will get 28*3000 MByte = 84000 MByte per MPI task) and total wall clock time of 3 hours looks like:

<source lang="bash">
#!/bin/bash
#SBATCH --ntasks=4
#SBATCH --cpus-per-task=56
#SBATCH --time=03:00:00
#SBATCH --mem=83gb # 84000 MB = 84000/1024 GB = 82.1 GB
#SBATCH --export=ALL,MPI_MODULE=mpi/openmpi/3.1,EXECUTABLE=./ompi_omp_program
#SBATCH --output="parprog_hybrid_%j.out"

# Use when a defined module environment related to OpenMPI is wished
module load ${MPI_MODULE}
export OMP_NUM_THREADS=$((${SLURM_CPUS_PER_TASK}/2))
export MPIRUN_OPTIONS="--bind-to core --map-by socket:PE=${OMP_NUM_THREADS} -report-bindings"
export NUM_CORES=${SLURM_NTASKS}*${OMP_NUM_THREADS}
echo "${EXECUTABLE} running on ${NUM_CORES} cores with ${SLURM_NTASKS} MPI-tasks and ${OMP_NUM_THREADS} threads"
startexe="mpirun -n ${SLURM_NTASKS} ${MPIRUN_OPTIONS} ${EXECUTABLE}"
echo $startexe
exec $startexe
</source>
Execute the script '''job_ompi_omp.sh''' by command sbatch:
<pre>
$ sbatch -p compute ./job_ompi_omp.sh
</pre>
* With the mpirun option ''--bind-to core'' MPI tasks and OpenMP threads are bound to physical cores.
* With the option ''--map-by node:PE=<value>'' (neighbored) MPI tasks will be attached to different nodes and each MPI task is bound to the first core of a node. <value> must be set to ${OMP_NUM_THREADS}.
* The option ''-report-bindings'' shows the bindings between MPI tasks and physical cores.
* The mpirun-options '''--bind-to core''', '''--map-by socket|...|node:PE=<value>''' should always be used when running a multithreaded MPI program.
 

==== GPU jobs ====

The nodes in the <code>gpu</code> queue have 2 or 4 NVIDIA A30/A100/H200 GPUs. Just submitting a job to these queues is not enough to also allocate one or more GPUs, you have to do so using the "--gres=gpu" parameter. You have to specifiy how many GPUs your job needs, e.g. "--gres=gpu:a30:2" will request two NVIDIA A30 GPUs.

The GPU nodes are shared between multiple jobs if the jobs don't request all the GPUs in a node and there are enough ressources to run more than one job. The individual GPUs are always bound to a single job and will not be shared between different jobs.

a) add after the initial line of your script job.sh the line including the
information about the GPU usage: #SBATCH --gres=gpu:a30:2
<pre>
#!/bin/bash
#SBATCH --ntasks=40
#SBATCH --time=02:00:00
#SBATCH --mem=4000
#SBATCH --gres=gpu:a30:2
</pre>

or b) execute:
<pre>
$ sbatch -p <queue> -n 40 -t 02:00:00 --mem 4000 --gres=gpu:a30:2 job.sh
</pre>
 
If you start an interactive session on of the GPU nodes, you can use the "nvidia-smi" command to list the GPUs allocated to your job:
<pre>
$ nvidia-smi
Sun Mar 29 15:20:05 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.33.01 Driver Version: 440.33.01 CUDA Version: 10.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2... Off | 00000000:3A:00.0 Off | 0 |
| N/A 29C P0 39W / 300W | 9MiB / 32510MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla V100-SXM2... Off | 00000000:3B:00.0 Off | 0 |
| N/A 30C P0 41W / 300W | 8MiB / 32510MiB | 0% Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 14228 G /usr/bin/X 8MiB |
| 1 14228 G /usr/bin/X 8MiB |
+-----------------------------------------------------------------------------+
</pre>

Upon successfull GPU ressource allocation, SLURM will set the environment variable <code>CUDA_VISIBLE_DEVICES</code> appropriately. Do not change this variable!

 
In case of using Open MPI, the underlying communication infrastructure (UCX and Open MPI's BTL) is CUDA-aware.
However, there may be warnings, e.g. when running
<pre>
$ module load compiler/gnu/10.3 mpi/openmpi devel/cuda
$ mpirun mpirun -np 2 ./mpi_cuda_app
--------------------------------------
WARNING: There are more than one active ports on host 'uc2n520', but the
default subnet GID prefix was detected on more than one of these
ports. If these ports are connected to different physical IB
networks, this configuration will fail in Open MPI. This version of
Open MPI requires that every physically separate IB subnet that is
used between connected MPI processes must have different subnet ID
values.

Please see this FAQ entry for more details:

http://www.open-mpi.org/faq/?category=openfabrics#ofa-default-subnet-gid

NOTE: You can turn off this warning by setting the MCA parameter
btl_openib_warn_default_gid_prefix to 0.
--------------------------------------------------------------------------
</pre>

Please run Open MPI's mpirun using the following command:
<pre>
$ mpirun --mca pml ucx --mca btl_openib_warn_default_gid_prefix 0 -np 2 ./mpi_cuda_app
</pre>
or disabling the (older) communication layer Bit-Transfer-Layer (short BTL) alltogether:
<pre>
$ mpirun --mca pml ucx --mca btl ^openib -np 2 ./mpi_cuda_app
</pre>

(Please note, that CUDA per v12.8 is only officially supported with up to GCC-11)
 
 

== Start time of job or resources : squeue --start ==
The command can be used by any user to displays the estimated start time of a job based a number of analysis types based on historical usage, earliest available reservable resources, and priority based backlog. The command squeue is explained in detail on the webpage https://slurm.schedmd.com/squeue.html or via manpage (man squeue).
 
 
=== Access ===
By default, this command can be run by '''any user'''.
 
 

== List of your submitted jobs : squeue ==
Displays information about YOUR active, pending and/or recently completed jobs. The command displays all own active and pending jobs. The command squeue is explained in detail on the webpage https://slurm.schedmd.com/squeue.html or via manpage (man squeue).
 
 
=== Access ===
By default, this command can be run by any user.
 
 

=== Flags ===
{| width=750px class="wikitable"
|-
! Flag !! Description
|-
| -l, --long
| Report more of the available information for the selected jobs or job steps, subject to any constraints specified.
|}
 

=== Examples ===
''squeue'' example on BinaC 2 (Only your own jobs are displayed!).
<pre>
$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
18088744 single CPV.sbat ab1234 PD 0:00 1 (Priority)
18098414 multiple CPV.sbat ab1234 PD 0:00 2 (Priority)
18090089 multiple CPV.sbat ab1234 R 2:27 2 uc2n[127-128]
$ squeue -l
JOBID PARTITION NAME USER STATE TIME TIME_LIMI NODES NODELIST(REASON)
18088654 single CPV.sbat ab1234 COMPLETI 4:29 2:00:00 1 uc2n374
18088785 single CPV.sbat ab1234 PENDING 0:00 2:00:00 1 (Priority)
18098414 multiple CPV.sbat ab1234 PENDING 0:00 2:00:00 2 (Priority)
18088683 single CPV.sbat ab1234 RUNNING 0:14 2:00:00 1 uc2n413
</pre>
* The output of ''squeue'' shows how many jobs of yours are running or pending and how many nodes are in use by your jobs.
 

== Shows free resources : sinfo_t_idle ==
The Slurm command sinfo is used to view partition and node information for a system running Slurm. It incorporates down time, reservations, and node state information in determining the available backfill window. The sinfo command can only be used by the administrator.
 
SCC has prepared a special script (sinfo_t_idle) to find out how many processors are available for immediate use on the system. It is anticipated that users will use this information to submit jobs that meet these criteria and thus obtain quick job turnaround times.
 
 
=== Access ===
By default, this command can be used by any user or administrator.
 
 
=== Example ===
* The following command displays what resources are available for immediate use for the whole partition.
<pre>$ sinfo_t_idle
Partition dev_multiple : 8 nodes idle
Partition multiple : 332 nodes idle
Partition dev_single : 4 nodes idle
Partition single : 76 nodes idle
Partition long : 80 nodes idle
Partition fat : 5 nodes idle
Partition dev_special : 342 nodes idle
Partition special : 342 nodes idle
Partition dev_multiple_e: 7 nodes idle
Partition multiple_e : 335 nodes idle
Partition gpu_4 : 12 nodes idle
Partition gpu_8 : 6 nodes idle
</pre>
* For the above example jobs in all partitions can be run immediately.
 

== Detailed job information : scontrol show job ==
scontrol show job displays detailed job state information and diagnostic output for all or a specified job of yours. Detailed information is available for active, pending and recently completed jobs. The command scontrol is explained in detail on the webpage https://slurm.schedmd.com/scontrol.html or via manpage (man scontrol).
 
Display the state of all your jobs in normal mode: scontrol show job
 
Display the state of a job with <jobid> in normal mode: scontrol show job <jobid>
 
 
=== Access ===
* End users can use scontrol show job to view the status of their '''own jobs''' only.
 

=== Arguments ===
{| width=750px class="wikitable"
|-
! Option !! Default !! Description !! Example
|- style="vertical-align:top;"
|- style="width:12%;"
| -d
| (n/a)
| Detailed mode
| Example: Display the state with jobid 18089884 in detailed mode. <pre>scontrol -d show job 18089884</pre>
|}
 
 

=== Scontrol show job Example ===
Here is an example from BinAC 2.
<pre>
squeue # show my own jobs (here the userid is replaced!)
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
18089884 multiple CPV.sbat bq0742 R 33:44 2 uc2n[165-166]

$
$ # now, see what's up with my pending job with jobid 18089884
$
$ scontrol show job 18089884

JobId=18089884 JobName=CPV.sbatch
UserId=bq0742(8946) GroupId=scc(12345) MCS_label=N/A
Priority=3 Nice=0 Account=kit QOS=normal
JobState=RUNNING Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
RunTime=00:35:06 TimeLimit=02:00:00 TimeMin=N/A
SubmitTime=2020-03-16T14:14:54 EligibleTime=2020-03-16T14:14:54
AccrueTime=2020-03-16T14:14:54
StartTime=2020-03-16T15:12:51 EndTime=2020-03-16T17:12:51 Deadline=N/A
SuspendTime=None SecsPreSuspend=0 LastSchedEval=2020-03-16T15:12:51
Partition=multiple AllocNode:Sid=uc2n995:5064
ReqNodeList=(null) ExcNodeList=(null)
NodeList=uc2n[165-166]
BatchHost=uc2n165
NumNodes=2 NumCPUs=160 NumTasks=80 CPUs/Task=1 ReqB:S:C:T=0:0:*:1
TRES=cpu=160,mem=96320M,node=2,billing=160
Socks/Node=* NtasksPerN:B:S:C=40:0:*:1 CoreSpec=*
MinCPUsNode=40 MinMemoryCPU=1204M MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
OverSubscribe=NO Contiguous=0 Licenses=(null) Network=(null)
Command=/pfs/data5/home/kit/scc/bq0742/git/CPV/bin/CPV.sbatch
WorkDir=/pfs/data5/home/kit/scc/bq0742/git/CPV/bin
StdErr=/pfs/data5/home/kit/scc/bq0742/git/CPV/bin/slurm-18089884.out
StdIn=/dev/null
StdOut=/pfs/data5/home/kit/scc/bq0742/git/CPV/bin/slurm-18089884.out
Power=
MailUser=(null) MailType=NONE
</pre>
 
You can use standard Linux pipe commands to filter the very detailed scontrol show job output.
* In which state the job is?
<pre>$ scontrol show job 18089884 | grep -i State
JobState=COMPLETED Reason=None Dependency=(null)
</pre>
 

== Cancel Slurm Jobs ==
The scancel command is used to cancel jobs. The command scancel is explained in detail on the webpage https://slurm.schedmd.com/scancel.html or via manpage (man scancel).
 
 
=== Canceling own jobs : scancel ===
scancel is used to signal or cancel jobs, job arrays or job steps. The command is:
<pre>
$ scancel [-i] <job-id>
$ scancel -t <job_state_name>
</pre>

 
{| width=750px class="wikitable"
! Flag !! Default !! Description !! Example
|- style="vertical-align:top;"
| -i, --interactive
| (n/a)
| Interactive mode.
| Cancel the job 987654 interactively. <pre> scancel -i 987654 </pre>
|-
| -t, --state
| (n/a)
| Restrict the scancel operation to jobs in a certain state. "job_state_name" may have a value of either "PENDING", "RUNNING" or "SUSPENDED".
| Cancel all jobs in state "PENDING". <pre> scancel -t "PENDING" </pre>
|}
 

= Resource Managers =
=== Batch Job (Slurm) Variables ===
The following environment variables of Slurm are added to your environment once your job has started
(only an excerpt of the most important ones).
{| width=750px class="wikitable"
! Environment !! Brief explanation
|-
| SLURM_JOB_CPUS_PER_NODE
| Number of processes per node dedicated to the job
|-
| SLURM_JOB_NODELIST
| List of nodes dedicated to the job
|-
| SLURM_JOB_NUM_NODES
| Number of nodes dedicated to the job
|-
| SLURM_MEM_PER_NODE
| Memory per node dedicated to the job
|-
| SLURM_NPROCS
| Total number of processes dedicated to the job
|-
| SLURM_CLUSTER_NAME
| Name of the cluster executing the job
|-
| SLURM_CPUS_PER_TASK
| Number of CPUs requested per task
|-
| SLURM_JOB_ACCOUNT
| Account name
|-
| SLURM_JOB_ID
| Job ID
|-
| SLURM_JOB_NAME
| Job Name
|-
| SLURM_JOB_PARTITION
| Partition/queue running the job
|-
| SLURM_JOB_UID
| User ID of the job's owner
|-
| SLURM_SUBMIT_DIR
| Job submit folder. The directory from which sbatch was invoked.
|-
| SLURM_JOB_USER
| User name of the job's owner
|-
| SLURM_RESTART_COUNT
| Number of times job has restarted
|-
| SLURM_PROCID
| Task ID (MPI rank)
|-
| SLURM_NTASKS
| The total number of tasks available for the job
|-
| SLURM_STEP_ID
| Job step ID
|-
| SLURM_STEP_NUM_TASKS
| Task count (number of MPI ranks)
|-
| SLURM_JOB_CONSTRAINT
| Job constraints
|}
See also:
* [https://slurm.schedmd.com/sbatch.html#lbAI Slurm input and output environment variables]
 

=== Job Exit Codes ===
A job's exit code (also known as exit status, return code and completion code) is captured by SLURM and saved as part of the job record.
 
Any non-zero exit code will be assumed to be a job failure and will result in a Job State of FAILED with a reason of "NonZeroExitCode".
 
The exit code is an 8 bit unsigned number ranging between 0 and 255. While it is possible for a job to return a negative exit code, SLURM will display it as an unsigned value in the 0 - 255 range.
 
 
==== Displaying Exit Codes and Signals ====
SLURM displays a job's exit code in the output of the '''scontrol show job''' and the sview utility.
 
When a signal was responsible for a job or step's termination, the signal number will be displayed after the exit code, delineated by a colon(:).
 
 
==== Submitting Termination Signal ====
Here is an example, how to 'save' a Slurm termination signal in a typical jobscript.
<source lang="bash">
[...]
exit_code=$?
mpirun -np <#cores> <EXE_BIN_DIR>/<executable> ... (options) 2>&1
[ "$exit_code" -eq 0 ] && echo "all clean..." || \
echo "Executable <EXE_BIN_DIR>/<executable> finished with exit code ${$exit_code}"
[...]
</source>
* Do not use ''''time'''' mpirun! The exit code will be the one submitted by the first (time) program.
* You do not need an '''exit $exit_code''' in the scripts.
 
 
----
[[#top|Back to top]]

BinAC2/Slurm

2026-01-07T08:45:05Z

S Behnle:

= General information about Slurm =

Any kind of calculation on the compute nodes of bwForCluster BinAC 2 requires the user to define calculations as a sequence of commands or single command together with required run time, number of CPU cores and main memory and submit all, i.e., the '''batch job''', to a resource and workload managing software. BinAC 2 has installed the workload managing software Slurm. Therefore any job submission by the user is to be executed by commands of the Slurm software. Slurm queues and runs user jobs based on fair sharing policies.

= External Slurm documentation =

You can find the official Slurm configuration and some other material here:

* Slurm documentation: https://slurm.schedmd.com/documentation.html
* Slurm cheat sheet: https://slurm.schedmd.com/pdfs/summary.pdf
* Slurm tutorials: https://slurm.schedmd.com/tutorials.html

= SLURM terminology =

SLURM knows and mirrors the division of the cluster into '''nodes''' with several '''cores'''. When queuing '''jobs''', there are several ways of requesting resources and it is important to know which term means what in SLURM. Here are some basic SLURM terms:

;Job
: A job is a self-contained computation that may encompass multiple tasks and is given specific resources like individual CPUs/GPUs, a specific amount of RAM or entire nodes. These resources are said to have been allocated for the job.

;Task
: A task is a single run of a single process. By default, one task is run per node and one CPU is assigned per task.

;Partition
: A partition (usually called queue outside SLURM) is a waiting line in which jobs are put by users.

;Socket
: Receptacle on the motherboard for one physically packaged processor (each of which can contain one or more cores).

;Core
: A complete private set of registers, execution units, and retirement queues needed to execute programs.

;Thread
: One or more hardware contexts withing a single core. Each thread has attributes of one core, managed & scheduled as a single logical processor by the OS.

;CPU
: A '''CPU''' in Slurm means a '''single core'''. This is different from the more common terminology, where a CPU (a microprocessor chip) consists of multiple cores. Slurm uses the term '''sockets''' when talking about CPU chips. Depending upon system configuration, a CPU can be either a '''core''' or a '''thread'''. On '''BinAC 2 Hyperthreading is activated on every machine'''. This means that the operating system and Slurm sees each physical core as two logical cores.

= Slurm Commands =

{| width=750px class="wikitable"
! Slurm commands !! Brief explanation
|-
| [https://slurm.schedmd.com/sbatch.html sbatch] || Submits a job and queues it in an input queue
|-
| [https://slurm.schedmd.com/salloc.html saclloc] || Request resources for an interactive job
|-
| [https://slurm.schedmd.com/squeue.html squeue] || Displays information about active, eligible, blocked, and/or recently completed jobs
|-
| [https://slurm.schedmd.com/scontrol.html scontrol] || Displays detailed job state information
|-
| [https://slurm.schedmd.com/scontrol.html sstat] || Displays status information about a running job
|-
| [https://slurm.schedmd.com/scancel.html scancel] || Cancels a job
|-
|}

== Interactive Jobs ==

You can run interactive jobs for testing and developing your job scripts.
Several nodes are reserved for interactive work, so your jobs should start right away.
You can only submit one job to this partition at a time. A job can run for up to 10 hours (about one workday).

This example command gives you 16 cores and 128 GB of memory for four hours on one of the reserved nodes:

<pre>
salloc --partition=interactive --time=4:00:00 --cpus-per-task=16 --mem=128gb
</pre>

You can also use srun to request the same resources:

<pre>
srun --partition=interactive --time=4:00:00 --cpus-per-task=16 --mem=128gb --pty bash
</pre>

== Job Submission : sbatch ==

Batch jobs are submitted by using the command '''sbatch'''. The main purpose of the '''sbatch''' command is to specify the resources that are needed to run the job. '''sbatch''' will then queue the batch job. However, starting of batch job depends on the availability of the requested resources and the fair sharing value.
 
=== sbatch Command Parameters ===
The syntax and use of '''sbatch''' can be displayed via:
<pre>
$ man sbatch
</pre>
'''sbatch''' options can be used from the command line or in your job script. The following table shows the syntax and provides examples for each option.

{| class="wikitable"
! colspan="5" | sbatch Options
|-
! Command line
! Job Script
! Purpose
! Example
! Default value
|- style="vertical-align:top;"
| <code>-t ''time''</code> or <code>--time=''time''</code>
| #SBATCH --time=''time''
| Wall clock time limit. 
| <code>-t 2:30:00</code> Limits run time to 2h 30 min.<code>-t 2-12</code> Limits run time to 2 days and 12 hours.
| Depends on Slurm partition.
|-
|- style="vertical-align:top;"
| -N ''count'' or --nodes=''count''
| #SBATCH --nodes=''count''
| Number of nodes to be used.
| <code>-N 1</code> Run job on one node.<code>-N 2</code> Run job on two nodes (have to use MPI!)
|
|- style="vertical-align:top;"
| -n ''count'' or --ntasks=''count''
| #SBATCH --ntasks=''count''
| Number of tasks to be launched.
| <code>-n 2</code> launch two tasks in the job.
| One task per node
|-
|- style="vertical-align:top;"
| --ntasks-per-node=''count''
| #SBATCH --ntasks-per-node=''count''
| Maximum count of tasks per node. (Replaces the option <code>ppn</code> of MOAB.)
| <code>--ntasks-per-node=2</code> Run 2 tasks per node
| 1 task per node
|-
|- style="vertical-align:top;"
| -c ''count'' or --cpus-per-task=''count''
| #SBATCH --cpus-per-task=''count''
| Number of CPUs required per (MPI-)task.
| <code>-c 2</code> Request two CPUs per (MPI-)task.
| 1 CPU per (MPI-)task
|-
|- style="vertical-align:top;"
| <code>--mem=<size>[units]</code>
| #SBATCH --mem=''value_in_MB''
| Memory in MegaByte per node.<code>[units]</code> can be one of <code>[K<nowiki>|</nowiki>M<nowiki>|</nowiki>G<nowiki>|</nowiki>T]</code>.
| <code>--mem=10g</code> Request 10GB RAM per node <code>--mem=0</code> Request all memory on node
| Depends on Slurm configuration.It is better to specify <code>--mem</code> in every case.
|-
|- style="vertical-align:top;"
| --mem-per-cpu=''value_in_MB''
| #SBATCH --mem-per-cpu=''value_in_MB''
| Minimum Memory required per allocated CPU. (Replaces the option pmem of MOAB. You should omit the setting of this option.)
|-
|- style="vertical-align:top;"
| --mail-type=''type''
| #SBATCH --mail-type=''type''
| Notify user by email when certain event types occur. Valid type values are NONE, BEGIN, END, FAIL, REQUEUE, ALL.
|-
|- style="vertical-align:top;"
| --mail-user=''mail-address''
| #SBATCH --mail-user=''mail-address''
| The specified mail-address receives email notification of state changes as defined by --mail-type.
|-
|- style="vertical-align:top;"
| --output=''name''
| #SBATCH --output=''name''
| File in which job output is stored.
|-
|- style="vertical-align:top;"
| --error=''name''
| #SBATCH --error=''name''
| File in which job error messages are stored.
|-
|- style="vertical-align:top;"
| -J ''name'' or --job-name=''name''
| #SBATCH --job-name=''name''
| Job name.
|-
|- style="vertical-align:top;"
| --export=[ALL,] ''env-variables''
| #SBATCH --export=[ALL,] ''env-variables''
| Identifies which environment variables from the submission environment are propagated to the launched application. Default is ALL. If adding an environment variable to the submission environment is intended, the argument ALL must be added.
|-
|- style="vertical-align:top;"
| -A ''group-name'' or --account=''group-name''
| #SBATCH --account=''group-name''
| Change resources used by this job to specified group. You may need this option if your account is assigned to more than one group. By command "scontrol show job" the project group the job is accounted on can be seen behind "Account=".
|-
|- style="vertical-align:top;"
| -p ''queue-name'' or --partition=''queue-name''
| #SBATCH --partition=''queue-name''
| Request a specific queue for the resource allocation.
|-
|- style="vertical-align:top;"
| --reservation=''reservation-name''
| #SBATCH --reservation=''reservation-name''
| Use a specific reservation for the resource allocation.
|-
|- style="vertical-align:top;"
| -C ''LSDF'' or --constraint=''LSDF''
| #SBATCH --constraint=LSDF
| Job constraint LSDF Filesystems.
|-
|}
 

==== sbatch --partition ''queues'' ====
Queue classes define maximum resources such as walltime, nodes and processes per node and queue of the compute system. Details can be found here:
* [[BinAC2/SLURM_Partitions|BinAC 2 partitions]]
 

=== sbatch Examples ===

If you are coming from Moab/Torque on BinAC 1 or you are new to HPC/Slurm the <code>sbatch</code> options may confuse you. The following examples give an orientation how to run typical workloads on BinAC 2.

You can find every file mentioned on this Wiki page on BinAC 2 at: <code>/pfs/10/project/examples</code>

==== Serial Programs ====
When you use serial programs that use only one process, you can omit most of the <code>sbatch</code> parameters, as the default values are sufficient.

To submit a serial job that runs the script <code>serial_job.sh</code> and requires 5000 MB of main memory and 10 minutes of wall clock time, Slurm will allocate one '''physical''' core to your job.

a) execute:
<pre>
$ sbatch -p compute -t 10:00 --mem=5000m serial_job.sh
</pre>
or
b) add after the initial line of your script '''serial_job.sh''' the lines:
<source lang="bash">
#SBATCH --time=10:00
#SBATCH --mem=5000m
#SBATCH --job-name=simple-serial-job
</source>
and execute the modified script with the command line option ''--partition=compute''
<pre>
$ sbatch -p=compute serial_job.sh
</pre>
Note, that sbatch command line options overrule script options.
 
 

==== Multithreaded Programs ====

Multithreaded programs run their processes on multiple threads and share resources such as memory. 
You may use a program that includes a built-in option for multithreading (e.g., options like <code>--threads</code>). 
For multithreaded programs based on '''Open''' '''M'''ulti-'''P'''rocessing (OpenMP) a number of threads is defined by the environment variable <code>OMP_NUM_THREADS</code>. By default, this variable is set to 1 (<code>OMP_NUM_THREADS=1</code>).

'''Important:''' Hyperthreading is activated on bwForCluster BinAC 2. Hyperthreading can be beneficial for some applications and codes, but it can also degrade performance in other cases. We therefore recommend to run a small test job with and without hyperthreading to determine the best choice. '''

'''a) Program with built-in multithreading option'''

The example uses the common Bioinformatics software called <code>samtools</code> as example for using built-in multithreading.

The module <code>bio/samtools/1.21</code> provides an example jobscript that requests 4 CPUs and runs <code>samtools sort</code> with 4 threads.

<pre>
#!/bin/bash

#SBATCH --time=19:00
#SBATCH --nodes=1
#SBATCH --cpus-per-task=4
#SBATCH --mem=5000m
#SBATCH --partition compute
[...]
samtools sort -@ 4 sample.bam -o sample.sorted.bam
</pre>

You can use the example jobscript with this command

<pre>
sbatch /opt/bwhpc/common/bio/samtools/1.21/bwhpc-examples/binac2-samtools-1.21-bwhpc-examples.slurm
</pre>

'''b) OpenMP'''

We will run an exaple OpenMP Hello-World program. The jobscript looks like this:

<pre>
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --cpus-per-task=4
#SBATCH --time=1:00
#SBATCH --mem=5000m
#SBATCH -J OpenMP-Hello-World

export OMP_NUM_THREADS=$(${SLURM_JOB_CPUS_PER_NODE}/2)

echo "Executable running on ${SLURM_JOB_CPUS_PER_NODE} cores with ${OMP_NUM_THREADS} threads"

# Run parallel Hello World
/pfs/10/project/examples/openmp_hello_world
</pre>

Submit the job to the <code>compute</code> partition and get the output (in the stdout-file)

<pre>
sbatch --partition=compute /pfs/10/project/examples/openmp_hello_world.sh

Executable running on 4 cores with 4 threads
Hello from process: 0
Hello from process: 2
Hello from process: 1
Hello from process: 3
</pre>

==== OpenMPI ====

If you want to run MPI-jobs on batch nodes, generate a wrapper script <code>mpi_hello_world.sh</code> for '''OpenMPI''' containing the following lines:

<source lang="bash">
#!/bin/bash

#SBATCH --partition compute
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=2
#SBATCH --cpus-per-task=2
#SBATCH --mem-per-cpu=2000
#SBATCH --time=05:00

# Load the MPI implementation of your choice
module load mpi/openmpi/4.1-gnu-14.2

# Run your MPI program
mpirun --bind-to core --map-by core --report-bindings mpi_hello_world
</source>

'''Attention:''' Do '''NOT''' add mpirun options <code>-n <number_of_processes></code> or any other option defining processes or nodes, since Slurm instructs mpirun about number of processes and node hostnames.

Use '''ALWAYS''' the MPI options <code>--bind-to core</code> and <code>--map-by core|socket|node</code>.
Please type <code>man mpirun</code> for an explanation of the meaning of the different options of mpirun option <code>--map-by</code>.

The above jobscript runs four OpenMPI tasks, distributed between two nodes. Because of hyperthreading you have to set <code>--cpus-per-task=2</code>. This means each MPI-task will get one physical core. If you omit <code>--cpus-per-task=2</code> MPI will fail.

'''Attention:''' Not all compute nodes are connected via Infiniband. Tell Slurm you want Infiniband via <code>--constraint=ib</code> when submitting or add <code>#SBATCH --constraint=ib</code> to your jobscript.

<pre>
$ sbatch --constraint=ib /pfs/10/project/examples/mpi_hello_world.sh
</pre>

This will run a simple Hello World program:

<pre>
[...]
Hello world from processor node2-031, rank 3 out of 4 processors
Hello world from processor node2-031, rank 2 out of 4 processors
Hello world from processor node2-030, rank 1 out of 4 processors
Hello world from processor node2-030, rank 0 out of 4 processors

</pre>

==== Multithreaded + MPI parallel Programs ====

Multithreaded + MPI parallel programs operate faster than serial programs on multi CPUs with multiple cores. All threads of one process share resources such as memory. On the contrary MPI tasks do not share memory but can be spawned over different nodes. '''Because hyperthreading is switched on BinaC 2, the option --cpus-per-task (-c) must be set to 2*n, if you want to use n threads.'''
 
 
===== OpenMPI with Multithreading =====
Multiple MPI tasks using '''OpenMPI''' must be launched by the MPI parallel program '''mpirun'''. For multithreaded programs based on '''Open''' '''M'''ulti-'''P'''rocessing (OpenMP) number of threads are defined by the environment variable OMP_NUM_THREADS. By default this variable is set to 1 (OMP_NUM_THREADS=1).
 
'''For OpenMPI''' a job-script to submit a batch job called ''job_ompi_omp.sh'' that runs a MPI program with 4 tasks and a 28-fold threaded program ''ompi_omp_program'' requiring 3000 MByte of physical memory per thread (using 28 threads per MPI task you will get 28*3000 MByte = 84000 MByte per MPI task) and total wall clock time of 3 hours looks like:

<source lang="bash">
#!/bin/bash
#SBATCH --ntasks=4
#SBATCH --cpus-per-task=56
#SBATCH --time=03:00:00
#SBATCH --mem=83gb # 84000 MB = 84000/1024 GB = 82.1 GB
#SBATCH --export=ALL,MPI_MODULE=mpi/openmpi/3.1,EXECUTABLE=./ompi_omp_program
#SBATCH --output="parprog_hybrid_%j.out"

# Use when a defined module environment related to OpenMPI is wished
module load ${MPI_MODULE}
export OMP_NUM_THREADS=$((${SLURM_CPUS_PER_TASK}/2))
export MPIRUN_OPTIONS="--bind-to core --map-by socket:PE=${OMP_NUM_THREADS} -report-bindings"
export NUM_CORES=${SLURM_NTASKS}*${OMP_NUM_THREADS}
echo "${EXECUTABLE} running on ${NUM_CORES} cores with ${SLURM_NTASKS} MPI-tasks and ${OMP_NUM_THREADS} threads"
startexe="mpirun -n ${SLURM_NTASKS} ${MPIRUN_OPTIONS} ${EXECUTABLE}"
echo $startexe
exec $startexe
</source>
Execute the script '''job_ompi_omp.sh''' by command sbatch:
<pre>
$ sbatch -p compute ./job_ompi_omp.sh
</pre>
* With the mpirun option ''--bind-to core'' MPI tasks and OpenMP threads are bound to physical cores.
* With the option ''--map-by node:PE=<value>'' (neighbored) MPI tasks will be attached to different nodes and each MPI task is bound to the first core of a node. <value> must be set to ${OMP_NUM_THREADS}.
* The option ''-report-bindings'' shows the bindings between MPI tasks and physical cores.
* The mpirun-options '''--bind-to core''', '''--map-by socket|...|node:PE=<value>''' should always be used when running a multithreaded MPI program.
 

==== GPU jobs ====

The nodes in the <code>gpu</code> queue have 2 or 4 NVIDIA A30/A100/H200 GPUs. Just submitting a job to these queues is not enough to also allocate one or more GPUs, you have to do so using the "--gres=gpu" parameter. You have to specifiy how many GPUs your job needs, e.g. "--gres=gpu:a30:2" will request two NVIDIA A30 GPUs.

The GPU nodes are shared between multiple jobs if the jobs don't request all the GPUs in a node and there are enough ressources to run more than one job. The individual GPUs are always bound to a single job and will not be shared between different jobs.

a) add after the initial line of your script job.sh the line including the
information about the GPU usage: #SBATCH --gres=gpu:a30:2
<pre>
#!/bin/bash
#SBATCH --ntasks=40
#SBATCH --time=02:00:00
#SBATCH --mem=4000
#SBATCH --gres=gpu:a30:2
</pre>

or b) execute:
<pre>
$ sbatch -p <queue> -n 40 -t 02:00:00 --mem 4000 --gres=gpu:a30:2 job.sh
</pre>
 
If you start an interactive session on of the GPU nodes, you can use the "nvidia-smi" command to list the GPUs allocated to your job:
<pre>
$ nvidia-smi
Sun Mar 29 15:20:05 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.33.01 Driver Version: 440.33.01 CUDA Version: 10.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2... Off | 00000000:3A:00.0 Off | 0 |
| N/A 29C P0 39W / 300W | 9MiB / 32510MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla V100-SXM2... Off | 00000000:3B:00.0 Off | 0 |
| N/A 30C P0 41W / 300W | 8MiB / 32510MiB | 0% Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 14228 G /usr/bin/X 8MiB |
| 1 14228 G /usr/bin/X 8MiB |
+-----------------------------------------------------------------------------+
</pre>

Upon successfull GPU ressource allocation, SLURM will set the environment variable <code>CUDA_VISIBLE_DEVICES</code> appropriately. Do not change this variable!

 
In case of using Open MPI, the underlying communication infrastructure (UCX and Open MPI's BTL) is CUDA-aware.
However, there may be warnings, e.g. when running
<pre>
$ module load compiler/gnu/10.3 mpi/openmpi devel/cuad
$ mpirun mpirun -np 2 ./mpi_cuda_app
--------------------------------------
WARNING: There are more than one active ports on host 'uc2n520', but the
default subnet GID prefix was detected on more than one of these
ports. If these ports are connected to different physical IB
networks, this configuration will fail in Open MPI. This version of
Open MPI requires that every physically separate IB subnet that is
used between connected MPI processes must have different subnet ID
values.

Please see this FAQ entry for more details:

http://www.open-mpi.org/faq/?category=openfabrics#ofa-default-subnet-gid

NOTE: You can turn off this warning by setting the MCA parameter
btl_openib_warn_default_gid_prefix to 0.
--------------------------------------------------------------------------
</pre>

Please run Open MPI's mpirun using the following command:
<pre>
$ mpirun --mca pml ucx --mca btl_openib_warn_default_gid_prefix 0 -np 2 ./mpi_cuda_app
</pre>
or disabling the (older) communication layer Bit-Transfer-Layer (short BTL) alltogether:
<pre>
$ mpirun --mca pml ucx --mca btl ^openib -np 2 ./mpi_cuda_app
</pre>

(Please note, that CUDA per v12.8 is only officially supported with up to GCC-11)
 
 

== Start time of job or resources : squeue --start ==
The command can be used by any user to displays the estimated start time of a job based a number of analysis types based on historical usage, earliest available reservable resources, and priority based backlog. The command squeue is explained in detail on the webpage https://slurm.schedmd.com/squeue.html or via manpage (man squeue).
 
 
=== Access ===
By default, this command can be run by '''any user'''.
 
 

== List of your submitted jobs : squeue ==
Displays information about YOUR active, pending and/or recently completed jobs. The command displays all own active and pending jobs. The command squeue is explained in detail on the webpage https://slurm.schedmd.com/squeue.html or via manpage (man squeue).
 
 
=== Access ===
By default, this command can be run by any user.
 
 

=== Flags ===
{| width=750px class="wikitable"
|-
! Flag !! Description
|-
| -l, --long
| Report more of the available information for the selected jobs or job steps, subject to any constraints specified.
|}
 

=== Examples ===
''squeue'' example on BinaC 2 (Only your own jobs are displayed!).
<pre>
$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
18088744 single CPV.sbat ab1234 PD 0:00 1 (Priority)
18098414 multiple CPV.sbat ab1234 PD 0:00 2 (Priority)
18090089 multiple CPV.sbat ab1234 R 2:27 2 uc2n[127-128]
$ squeue -l
JOBID PARTITION NAME USER STATE TIME TIME_LIMI NODES NODELIST(REASON)
18088654 single CPV.sbat ab1234 COMPLETI 4:29 2:00:00 1 uc2n374
18088785 single CPV.sbat ab1234 PENDING 0:00 2:00:00 1 (Priority)
18098414 multiple CPV.sbat ab1234 PENDING 0:00 2:00:00 2 (Priority)
18088683 single CPV.sbat ab1234 RUNNING 0:14 2:00:00 1 uc2n413
</pre>
* The output of ''squeue'' shows how many jobs of yours are running or pending and how many nodes are in use by your jobs.
 

== Shows free resources : sinfo_t_idle ==
The Slurm command sinfo is used to view partition and node information for a system running Slurm. It incorporates down time, reservations, and node state information in determining the available backfill window. The sinfo command can only be used by the administrator.
 
SCC has prepared a special script (sinfo_t_idle) to find out how many processors are available for immediate use on the system. It is anticipated that users will use this information to submit jobs that meet these criteria and thus obtain quick job turnaround times.
 
 
=== Access ===
By default, this command can be used by any user or administrator.
 
 
=== Example ===
* The following command displays what resources are available for immediate use for the whole partition.
<pre>$ sinfo_t_idle
Partition dev_multiple : 8 nodes idle
Partition multiple : 332 nodes idle
Partition dev_single : 4 nodes idle
Partition single : 76 nodes idle
Partition long : 80 nodes idle
Partition fat : 5 nodes idle
Partition dev_special : 342 nodes idle
Partition special : 342 nodes idle
Partition dev_multiple_e: 7 nodes idle
Partition multiple_e : 335 nodes idle
Partition gpu_4 : 12 nodes idle
Partition gpu_8 : 6 nodes idle
</pre>
* For the above example jobs in all partitions can be run immediately.
 

== Detailed job information : scontrol show job ==
scontrol show job displays detailed job state information and diagnostic output for all or a specified job of yours. Detailed information is available for active, pending and recently completed jobs. The command scontrol is explained in detail on the webpage https://slurm.schedmd.com/scontrol.html or via manpage (man scontrol).
 
Display the state of all your jobs in normal mode: scontrol show job
 
Display the state of a job with <jobid> in normal mode: scontrol show job <jobid>
 
 
=== Access ===
* End users can use scontrol show job to view the status of their '''own jobs''' only.
 

=== Arguments ===
{| width=750px class="wikitable"
|-
! Option !! Default !! Description !! Example
|- style="vertical-align:top;"
|- style="width:12%;"
| -d
| (n/a)
| Detailed mode
| Example: Display the state with jobid 18089884 in detailed mode. <pre>scontrol -d show job 18089884</pre>
|}
 
 

=== Scontrol show job Example ===
Here is an example from BinAC 2.
<pre>
squeue # show my own jobs (here the userid is replaced!)
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
18089884 multiple CPV.sbat bq0742 R 33:44 2 uc2n[165-166]

$
$ # now, see what's up with my pending job with jobid 18089884
$
$ scontrol show job 18089884

JobId=18089884 JobName=CPV.sbatch
UserId=bq0742(8946) GroupId=scc(12345) MCS_label=N/A
Priority=3 Nice=0 Account=kit QOS=normal
JobState=RUNNING Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
RunTime=00:35:06 TimeLimit=02:00:00 TimeMin=N/A
SubmitTime=2020-03-16T14:14:54 EligibleTime=2020-03-16T14:14:54
AccrueTime=2020-03-16T14:14:54
StartTime=2020-03-16T15:12:51 EndTime=2020-03-16T17:12:51 Deadline=N/A
SuspendTime=None SecsPreSuspend=0 LastSchedEval=2020-03-16T15:12:51
Partition=multiple AllocNode:Sid=uc2n995:5064
ReqNodeList=(null) ExcNodeList=(null)
NodeList=uc2n[165-166]
BatchHost=uc2n165
NumNodes=2 NumCPUs=160 NumTasks=80 CPUs/Task=1 ReqB:S:C:T=0:0:*:1
TRES=cpu=160,mem=96320M,node=2,billing=160
Socks/Node=* NtasksPerN:B:S:C=40:0:*:1 CoreSpec=*
MinCPUsNode=40 MinMemoryCPU=1204M MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
OverSubscribe=NO Contiguous=0 Licenses=(null) Network=(null)
Command=/pfs/data5/home/kit/scc/bq0742/git/CPV/bin/CPV.sbatch
WorkDir=/pfs/data5/home/kit/scc/bq0742/git/CPV/bin
StdErr=/pfs/data5/home/kit/scc/bq0742/git/CPV/bin/slurm-18089884.out
StdIn=/dev/null
StdOut=/pfs/data5/home/kit/scc/bq0742/git/CPV/bin/slurm-18089884.out
Power=
MailUser=(null) MailType=NONE
</pre>
 
You can use standard Linux pipe commands to filter the very detailed scontrol show job output.
* In which state the job is?
<pre>$ scontrol show job 18089884 | grep -i State
JobState=COMPLETED Reason=None Dependency=(null)
</pre>
 

== Cancel Slurm Jobs ==
The scancel command is used to cancel jobs. The command scancel is explained in detail on the webpage https://slurm.schedmd.com/scancel.html or via manpage (man scancel).
 
 
=== Canceling own jobs : scancel ===
scancel is used to signal or cancel jobs, job arrays or job steps. The command is:
<pre>
$ scancel [-i] <job-id>
$ scancel -t <job_state_name>
</pre>

 
{| width=750px class="wikitable"
! Flag !! Default !! Description !! Example
|- style="vertical-align:top;"
| -i, --interactive
| (n/a)
| Interactive mode.
| Cancel the job 987654 interactively. <pre> scancel -i 987654 </pre>
|-
| -t, --state
| (n/a)
| Restrict the scancel operation to jobs in a certain state. "job_state_name" may have a value of either "PENDING", "RUNNING" or "SUSPENDED".
| Cancel all jobs in state "PENDING". <pre> scancel -t "PENDING" </pre>
|}
 

= Resource Managers =
=== Batch Job (Slurm) Variables ===
The following environment variables of Slurm are added to your environment once your job has started
(only an excerpt of the most important ones).
{| width=750px class="wikitable"
! Environment !! Brief explanation
|-
| SLURM_JOB_CPUS_PER_NODE
| Number of processes per node dedicated to the job
|-
| SLURM_JOB_NODELIST
| List of nodes dedicated to the job
|-
| SLURM_JOB_NUM_NODES
| Number of nodes dedicated to the job
|-
| SLURM_MEM_PER_NODE
| Memory per node dedicated to the job
|-
| SLURM_NPROCS
| Total number of processes dedicated to the job
|-
| SLURM_CLUSTER_NAME
| Name of the cluster executing the job
|-
| SLURM_CPUS_PER_TASK
| Number of CPUs requested per task
|-
| SLURM_JOB_ACCOUNT
| Account name
|-
| SLURM_JOB_ID
| Job ID
|-
| SLURM_JOB_NAME
| Job Name
|-
| SLURM_JOB_PARTITION
| Partition/queue running the job
|-
| SLURM_JOB_UID
| User ID of the job's owner
|-
| SLURM_SUBMIT_DIR
| Job submit folder. The directory from which sbatch was invoked.
|-
| SLURM_JOB_USER
| User name of the job's owner
|-
| SLURM_RESTART_COUNT
| Number of times job has restarted
|-
| SLURM_PROCID
| Task ID (MPI rank)
|-
| SLURM_NTASKS
| The total number of tasks available for the job
|-
| SLURM_STEP_ID
| Job step ID
|-
| SLURM_STEP_NUM_TASKS
| Task count (number of MPI ranks)
|-
| SLURM_JOB_CONSTRAINT
| Job constraints
|}
See also:
* [https://slurm.schedmd.com/sbatch.html#lbAI Slurm input and output environment variables]
 

=== Job Exit Codes ===
A job's exit code (also known as exit status, return code and completion code) is captured by SLURM and saved as part of the job record.
 
Any non-zero exit code will be assumed to be a job failure and will result in a Job State of FAILED with a reason of "NonZeroExitCode".
 
The exit code is an 8 bit unsigned number ranging between 0 and 255. While it is possible for a job to return a negative exit code, SLURM will display it as an unsigned value in the 0 - 255 range.
 
 
==== Displaying Exit Codes and Signals ====
SLURM displays a job's exit code in the output of the '''scontrol show job''' and the sview utility.
 
When a signal was responsible for a job or step's termination, the signal number will be displayed after the exit code, delineated by a colon(:).
 
 
==== Submitting Termination Signal ====
Here is an example, how to 'save' a Slurm termination signal in a typical jobscript.
<source lang="bash">
[...]
exit_code=$?
mpirun -np <#cores> <EXE_BIN_DIR>/<executable> ... (options) 2>&1
[ "$exit_code" -eq 0 ] && echo "all clean..." || \
echo "Executable <EXE_BIN_DIR>/<executable> finished with exit code ${$exit_code}"
[...]
</source>
* Do not use ''''time'''' mpirun! The exit code will be the one submitted by the first (time) program.
* You do not need an '''exit $exit_code''' in the scripts.
 
 
----
[[#top|Back to top]]

BinAC2/Hardware and Architecture

2025-12-19T18:09:21Z

S Behnle: More on Lustre (3)

= Hardware and Architecture =

The bwForCluster BinAC 2 supports researchers from the broader fields of Bioinformatics, Medical Informatics, Astrophysics, Geosciences and Pharmacy.

[[File:Binac2 schema.png|600px|thumb|center|Overview on the BinAC 2 hardware architecture.]]

== Operating System and Software ==

* Operating System: Rocky Linux 9.6
* Queuing System: [https://slurm.schedmd.com/documentation.html Slurm] (see [[BinAC2/Slurm]] for help)
* (Scientific) Libraries and Software: [[Environment Modules]]

== Compute Nodes ==

BinAC 2 offers compute nodes, high-mem nodes, and three types of GPU nodes.
* 180 compute nodes
* 16 SMP node
* 32 GPU nodes (2xA30)
* 8 GPU nodes (4xA100)
* 4 GPU nodes (4xH200)
* plus several special purpose nodes for login, interactive jobs, etc.

Compute node specification:
{| class="wikitable"
|-
! style="width:10%"|
! style="width:10%"| Standard
! style="width:10%"| High-Mem
! style="width:10%"| GPU (A30)
! style="width:10%"| GPU (A100)
! style="width:10%"| GPU (H200)
|-
!scope="column"| Quantity
| 168 / 12
| 14 / 2
| 32
| 8
| 4
|-
!scope="column" | Processors
| 2 x [https://www.amd.com/de/products/processors/server/epyc/7003-series/amd-epyc-7543.html AMD EPYC Milan 7543] / 2 x [https://www.amd.com/en/products/processors/server/epyc/7003-series/amd-epyc-75f3.html AMD EPYC Milan 75F3]
| 2 x [https://www.amd.com/de/products/processors/server/epyc/7003-series/amd-epyc-7443.html AMD EPYC Milan 7443] / 2 x [https://www.amd.com/en/products/processors/server/epyc/7003-series/amd-epyc-75f3.html AMD EPYC Milan 75F3]
| 2 x [https://www.amd.com/de/products/processors/server/epyc/7003-series/amd-epyc-7543.html AMD EPYC Milan 7543]
| 2 x [https://www.amd.com/de/products/processors/server/epyc/7003-series/amd-epyc-7543.html AMD EPYC Milan 7543]
| 2 x [https://www.amd.com/de/products/processors/server/epyc/9005-series/amd-epyc-9555.html AMD EPYC Milan 9555]
|-
!scope="column" | Processor Base Frequency (GHz)
| 2.80 / 2.95
| 2.85 / 2.95
| 2.80
| 2.80
| 3.20
|-
!scope="column" | Number of Physical Cores / Hypertreads
| 64 / 128
| 48 / 96 // 64 / 128
| 64 / 128
| 64 / 128
| 128 / 256
|-
!scope="column" | Working Memory (GB)
| 512
| 2048
| 512
| 512
| 1536
|-
!scope="column" | Local Disk (GiB)
| 450 (NVMe-SSD)
| 14000 (NVMe-SSD)
| 450 (NVMe-SSD)
| 14000 (NVMe-SSD)
| 28000 (NVMe-SSD)
|-
!scope="column" | Interconnect
| HDR 100 IB (84 nodes) / 100GbE (96 nodes)
| 100GbE
| 100GbE
| 100GbE
| HDR 200 IB + 100GbE
|-
!scope="column" | Coprocessors
| -
| -
| 2 x [https://www.nvidia.com/de-de/data-center/products/a30-gpu/ NVIDIA A30 (24 GB ECC HBM2, NVLink)]
| 4 x [https://www.nvidia.com/de-de/data-center/a100/ NVIDIA A100 (80 GB ECC HBM2e)]
| 4 x [https://www.nvidia.com/de-de/data-center/h200/ NVIDIA H200 NVL (141 GB ECC HBM3e, NVLink)]
|}

= Network =

The compute nodes and the parallel file system are connected via 100GbE ethernet
In contrast to BinAC 1 not all compute nodes are connected via Infiniband, but there are 84 standard compute nodes connected via HDR Infiniband (100 GbE). In order to get your jobs onto the Infiniband nodes, submit your job with <code>--constraint=ib</code>.

'''Question:'''
OpenMPI throws the following warning:
<pre>
--------------------------------------------------------------------------
No OpenFabrics connection schemes reported that they were able to be
used on a specific port. As such, the openib BTL (OpenFabrics
support) will be disabled for this port.
Local host: node1-083
Local device: mlx5_0
Local port: 1
CPCs attempted: rdmacm, udcm
--------------------------------------------------------------------------
[node1-083:2137377] 3 more processes have sent help message help-mpi-btl-openib-cpc-base.txt / no cpcs for port
[node1-083:2137377] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
</pre>
What should i do?

'''Answer:'''
BinAC2 has two (almost) separate networks, a 100GbE network and and InfiniBand network, both connecting a subset of the nodes. Both networks require different cables and switches.
Concerning the network cards for the nodes, however, there exist VPI network cards which can be configured to work in either mode (https://docs.nvidia.com/networking/display/connectx6vpi/specifications#src-2487215234_Specifications-MCX653105A-ECATSpecifications).
OpenMPI can use a number of layers for transferring data and messages between processes. When it ramps up, it will test all means of communication that were configured during compilation and then tries to figure out the fastest path between all processes.
If OpenMPI encounters such a VPI card, it will first try to establish a Remote Direct Memory Access communication (RDMA) channel using the OpenFabrics (OFI) layer.
On nodes with 100Gb ethernet, this fails as there is no RDMA protocol configured. OpenMPI will fall back to TCP transport but not without complaints.

'''Workaround:'''
For single-node jobs or on regular compute nodes, A30 and A100 GPU nodes: Add the lines
<pre>
export OMPI_MCA_btl="^ofi,openib"
export OMPI_MCA_mtl="^ofi"
</pre>
to your job script to disable the OFI transport layer. If you need high-bandwidth, low-latency transport between all processes on all nodes, switch to the Infiniband partition (<code>#SBATCH --constraint=ib</code>). ''Do not turn off the OFI layer on Infiniband nodes as this will be the best choice between nodes!''

= File Systems =

The bwForCluster BinAC 2 consists of two separate storage systems, one for the user's home directory $HOME and one serving as a project/work space.
The home directory is limited in space and parallel access but offers snapshots of your files and backup.

The project/work is a parallel file system (PFS) which offers fast and parallel file access and a bigger capacity than the home directory. It is mounted at <code>/pfs/10</code> on the login and compute nodes. This storage is based on Lustre and can be accessed in parallel from many nodes. The PFS contains the project and the work directory. Each compute project has its own directory at <code>/pfs/10/project</code> that is accessible for all members of the compute project.
Each user can create workspaces under <code>/pfs/10/work</code> using the workspace tools. These directories are only accessible for the user who created the workspace.

Additionally, each compute node provides high-speed temporary storage (SSD) on the node-local solid state disk via the $TMPDIR environment variable.

{| class="wikitable"
|-
! style="width:10%"|
! style="width:10%"| <tt>$HOME</tt>
! style="width:10%"| project
! style="width:10%"| work
! style="width:10%"| <tt>$TMPDIR</tt>
|-
!scope="column" | Visibility
| global
| global
| global
| node local
|-
!scope="column" | Lifetime
| permanent
| permanent
| work space lifetime (max. 30 days, max. 5 extensions)
| batch job walltime
|-
!scope="column" | Capacity
| -
| 8.1 PB
| 1000 TB
| 480 GB (compute nodes); 7.7 TB (GPU-A30 nodes); 16 TB (GPU-A100 and SMP nodes); 31 TB (GPU-H200 nodes)
|-
!scope="column" | File System Type
| NFS
| Lustre
| Lustre
| XFS
|-
!scope="column" | Speed (read)
| ≈ 1 GB/s, shared by all nodes
| max. 12 GB/s
| ≈ 145 GB/s peak, aggregated over 56 nodes, ideal striping
| ≈ 3 GB/s (compute)/ ≈5 GB/S (GPUA-30)/ ≈ 26 GB/s (GPU-A100 + SMP)/ ≈ 42 GB/s (GPU-H200) per node
|-
!scope="column" | [https://en.wikipedia.org/wiki/Disk_quota#Quotas Quotas]
| 40 GB per user
| not yet, maybe in the future
| none
| none
|-
!scope="column" | Backup
| yes (nightly)
| '''no'''
| '''no'''
| '''no'''
|}

global : all nodes access the same file system
local : each node has its own file system
permanent : files are stored permanently
batch job walltime : files are removed at end of the batch job

{| class="wikitable" style="color:red; background-color:#ffffcc;" cellpadding="10"
|
Please note that due to the large capacity of '''work''' and '''project''' and due to frequent file changes on these file systems, no backup can be provided.
Backing up these file systems would require a redundant storage facility with multiple times the capacity of '''project'''. Furthermore, regular backups would significantly degrade the performance.
Data is stored redundantly, i.e. immune against disk failures but not immune against catastrophic incidents like cyber attacks or a fire in the server room.
Please consider to use on of the remote storage facilities like [https://wiki.bwhpc.de/e/SDS@hd SDS@hd], [https://uni-tuebingen.de/einrichtungen/zentrum-fuer-datenverarbeitung/projekte/laufende-projekte/bwsfs bwSFS], [https://www.scc.kit.edu/en/services/lsdf.php LSFD Online Storage] or the [https://www.rda.kit.edu/english/ bwDataArchive] to back up your valuable data.
|}

=== Home ===

Home directories are meant for permanent file storage of files that are keep being used like source codes, configuration files, executable programs etc.; the content of home directories will be backed up on a regular basis.
Because the backup space is limited we enforce a quota of 40GB on the home directories.

'''NOTE:'''
Compute jobs on nodes must not write temporary data to $HOME.
Instead they should use the local $TMPDIR directory for I/O-heavy use cases
and work spaces for less I/O intense multinode-jobs.


=== Project ===

The data is stored on HDDs. The primary focus of <code>/pfs/10/project</code> is pure capacity, not speed.

Every project gets a dedicated directory located at:

<syntaxhighlight>
/pfs/10/project/<project_id>/
</syntaxhighlight>

You can check the project(s) you are member of via:

<syntaxhighlight>
# id $USER | grep -o 'bw[^)]*'
bw16f003
</syntaxhighlight>

In this case, your project directory would be:
<code>
/pfs/10/project/bw16f003/
</code>

Check our [[BinAC2/Project_Data_Organization | data organization guide ]] for methods to organize data inside the project directory.

=== Workspaces ===

Data on the fast storage pool at <code>/pfs/10/work</code> is stored on SSDs.
The primary focus is speed, not capacity.

In contrast to BinAC 1 we will enforce work space lifetime, as the capacity is limited.
We ask you to only store data you actively use for computations on <code>/pfs/10/work</code>.
Please move data to <code>/pfs/10/project</code> when you don't need it on the fast storage any more.

Each user should create workspaces at <code>/pfs/10/work</code> through the workspace tools

You can find more info on workspace tools on our general page:

:: → '''[[Workspace]]s'''

To create a work space you'll need to supply a name for your work space area and a lifetime in days.
For more information read the corresponding help, e.g: <code>ws_allocate -h.</code>
{| class="wikitable"
|-
!style="width:30%" | Command
!style="width:70%" | Action
|-
|<code>ws_allocate mywork 30</code>
|Allocate a work space named "mywork" for 30 days.
|-
|<code>ws_allocate myotherwork</code>
|Allocate a work space named "myotherwork" with maximum lifetime.
|-
|<code>ws_list -a</code>
|List all your work spaces.
|-
|<code>ws_find mywork</code>
|Get absolute path of work space "mywork".
|-
|<code>ws_extend mywork 30</code>
|Extend life me of work space mywork by 30 days from now.
|-
|<code>ws_release mywork</code>
|Manually erase your work space "mywork". Please remove directory content first or use the <code>--delete-data</code> option.
|-
|}

=== Scratch ===

Please use the fast local scratch space for storing temporary data during your jobs.

For each job a scratch directory will be created on the compute nodes. It is available via the environment variable <code>$TMPDIR</code>, which points to <code>/scratch/<jobID></code>.

Especially the SMP nodes and the GPU nodes are equipped with large and fast local disks that should be used for temporary data, scratch data or data staging for ML model training.
The Lustre file system (<code>WORK</code> and <code>PROJECT</code>) is unsuited for repetitive random I/O, I/O sizes smaller than the Lustre and ZFS block size (1M) or I/O patterns where files are opened and closed in rapid succession. The XFS file system of the local scratch drives is better suited for typical scratch workloads and access patterns. Moreover, the local scratch drives offer a lower latency and a higher bandwidth than <code>WORK</code>.

=== SDS@hd ===

SDS@hd is mounted via NFS on login and compute nodes at <syntaxhighlight inline>/mnt/sds-hd</syntaxhighlight>.

To access your Speichervorhaben, the export to BinAC 2 must first be enabled by the SDS@hd-Team. Please contact [mailto:sds-hd-support@urz.uni-heidelberg.de SDS@hd support] and provide the acronym of your Speichervorhaben, along with a request to enable the export to BinAC 2.

Once this has been done, you can access your Speichervorhaben as described in the [https://wiki.bwhpc.de/e/SDS@hd/Access/NFS#Access_your_data SDS documentation].

<syntaxhighlight>
$ kinit $USER
Password for <user>@BWSERVICES.UNI-HEIDELBERG.DE:
</syntaxhighlight>

The Kerberos ticket store is shared across all nodes. Creating a single ticket is sufficient to access your Speichervorhaben on all nodes.

=== More Details on the Lustre File System ===
[https://www.lustre.org/ Lustre] is a distributed parallel file system.
* The entire logical volume as presented to the user is formed by multiple physical or local drives. Data is distributed over more than one physical or logical volume/hard drive, single files can be larger than the capacity of a single hard drive.
* The file system can be mounted from all nodes ("clients") in parallel at the same time for reading and writing. This also means that technically you can write to the same file from two different compute nodes! Usually, this will create an unpredictable mess! Never ever do this unless you know exactly what you are doing!
* On a single server or client, the bandwidth of multiple network interfaces can be aggregated to increase the throughput ("multi-rail").

Lustre works by chopping files into many small parts ("stripes", file objects) which are then stored on the object storage servers. The information which part of the file is stored where on which object storage server, when it was changed last etc. and the entire directory structure is stored on the metadata servers. Think of the entries on the metadata server as being pointers pointing to the actual file objects on the object storage servers.
A Lustre file system can consist of many metadata servers (MDS) and object storage servers (OSS).
Each MDS or OSS can again hold one or more so-called object storage targets (OST) or metadata targets (MDT) which can e.g. be simply multiple hard drives.
The capacity of a Lustre file system can hence be easily scaled by adding more servers.

==== Useful Lustre Comamnds ====
Commands specific to the Lustre file system are divided into user commands (<code>lfs ...</code>) and administrative commands (<code>lctl ...</code>). On BinAC2, users may only execute user commands, and also not all of them.
* <code>lfs help <command></code>: Print built-in help for command; Alternative: <code>man lfs <command></code>
* <code>lfs find</code>: Drop-in replacement for the <code>find</code> command, much faster on Lustre filesystems as it directly talks to the metadata sever
* <code>lfs --list-commands</code>: Print a list of available commands

==== Moving data between WORK and PROJECT ====
!! IMPORTANT !! Calling <code>mv</code> on files will not physically move them between the fast and the slow pool of the file system. Instead, the file metadata, i.e. the path to the file in the directory tree will be modified (i.e. data stored on the MDS). The stripes of the file on the OSS, however, will remain exactly were they were. The only result will be the confusing situation that you now have metadata entries under <code>/pfs/10/project</code> that still point to WORK OSTs. This may sound confusing at first. When using <code>mv</code> on the same file system, Lustre only renames the files and makes them available from a different path. The pointers to the file objects on the OSS stay identical. This will only change if you either create a copy of the file at a different path (with <code>cp</code> or <code>rsync</code>, e.g.) or if you explicitly instruct Lustre to move the actual file objects to another storage location, e.g. another pool of the same file system.

Proper ways of moving data between the pools
* Copy the data - which will create new files -, then delete the old files. Example:
<pre>
$> cp -ar /pfs/10/work/tu_abcde01-my-precious-ws/* /pfs/10/project/bw10a001/tu_abcde01/my-precious-research/.
$> rm -rf /pfs/10/work/tu_abcde01-my-precious-ws/*
$> ws_release --delete-data my-precious-ws
</pre>
* Alternative to copy: use <code>rsync</code> to copy data between the workspace and the project directories. Example:
<pre>
$> rsync -av /pfs/10/work/tu_abcde01-my-precious-ws/simulation/output /pfs/10/project/bw10a001/tu_abcde01/my-precious-research/simulation25/
$> rm -rf /pfs/10/work/tu_abcde01-my-precious-ws/*
$> ws_release --delete-data my-precious-ws
</pre>
* If there are many subfolders with similar size, you can use <code>xargs</code> to copy them in parallel:
<pre>
$> find . -maxdepth 1 -mindepth 1 -type d -print | xargs -P4 -I{} rsync -aHAXW --inplace --update {} /pfs/10/project/bw10a001/tu_abcde01/my-precious-research/simulation25/
</pre>
will launch four parallel <code>rsync</code> processes at a time, each will copy one of the subdirectories.
* First move the metadata with <code>mv</code>, then use <code>lfs migrate</code> or the wrapper <code>lfs_migrate</code> to actually migrate the file stripes. This is also a possible resolution if you already <code>mv</code>ed data from <code>work</code> to <code>project</code> or vice versa.
** <code>lfs migrate</code> is the raw lustre command. It can only operate on one file at a time, but offers access to all options.
** <code>lfs_migrate</code> is a versatile wrapper script that can work on single files or recursively on entire directories. If available, it will try to use <code>lfs migrate</code>, otherwise it will fall back to <code>rsync</code> (see <code>lfs_migrate --help</code> for all options.)
Example with <code>lfs migrate</code>:
<pre>
$> mv /pfs/10/work/tu_abcde01-my-precious-ws/* /pfs/10/project/bw10a001/tu_abcde01/my-precious-research/.
$> cd /pfs/10/project/bw10a001/tu_abcde01/my-precious-research/.
$> lfs find . -type f --pool work -0 | xargs -0 lfs migrate --pool project # find all files whose file objects are on the work pool and migrate the objects to the project pool
$> ws_release --delete-data my-precious-ws
</pre>
Example with <code>lfs_migrate</code>:
<pre>
$> mv /pfs/10/work/tu_abcde01-my-precious-ws/* /pfs/10/project/bw10a001/tu_abcde01/my-precious-research/.
$> cd /pfs/10/project/bw10a001/tu_abcde01/my-precious-research/.
$> lfs_migrate --yes -q -p project * # migrate all file objects in the current directory to the project pool, be quiet (-q) and do not ask for confirmation (--yes)
$> ws_release --delete-data my-precious-ws
</pre>
Both migration commands can also be combined with options to restripe the files during migration, i.e. you can also change the number of OSTs the file is striped over, the size of a single strip etc.
Attention! Both <code>lfs migrate</code> and <code>lfs_migrate</code> will not change the path of the file(s), you must also <code>mv</code> them! If used without <code>mv</code>, the files will still belong to the workspace although their file object stripes are now on the <code>project</code> pool and a subsequent <code>rm</code> in the workspace will wipe them.

All of the above procedures may take a considerable amount of time depending on the amount of data, so it might be advisable to execute them in a terminal multiplexer like <code>screen</code> or <code>tmux</code> or wrap them into small SLURM jobs with <code>sbatch --wrap="<command>"</code>.

Question: I totally lost overview, how do i find out where my files are located?

Answer:
* Use <code>lfs find</code> to find files on a specific pool. Example:
<pre>
$> lfs find . --pool project # recursively find all files in the current directory whose file objects are on the "project" pool
</pre>
* Use <code>lfs getstripe</code> to query the striping pattern and the pool (also works recursively if called with a directory). Example:
<pre>
$> lfs getstripe parameter.h
parameter.h
lmm_stripe_count: 1
lmm_stripe_size: 1048576
lmm_pattern: raid0
lmm_layout_gen: 1
lmm_stripe_offset: 44
lmm_pool: project
obdidx objid objid group
44 7991938 0x79f282 0xd80000400
</pre>
shows that the file is striped over OST 44 (obdidx) which belongs to pool project (lmm_pool).

Why pathes and storage pools should match:
There are four different possible scenarios with two subdirectories and two pools:
* File path in <code>/pfs/10/work</code>, file objects on pool <code>work</code>: good.
* File path in <code>/pfs/10/project</code>, file objects on pool <code>project</code>: good.
* File path in <code>/pfs/10/project</code>, file objects on pool <code>work</code>: bad. This will "leak" storage from the fast pool, making it unavailable for workspaces.
* File path in <code>/pfs/10/work</code>, file objects on pool <code>project</code>: bad. Access will be slow, and if (volatile) workspaces are purged, data residing on <code>project</code> will (voluntarily or involuntarily) be deleted.
The latter two situations may arise from <code>mv</code>ing data between workspaces and project folders.

==== More on data striping and how to influence it ====
!! The default striping patterns on BinAC2 are set for good reasons and should not light-heartedly be changed! Doing so wrongly will in the best case only hurt your performance. In the worst case, it will also hurt all other users and endanger the stability of the cluster. Please talk to the admins first if you think that you need a non-default pattern.
* Reading striping patterns with <code>lfs getstripe</code>
* Setting striping patterns with <code>lfs setstripe</code> for new files and directories
* Restriping files with <code>lfs migrate</code>
* Progressive File Layout

==== Architecture of BinAC2's Lustre File System ====
Metadata Servers:
* 2 metadata servers
* 1 MDT per server
* MDT Capacity: 31TB, hardware RAID6 on NVMe drives (flash memory/SSD)
* Networking: 2x 100 GbE, 2x HDR-100 InfiniBand

Object storage servers:
* 8 object storage servers
* 2 fast OSTs per server
** 70 TB per OST, software RAID (raid-z2, 10+2 reduncancy)
** NVMe drives, directly attached to the PCIe bus
* 8 slow OSTs per server
** 143 TB per OST, hardware RAID (RAID6, 8+2 redundancy)
** externally attached via SAS
* Networking: 2x 100 GbE, 2x HDR-100 InfiniBand

* All fast OSTs are assigned to the pool <code>work</code>
* All slow OSTs are assigned to the pool <code>project</code>
* All files that are created under <code>/pfs/10/work</code> are by default stored on the fast pool
* All files that are created under <code>/pfs/10/project</code> are by default stored on the slow pool
* Metadata is distributed over both MDTs. All subdirectories of a directory (workspace or project folder) are typically on the same MDT. Directory striping/placement on MDTs can not be influenced by users.
* Default OST striping: Stripes have size 1 MiB. Files are striped over one OST if possible, i.e. all stripes of a file are on the same OST. New files are created on the most empty OST.
Internally, the slow and the fast pool belong to the same Lustre file system and namespace.

More reading:
* [https://doc.lustre.org/lustre_manual.xhtml The Lustre 2.X Manual] ([http://doc.lustre.org/lustre_manual.pdf PDF])
* [https://wiki.lustre.org/Main_Page The Lustre Wiki]

BinAC2/Hardware and Architecture

2025-12-19T17:57:00Z

S Behnle: More on Lustre (2)

= Hardware and Architecture =

The bwForCluster BinAC 2 supports researchers from the broader fields of Bioinformatics, Medical Informatics, Astrophysics, Geosciences and Pharmacy.

[[File:Binac2 schema.png|600px|thumb|center|Overview on the BinAC 2 hardware architecture.]]

== Operating System and Software ==

* Operating System: Rocky Linux 9.6
* Queuing System: [https://slurm.schedmd.com/documentation.html Slurm] (see [[BinAC2/Slurm]] for help)
* (Scientific) Libraries and Software: [[Environment Modules]]

== Compute Nodes ==

BinAC 2 offers compute nodes, high-mem nodes, and three types of GPU nodes.
* 180 compute nodes
* 16 SMP node
* 32 GPU nodes (2xA30)
* 8 GPU nodes (4xA100)
* 4 GPU nodes (4xH200)
* plus several special purpose nodes for login, interactive jobs, etc.

Compute node specification:
{| class="wikitable"
|-
! style="width:10%"|
! style="width:10%"| Standard
! style="width:10%"| High-Mem
! style="width:10%"| GPU (A30)
! style="width:10%"| GPU (A100)
! style="width:10%"| GPU (H200)
|-
!scope="column"| Quantity
| 168 / 12
| 14 / 2
| 32
| 8
| 4
|-
!scope="column" | Processors
| 2 x [https://www.amd.com/de/products/processors/server/epyc/7003-series/amd-epyc-7543.html AMD EPYC Milan 7543] / 2 x [https://www.amd.com/en/products/processors/server/epyc/7003-series/amd-epyc-75f3.html AMD EPYC Milan 75F3]
| 2 x [https://www.amd.com/de/products/processors/server/epyc/7003-series/amd-epyc-7443.html AMD EPYC Milan 7443] / 2 x [https://www.amd.com/en/products/processors/server/epyc/7003-series/amd-epyc-75f3.html AMD EPYC Milan 75F3]
| 2 x [https://www.amd.com/de/products/processors/server/epyc/7003-series/amd-epyc-7543.html AMD EPYC Milan 7543]
| 2 x [https://www.amd.com/de/products/processors/server/epyc/7003-series/amd-epyc-7543.html AMD EPYC Milan 7543]
| 2 x [https://www.amd.com/de/products/processors/server/epyc/9005-series/amd-epyc-9555.html AMD EPYC Milan 9555]
|-
!scope="column" | Processor Base Frequency (GHz)
| 2.80 / 2.95
| 2.85 / 2.95
| 2.80
| 2.80
| 3.20
|-
!scope="column" | Number of Physical Cores / Hypertreads
| 64 / 128
| 48 / 96 // 64 / 128
| 64 / 128
| 64 / 128
| 128 / 256
|-
!scope="column" | Working Memory (GB)
| 512
| 2048
| 512
| 512
| 1536
|-
!scope="column" | Local Disk (GiB)
| 450 (NVMe-SSD)
| 14000 (NVMe-SSD)
| 450 (NVMe-SSD)
| 14000 (NVMe-SSD)
| 28000 (NVMe-SSD)
|-
!scope="column" | Interconnect
| HDR 100 IB (84 nodes) / 100GbE (96 nodes)
| 100GbE
| 100GbE
| 100GbE
| HDR 200 IB + 100GbE
|-
!scope="column" | Coprocessors
| -
| -
| 2 x [https://www.nvidia.com/de-de/data-center/products/a30-gpu/ NVIDIA A30 (24 GB ECC HBM2, NVLink)]
| 4 x [https://www.nvidia.com/de-de/data-center/a100/ NVIDIA A100 (80 GB ECC HBM2e)]
| 4 x [https://www.nvidia.com/de-de/data-center/h200/ NVIDIA H200 NVL (141 GB ECC HBM3e, NVLink)]
|}

= Network =

The compute nodes and the parallel file system are connected via 100GbE ethernet
In contrast to BinAC 1 not all compute nodes are connected via Infiniband, but there are 84 standard compute nodes connected via HDR Infiniband (100 GbE). In order to get your jobs onto the Infiniband nodes, submit your job with <code>--constraint=ib</code>.

'''Question:'''
OpenMPI throws the following warning:
<pre>
--------------------------------------------------------------------------
No OpenFabrics connection schemes reported that they were able to be
used on a specific port. As such, the openib BTL (OpenFabrics
support) will be disabled for this port.
Local host: node1-083
Local device: mlx5_0
Local port: 1
CPCs attempted: rdmacm, udcm
--------------------------------------------------------------------------
[node1-083:2137377] 3 more processes have sent help message help-mpi-btl-openib-cpc-base.txt / no cpcs for port
[node1-083:2137377] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
</pre>
What should i do?

'''Answer:'''
BinAC2 has two (almost) separate networks, a 100GbE network and and InfiniBand network, both connecting a subset of the nodes. Both networks require different cables and switches.
Concerning the network cards for the nodes, however, there exist VPI network cards which can be configured to work in either mode (https://docs.nvidia.com/networking/display/connectx6vpi/specifications#src-2487215234_Specifications-MCX653105A-ECATSpecifications).
OpenMPI can use a number of layers for transferring data and messages between processes. When it ramps up, it will test all means of communication that were configured during compilation and then tries to figure out the fastest path between all processes.
If OpenMPI encounters such a VPI card, it will first try to establish a Remote Direct Memory Access communication (RDMA) channel using the OpenFabrics (OFI) layer.
On nodes with 100Gb ethernet, this fails as there is no RDMA protocol configured. OpenMPI will fall back to TCP transport but not without complaints.

'''Workaround:'''
For single-node jobs or on regular compute nodes, A30 and A100 GPU nodes: Add the lines
<pre>
export OMPI_MCA_btl="^ofi,openib"
export OMPI_MCA_mtl="^ofi"
</pre>
to your job script to disable the OFI transport layer. If you need high-bandwidth, low-latency transport between all processes on all nodes, switch to the Infiniband partition (<code>#SBATCH --constraint=ib</code>). ''Do not turn off the OFI layer on Infiniband nodes as this will be the best choice between nodes!''

= File Systems =

The bwForCluster BinAC 2 consists of two separate storage systems, one for the user's home directory $HOME and one serving as a project/work space.
The home directory is limited in space and parallel access but offers snapshots of your files and backup.

The project/work is a parallel file system (PFS) which offers fast and parallel file access and a bigger capacity than the home directory. It is mounted at <code>/pfs/10</code> on the login and compute nodes. This storage is based on Lustre and can be accessed in parallel from many nodes. The PFS contains the project and the work directory. Each compute project has its own directory at <code>/pfs/10/project</code> that is accessible for all members of the compute project.
Each user can create workspaces under <code>/pfs/10/work</code> using the workspace tools. These directories are only accessible for the user who created the workspace.

Additionally, each compute node provides high-speed temporary storage (SSD) on the node-local solid state disk via the $TMPDIR environment variable.

{| class="wikitable"
|-
! style="width:10%"|
! style="width:10%"| <tt>$HOME</tt>
! style="width:10%"| project
! style="width:10%"| work
! style="width:10%"| <tt>$TMPDIR</tt>
|-
!scope="column" | Visibility
| global
| global
| global
| node local
|-
!scope="column" | Lifetime
| permanent
| permanent
| work space lifetime (max. 30 days, max. 5 extensions)
| batch job walltime
|-
!scope="column" | Capacity
| -
| 8.1 PB
| 1000 TB
| 480 GB (compute nodes); 7.7 TB (GPU-A30 nodes); 16 TB (GPU-A100 and SMP nodes); 31 TB (GPU-H200 nodes)
|-
!scope="column" | File System Type
| NFS
| Lustre
| Lustre
| XFS
|-
!scope="column" | Speed (read)
| ≈ 1 GB/s, shared by all nodes
| max. 12 GB/s
| ≈ 145 GB/s peak, aggregated over 56 nodes, ideal striping
| ≈ 3 GB/s (compute)/ ≈5 GB/S (GPUA-30)/ ≈ 26 GB/s (GPU-A100 + SMP)/ ≈ 42 GB/s (GPU-H200) per node
|-
!scope="column" | [https://en.wikipedia.org/wiki/Disk_quota#Quotas Quotas]
| 40 GB per user
| not yet, maybe in the future
| none
| none
|-
!scope="column" | Backup
| yes (nightly)
| '''no'''
| '''no'''
| '''no'''
|}

global : all nodes access the same file system
local : each node has its own file system
permanent : files are stored permanently
batch job walltime : files are removed at end of the batch job

{| class="wikitable" style="color:red; background-color:#ffffcc;" cellpadding="10"
|
Please note that due to the large capacity of '''work''' and '''project''' and due to frequent file changes on these file systems, no backup can be provided.
Backing up these file systems would require a redundant storage facility with multiple times the capacity of '''project'''. Furthermore, regular backups would significantly degrade the performance.
Data is stored redundantly, i.e. immune against disk failures but not immune against catastrophic incidents like cyber attacks or a fire in the server room.
Please consider to use on of the remote storage facilities like [https://wiki.bwhpc.de/e/SDS@hd SDS@hd], [https://uni-tuebingen.de/einrichtungen/zentrum-fuer-datenverarbeitung/projekte/laufende-projekte/bwsfs bwSFS], [https://www.scc.kit.edu/en/services/lsdf.php LSFD Online Storage] or the [https://www.rda.kit.edu/english/ bwDataArchive] to back up your valuable data.
|}

=== Home ===

Home directories are meant for permanent file storage of files that are keep being used like source codes, configuration files, executable programs etc.; the content of home directories will be backed up on a regular basis.
Because the backup space is limited we enforce a quota of 40GB on the home directories.

'''NOTE:'''
Compute jobs on nodes must not write temporary data to $HOME.
Instead they should use the local $TMPDIR directory for I/O-heavy use cases
and work spaces for less I/O intense multinode-jobs.


=== Project ===

The data is stored on HDDs. The primary focus of <code>/pfs/10/project</code> is pure capacity, not speed.

Every project gets a dedicated directory located at:

<syntaxhighlight>
/pfs/10/project/<project_id>/
</syntaxhighlight>

You can check the project(s) you are member of via:

<syntaxhighlight>
# id $USER | grep -o 'bw[^)]*'
bw16f003
</syntaxhighlight>

In this case, your project directory would be:
<code>
/pfs/10/project/bw16f003/
</code>

Check our [[BinAC2/Project_Data_Organization | data organization guide ]] for methods to organize data inside the project directory.

=== Workspaces ===

Data on the fast storage pool at <code>/pfs/10/work</code> is stored on SSDs.
The primary focus is speed, not capacity.

In contrast to BinAC 1 we will enforce work space lifetime, as the capacity is limited.
We ask you to only store data you actively use for computations on <code>/pfs/10/work</code>.
Please move data to <code>/pfs/10/project</code> when you don't need it on the fast storage any more.

Each user should create workspaces at <code>/pfs/10/work</code> through the workspace tools

You can find more info on workspace tools on our general page:

:: → '''[[Workspace]]s'''

To create a work space you'll need to supply a name for your work space area and a lifetime in days.
For more information read the corresponding help, e.g: <code>ws_allocate -h.</code>
{| class="wikitable"
|-
!style="width:30%" | Command
!style="width:70%" | Action
|-
|<code>ws_allocate mywork 30</code>
|Allocate a work space named "mywork" for 30 days.
|-
|<code>ws_allocate myotherwork</code>
|Allocate a work space named "myotherwork" with maximum lifetime.
|-
|<code>ws_list -a</code>
|List all your work spaces.
|-
|<code>ws_find mywork</code>
|Get absolute path of work space "mywork".
|-
|<code>ws_extend mywork 30</code>
|Extend life me of work space mywork by 30 days from now.
|-
|<code>ws_release mywork</code>
|Manually erase your work space "mywork". Please remove directory content first or use the <code>--delete-data</code> option.
|-
|}

=== Scratch ===

Please use the fast local scratch space for storing temporary data during your jobs.

For each job a scratch directory will be created on the compute nodes. It is available via the environment variable <code>$TMPDIR</code>, which points to <code>/scratch/<jobID></code>.

Especially the SMP nodes and the GPU nodes are equipped with large and fast local disks that should be used for temporary data, scratch data or data staging for ML model training.
The Lustre file system (<code>WORK</code> and <code>PROJECT</code>) is unsuited for repetitive random I/O, I/O sizes smaller than the Lustre and ZFS block size (1M) or I/O patterns where files are opened and closed in rapid succession. The XFS file system of the local scratch drives is better suited for typical scratch workloads and access patterns. Moreover, the local scratch drives offer a lower latency and a higher bandwidth than <code>WORK</code>.

=== SDS@hd ===

SDS@hd is mounted via NFS on login and compute nodes at <syntaxhighlight inline>/mnt/sds-hd</syntaxhighlight>.

To access your Speichervorhaben, the export to BinAC 2 must first be enabled by the SDS@hd-Team. Please contact [mailto:sds-hd-support@urz.uni-heidelberg.de SDS@hd support] and provide the acronym of your Speichervorhaben, along with a request to enable the export to BinAC 2.

Once this has been done, you can access your Speichervorhaben as described in the [https://wiki.bwhpc.de/e/SDS@hd/Access/NFS#Access_your_data SDS documentation].

<syntaxhighlight>
$ kinit $USER
Password for <user>@BWSERVICES.UNI-HEIDELBERG.DE:
</syntaxhighlight>

The Kerberos ticket store is shared across all nodes. Creating a single ticket is sufficient to access your Speichervorhaben on all nodes.

=== More Details on the Lustre File System ===
[https://www.lustre.org/ Lustre] is a distributed parallel file system.
* The entire logical volume as presented to the user is formed by multiple physical or local drives. Data is distributed over more than one physical or logical volume/hard drive, single files can be larger than the capacity of a single hard drive.
* The file system can be mounted from all nodes ("clients") in parallel at the same time for reading and writing. This also means that technically you can write to the same file from two different compute nodes! Usually, this will create an unpredictable mess! Never ever do this unless you know exactly what you are doing!
* On a single server or client, the bandwidth of multiple network interfaces can be aggregated to increase the throughput ("multi-rail").

Lustre works by chopping files into many small parts ("stripes", file objects) which are then stored on the object storage servers. The information which part of the file is stored where on which object storage server, when it was changed last etc. and the entire directory structure is stored on the metadata servers. Think of the entries on the metadata server as being pointers pointing to the actual file objects on the object storage servers.
A Lustre file system can consist of many metadata servers (MDS) and object storage servers (OSS).
Each MDS or OSS can again hold one or more so-called object storage targets (OST) or metadata targets (MDT) which can e.g. be simply multiple hard drives.
The capacity of a Lustre file system can hence be easily scaled by adding more servers.

==== Useful Lustre Comamnds ====
Commands specific to the Lustre file system are divided into user commands (<code>lfs ...</code>) and administrative commands (<code>lctl ...</code>). On BinAC2, users may only execute user commands, and also not all of them.
* <code>lfs help <command></code>: Print built-in help for command; Alternative: <code>man lfs <command></code>
* <code>lfs find</code>: Drop-in replacement for the <code>find</code> command, much faster on Lustre filesystems as it directly talks to the metadata sever
* <code>lfs --list-commands</code>: Print a list of available commands

==== Moving data between WORK and PROJECT ====
!! IMPORTANT !! Calling <code>mv</code> on files will not physically move them between the fast and the slow pool of the file system. Instead, the file metadata, i.e. the path to the file in the directory tree will be modified (i.e. data stored on the MDS). The stripes of the file on the OSS, however, will remain exactly were they were. The only result will be the confusing situation that you now have metadata entries under <code>/pfs/10/project</code> that still point to WORK OSTs. This may sound confusing at first. When using <code>mv</code> on the same file system, Lustre only renames the files and makes them available from a different path. The pointers to the file objects on the OSS stay identical. This will only change if you either create a copy of the file at a different path (with <code>cp</code> or <code>rsync</code>, e.g.) or if you explicitly instruct Lustre to move the actual file objects to another storage location, e.g. another pool of the same file system.

Proper ways of moving data between the pools
* Copy the data - which will create new files -, then delete the old files. Example:
<pre>
$> cp -ar /pfs/10/work/tu_abcde01-my-precious-ws/* /pfs/10/project/bw10a001/tu_abcde01/my-precious-research/.
$> rm -rf /pfs/10/work/tu_abcde01-my-precious-ws/*
$> ws_release --delete-data my-precious-ws
</pre>
* Alternative to copy: use <code>rsync</code> to copy data between the workspace and the project directories. Example:
<pre>
$> rsync -av /pfs/10/work/tu_abcde01-my-precious-ws/simulation/output /pfs/10/project/bw10a001/tu_abcde01/my-precious-research/simulation25/
$> rm -rf /pfs/10/work/tu_abcde01-my-precious-ws/*
$> ws_release --delete-data my-precious-ws
</pre>
* If there are many subfolders with similar size, you can use <code>xargs</code> to copy them in parallel:
<pre>
$> find . -maxdepth 1 -mindepth 1 -type d -print | xargs -P4 -I{} rsync -aHAXW --inplace --update {} /pfs/10/project/bw10a001/tu_abcde01/my-precious-research/simulation25/
</pre>
will launch four parallel <code>rsync</code> processes at a time, each will copy one of the subdirectories.
* First move the metadata with <code>mv</code>, then use <code>lfs migrate</code> or the wrapper <code>lfs_migrate</code> to actually migrate the file stripes. This is also a possible resolution if you already <code>mv</code>ed data from <code>work</code> to <code>project</code> or vice versa.
** <code>lfs migrate</code> is the raw lustre command. It can only operate on one file at a time, but offers access to all options.
** <code>lfs_migrate</code> is a versatile wrapper script that can work on single files or recursively on entire directories. If available, it will try to use <code>lfs migrate</code>, otherwise it will fall back to <code>rsync</code> (see <code>lfs_migrate --help</code> for all options.)
Example with <code>lfs migrate</code>:
<pre>
$> mv /pfs/10/work/tu_abcde01-my-precious-ws/* /pfs/10/project/bw10a001/tu_abcde01/my-precious-research/.
$> cd /pfs/10/project/bw10a001/tu_abcde01/my-precious-research/.
$> lfs find . -type f --pool work -0 | xargs -0 lfs migrate --pool project # find all files whose file objects are on the work pool and migrate the objects to the project pool
$> ws_release --delete-data my-precious-ws
</pre>
Example with <code>lfs_migrate</code>:
<pre>
$> mv /pfs/10/work/tu_abcde01-my-precious-ws/* /pfs/10/project/bw10a001/tu_abcde01/my-precious-research/.
$> cd /pfs/10/project/bw10a001/tu_abcde01/my-precious-research/.
$> lfs_migrate --yes -q -p project * # migrate all file objects in the current directory to the project pool, be quiet (-q) and do not ask for confirmation (--yes)
$> ws_release --delete-data my-precious-ws
</pre>
Both migration commands can also be combined with options to restripe the files during migration, i.e. you can also change the number of OSTs the file is striped over, the size of a single strip etc.
Attention! Both <code>lfs migrate</code> and <code>lfs_migrate</code> will not change the path of the file(s), you must also <code>mv</code> them! If used without <code>mv</code>, the files will still belong to the workspace although their file object stripes are now on the <code>project</code> pool and a subsequent <code>rm</code> in the workspace will wipe them.

All of the above procedures may take a considerable amount of time depending on the amount of data, so it might be advisable to execute them in a terminal multiplexer like <code>screen</code> or <code>tmux</code> or wrap them into small SLURM jobs with <code>sbatch --wrap="<command>"</code>.

Why pathes and storage pools should match:
There are four different possible scenarios with two subdirectories and two pools:
* File path in <code>/pfs/10/work</code>, file objects on pool <code>work</code>: good.
* File path in <code>/pfs/10/project</code>, file objects on pool <code>project</code>: good.
* File path in <code>/pfs/10/project</code>, file objects on pool <code>work</code>: bad. This will "leak" storage from the fast pool, making it unavailable for workspaces.
* File path in <code>/pfs/10/work</code>, file objects on pool <code>project</code>: bad. Access will be slow, and if (volatile) workspaces are purged, data residing on <code>project</code> will (voluntarily or involuntarily) be deleted.
The latter two situations may arise from <code>mv</code>ing data between workspaces and project folders.

==== More on data striping and how to influence it ====
!! The default striping patterns on BinAC2 are set for good reasons and should not light-heartedly be changed! Doing so wrongly will in the best case only hurt your performance. In the worst case, it will also hurt all other users and endanger the stability of the cluster. Please talk to the admins first if you think that you need a non-default pattern.
* Reading striping patterns with <code>lfs getstripe</code>
* Setting striping patterns with <code>lfs setstripe</code> for new files and directories
* Restriping files with <code>lfs migrate</code>
* Progressive File Layout

==== Architecture of BinAC2's Lustre File System ====
Metadata Servers:
* 2 metadata servers
* 1 MDT per server
* MDT Capacity: 31TB, hardware RAID6 on NVMe drives (flash memory/SSD)
* Networking: 2x 100 GbE, 2x HDR-100 InfiniBand

Object storage servers:
* 8 object storage servers
* 2 fast OSTs per server
** 70 TB per OST, software RAID (raid-z2, 10+2 reduncancy)
** NVMe drives, directly attached to the PCIe bus
* 8 slow OSTs per server
** 143 TB per OST, hardware RAID (RAID6, 8+2 redundancy)
** externally attached via SAS
* Networking: 2x 100 GbE, 2x HDR-100 InfiniBand

* All fast OSTs are assigned to the pool <code>work</code>
* All slow OSTs are assigned to the pool <code>project</code>
* All files that are created under <code>/pfs/10/work</code> are by default stored on the fast pool
* All files that are created under <code>/pfs/10/project</code> are by default stored on the slow pool
* Metadata is distributed over both MDTs. All subdirectories of a directory (workspace or project folder) are typically on the same MDT. Directory striping/placement on MDTs can not be influenced by users.
* Default OST striping: Stripes have size 1 MiB. Files are striped over one OST if possible, i.e. all stripes of a file are on the same OST. New files are created on the most empty OST.
Internally, the slow and the fast pool belong to the same Lustre file system and namespace.

More reading:
* [https://doc.lustre.org/lustre_manual.xhtml The Lustre 2.X Manual] ([http://doc.lustre.org/lustre_manual.pdf PDF])
* [https://wiki.lustre.org/Main_Page The Lustre Wiki]

BinAC2/Hardware and Architecture

2025-12-19T16:23:28Z

S Behnle:

= Hardware and Architecture =

The bwForCluster BinAC 2 supports researchers from the broader fields of Bioinformatics, Medical Informatics, Astrophysics, Geosciences and Pharmacy.

[[File:Binac2 schema.png|600px|thumb|center|Overview on the BinAC 2 hardware architecture.]]

== Operating System and Software ==

* Operating System: Rocky Linux 9.6
* Queuing System: [https://slurm.schedmd.com/documentation.html Slurm] (see [[BinAC2/Slurm]] for help)
* (Scientific) Libraries and Software: [[Environment Modules]]

== Compute Nodes ==

BinAC 2 offers compute nodes, high-mem nodes, and three types of GPU nodes.
* 180 compute nodes
* 16 SMP node
* 32 GPU nodes (2xA30)
* 8 GPU nodes (4xA100)
* 4 GPU nodes (4xH200)
* plus several special purpose nodes for login, interactive jobs, etc.

Compute node specification:
{| class="wikitable"
|-
! style="width:10%"|
! style="width:10%"| Standard
! style="width:10%"| High-Mem
! style="width:10%"| GPU (A30)
! style="width:10%"| GPU (A100)
! style="width:10%"| GPU (H200)
|-
!scope="column"| Quantity
| 168 / 12
| 14 / 2
| 32
| 8
| 4
|-
!scope="column" | Processors
| 2 x [https://www.amd.com/de/products/processors/server/epyc/7003-series/amd-epyc-7543.html AMD EPYC Milan 7543] / 2 x [https://www.amd.com/en/products/processors/server/epyc/7003-series/amd-epyc-75f3.html AMD EPYC Milan 75F3]
| 2 x [https://www.amd.com/de/products/processors/server/epyc/7003-series/amd-epyc-7443.html AMD EPYC Milan 7443] / 2 x [https://www.amd.com/en/products/processors/server/epyc/7003-series/amd-epyc-75f3.html AMD EPYC Milan 75F3]
| 2 x [https://www.amd.com/de/products/processors/server/epyc/7003-series/amd-epyc-7543.html AMD EPYC Milan 7543]
| 2 x [https://www.amd.com/de/products/processors/server/epyc/7003-series/amd-epyc-7543.html AMD EPYC Milan 7543]
| 2 x [https://www.amd.com/de/products/processors/server/epyc/9005-series/amd-epyc-9555.html AMD EPYC Milan 9555]
|-
!scope="column" | Processor Base Frequency (GHz)
| 2.80 / 2.95
| 2.85 / 2.95
| 2.80
| 2.80
| 3.20
|-
!scope="column" | Number of Physical Cores / Hypertreads
| 64 / 128
| 48 / 96 // 64 / 128
| 64 / 128
| 64 / 128
| 128 / 256
|-
!scope="column" | Working Memory (GB)
| 512
| 2048
| 512
| 512
| 1536
|-
!scope="column" | Local Disk (GiB)
| 450 (NVMe-SSD)
| 14000 (NVMe-SSD)
| 450 (NVMe-SSD)
| 14000 (NVMe-SSD)
| 28000 (NVMe-SSD)
|-
!scope="column" | Interconnect
| HDR 100 IB (84 nodes) / 100GbE (96 nodes)
| 100GbE
| 100GbE
| 100GbE
| HDR 200 IB + 100GbE
|-
!scope="column" | Coprocessors
| -
| -
| 2 x [https://www.nvidia.com/de-de/data-center/products/a30-gpu/ NVIDIA A30 (24 GB ECC HBM2, NVLink)]
| 4 x [https://www.nvidia.com/de-de/data-center/a100/ NVIDIA A100 (80 GB ECC HBM2e)]
| 4 x [https://www.nvidia.com/de-de/data-center/h200/ NVIDIA H200 NVL (141 GB ECC HBM3e, NVLink)]
|}

= Network =

The compute nodes and the parallel file system are connected via 100GbE ethernet
In contrast to BinAC 1 not all compute nodes are connected via Infiniband, but there are 84 standard compute nodes connected via HDR Infiniband (100 GbE). In order to get your jobs onto the Infiniband nodes, submit your job with <code>--constraint=ib</code>.

'''Question:'''
OpenMPI throws the following warning:
<pre>
--------------------------------------------------------------------------
No OpenFabrics connection schemes reported that they were able to be
used on a specific port. As such, the openib BTL (OpenFabrics
support) will be disabled for this port.
Local host: node1-083
Local device: mlx5_0
Local port: 1
CPCs attempted: rdmacm, udcm
--------------------------------------------------------------------------
[node1-083:2137377] 3 more processes have sent help message help-mpi-btl-openib-cpc-base.txt / no cpcs for port
[node1-083:2137377] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
</pre>
What should i do?

'''Answer:'''
BinAC2 has two (almost) separate networks, a 100GbE network and and InfiniBand network, both connecting a subset of the nodes. Both networks require different cables and switches.
Concerning the network cards for the nodes, however, there exist VPI network cards which can be configured to work in either mode (https://docs.nvidia.com/networking/display/connectx6vpi/specifications#src-2487215234_Specifications-MCX653105A-ECATSpecifications).
OpenMPI can use a number of layers for transferring data and messages between processes. When it ramps up, it will test all means of communication that were configured during compilation and then tries to figure out the fastest path between all processes.
If OpenMPI encounters such a VPI card, it will first try to establish a Remote Direct Memory Access communication (RDMA) channel using the OpenFabrics (OFI) layer.
On nodes with 100Gb ethernet, this fails as there is no RDMA protocol configured. OpenMPI will fall back to TCP transport but not without complaints.

'''Workaround:'''
For single-node jobs or on regular compute nodes, A30 and A100 GPU nodes: Add the lines
<pre>
export OMPI_MCA_btl="^ofi,openib"
export OMPI_MCA_mtl="^ofi"
</pre>
to your job script to disable the OFI transport layer. If you need high-bandwidth, low-latency transport between all processes on all nodes, switch to the Infiniband partition (<code>#SBATCH --constraint=ib</code>). ''Do not turn off the OFI layer on Infiniband nodes as this will be the best choice between nodes!''

= File Systems =

The bwForCluster BinAC 2 consists of two separate storage systems, one for the user's home directory $HOME and one serving as a project/work space.
The home directory is limited in space and parallel access but offers snapshots of your files and backup.

The project/work is a parallel file system (PFS) which offers fast and parallel file access and a bigger capacity than the home directory. It is mounted at <code>/pfs/10</code> on the login and compute nodes. This storage is based on Lustre and can be accessed in parallel from many nodes. The PFS contains the project and the work directory. Each compute project has its own directory at <code>/pfs/10/project</code> that is accessible for all members of the compute project.
Each user can create workspaces under <code>/pfs/10/work</code> using the workspace tools. These directories are only accessible for the user who created the workspace.

Additionally, each compute node provides high-speed temporary storage (SSD) on the node-local solid state disk via the $TMPDIR environment variable.

{| class="wikitable"
|-
! style="width:10%"|
! style="width:10%"| <tt>$HOME</tt>
! style="width:10%"| project
! style="width:10%"| work
! style="width:10%"| <tt>$TMPDIR</tt>
|-
!scope="column" | Visibility
| global
| global
| global
| node local
|-
!scope="column" | Lifetime
| permanent
| permanent
| work space lifetime (max. 30 days, max. 5 extensions)
| batch job walltime
|-
!scope="column" | Capacity
| -
| 8.1 PB
| 1000 TB
| 480 GB (compute nodes); 7.7 TB (GPU-A30 nodes); 16 TB (GPU-A100 and SMP nodes); 31 TB (GPU-H200 nodes)
|-
!scope="column" | File System Type
| NFS
| Lustre
| Lustre
| XFS
|-
!scope="column" | Speed (read)
| ≈ 1 GB/s, shared by all nodes
| max. 12 GB/s
| ≈ 145 GB/s peak, aggregated over 56 nodes, ideal striping
| ≈ 3 GB/s (compute)/ ≈5 GB/S (GPUA-30)/ ≈ 26 GB/s (GPU-A100 + SMP)/ ≈ 42 GB/s (GPU-H200) per node
|-
!scope="column" | [https://en.wikipedia.org/wiki/Disk_quota#Quotas Quotas]
| 40 GB per user
| not yet, maybe in the future
| none
| none
|-
!scope="column" | Backup
| yes (nightly)
| '''no'''
| '''no'''
| '''no'''
|}

global : all nodes access the same file system
local : each node has its own file system
permanent : files are stored permanently
batch job walltime : files are removed at end of the batch job

{| class="wikitable" style="color:red; background-color:#ffffcc;" cellpadding="10"
|
Please note that due to the large capacity of '''work''' and '''project''' and due to frequent file changes on these file systems, no backup can be provided.
Backing up these file systems would require a redundant storage facility with multiple times the capacity of '''project'''. Furthermore, regular backups would significantly degrade the performance.
Data is stored redundantly, i.e. immune against disk failures but not immune against catastrophic incidents like cyber attacks or a fire in the server room.
Please consider to use on of the remote storage facilities like [https://wiki.bwhpc.de/e/SDS@hd SDS@hd], [https://uni-tuebingen.de/einrichtungen/zentrum-fuer-datenverarbeitung/projekte/laufende-projekte/bwsfs bwSFS], [https://www.scc.kit.edu/en/services/lsdf.php LSFD Online Storage] or the [https://www.rda.kit.edu/english/ bwDataArchive] to back up your valuable data.
|}

=== Home ===

Home directories are meant for permanent file storage of files that are keep being used like source codes, configuration files, executable programs etc.; the content of home directories will be backed up on a regular basis.
Because the backup space is limited we enforce a quota of 40GB on the home directories.

'''NOTE:'''
Compute jobs on nodes must not write temporary data to $HOME.
Instead they should use the local $TMPDIR directory for I/O-heavy use cases
and work spaces for less I/O intense multinode-jobs.


=== Project ===

The data is stored on HDDs. The primary focus of <code>/pfs/10/project</code> is pure capacity, not speed.

Every project gets a dedicated directory located at:

<syntaxhighlight>
/pfs/10/project/<project_id>/
</syntaxhighlight>

You can check the project(s) you are member of via:

<syntaxhighlight>
# id $USER | grep -o 'bw[^)]*'
bw16f003
</syntaxhighlight>

In this case, your project directory would be:
<code>
/pfs/10/project/bw16f003/
</code>

Check our [[BinAC2/Project_Data_Organization | data organization guide ]] for methods to organize data inside the project directory.

=== Workspaces ===

Data on the fast storage pool at <code>/pfs/10/work</code> is stored on SSDs.
The primary focus is speed, not capacity.

In contrast to BinAC 1 we will enforce work space lifetime, as the capacity is limited.
We ask you to only store data you actively use for computations on <code>/pfs/10/work</code>.
Please move data to <code>/pfs/10/project</code> when you don't need it on the fast storage any more.

Each user should create workspaces at <code>/pfs/10/work</code> through the workspace tools

You can find more info on workspace tools on our general page:

:: → '''[[Workspace]]s'''

To create a work space you'll need to supply a name for your work space area and a lifetime in days.
For more information read the corresponding help, e.g: <code>ws_allocate -h.</code>
{| class="wikitable"
|-
!style="width:30%" | Command
!style="width:70%" | Action
|-
|<code>ws_allocate mywork 30</code>
|Allocate a work space named "mywork" for 30 days.
|-
|<code>ws_allocate myotherwork</code>
|Allocate a work space named "myotherwork" with maximum lifetime.
|-
|<code>ws_list -a</code>
|List all your work spaces.
|-
|<code>ws_find mywork</code>
|Get absolute path of work space "mywork".
|-
|<code>ws_extend mywork 30</code>
|Extend life me of work space mywork by 30 days from now.
|-
|<code>ws_release mywork</code>
|Manually erase your work space "mywork". Please remove directory content first or use the <code>--delete-data</code> option.
|-
|}

=== Scratch ===

Please use the fast local scratch space for storing temporary data during your jobs.

For each job a scratch directory will be created on the compute nodes. It is available via the environment variable <code>$TMPDIR</code>, which points to <code>/scratch/<jobID></code>.

Especially the SMP nodes and the GPU nodes are equipped with large and fast local disks that should be used for temporary data, scratch data or data staging for ML model training.
The Lustre file system (<code>WORK</code> and <code>PROJECT</code>) is unsuited for repetitive random I/O, I/O sizes smaller than the Lustre and ZFS block size (1M) or I/O patterns where files are opened and closed in rapid succession. The XFS file system of the local scratch drives is better suited for typical scratch workloads and access patterns. Moreover, the local scratch drives offer a lower latency and a higher bandwidth than <code>WORK</code>.

=== SDS@hd ===

SDS@hd is mounted via NFS on login and compute nodes at <syntaxhighlight inline>/mnt/sds-hd</syntaxhighlight>.

To access your Speichervorhaben, the export to BinAC 2 must first be enabled by the SDS@hd-Team. Please contact [mailto:sds-hd-support@urz.uni-heidelberg.de SDS@hd support] and provide the acronym of your Speichervorhaben, along with a request to enable the export to BinAC 2.

Once this has been done, you can access your Speichervorhaben as described in the [https://wiki.bwhpc.de/e/SDS@hd/Access/NFS#Access_your_data SDS documentation].

<syntaxhighlight>
$ kinit $USER
Password for <user>@BWSERVICES.UNI-HEIDELBERG.DE:
</syntaxhighlight>

The Kerberos ticket store is shared across all nodes. Creating a single ticket is sufficient to access your Speichervorhaben on all nodes.

=== More Details on the Lustre File System ===
[https://www.lustre.org/ Lustre] is a distributed parallel file system.
* The entire logical volume as presented to the user is formed by multiple physical or local drives. Data is distributed over more than one physical or logical volume/hard drive, single files can be larger than the capacity of a single hard drive.
* The file system can be mounted from all nodes ("clients") in parallel at the same time for reading and writing. This also means that technically you can write to the same file from two different compute nodes! Usually, this will create an unpredictable mess! Never ever do this unless you know exactly what you are doing!
* On a single server or client, the bandwidth of multiple network interfaces can be aggregated to increase the throughput ("multi-rail").

Lustre works by chopping files into many small parts ("stripes", file objects) which are then stored on the object storage servers. The information which part of the file is stored where on which object storage server, when it was changed last etc. and the entire directory structure is stored on the metadata servers. Think of the entries on the metadata server as being pointers pointing to the actual file objects on the object storage servers.
A Lustre file system can consist of many metadata servers (MDS) and object storage servers (OSS).
Each MDS or OSS can again hold one or more so-called object storage targets (OST) or metadata targets (MDT) which can e.g. be simply multiple hard drives.
The capacity of a Lustre file system can hence be easily scaled by adding more servers.

==== Architecture of BinAC2's Lustre File System ====
Metadata Servers:
* 2 metadata servers
* 1 MDT per server
* MDT Capacity: 31TB, hardware RAID6 on NVMe drives (flash memory/SSD)
* Networking: 2x 100 GbE, 2x HDR-100 InfiniBand

Object storage servers:
* 8 object storage servers
* 2 fast OSTs per server
** 70 TB per OST, software RAID (raid-z2, 10+2 reduncancy)
** NVMe drives, directly attached to the PCIe bus
* 8 slow OSTs per server
** 143 TB per OST, hardware RAID (RAID6, 8+2 redundancy)
** externally attached via SAS
* Networking: 2x 100 GbE, 2x HDR-100 InfiniBand

* All fast OSTs are assigned to the pool <code>work</code>
* All slow OSTs are assigned to the pool <code>project</code>
* All files that are created under <code>/pfs/10/work</code> are by default stored on the fast pool
* All files that are created under <code>/pfs/10/project</code> are by default stored on the slow pool
* Metadata is distributed over both MDTs. All subdirectories of a directory (workspace or project folder) are typically on the same MDT. Directory striping/placement on MDTs can not be influenced by users.
* Default OST striping: Stripes have size 1 MiB. Files are striped over one OST if possible, i.e. all stripes of a file are on the same OST. New files are created on the most empty OST.
Internally, the slow and the fast pool belong to the same Lustre file system and namespace which has advantages and disadvantages.

==== Moving data between WORK and PROJECT ====
!! IMPORTANT !! Calling <code>mv</code> on files will not physically move them. Instead, the file metadata, i.e. the path to the file in the directory tree will be modified (i.e. data stored on the MDS). The stripes of the file on the OSS, however, will remain exactly were they were. The only result will be the confusing situation that you now have metadata entries under <code>/pfs/10/project</code> that still point to WORK OSTs. This may sound confusing at first. When using <code>mv</code> on the same file system, Lustre only renames the files and makes them available from a different path. The pointers to the file objects on the OSS stay identical. This will only change if you either create a copy of the file at a different path (with <code>cp</code> or <code>rsync</code>, e.g.) or if you explicitly instruct Lustre to move the actual file objects to another storage location, e.g. another pool of the same file system.

Proper ways of moving data between the pools
* Copy the data - which will create new files, then delete the old files. Example:
<pre>
$> cp -ar /pfs/10/work/tu_abcde01-my-precious-ws/* /pfs/10/project/bw10a001/tu_abcde01/my-precious-research/.
$> rm -rf /pfs/10/work/tu_abcde01-my-precious-ws/*
$> ws_release --delete-data my-precious-ws
</pre>
* Alternative to copy: use <code>rsync</code> to copy data between the workspace and the project directories. Example:
<pre>
$> rsync -av /pfs/10/work/tu_abcde01-my-precious-ws/simulation/output /pfs/10/project/bw10a001/tu_abcde01/my-precious-research/simulation25/
$> rm -rf /pfs/10/work/tu_abcde01-my-precious-ws/*
$> ws_release --delete-data my-precious-ws
</pre>
* If there are many subfolders with similar size, you can use <code>xargs</code> to copy them in parallel:
<pre>
$> find . -maxdepth 1 -mindepth 1 -type d -print | xargs -P4 -I{} rsync -aHAXW --inplace --update {} /pfs/10/project/bw10a001/tu_abcde01/my-precious-research/simulation25/
</pre>
will launch four parallel rsync processes at a time, each will copy one of the subdirectories.
* Move the metadata with mv, then use lfs migrate or lfs_migrate to actually migrate the stripes

Why pathes and storage pools should match:
There are four different possible scenarios with two subdirectories and two pools:
* File path in <code>/pfs/10/work</code>, file objects on pool <code>work</code>: good.
* File path in <code>/pfs/10/project</code>, file objects on pool <code>project</code>: good.
* File path in <code>/pfs/10/project</code>, file objects on pool <code>work</code>: bad. This will "leak" storage from the fast pool, making it unavailable for workspaces.
* File path in <code>/pfs/10/work</code>, file objects on pool <code>project</code>: bad. Access will be slow, and if (volatile) workspaces are purged, data residing on <code>project</code> will (voluntarily or involuntarily) be deleted.
The latter two situations may arise from <code>mv</code>ing data between workspaces and project folders.

More reading:
* [https://doc.lustre.org/lustre_manual.xhtml The Lustre 2.X Manual] ([http://doc.lustre.org/lustre_manual.pdf PDF])
* [https://wiki.lustre.org/Main_Page The Lustre Wiki]

BinAC2/Hardware and Architecture

2025-12-19T15:06:33Z

S Behnle:

= Hardware and Architecture =

The bwForCluster BinAC 2 supports researchers from the broader fields of Bioinformatics, Medical Informatics, Astrophysics, Geosciences and Pharmacy.

[[File:Binac2 schema.png|600px|thumb|center|Overview on the BinAC 2 hardware architecture.]]

== Operating System and Software ==

* Operating System: Rocky Linux 9.6
* Queuing System: [https://slurm.schedmd.com/documentation.html Slurm] (see [[BinAC2/Slurm]] for help)
* (Scientific) Libraries and Software: [[Environment Modules]]

== Compute Nodes ==

BinAC 2 offers compute nodes, high-mem nodes, and three types of GPU nodes.
* 180 compute nodes
* 16 SMP node
* 32 GPU nodes (2xA30)
* 8 GPU nodes (4xA100)
* 4 GPU nodes (4xH200)
* plus several special purpose nodes for login, interactive jobs, etc.

Compute node specification:
{| class="wikitable"
|-
! style="width:10%"|
! style="width:10%"| Standard
! style="width:10%"| High-Mem
! style="width:10%"| GPU (A30)
! style="width:10%"| GPU (A100)
! style="width:10%"| GPU (H200)
|-
!scope="column"| Quantity
| 168 / 12
| 14 / 2
| 32
| 8
| 4
|-
!scope="column" | Processors
| 2 x [https://www.amd.com/de/products/processors/server/epyc/7003-series/amd-epyc-7543.html AMD EPYC Milan 7543] / 2 x [https://www.amd.com/en/products/processors/server/epyc/7003-series/amd-epyc-75f3.html AMD EPYC Milan 75F3]
| 2 x [https://www.amd.com/de/products/processors/server/epyc/7003-series/amd-epyc-7443.html AMD EPYC Milan 7443] / 2 x [https://www.amd.com/en/products/processors/server/epyc/7003-series/amd-epyc-75f3.html AMD EPYC Milan 75F3]
| 2 x [https://www.amd.com/de/products/processors/server/epyc/7003-series/amd-epyc-7543.html AMD EPYC Milan 7543]
| 2 x [https://www.amd.com/de/products/processors/server/epyc/7003-series/amd-epyc-7543.html AMD EPYC Milan 7543]
| 2 x [https://www.amd.com/de/products/processors/server/epyc/9005-series/amd-epyc-9555.html AMD EPYC Milan 9555]
|-
!scope="column" | Processor Base Frequency (GHz)
| 2.80 / 2.95
| 2.85 / 2.95
| 2.80
| 2.80
| 3.20
|-
!scope="column" | Number of Physical Cores / Hypertreads
| 64 / 128
| 48 / 96 // 64 / 128
| 64 / 128
| 64 / 128
| 128 / 256
|-
!scope="column" | Working Memory (GB)
| 512
| 2048
| 512
| 512
| 1536
|-
!scope="column" | Local Disk (GiB)
| 450 (NVMe-SSD)
| 14000 (NVMe-SSD)
| 450 (NVMe-SSD)
| 14000 (NVMe-SSD)
| 28000 (NVMe-SSD)
|-
!scope="column" | Interconnect
| HDR 100 IB (84 nodes) / 100GbE (96 nodes)
| 100GbE
| 100GbE
| 100GbE
| HDR 200 IB + 100GbE
|-
!scope="column" | Coprocessors
| -
| -
| 2 x [https://www.nvidia.com/de-de/data-center/products/a30-gpu/ NVIDIA A30 (24 GB ECC HBM2, NVLink)]
| 4 x [https://www.nvidia.com/de-de/data-center/a100/ NVIDIA A100 (80 GB ECC HBM2e)]
| 4 x [https://www.nvidia.com/de-de/data-center/h200/ NVIDIA H200 NVL (141 GB ECC HBM3e, NVLink)]
|}

= Network =

The compute nodes and the parallel file system are connected via 100GbE ethernet
In contrast to BinAC 1 not all compute nodes are connected via Infiniband, but there are 84 standard compute nodes connected via HDR Infiniband (100 GbE). In order to get your jobs onto the Infiniband nodes, submit your job with <code>--constraint=ib</code>.

'''Question:'''
OpenMPI throws the following warning:
<pre>
--------------------------------------------------------------------------
No OpenFabrics connection schemes reported that they were able to be
used on a specific port. As such, the openib BTL (OpenFabrics
support) will be disabled for this port.
Local host: node1-083
Local device: mlx5_0
Local port: 1
CPCs attempted: rdmacm, udcm
--------------------------------------------------------------------------
[node1-083:2137377] 3 more processes have sent help message help-mpi-btl-openib-cpc-base.txt / no cpcs for port
[node1-083:2137377] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
</pre>
What should i do?

'''Answer:'''
BinAC2 has two (almost) separate networks, a 100GbE network and and InfiniBand network, both connecting a subset of the nodes. Both networks require different cables and switches.
Concerning the network cards for the nodes, however, there exist VPI network cards which can be configured to work in either mode (https://docs.nvidia.com/networking/display/connectx6vpi/specifications#src-2487215234_Specifications-MCX653105A-ECATSpecifications).
OpenMPI can use a number of layers for transferring data and messages between processes. When it ramps up, it will test all means of communication that were configured during compilation and then tries to figure out the fastest path between all processes.
If OpenMPI encounters such a VPI card, it will first try to establish a Remote Direct Memory Access communication (RDMA) channel using the OpenFabrics (OFI) layer.
On nodes with 100Gb ethernet, this fails as there is no RDMA protocol configured. OpenMPI will fall back to TCP transport but not without complaints.

'''Workaround:'''
For single-node jobs or on regular compute nodes, A30 and A100 GPU nodes: Add the lines
<pre>
export OMPI_MCA_btl="^ofi,openib"
export OMPI_MCA_mtl="^ofi"
</pre>
to your job script to disable the OFI transport layer. If you need high-bandwidth, low-latency transport between all processes on all nodes, switch to the Infiniband partition (<code>#SBATCH --constraint=ib</code>). ''Do not turn off the OFI layer on Infiniband nodes as this will be the best choice between nodes!''

= File Systems =

The bwForCluster BinAC 2 consists of two separate storage systems, one for the user's home directory $HOME and one serving as a project/work space.
The home directory is limited in space and parallel access but offers snapshots of your files and backup.

The project/work is a parallel file system (PFS) which offers fast and parallel file access and a bigger capacity than the home directory. It is mounted at <code>/pfs/10</code> on the login and compute nodes. This storage is based on Lustre and can be accessed in parallel from many nodes. The PFS contains the project and the work directory. Each compute project has its own directory at <code>/pfs/10/project</code> that is accessible for all members of the compute project.
Each user can create workspaces under <code>/pfs/10/work</code> using the workspace tools. These directories are only accessible for the user who created the workspace.

Additionally, each compute node provides high-speed temporary storage (SSD) on the node-local solid state disk via the $TMPDIR environment variable.

{| class="wikitable"
|-
! style="width:10%"|
! style="width:10%"| <tt>$HOME</tt>
! style="width:10%"| project
! style="width:10%"| work
! style="width:10%"| <tt>$TMPDIR</tt>
|-
!scope="column" | Visibility
| global
| global
| global
| node local
|-
!scope="column" | Lifetime
| permanent
| permanent
| work space lifetime (max. 30 days, max. 5 extensions)
| batch job walltime
|-
!scope="column" | Capacity
| -
| 8.1 PB
| 1000 TB
| 480 GB (compute nodes); 7.7 TB (GPU-A30 nodes); 16 TB (GPU-A100 and SMP nodes); 31 TB (GPU-H200 nodes)
|-
!scope="column" | File System Type
| NFS
| Lustre
| Lustre
| XFS
|-
!scope="column" | Speed (read)
| ≈ 1 GB/s, shared by all nodes
| max. 12 GB/s
| ≈ 145 GB/s peak, aggregated over 56 nodes, ideal striping
| ≈ 3 GB/s (compute)/ ≈5 GB/S (GPUA-30)/ ≈ 26 GB/s (GPU-A100 + SMP)/ ≈ 42 GB/s (GPU-H200) per node
|-
!scope="column" | [https://en.wikipedia.org/wiki/Disk_quota#Quotas Quotas]
| 40 GB per user
| not yet, maybe in the future
| none
| none
|-
!scope="column" | Backup
| yes (nightly)
| '''no'''
| '''no'''
| '''no'''
|}

global : all nodes access the same file system
local : each node has its own file system
permanent : files are stored permanently
batch job walltime : files are removed at end of the batch job

{| class="wikitable" style="color:red; background-color:#ffffcc;" cellpadding="10"
|
Please note that due to the large capacity of '''work''' and '''project''' and due to frequent file changes on these file systems, no backup can be provided.
Backing up these file systems would require a redundant storage facility with multiple times the capacity of '''project'''. Furthermore, regular backups would significantly degrade the performance.
Data is stored redundantly, i.e. immune against disk failures but not immune against catastrophic incidents like cyber attacks or a fire in the server room.
Please consider to use on of the remote storage facilities like [https://wiki.bwhpc.de/e/SDS@hd SDS@hd], [https://uni-tuebingen.de/einrichtungen/zentrum-fuer-datenverarbeitung/projekte/laufende-projekte/bwsfs bwSFS], [https://www.scc.kit.edu/en/services/lsdf.php LSFD Online Storage] or the [https://www.rda.kit.edu/english/ bwDataArchive] to back up your valuable data.
|}

=== Home ===

Home directories are meant for permanent file storage of files that are keep being used like source codes, configuration files, executable programs etc.; the content of home directories will be backed up on a regular basis.
Because the backup space is limited we enforce a quota of 40GB on the home directories.

'''NOTE:'''
Compute jobs on nodes must not write temporary data to $HOME.
Instead they should use the local $TMPDIR directory for I/O-heavy use cases
and work spaces for less I/O intense multinode-jobs.


=== Project ===

The data is stored on HDDs. The primary focus of <code>/pfs/10/project</code> is pure capacity, not speed.

Every project gets a dedicated directory located at:

<syntaxhighlight>
/pfs/10/project/<project_id>/
</syntaxhighlight>

You can check the project you're member of:

<syntaxhighlight>
# id $USER | grep -o 'bw[^)]*'
bw16f003
</syntaxhighlight>

In this case, your project directory would be:
```
/pfs/10/project/bw16f003/
```

Check our [[BinAC2/Project_Data_Organization | data organization guide ]] for methods to organize data inside the project directory.

=== Workspaces ===

Data on the fast storage pool at <code>/pfs/10/work</code> is stored on SSDs.
The primary focus is speed, not capacity.

In contrast to BinAC 1 we will enforce work space lifetime, as the capacity is limited.
We ask you to only store data you actively use for computations on <code>/pfs/10/work</code>.
Please move data to <code>/pfs/10/project</code> when you don't need it on the fast storage any more.

Each user should create workspaces at <code>/pfs/10/work</code> through the workspace tools

You can find more info on workspace tools on our general page:

:: → '''[[Workspace]]s'''

To create a work space you'll need to supply a name for your work space area and a lifetime in days.
For more information read the corresponding help, e.g: <code>ws_allocate -h.</code>
{| class="wikitable"
|-
!style="width:30%" | Command
!style="width:70%" | Action
|-
|<code>ws_allocate mywork 30</code>
|Allocate a work space named "mywork" for 30 days.
|-
|<code>ws_allocate myotherwork</code>
|Allocate a work space named "myotherwork" with maximum lifetime.
|-
|<code>ws_list -a</code>
|List all your work spaces.
|-
|<code>ws_find mywork</code>
|Get absolute path of work space "mywork".
|-
|<code>ws_extend mywork 30</code>
|Extend life me of work space mywork by 30 days from now.
|-
|<code>ws_release mywork</code>
|Manually erase your work space "mywork". Please remove directory content first or use the <code>--delete-data</code> option.
|-
|}

=== Scratch ===

Please use the fast local scratch space for storing temporary data during your jobs.

For each job a scratch directory will be created on the compute nodes. It is available via the environment variable <code>$TMPDIR</code>, which points to <code>/scratch/<jobID></code>.

Especially the SMP nodes and the GPU nodes are equipped with large and fast local disks that should be used for temporary data, scratch data or data staging for ML model training.
The Lustre file system (<code>WORK</code> and <code>PROJECT</code>) is unsuited for repetitive random I/O, I/O sizes smaller than the Lustre and ZFS block size (1M) or I/O patterns where files are opened and closed in rapid succession. The XFS file system of the local scratch drives is better suited for typical scratch workloads and access patterns. Moreover, the local scratch drives offer a lower latency and a higher bandwidth than <code>WORK</code>.

=== SDS@hd ===

SDS@hd is mounted via NFS on login and compute nodes at <syntaxhighlight inline>/mnt/sds-hd</syntaxhighlight>.

To access your Speichervorhaben, the export to BinAC 2 must first be enabled by the SDS@hd-Team. Please contact [mailto:sds-hd-support@urz.uni-heidelberg.de SDS@hd support] and provide the acronym of your Speichervorhaben, along with a request to enable the export to BinAC 2.

Once this has been done, you can access your Speichervorhaben as described in the [https://wiki.bwhpc.de/e/SDS@hd/Access/NFS#Access_your_data SDS documentation].

<syntaxhighlight>
$ kinit $USER
Password for <user>@BWSERVICES.UNI-HEIDELBERG.DE:
</syntaxhighlight>

The Kerberos ticket store is shared across all nodes. Creating a single ticket is sufficient to access your Speichervorhaben on all nodes.

=== More Details on the Lustre File System ===
[https://www.lustre.org/ Lustre] is a distributed parallel file system.
* The entire logical volume as presented to the user is formed by multiple physical or local drives. Data is distributed over more than one physical or logical volume/hard drive, single files can be larger than the capacity of a single hard drive.
* The file system can be mounted from all nodes ("clients") in parallel at the same time for reading and writing. This also means that technically you can write to the same file from two different compute nodes! Usually, this will create an unpredictable mess! Never ever do this unless you know exactly what you are doing!
* On a single server or client, the bandwidth of multiple network interfaces can be aggregated to increase the throughput ("multi-rail").

Lustre works by chopping files into many small parts ("stripes", file objects) which are then stored on the object storage servers. The information which part of the file is stored where on which object storage server, when it was changed last etc. and the entire directory structure is stored on the metadata servers. Think of the entries on the metadata server as being pointers pointing to the actual file objects on the object storage servers.
A Lustre file system can consist of many metadata servers (MDS) and object storage servers (OSS).
Each MDS or OSS can again hold one or more so-called object storage targets (OST) or metadata targets (MDT) which can e.g. be simply multiple hard drives.
The capacity of a Lustre file system can hence be easily scaled by adding more servers.

==== Architecture of BinAC2's Lustre File System ====
Metadata Servers:
* 2 metadata servers
* 1 MDT per server
* MDT Capacity: 31TB, hardware RAID6 on NVMe drives (flash memory/SSD)
* Networking: 2x 100 GbE, 2x HDR-100 InfiniBand

Object storage servers:
* 8 object storage servers
* 2 fast OSTs per server
** 70 TB per OST, software RAID (raid-z2, 10+2 reduncancy)
** NVMe drives, directly attached to the PCIe bus
* 8 slow OSTs per server
** 143 TB per OST, hardware RAID (RAID6, 8+2 redundancy)
** externally attached via SAS
* Networking: 2x 100 GbE, 2x HDR-100 InfiniBand

* All fast OSTs are assigned to the pool <code>work</code>
* All slow OSTs are assigned to the pool <code>project</code>
* All files that are created under <code>/pfs/10/work</code> are by default stored on the fast pool
* All files that are created under <code>/pfs/10/project</code> are by default stored on the slow pool
* Metadata is distributed over both MDTs. All subdirectories of a directory (workspace or project folder) are typically on the same MDT. Directory striping/placement on MDTs can not be influenced by users.
* Default OST striping: Stripes have size 1 MiB. Files are striped over one OST if possible, i.e. all stripes of a file are on the same OST. New files are created on the most empty OST.
Internally, the slow and the fast pool belong to the same Lustre file system and namespace which has advantages and disadvantages.

==== Moving data between WORK and PROJECT ====
!! IMPORTANT !! Calling <code>mv</code> on files will not physically move them. Instead, the file metadata, i.e. the path to the file in the directory tree will be modified (i.e. data stored on the MDS). The stripes of the file on the OSS, however, will remain exactly were they were. The only result will be the confusing situation that you now have metadata entries under <code>/pfs/10/project</code> that still point to WORK OSTs. This may sound confusing at first. When using <code>mv</code> on the same file system, Lustre only renames the files and makes them available from a different path. The pointers to the file objects on the OSS stay identical. This will only change if you either create a copy of the file at a different path (with <code>cp</code> or <code>rsync</code>, e.g.) or if you explicitly instruct Lustre to move the actual file objects to another storage location, e.g. another pool of the same file system.

Proper ways of moving data between the pools
* Copy the data - which will create new files, then delete the old files. Example:
<pre>
$> cp -ar /pfs/10/work/tu_abcde01-my-precious-ws/* /pfs/10/project/bw10a001/tu_abcde01/my-precious-research/.
$> rm -rf /pfs/10/work/tu_abcde01-my-precious-ws/*
$> ws_release --delete-data my-precious-ws
</pre>
* Alternative to copy: use <code>rsync</code> to copy data between the workspace and the project directories. Example:
<pre>
$> rsync -av /pfs/10/work/tu_abcde01-my-precious-ws/simulation/output /pfs/10/project/bw10a001/tu_abcde01/my-precious-research/simulation25/
$> rm -rf /pfs/10/work/tu_abcde01-my-precious-ws/*
$> ws_release --delete-data my-precious-ws
</pre>
* If there are many subfolders with similar size, you can use <code>xargs</code> to copy them in parallel:
<pre>
$> find . -maxdepth 1 -mindepth 1 -type d -print | xargs -P4 -I{} rsync -aHAXW --inplace --update {} /pfs/10/project/bw10a001/tu_abcde01/my-precious-research/simulation25/
</pre>
will launch four parallel rsync processes at a time, each will copy one of the subdirectories.
* Move the metadata with mv, then use lfs migrate or lfs_migrate to actually migrate the stripes

Why pathes and storage pools should match:
There are four different possible scenarios with two subdirectories and two pools:
* File path in <code>/pfs/10/work</code>, file objects on pool <code>work</code>: good.
* File path in <code>/pfs/10/project</code>, file objects on pool <code>project</code>: good.
* File path in <code>/pfs/10/project</code>, file objects on pool <code>work</code>: bad. This will "leak" storage from the fast pool, making it unavailable for workspaces.
* File path in <code>/pfs/10/work</code>, file objects on pool <code>project</code>: bad. Access will be slow, and if (volatile) workspaces are purged, data residing on <code>project</code> will (voluntarily or involuntarily) be deleted.
The latter two situations may arise from <code>mv</code>ing data between workspaces and project folders.

More reading:
* [https://doc.lustre.org/lustre_manual.xhtml The Lustre 2.X Manual] ([http://doc.lustre.org/lustre_manual.pdf PDF])
* [https://wiki.lustre.org/Main_Page The Lustre Wiki]

BinAC2/Hardware and Architecture

2025-12-19T14:16:01Z

S Behnle:

= Hardware and Architecture =

The bwForCluster BinAC 2 supports researchers from the broader fields of Bioinformatics, Medical Informatics, Astrophysics, Geosciences and Pharmacy.

[[File:Binac2 schema.png|600px|thumb|center|Overview on the BinAC 2 hardware architecture.]]

== Operating System and Software ==

* Operating System: Rocky Linux 9.6
* Queuing System: [https://slurm.schedmd.com/documentation.html Slurm] (see [[BinAC2/Slurm]] for help)
* (Scientific) Libraries and Software: [[Environment Modules]]

== Compute Nodes ==

BinAC 2 offers compute nodes, high-mem nodes, and three types of GPU nodes.
* 180 compute nodes
* 16 SMP node
* 32 GPU nodes (2xA30)
* 8 GPU nodes (4xA100)
* 4 GPU nodes (4xH200)
* plus several special purpose nodes for login, interactive jobs, etc.

Compute node specification:
{| class="wikitable"
|-
! style="width:10%"|
! style="width:10%"| Standard
! style="width:10%"| High-Mem
! style="width:10%"| GPU (A30)
! style="width:10%"| GPU (A100)
! style="width:10%"| GPU (H200)
|-
!scope="column"| Quantity
| 168 / 12
| 14 / 2
| 32
| 8
| 4
|-
!scope="column" | Processors
| 2 x [https://www.amd.com/de/products/processors/server/epyc/7003-series/amd-epyc-7543.html AMD EPYC Milan 7543] / 2 x [https://www.amd.com/en/products/processors/server/epyc/7003-series/amd-epyc-75f3.html AMD EPYC Milan 75F3]
| 2 x [https://www.amd.com/de/products/processors/server/epyc/7003-series/amd-epyc-7443.html AMD EPYC Milan 7443] / 2 x [https://www.amd.com/en/products/processors/server/epyc/7003-series/amd-epyc-75f3.html AMD EPYC Milan 75F3]
| 2 x [https://www.amd.com/de/products/processors/server/epyc/7003-series/amd-epyc-7543.html AMD EPYC Milan 7543]
| 2 x [https://www.amd.com/de/products/processors/server/epyc/7003-series/amd-epyc-7543.html AMD EPYC Milan 7543]
| 2 x [https://www.amd.com/de/products/processors/server/epyc/9005-series/amd-epyc-9555.html AMD EPYC Milan 9555]
|-
!scope="column" | Processor Base Frequency (GHz)
| 2.80 / 2.95
| 2.85 / 2.95
| 2.80
| 2.80
| 3.20
|-
!scope="column" | Number of Physical Cores / Hypertreads
| 64 / 128
| 48 / 96 // 64 / 128
| 64 / 128
| 64 / 128
| 128 / 256
|-
!scope="column" | Working Memory (GB)
| 512
| 2048
| 512
| 512
| 1536
|-
!scope="column" | Local Disk (GiB)
| 450 (NVMe-SSD)
| 14000 (NVMe-SSD)
| 450 (NVMe-SSD)
| 14000 (NVMe-SSD)
| 28000 (NVMe-SSD)
|-
!scope="column" | Interconnect
| HDR 100 IB (84 nodes) / 100GbE (96 nodes)
| 100GbE
| 100GbE
| 100GbE
| HDR 200 IB + 100GbE
|-
!scope="column" | Coprocessors
| -
| -
| 2 x [https://www.nvidia.com/de-de/data-center/products/a30-gpu/ NVIDIA A30 (24 GB ECC HBM2, NVLink)]
| 4 x [https://www.nvidia.com/de-de/data-center/a100/ NVIDIA A100 (80 GB ECC HBM2e)]
| 4 x [https://www.nvidia.com/de-de/data-center/h200/ NVIDIA H200 NVL (141 GB ECC HBM3e, NVLink)]
|}

= Network =

The compute nodes and the parallel file system are connected via 100GbE ethernet
In contrast to BinAC 1 not all compute nodes are connected via Infiniband, but there are 84 standard compute nodes connected via HDR Infiniband (100 GbE). In order to get your jobs onto the Infiniband nodes, submit your job with <code>--constraint=ib</code>.

'''Question:'''
OpenMPI throws the following warning:
<pre>
--------------------------------------------------------------------------
No OpenFabrics connection schemes reported that they were able to be
used on a specific port. As such, the openib BTL (OpenFabrics
support) will be disabled for this port.
Local host: node1-083
Local device: mlx5_0
Local port: 1
CPCs attempted: rdmacm, udcm
--------------------------------------------------------------------------
[node1-083:2137377] 3 more processes have sent help message help-mpi-btl-openib-cpc-base.txt / no cpcs for port
[node1-083:2137377] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
</pre>
What should i do?

'''Answer:'''
BinAC2 has two (almost) separate networks, a 100GbE network and and InfiniBand network, both connecting a subset of the nodes. Both networks require different cables and switches.
Concerning the network cards for the nodes, however, there exist VPI network cards which can be configured to work in either mode (https://docs.nvidia.com/networking/display/connectx6vpi/specifications#src-2487215234_Specifications-MCX653105A-ECATSpecifications).
OpenMPI can use a number of layers for transferring data and messages between processes. When it ramps up, it will test all means of communication that were configured during compilation and then tries to figure out the fastest path between all processes.
If OpenMPI encounters such a VPI card, it will first try to establish a Remote Direct Memory Access communication (RDMA) channel using the OpenFabrics (OFI) layer.
On nodes with 100Gb ethernet, this fails as there is no RDMA protocol configured. OpenMPI will fall back to TCP transport but not without complaints.

'''Workaround:'''
For single-node jobs or on regular compute nodes, A30 and A100 GPU nodes: Add the lines
<pre>
export OMPI_MCA_btl="^ofi,openib"
export OMPI_MCA_mtl="^ofi"
</pre>
to your job script to disable the OFI transport layer. If you need high-bandwidth, low-latency transport between all processes on all nodes, switch to the Infiniband partition (<code>#SBATCH --constraint=ib</code>). ''Do not turn off the OFI layer on Infiniband nodes as this will be the best choice between nodes!''

= File Systems =

The bwForCluster BinAC 2 consists of two separate storage systems, one for the user's home directory $HOME and one serving as a project/work space.
The home directory is limited in space and parallel access but offers snapshots of your files and backup.

The project/work is a parallel file system (PFS) which offers fast and parallel file access and a bigger capacity than the home directory. It is mounted at <code>/pfs/10</code> on the login and compute nodes. This storage is based on Lustre and can be accessed in parallel from many nodes. The PFS contains the project and the work directory. Each compute project has its own directory at <code>/pfs/10/project</code> that is accessible for all members of the compute project.
Each user can create workspaces under <code>/pfs/10/work</code> using the workspace tools. These directories are only accessible for the user who created the workspace.

Additionally, each compute node provides high-speed temporary storage (SSD) on the node-local solid state disk via the $TMPDIR environment variable.

{| class="wikitable"
|-
! style="width:10%"|
! style="width:10%"| <tt>$HOME</tt>
! style="width:10%"| project
! style="width:10%"| work
! style="width:10%"| <tt>$TMPDIR</tt>
|-
!scope="column" | Visibility
| global
| global
| global
| node local
|-
!scope="column" | Lifetime
| permanent
| permanent
| work space lifetime (max. 30 days, max. 5 extensions)
| batch job walltime
|-
!scope="column" | Capacity
| -
| 8.1 PB
| 1000 TB
| 480 GB (compute nodes); 7.7 TB (GPU-A30 nodes); 16 TB (GPU-A100 and SMP nodes); 31 TB (GPU-H200 nodes)
|-
!scope="column" | File System Type
| NFS
| Lustre
| Lustre
| XFS
|-
!scope="column" | Speed (read)
| ≈ 1 GB/s, shared by all nodes
| max. 12 GB/s
| ≈ 145 GB/s peak, aggregated over 56 nodes, ideal striping
| ≈ 3 GB/s (compute)/ ≈5 GB/S (GPUA-30)/ ≈ 26 GB/s (GPU-A100 + SMP)/ ≈ 42 GB/s (GPU-H200) per node
|-
!scope="column" | [https://en.wikipedia.org/wiki/Disk_quota#Quotas Quotas]
| 40 GB per user
| not yet, maybe in the future
| none
| none
|-
!scope="column" | Backup
| yes (nightly)
| '''no'''
| '''no'''
| '''no'''
|}

global : all nodes access the same file system
local : each node has its own file system
permanent : files are stored permanently
batch job walltime : files are removed at end of the batch job

{| class="wikitable" style="color:red; background-color:#ffffcc;" cellpadding="10"
|
Please note that due to the large capacity of '''work''' and '''project''' and due to frequent file changes on these file systems, no backup can be provided.
Backing up these file systems would require a redundant storage facility with multiple times the capacity of '''project'''. Furthermore, regular backups would significantly degrade the performance.
Data is stored redundantly, i.e. immune against disk failures but not immune against catastrophic incidents like cyber attacks or a fire in the server room.
Please consider to use on of the remote storage facilities like [https://wiki.bwhpc.de/e/SDS@hd SDS@hd], [https://uni-tuebingen.de/einrichtungen/zentrum-fuer-datenverarbeitung/projekte/laufende-projekte/bwsfs bwSFS], [https://www.scc.kit.edu/en/services/lsdf.php LSFD Online Storage] or the [https://www.rda.kit.edu/english/ bwDataArchive] to back up your valuable data.
|}

=== Home ===

Home directories are meant for permanent file storage of files that are keep being used like source codes, configuration files, executable programs etc.; the content of home directories will be backed up on a regular basis.
Because the backup space is limited we enforce a quota of 40GB on the home directories.

'''NOTE:'''
Compute jobs on nodes must not write temporary data to $HOME.
Instead they should use the local $TMPDIR directory for I/O-heavy use cases
and work spaces for less I/O intense multinode-jobs.


=== Project ===

The data is stored on HDDs. The primary focus of <code>/pfs/10/project</code> is pure capacity, not speed.

Every project gets a dedicated directory located at:

<syntaxhighlight>
/pfs/10/project/<project_id>/
</syntaxhighlight>

You can check the project you're member of:

<syntaxhighlight>
# id $USER | grep -o 'bw[^)]*'
bw16f003
</syntaxhighlight>

In this case, your project directory would be:
```
/pfs/10/project/bw16f003/
```

Check our [[BinAC2/Project_Data_Organization | data organization guide ]] for methods to organize data inside the project directory.

=== Workspaces ===

Data on the fast storage pool at <code>/pfs/10/work</code> is stored on SSDs.
The primary focus is speed, not capacity.

In contrast to BinAC 1 we will enforce work space lifetime, as the capacity is limited.
We ask you to only store data you actively use for computations on <code>/pfs/10/work</code>.
Please move data to <code>/pfs/10/project</code> when you don't need it on the fast storage any more.

Each user should create workspaces at <code>/pfs/10/work</code> through the workspace tools

You can find more info on workspace tools on our general page:

:: → '''[[Workspace]]s'''

To create a work space you'll need to supply a name for your work space area and a lifetime in days.
For more information read the corresponding help, e.g: <code>ws_allocate -h.</code>
{| class="wikitable"
|-
!style="width:30%" | Command
!style="width:70%" | Action
|-
|<code>ws_allocate mywork 30</code>
|Allocate a work space named "mywork" for 30 days.
|-
|<code>ws_allocate myotherwork</code>
|Allocate a work space named "myotherwork" with maximum lifetime.
|-
|<code>ws_list -a</code>
|List all your work spaces.
|-
|<code>ws_find mywork</code>
|Get absolute path of work space "mywork".
|-
|<code>ws_extend mywork 30</code>
|Extend life me of work space mywork by 30 days from now.
|-
|<code>ws_release mywork</code>
|Manually erase your work space "mywork". Please remove directory content first or use the <code>--delete-data</code> option.
|-
|}

=== Scratch ===

Please use the fast local scratch space for storing temporary data during your jobs.

For each job a scratch directory will be created on the compute nodes. It is available via the environment variable <code>$TMPDIR</code>, which points to <code>/scratch/<jobID></code>.

Especially the SMP nodes and the GPU nodes are equipped with large and fast local disks that should be used for temporary data, scratch data or data staging for ML model training.
The Lustre file system (<code>WORK</code> and <code>PROJECT</code>) is unsuited for repetitive random I/O, I/O sizes smaller than the Lustre and ZFS block size (1M) or I/O patterns where files are opened and closed in rapid succession. The XFS file system of the local scratch drives is better suited for typical scratch workloads and access patterns. Moreover, the local scratch drives offer a lower latency and a higher bandwidth than <code>WORK</code>.

=== SDS@hd ===

SDS@hd is mounted via NFS on login and compute nodes at <syntaxhighlight inline>/mnt/sds-hd</syntaxhighlight>.

To access your Speichervorhaben, the export to BinAC 2 must first be enabled by the SDS@hd-Team. Please contact [mailto:sds-hd-support@urz.uni-heidelberg.de SDS@hd support] and provide the acronym of your Speichervorhaben, along with a request to enable the export to BinAC 2.

Once this has been done, you can access your Speichervorhaben as described in the [https://wiki.bwhpc.de/e/SDS@hd/Access/NFS#Access_your_data SDS documentation].

<syntaxhighlight>
$ kinit $USER
Password for <user>@BWSERVICES.UNI-HEIDELBERG.DE:
</syntaxhighlight>

The Kerberos ticket store is shared across all nodes. Creating a single ticket is sufficient to access your Speichervorhaben on all nodes.

=== More Details on the Lustre File System ===
[https://www.lustre.org/ Lustre] is a distributed parallel file system.
* The entire logical volume as presented to the user is formed by multiple physical or local drives. Data is distributed over more than one physical or logical volume/hard drive, single files can be larger than the capacity of a single hard drive.
* The file system can be mounted from all nodes ("clients") in parallel at the same time for reading and writing. This also means that technically you can write to the same file from two different compute nodes! Usually, this will create an unpredictable mess! Never ever do this unless you know exactly what you are doing!
* On a single server or client, the bandwidth of multiple network interfaces can be aggregated to increase the throughput ("multi-rail").

Lustre works by chopping files into many small parts ("stripes", file objects) which are then stored on the object storage servers. The information which part of the file is stored where on which object storage server, when it was changed last etc. and the entire directory structure is stored on the metadata servers. Think of the entries on the metadata server as being pointers pointing to the actual file objects on the object storage servers.
A Lustre file system can consist of many metadata servers (MDS) and object storage servers (OSS).
Each MDS or OSS can again hold one or more so-called object storage targets (OST) or metadata targets (MDT) which can e.g. be simply multiple hard drives.
The capacity of a Lustre file system can hence be easily scaled by adding more servers.

==== Architecture of BinAC2's Lustre File System ====
Metadata Servers:
* 2 metadata servers
* 1 MDT per server
* MDT Capacity: 31TB, hardware RAID6 on NVMe drives (flash memory/SSD)
* Networking: 2x 100 GbE, 2x HDR-100 InfiniBand

Object storage servers:
* 8 object storage servers
* 2 fast OSTs per server
** 70 TB per OST, software RAID (raid-z2, 10+2 reduncancy)
** NVMe drives, directly attached to the PCIe bus
* 8 slow OSTs per server
** 143 TB per OST, hardware RAID (RAID6, 8+2 redundancy)
** externally attached via SAS
* Networking: 2x 100 GbE, 2x HDR-100 InfiniBand

* All fast OSTs are assigned to the pool <code>work</code>
* All slow OSTs are assigned to the pool <code>project</code>
* All files that are created under <code>/pfs/10/work</code> are by default stored on the fast pool
* All files that are created under <code>/pfs/10/project</code> are by default stored on the slow pool
* Metadata is distributed over both MDTs. All subdirectories of a directory (workspace or project folder) are typically on the same MDT. Directory striping/placement on MDTs can not be influenced by users.
* Default OST striping: Stripes have size 1 MiB. Files are striped over one OST if possible, i.e. all stripes of a file are on the same OST. New files are created on the most empty OST.
Internally, the slow and the fast pool belong to the same Lustre file system and namespace which has advantages and disadvantages.

==== Moving data between WORK and PROJECT ====
!! IMPORTANT !! Calling <code>mv</code> on files will not physically move them. Instead, the file metadata, i.e. the path to the file in the directory tree will be modified (i.e. data stored on the MDS). The stripes of the file on the OSS, however, will remain exactly were they were. The only result will be the confusing situation that you now have metadata entries under <code>/pfs/10/project</code> that still point to WORK OSTs. This may sound confusing at first. When using <code>mv</code> on the same file system, Lustre only renames the files and makes them available from a different path. The pointers to the file objects on the OSS stay identical. This will only change if you either create a copy of the file at a different path (with <code>cp</code> or <code>rsync</code>, e.g.) or if you explicitly instruct Lustre to move the actual file objects to another storage location, e.g. another pool of the same file system.

Proper ways of moving data between the pools
* Copy the data - which will create new files, then delete the old files. Example:
<pre>
$> cp -ar /pfs/10/work/tu_abcde01-my-precious-ws/* pfs/10/project/bw10a001/tu_abcde01/my-precious-research/.
$> rm -rf /pfs/10/work/tu_abcde01-my-precious-ws/*
$> ws_release --delete-data my-precious-ws
</pre>
* Alternative to copy: use <code>rsync</code> to copy data between the workspace and the project directories. Example:
<pre>
$> rsync -av /pfs/10/work/tu_abcde01-my-precious-ws/simulation/output pfs/10/project/bw10a001/tu_abcde01/my-precious-research/simulation25/
$> rm -rf /pfs/10/work/tu_abcde01-my-precious-ws/*
$> ws_release --delete-data my-precious-ws
</pre>
* Move the metadata with mv, then use lfs migrate or lfs_migrate to actually migrate the stripes

Why pathes and storage pools should match:
There are four different possible scenarios with two subdirectories and two pools:
* File path in <code>/pfs/10/work</code>, file objects on pool <code>work</code>: good.
* File path in <code>/pfs/10/project</code>, file objects on pool <code>project</code>: good.
* File path in <code>/pfs/10/project</code>, file objects on pool <code>work</code>: bad. This will "leak" storage from the fast pool, making it unavailable for workspaces.
* File path in <code>/pfs/10/work</code>, file objects on pool <code>project</code>: bad. Access will be slow, and if (volatile) workspaces are purged, data residing on <code>project</code> will (voluntarily or involuntarily) be deleted.
The latter two situations may arise from <code>mv</code>ing data between workspaces and project folders.

More reading:
* [https://doc.lustre.org/lustre_manual.xhtml The Lustre 2.X Manual] ([http://doc.lustre.org/lustre_manual.pdf PDF])
* [https://wiki.lustre.org/Main_Page The Lustre Wiki]

BinAC2/Hardware and Architecture

2025-12-18T18:45:55Z

S Behnle: More on Lustre

= Hardware and Architecture =

The bwForCluster BinAC 2 supports researchers from the broader fields of Bioinformatics, Medical Informatics, Astrophysics, Geosciences and Pharmacy.

[[File:Binac2 schema.png|600px|thumb|center|Overview on the BinAC 2 hardware architecture.]]

== Operating System and Software ==

* Operating System: Rocky Linux 9.6
* Queuing System: [https://slurm.schedmd.com/documentation.html Slurm] (see [[BinAC2/Slurm]] for help)
* (Scientific) Libraries and Software: [[Environment Modules]]

== Compute Nodes ==

BinAC 2 offers compute nodes, high-mem nodes, and three types of GPU nodes.
* 180 compute nodes
* 16 SMP node
* 32 GPU nodes (2xA30)
* 8 GPU nodes (4xA100)
* 4 GPU nodes (4xH200)
* plus several special purpose nodes for login, interactive jobs, etc.

Compute node specification:
{| class="wikitable"
|-
! style="width:10%"|
! style="width:10%"| Standard
! style="width:10%"| High-Mem
! style="width:10%"| GPU (A30)
! style="width:10%"| GPU (A100)
! style="width:10%"| GPU (H200)
|-
!scope="column"| Quantity
| 168 / 12
| 14 / 2
| 32
| 8
| 4
|-
!scope="column" | Processors
| 2 x [https://www.amd.com/de/products/processors/server/epyc/7003-series/amd-epyc-7543.html AMD EPYC Milan 7543] / 2 x [https://www.amd.com/en/products/processors/server/epyc/7003-series/amd-epyc-75f3.html AMD EPYC Milan 75F3]
| 2 x [https://www.amd.com/de/products/processors/server/epyc/7003-series/amd-epyc-7443.html AMD EPYC Milan 7443] / 2 x [https://www.amd.com/en/products/processors/server/epyc/7003-series/amd-epyc-75f3.html AMD EPYC Milan 75F3]
| 2 x [https://www.amd.com/de/products/processors/server/epyc/7003-series/amd-epyc-7543.html AMD EPYC Milan 7543]
| 2 x [https://www.amd.com/de/products/processors/server/epyc/7003-series/amd-epyc-7543.html AMD EPYC Milan 7543]
| 2 x [https://www.amd.com/de/products/processors/server/epyc/9005-series/amd-epyc-9555.html AMD EPYC Milan 9555]
|-
!scope="column" | Processor Base Frequency (GHz)
| 2.80 / 2.95
| 2.85 / 2.95
| 2.80
| 2.80
| 3.20
|-
!scope="column" | Number of Physical Cores / Hypertreads
| 64 / 128
| 48 / 96 // 64 / 128
| 64 / 128
| 64 / 128
| 128 / 256
|-
!scope="column" | Working Memory (GB)
| 512
| 2048
| 512
| 512
| 1536
|-
!scope="column" | Local Disk (GiB)
| 450 (NVMe-SSD)
| 14000 (NVMe-SSD)
| 450 (NVMe-SSD)
| 14000 (NVMe-SSD)
| 28000 (NVMe-SSD)
|-
!scope="column" | Interconnect
| HDR 100 IB (84 nodes) / 100GbE (96 nodes)
| 100GbE
| 100GbE
| 100GbE
| HDR 200 IB + 100GbE
|-
!scope="column" | Coprocessors
| -
| -
| 2 x [https://www.nvidia.com/de-de/data-center/products/a30-gpu/ NVIDIA A30 (24 GB ECC HBM2, NVLink)]
| 4 x [https://www.nvidia.com/de-de/data-center/a100/ NVIDIA A100 (80 GB ECC HBM2e)]
| 4 x [https://www.nvidia.com/de-de/data-center/h200/ NVIDIA H200 NVL (141 GB ECC HBM3e, NVLink)]
|}

= Network =

The compute nodes and the parallel file system are connected via 100GbE ethernet
In contrast to BinAC 1 not all compute nodes are connected via Infiniband, but there are 84 standard compute nodes connected via HDR Infiniband (100 GbE). In order to get your jobs onto the Infiniband nodes, submit your job with <code>--constraint=ib</code>.

'''Question:'''
OpenMPI throws the following warning:
<pre>
--------------------------------------------------------------------------
No OpenFabrics connection schemes reported that they were able to be
used on a specific port. As such, the openib BTL (OpenFabrics
support) will be disabled for this port.
Local host: node1-083
Local device: mlx5_0
Local port: 1
CPCs attempted: rdmacm, udcm
--------------------------------------------------------------------------
[node1-083:2137377] 3 more processes have sent help message help-mpi-btl-openib-cpc-base.txt / no cpcs for port
[node1-083:2137377] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
</pre>
What should i do?

'''Answer:'''
BinAC2 has two (almost) separate networks, a 100GbE network and and InfiniBand network, both connecting a subset of the nodes. Both networks require different cables and switches.
Concerning the network cards for the nodes, however, there exist VPI network cards which can be configured to work in either mode (https://docs.nvidia.com/networking/display/connectx6vpi/specifications#src-2487215234_Specifications-MCX653105A-ECATSpecifications).
OpenMPI can use a number of layers for transferring data and messages between processes. When it ramps up, it will test all means of communication that were configured during compilation and then tries to figure out the fastest path between all processes.
If OpenMPI encounters such a VPI card, it will first try to establish a Remote Direct Memory Access communication (RDMA) channel using the OpenFabrics (OFI) layer.
On nodes with 100Gb ethernet, this fails as there is no RDMA protocol configured. OpenMPI will fall back to TCP transport but not without complaints.

'''Workaround:'''
For single-node jobs or on regular compute nodes, A30 and A100 GPU nodes: Add the lines
<pre>
export OMPI_MCA_btl="^ofi,openib"
export OMPI_MCA_mtl="^ofi"
</pre>
to your job script to disable the OFI transport layer. If you need high-bandwidth, low-latency transport between all processes on all nodes, switch to the Infiniband partition (<code>#SBATCH --constraint=ib</code>). ''Do not turn off the OFI layer on Infiniband nodes as this will be the best choice between nodes!''

= File Systems =

The bwForCluster BinAC 2 consists of two separate storage systems, one for the user's home directory $HOME and one serving as a project/work space.
The home directory is limited in space and parallel access but offers snapshots of your files and backup.

The project/work is a parallel file system (PFS) which offers fast and parallel file access and a bigger capacity than the home directory. It is mounted at <code>/pfs/10</code> on the login and compute nodes. This storage is based on Lustre and can be accessed in parallel from many nodes. The PFS contains the project and the work directory. Each compute project has its own directory at <code>/pfs/10/project</code> that is accessible for all members of the compute project.
Each user can create workspaces under <code>/pfs/10/work</code> using the workspace tools. These directories are only accessible for the user who created the workspace.

Additionally, each compute node provides high-speed temporary storage (SSD) on the node-local solid state disk via the $TMPDIR environment variable.

{| class="wikitable"
|-
! style="width:10%"|
! style="width:10%"| <tt>$HOME</tt>
! style="width:10%"| project
! style="width:10%"| work
! style="width:10%"| <tt>$TMPDIR</tt>
|-
!scope="column" | Visibility
| global
| global
| global
| node local
|-
!scope="column" | Lifetime
| permanent
| permanent
| work space lifetime (max. 30 days, max. 5 extensions)
| batch job walltime
|-
!scope="column" | Capacity
| -
| 8.1 PB
| 1000 TB
| 480 GB (compute nodes); 7.7 TB (GPU-A30 nodes); 16 TB (GPU-A100 and SMP nodes); 31 TB (GPU-H200 nodes)
|-
!scope="column" | File System Type
| NFS
| Lustre
| Lustre
| XFS
|-
!scope="column" | Speed (read)
| ≈ 1 GB/s, shared by all nodes
| max. 12 GB/s
| ≈ 145 GB/s peak, aggregated over 56 nodes, ideal striping
| ≈ 3 GB/s (compute)/ ≈5 GB/S (GPUA-30)/ ≈ 26 GB/s (GPU-A100 + SMP)/ ≈ 42 GB/s (GPU-H200) per node
|-
!scope="column" | [https://en.wikipedia.org/wiki/Disk_quota#Quotas Quotas]
| 40 GB per user
| not yet, maybe in the future
| none
| none
|-
!scope="column" | Backup
| yes (nightly)
| '''no'''
| '''no'''
| '''no'''
|}

global : all nodes access the same file system
local : each node has its own file system
permanent : files are stored permanently
batch job walltime : files are removed at end of the batch job

{| class="wikitable" style="color:red; background-color:#ffffcc;" cellpadding="10"
|
Please note that due to the large capacity of '''work''' and '''project''' and due to frequent file changes on these file systems, no backup can be provided.
Backing up these file systems would require a redundant storage facility with multiple times the capacity of '''project'''. Furthermore, regular backups would significantly degrade the performance.
Data is stored redundantly, i.e. immune against disk failures but not immune against catastrophic incidents like cyber attacks or a fire in the server room.
Please consider to use on of the remote storage facilities like [https://wiki.bwhpc.de/e/SDS@hd SDS@hd], [https://uni-tuebingen.de/einrichtungen/zentrum-fuer-datenverarbeitung/projekte/laufende-projekte/bwsfs bwSFS], [https://www.scc.kit.edu/en/services/lsdf.php LSFD Online Storage] or the [https://www.rda.kit.edu/english/ bwDataArchive] to back up your valuable data.
|}

=== Home ===

Home directories are meant for permanent file storage of files that are keep being used like source codes, configuration files, executable programs etc.; the content of home directories will be backed up on a regular basis.
Because the backup space is limited we enforce a quota of 40GB on the home directories.

'''NOTE:'''
Compute jobs on nodes must not write temporary data to $HOME.
Instead they should use the local $TMPDIR directory for I/O-heavy use cases
and work spaces for less I/O intense multinode-jobs.


=== Project ===

The data is stored on HDDs. The primary focus of <code>/pfs/10/project</code> is pure capacity, not speed.

Every project gets a dedicated directory located at:

<syntaxhighlight>
/pfs/10/project/<project_id>/
</syntaxhighlight>

You can check the project you're member of:

<syntaxhighlight>
# id $USER | grep -o 'bw[^)]*'
bw16f003
</syntaxhighlight>

In this case, your project directory would be:
```
/pfs/10/project/bw16f003/
```

Check our [[BinAC2/Project_Data_Organization | data organization guide ]] for methods to organize data inside the project directory.

=== Workspaces ===

Data on the fast storage pool at <code>/pfs/10/work</code> is stored on SSDs.
The primary focus is speed, not capacity.

In contrast to BinAC 1 we will enforce work space lifetime, as the capacity is limited.
We ask you to only store data you actively use for computations on <code>/pfs/10/work</code>.
Please move data to <code>/pfs/10/project</code> when you don't need it on the fast storage any more.

Each user should create workspaces at <code>/pfs/10/work</code> through the workspace tools

You can find more info on workspace tools on our general page:

:: → '''[[Workspace]]s'''

To create a work space you'll need to supply a name for your work space area and a lifetime in days.
For more information read the corresponding help, e.g: <code>ws_allocate -h.</code>
{| class="wikitable"
|-
!style="width:30%" | Command
!style="width:70%" | Action
|-
|<code>ws_allocate mywork 30</code>
|Allocate a work space named "mywork" for 30 days.
|-
|<code>ws_allocate myotherwork</code>
|Allocate a work space named "myotherwork" with maximum lifetime.
|-
|<code>ws_list -a</code>
|List all your work spaces.
|-
|<code>ws_find mywork</code>
|Get absolute path of work space "mywork".
|-
|<code>ws_extend mywork 30</code>
|Extend life me of work space mywork by 30 days from now.
|-
|<code>ws_release mywork</code>
|Manually erase your work space "mywork". Please remove directory content first or use the <code>--delete-data</code> option.
|-
|}

=== Scratch ===

Please use the fast local scratch space for storing temporary data during your jobs.

For each job a scratch directory will be created on the compute nodes. It is available via the environment variable <code>$TMPDIR</code>, which points to <code>/scratch/<jobID></code>.

Especially the SMP nodes and the GPU nodes are equipped with large and fast local disks that should be used for temporary data, scratch data or data staging for ML model training.
The Lustre file system (<code>WORK</code> and <code>PROJECT</code>) is unsuited for repetitive random I/O, I/O sizes smaller than the Lustre and ZFS block size (1M) or I/O patterns where files are opened and closed in rapid succession. The XFS file system of the local scratch drives is better suited for typical scratch workloads and access patterns. Moreover, the local scratch drives offer a lower latency and a higher bandwidth than <code>WORK</code>.

=== SDS@hd ===

SDS@hd is mounted via NFS on login and compute nodes at <syntaxhighlight inline>/mnt/sds-hd</syntaxhighlight>.

To access your Speichervorhaben, the export to BinAC 2 must first be enabled by the SDS@hd-Team. Please contact [mailto:sds-hd-support@urz.uni-heidelberg.de SDS@hd support] and provide the acronym of your Speichervorhaben, along with a request to enable the export to BinAC 2.

Once this has been done, you can access your Speichervorhaben as described in the [https://wiki.bwhpc.de/e/SDS@hd/Access/NFS#Access_your_data SDS documentation].

<syntaxhighlight>
$ kinit $USER
Password for <user>@BWSERVICES.UNI-HEIDELBERG.DE:
</syntaxhighlight>

The Kerberos ticket store is shared across all nodes. Creating a single ticket is sufficient to access your Speichervorhaben on all nodes.

=== More Details on the Lustre File System ===
[https://www.lustre.org/ Lustre] is a distributed parallel file system.
* The entire logical volume as presented to the user is formed by multiple physical or local drives. Data is distributed over more than one physical or logical volume/hard drive, single files can be larger than the capacity of a single hard drive.
* The file system can be mounted from all nodes ("clients") in parallel at the same time for reading and writing. This also means that technically you can write to the same file from two different compute nodes! Usually, this will create an unpredictable mess! Never ever do this unless you know exactly what you are doing!
* On a single server or client, the bandwidth of multiple network interfaces can be aggregated to increase the throughput ("multi-rail").

Lustre works by chopping files into many small parts ("stripes", file objects) which are then stored on the object storage servers. The information which part of the file is stored where on which object storage server, when it was changed last etc. and the entire directory structure is stored on the metadata servers. Think of the entries on the metadata server as being pointers pointing to the actual file objects on the object storage servers.
A Lustre file system can consist of many metadata servers (MDS) and object storage servers (OSS).
Each MDS or OSS can again hold one or more so-called object storage targets (OST) or metadata targets (MDT) which can e.g. be simply multiple hard drives.
The capacity of a Lustre file system can hence be easily scaled by adding more servers.

==== Architecture of BinAC2's Lustre File System ====
Metadata Servers:
* 2 metadata servers
* 1 MDT per server
* MDT Capacity: 31TB, hardware RAID6 on NVMe drives (flash memory/SSD)
* Networking: 2x 100 GbE, 2x HDR-100 InfiniBand

Object storage servers:
* 8 object storage servers
* 2 fast OSTs per server
** 70 TB per OST, software RAID (raid-z2, 10+2 reduncancy)
** NVMe drives, directly attached to the PCIe bus
* 8 slow OSTs per server
** 143 TB per OST, hardware RAID (RAID6, 8+2 redundancy)
** externally attached via SAS
* Networking: 2x 100 GbE, 2x HDR-100 InfiniBand

* All fast OSTs are assigned to the pool <code>work</code>
* All slow OSTs are assigned to the pool <code>project</code>
* All files that are created under <code>/pfs/10/work</code> are by default stored on the fast pool
* All files that are created under <code>/pfs/10/project</code> are by default stored on the slow pool
* Metadata is distributed over both MDTs. All subdirectories of a directory (workspace or project folder) are typically on the same MDT. Directory striping/placement on MDTs can not be influenced by users.
* Default OST striping: Stripes have size 1 MiB. Files are striped over one OST if possible, i.e. all stripes of a file are on the same OST. New files are created on the most empty OST.
Internally, the slow and the fast pool belong to the same Lustre file system and namespace which has advantages and disadvantages.

==== Moving data between WORK and PROJECT ====
!! IMPORTANT !! Calling <code>mv</code> on files will not physically move them. Instead, the file metadata, i.e. the path to the file in the directory tree will be modified (i.e. data stored on the MDS). The stripes of the file on the OSS, however, will remain exactly were they were. The only result will be the confusing situation that you now have metadata entries under <code>/pfs/10/project</code> that still point to WORK OSTs. This may sound confusing at first. When using <code>mv</code> on the same file system, Lustre only renames the files and makes them available from a different path. The pointers to the file objects on the OSS stay identical. This will only change if you either create a copy of the file at a different path (with <code>cp</code> or <code>rsync</code>, e.g.) or if you explicitly instruct Lustre to move the actual file objects to another storage location, e.g. another pool of the same file system.

Proper ways of moving data between the pools
* Copy the data - which will create new files, then delete the old files. Example:
<pre>
$> cp -ar /pfs/10/work/tu_abcde01-my-precious-ws/* pfs/10/project/bw10a001/tu_abcde01/my-precious-research/.
$> rm -rf /pfs/10/work/tu_abcde01-my-precious-ws/*
$> ws_release --delete-data my-precious-ws
</pre>
* Move the metadata with mv, then use lfs migrate or lfs_migrate to actually migrate the stripes

Why pathes and storage pools should match:
There are four different possible scenarios with two subdirectories and two pools:
* File path in <code>/pfs/10/work</code>, file objects on pool <code>work</code>: good.
* File path in <code>/pfs/10/project</code>, file objects on pool <code>project</code>: good.
* File path in <code>/pfs/10/project</code>, file objects on pool <code>work</code>: bad. This will "leak" storage from the fast pool, making it unavailable for workspaces.
* File path in <code>/pfs/10/work</code>, file objects on pool <code>project</code>: bad. Access will be slow, and if (volatile) workspaces are purged, data residing on <code>project</code> will (voluntarily or involuntarily) be deleted.
The latter two situations may arise from <code>mv</code>ing data between workspaces and project folders.

More reading:
* [https://doc.lustre.org/lustre_manual.xhtml The Lustre 2.X Manual] ([http://doc.lustre.org/lustre_manual.pdf PDF])
* [https://wiki.lustre.org/Main_Page The Lustre Wiki]

BinAC2/Hardware and Architecture

2025-12-18T17:02:08Z

S Behnle: Added a BinAC2 schematic

= Hardware and Architecture =

The bwForCluster BinAC 2 supports researchers from the broader fields of Bioinformatics, Medical Informatics, Astrophysics, Geosciences and Pharmacy.

[[File:Binac2 schema.png|600px|thumb|center|Overview on the BinAC 2 hardware architecture.]]

== Operating System and Software ==

* Operating System: Rocky Linux 9.6
* Queuing System: [https://slurm.schedmd.com/documentation.html Slurm] (see [[BinAC2/Slurm]] for help)
* (Scientific) Libraries and Software: [[Environment Modules]]

== Compute Nodes ==

BinAC 2 offers compute nodes, high-mem nodes, and three types of GPU nodes.
* 180 compute nodes
* 16 SMP node
* 32 GPU nodes (2xA30)
* 8 GPU nodes (4xA100)
* 4 GPU nodes (4xH200)
* plus several special purpose nodes for login, interactive jobs, etc.

Compute node specification:
{| class="wikitable"
|-
! style="width:10%"|
! style="width:10%"| Standard
! style="width:10%"| High-Mem
! style="width:10%"| GPU (A30)
! style="width:10%"| GPU (A100)
! style="width:10%"| GPU (H200)
|-
!scope="column"| Quantity
| 168 / 12
| 14 / 2
| 32
| 8
| 4
|-
!scope="column" | Processors
| 2 x [https://www.amd.com/de/products/processors/server/epyc/7003-series/amd-epyc-7543.html AMD EPYC Milan 7543] / 2 x [https://www.amd.com/en/products/processors/server/epyc/7003-series/amd-epyc-75f3.html AMD EPYC Milan 75F3]
| 2 x [https://www.amd.com/de/products/processors/server/epyc/7003-series/amd-epyc-7443.html AMD EPYC Milan 7443] / 2 x [https://www.amd.com/en/products/processors/server/epyc/7003-series/amd-epyc-75f3.html AMD EPYC Milan 75F3]
| 2 x [https://www.amd.com/de/products/processors/server/epyc/7003-series/amd-epyc-7543.html AMD EPYC Milan 7543]
| 2 x [https://www.amd.com/de/products/processors/server/epyc/7003-series/amd-epyc-7543.html AMD EPYC Milan 7543]
| 2 x [https://www.amd.com/de/products/processors/server/epyc/9005-series/amd-epyc-9555.html AMD EPYC Milan 9555]
|-
!scope="column" | Processor Base Frequency (GHz)
| 2.80 / 2.95
| 2.85 / 2.95
| 2.80
| 2.80
| 3.20
|-
!scope="column" | Number of Physical Cores / Hypertreads
| 64 / 128
| 48 / 96 // 64 / 128
| 64 / 128
| 64 / 128
| 128 / 256
|-
!scope="column" | Working Memory (GB)
| 512
| 2048
| 512
| 512
| 1536
|-
!scope="column" | Local Disk (GiB)
| 450 (NVMe-SSD)
| 14000 (NVMe-SSD)
| 450 (NVMe-SSD)
| 14000 (NVMe-SSD)
| 28000 (NVMe-SSD)
|-
!scope="column" | Interconnect
| HDR 100 IB (84 nodes) / 100GbE (96 nodes)
| 100GbE
| 100GbE
| 100GbE
| HDR 200 IB + 100GbE
|-
!scope="column" | Coprocessors
| -
| -
| 2 x [https://www.nvidia.com/de-de/data-center/products/a30-gpu/ NVIDIA A30 (24 GB ECC HBM2, NVLink)]
| 4 x [https://www.nvidia.com/de-de/data-center/a100/ NVIDIA A100 (80 GB ECC HBM2e)]
| 4 x [https://www.nvidia.com/de-de/data-center/h200/ NVIDIA H200 NVL (141 GB ECC HBM3e, NVLink)]
|}

= Network =

The compute nodes and the parallel file system are connected via 100GbE ethernet
In contrast to BinAC 1 not all compute nodes are connected via Infiniband, but there are 84 standard compute nodes connected via HDR Infiniband (100 GbE). In order to get your jobs onto the Infiniband nodes, submit your job with <code>--constraint=ib</code>.

'''Question:'''
OpenMPI throws the following warning:
<pre>
--------------------------------------------------------------------------
No OpenFabrics connection schemes reported that they were able to be
used on a specific port. As such, the openib BTL (OpenFabrics
support) will be disabled for this port.
Local host: node1-083
Local device: mlx5_0
Local port: 1
CPCs attempted: rdmacm, udcm
--------------------------------------------------------------------------
[node1-083:2137377] 3 more processes have sent help message help-mpi-btl-openib-cpc-base.txt / no cpcs for port
[node1-083:2137377] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
</pre>
What should i do?

'''Answer:'''
BinAC2 has two (almost) separate networks, a 100GbE network and and InfiniBand network, both connecting a subset of the nodes. Both networks require different cables and switches.
Concerning the network cards for the nodes, however, there exist VPI network cards which can be configured to work in either mode (https://docs.nvidia.com/networking/display/connectx6vpi/specifications#src-2487215234_Specifications-MCX653105A-ECATSpecifications).
OpenMPI can use a number of layers for transferring data and messages between processes. When it ramps up, it will test all means of communication that were configured during compilation and then tries to figure out the fastest path between all processes.
If OpenMPI encounters such a VPI card, it will first try to establish a Remote Direct Memory Access communication (RDMA) channel using the OpenFabrics (OFI) layer.
On nodes with 100Gb ethernet, this fails as there is no RDMA protocol configured. OpenMPI will fall back to TCP transport but not without complaints.

'''Workaround:'''
For single-node jobs or on regular compute nodes, A30 and A100 GPU nodes: Add the lines
<pre>
export OMPI_MCA_btl="^ofi,openib"
export OMPI_MCA_mtl="^ofi"
</pre>
to your job script to disable the OFI transport layer. If you need high-bandwidth, low-latency transport between all processes on all nodes, switch to the Infiniband partition (<code>#SBATCH --constraint=ib</code>). ''Do not turn off the OFI layer on Infiniband nodes as this will be the best choice between nodes!''

= File Systems =

The bwForCluster BinAC 2 consists of two separate storage systems, one for the user's home directory $HOME and one serving as a project/work space.
The home directory is limited in space and parallel access but offers snapshots of your files and backup.

The project/work is a parallel file system (PFS) which offers fast and parallel file access and a bigger capacity than the home directory. It is mounted at <code>/pfs/10</code> on the login and compute nodes. This storage is based on Lustre and can be accessed in parallel from many nodes. The PFS contains the project and the work directory. Each compute project has its own directory at <code>/pfs/10/project</code> that is accessible for all members of the compute project.
Each user can create workspaces under <code>/pfs/10/work</code> using the workspace tools. These directories are only accessible for the user who created the workspace.

Additionally, each compute node provides high-speed temporary storage (SSD) on the node-local solid state disk via the $TMPDIR environment variable.

{| class="wikitable"
|-
! style="width:10%"|
! style="width:10%"| <tt>$HOME</tt>
! style="width:10%"| project
! style="width:10%"| work
! style="width:10%"| <tt>$TMPDIR</tt>
|-
!scope="column" | Visibility
| global
| global
| global
| node local
|-
!scope="column" | Lifetime
| permanent
| permanent
| work space lifetime (max. 30 days, max. 5 extensions)
| batch job walltime
|-
!scope="column" | Capacity
| -
| 8.1 PB
| 1000 TB
| 480 GB (compute nodes); 7.7 TB (GPU-A30 nodes); 16 TB (GPU-A100 and SMP nodes); 31 TB (GPU-H200 nodes)
|-
!scope="column" | File System Type
| NFS
| Lustre
| Lustre
| XFS
|-
!scope="column" | Speed (read)
| ≈ 1 GB/s, shared by all nodes
| max. 12 GB/s
| ≈ 145 GB/s peak, aggregated over 56 nodes, ideal striping
| ≈ 3 GB/s (compute)/ ≈5 GB/S (GPUA-30)/ ≈ 26 GB/s (GPU-A100 + SMP)/ ≈ 42 GB/s (GPU-H200) per node
|-
!scope="column" | [https://en.wikipedia.org/wiki/Disk_quota#Quotas Quotas]
| 40 GB per user
| not yet, maybe in the future
| none
| none
|-
!scope="column" | Backup
| yes (nightly)
| '''no'''
| '''no'''
| '''no'''
|}

global : all nodes access the same file system
local : each node has its own file system
permanent : files are stored permanently
batch job walltime : files are removed at end of the batch job

{| class="wikitable" style="color:red; background-color:#ffffcc;" cellpadding="10"
|
Please note that due to the large capacity of '''work''' and '''project''' and due to frequent file changes on these file systems, no backup can be provided.
Backing up these file systems would require a redundant storage facility with multiple times the capacity of '''project'''. Furthermore, regular backups would significantly degrade the performance.
Data is stored redundantly, i.e. immune against disk failures but not immune against catastrophic incidents like cyber attacks or a fire in the server room.
Please consider to use on of the remote storage facilities like [https://wiki.bwhpc.de/e/SDS@hd SDS@hd], [https://uni-tuebingen.de/einrichtungen/zentrum-fuer-datenverarbeitung/projekte/laufende-projekte/bwsfs bwSFS], [https://www.scc.kit.edu/en/services/lsdf.php LSFD Online Storage] or the [https://www.rda.kit.edu/english/ bwDataArchive] to back up your valuable data.
|}

=== Home ===

Home directories are meant for permanent file storage of files that are keep being used like source codes, configuration files, executable programs etc.; the content of home directories will be backed up on a regular basis.
Because the backup space is limited we enforce a quota of 40GB on the home directories.

'''NOTE:'''
Compute jobs on nodes must not write temporary data to $HOME.
Instead they should use the local $TMPDIR directory for I/O-heavy use cases
and work spaces for less I/O intense multinode-jobs.


=== Project ===

The data is stored on HDDs. The primary focus of <code>/pfs/10/project</code> is pure capacity, not speed.

Every project gets a dedicated directory located at:

<syntaxhighlight>
/pfs/10/project/<project_id>/
</syntaxhighlight>

You can check the project you're member of:

<syntaxhighlight>
# id $USER | grep -o 'bw[^)]*'
bw16f003
</syntaxhighlight>

In this case, your project directory would be:
```
/pfs/10/project/bw16f003/
```

Check our [[BinAC2/Project_Data_Organization | data organization guide ]] for methods to organize data inside the project directory.

=== Workspaces ===

Data on the fast storage pool at <code>/pfs/10/work</code> is stored on SSDs.
The primary focus is speed, not capacity.

In contrast to BinAC 1 we will enforce work space lifetime, as the capacity is limited.
We ask you to only store data you actively use for computations on <code>/pfs/10/work</code>.
Please move data to <code>/pfs/10/project</code> when you don't need it on the fast storage any more.

Each user should create workspaces at <code>/pfs/10/work</code> through the workspace tools

You can find more info on workspace tools on our general page:

:: → '''[[Workspace]]s'''

To create a work space you'll need to supply a name for your work space area and a lifetime in days.
For more information read the corresponding help, e.g: <code>ws_allocate -h.</code>
{| class="wikitable"
|-
!style="width:30%" | Command
!style="width:70%" | Action
|-
|<code>ws_allocate mywork 30</code>
|Allocate a work space named "mywork" for 30 days.
|-
|<code>ws_allocate myotherwork</code>
|Allocate a work space named "myotherwork" with maximum lifetime.
|-
|<code>ws_list -a</code>
|List all your work spaces.
|-
|<code>ws_find mywork</code>
|Get absolute path of work space "mywork".
|-
|<code>ws_extend mywork 30</code>
|Extend life me of work space mywork by 30 days from now.
|-
|<code>ws_release mywork</code>
|Manually erase your work space "mywork". Please remove directory content first or use the <code>--delete-data</code> option.
|-
|}

=== Scratch ===

Please use the fast local scratch space for storing temporary data during your jobs.

For each job a scratch directory will be created on the compute nodes. It is available via the environment variable <code>$TMPDIR</code>, which points to <code>/scratch/<jobID></code>.

Especially the SMP nodes and the GPU nodes are equipped with large and fast local disks that should be used for temporary data, scratch data or data staging for ML model training.
The Lustre file system (<code>WORK</code> and <code>PROJECT</code>) is unsuited for repetitive random I/O, I/O sizes smaller than the Lustre and ZFS block size (1M) or I/O patterns where files are opened and closed in rapid succession. The XFS file system of the local scratch drives is better suited for typical scratch workloads and access patterns. Moreover, the local scratch drives offer a lower latency and a higher bandwidth than <code>WORK</code>.

=== SDS@hd ===

SDS@hd is mounted via NFS on login and compute nodes at <syntaxhighlight inline>/mnt/sds-hd</syntaxhighlight>.

To access your Speichervorhaben, the export to BinAC 2 must first be enabled by the SDS@hd-Team. Please contact [mailto:sds-hd-support@urz.uni-heidelberg.de SDS@hd support] and provide the acronym of your Speichervorhaben, along with a request to enable the export to BinAC 2.

Once this has been done, you can access your Speichervorhaben as described in the [https://wiki.bwhpc.de/e/SDS@hd/Access/NFS#Access_your_data SDS documentation].

<syntaxhighlight>
$ kinit $USER
Password for <user>@BWSERVICES.UNI-HEIDELBERG.DE:
</syntaxhighlight>

The Kerberos ticket store is shared across all nodes. Creating a single ticket is sufficient to access your Speichervorhaben on all nodes.

=== More Details on the Lustre File System ===
[https://www.lustre.org/ Lustre] is a distributed parallel file system.
* The entire logical volume as presented to the user is formed by multiple physical or local drives. Data is distributed over more than one physical or logical volume/hard drive, single files can be larger than the capacity of a single hard drive.
* The file system can be mounted from all nodes ("clients") in parallel at the same time for reading and writing. This also means that technically you can write to the same file from two different compute nodes! Usually, this will create an unpredictable mess! Never ever do this unless you know exactly what you are doing!
* On a single server or client, the bandwidth of multiple network interfaces can be aggregated to increase the throughput ("multi-rail").

Lustre works by chopping files into many small parts ("stripes", file objects) which are then stored on the object storage servers. The information which part of the file is stored where on which object storage server, when it was changed last etc. and the entire directory structure is stored on the metadata servers. Think of the entries on the metadata server as being pointers pointing to the actual file objects on the object storage servers.
A Lustre file system can consist of many metadata servers (MDS) and object storage servers (OSS).
Each MDS or OSS can again hold one or more so-called object storage targets (OST) or metadata targets (MDT) which can e.g. be simply multiple hard drives.
The capacity of a Lustre file system can hence be easily scaled by adding more servers.

==== Architecture of BinAC2's Lustre File System ====
Metadata Servers:
* 2 metadata servers
* 1 MDT per server
* MDT Capacity: 31TB, hardware RAID6 on NVMe drives (flash memory/SSD)
* Networking: 2x 100 GbE, 2x HDR-100 InfiniBand

Object storage servers:
* 8 object storage servers
* 2 fast OSTs per server
** 70 TB per OST, software RAID (raid-z2, 10+2 reduncancy)
** NVMe drives, directly attached to the PCIe bus
* 8 slow OSTs per server
** 143 TB per OST, hardware RAID (RAID6, 8+2 redundancy)
** externally attached via SAS
* Networking: 2x 100 GbE, 2x HDR-100 InfiniBand

* All fast OSTs are assigned to the pool <code>work</code>
* All slow OSTs are assigned to the pool <code>project</code>
* All files that are created under <code>/pfs/10/work</code> are by default stored on the fast pool
* All files that are created under <code>/pfs/10/project</code> are by default stored on the slow pool
* Metadata is distributed over both MDTs. All subdirectories of a directory (workspace or project folder) are typically on the same MDT. Directory striping/placement on MDTs can not be influenced by users.
* Default OST striping: Stripes have size 1 MiB. Files are striped over one OST if possible, i.e. all stripes of a file are on the same OST. New files are created on the most empty OST.
Internally, the slow and the fast pool belong to the same Lustre file system and namespace which has advantages and disadvantages.

==== Moving data between WORK and PROJECT ====
!! IMPORTANT !! Calling <code>mv</code> on files will not physically move them. Instead, the file metadata, i.e. the path to the file in the directory tree will be modified (i.e. data stored on the MDS). The stripes of the file on the OSS, however, will remain exactly were they were. The only result will be the confusing situation that you now have metadata entries under <code>/pfs/10/project</code> that still point to WORK OSTs.

Proper ways of moving data between the pools
* Copy the data - which will create new files, then delete the old files
* Move the metadata with mv, then use lfs migrate or lfs_migrate to actually migrate the stripes

More reading:
* [https://doc.lustre.org/lustre_manual.xhtml The Lustre 2.X Manual] ([http://doc.lustre.org/lustre_manual.pdf PDF])
* [https://wiki.lustre.org/Main_Page The Lustre Wiki]

File:Binac2 schema.png

2025-12-18T16:56:23Z

S Behnle: A schematic drawing showing the nodes, partitions and connectivity of the BinAC2 cluster

== Summary ==
A schematic drawing showing the nodes, partitions and connectivity of the BinAC2 cluster

BinAC2/Hardware and Architecture

2025-12-16T18:44:21Z

S Behnle: Lustre stuff

= Hardware and Architecture =

The bwForCluster BinAC 2 supports researchers from the broader fields of Bioinformatics, Medical Informatics, Astrophysics, Geosciences and Pharmacy.

== Operating System and Software ==

* Operating System: Rocky Linux 9.6
* Queuing System: [https://slurm.schedmd.com/documentation.html Slurm] (see [[BinAC2/Slurm]] for help)
* (Scientific) Libraries and Software: [[Environment Modules]]

== Compute Nodes ==

BinAC 2 offers compute nodes, high-mem nodes, and three types of GPU nodes.
* 180 compute nodes
* 16 SMP node
* 32 GPU nodes (2xA30)
* 8 GPU nodes (4xA100)
* 4 GPU nodes (4xH200)
* plus several special purpose nodes for login, interactive jobs, etc.

Compute node specification:
{| class="wikitable"
|-
! style="width:10%"|
! style="width:10%"| Standard
! style="width:10%"| High-Mem
! style="width:10%"| GPU (A30)
! style="width:10%"| GPU (A100)
! style="width:10%"| GPU (H200)
|-
!scope="column"| Quantity
| 168 / 12
| 14 / 2
| 32
| 8
| 4
|-
!scope="column" | Processors
| 2 x [https://www.amd.com/de/products/processors/server/epyc/7003-series/amd-epyc-7543.html AMD EPYC Milan 7543] / 2 x [https://www.amd.com/en/products/processors/server/epyc/7003-series/amd-epyc-75f3.html AMD EPYC Milan 75F3]
| 2 x [https://www.amd.com/de/products/processors/server/epyc/7003-series/amd-epyc-7443.html AMD EPYC Milan 7443] / 2 x [https://www.amd.com/en/products/processors/server/epyc/7003-series/amd-epyc-75f3.html AMD EPYC Milan 75F3]
| 2 x [https://www.amd.com/de/products/processors/server/epyc/7003-series/amd-epyc-7543.html AMD EPYC Milan 7543]
| 2 x [https://www.amd.com/de/products/processors/server/epyc/7003-series/amd-epyc-7543.html AMD EPYC Milan 7543]
| 2 x [https://www.amd.com/de/products/processors/server/epyc/9005-series/amd-epyc-9555.html AMD EPYC Milan 9555]
|-
!scope="column" | Processor Base Frequency (GHz)
| 2.80 / 2.95
| 2.85 / 2.95
| 2.80
| 2.80
| 3.20
|-
!scope="column" | Number of Physical Cores / Hypertreads
| 64 / 128
| 48 / 96 // 64 / 128
| 64 / 128
| 64 / 128
| 128 / 256
|-
!scope="column" | Working Memory (GB)
| 512
| 2048
| 512
| 512
| 1536
|-
!scope="column" | Local Disk (GiB)
| 450 (NVMe-SSD)
| 14000 (NVMe-SSD)
| 450 (NVMe-SSD)
| 14000 (NVMe-SSD)
| 28000 (NVMe-SSD)
|-
!scope="column" | Interconnect
| HDR 100 IB (84 nodes) / 100GbE (96 nodes)
| 100GbE
| 100GbE
| 100GbE
| HDR 200 IB + 100GbE
|-
!scope="column" | Coprocessors
| -
| -
| 2 x [https://www.nvidia.com/de-de/data-center/products/a30-gpu/ NVIDIA A30 (24 GB ECC HBM2, NVLink)]
| 4 x [https://www.nvidia.com/de-de/data-center/a100/ NVIDIA A100 (80 GB ECC HBM2e)]
| 4 x [https://www.nvidia.com/de-de/data-center/h200/ NVIDIA H200 NVL (141 GB ECC HBM3e, NVLink)]
|}

= Network =

The compute nodes and the parallel file system are connected via 100GbE ethernet
In contrast to BinAC 1 not all compute nodes are connected via Infiniband, but there are 84 standard compute nodes connected via HDR Infiniband (100 GbE). In order to get your jobs onto the Infiniband nodes, submit your job with <code>--constraint=ib</code>.

'''Question:'''
OpenMPI throws the following warning:
<pre>
--------------------------------------------------------------------------
No OpenFabrics connection schemes reported that they were able to be
used on a specific port. As such, the openib BTL (OpenFabrics
support) will be disabled for this port.
Local host: node1-083
Local device: mlx5_0
Local port: 1
CPCs attempted: rdmacm, udcm
--------------------------------------------------------------------------
[node1-083:2137377] 3 more processes have sent help message help-mpi-btl-openib-cpc-base.txt / no cpcs for port
[node1-083:2137377] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
</pre>
What should i do?

'''Answer:'''
BinAC2 has two (almost) separate networks, a 100GbE network and and InfiniBand network, both connecting a subset of the nodes. Both networks require different cables and switches.
Concerning the network cards for the nodes, however, there exist VPI network cards which can be configured to work in either mode (https://docs.nvidia.com/networking/display/connectx6vpi/specifications#src-2487215234_Specifications-MCX653105A-ECATSpecifications).
OpenMPI can use a number of layers for transferring data and messages between processes. When it ramps up, it will test all means of communication that were configured during compilation and then tries to figure out the fastest path between all processes.
If OpenMPI encounters such a VPI card, it will first try to establish a Remote Direct Memory Access communication (RDMA) channel using the OpenFabrics (OFI) layer.
On nodes with 100Gb ethernet, this fails as there is no RDMA protocol configured. OpenMPI will fall back to TCP transport but not without complaints.

'''Workaround:'''
For single-node jobs or on regular compute nodes, A30 and A100 GPU nodes: Add the lines
<pre>
export OMPI_MCA_btl="^ofi,openib"
export OMPI_MCA_mtl="^ofi"
</pre>
to your job script to disable the OFI transport layer. If you need high-bandwidth, low-latency transport between all processes on all nodes, switch to the Infiniband partition (<code>#SBATCH --constraint=ib</code>). ''Do not turn off the OFI layer on Infiniband nodes as this will be the best choice between nodes!''

= File Systems =

The bwForCluster BinAC 2 consists of two separate storage systems, one for the user's home directory $HOME and one serving as a project/work space.
The home directory is limited in space and parallel access but offers snapshots of your files and backup.

The project/work is a parallel file system (PFS) which offers fast and parallel file access and a bigger capacity than the home directory. It is mounted at <code>/pfs/10</code> on the login and compute nodes. This storage is based on Lustre and can be accessed in parallel from many nodes. The PFS contains the project and the work directory. Each compute project has its own directory at <code>/pfs/10/project</code> that is accessible for all members of the compute project.
Each user can create workspaces under <code>/pfs/10/work</code> using the workspace tools. These directories are only accessible for the user who created the workspace.

Additionally, each compute node provides high-speed temporary storage (SSD) on the node-local solid state disk via the $TMPDIR environment variable.

{| class="wikitable"
|-
! style="width:10%"|
! style="width:10%"| <tt>$HOME</tt>
! style="width:10%"| project
! style="width:10%"| work
! style="width:10%"| <tt>$TMPDIR</tt>
|-
!scope="column" | Visibility
| global
| global
| global
| node local
|-
!scope="column" | Lifetime
| permanent
| permanent
| work space lifetime (max. 30 days, max. 5 extensions)
| batch job walltime
|-
!scope="column" | Capacity
| -
| 8.1 PB
| 1000 TB
| 480 GB (compute nodes); 7.7 TB (GPU-A30 nodes); 16 TB (GPU-A100 and SMP nodes); 31 TB (GPU-H200 nodes)
|-
!scope="column" | File System Type
| NFS
| Lustre
| Lustre
| XFS
|-
!scope="column" | Speed (read)
| ≈ 1 GB/s, shared by all nodes
| max. 12 GB/s
| ≈ 145 GB/s peak, aggregated over 56 nodes, ideal striping
| ≈ 3 GB/s (compute)/ ≈5 GB/S (GPUA-30)/ ≈ 26 GB/s (GPU-A100 + SMP)/ ≈ 42 GB/s (GPU-H200) per node
|-
!scope="column" | [https://en.wikipedia.org/wiki/Disk_quota#Quotas Quotas]
| 40 GB per user
| not yet, maybe in the future
| none
| none
|-
!scope="column" | Backup
| yes (nightly)
| '''no'''
| '''no'''
| '''no'''
|}

global : all nodes access the same file system
local : each node has its own file system
permanent : files are stored permanently
batch job walltime : files are removed at end of the batch job

{| class="wikitable" style="color:red; background-color:#ffffcc;" cellpadding="10"
|
Please note that due to the large capacity of '''work''' and '''project''' and due to frequent file changes on these file systems, no backup can be provided.
Backing up these file systems would require a redundant storage facility with multiple times the capacity of '''project'''. Furthermore, regular backups would significantly degrade the performance.
Data is stored redundantly, i.e. immune against disk failures but not immune against catastrophic incidents like cyber attacks or a fire in the server room.
Please consider to use on of the remote storage facilities like [https://wiki.bwhpc.de/e/SDS@hd SDS@hd], [https://uni-tuebingen.de/einrichtungen/zentrum-fuer-datenverarbeitung/projekte/laufende-projekte/bwsfs bwSFS], [https://www.scc.kit.edu/en/services/lsdf.php LSFD Online Storage] or the [https://www.rda.kit.edu/english/ bwDataArchive] to back up your valuable data.
|}

=== Home ===

Home directories are meant for permanent file storage of files that are keep being used like source codes, configuration files, executable programs etc.; the content of home directories will be backed up on a regular basis.
Because the backup space is limited we enforce a quota of 40GB on the home directories.

'''NOTE:'''
Compute jobs on nodes must not write temporary data to $HOME.
Instead they should use the local $TMPDIR directory for I/O-heavy use cases
and work spaces for less I/O intense multinode-jobs.


=== Project ===

The data is stored on HDDs. The primary focus of <code>/pfs/10/project</code> is pure capacity, not speed.

Every project gets a dedicated directory located at:

<syntaxhighlight>
/pfs/10/project/<project_id>/
</syntaxhighlight>

You can check the project you're member of:

<syntaxhighlight>
# id $USER | grep -o 'bw[^)]*'
bw16f003
</syntaxhighlight>

In this case, your project directory would be:
```
/pfs/10/project/bw16f003/
```

Check our [[BinAC2/Project_Data_Organization | data organization guide ]] for methods to organize data inside the project directory.

=== Workspaces ===

Data on the fast storage pool at <code>/pfs/10/work</code> is stored on SSDs.
The primary focus is speed, not capacity.

In contrast to BinAC 1 we will enforce work space lifetime, as the capacity is limited.
We ask you to only store data you actively use for computations on <code>/pfs/10/work</code>.
Please move data to <code>/pfs/10/project</code> when you don't need it on the fast storage any more.

Each user should create workspaces at <code>/pfs/10/work</code> through the workspace tools

You can find more info on workspace tools on our general page:

:: → '''[[Workspace]]s'''

To create a work space you'll need to supply a name for your work space area and a lifetime in days.
For more information read the corresponding help, e.g: <code>ws_allocate -h.</code>
{| class="wikitable"
|-
!style="width:30%" | Command
!style="width:70%" | Action
|-
|<code>ws_allocate mywork 30</code>
|Allocate a work space named "mywork" for 30 days.
|-
|<code>ws_allocate myotherwork</code>
|Allocate a work space named "myotherwork" with maximum lifetime.
|-
|<code>ws_list -a</code>
|List all your work spaces.
|-
|<code>ws_find mywork</code>
|Get absolute path of work space "mywork".
|-
|<code>ws_extend mywork 30</code>
|Extend life me of work space mywork by 30 days from now.
|-
|<code>ws_release mywork</code>
|Manually erase your work space "mywork". Please remove directory content first or use the <code>--delete-data</code> option.
|-
|}

=== Scratch ===

Please use the fast local scratch space for storing temporary data during your jobs.

For each job a scratch directory will be created on the compute nodes. It is available via the environment variable <code>$TMPDIR</code>, which points to <code>/scratch/<jobID></code>.

Especially the SMP nodes and the GPU nodes are equipped with large and fast local disks that should be used for temporary data, scratch data or data staging for ML model training.
The Lustre file system (<code>WORK</code> and <code>PROJECT</code>) is unsuited for repetitive random I/O, I/O sizes smaller than the Lustre and ZFS block size (1M) or I/O patterns where files are opened and closed in rapid succession. The XFS file system of the local scratch drives is better suited for typical scratch workloads and access patterns. Moreover, the local scratch drives offer a lower latency and a higher bandwidth than <code>WORK</code>.

=== SDS@hd ===

SDS@hd is mounted via NFS on login and compute nodes at <syntaxhighlight inline>/mnt/sds-hd</syntaxhighlight>.

To access your Speichervorhaben, the export to BinAC 2 must first be enabled by the SDS@hd-Team. Please contact [mailto:sds-hd-support@urz.uni-heidelberg.de SDS@hd support] and provide the acronym of your Speichervorhaben, along with a request to enable the export to BinAC 2.

Once this has been done, you can access your Speichervorhaben as described in the [https://wiki.bwhpc.de/e/SDS@hd/Access/NFS#Access_your_data SDS documentation].

<syntaxhighlight>
$ kinit $USER
Password for <user>@BWSERVICES.UNI-HEIDELBERG.DE:
</syntaxhighlight>

The Kerberos ticket store is shared across all nodes. Creating a single ticket is sufficient to access your Speichervorhaben on all nodes.

=== More Details on the Lustre File System ===
[https://www.lustre.org/ Lustre] is a distributed parallel file system.
* The entire logical volume as presented to the user is formed by multiple physical or local drives. Data is distributed over more than one physical or logical volume/hard drive, single files can be larger than the capacity of a single hard drive.
* The file system can be mounted from all nodes ("clients") in parallel at the same time for reading and writing. This also means that technically you can write to the same file from two different compute nodes! Usually, this will create an unpredictable mess! Never ever do this unless you know exactly what you are doing!
* On a single server or client, the bandwidth of multiple network interfaces can be aggregated to increase the throughput ("multi-rail").

Lustre works by chopping files into many small parts ("stripes", file objects) which are then stored on the object storage servers. The information which part of the file is stored where on which object storage server, when it was changed last etc. and the entire directory structure is stored on the metadata servers. Think of the entries on the metadata server as being pointers pointing to the actual file objects on the object storage servers.
A Lustre file system can consist of many metadata servers (MDS) and object storage servers (OSS).
Each MDS or OSS can again hold one or more so-called object storage targets (OST) or metadata targets (MDT) which can e.g. be simply multiple hard drives.
The capacity of a Lustre file system can hence be easily scaled by adding more servers.

==== Architecture of BinAC2's Lustre File System ====
Metadata Servers:
* 2 metadata servers
* 1 MDT per server
* MDT Capacity: 31TB, hardware RAID6 on NVMe drives (flash memory/SSD)
* Networking: 2x 100 GbE, 2x HDR-100 InfiniBand

Object storage servers:
* 8 object storage servers
* 2 fast OSTs per server
** 70 TB per OST, software RAID (raid-z2, 10+2 reduncancy)
** NVMe drives, directly attached to the PCIe bus
* 8 slow OSTs per server
** 143 TB per OST, hardware RAID (RAID6, 8+2 redundancy)
** externally attached via SAS
* Networking: 2x 100 GbE, 2x HDR-100 InfiniBand

* All fast OSTs are assigned to the pool <code>work</code>
* All slow OSTs are assigned to the pool <code>project</code>
* All files that are created under <code>/pfs/10/work</code> are by default stored on the fast pool
* All files that are created under <code>/pfs/10/project</code> are by default stored on the slow pool
* Metadata is distributed over both MDTs. All subdirectories of a directory (workspace or project folder) are typically on the same MDT. Directory striping/placement on MDTs can not be influenced by users.
* Default OST striping: Stripes have size 1 MiB. Files are striped over one OST if possible, i.e. all stripes of a file are on the same OST. New files are created on the most empty OST.
Internally, the slow and the fast pool belong to the same Lustre file system and namespace which has advantages and disadvantages.

==== Moving data between WORK and PROJECT ====
!! IMPORTANT !! Calling <code>mv</code> on files will not physically move them. Instead, the file metadata, i.e. the path to the file in the directory tree will be modified (i.e. data stored on the MDS). The stripes of the file on the OSS, however, will remain exactly were they were. The only result will be the confusing situation that you now have metadata entries under <code>/pfs/10/project</code> that still point to WORK OSTs.

Proper ways of moving data between the pools
* Copy the data - which will create new files, then delete the old files
* Move the metadata with mv, then use lfs migrate or lfs_migrate to actually migrate the stripes

More reading:
* [https://doc.lustre.org/lustre_manual.xhtml The Lustre 2.X Manual] ([http://doc.lustre.org/lustre_manual.pdf PDF])
* [https://wiki.lustre.org/Main_Page The Lustre Wiki]

BinAC2/Hardware and Architecture

2025-12-16T17:01:16Z

S Behnle:

= Hardware and Architecture =

The bwForCluster BinAC 2 supports researchers from the broader fields of Bioinformatics, Medical Informatics, Astrophysics, Geosciences and Pharmacy.

== Operating System and Software ==

* Operating System: Rocky Linux 9.6
* Queuing System: [https://slurm.schedmd.com/documentation.html Slurm] (see [[BinAC2/Slurm]] for help)
* (Scientific) Libraries and Software: [[Environment Modules]]

== Compute Nodes ==

BinAC 2 offers compute nodes, high-mem nodes, and three types of GPU nodes.
* 180 compute nodes
* 16 SMP node
* 32 GPU nodes (2xA30)
* 8 GPU nodes (4xA100)
* 4 GPU nodes (4xH200)
* plus several special purpose nodes for login, interactive jobs, etc.

Compute node specification:
{| class="wikitable"
|-
! style="width:10%"|
! style="width:10%"| Standard
! style="width:10%"| High-Mem
! style="width:10%"| GPU (A30)
! style="width:10%"| GPU (A100)
! style="width:10%"| GPU (H200)
|-
!scope="column"| Quantity
| 168 / 12
| 14 / 2
| 32
| 8
| 4
|-
!scope="column" | Processors
| 2 x [https://www.amd.com/de/products/processors/server/epyc/7003-series/amd-epyc-7543.html AMD EPYC Milan 7543] / 2 x [https://www.amd.com/en/products/processors/server/epyc/7003-series/amd-epyc-75f3.html AMD EPYC Milan 75F3]
| 2 x [https://www.amd.com/de/products/processors/server/epyc/7003-series/amd-epyc-7443.html AMD EPYC Milan 7443] / 2 x [https://www.amd.com/en/products/processors/server/epyc/7003-series/amd-epyc-75f3.html AMD EPYC Milan 75F3]
| 2 x [https://www.amd.com/de/products/processors/server/epyc/7003-series/amd-epyc-7543.html AMD EPYC Milan 7543]
| 2 x [https://www.amd.com/de/products/processors/server/epyc/7003-series/amd-epyc-7543.html AMD EPYC Milan 7543]
| 2 x [https://www.amd.com/de/products/processors/server/epyc/9005-series/amd-epyc-9555.html AMD EPYC Milan 9555]
|-
!scope="column" | Processor Base Frequency (GHz)
| 2.80 / 2.95
| 2.85 / 2.95
| 2.80
| 2.80
| 3.20
|-
!scope="column" | Number of Physical Cores / Hypertreads
| 64 / 128
| 48 / 96 // 64 / 128
| 64 / 128
| 64 / 128
| 128 / 256
|-
!scope="column" | Working Memory (GB)
| 512
| 2048
| 512
| 512
| 1536
|-
!scope="column" | Local Disk (GiB)
| 450 (NVMe-SSD)
| 14000 (NVMe-SSD)
| 450 (NVMe-SSD)
| 14000 (NVMe-SSD)
| 28000 (NVMe-SSD)
|-
!scope="column" | Interconnect
| HDR 100 IB (84 nodes) / 100GbE (96 nodes)
| 100GbE
| 100GbE
| 100GbE
| HDR 200 IB + 100GbE
|-
!scope="column" | Coprocessors
| -
| -
| 2 x [https://www.nvidia.com/de-de/data-center/products/a30-gpu/ NVIDIA A30 (24 GB ECC HBM2, NVLink)]
| 4 x [https://www.nvidia.com/de-de/data-center/a100/ NVIDIA A100 (80 GB ECC HBM2e)]
| 4 x [https://www.nvidia.com/de-de/data-center/h200/ NVIDIA H200 NVL (141 GB ECC HBM3e, NVLink)]
|}

= Network =

The compute nodes and the parallel file system are connected via 100GbE ethernet
In contrast to BinAC 1 not all compute nodes are connected via Infiniband, but there are 84 standard compute nodes connected via HDR Infiniband (100 GbE). In order to get your jobs onto the Infiniband nodes, submit your job with <code>--constraint=ib</code>.

'''Question:'''
OpenMPI throws the following warning:
<pre>
--------------------------------------------------------------------------
No OpenFabrics connection schemes reported that they were able to be
used on a specific port. As such, the openib BTL (OpenFabrics
support) will be disabled for this port.
Local host: node1-083
Local device: mlx5_0
Local port: 1
CPCs attempted: rdmacm, udcm
--------------------------------------------------------------------------
[node1-083:2137377] 3 more processes have sent help message help-mpi-btl-openib-cpc-base.txt / no cpcs for port
[node1-083:2137377] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
</pre>
What should i do?

'''Answer:'''
BinAC2 has two (almost) separate networks, a 100GbE network and and InfiniBand network, both connecting a subset of the nodes. Both networks require different cables and switches.
Concerning the network cards for the nodes, however, there exist VPI network cards which can be configured to work in either mode (https://docs.nvidia.com/networking/display/connectx6vpi/specifications#src-2487215234_Specifications-MCX653105A-ECATSpecifications).
OpenMPI can use a number of layers for transferring data and messages between processes. When it ramps up, it will test all means of communication that were configured during compilation and then tries to figure out the fastest path between all processes.
If OpenMPI encounters such a VPI card, it will first try to establish a Remote Direct Memory Access communication (RDMA) channel using the OpenFabrics (OFI) layer.
On nodes with 100Gb ethernet, this fails as there is no RDMA protocol configured. OpenMPI will fall back to TCP transport but not without complaints.

'''Workaround:'''
For single-node jobs or on regular compute nodes, A30 and A100 GPU nodes: Add the lines
<pre>
export OMPI_MCA_btl="^ofi,openib"
export OMPI_MCA_mtl="^ofi"
</pre>
to your job script to disable the OFI transport layer. If you need high-bandwidth, low-latency transport between all processes on all nodes, switch to the Infiniband partition (<code>#SBATCH --constraint=ib</code>). ''Do not turn off the OFI layer on Infiniband nodes as this will be the best choice between nodes!''

= File Systems =

The bwForCluster BinAC 2 consists of two separate storage systems, one for the user's home directory $HOME and one serving as a project/work space.
The home directory is limited in space and parallel access but offers snapshots of your files and backup.

The project/work is a parallel file system (PFS) which offers fast and parallel file access and a bigger capacity than the home directory. It is mounted at <code>/pfs/10</code> on the login and compute nodes. This storage is based on Lustre and can be accessed in parallel from many nodes. The PFS contains the project and the work directory. Each compute project has its own directory at <code>/pfs/10/project</code> that is accessible for all members of the compute project.
Each user can create workspaces under <code>/pfs/10/work</code> using the workspace tools. These directories are only accessible for the user who created the workspace.

Additionally, each compute node provides high-speed temporary storage (SSD) on the node-local solid state disk via the $TMPDIR environment variable.

{| class="wikitable"
|-
! style="width:10%"|
! style="width:10%"| <tt>$HOME</tt>
! style="width:10%"| project
! style="width:10%"| work
! style="width:10%"| <tt>$TMPDIR</tt>
|-
!scope="column" | Visibility
| global
| global
| global
| node local
|-
!scope="column" | Lifetime
| permanent
| permanent
| work space lifetime (max. 30 days, max. 5 extensions)
| batch job walltime
|-
!scope="column" | Capacity
| -
| 8.1 PB
| 1000 TB
| 480 GB (compute nodes); 7.7 TB (GPU-A30 nodes); 16 TB (GPU-A100 and SMP nodes); 31 TB (GPU-H200 nodes)
|-
!scope="column" | Speed (read)
| ≈ 1 GB/s, shared by all nodes
| max. 12 GB/s
| ≈ 145 GB/s peak, aggregated over 56 nodes, ideal striping
| ≈ 3 GB/s (compute)/ ≈5 GB/S (GPUA-30)/ ≈ 26 GB/s (GPU-A100 + SMP)/ ≈ 42 GB/s (GPU-H200) per node
|-
!scope="column" | [https://en.wikipedia.org/wiki/Disk_quota#Quotas Quotas]
| 40 GB per user
| not yet, maybe in the future
| none
| none
|-
!scope="column" | Backup
| yes (nightly)
| '''no'''
| '''no'''
| '''no'''
|}

global : all nodes access the same file system
local : each node has its own file system
permanent : files are stored permanently
batch job walltime : files are removed at end of the batch job

{| class="wikitable" style="color:red; background-color:#ffffcc;" cellpadding="10"
|
Please note that due to the large capacity of '''work''' and '''project''' and due to frequent file changes on these file systems, no backup can be provided.
Backing up these file systems would require a redundant storage facility with multiple times the capacity of '''project'''. Furthermore, regular backups would significantly degrade the performance.
Data is stored redundantly, i.e. immune against disk failures but not immune against catastrophic incidents like cyber attacks or a fire in the server room.
Please consider to use on of the remote storage facilities like [https://wiki.bwhpc.de/e/SDS@hd SDS@hd], [https://uni-tuebingen.de/einrichtungen/zentrum-fuer-datenverarbeitung/projekte/laufende-projekte/bwsfs bwSFS], [https://www.scc.kit.edu/en/services/lsdf.php LSFD Online Storage] or the [https://www.rda.kit.edu/english/ bwDataArchive] to back up your valuable data.
|}

=== Home ===

Home directories are meant for permanent file storage of files that are keep being used like source codes, configuration files, executable programs etc.; the content of home directories will be backed up on a regular basis.
Because the backup space is limited we enforce a quota of 40GB on the home directories.

'''NOTE:'''
Compute jobs on nodes must not write temporary data to $HOME.
Instead they should use the local $TMPDIR directory for I/O-heavy use cases
and work spaces for less I/O intense multinode-jobs.


=== Project ===

The data is stored on HDDs. The primary focus of <code>/pfs/10/project</code> is pure capacity, not speed.

Every project gets a dedicated directory located at:

<syntaxhighlight>
/pfs/10/project/<project_id>/
</syntaxhighlight>

You can check the project you're member of:

<syntaxhighlight>
# id $USER | grep -o 'bw[^)]*'
bw16f003
</syntaxhighlight>

In this case, your project directory would be:
```
/pfs/10/project/bw16f003/
```

Check our [[BinAC2/Project_Data_Organization | data organization guide ]] for methods to organize data inside the project directory.

=== Workspaces ===

Data on the fast storage pool at <code>/pfs/10/work</code> is stored on SSDs.
The primary focus is speed, not capacity.

In contrast to BinAC 1 we will enforce work space lifetime, as the capacity is limited.
We ask you to only store data you actively use for computations on <code>/pfs/10/work</code>.
Please move data to <code>/pfs/10/project</code> when you don't need it on the fast storage any more.

Each user should create workspaces at <code>/pfs/10/work</code> through the workspace tools

You can find more info on workspace tools on our general page:

:: → '''[[Workspace]]s'''

To create a work space you'll need to supply a name for your work space area and a lifetime in days.
For more information read the corresponding help, e.g: <code>ws_allocate -h.</code>
{| class="wikitable"
|-
!style="width:30%" | Command
!style="width:70%" | Action
|-
|<code>ws_allocate mywork 30</code>
|Allocate a work space named "mywork" for 30 days.
|-
|<code>ws_allocate myotherwork</code>
|Allocate a work space named "myotherwork" with maximum lifetime.
|-
|<code>ws_list -a</code>
|List all your work spaces.
|-
|<code>ws_find mywork</code>
|Get absolute path of work space "mywork".
|-
|<code>ws_extend mywork 30</code>
|Extend life me of work space mywork by 30 days from now.
|-
|<code>ws_release mywork</code>
|Manually erase your work space "mywork". Please remove directory content first or use the <code>--delete-data</code> option.
|-
|}

=== Scratch ===

Please use the fast local scratch space for storing temporary data during your jobs.

For each job a scratch directory will be created on the compute nodes. It is available via the environment variable <code>$TMPDIR</code>, which points to <code>/scratch/<jobID></code>.

Especially the SMP nodes and the GPU nodes are equipped with large and fast local disks that should be used for temporary data, scratch data or data staging for ML model training.
The Lustre file system (<code>WORK</code> and <code>PROJECT</code>) is unsuited for repetitive random I/O, I/O sizes smaller than the Lustre and ZFS block size (1M) or I/O patterns where files are opened and closed in rapid succession. The XFS file system of the local scratch drives is better suited for typical scratch workloads and access patterns. Moreover, the local scratch drives offer a lower latency and a higher bandwidth than <code>WORK</code>.

=== SDS@hd ===

SDS@hd is mounted via NFS on login and compute nodes at <syntaxhighlight inline>/mnt/sds-hd</syntaxhighlight>.

To access your Speichervorhaben, the export to BinAC 2 must first be enabled by the SDS@hd-Team. Please contact [mailto:sds-hd-support@urz.uni-heidelberg.de SDS@hd support] and provide the acronym of your Speichervorhaben, along with a request to enable the export to BinAC 2.

Once this has been done, you can access your Speichervorhaben as described in the [https://wiki.bwhpc.de/e/SDS@hd/Access/NFS#Access_your_data SDS documentation].

<syntaxhighlight>
$ kinit $USER
Password for <user>@BWSERVICES.UNI-HEIDELBERG.DE:
</syntaxhighlight>

The Kerberos ticket store is shared across all nodes. Creating a single ticket is sufficient to access your Speichervorhaben on all nodes.

BinAC2/Hardware and Architecture

2025-12-16T16:57:45Z

S Behnle:

= Hardware and Architecture =

The bwForCluster BinAC 2 supports researchers from the broader fields of Bioinformatics, Medical Informatics, Astrophysics, Geosciences and Pharmacy.

== Operating System and Software ==

* Operating System: Rocky Linux 9.6
* Queuing System: [https://slurm.schedmd.com/documentation.html Slurm] (see [[BinAC2/Slurm]] for help)
* (Scientific) Libraries and Software: [[Environment Modules]]

== Compute Nodes ==

BinAC 2 offers compute nodes, high-mem nodes, and three types of GPU nodes.
* 180 compute nodes
* 16 SMP node
* 32 GPU nodes (2xA30)
* 8 GPU nodes (4xA100)
* 4 GPU nodes (4xH200)
* plus several special purpose nodes for login, interactive jobs, etc.

Compute node specification:
{| class="wikitable"
|-
! style="width:10%"|
! style="width:10%"| Standard
! style="width:10%"| High-Mem
! style="width:10%"| GPU (A30)
! style="width:10%"| GPU (A100)
! style="width:10%"| GPU (H200)
|-
!scope="column"| Quantity
| 168 / 12
| 14 / 2
| 32
| 8
| 4
|-
!scope="column" | Processors
| 2 x [https://www.amd.com/de/products/processors/server/epyc/7003-series/amd-epyc-7543.html AMD EPYC Milan 7543] / 2 x [https://www.amd.com/en/products/processors/server/epyc/7003-series/amd-epyc-75f3.html AMD EPYC Milan 75F3]
| 2 x [https://www.amd.com/de/products/processors/server/epyc/7003-series/amd-epyc-7443.html AMD EPYC Milan 7443] / 2 x [https://www.amd.com/en/products/processors/server/epyc/7003-series/amd-epyc-75f3.html AMD EPYC Milan 75F3]
| 2 x [https://www.amd.com/de/products/processors/server/epyc/7003-series/amd-epyc-7543.html AMD EPYC Milan 7543]
| 2 x [https://www.amd.com/de/products/processors/server/epyc/7003-series/amd-epyc-7543.html AMD EPYC Milan 7543]
| 2 x [https://www.amd.com/de/products/processors/server/epyc/9005-series/amd-epyc-9555.html AMD EPYC Milan 9555]
|-
!scope="column" | Processor Base Frequency (GHz)
| 2.80 / 2.95
| 2.85 / 2.95
| 2.80
| 2.80
| 3.20
|-
!scope="column" | Number of Physical Cores / Hypertreads
| 64 / 128
| 48 / 96 // 64 / 128
| 64 / 128
| 64 / 128
| 128 / 256
|-
!scope="column" | Working Memory (GB)
| 512
| 2048
| 512
| 512
| 1536
|-
!scope="column" | Local Disk (GiB)
| 450 (NVMe-SSD)
| 14000 (NVMe-SSD)
| 450 (NVMe-SSD)
| 14000 (NVMe-SSD)
| 28000 (NVMe-SSD)
|-
!scope="column" | Interconnect
| HDR 100 IB (84 nodes) / 100GbE (96 nodes)
| 100GbE
| 100GbE
| 100GbE
| HDR 200 IB + 100GbE
|-
!scope="column" | Coprocessors
| -
| -
| 2 x [https://www.nvidia.com/de-de/data-center/products/a30-gpu/ NVIDIA A30 (24 GB ECC HBM2, NVLink)]
| 4 x [https://www.nvidia.com/de-de/data-center/a100/ NVIDIA A100 (80 GB ECC HBM2e)]
| 4 x [https://www.nvidia.com/de-de/data-center/h200/ NVIDIA H200 NVL (141 GB ECC HBM3e, NVLink)]
|}

= Network =

The compute nodes and the parallel file system are connected via 100GbE ethernet
In contrast to BinAC 1 not all compute nodes are connected via Infiniband, but there are 84 standard compute nodes connected via HDR Infiniband (100 GbE). In order to get your jobs onto the Infiniband nodes, submit your job with <code>--constraint=ib</code>.

'''Question:'''
OpenMPI throws the following warning:
<pre>
--------------------------------------------------------------------------
No OpenFabrics connection schemes reported that they were able to be
used on a specific port. As such, the openib BTL (OpenFabrics
support) will be disabled for this port.
Local host: node1-083
Local device: mlx5_0
Local port: 1
CPCs attempted: rdmacm, udcm
--------------------------------------------------------------------------
[node1-083:2137377] 3 more processes have sent help message help-mpi-btl-openib-cpc-base.txt / no cpcs for port
[node1-083:2137377] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
</pre>
What should i do?

'''Answer:'''
BinAC2 has two (almost) separate networks, a 100GbE network and and InfiniBand network, both connecting a subset of the nodes. Both networks require different cables and switches.
Concerning the network cards for the nodes, however, there exist VPI network cards which can be configured to work in either mode (https://docs.nvidia.com/networking/display/connectx6vpi/specifications#src-2487215234_Specifications-MCX653105A-ECATSpecifications).
OpenMPI can use a number of layers for transferring data and messages between processes. When it ramps up, it will test all means of communication that were configured during compilation and then tries to figure out the fastest path between all processes.
If OpenMPI encounters such a VPI card, it will first try to establish a Remote Direct Memory Access communication (RDMA) channel using the OpenFabrics (OFI) layer.
On nodes with 100Gb ethernet, this fails as there is no RDMA protocol configured. OpenMPI will fall back to TCP transport but not without complaints.

'''Workaround:'''
For single-node jobs or on regular compute nodes, A30 and A100 GPU nodes: Add the lines
<pre>
export OMPI_MCA_btl="^ofi,openib"
export OMPI_MCA_mtl="^ofi"
</pre>
to your job script to disable the OFI transport layer. If you need high-bandwidth, low-latency transport between all processes on all nodes, switch to the Infiniband partition (<code>#SBATCH --constraint=ib</code>). ''Do not turn off the OFI layer on Infiniband nodes as this will be the best choice between nodes!''

= File Systems =

The bwForCluster BinAC 2 consists of two separate storage systems, one for the user's home directory $HOME and one serving as a project/work space.
The home directory is limited in space and parallel access but offers snapshots of your files and backup.

The project/work is a parallel file system (PFS) which offers fast and parallel file access and a bigger capacity than the home directory. It is mounted at <code>/pfs/10</code> on the login and compute nodes. This storage is based on Lustre and can be accessed in parallel from many nodes. The PFS contains the project and the work directory. Each compute project has its own directory at <code>/pfs/10/project</code> that is accessible for all members of the compute project.
Each user can create workspaces under <code>/pfs/10/work</code> using the workspace tools. These directories are only accessible for the user who created the workspace.

Additionally, each compute node provides high-speed temporary storage (SSD) on the node-local solid state disk via the $TMPDIR environment variable.

{| class="wikitable"
|-
! style="width:10%"|
! style="width:10%"| <tt>$HOME</tt>
! style="width:10%"| project
! style="width:10%"| work
! style="width:10%"| <tt>$TMPDIR</tt>
|-
!scope="column" | Visibility
| global
| global
| global
| node local
|-
!scope="column" | Lifetime
| permanent
| permanent
| work space lifetime (max. 30 days, max. 5 extensions)
| batch job walltime
|-
!scope="column" | Capacity
| -
| 8.1 PB
| 1000 TB
| 480 GB (compute nodes); 7.7 TB (GPU-A30 nodes); 16 TB (GPU-A100 and SMP nodes); 31 TB (GPU-H200 nodes)
|-
!scope="column" | Speed (read)
| ≈ 1 GB/s, shared by all nodes
| max. 12 GB/s
| ≈ 145 GB/s peak, aggregated over 56 nodes, ideal striping
| ≈ 3 GB/s (compute)/ ≈5 GB/S (GPUA-30)/ ≈ 26 GB/s (GPU-A100 + SMP)/ ≈ 42 GB/s (GPU-H200) per node
|-
!scope="column" | [https://en.wikipedia.org/wiki/Disk_quota#Quotas Quotas]
| 40 GB per user
| not yet, maybe in the future
| none
| none
|-
!scope="column" | Backup
| yes (nightly)
| '''no'''
| '''no'''
| '''no'''
|}

global : all nodes access the same file system
local : each node has its own file system
permanent : files are stored permanently
batch job walltime : files are removed at end of the batch job

{| class="wikitable" style="color:red; background-color:#ffffcc;" cellpadding="10"
|
Please note that due to the large capacity of '''work''' and '''project''' and due to frequent file changes on these file systems, no backup can be provided.
Backing up these file systems would require a redundant storage facility with multiple times the capacity of '''project'''. Furthermore, regular backups would significantly degrade the performance.
Data is stored redundantly, i.e. immune against disk failures but not immune against catastrophic incidents like cyber attacks or a fire in the server room.
Please consider to use on of the remote storage facilities like [https://wiki.bwhpc.de/e/SDS@hd SDS@hd], [https://uni-tuebingen.de/einrichtungen/zentrum-fuer-datenverarbeitung/projekte/laufende-projekte/bwsfs bwSFS], [https://www.scc.kit.edu/en/services/lsdf.php LSFD Online Storage] or the [https://www.rda.kit.edu/english/ bwDataArchive] to back up your valuable data.
|}

=== Home ===

Home directories are meant for permanent file storage of files that are keep being used like source codes, configuration files, executable programs etc.; the content of home directories will be backed up on a regular basis.
Because the backup space is limited we enforce a quota of 40GB on the home directories.

'''NOTE:'''
Compute jobs on nodes must not write temporary data to $HOME.
Instead they should use the local $TMPDIR directory for I/O-heavy use cases
and work spaces for less I/O intense multinode-jobs.


=== Project ===

The data is stored on HDDs. The primary focus of <code>/pfs/10/project</code> is pure capacity, not speed.

Every project gets a dedicated directory located at:

<syntaxhighlight>
/pfs/10/project/<project_id>/
</syntaxhighlight>

You can check the project you're member of:

<syntaxhighlight>
# id $USER | grep -o 'bw[^)]*'
bw16f003
</syntaxhighlight>

In this case, your project directory would be:
```
/pfs/10/project/bw16f003/
```

Check our [[BinAC2/Project_Data_Organization | data organization guide ]] for methods to organize data inside the project directory.

=== Workspaces ===

Data on the fast storage pool at <code>/pfs/10/work</code> is stored on SSDs.
The primary focus is speed, not capacity.

In contrast to BinAC 1 we will enforce work space lifetime, as the capacity is limited.
We ask you to only store data you actively use for computations on <code>/pfs/10/work</code>.
Please move data to <code>/pfs/10/project</code> when you don't need it on the fast storage any more.

Each user should create workspaces at <code>/pfs/10/work</code> through the workspace tools

You can find more info on workspace tools on our general page:

:: → '''[[Workspace]]s'''

To create a work space you'll need to supply a name for your work space area and a lifetime in days.
For more information read the corresponding help, e.g: <code>ws_allocate -h.</code>
{| class="wikitable"
|-
!style="width:30%" | Command
!style="width:70%" | Action
|-
|<code>ws_allocate mywork 30</code>
|Allocate a work space named "mywork" for 30 days.
|-
|<code>ws_allocate myotherwork</code>
|Allocate a work space named "myotherwork" with maximum lifetime.
|-
|<code>ws_list -a</code>
|List all your work spaces.
|-
|<code>ws_find mywork</code>
|Get absolute path of work space "mywork".
|-
|<code>ws_extend mywork 30</code>
|Extend life me of work space mywork by 30 days from now.
|-
|<code>ws_release mywork</code>
|Manually erase your work space "mywork". Please remove directory content first.
|-
|}

=== Scratch ===

Please use the fast local scratch space for storing temporary data during your jobs.

For each job a scratch directory will be created on the compute nodes. It is available via the environment variable <code>$TMPDIR</code>, which points to <code>/scratch/<jobID></code>.

Especially the SMP nodes and the GPU nodes are equipped with large and fast local disks that should be used for temporary data, scratch data or data staging for ML model training.
The Lustre file system (<code>WORK</code> and <code>PROJECT</code>) is unsuited for repetitive random I/O, I/O sizes smaller than the Lustre and ZFS block size (1M) or I/O patterns where files are opened and closed in rapid succession. The XFS file system of the local scratch drives is better suited for typical scratch workloads and access patterns. Moreover, the local scratch drives offer a lower latency and a higher bandwidth than <code>WORK</code>.

=== SDS@hd ===

SDS@hd is mounted via NFS on login and compute nodes at <syntaxhighlight inline>/mnt/sds-hd</syntaxhighlight>.

To access your Speichervorhaben, the export to BinAC 2 must first be enabled by the SDS@hd-Team. Please contact [mailto:sds-hd-support@urz.uni-heidelberg.de SDS@hd support] and provide the acronym of your Speichervorhaben, along with a request to enable the export to BinAC 2.

Once this has been done, you can access your Speichervorhaben as described in the [https://wiki.bwhpc.de/e/SDS@hd/Access/NFS#Access_your_data SDS documentation].

<syntaxhighlight>
$ kinit $USER
Password for <user>@BWSERVICES.UNI-HEIDELBERG.DE:
</syntaxhighlight>

The Kerberos ticket store is shared across all nodes. Creating a single ticket is sufficient to access your Speichervorhaben on all nodes.

BinAC2/Hardware and Architecture

2025-12-16T16:54:46Z

S Behnle:

= Hardware and Architecture =

The bwForCluster BinAC 2 supports researchers from the broader fields of Bioinformatics, Medical Informatics, Astrophysics, Geosciences and Pharmacy.

== Operating System and Software ==

* Operating System: Rocky Linux 9.6
* Queuing System: [https://slurm.schedmd.com/documentation.html Slurm] (see [[BinAC2/Slurm]] for help)
* (Scientific) Libraries and Software: [[Environment Modules]]

== Compute Nodes ==

BinAC 2 offers compute nodes, high-mem nodes, and three types of GPU nodes.
* 180 compute nodes
* 16 SMP node
* 32 GPU nodes (2xA30)
* 8 GPU nodes (4xA100)
* 4 GPU nodes (4xH200)
* plus several special purpose nodes for login, interactive jobs, etc.

Compute node specification:
{| class="wikitable"
|-
! style="width:10%"|
! style="width:10%"| Standard
! style="width:10%"| High-Mem
! style="width:10%"| GPU (A30)
! style="width:10%"| GPU (A100)
! style="width:10%"| GPU (H200)
|-
!scope="column"| Quantity
| 168 / 12
| 14 / 2
| 32
| 8
| 4
|-
!scope="column" | Processors
| 2 x [https://www.amd.com/de/products/processors/server/epyc/7003-series/amd-epyc-7543.html AMD EPYC Milan 7543] / 2 x [https://www.amd.com/en/products/processors/server/epyc/7003-series/amd-epyc-75f3.html AMD EPYC Milan 75F3]
| 2 x [https://www.amd.com/de/products/processors/server/epyc/7003-series/amd-epyc-7443.html AMD EPYC Milan 7443] / 2 x [https://www.amd.com/en/products/processors/server/epyc/7003-series/amd-epyc-75f3.html AMD EPYC Milan 75F3]
| 2 x [https://www.amd.com/de/products/processors/server/epyc/7003-series/amd-epyc-7543.html AMD EPYC Milan 7543]
| 2 x [https://www.amd.com/de/products/processors/server/epyc/7003-series/amd-epyc-7543.html AMD EPYC Milan 7543]
| 2 x [https://www.amd.com/de/products/processors/server/epyc/9005-series/amd-epyc-9555.html AMD EPYC Milan 9555]
|-
!scope="column" | Processor Base Frequency (GHz)
| 2.80 / 2.95
| 2.85 / 2.95
| 2.80
| 2.80
| 3.20
|-
!scope="column" | Number of Physical Cores / Hypertreads
| 64 / 128
| 48 / 96 // 64 / 128
| 64 / 128
| 64 / 128
| 128 / 256
|-
!scope="column" | Working Memory (GB)
| 512
| 2048
| 512
| 512
| 1536
|-
!scope="column" | Local Disk (GiB)
| 450 (NVMe-SSD)
| 14000 (NVMe-SSD)
| 450 (NVMe-SSD)
| 14000 (NVMe-SSD)
| 28000 (NVMe-SSD)
|-
!scope="column" | Interconnect
| HDR 100 IB (84 nodes) / 100GbE (96 nodes)
| 100GbE
| 100GbE
| 100GbE
| HDR 200 IB + 100GbE
|-
!scope="column" | Coprocessors
| -
| -
| 2 x [https://www.nvidia.com/de-de/data-center/products/a30-gpu/ NVIDIA A30 (24 GB ECC HBM2, NVLink)]
| 4 x [https://www.nvidia.com/de-de/data-center/a100/ NVIDIA A100 (80 GB ECC HBM2e)]
| 4 x [https://www.nvidia.com/de-de/data-center/h200/ NVIDIA H200 NVL (141 GB ECC HBM3e, NVLink)]
|}

= Network =

The compute nodes and the parallel file system are connected via 100GbE ethernet
In contrast to BinAC 1 not all compute nodes are connected via Infiniband, but there are 84 standard compute nodes connected via HDR Infiniband (100 GbE). In order to get your jobs onto the Infiniband nodes, submit your job with <code>--constraint=ib</code>.

'''Question:'''
OpenMPI throws the following warning:
<pre>
--------------------------------------------------------------------------
No OpenFabrics connection schemes reported that they were able to be
used on a specific port. As such, the openib BTL (OpenFabrics
support) will be disabled for this port.
Local host: node1-083
Local device: mlx5_0
Local port: 1
CPCs attempted: rdmacm, udcm
--------------------------------------------------------------------------
[node1-083:2137377] 3 more processes have sent help message help-mpi-btl-openib-cpc-base.txt / no cpcs for port
[node1-083:2137377] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
</pre>
What should i do?

'''Answer:'''
BinAC2 has two (almost) separate networks, a 100GbE network and and InfiniBand network, both connecting a subset of the nodes. Both networks require different cables and switches.
Concerning the network cards for the nodes, however, there exist VPI network cards which can be configured to work in either mode (https://docs.nvidia.com/networking/display/connectx6vpi/specifications#src-2487215234_Specifications-MCX653105A-ECATSpecifications).
OpenMPI can use a number of layers for transferring data and messages between processes. When it ramps up, it will test all means of communication that were configured during compilation and then tries to figure out the fastest path between all processes.
If OpenMPI encounters such a VPI card, it will first try to establish a Remote Direct Memory Access communication (RDMA) channel using the OpenFabrics (OFI) layer.
On nodes with 100Gb ethernet, this fails as there is no RDMA protocol configured. OpenMPI will fall back to TCP transport but not without complaints.

'''Workaround:'''
For single-node jobs or on regular compute nodes, A30 and A100 GPU nodes: Add the lines
<pre>
export OMPI_MCA_btl="^ofi,openib"
export OMPI_MCA_mtl="^ofi"
</pre>
to your job script to disable the OFI transport layer. If you need high-bandwidth, low-latency transport between all processes on all nodes, switch to the Infiniband partition (<code>#SBATCH --constraint=ib</code>). ''Do not turn off the OFI layer on Infiniband nodes as this will be the best choice between nodes!''

= File Systems =

The bwForCluster BinAC 2 consists of two separate storage systems, one for the user's home directory $HOME and one serving as a project/work space.
The home directory is limited in space and parallel access but offers snapshots of your files and backup.

The project/work is a parallel file system (PFS) which offers fast and parallel file access and a bigger capacity than the home directory. It is mounted at <code>/pfs/10</code> on the login and compute nodes. This storage is based on Lustre and can be accessed parallel from many nodes. The PFS contains the project and the work directory. Each compute project has its own directory at <code>/pfs/10/project</code> that is accessible for all members of the compute project.
Each user can create workspaces under <code>/pfs/10/work</code> using the workspace tools. These directories are only accessible for the user who created the workspace.

Additionally, each compute node provides high-speed temporary storage (SSD) on the node-local solid state disk via the $TMPDIR environment variable.

{| class="wikitable"
|-
! style="width:10%"|
! style="width:10%"| <tt>$HOME</tt>
! style="width:10%"| project
! style="width:10%"| work
! style="width:10%"| <tt>$TMPDIR</tt>
|-
!scope="column" | Visibility
| global
| global
| global
| node local
|-
!scope="column" | Lifetime
| permanent
| permanent
| work space lifetime (max. 30 days, max. 5 extensions)
| batch job walltime
|-
!scope="column" | Capacity
| -
| 8.1 PB
| 1000 TB
| 480 GB (compute nodes); 7.7 TB (GPU-A30 nodes); 16 TB (GPU-A100 and SMP nodes); 31 TB (GPU-H200 nodes)
|-
!scope="column" | Speed (read)
| ≈ 1 GB/s, shared by all nodes
| max. 12 GB/s
| ≈ 145 GB/s peak, aggregated over 56 nodes, ideal striping
| ≈ 3 GB/s (compute)/ ≈5 GB/S (GPUA-30)/ ≈ 26 GB/s (GPU-A100 + SMP)/ ≈ 42 GB/s (GPU-H200) per node
|-
!scope="column" | [https://en.wikipedia.org/wiki/Disk_quota#Quotas Quotas]
| 40 GB per user
| not yet, maybe in the future
| none
| none
|-
!scope="column" | Backup
| yes (nightly)
| '''no'''
| '''no'''
| '''no'''
|}

global : all nodes access the same file system
local : each node has its own file system
permanent : files are stored permanently
batch job walltime : files are removed at end of the batch job

{| class="wikitable" style="color:red; background-color:#ffffcc;" cellpadding="10"
|
Please note that due to the large capacity of '''work''' and '''project''' and due to frequent file changes on these file systems, no backup can be provided.
Backing up these file systems would require a redundant storage facility with multiple times the capacity of '''project'''. Furthermore, regular backups would significantly degrade the performance.
Data is stored redundantly, i.e. immune against disk failures but not immune against catastrophic incidents like cyber attacks or a fire in the server room.
Please consider to use on of the remote storage facilities like [https://wiki.bwhpc.de/e/SDS@hd SDS@hd], [https://uni-tuebingen.de/einrichtungen/zentrum-fuer-datenverarbeitung/projekte/laufende-projekte/bwsfs bwSFS], [https://www.scc.kit.edu/en/services/lsdf.php LSFD Online Storage] or the [https://www.rda.kit.edu/english/ bwDataArchive] to back up your valuable data.
|}

=== Home ===

Home directories are meant for permanent file storage of files that are keep being used like source codes, configuration files, executable programs etc.; the content of home directories will be backed up on a regular basis.
Because the backup space is limited we enforce a quota of 40GB on the home directories.

'''NOTE:'''
Compute jobs on nodes must not write temporary data to $HOME.
Instead they should use the local $TMPDIR directory for I/O-heavy use cases
and work spaces for less I/O intense multinode-jobs.


=== Project ===

The data is stored on HDDs. The primary focus of <code>/pfs/10/project</code> is pure capacity, not speed.

Every project gets a dedicated directory located at:

<syntaxhighlight>
/pfs/10/project/<project_id>/
</syntaxhighlight>

You can check the project you're member of:

<syntaxhighlight>
# id $USER | grep -o 'bw[^)]*'
bw16f003
</syntaxhighlight>

In this case, your project directory would be:
```
/pfs/10/project/bw16f003/
```

Check our [[BinAC2/Project_Data_Organization | data organization guide ]] for methods to organize data inside the project directory.

=== Workspaces ===

Data on the fast storage pool at <code>/pfs/10/work</code> is stored on SSDs.
The primary focus is speed, not capacity.

In contrast to BinAC 1 we will enforce work space lifetime, as the capacity is limited.
We ask you to only store data you actively use for computations on <code>/pfs/10/work</code>.
Please move data to <code>/pfs/10/project</code> when you don't need it on the fast storage any more.

Each user should create workspaces at <code>/pfs/10/work</code> through the workspace tools

You can find more info on workspace tools on our general page:

:: → '''[[Workspace]]s'''

To create a work space you'll need to supply a name for your work space area and a lifetime in days.
For more information read the corresponding help, e.g: <code>ws_allocate -h.</code>
{| class="wikitable"
|-
!style="width:30%" | Command
!style="width:70%" | Action
|-
|<code>ws_allocate mywork 30</code>
|Allocate a work space named "mywork" for 30 days.
|-
|<code>ws_allocate myotherwork</code>
|Allocate a work space named "myotherwork" with maximum lifetime.
|-
|<code>ws_list -a</code>
|List all your work spaces.
|-
|<code>ws_find mywork</code>
|Get absolute path of work space "mywork".
|-
|<code>ws_extend mywork 30</code>
|Extend life me of work space mywork by 30 days from now.
|-
|<code>ws_release mywork</code>
|Manually erase your work space "mywork". Please remove directory content first.
|-
|}

=== Scratch ===

Please use the fast local scratch space for storing temporary data during your jobs.

For each job a scratch directory will be created on the compute nodes. It is available via the environment variable <code>$TMPDIR</code>, which points to <code>/scratch/<jobID></code>.

Especially the SMP nodes and the GPU nodes are equipped with large and fast local disks that should be used for temporary data, scratch data or data staging for ML model training.
The Lustre file system (<code>WORK</code> and <code>PROJECT</code>) is unsuited for repetitive random I/O, I/O sizes smaller than the Lustre and ZFS block size (1M) or I/O patterns where files are opened and closed in rapid succession. The XFS file system of the local scratch drives is better suited for typical scratch workloads and access patterns. Moreover, the local scratch drives offer a lower latency and a higher bandwidth than <code>WORK</code>.

=== SDS@hd ===

SDS@hd is mounted via NFS on login and compute nodes at <syntaxhighlight inline>/mnt/sds-hd</syntaxhighlight>.

To access your Speichervorhaben, the export to BinAC 2 must first be enabled by the SDS@hd-Team. Please contact [mailto:sds-hd-support@urz.uni-heidelberg.de SDS@hd support] and provide the acronym of your Speichervorhaben, along with a request to enable the export to BinAC 2.

Once this has been done, you can access your Speichervorhaben as described in the [https://wiki.bwhpc.de/e/SDS@hd/Access/NFS#Access_your_data SDS documentation].

<syntaxhighlight>
$ kinit $USER
Password for <user>@BWSERVICES.UNI-HEIDELBERG.DE:
</syntaxhighlight>

The Kerberos ticket store is shared across all nodes. Creating a single ticket is sufficient to access your Speichervorhaben on all nodes.

BinAC2/Hardware and Architecture

2025-12-12T10:17:19Z

S Behnle: /* Network */

= Hardware and Architecture =

The bwForCluster BinAC 2 supports researchers from the broader fields of Bioinformatics, Medical Informatics, Astrophysics, Geosciences and Pharmacy.

== Operating System and Software ==

* Operating System: Rocky Linux 9.5
* Queuing System: [https://slurm.schedmd.com/documentation.html Slurm] (see [[BinAC2/Slurm]] for help)
* (Scientific) Libraries and Software: [[Environment Modules]]

== Compute Nodes ==

BinAC 2 offers compute nodes, high-mem nodes, and three types of GPU nodes.
* 180 compute nodes
* 16 SMP node
* 32 GPU nodes (2xA30)
* 8 GPU nodes (4xA100)
* 4 GPU nodes (4xH200)
* plus several special purpose nodes for login, interactive jobs, etc.

Compute node specification:
{| class="wikitable"
|-
! style="width:10%"|
! style="width:10%"| Standard
! style="width:10%"| High-Mem
! style="width:10%"| GPU (A30)
! style="width:10%"| GPU (A100)
! style="width:10%"| GPU (H200)
|-
!scope="column"| Quantity
| 168 / 12
| 14 / 2
| 32
| 8
| 4
|-
!scope="column" | Processors
| 2 x [https://www.amd.com/de/products/processors/server/epyc/7003-series/amd-epyc-7543.html AMD EPYC Milan 7543] / 2 x [https://www.amd.com/en/products/processors/server/epyc/7003-series/amd-epyc-75f3.html AMD EPYC Milan 75F3]
| 2 x [https://www.amd.com/de/products/processors/server/epyc/7003-series/amd-epyc-7443.html AMD EPYC Milan 7443] / 2 x [https://www.amd.com/en/products/processors/server/epyc/7003-series/amd-epyc-75f3.html AMD EPYC Milan 75F3]
| 2 x [https://www.amd.com/de/products/processors/server/epyc/7003-series/amd-epyc-7543.html AMD EPYC Milan 7543]
| 2 x [https://www.amd.com/de/products/processors/server/epyc/7003-series/amd-epyc-7543.html AMD EPYC Milan 7543]
| 2 x [https://www.amd.com/de/products/processors/server/epyc/9005-series/amd-epyc-9555.html AMD EPYC Milan 9555]
|-
!scope="column" | Processor Base Frequency (GHz)
| 2.80 / 2.95
| 2.85 / 2.95
| 2.80
| 2.80
| 3.20
|-
!scope="column" | Number of Physical Cores / Hypertreads
| 64 / 128
| 48 / 96 // 64 / 128
| 64 / 128
| 64 / 128
| 128 / 256
|-
!scope="column" | Working Memory (GB)
| 512
| 2048
| 512
| 512
| 1536
|-
!scope="column" | Local Disk (GiB)
| 450 (NVMe-SSD)
| 14000 (NVMe-SSD)
| 450 (NVMe-SSD)
| 14000 (NVMe-SSD)
| 28000 (NVMe-SSD)
|-
!scope="column" | Interconnect
| HDR 100 IB (84 nodes) / 100GbE (96 nodes)
| 100GbE
| 100GbE
| 100GbE
| HDR 200 IB + 100GbE
|-
!scope="column" | Coprocessors
| -
| -
| 2 x [https://www.nvidia.com/de-de/data-center/products/a30-gpu/ NVIDIA A30 (24 GB ECC HBM2, NVLink)]
| 4 x [https://www.nvidia.com/de-de/data-center/a100/ NVIDIA A100 (80 GB ECC HBM2e)]
| 4 x [https://www.nvidia.com/de-de/data-center/h200/ NVIDIA H200 NVL (141 GB ECC HBM3e, NVLink)]
|}

= Network =

The compute nodes and the parallel file system are connected via 100GbE ethernet
In contrast to BinAC 1 not all compute nodes are connected via Infiniband, but there are 84 standard compute nodes connected via HDR Infiniband (100 GbE). In order to get your jobs onto the Infiniband nodes, submit your job with <code>--constraint=ib</code>.

'''Question:'''
OpenMPI throws the following warning:
<pre>
--------------------------------------------------------------------------
No OpenFabrics connection schemes reported that they were able to be
used on a specific port. As such, the openib BTL (OpenFabrics
support) will be disabled for this port.
Local host: node1-083
Local device: mlx5_0
Local port: 1
CPCs attempted: rdmacm, udcm
--------------------------------------------------------------------------
[node1-083:2137377] 3 more processes have sent help message help-mpi-btl-openib-cpc-base.txt / no cpcs for port
[node1-083:2137377] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
</pre>
What should i do?

'''Answer:'''
BinAC2 has two (almost) separate networks, a 100GbE network and and InfiniBand network, both connecting a subset of the nodes. Both networks require different cables and switches.
Concerning the network cards for the nodes, however, there exist VPI network cards which can be configured to work in either mode (https://docs.nvidia.com/networking/display/connectx6vpi/specifications#src-2487215234_Specifications-MCX653105A-ECATSpecifications).
OpenMPI can use a number of layers for transferring data and messages between processes. When it ramps up, it will test all means of communication that were configured during compilation and then tries to figure out the fastest path between all processes.
If OpenMPI encounters such a VPI card, it will first try to establish a Remote Direct Memory Access communication (RDMA) channel using the OpenFabrics (OFI) layer.
On nodes with 100Gb ethernet, this fails as there is no RDMA protocol configured. OpenMPI will fall back to TCP transport but not without complaints.

'''Workaround:'''
For single-node jobs or on regular compute nodes, A30 and A100 GPU nodes: Add the lines
<pre>
export OMPI_MCA_btl="^ofi,openib"
export OMPI_MCA_mtl="^ofi"
</pre>
to your job script to disable the OFI transport layer. If you need high-bandwidth, low-latency transport between all processes on all nodes, switch to the Infiniband partition (<code>#SBATCH --constraint=ib</code>). ''Do not turn off the OFI layer on Infiniband nodes as this will be the best choice between nodes!''

= File Systems =

The bwForCluster BinAC 2 consists of two separate storage systems, one for the user's home directory $HOME and one serving as a project/work space.
The home directory is limited in space and parallel access but offers snapshots of your files and backup.

The project/work is a parallel file system (PFS) which offers fast and parallel file access and a bigger capacity than the home directory. It is mounted at <code>/pfs/10</code> on the login and compute nodes. This storage is based on Lustre and can be accessed parallel from many nodes. The PFS contains the project and the work directory. Each compute project has its own directory at <code>/pfs/10/project</code> that is accessible for all members of the compute project.
Each user can create workspaces under <code>/pfs/10/work</code> using the workspace tools. These directories are only accessible for the user who created the workspace.

Additionally, each compute node provides high-speed temporary storage (SSD) on the node-local solid state disk via the $TMPDIR environment variable.

{| class="wikitable"
|-
! style="width:10%"|
! style="width:10%"| <tt>$HOME</tt>
! style="width:10%"| project
! style="width:10%"| work
! style="width:10%"| <tt>$TMPDIR</tt>
|-
!scope="column" | Visibility
| global
| global
| global
| node local
|-
!scope="column" | Lifetime
| permanent
| permanent
| work space lifetime (max. 30 days, max. 5 extensions)
| batch job walltime
|-
!scope="column" | Capacity
| -
| 8.1 PB
| 1000 TB
| 480 GB (compute nodes); 7.7 TB (GPU-A30 nodes); 16 TB (GPU-A100 and SMP nodes); 31 TB (GPU-H200 nodes)
|-
!scope="column" | Speed (read)
| ≈ 1 GB/s, shared by all nodes
| max. 12 GB/s
| ≈ 145 GB/s peak, aggregated over 56 nodes, ideal striping
| ≈ 3 GB/s (compute)/ ≈5 GB/S (GPUA-30)/ ≈ 26 GB/s (GPU-A100 + SMP)/ ≈ 42 GB/s (GPU-H200) per node
|-
!scope="column" | [https://en.wikipedia.org/wiki/Disk_quota#Quotas Quotas]
| 40 GB per user
| not yet, maybe in the future
| none
| none
|-
!scope="column" | Backup
| yes (nightly)
| '''no'''
| '''no'''
| '''no'''
|}

global : all nodes access the same file system
local : each node has its own file system
permanent : files are stored permanently
batch job walltime : files are removed at end of the batch job

{| class="wikitable" style="color:red; background-color:#ffffcc;" cellpadding="10"
|
Please note that due to the large capacity of '''work''' and '''project''' and due to frequent file changes on these file systems, no backup can be provided.
Backing up these file systems would require a redundant storage facility with multiple times the capacity of '''project'''. Furthermore, regular backups would significantly degrade the performance.
Data is stored redundantly, i.e. immune against disk failures but not immune against catastrophic incidents like cyber attacks or a fire in the server room.
Please consider to use on of the remote storage facilities like [https://wiki.bwhpc.de/e/SDS@hd SDS@hd], [https://uni-tuebingen.de/einrichtungen/zentrum-fuer-datenverarbeitung/projekte/laufende-projekte/bwsfs bwSFS], [https://www.scc.kit.edu/en/services/lsdf.php LSFD Online Storage] or the [https://www.rda.kit.edu/english/ bwDataArchive] to back up your valuable data.
|}

=== Home ===

Home directories are meant for permanent file storage of files that are keep being used like source codes, configuration files, executable programs etc.; the content of home directories will be backed up on a regular basis.
Because the backup space is limited we enforce a quota of 40GB on the home directories.

'''NOTE:'''
Compute jobs on nodes must not write temporary data to $HOME.
Instead they should use the local $TMPDIR directory for I/O-heavy use cases
and work spaces for less I/O intense multinode-jobs.


=== Project ===

The data is stored on HDDs. The primary focus of <code>/pfs/10/project</code> is pure capacity, not speed.

Every project gets a dedicated directory located at:

<syntaxhighlight>
/pfs/10/project/<project_id>/
</syntaxhighlight>

You can check the project you're member of:

<syntaxhighlight>
# id $USER | grep -o 'bw[^)]*'
bw16f003
</syntaxhighlight>

In this case, your project directory would be:
```
/pfs/10/project/bw16f003/
```

Check our [[BinAC2/Project_Data_Organization | data organization guide ]] for methods to organize data inside the project directory.

=== Workspaces ===

Data on the fast storage pool at <code>/pfs/10/work</code> is stored on SSDs.
The primary focus is speed, not capacity.

In contrast to BinAC 1 we will enforce work space lifetime, as the capacity is limited.
We ask you to only store data you actively use for computations on <code>/pfs/10/work</code>.
Please move data to <code>/pfs/10/project</code> when you don't need it on the fast storage any more.

Each user should create workspaces at <code>/pfs/10/work</code> through the workspace tools

You can find more info on workspace tools on our general page:

:: → '''[[Workspace]]s'''

To create a work space you'll need to supply a name for your work space area and a lifetime in days.
For more information read the corresponding help, e.g: <code>ws_allocate -h.</code>
{| class="wikitable"
|-
!style="width:30%" | Command
!style="width:70%" | Action
|-
|<code>ws_allocate mywork 30</code>
|Allocate a work space named "mywork" for 30 days.
|-
|<code>ws_allocate myotherwork</code>
|Allocate a work space named "myotherwork" with maximum lifetime.
|-
|<code>ws_list -a</code>
|List all your work spaces.
|-
|<code>ws_find mywork</code>
|Get absolute path of work space "mywork".
|-
|<code>ws_extend mywork 30</code>
|Extend life me of work space mywork by 30 days from now.
|-
|<code>ws_release mywork</code>
|Manually erase your work space "mywork". Please remove directory content first.
|-
|}

=== Scratch ===

Please use the fast local scratch space for storing temporary data during your jobs.

For each job a scratch directory will be created on the compute nodes. It is available via the environment variable <code>$TMPDIR</code>, which points to <code>/scratch/<jobID></code>.

Especially the SMP nodes and the GPU nodes are equipped with large and fast local disks that should be used for temporary data, scratch data or data staging for ML model training.
The Lustre file system (<code>WORK</code> and <code>PROJECT</code>) is unsuited for repetitive random I/O, I/O sizes smaller than the Lustre and ZFS block size (1M) or I/O patterns where files are opened and closed in rapid succession. The XFS file system of the local scratch drives is better suited for typical scratch workloads and access patterns. Moreover, the local scratch drives offer a lower latency and a higher bandwidth than <code>WORK</code>.

=== SDS@hd ===

SDS@hd is mounted via NFS on login and compute nodes at <syntaxhighlight inline>/mnt/sds-hd</syntaxhighlight>.

To access your Speichervorhaben, the export to BinAC 2 must first be enabled by the SDS@hd-Team. Please contact [mailto:sds-hd-support@urz.uni-heidelberg.de SDS@hd support] and provide the acronym of your Speichervorhaben, along with a request to enable the export to BinAC 2.

Once this has been done, you can access your Speichervorhaben as described in the [https://wiki.bwhpc.de/e/SDS@hd/Access/NFS#Access_your_data SDS documentation].

<syntaxhighlight>
$ kinit $USER
Password for <user>@BWSERVICES.UNI-HEIDELBERG.DE:
</syntaxhighlight>

The Kerberos ticket store is shared across all nodes. Creating a single ticket is sufficient to access your Speichervorhaben on all nodes.

BinAC2/SLURM Partitions

2025-09-18T07:42:40Z

S Behnle: Added BinAC2 long partition

== Partitions ==

The bwForCluster BinAC 2 provides four partitions for job submission.
Within a partition job allocations are routed automatically to the most suitable compute node(s) for the requested resources (e.g. amount of nodes and cores, memory, number and type of GPUs).

The <code>gpu</code> partition will only run 8 jobs per user at the same time. A user can only use 4 A100 and 8 A30 GPUs at the same time.

The <code>interactive</code> partition will only run 1 job per user at the same time.
This partition is reserved is dedicated for testing things and using tools via a graphical user interface.
The four nodes <code>node1-00[1-4]</code> are exclusively reserved for this partition.
You can run a VNC server in this partition. Please use <code>#SBATCH --gres=display:1</code> in your jobscript or <code>--gres=display:1</code> on the command line if you need a display. This ensures that your job starts on a node with "free" displays, because each of the four nodes only provide 20 possible virtual displays.

The <code>long</code> partition is meant for long-running, parallel jobs. Please pack your jobs as dense as possible. If possible, do regular checkpointing in case the job fails after several days. Due to the small number of GPU nodes at BinAC2, we cannot offer a <code>long</code> partition with GPU nodes.


{| class="wikitable"
|-
! style="width:10%"| Partition
! style="width:10%"| Node Access Policy
! style="width:10%"| Node Types
! style="width:20%"| Default
! style="width:20%"| Limits
|-
| compute (default)
| shared
| cpu
| ntasks=1, time=00:10:00, mem-per-cpu=1gb
| nodes=2, time=14-00:00:00
|-
| gpu
| shared
| gpu
| ntasks=1, time=00:10:00, mem-per-cpu=1gb
| time=14-00:00:00MaxJobsPerUser: 8MaxTRESPerUser:<pre>gres/gpu:a100=4,
gres/gpu:a30=8,
gres/gpu:h200=4</pre>
|-
| interactive
| shared
| cpu
| ntasks=1, time=00:10:00, mem-per-cpu=1gb
| time=10:00:00MaxJobsPerUser: 1
|-
| long
| shared
| cpu (InfiniBand nodes only)
| time=1-00:00:00, feature=ib
| time=30-00:00:00MaxNodes=10
|-
|}

=== Parallel Jobs ===

In order to submit parallel jobs to the InfiniBand part of the cluster, i.e., for fast inter-node communication, please select the appropriate nodes via the <code>--constraint=ib</code> option in your job script. For less demanding parallel jobs, you may try the <code>--constraint=eth</code> option, which utilizes 100Gb/s Ethernet instead of the low-latency 100Gb/s InfiniBand.

=== GPU Jobs ===

BinAC 2 provides different GPU models for computations. Please select the appropriate GPU type and the amount of GPUs with the <code>--gres=aXX:N</code> option in your job script

{| class="wikitable"
|-
! style="width:20%"| GPU
! style="width:20%"| GPU Memory
! style="width:20%"| # GPUs per Node [N]
! style="width:20%"| Submit Option
|-
| Nvidia A30
| 24GB
| 2
| <code>--gres=gpu:a30:N</code>
|-
| Nvidia A100
| 80GB
| 4
| <code>--gres=gpu:a100:N</code>
|-
| Nvidia H200
| 141GB
| 4
| <code>--gres=gpu:h200:N</code>
|-
|}

BinAC2/Hardware and Architecture

2025-09-16T07:53:24Z

S Behnle:

= Hardware and Architecture =

The bwForCluster BinAC 2 supports researchers from the broader fields of Bioinformatics, Medical Informatics, Astrophysics, Geosciences and Pharmacy.

== Operating System and Software ==

* Operating System: Rocky Linux 9.5
* Queuing System: [https://slurm.schedmd.com/documentation.html Slurm] (see [[BinAC2/Slurm]] for help)
* (Scientific) Libraries and Software: [[Environment Modules]]

== Compute Nodes ==

BinAC 2 offers compute nodes, high-mem nodes, and three types of GPU nodes.
* 180 compute nodes
* 16 SMP node
* 32 GPU nodes (2xA30)
* 8 GPU nodes (4xA100)
* 4 GPU nodes (4xH200)
* plus several special purpose nodes for login, interactive jobs, etc.

Compute node specification:
{| class="wikitable"
|-
! style="width:10%"|
! style="width:10%"| Standard
! style="width:10%"| High-Mem
! style="width:10%"| GPU (A30)
! style="width:10%"| GPU (A100)
! style="width:10%"| GPU (H200)
|-
!scope="column"| Quantity
| 168 / 12
| 14 / 2
| 32
| 8
| 4
|-
!scope="column" | Processors
| 2 x [https://www.amd.com/de/products/processors/server/epyc/7003-series/amd-epyc-7543.html AMD EPYC Milan 7543] / 2 x [https://www.amd.com/en/products/processors/server/epyc/7003-series/amd-epyc-75f3.html AMD EPYC Milan 75F3]
| 2 x [https://www.amd.com/de/products/processors/server/epyc/7003-series/amd-epyc-7443.html AMD EPYC Milan 7443] / 2 x [https://www.amd.com/en/products/processors/server/epyc/7003-series/amd-epyc-75f3.html AMD EPYC Milan 75F3]
| 2 x [https://www.amd.com/de/products/processors/server/epyc/7003-series/amd-epyc-7543.html AMD EPYC Milan 7543]
| 2 x [https://www.amd.com/de/products/processors/server/epyc/7003-series/amd-epyc-7543.html AMD EPYC Milan 7543]
| 2 x [https://www.amd.com/de/products/processors/server/epyc/9005-series/amd-epyc-9555.html AMD EPYC Milan 9555]
|-
!scope="column" | Processor Base Frequency (GHz)
| 2.80 / 2.95
| 2.85 / 2.95
| 2.80
| 2.80
| 3.20
|-
!scope="column" | Number of Physical Cores / Hypertreads
| 64 / 128
| 48 / 96 // 64 / 128
| 64 / 128
| 64 / 128
| 128 / 256
|-
!scope="column" | Working Memory (GB)
| 512
| 2048
| 512
| 512
| 1536
|-
!scope="column" | Local Disk (GiB)
| 450 (NVMe-SSD)
| 14000 (NVMe-SSD)
| 450 (NVMe-SSD)
| 14000 (NVMe-SSD)
| 28000 (NVMe-SSD)
|-
!scope="column" | Interconnect
| HDR 100 IB (84 nodes) / 100GbE (96 nodes)
| 100GbE
| 100GbE
| 100GbE
| HDR 200 IB + 100GbE
|-
!scope="column" | Coprocessors
| -
| -
| 2 x [https://www.nvidia.com/de-de/data-center/products/a30-gpu/ NVIDIA A30 (24 GB ECC HBM2, NVLink)]
| 4 x [https://www.nvidia.com/de-de/data-center/a100/ NVIDIA A100 (80 GB ECC HBM2e)]
| 4 x [https://www.nvidia.com/de-de/data-center/h200/ NVIDIA H200 NVL (141 GB ECC HBM3e, NVLink)]
|}

= Network =

The compute nodes and the parallel file system are connected via 100GbE ethernet
In contrast to BinAC 1 not all compute nodes are connected via Infiniband, but there are 84 standard compute nodes connected via HDR Infiniband (100 GbE). In order to get your jobs onto the Infiniband nodes, submit your job with <code>--constraint=ib</code>.

= File Systems =

The bwForCluster BinAC 2 consists of two separate storage systems, one for the user's home directory $HOME and one serving as a project/work space.
The home directory is limited in space and parallel access but offers snapshots of your files and backup.

The project/work is a parallel file system (PFS) which offers fast and parallel file access and a bigger capacity than the home directory. It is mounted at <code>/pfs/10</code> on the login and compute nodes. This storage is based on Lustre and can be accessed parallel from many nodes. The PFS contains the project and the work directory. Each compute project has its own directory at <code>/pfs/10/project</code> that is accessible for all members of the compute project.
Each user can create workspaces under <code>/pfs/10/work</code> using the workspace tools. These directories are only accessible for the user who created the workspace.

Additionally, each compute node provides high-speed temporary storage (SSD) on the node-local solid state disk via the $TMPDIR environment variable.

{| class="wikitable"
|-
! style="width:10%"|
! style="width:10%"| <tt>$HOME</tt>
! style="width:10%"| project
! style="width:10%"| work
! style="width:10%"| <tt>$TMPDIR</tt>
|-
!scope="column" | Visibility
| global
| global
| global
| node local
|-
!scope="column" | Lifetime
| permanent
| permanent
| work space lifetime (max. 30 days, max. 5 extensions)
| batch job walltime
|-
!scope="column" | Capacity
| -
| 8.1 PB
| 1000 TB
| 480 GB (compute nodes); 7.7 TB (GPU-A30 nodes); 16 TB (GPU-A100 and SMP nodes); 31 TB (GPU-H200 nodes)
|-
!scope="column" | Speed (read)
| ≈ 1 GB/s, shared by all nodes
| max. 12 GB/s
| ≈ 145 GB/s peak, aggregated over 56 nodes, ideal striping
| ≈ 3 GB/s (compute)/ ≈5 GB/S (GPUA-30)/ ≈ 26 GB/s (GPU-A100 + SMP)/ ≈ 42 GB/s (GPU-H200) per node
|-
!scope="column" | [https://en.wikipedia.org/wiki/Disk_quota#Quotas Quotas]
| 40 GB per user
| not yet, maybe in the future
| none
| none
|-
!scope="column" | Backup
| yes (nightly)
| '''no'''
| '''no'''
| '''no'''
|}

global : all nodes access the same file system
local : each node has its own file system
permanent : files are stored permanently
batch job walltime : files are removed at end of the batch job

{| class="wikitable" style="color:red; background-color:#ffffcc;" cellpadding="10"
|
Please note that due to the large capacity of '''work''' and '''project''' and due to frequent file changes on these file systems, no backup can be provided.
Backing up these file systems would require a redundant storage facility with multiple times the capacity of '''project'''. Furthermore, regular backups would significantly degrade the performance.
Data is stored redundantly, i.e. immune against disk failures but not immune against catastrophic incidents like cyber attacks or a fire in the server room.
Please consider to use on of the remote storage facilities like [https://wiki.bwhpc.de/e/SDS@hd SDS@hd], [https://uni-tuebingen.de/einrichtungen/zentrum-fuer-datenverarbeitung/projekte/laufende-projekte/bwsfs bwSFS], [https://www.scc.kit.edu/en/services/lsdf.php LSFD Online Storage] or the [https://www.rda.kit.edu/english/ bwDataArchive] to back up your valuable data.
|}

=== Home ===

Home directories are meant for permanent file storage of files that are keep being used like source codes, configuration files, executable programs etc.; the content of home directories will be backed up on a regular basis.
Because the backup space is limited we enforce a quota of 40GB on the home directories.

'''NOTE:'''
Compute jobs on nodes must not write temporary data to $HOME.
Instead they should use the local $TMPDIR directory for I/O-heavy use cases
and work spaces for less I/O intense multinode-jobs.


=== Project ===

Each compute project has its own project directory at <code>/pfs/10/project</code>.

<pre>
$ ls -lh /pfs/10/project/
drwxrwx---. 2 root bw16f003 33K Dec 12 16:46 bw16f003
[...]
</pre>

As you can see the directory is owned by a group representing your compute project (here bw16f003) and the directory is accessible by all group members. It is upon your group to decide how to use the space inside this directory: shared data folders, personal directories for each project member, software containers, etc.

The data is stored on HDDs. The primary focus of <code>/pfs/10/project</code> is pure capacity, not speed.

=== Workspaces ===

Data on the fast storage pool at <code>/pfs/10/work</code> is stored on SSDs.
The primary focus is speed, not capacity.

In contrast to BinAC 1 we will enforce work space lifetime, as the capacity is limited.
We ask you to only store data you actively use for computations on <code>/pfs/10/work</code>.
Please move data to <code>/pfs/10/project</code> when you don't need it on the fast storage any more.

Each user should create workspaces at <code>/pfs/10/work</code> through the workspace tools

You can find more info on workspace tools on our general page:

:: → '''[[Workspace]]s'''

To create a work space you'll need to supply a name for your work space area and a lifetime in days.
For more information read the corresponding help, e.g: <code>ws_allocate -h.</code>
{| class="wikitable"
|-
!style="width:30%" | Command
!style="width:70%" | Action
|-
|<code>ws_allocate mywork 30</code>
|Allocate a work space named "mywork" for 30 days.
|-
|<code>ws_allocate myotherwork</code>
|Allocate a work space named "myotherwork" with maximum lifetime.
|-
|<code>ws_list -a</code>
|List all your work spaces.
|-
|<code>ws_find mywork</code>
|Get absolute path of work space "mywork".
|-
|<code>ws_extend mywork 30</code>
|Extend life me of work space mywork by 30 days from now.
|-
|<code>ws_release mywork</code>
|Manually erase your work space "mywork". Please remove directory content first.
|-
|}

=== Scratch ===

Please use the fast local scratch space for storing temporary data during your jobs.

For each job a scratch directory will be created on the compute nodes. It is available via the environment variable <code>$TMPDIR</code>, which points to <code>/scratch/<jobID></code>.

Especially the SMP nodes and the GPU nodes are equipped with large and fast local disks that should be used for temporary data, scratch data or data staging for ML model training.
The Lustre file system (<code>WORK</code> and <code>PROJECT</code>) is unsuited for repetitive random I/O, I/O sizes smaller than the Lustre and ZFS block size (1M) or I/O patterns where files are opened and closed in rapid succession. The XFS file system of the local scratch drives is better suited for typical scratch workloads and access patterns. Moreover, the local scratch drives offer a lower latency and a higher bandwidth than <code>WORK</code>.

=== SDS@hd ===

SDS@hd is mounted via NFS on login and compute nodes at <syntaxhighlight inline>/mnt/sds-hd</syntaxhighlight>.

To access your Speichervorhaben, the export to BinAC 2 must first be enabled by the SDS@hd-Team. Please contact [mailto:sds-hd-support@urz.uni-heidelberg.de SDS@hd support] and provide the acronym of your Speichervorhaben, along with a request to enable the export to BinAC 2.

Once this has been done, you can access your Speichervorhaben as described in the [https://wiki.bwhpc.de/e/SDS@hd/Access/NFS#Access_your_data SDS documentation].

<syntaxhighlight>
$ kinit $USER
Password for <user>@BWSERVICES.UNI-HEIDELBERG.DE:
</syntaxhighlight>

The Kerberos ticket store is shared across all nodes. Creating a single ticket is sufficient to access your Speichervorhaben on all nodes.

BinAC2/Slurm

2025-09-08T07:42:25Z

S Behnle: updated GPU section

= General information about Slurm =

Any kind of calculation on the compute nodes of bwForCluster BinAC 2 requires the user to define calculations as a sequence of commands or single command together with required run time, number of CPU cores and main memory and submit all, i.e., the '''batch job''', to a resource and workload managing software. BinAC 2 has installed the workload managing software Slurm. Therefore any job submission by the user is to be executed by commands of the Slurm software. Slurm queues and runs user jobs based on fair sharing policies.

= External Slurm documentation =

You can find the official Slurm configuration and some other material here:

* Slurm documentation: https://slurm.schedmd.com/documentation.html
* Slurm cheat sheet: https://slurm.schedmd.com/pdfs/summary.pdf
* Slurm tutorials: https://slurm.schedmd.com/tutorials.html

= SLURM terminology =

SLURM knows and mirrors the division of the cluster into '''nodes''' with several '''cores'''. When queuing '''jobs''', there are several ways of requesting resources and it is important to know which term means what in SLURM. Here are some basic SLURM terms:

;Job
: A job is a self-contained computation that may encompass multiple tasks and is given specific resources like individual CPUs/GPUs, a specific amount of RAM or entire nodes. These resources are said to have been allocated for the job.

;Task
: A task is a single run of a single process. By default, one task is run per node and one CPU is assigned per task.

;Partition
: A partition (usually called queue outside SLURM) is a waiting line in which jobs are put by users.

;Socket
: Receptacle on the motherboard for one physically packaged processor (each of which can contain one or more cores).

;Core
: A complete private set of registers, execution units, and retirement queues needed to execute programs.

;Thread
: One or more hardware contexts withing a single core. Each thread has attributes of one core, managed & scheduled as a single logical processor by the OS.

;CPU
: A '''CPU''' in Slurm means a '''single core'''. This is different from the more common terminology, where a CPU (a microprocessor chip) consists of multiple cores. Slurm uses the term '''sockets''' when talking about CPU chips. Depending upon system configuration, a CPU can be either a '''core''' or a '''thread'''. On '''BinAC 2 Hyperthreading is activated on every machine'''. This means that the operating system and Slurm sees each physical core as two logical cores.

= Slurm Commands =

{| width=750px class="wikitable"
! Slurm commands !! Brief explanation
|-
| [https://slurm.schedmd.com/sbatch.html sbatch] || Submits a job and queues it in an input queue
|-
| [https://slurm.schedmd.com/salloc.html saclloc] || Request resources for an interactive job
|-
| [https://slurm.schedmd.com/squeue.html squeue] || Displays information about active, eligible, blocked, and/or recently completed jobs
|-
| [https://slurm.schedmd.com/scontrol.html scontrol] || Displays detailed job state information
|-
| [https://slurm.schedmd.com/scontrol.html sstat] || Displays status information about a running job
|-
| [https://slurm.schedmd.com/scancel.html scancel] || Cancels a job
|-
|}

== Interactive Jobs ==

You can run interactive jobs for testing and developing your job scripts.
Several nodes are reserved for interactive work, so your jobs should start right away.
You can only submit one job to this partition at a time. A job can run for up to 10 hours (about one workday).

This example command gives you 16 cores and 128 GB of memory for four hours on one of the reserved nodes:

<pre>
salloc --partition=interactive --time=4:00:00 --cpus-per-task=16 --mem=128gb
</pre>

You can also use srun to request the same resources:

<pre>
srun --partition=interactive --time=4:00:00 --cpus-per-task=16 --mem=128gb --pty bash
</pre>

== Job Submission : sbatch ==

Batch jobs are submitted by using the command '''sbatch'''. The main purpose of the '''sbatch''' command is to specify the resources that are needed to run the job. '''sbatch''' will then queue the batch job. However, starting of batch job depends on the availability of the requested resources and the fair sharing value.
 
=== sbatch Command Parameters ===
The syntax and use of '''sbatch''' can be displayed via:
<pre>
$ man sbatch
</pre>
'''sbatch''' options can be used from the command line or in your job script. The following table shows the syntax and provides examples for each option.

{| class="wikitable"
! colspan="5" | sbatch Options
|-
! Command line
! Job Script
! Purpose
! Example
! Default value
|- style="vertical-align:top;"
| <code>-t ''time''</code> or <code>--time=''time''</code>
| #SBATCH --time=''time''
| Wall clock time limit. 
| <code>-t 2:30:00</code> Limits run time to 2h 30 min.<code>-t 2-12</code> Limits run time to 2 days and 12 hours.
| Depends on Slurm partition.
|-
|- style="vertical-align:top;"
| -N ''count'' or --nodes=''count''
| #SBATCH --nodes=''count''
| Number of nodes to be used.
| <code>-N 1</code> Run job on one node.<code>-N 2</code> Run job on two nodes (have to use MPI!)
|
|- style="vertical-align:top;"
| -n ''count'' or --ntasks=''count''
| #SBATCH --ntasks=''count''
| Number of tasks to be launched.
| <code>-n 2</code> launch two tasks in the job.
| One task per node
|-
|- style="vertical-align:top;"
| --ntasks-per-node=''count''
| #SBATCH --ntasks-per-node=''count''
| Maximum count of tasks per node. (Replaces the option <code>ppn</code> of MOAB.)
| <code>--ntasks-per-node=2</code> Run 2 tasks per node
| 1 task per node
|-
|- style="vertical-align:top;"
| -c ''count'' or --cpus-per-task=''count''
| #SBATCH --cpus-per-task=''count''
| Number of CPUs required per (MPI-)task.
| <code>-c 2</code> Request two CPUs per (MPI-)task.
| 1 CPU per (MPI-)task
|-
|- style="vertical-align:top;"
| <code>--mem=<size>[units]</code>
| #SBATCH --mem=''value_in_MB''
| Memory in MegaByte per node.<code>[units]</code> can be one of <code>[K<nowiki>|</nowiki>M<nowiki>|</nowiki>G<nowiki>|</nowiki>T]</code>.
| <code>--mem=10g</code> Request 10GB RAM per node <code>--mem=0</code> Request all memory on node
| Depends on Slurm configuration.It is better to specify <code>--mem</code> in every case.
|-
|- style="vertical-align:top;"
| --mem-per-cpu=''value_in_MB''
| #SBATCH --mem-per-cpu=''value_in_MB''
| Minimum Memory required per allocated CPU. (Replaces the option pmem of MOAB. You should omit the setting of this option.)
|-
|- style="vertical-align:top;"
| --mail-type=''type''
| #SBATCH --mail-type=''type''
| Notify user by email when certain event types occur. Valid type values are NONE, BEGIN, END, FAIL, REQUEUE, ALL.
|-
|- style="vertical-align:top;"
| --mail-user=''mail-address''
| #SBATCH --mail-user=''mail-address''
| The specified mail-address receives email notification of state changes as defined by --mail-type.
|-
|- style="vertical-align:top;"
| --output=''name''
| #SBATCH --output=''name''
| File in which job output is stored.
|-
|- style="vertical-align:top;"
| --error=''name''
| #SBATCH --error=''name''
| File in which job error messages are stored.
|-
|- style="vertical-align:top;"
| -J ''name'' or --job-name=''name''
| #SBATCH --job-name=''name''
| Job name.
|-
|- style="vertical-align:top;"
| --export=[ALL,] ''env-variables''
| #SBATCH --export=[ALL,] ''env-variables''
| Identifies which environment variables from the submission environment are propagated to the launched application. Default is ALL. If adding an environment variable to the submission environment is intended, the argument ALL must be added.
|-
|- style="vertical-align:top;"
| -A ''group-name'' or --account=''group-name''
| #SBATCH --account=''group-name''
| Change resources used by this job to specified group. You may need this option if your account is assigned to more than one group. By command "scontrol show job" the project group the job is accounted on can be seen behind "Account=".
|-
|- style="vertical-align:top;"
| -p ''queue-name'' or --partition=''queue-name''
| #SBATCH --partition=''queue-name''
| Request a specific queue for the resource allocation.
|-
|- style="vertical-align:top;"
| --reservation=''reservation-name''
| #SBATCH --reservation=''reservation-name''
| Use a specific reservation for the resource allocation.
|-
|- style="vertical-align:top;"
| -C ''LSDF'' or --constraint=''LSDF''
| #SBATCH --constraint=LSDF
| Job constraint LSDF Filesystems.
|-
|}
 

==== sbatch --partition ''queues'' ====
Queue classes define maximum resources such as walltime, nodes and processes per node and queue of the compute system. Details can be found here:
* [[BinAC2/SLURM_Partitions|BinAC 2 partitions]]
 

=== sbatch Examples ===

If you are coming from Moab/Torque on BinAC 1 or you are new to HPC/Slurm the <code>sbatch</code> options may confuse you. The following examples give an orientation how to run typical workloads on BinAC 2.

You can find every file mentioned on this Wiki page on BinAC 2 at: <code>/pfs/10/project/examples</code>

==== Serial Programs ====
When you use serial programs that use only one process, you can omit most of the <code>sbatch</code> parameters, as the default values are sufficient.

To submit a serial job that runs the script <code>serial_job.sh</code> and requires 5000 MB of main memory and 10 minutes of wall clock time, Slurm will allocate one '''physical''' core to your job.

a) execute:
<pre>
$ sbatch -p compute -t 10:00 --mem=5000m serial_job.sh
</pre>
or
b) add after the initial line of your script '''serial_job.sh''' the lines:
<source lang="bash">
#SBATCH --time=10:00
#SBATCH --mem=5000m
#SBATCH --job-name=simple-serial-job
</source>
and execute the modified script with the command line option ''--partition=compute''
<pre>
$ sbatch -p=compute serial_job.sh
</pre>
Note, that sbatch command line options overrule script options.
 
 

==== Multithreaded Programs ====

Multithreaded programs run their processes on multiple threads and share resources such as memory. 
You may use a program that includes a built-in option for multithreading (e.g., options like <code>--threads</code>). 
For multithreaded programs based on '''Open''' '''M'''ulti-'''P'''rocessing (OpenMP) a number of threads is defined by the environment variable <code>OMP_NUM_THREADS</code>. By default, this variable is set to 1 (<code>OMP_NUM_THREADS=1</code>).

'''Important:''' Hyperthreading is activated on bwForCluster BinAC 2. Hyperthreading can be beneficial for some applications and codes, but it can also degrade performance in other cases. We therefore recommend to run a small test job with and without hyperthreading to determine the best choice. '''

'''a) Program with built-in multithreading option'''

The example uses the common Bioinformatics software called <code>samtools</code> as example for using built-in multithreading.

The module <code>bio/samtools/1.21</code> provides an example jobscript that requests 4 CPUs and runs <code>samtools sort</code> with 4 threads.

<pre>
#!/bin/bash

#SBATCH --time=19:00
#SBATCH --nodes=1
#SBATCH --cpus-per-task=4
#SBATCH --mem=5000m
#SBATCH --partition compute
[...]
samtools sort -@ 4 sample.bam -o sample.sorted.bam
</pre>

You can use the example jobscript with this command

<pre>
sbatch /opt/bwhpc/common/bio/samtools/1.21/bwhpc-examples/binac2-samtools-1.21-bwhpc-examples.slurm
</pre>

'''b) OpenMP'''

We will run an exaple OpenMP Hello-World program. The jobscript looks like this:

<pre>
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --cpus-per-task=4
#SBATCH --time=1:00
#SBATCH --mem=5000m
#SBATCH -J OpenMP-Hello-World

export OMP_NUM_THREADS=$(${SLURM_JOB_CPUS_PER_NODE}/2)

echo "Executable running on ${SLURM_JOB_CPUS_PER_NODE} cores with ${OMP_NUM_THREADS} threads"

# Run parallel Hello World
/pfs/10/project/examples/openmp_hello_world
</pre>

Submit the job to the <code>compute</code> partition and get the output (in the stdout-file)

<pre>
sbatch --partition=compute /pfs/10/project/examples/openmp_hello_world.sh

Executable running on 4 cores with 4 threads
Hello from process: 0
Hello from process: 2
Hello from process: 1
Hello from process: 3
</pre>

==== OpenMPI ====

If you want to run MPI-jobs on batch nodes, generate a wrapper script <code>mpi_hello_world.sh</code> for '''OpenMPI''' containing the following lines:

<source lang="bash">
#!/bin/bash

#SBATCH --partition compute
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=2
#SBATCH --cpus-per-task=2
#SBATCH --mem-per-cpu=2000
#SBATCH --time=05:00

# Load the MPI implementation of your choice
module load mpi/openmpi/4.1-gnu-14.2

# Run your MPI program
mpirun --bind-to core --map-by core --report-bindings mpi_hello_world
</source>

'''Attention:''' Do '''NOT''' add mpirun options <code>-n <number_of_processes></code> or any other option defining processes or nodes, since Slurm instructs mpirun about number of processes and node hostnames.

Use '''ALWAYS''' the MPI options <code>--bind-to core</code> and <code>--map-by core|socket|node</code>.
Please type <code>man mpirun</code> for an explanation of the meaning of the different options of mpirun option <code>--map-by</code>.

The above jobscript runs four OpenMPI tasks, distributed between two nodes. Because of hyperthreading you have to set <code>--cpus-per-task=2</code>. This means each MPI-task will get one physical core. If you omit <code>--cpus-per-task=2</code> MPI will fail.

'''Attention:''' Not all compute nodes are connected via Infiniband. Tell Slurm you want Infiniband via <code>--constraint=ib</code> when submitting or add <code>#SBATCH --constraint=ib</code> to your jobscript.

<pre>
$ sbatch --constraint=ib /pfs/10/project/examples/mpi_hello_world.sh
</pre>

This will run a simple Hello World program:

<pre>
[...]
Hello world from processor node2-031, rank 3 out of 4 processors
Hello world from processor node2-031, rank 2 out of 4 processors
Hello world from processor node2-030, rank 1 out of 4 processors
Hello world from processor node2-030, rank 0 out of 4 processors

</pre>

==== Multithreaded + MPI parallel Programs ====

Multithreaded + MPI parallel programs operate faster than serial programs on multi CPUs with multiple cores. All threads of one process share resources such as memory. On the contrary MPI tasks do not share memory but can be spawned over different nodes. '''Because hyperthreading is switched on BinaC 2, the option --cpus-per-task (-c) must be set to 2*n, if you want to use n threads.'''
 
 
===== OpenMPI with Multithreading =====
Multiple MPI tasks using '''OpenMPI''' must be launched by the MPI parallel program '''mpirun'''. For multithreaded programs based on '''Open''' '''M'''ulti-'''P'''rocessing (OpenMP) number of threads are defined by the environment variable OMP_NUM_THREADS. By default this variable is set to 1 (OMP_NUM_THREADS=1).
 
'''For OpenMPI''' a job-script to submit a batch job called ''job_ompi_omp.sh'' that runs a MPI program with 4 tasks and a 28-fold threaded program ''ompi_omp_program'' requiring 3000 MByte of physical memory per thread (using 28 threads per MPI task you will get 28*3000 MByte = 84000 MByte per MPI task) and total wall clock time of 3 hours looks like:

<source lang="bash">
#!/bin/bash
#SBATCH --ntasks=4
#SBATCH --cpus-per-task=56
#SBATCH --time=03:00:00
#SBATCH --mem=83gb # 84000 MB = 84000/1024 GB = 82.1 GB
#SBATCH --export=ALL,MPI_MODULE=mpi/openmpi/3.1,EXECUTABLE=./ompi_omp_program
#SBATCH --output="parprog_hybrid_%j.out"

# Use when a defined module environment related to OpenMPI is wished
module load ${MPI_MODULE}
export OMP_NUM_THREADS=$((${SLURM_CPUS_PER_TASK}/2))
export MPIRUN_OPTIONS="--bind-to core --map-by socket:PE=${OMP_NUM_THREADS} -report-bindings"
export NUM_CORES=${SLURM_NTASKS}*${OMP_NUM_THREADS}
echo "${EXECUTABLE} running on ${NUM_CORES} cores with ${SLURM_NTASKS} MPI-tasks and ${OMP_NUM_THREADS} threads"
startexe="mpirun -n ${SLURM_NTASKS} ${MPIRUN_OPTIONS} ${EXECUTABLE}"
echo $startexe
exec $startexe
</source>
Execute the script '''job_ompi_omp.sh''' by command sbatch:
<pre>
$ sbatch -p compute ./job_ompi_omp.sh
</pre>
* With the mpirun option ''--bind-to core'' MPI tasks and OpenMP threads are bound to physical cores.
* With the option ''--map-by node:PE=<value>'' (neighbored) MPI tasks will be attached to different nodes and each MPI task is bound to the first core of a node. <value> must be set to ${OMP_NUM_THREADS}.
* The option ''-report-bindings'' shows the bindings between MPI tasks and physical cores.
* The mpirun-options '''--bind-to core''', '''--map-by socket|...|node:PE=<value>''' should always be used when running a multithreaded MPI program.
 

==== GPU jobs ====

The nodes in the <code>gpu</code> queue have 2 or 8 NVIDIA A30/A100/H200 GPUs. Just submitting a job to these queues is not enough to also allocate one or more GPUs, you have to do so using the "--gres=gpu" parameter. You have to specifiy how many GPUs your job needs, e.g. "--gres=gpu:a30:2" will request two NVIDIA A30 GPUs.

The GPU nodes are shared between multiple jobs if the jobs don't request all the GPUs in a node and there are enough ressources to run more than one job. The individual GPUs are always bound to a single job and will not be shared between different jobs.

a) add after the initial line of your script job.sh the line including the
information about the GPU usage: #SBATCH --gres=gpu:a30:2
<pre>
#!/bin/bash
#SBATCH --ntasks=40
#SBATCH --time=02:00:00
#SBATCH --mem=4000
#SBATCH --gres=gpu:a30:2
</pre>

or b) execute:
<pre>
$ sbatch -p <queue> -n 40 -t 02:00:00 --mem 4000 --gres=gpu:a30:2 job.sh
</pre>
 
If you start an interactive session on of the GPU nodes, you can use the "nvidia-smi" command to list the GPUs allocated to your job:
<pre>
$ nvidia-smi
Sun Mar 29 15:20:05 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.33.01 Driver Version: 440.33.01 CUDA Version: 10.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2... Off | 00000000:3A:00.0 Off | 0 |
| N/A 29C P0 39W / 300W | 9MiB / 32510MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla V100-SXM2... Off | 00000000:3B:00.0 Off | 0 |
| N/A 30C P0 41W / 300W | 8MiB / 32510MiB | 0% Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 14228 G /usr/bin/X 8MiB |
| 1 14228 G /usr/bin/X 8MiB |
+-----------------------------------------------------------------------------+
</pre>

Upon successfull GPU ressource allocation, SLURM will set the environment variable <code>CUDA_VISIBLE_DEVICES</code> appropriately. Do not change this variable!

 
In case of using Open MPI, the underlying communication infrastructure (UCX and Open MPI's BTL) is CUDA-aware.
However, there may be warnings, e.g. when running
<pre>
$ module load compiler/gnu/10.3 mpi/openmpi devel/cuad
$ mpirun mpirun -np 2 ./mpi_cuda_app
--------------------------------------
WARNING: There are more than one active ports on host 'uc2n520', but the
default subnet GID prefix was detected on more than one of these
ports. If these ports are connected to different physical IB
networks, this configuration will fail in Open MPI. This version of
Open MPI requires that every physically separate IB subnet that is
used between connected MPI processes must have different subnet ID
values.

Please see this FAQ entry for more details:

http://www.open-mpi.org/faq/?category=openfabrics#ofa-default-subnet-gid

NOTE: You can turn off this warning by setting the MCA parameter
btl_openib_warn_default_gid_prefix to 0.
--------------------------------------------------------------------------
</pre>

Please run Open MPI's mpirun using the following command:
<pre>
$ mpirun --mca pml ucx --mca btl_openib_warn_default_gid_prefix 0 -np 2 ./mpi_cuda_app
</pre>
or disabling the (older) communication layer Bit-Transfer-Layer (short BTL) alltogether:
<pre>
$ mpirun --mca pml ucx --mca btl ^openib -np 2 ./mpi_cuda_app
</pre>

(Please note, that CUDA per v12.8 is only officially supported with up to GCC-11)
 
 

== Start time of job or resources : squeue --start ==
The command can be used by any user to displays the estimated start time of a job based a number of analysis types based on historical usage, earliest available reservable resources, and priority based backlog. The command squeue is explained in detail on the webpage https://slurm.schedmd.com/squeue.html or via manpage (man squeue).
 
 
=== Access ===
By default, this command can be run by '''any user'''.
 
 

== List of your submitted jobs : squeue ==
Displays information about YOUR active, pending and/or recently completed jobs. The command displays all own active and pending jobs. The command squeue is explained in detail on the webpage https://slurm.schedmd.com/squeue.html or via manpage (man squeue).
 
 
=== Access ===
By default, this command can be run by any user.
 
 

=== Flags ===
{| width=750px class="wikitable"
|-
! Flag !! Description
|-
| -l, --long
| Report more of the available information for the selected jobs or job steps, subject to any constraints specified.
|}
 

=== Examples ===
''squeue'' example on BinaC 2 (Only your own jobs are displayed!).
<pre>
$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
18088744 single CPV.sbat ab1234 PD 0:00 1 (Priority)
18098414 multiple CPV.sbat ab1234 PD 0:00 2 (Priority)
18090089 multiple CPV.sbat ab1234 R 2:27 2 uc2n[127-128]
$ squeue -l
JOBID PARTITION NAME USER STATE TIME TIME_LIMI NODES NODELIST(REASON)
18088654 single CPV.sbat ab1234 COMPLETI 4:29 2:00:00 1 uc2n374
18088785 single CPV.sbat ab1234 PENDING 0:00 2:00:00 1 (Priority)
18098414 multiple CPV.sbat ab1234 PENDING 0:00 2:00:00 2 (Priority)
18088683 single CPV.sbat ab1234 RUNNING 0:14 2:00:00 1 uc2n413
</pre>
* The output of ''squeue'' shows how many jobs of yours are running or pending and how many nodes are in use by your jobs.
 

== Shows free resources : sinfo_t_idle ==
The Slurm command sinfo is used to view partition and node information for a system running Slurm. It incorporates down time, reservations, and node state information in determining the available backfill window. The sinfo command can only be used by the administrator.
 
SCC has prepared a special script (sinfo_t_idle) to find out how many processors are available for immediate use on the system. It is anticipated that users will use this information to submit jobs that meet these criteria and thus obtain quick job turnaround times.
 
 
=== Access ===
By default, this command can be used by any user or administrator.
 
 
=== Example ===
* The following command displays what resources are available for immediate use for the whole partition.
<pre>$ sinfo_t_idle
Partition dev_multiple : 8 nodes idle
Partition multiple : 332 nodes idle
Partition dev_single : 4 nodes idle
Partition single : 76 nodes idle
Partition long : 80 nodes idle
Partition fat : 5 nodes idle
Partition dev_special : 342 nodes idle
Partition special : 342 nodes idle
Partition dev_multiple_e: 7 nodes idle
Partition multiple_e : 335 nodes idle
Partition gpu_4 : 12 nodes idle
Partition gpu_8 : 6 nodes idle
</pre>
* For the above example jobs in all partitions can be run immediately.
 

== Detailed job information : scontrol show job ==
scontrol show job displays detailed job state information and diagnostic output for all or a specified job of yours. Detailed information is available for active, pending and recently completed jobs. The command scontrol is explained in detail on the webpage https://slurm.schedmd.com/scontrol.html or via manpage (man scontrol).
 
Display the state of all your jobs in normal mode: scontrol show job
 
Display the state of a job with <jobid> in normal mode: scontrol show job <jobid>
 
 
=== Access ===
* End users can use scontrol show job to view the status of their '''own jobs''' only.
 

=== Arguments ===
{| width=750px class="wikitable"
|-
! Option !! Default !! Description !! Example
|- style="vertical-align:top;"
|- style="width:12%;"
| -d
| (n/a)
| Detailed mode
| Example: Display the state with jobid 18089884 in detailed mode. <pre>scontrol -d show job 18089884</pre>
|}
 
 

=== Scontrol show job Example ===
Here is an example from BinAC 2.
<pre>
squeue # show my own jobs (here the userid is replaced!)
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
18089884 multiple CPV.sbat bq0742 R 33:44 2 uc2n[165-166]

$
$ # now, see what's up with my pending job with jobid 18089884
$
$ scontrol show job 18089884

JobId=18089884 JobName=CPV.sbatch
UserId=bq0742(8946) GroupId=scc(12345) MCS_label=N/A
Priority=3 Nice=0 Account=kit QOS=normal
JobState=RUNNING Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
RunTime=00:35:06 TimeLimit=02:00:00 TimeMin=N/A
SubmitTime=2020-03-16T14:14:54 EligibleTime=2020-03-16T14:14:54
AccrueTime=2020-03-16T14:14:54
StartTime=2020-03-16T15:12:51 EndTime=2020-03-16T17:12:51 Deadline=N/A
SuspendTime=None SecsPreSuspend=0 LastSchedEval=2020-03-16T15:12:51
Partition=multiple AllocNode:Sid=uc2n995:5064
ReqNodeList=(null) ExcNodeList=(null)
NodeList=uc2n[165-166]
BatchHost=uc2n165
NumNodes=2 NumCPUs=160 NumTasks=80 CPUs/Task=1 ReqB:S:C:T=0:0:*:1
TRES=cpu=160,mem=96320M,node=2,billing=160
Socks/Node=* NtasksPerN:B:S:C=40:0:*:1 CoreSpec=*
MinCPUsNode=40 MinMemoryCPU=1204M MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
OverSubscribe=NO Contiguous=0 Licenses=(null) Network=(null)
Command=/pfs/data5/home/kit/scc/bq0742/git/CPV/bin/CPV.sbatch
WorkDir=/pfs/data5/home/kit/scc/bq0742/git/CPV/bin
StdErr=/pfs/data5/home/kit/scc/bq0742/git/CPV/bin/slurm-18089884.out
StdIn=/dev/null
StdOut=/pfs/data5/home/kit/scc/bq0742/git/CPV/bin/slurm-18089884.out
Power=
MailUser=(null) MailType=NONE
</pre>
 
You can use standard Linux pipe commands to filter the very detailed scontrol show job output.
* In which state the job is?
<pre>$ scontrol show job 18089884 | grep -i State
JobState=COMPLETED Reason=None Dependency=(null)
</pre>
 

== Cancel Slurm Jobs ==
The scancel command is used to cancel jobs. The command scancel is explained in detail on the webpage https://slurm.schedmd.com/scancel.html or via manpage (man scancel).
 
 
=== Canceling own jobs : scancel ===
scancel is used to signal or cancel jobs, job arrays or job steps. The command is:
<pre>
$ scancel [-i] <job-id>
$ scancel -t <job_state_name>
</pre>

 
{| width=750px class="wikitable"
! Flag !! Default !! Description !! Example
|- style="vertical-align:top;"
| -i, --interactive
| (n/a)
| Interactive mode.
| Cancel the job 987654 interactively. <pre> scancel -i 987654 </pre>
|-
| -t, --state
| (n/a)
| Restrict the scancel operation to jobs in a certain state. "job_state_name" may have a value of either "PENDING", "RUNNING" or "SUSPENDED".
| Cancel all jobs in state "PENDING". <pre> scancel -t "PENDING" </pre>
|}
 

= Resource Managers =
=== Batch Job (Slurm) Variables ===
The following environment variables of Slurm are added to your environment once your job has started
(only an excerpt of the most important ones).
{| width=750px class="wikitable"
! Environment !! Brief explanation
|-
| SLURM_JOB_CPUS_PER_NODE
| Number of processes per node dedicated to the job
|-
| SLURM_JOB_NODELIST
| List of nodes dedicated to the job
|-
| SLURM_JOB_NUM_NODES
| Number of nodes dedicated to the job
|-
| SLURM_MEM_PER_NODE
| Memory per node dedicated to the job
|-
| SLURM_NPROCS
| Total number of processes dedicated to the job
|-
| SLURM_CLUSTER_NAME
| Name of the cluster executing the job
|-
| SLURM_CPUS_PER_TASK
| Number of CPUs requested per task
|-
| SLURM_JOB_ACCOUNT
| Account name
|-
| SLURM_JOB_ID
| Job ID
|-
| SLURM_JOB_NAME
| Job Name
|-
| SLURM_JOB_PARTITION
| Partition/queue running the job
|-
| SLURM_JOB_UID
| User ID of the job's owner
|-
| SLURM_SUBMIT_DIR
| Job submit folder. The directory from which sbatch was invoked.
|-
| SLURM_JOB_USER
| User name of the job's owner
|-
| SLURM_RESTART_COUNT
| Number of times job has restarted
|-
| SLURM_PROCID
| Task ID (MPI rank)
|-
| SLURM_NTASKS
| The total number of tasks available for the job
|-
| SLURM_STEP_ID
| Job step ID
|-
| SLURM_STEP_NUM_TASKS
| Task count (number of MPI ranks)
|-
| SLURM_JOB_CONSTRAINT
| Job constraints
|}
See also:
* [https://slurm.schedmd.com/sbatch.html#lbAI Slurm input and output environment variables]
 

=== Job Exit Codes ===
A job's exit code (also known as exit status, return code and completion code) is captured by SLURM and saved as part of the job record.
 
Any non-zero exit code will be assumed to be a job failure and will result in a Job State of FAILED with a reason of "NonZeroExitCode".
 
The exit code is an 8 bit unsigned number ranging between 0 and 255. While it is possible for a job to return a negative exit code, SLURM will display it as an unsigned value in the 0 - 255 range.
 
 
==== Displaying Exit Codes and Signals ====
SLURM displays a job's exit code in the output of the '''scontrol show job''' and the sview utility.
 
When a signal was responsible for a job or step's termination, the signal number will be displayed after the exit code, delineated by a colon(:).
 
 
==== Submitting Termination Signal ====
Here is an example, how to 'save' a Slurm termination signal in a typical jobscript.
<source lang="bash">
[...]
exit_code=$?
mpirun -np <#cores> <EXE_BIN_DIR>/<executable> ... (options) 2>&1
[ "$exit_code" -eq 0 ] && echo "all clean..." || \
echo "Executable <EXE_BIN_DIR>/<executable> finished with exit code ${$exit_code}"
[...]
</source>
* Do not use ''''time'''' mpirun! The exit code will be the one submitted by the first (time) program.
* You do not need an '''exit $exit_code''' in the scripts.
 
 
----
[[#top|Back to top]]

BinAC2/Slurm

2025-08-11T10:41:27Z

S Behnle: Updated SLURM memory request

= General information about Slurm =

Any kind of calculation on the compute nodes of bwForCluster BinAC 2 requires the user to define calculations as a sequence of commands or single command together with required run time, number of CPU cores and main memory and submit all, i.e., the '''batch job''', to a resource and workload managing software. BinAC 2 has installed the workload managing software Slurm. Therefore any job submission by the user is to be executed by commands of the Slurm software. Slurm queues and runs user jobs based on fair sharing policies.

= External Slurm documentation =

You can find the official Slurm configuration and some other material here:

* Slurm documentation: https://slurm.schedmd.com/documentation.html
* Slurm cheat sheet: https://slurm.schedmd.com/pdfs/summary.pdf
* Slurm tutorials: https://slurm.schedmd.com/tutorials.html

= SLURM terminology =

SLURM knows and mirrors the division of the cluster into '''nodes''' with several '''cores'''. When queuing '''jobs''', there are several ways of requesting resources and it is important to know which term means what in SLURM. Here are some basic SLURM terms:

;Job
: A job is a self-contained computation that may encompass multiple tasks and is given specific resources like individual CPUs/GPUs, a specific amount of RAM or entire nodes. These resources are said to have been allocated for the job.

;Task
: A task is a single run of a single process. By default, one task is run per node and one CPU is assigned per task.

;Partition
: A partition (usually called queue outside SLURM) is a waiting line in which jobs are put by users.

;Socket
: Receptacle on the motherboard for one physically packaged processor (each of which can contain one or more cores).

;Core
: A complete private set of registers, execution units, and retirement queues needed to execute programs.

;Thread
: One or more hardware contexts withing a single core. Each thread has attributes of one core, managed & scheduled as a single logical processor by the OS.

;CPU
: A '''CPU''' in Slurm means a '''single core'''. This is different from the more common terminology, where a CPU (a microprocessor chip) consists of multiple cores. Slurm uses the term '''sockets''' when talking about CPU chips. Depending upon system configuration, a CPU can be either a '''core''' or a '''thread'''. On '''BinAC 2 Hyperthreading is activated on every machine'''. This means that the operating system and Slurm sees each physical core as two logical cores.

= Slurm Commands =

{| width=750px class="wikitable"
! Slurm commands !! Brief explanation
|-
| [https://slurm.schedmd.com/sbatch.html sbatch] || Submits a job and queues it in an input queue
|-
| [https://slurm.schedmd.com/salloc.html saclloc] || Request resources for an interactive job
|-
| [https://slurm.schedmd.com/squeue.html squeue] || Displays information about active, eligible, blocked, and/or recently completed jobs
|-
| [https://slurm.schedmd.com/scontrol.html scontrol] || Displays detailed job state information
|-
| [https://slurm.schedmd.com/scontrol.html sstat] || Displays status information about a running job
|-
| [https://slurm.schedmd.com/scancel.html scancel] || Cancels a job
|-
|}

== Job Submission : sbatch ==

Batch jobs are submitted by using the command '''sbatch'''. The main purpose of the '''sbatch''' command is to specify the resources that are needed to run the job. '''sbatch''' will then queue the batch job. However, starting of batch job depends on the availability of the requested resources and the fair sharing value.
 
=== sbatch Command Parameters ===
The syntax and use of '''sbatch''' can be displayed via:
<pre>
$ man sbatch
</pre>
'''sbatch''' options can be used from the command line or in your job script. The following table shows the syntax and provides examples for each option.

{| class="wikitable"
! colspan="5" | sbatch Options
|-
! Command line
! Job Script
! Purpose
! Example
! Default value
|- style="vertical-align:top;"
| <code>-t ''time''</code> or <code>--time=''time''</code>
| #SBATCH --time=''time''
| Wall clock time limit. 
| <code>-t 2:30:00</code> Limits run time to 2h 30 min.<code>-t 2-12</code> Limits run time to 2 days and 12 hours.
| Depends on Slurm partition.
|-
|- style="vertical-align:top;"
| -N ''count'' or --nodes=''count''
| #SBATCH --nodes=''count''
| Number of nodes to be used.
| <code>-N 1</code> Run job on one node.<code>-N 2</code> Run job on two nodes (have to use MPI!)
|
|- style="vertical-align:top;"
| -n ''count'' or --ntasks=''count''
| #SBATCH --ntasks=''count''
| Number of tasks to be launched.
| <code>-n 2</code> launch two tasks in the job.
| One task per node
|-
|- style="vertical-align:top;"
| --ntasks-per-node=''count''
| #SBATCH --ntasks-per-node=''count''
| Maximum count of tasks per node. (Replaces the option <code>ppn</code> of MOAB.)
| <code>--ntasks-per-node=2</code> Run 2 tasks per node
| 1 task per node
|-
|- style="vertical-align:top;"
| -c ''count'' or --cpus-per-task=''count''
| #SBATCH --cpus-per-task=''count''
| Number of CPUs required per (MPI-)task.
| <code>-c 2</code> Request two CPUs per (MPI-)task.
| 1 CPU per (MPI-)task
|-
|- style="vertical-align:top;"
| <code>--mem=<size>[units]</code>
| #SBATCH --mem=''value_in_MB''
| Memory in MegaByte per node.<code>[units]</code> can be one of <code>[K<nowiki>|</nowiki>M<nowiki>|</nowiki>G<nowiki>|</nowiki>T]</code>.
| <code>--mem=10g</code> Request 10GB RAM per node <code>--mem=0</code> Request all memory on node
| Depends on Slurm configuration.It is better to specify <code>--mem</code> in every case.
|-
|- style="vertical-align:top;"
| --mem-per-cpu=''value_in_MB''
| #SBATCH --mem-per-cpu=''value_in_MB''
| Minimum Memory required per allocated CPU. (Replaces the option pmem of MOAB. You should omit the setting of this option.)
|-
|- style="vertical-align:top;"
| --mail-type=''type''
| #SBATCH --mail-type=''type''
| Notify user by email when certain event types occur. Valid type values are NONE, BEGIN, END, FAIL, REQUEUE, ALL.
|-
|- style="vertical-align:top;"
| --mail-user=''mail-address''
| #SBATCH --mail-user=''mail-address''
| The specified mail-address receives email notification of state changes as defined by --mail-type.
|-
|- style="vertical-align:top;"
| --output=''name''
| #SBATCH --output=''name''
| File in which job output is stored.
|-
|- style="vertical-align:top;"
| --error=''name''
| #SBATCH --error=''name''
| File in which job error messages are stored.
|-
|- style="vertical-align:top;"
| -J ''name'' or --job-name=''name''
| #SBATCH --job-name=''name''
| Job name.
|-
|- style="vertical-align:top;"
| --export=[ALL,] ''env-variables''
| #SBATCH --export=[ALL,] ''env-variables''
| Identifies which environment variables from the submission environment are propagated to the launched application. Default is ALL. If adding an environment variable to the submission environment is intended, the argument ALL must be added.
|-
|- style="vertical-align:top;"
| -A ''group-name'' or --account=''group-name''
| #SBATCH --account=''group-name''
| Change resources used by this job to specified group. You may need this option if your account is assigned to more than one group. By command "scontrol show job" the project group the job is accounted on can be seen behind "Account=".
|-
|- style="vertical-align:top;"
| -p ''queue-name'' or --partition=''queue-name''
| #SBATCH --partition=''queue-name''
| Request a specific queue for the resource allocation.
|-
|- style="vertical-align:top;"
| --reservation=''reservation-name''
| #SBATCH --reservation=''reservation-name''
| Use a specific reservation for the resource allocation.
|-
|- style="vertical-align:top;"
| -C ''LSDF'' or --constraint=''LSDF''
| #SBATCH --constraint=LSDF
| Job constraint LSDF Filesystems.
|-
|}
 

==== sbatch --partition ''queues'' ====
Queue classes define maximum resources such as walltime, nodes and processes per node and queue of the compute system. Details can be found here:
* [[BinAC2/SLURM_Partitions|BinAC 2 partitions]]
 

=== sbatch Examples ===

If you are coming from Moab/Torque on BinAC 1 or you are new to HPC/Slurm the <code>sbatch</code> options may confuse you. The following examples give an orientation how to run typical workloads on BinAC 2.

You can find every file mentioned on this Wiki page on BinAC 2 at: <code>/pfs/10/project/examples</code>

==== Serial Programs ====
When you use serial programs that use only one process, you can omit most of the <code>sbatch</code> parameters, as the default values are sufficient.

To submit a serial job that runs the script <code>serial_job.sh</code> and requires 5000 MB of main memory and 10 minutes of wall clock time, Slurm will allocate one '''physical''' core to your job.

a) execute:
<pre>
$ sbatch -p compute -t 10:00 --mem=5000m serial_job.sh
</pre>
or
b) add after the initial line of your script '''serial_job.sh''' the lines:
<source lang="bash">
#SBATCH --time=10:00
#SBATCH --mem=5000m
#SBATCH --job-name=simple-serial-job
</source>
and execute the modified script with the command line option ''--partition=compute''
<pre>
$ sbatch -p=compute serial_job.sh
</pre>
Note, that sbatch command line options overrule script options.
 
 

==== Multithreaded Programs ====

Multithreaded programs run their processes on multiple threads and share resources such as memory. 
You may use a program that includes a built-in option for multithreading (e.g., options like <code>--threads</code>). 
For multithreaded programs based on '''Open''' '''M'''ulti-'''P'''rocessing (OpenMP) a number of threads is defined by the environment variable <code>OMP_NUM_THREADS</code>. By default, this variable is set to 1 (<code>OMP_NUM_THREADS=1</code>).

'''Important:''' Hyperthreading is activated on bwForCluster BinAC 2. Hyperthreading can be beneficial for some applications and codes, but it can also degrade performance in other cases. We therefore recommend to run a small test job with and without hyperthreading to determine the best choice. '''

'''a) Program with built-in multithreading option'''

The example uses the common Bioinformatics software called <code>samtools</code> as example for using built-in multithreading.

The module <code>bio/samtools/1.21</code> provides an example jobscript that requests 4 CPUs and runs <code>samtools sort</code> with 4 threads.

<pre>
#!/bin/bash

#SBATCH --time=19:00
#SBATCH --nodes=1
#SBATCH --cpus-per-task=4
#SBATCH --mem=5000m
#SBATCH --partition compute
[...]
samtools sort -@ 4 sample.bam -o sample.sorted.bam
</pre>

You can use the example jobscript with this command

<pre>
sbatch /opt/bwhpc/common/bio/samtools/1.21/bwhpc-examples/binac2-samtools-1.21-bwhpc-examples.slurm
</pre>

'''b) OpenMP'''

We will run an exaple OpenMP Hello-World program. The jobscript looks like this:

<pre>
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --cpus-per-task=4
#SBATCH --time=1:00
#SBATCH --mem=5000m
#SBATCH -J OpenMP-Hello-World

export OMP_NUM_THREADS=$(${SLURM_JOB_CPUS_PER_NODE}/2)

echo "Executable running on ${SLURM_JOB_CPUS_PER_NODE} cores with ${OMP_NUM_THREADS} threads"

# Run parallel Hello World
/pfs/10/project/examples/openmp_hello_world
</pre>

Submit the job to the <code>compute</code> partition and get the output (in the stdout-file)

<pre>
sbatch --partition=compute /pfs/10/project/examples/openmp_hello_world.sh

Executable running on 4 cores with 4 threads
Hello from process: 0
Hello from process: 2
Hello from process: 1
Hello from process: 3
</pre>

==== OpenMPI ====

If you want to run MPI-jobs on batch nodes, generate a wrapper script <code>mpi_hello_world.sh</code> for '''OpenMPI''' containing the following lines:

<source lang="bash">
#!/bin/bash

#SBATCH --partition compute
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=2
#SBATCH --cpus-per-task=2
#SBATCH --mem-per-cpu=2000
#SBATCH --time=05:00

# Load the MPI implementation of your choice
module load mpi/openmpi/4.1-gnu-14.2

# Run your MPI program
mpirun --bind-to core --map-by core --report-bindings mpi_hello_world
</source>

'''Attention:''' Do '''NOT''' add mpirun options <code>-n <number_of_processes></code> or any other option defining processes or nodes, since Slurm instructs mpirun about number of processes and node hostnames.

Use '''ALWAYS''' the MPI options <code>--bind-to core</code> and <code>--map-by core|socket|node</code>.
Please type <code>man mpirun</code> for an explanation of the meaning of the different options of mpirun option <code>--map-by</code>.

The above jobscript runs four OpenMPI tasks, distributed between two nodes. Because of hyperthreading you have to set <code>--cpus-per-task=2</code>. This means each MPI-task will get one physical core. If you omit <code>--cpus-per-task=2</code> MPI will fail.

'''Attention:''' Not all compute nodes are connected via Infiniband. Tell Slurm you want Infiniband via <code>--constraint=ib</code> when submitting or add <code>#SBATCH --constraint=ib</code> to your jobscript.

<pre>
$ sbatch --constraint=ib /pfs/10/project/examples/mpi_hello_world.sh
</pre>

This will run a simple Hello World program:

<pre>
[...]
Hello world from processor node2-031, rank 3 out of 4 processors
Hello world from processor node2-031, rank 2 out of 4 processors
Hello world from processor node2-030, rank 1 out of 4 processors
Hello world from processor node2-030, rank 0 out of 4 processors

</pre>

==== Multithreaded + MPI parallel Programs ====

Multithreaded + MPI parallel programs operate faster than serial programs on multi CPUs with multiple cores. All threads of one process share resources such as memory. On the contrary MPI tasks do not share memory but can be spawned over different nodes. '''Because hyperthreading is switched on BinaC 2, the option --cpus-per-task (-c) must be set to 2*n, if you want to use n threads.'''
 
 
===== OpenMPI with Multithreading =====
Multiple MPI tasks using '''OpenMPI''' must be launched by the MPI parallel program '''mpirun'''. For multithreaded programs based on '''Open''' '''M'''ulti-'''P'''rocessing (OpenMP) number of threads are defined by the environment variable OMP_NUM_THREADS. By default this variable is set to 1 (OMP_NUM_THREADS=1).
 
'''For OpenMPI''' a job-script to submit a batch job called ''job_ompi_omp.sh'' that runs a MPI program with 4 tasks and a 28-fold threaded program ''ompi_omp_program'' requiring 3000 MByte of physical memory per thread (using 28 threads per MPI task you will get 28*3000 MByte = 84000 MByte per MPI task) and total wall clock time of 3 hours looks like:

<source lang="bash">
#!/bin/bash
#SBATCH --ntasks=4
#SBATCH --cpus-per-task=56
#SBATCH --time=03:00:00
#SBATCH --mem=83gb # 84000 MB = 84000/1024 GB = 82.1 GB
#SBATCH --export=ALL,MPI_MODULE=mpi/openmpi/3.1,EXECUTABLE=./ompi_omp_program
#SBATCH --output="parprog_hybrid_%j.out"

# Use when a defined module environment related to OpenMPI is wished
module load ${MPI_MODULE}
export OMP_NUM_THREADS=$((${SLURM_CPUS_PER_TASK}/2))
export MPIRUN_OPTIONS="--bind-to core --map-by socket:PE=${OMP_NUM_THREADS} -report-bindings"
export NUM_CORES=${SLURM_NTASKS}*${OMP_NUM_THREADS}
echo "${EXECUTABLE} running on ${NUM_CORES} cores with ${SLURM_NTASKS} MPI-tasks and ${OMP_NUM_THREADS} threads"
startexe="mpirun -n ${SLURM_NTASKS} ${MPIRUN_OPTIONS} ${EXECUTABLE}"
echo $startexe
exec $startexe
</source>
Execute the script '''job_ompi_omp.sh''' by command sbatch:
<pre>
$ sbatch -p compute ./job_ompi_omp.sh
</pre>
* With the mpirun option ''--bind-to core'' MPI tasks and OpenMP threads are bound to physical cores.
* With the option ''--map-by node:PE=<value>'' (neighbored) MPI tasks will be attached to different nodes and each MPI task is bound to the first core of a node. <value> must be set to ${OMP_NUM_THREADS}.
* The option ''-report-bindings'' shows the bindings between MPI tasks and physical cores.
* The mpirun-options '''--bind-to core''', '''--map-by socket|...|node:PE=<value>''' should always be used when running a multithreaded MPI program.
 

==== GPU jobs ====

The nodes in the gpu_4 and gpu_8 queues have 4 or 8 NVIDIA Tesla V100 GPUs. Just submitting a job to these queues is not enough to also allocate one or more GPUs, you have to do so using the "--gres=gpu" parameter. You have to specifiy how many GPUs your job needs, e.g. "--gres=gpu:2" will request two GPUs.

The GPU nodes are shared between multiple jobs if the jobs don't request all the GPUs in a node and there are enough ressources to run more than one job. The individual GPUs are always bound to a single job and will not be shared between different jobs.

a) add after the initial line of your script job.sh the line including the
information about the GPU usage: #SBATCH --gres=gpu:2
<pre>
#!/bin/bash
#SBATCH --ntasks=40
#SBATCH --time=02:00:00
#SBATCH --mem=4000
#SBATCH --gres=gpu:2
</pre>

or b) execute:
<pre>
$ sbatch -p <queue> -n 40 -t 02:00:00 --mem 4000 --gres=gpu:2 job.sh
</pre>
 
If you start an interactive session on of the GPU nodes, you can use the "nvidia-smi" command to list the GPUs allocated to your job:
<pre>
$ nvidia-smi
Sun Mar 29 15:20:05 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.33.01 Driver Version: 440.33.01 CUDA Version: 10.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2... Off | 00000000:3A:00.0 Off | 0 |
| N/A 29C P0 39W / 300W | 9MiB / 32510MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla V100-SXM2... Off | 00000000:3B:00.0 Off | 0 |
| N/A 30C P0 41W / 300W | 8MiB / 32510MiB | 0% Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 14228 G /usr/bin/X 8MiB |
| 1 14228 G /usr/bin/X 8MiB |
+-----------------------------------------------------------------------------+
</pre>

 
In case of using Open MPI, the underlying communication infrastructure (UCX and Open MPI's BTL) is CUDA-aware.
However, there may be warnings, e.g. when running
<pre>
$ module load compiler/gnu/10.3 mpi/openmpi devel/cuad
$ mpirun mpirun -np 2 ./mpi_cuda_app
--------------------------------------
WARNING: There are more than one active ports on host 'uc2n520', but the
default subnet GID prefix was detected on more than one of these
ports. If these ports are connected to different physical IB
networks, this configuration will fail in Open MPI. This version of
Open MPI requires that every physically separate IB subnet that is
used between connected MPI processes must have different subnet ID
values.

Please see this FAQ entry for more details:

http://www.open-mpi.org/faq/?category=openfabrics#ofa-default-subnet-gid

NOTE: You can turn off this warning by setting the MCA parameter
btl_openib_warn_default_gid_prefix to 0.
--------------------------------------------------------------------------
</pre>

Please run Open MPI's mpirun using the following command:
<pre>
$ mpirun --mca pml ucx --mca btl_openib_warn_default_gid_prefix 0 -np 2 ./mpi_cuda_app
</pre>
or disabling the (older) communication layer Bit-Transfer-Layer (short BTL) alltogether:
<pre>
$ mpirun --mca pml ucx --mca btl ^openib -np 2 ./mpi_cuda_app
</pre>

(Please note, that CUDA per v11.4 is only available with up to GCC-10)
 
 

== Start time of job or resources : squeue --start ==
The command can be used by any user to displays the estimated start time of a job based a number of analysis types based on historical usage, earliest available reservable resources, and priority based backlog. The command squeue is explained in detail on the webpage https://slurm.schedmd.com/squeue.html or via manpage (man squeue).
 
 
=== Access ===
By default, this command can be run by '''any user'''.
 
 

== List of your submitted jobs : squeue ==
Displays information about YOUR active, pending and/or recently completed jobs. The command displays all own active and pending jobs. The command squeue is explained in detail on the webpage https://slurm.schedmd.com/squeue.html or via manpage (man squeue).
 
 
=== Access ===
By default, this command can be run by any user.
 
 

=== Flags ===
{| width=750px class="wikitable"
|-
! Flag !! Description
|-
| -l, --long
| Report more of the available information for the selected jobs or job steps, subject to any constraints specified.
|}
 

=== Examples ===
''squeue'' example on BinaC 2 (Only your own jobs are displayed!).
<pre>
$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
18088744 single CPV.sbat ab1234 PD 0:00 1 (Priority)
18098414 multiple CPV.sbat ab1234 PD 0:00 2 (Priority)
18090089 multiple CPV.sbat ab1234 R 2:27 2 uc2n[127-128]
$ squeue -l
JOBID PARTITION NAME USER STATE TIME TIME_LIMI NODES NODELIST(REASON)
18088654 single CPV.sbat ab1234 COMPLETI 4:29 2:00:00 1 uc2n374
18088785 single CPV.sbat ab1234 PENDING 0:00 2:00:00 1 (Priority)
18098414 multiple CPV.sbat ab1234 PENDING 0:00 2:00:00 2 (Priority)
18088683 single CPV.sbat ab1234 RUNNING 0:14 2:00:00 1 uc2n413
</pre>
* The output of ''squeue'' shows how many jobs of yours are running or pending and how many nodes are in use by your jobs.
 

== Shows free resources : sinfo_t_idle ==
The Slurm command sinfo is used to view partition and node information for a system running Slurm. It incorporates down time, reservations, and node state information in determining the available backfill window. The sinfo command can only be used by the administrator.
 
SCC has prepared a special script (sinfo_t_idle) to find out how many processors are available for immediate use on the system. It is anticipated that users will use this information to submit jobs that meet these criteria and thus obtain quick job turnaround times.
 
 
=== Access ===
By default, this command can be used by any user or administrator.
 
 
=== Example ===
* The following command displays what resources are available for immediate use for the whole partition.
<pre>$ sinfo_t_idle
Partition dev_multiple : 8 nodes idle
Partition multiple : 332 nodes idle
Partition dev_single : 4 nodes idle
Partition single : 76 nodes idle
Partition long : 80 nodes idle
Partition fat : 5 nodes idle
Partition dev_special : 342 nodes idle
Partition special : 342 nodes idle
Partition dev_multiple_e: 7 nodes idle
Partition multiple_e : 335 nodes idle
Partition gpu_4 : 12 nodes idle
Partition gpu_8 : 6 nodes idle
</pre>
* For the above example jobs in all partitions can be run immediately.
 

== Detailed job information : scontrol show job ==
scontrol show job displays detailed job state information and diagnostic output for all or a specified job of yours. Detailed information is available for active, pending and recently completed jobs. The command scontrol is explained in detail on the webpage https://slurm.schedmd.com/scontrol.html or via manpage (man scontrol).
 
Display the state of all your jobs in normal mode: scontrol show job
 
Display the state of a job with <jobid> in normal mode: scontrol show job <jobid>
 
 
=== Access ===
* End users can use scontrol show job to view the status of their '''own jobs''' only.
 

=== Arguments ===
{| width=750px class="wikitable"
|-
! Option !! Default !! Description !! Example
|- style="vertical-align:top;"
|- style="width:12%;"
| -d
| (n/a)
| Detailed mode
| Example: Display the state with jobid 18089884 in detailed mode. <pre>scontrol -d show job 18089884</pre>
|}
 
 

=== Scontrol show job Example ===
Here is an example from BinAC 2.
<pre>
squeue # show my own jobs (here the userid is replaced!)
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
18089884 multiple CPV.sbat bq0742 R 33:44 2 uc2n[165-166]

$
$ # now, see what's up with my pending job with jobid 18089884
$
$ scontrol show job 18089884

JobId=18089884 JobName=CPV.sbatch
UserId=bq0742(8946) GroupId=scc(12345) MCS_label=N/A
Priority=3 Nice=0 Account=kit QOS=normal
JobState=RUNNING Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
RunTime=00:35:06 TimeLimit=02:00:00 TimeMin=N/A
SubmitTime=2020-03-16T14:14:54 EligibleTime=2020-03-16T14:14:54
AccrueTime=2020-03-16T14:14:54
StartTime=2020-03-16T15:12:51 EndTime=2020-03-16T17:12:51 Deadline=N/A
SuspendTime=None SecsPreSuspend=0 LastSchedEval=2020-03-16T15:12:51
Partition=multiple AllocNode:Sid=uc2n995:5064
ReqNodeList=(null) ExcNodeList=(null)
NodeList=uc2n[165-166]
BatchHost=uc2n165
NumNodes=2 NumCPUs=160 NumTasks=80 CPUs/Task=1 ReqB:S:C:T=0:0:*:1
TRES=cpu=160,mem=96320M,node=2,billing=160
Socks/Node=* NtasksPerN:B:S:C=40:0:*:1 CoreSpec=*
MinCPUsNode=40 MinMemoryCPU=1204M MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
OverSubscribe=NO Contiguous=0 Licenses=(null) Network=(null)
Command=/pfs/data5/home/kit/scc/bq0742/git/CPV/bin/CPV.sbatch
WorkDir=/pfs/data5/home/kit/scc/bq0742/git/CPV/bin
StdErr=/pfs/data5/home/kit/scc/bq0742/git/CPV/bin/slurm-18089884.out
StdIn=/dev/null
StdOut=/pfs/data5/home/kit/scc/bq0742/git/CPV/bin/slurm-18089884.out
Power=
MailUser=(null) MailType=NONE
</pre>
 
You can use standard Linux pipe commands to filter the very detailed scontrol show job output.
* In which state the job is?
<pre>$ scontrol show job 18089884 | grep -i State
JobState=COMPLETED Reason=None Dependency=(null)
</pre>
 

== Cancel Slurm Jobs ==
The scancel command is used to cancel jobs. The command scancel is explained in detail on the webpage https://slurm.schedmd.com/scancel.html or via manpage (man scancel).
 
 
=== Canceling own jobs : scancel ===
scancel is used to signal or cancel jobs, job arrays or job steps. The command is:
<pre>
$ scancel [-i] <job-id>
$ scancel -t <job_state_name>
</pre>

 
{| width=750px class="wikitable"
! Flag !! Default !! Description !! Example
|- style="vertical-align:top;"
| -i, --interactive
| (n/a)
| Interactive mode.
| Cancel the job 987654 interactively. <pre> scancel -i 987654 </pre>
|-
| -t, --state
| (n/a)
| Restrict the scancel operation to jobs in a certain state. "job_state_name" may have a value of either "PENDING", "RUNNING" or "SUSPENDED".
| Cancel all jobs in state "PENDING". <pre> scancel -t "PENDING" </pre>
|}
 

= Resource Managers =
=== Batch Job (Slurm) Variables ===
The following environment variables of Slurm are added to your environment once your job has started
(only an excerpt of the most important ones).
{| width=750px class="wikitable"
! Environment !! Brief explanation
|-
| SLURM_JOB_CPUS_PER_NODE
| Number of processes per node dedicated to the job
|-
| SLURM_JOB_NODELIST
| List of nodes dedicated to the job
|-
| SLURM_JOB_NUM_NODES
| Number of nodes dedicated to the job
|-
| SLURM_MEM_PER_NODE
| Memory per node dedicated to the job
|-
| SLURM_NPROCS
| Total number of processes dedicated to the job
|-
| SLURM_CLUSTER_NAME
| Name of the cluster executing the job
|-
| SLURM_CPUS_PER_TASK
| Number of CPUs requested per task
|-
| SLURM_JOB_ACCOUNT
| Account name
|-
| SLURM_JOB_ID
| Job ID
|-
| SLURM_JOB_NAME
| Job Name
|-
| SLURM_JOB_PARTITION
| Partition/queue running the job
|-
| SLURM_JOB_UID
| User ID of the job's owner
|-
| SLURM_SUBMIT_DIR
| Job submit folder. The directory from which sbatch was invoked.
|-
| SLURM_JOB_USER
| User name of the job's owner
|-
| SLURM_RESTART_COUNT
| Number of times job has restarted
|-
| SLURM_PROCID
| Task ID (MPI rank)
|-
| SLURM_NTASKS
| The total number of tasks available for the job
|-
| SLURM_STEP_ID
| Job step ID
|-
| SLURM_STEP_NUM_TASKS
| Task count (number of MPI ranks)
|-
| SLURM_JOB_CONSTRAINT
| Job constraints
|}
See also:
* [https://slurm.schedmd.com/sbatch.html#lbAI Slurm input and output environment variables]
 

=== Job Exit Codes ===
A job's exit code (also known as exit status, return code and completion code) is captured by SLURM and saved as part of the job record.
 
Any non-zero exit code will be assumed to be a job failure and will result in a Job State of FAILED with a reason of "NonZeroExitCode".
 
The exit code is an 8 bit unsigned number ranging between 0 and 255. While it is possible for a job to return a negative exit code, SLURM will display it as an unsigned value in the 0 - 255 range.
 
 
==== Displaying Exit Codes and Signals ====
SLURM displays a job's exit code in the output of the '''scontrol show job''' and the sview utility.
 
When a signal was responsible for a job or step's termination, the signal number will be displayed after the exit code, delineated by a colon(:).
 
 
==== Submitting Termination Signal ====
Here is an example, how to 'save' a Slurm termination signal in a typical jobscript.
<source lang="bash">
[...]
exit_code=$?
mpirun -np <#cores> <EXE_BIN_DIR>/<executable> ... (options) 2>&1
[ "$exit_code" -eq 0 ] && echo "all clean..." || \
echo "Executable <EXE_BIN_DIR>/<executable> finished with exit code ${$exit_code}"
[...]
</source>
* Do not use ''''time'''' mpirun! The exit code will be the one submitted by the first (time) program.
* You do not need an '''exit $exit_code''' in the scripts.
 
 
----
[[#top|Back to top]]

BinAC2/Login

2025-07-30T06:59:40Z

S Behnle:

{|style="background:#deffee; width:100%;"
|style="padding:5px; background:#cef2e0; text-align:left"|
[[Image:Attention.svg|center|25px]]
|style="padding:5px; background:#cef2e0; text-align:left"|
Access to bwForCluster BinAC 2 is only possible from IP addresses within the [https://www.belwue.de BelWü] network which connects universities and other scientific institutions in Baden-Württemberg.
If your computer is in your University network (e.g. at your office), you should be able to connect to bwForCluster BinAC 2 without restrictions.
If you are outside the BelWü network (e.g. at home), a VPN (virtual private network) connection to your University network must be established first. Please consult the VPN documentation of your University.
|}

'''Prerequisites for successful login:'''

You need to have
* completed the 3-step [[registration/bwForCluster|bwForCluster registration]] procedure.
* [[Registration/Password|set a service password]] for bwForCluster BinAC 2.
* Setup the [[BinAC2/Login#TOTP_Second_Factor|two factor authentication (2FA)]].

= Login to bwForCluster BinAC 2 =

Login to bwForCluster BinAC 2 is only possible with a Secure Shell (SSH) client for which you must know your [[BinAC2/Login#Username|username]] on the cluster and the [[BinAC2/Login#Hostname|hostname]] of the BinAC 2 login node.

For more gneral information on SSH clients, visit the [[Registration/Login/Client|SSH clients Guide]].

== TOTP Second Factor ==

At the moment no second factor is needed. We are currently implementing a new TOTP procedure.

== Username ==

Your <code><username></code> on BinAC 2 consists of a prefix and your local username.
For prefixes please refer to the [[Registration/Login/Username|Username Guide]].

Example: If your local username at your University is <code>ab123</code> and you are a user from Tübingen University, your username on the cluster is: <code>tu_ab123</code>.

== Hostnames ==

BinAC 2 has one login node serving as a load balancer. We use DNS round-robin scheduling to load-balance the incoming connections between the actual three login nodes. If you are logging in multiple times, different sessions might run on different login nodes and hence programs started in one session might not be visible in another sessions.

{| class="wikitable"
! Hostname !! Destination
|-
| login.binac2.uni-tuebingen.de || one of the three login nodes
|-
|}

You can choose a specific login node by using specific ports on the load balancer. Please only do this if there is a real reason for that (e.g. connecting to a running tmux/screen session).

{| class="wikitable"
! Port !! Destination
|-
| 2221 || login01
|-
| 2222 || login02
|-
| 2223 || login03
|-
|}
Usage: <code>ssh -p <port> [other options] <username>@login.binac2.uni-tuebingen.de</code>

== Login with SSH command (Linux, Mac, Windows) ==

Most Unix and Unix-like operating systems like Linux or MacOS come with a built-in SSH client provided by the OpenSSH project.
Windows 10 and Windows also come with a built-in OpenSSH client.

For login use one of the following ssh commands:

ssh <username>@login.binac2.uni-tuebingen.de

To run graphical applications on the cluster, you need to enable X11 forwarding with the <code>-X</code> flag:

ssh -X <username>@login.binac2.uni-tuebingen.de

For login to a specific login node (here: login03):

ssh -p 2223 <username>@login.binac2.uni-tuebingen.de

== Login with graphical SSH client (Windows) ==

For Windows we suggest using MobaXterm for login and file transfer.

Start MobaXterm and fill in the following fields:
<pre>
Remote name : login.binac2.uni-tuebingen.de
Specify user name : <username>
Port : 22
</pre>

After that click on 'ok'. Then a terminal will open where you can enter your credentials.

BinAC2/Hardware and Architecture

2025-07-01T10:23:10Z

S Behnle:

= Hardware and Architecture =

The bwForCluster BinAC 2 supports researchers from the broader fields of Bioinformatics, Medical Informatics, Astrophysics, Geosciences and Pharmacy.

== Operating System and Software ==

* Operating System: Rocky Linux 9.5
* Queuing System: [https://slurm.schedmd.com/documentation.html Slurm] (see [[BinAC2/Slurm]] for help)
* (Scientific) Libraries and Software: [[Environment Modules]]

== Compute Nodes ==

BinAC 2 offers compute nodes, high-mem nodes, and three types of GPU nodes.
* 180 compute nodes
* 16 SMP node
* 32 GPU nodes (2xA30)
* 8 GPU nodes (4xA100)
* 4 GPU nodes (4xH200)
* plus several special purpose nodes for login, interactive jobs, etc.

Compute node specification:
{| class="wikitable"
|-
! style="width:10%"|
! style="width:10%"| Standard
! style="width:10%"| High-Mem
! style="width:10%"| GPU (A30)
! style="width:10%"| GPU (A100)
! style="width:10%"| GPU (H200)
|-
!scope="column"| Quantity
| 180
| 14 / 2
| 32
| 8
| 4
|-
!scope="column" | Processors
| 2 x [https://www.amd.com/de/products/processors/server/epyc/7003-series/amd-epyc-7543.html AMD EPYC Milan 7543]
| 2 x [https://www.amd.com/de/products/processors/server/epyc/7003-series/amd-epyc-7443.html AMD EPYC Milan 7443] / 2 x [https://www.amd.com/en/products/processors/server/epyc/7003-series/amd-epyc-75f3.html AMD EPYC Milan 75F3]
| 2 x [https://www.amd.com/de/products/processors/server/epyc/7003-series/amd-epyc-7543.html AMD EPYC Milan 7543]
| 2 x [https://www.amd.com/de/products/processors/server/epyc/7003-series/amd-epyc-7543.html AMD EPYC Milan 7543]
| 2 x [https://www.amd.com/de/products/processors/server/epyc/9005-series/amd-epyc-9555.html AMD EPYC Milan 9555]
|-
!scope="column" | Processor Base Frequency (GHz)
| 2.80
| 2.85 / 2.95
| 2.80
| 2.80
| 3.20
|-
!scope="column" | Number of Physical Cores / Hypertreads
| 64 / 128
| 48 / 96 // 64 / 128
| 64 / 128
| 64 / 128
| 128 / 256
|-
!scope="column" | Working Memory (GB)
| 512
| 2048
| 512
| 512
| 1536
|-
!scope="column" | Local Disk (GiB)
| 450 (NVMe-SSD)
| 14000 (NVMe-SSD)
| 450 (NVMe-SSD)
| 14000 (NVMe-SSD)
| 28000 (NVMe-SSD)
|-
!scope="column" | Interconnect
| HDR 100 IB (84 nodes) / 100GbE (96 nodes)
| 100GbE
| 100GbE
| 100GbE
| HDR 200 IB + 100GbE
|-
!scope="column" | Coprocessors
| -
| -
| 2 x [https://www.nvidia.com/de-de/data-center/products/a30-gpu/ NVIDIA A30 (24 GB ECC HBM2, NVLink)]
| 4 x [https://www.nvidia.com/de-de/data-center/a100/ NVIDIA A100 (80 GB ECC HBM2e)]
| 4 x [https://www.nvidia.com/de-de/data-center/h200/ NVIDIA H200 NVL (141 GB ECC HBM3e, NVLink)]
|}

= Network =

The compute nodes and the parallel file system are connected via 100GbE ethernet
In contrast to BinAC 1 not all compute nodes are connected via Infiniband, but there are 84 standard compute nodes connected via HDR Infiniband (100 GbE). In order to get your jobs onto the Infiniband nodes, submit your job with <code>--constraint=ib</code>.

= File Systems =

The bwForCluster BinAC 2 consists of two separate storage systems, one for the user's home directory $HOME and one serving as a project/work space.
The home directory is limited in space and parallel access but offers snapshots of your files and backup.

The project/work is a parallel file system (PFS) which offers fast and parallel file access and a bigger capacity than the home directory. It is mounted at <code>/pfs/10</code> on the login and compute nodes. This storage is based on Lustre and can be accessed parallel from many nodes. The PFS contains the project and the work directory. Each compute project has its own directory at <code>/pfs/10/project</code> that is accessible for all members of the compute project.
Each user can create workspaces under <code>/pfs/10/work</code> using the workspace tools. These directories are only accessible for the user who created the workspace.

Additionally, each compute node provides high-speed temporary storage (SSD) on the node-local solid state disk via the $TMPDIR environment variable.

{| class="wikitable"
|-
! style="width:10%"|
! style="width:10%"| <tt>$HOME</tt>
! style="width:10%"| project
! style="width:10%"| work
! style="width:10%"| <tt>$TMPDIR</tt>
|-
!scope="column" | Visibility
| global
| global
| global
| node local
|-
!scope="column" | Lifetime
| permanent
| permanent
| work space lifetime (max. 30 days, max. 5 extensions)
| batch job walltime
|-
!scope="column" | Capacity
| -
| 8.1 PB
| 1000 TB
| 480 GB (compute nodes); 7.7 TB (GPU-A30 nodes); 16 TB (GPU-A100 and SMP nodes); 31 TB (GPU-H200 nodes)
|-
!scope="column" | Speed (read)
| ≈ 1 GB/s, shared by all nodes
| max. 12 GB/s
| ≈ 145 GB/s peak, aggregated over 56 nodes, ideal striping
| ≈ 3 GB/s (compute)/ ≈5 GB/S (GPUA-30)/ ≈ 26 GB/s (GPU-A100 + SMP)/ ≈ 42 GB/s (GPU-H200) per node
|-
!scope="column" | [https://en.wikipedia.org/wiki/Disk_quota#Quotas Quotas]
| 40 GB per user
| not yet, maybe in the future
| none
| none
|-
!scope="column" | Backup
| yes (nightly)
| '''no'''
| '''no'''
| '''no'''
|}

global : all nodes access the same file system
local : each node has its own file system
permanent : files are stored permanently
batch job walltime : files are removed at end of the batch job

{| class="wikitable" style="color:red; background-color:#ffffcc;" cellpadding="10"
|
Please note that due to the large capacity of '''work''' and '''project''' and due to frequent file changes on these file systems, no backup can be provided.
Backing up these file systems would require a redundant storage facility with multiple times the capacity of '''project'''. Furthermore, regular backups would significantly degrade the performance.
Data is stored redundantly, i.e. immune against disk failures but not immune against catastrophic incidents like cyber attacks or a fire in the server room.
Please consider to use on of the remote storage facilities like [https://wiki.bwhpc.de/e/SDS@hd SDS@hd], [https://uni-tuebingen.de/einrichtungen/zentrum-fuer-datenverarbeitung/projekte/laufende-projekte/bwsfs bwSFS], [https://www.scc.kit.edu/en/services/lsdf.php LSFD Online Storage] or the [https://www.rda.kit.edu/english/ bwDataArchive] to back up your valuable data.
|}

=== Home ===

Home directories are meant for permanent file storage of files that are keep being used like source codes, configuration files, executable programs etc.; the content of home directories will be backed up on a regular basis.
Because the backup space is limited we enforce a quota of 40GB on the home directories.

'''NOTE:'''
Compute jobs on nodes must not write temporary data to $HOME.
Instead they should use the local $TMPDIR directory for I/O-heavy use cases
and work spaces for less I/O intense multinode-jobs.


=== Project ===

Each compute project has its own project directory at <code>/pfs/10/project</code>.

<pre>
$ ls -lh /pfs/10/project/
drwxrwx---. 2 root bw16f003 33K Dec 12 16:46 bw16f003
[...]
</pre>

As you can see the directory is owned by a group representing your compute project (here bw16f003) and the directory is accessible by all group members. It is upon your group to decide how to use the space inside this directory: shared data folders, personal directories for each project member, software containers, etc.

The data is stored on HDDs. The primary focus of <code>/pfs/10/project</code> is pure capacity, not speed.

=== Work ===

The data at <code>/pfs/10/work</code> is stored on SSDs. The primary focus is speed, not capacity.
In contrast to BinAC 1 we will enforce work space lifetime, as the capacity is limited.
We ask you to only store data you actively use for computations on <code>/pfs/10/work</code>.
Please move data to <code>/pfs/10/project</code> when you don't need it on the fast storage any more.

Each user can create workspaces at <code>/pfs/10/work</code> through the workspace tools
To create a work space you'll need to supply a name for your work space area and a lifetime in days.
For more information read the corresponding help, e.g: <code>ws_allocate -h.</code>
{| class="wikitable"
|-
!style="width:30%" | Command
!style="width:70%" | Action
|-
|<code>ws_allocate mywork 30</code>
|Allocate a work space named "mywork" for 30 days.
|-
|<code>ws_allocate myotherwork</code>
|Allocate a work space named "myotherwork" with maximum lifetime.
|-
|<code>ws_list -a</code>
|List all your work spaces.
|-
|<code>ws_find mywork</code>
|Get absolute path of work space "mywork".
|-
|<code>ws_extend mywork 30</code>
|Extend life me of work space mywork by 30 days from now.
|-
|<code>ws_release mywork</code>
|Manually erase your work space "mywork". Please remove directory content first.
|-
|}

=== Scratch ===

Please use the fast local scratch space for storing temporary data during your jobs.

For each job a scratch directory will be created on the compute nodes. It is available via the environment variable <code>$TMPDIR</code>, which points to <code>/scratch/<jobID></code>.

Especially the SMP nodes and the GPU nodes are equipped with large and fast local disks that should be used for temporary data, scratch data or data staging for ML model training.
The Lustre file system (<code>WORK</code> and <code>PROJECT</code>) is unsuited for repetitive random I/O, I/O sizes smaller than the Lustre and ZFS block size (1M) or I/O patterns where files are opened and closed in rapid succession. The XFS file system of the local scratch drives is better suited for typical scratch workloads and access patterns. Moreover, the local scratch drives offer a lower latency and a higher bandwidth than <code>WORK</code>.

=== SDS@hd ===

SDS@hd is mounted via NFS on login and compute nodes at <syntaxhighlight inline>/mnt/sds-hd</syntaxhighlight>.

To access your Speichervorhaben, the export to BinAC 2 must first be enabled by the SDS@hd-Team. Please contact [mailto:sds-hd-support@urz.uni-heidelberg.de SDS@hd support] and provide the acronym of your Speichervorhaben, along with a request to enable the export to BinAC 2.

Once this has been done, you can access your Speichervorhaben as described in the [https://wiki.bwhpc.de/e/SDS@hd/Access/NFS#Access_your_data SDS documentation].

<syntaxhighlight>
$ kinit $USER
Password for <user>@BWSERVICES.UNI-HEIDELBERG.DE:
</syntaxhighlight>

The Kerberos ticket store is shared across all nodes. Creating a single ticket is sufficient to access your Speichervorhaben on all nodes.

BinAC2/Hardware and Architecture

2025-07-01T10:15:35Z

S Behnle:

= Hardware and Architecture =

The bwForCluster BinAC 2 supports researchers from the broader fields of Bioinformatics, Medical Informatics, Astrophysics, Geosciences and Pharmacy.

== Operating System and Software ==

* Operating System: Rocky Linux 9.5
* Queuing System: [https://slurm.schedmd.com/documentation.html Slurm] (see [[BinAC2/Slurm]] for help)
* (Scientific) Libraries and Software: [[Environment Modules]]

== Compute Nodes ==

BinAC 2 offers compute nodes, high-mem nodes, and three types of GPU nodes.
* 180 compute nodes
* 16 SMP node
* 32 GPU nodes (2xA30)
* 8 GPU nodes (4xA100)
* 4 GPU nodes (4xH200)
* plus several special purpose nodes for login, interactive jobs, etc.

Compute node specification:
{| class="wikitable"
|-
! style="width:10%"|
! style="width:10%"| Standard
! style="width:10%"| High-Mem
! style="width:10%"| GPU (A30)
! style="width:10%"| GPU (A100)
! style="width:10%"| GPU (H200)
|-
!scope="column"| Quantity
| 180
| 14 / 2
| 32
| 8
| 4
|-
!scope="column" | Processors
| 2 x [https://www.amd.com/de/products/processors/server/epyc/7003-series/amd-epyc-7543.html AMD EPYC Milan 7543]
| 2 x [https://www.amd.com/de/products/processors/server/epyc/7003-series/amd-epyc-7443.html AMD EPYC Milan 7443] / 2 x [https://www.amd.com/en/products/processors/server/epyc/7003-series/amd-epyc-75f3.html AMD EPYC Milan 75F3]
| 2 x [https://www.amd.com/de/products/processors/server/epyc/7003-series/amd-epyc-7543.html AMD EPYC Milan 7543]
| 2 x [https://www.amd.com/de/products/processors/server/epyc/7003-series/amd-epyc-7543.html AMD EPYC Milan 7543]
| 2 x [https://www.amd.com/de/products/processors/server/epyc/9005-series/amd-epyc-9555.html AMD EPYC Milan 9555]
|-
!scope="column" | Processor Base Frequency (GHz)
| 2.80
| 2.85 / 2.95
| 2.80
| 2.80
| 3.20
|-
!scope="column" | Number of Physical Cores / Hypertreads
| 64 / 128
| 48 / 96 // 64 / 128
| 64 / 128
| 64 / 128
| 128 / 256
|-
!scope="column" | Working Memory (GB)
| 512
| 2048
| 512
| 512
| 1536
|-
!scope="column" | Local Disk (GB)
| 450 (NVMe-SSD)
| 14000 (NVMe-SSD)
| 450 (NVMe-SSD)
| 14000 (NVMe-SSD)
| 28000 (NVMe-SSD)
|-
!scope="column" | Interconnect
| HDR 100 IB (84 nodes) / 100GbE (96 nodes)
| 100GbE
| 100GbE
| 100GbE
| HDR 200 IB + 100GbE
|-
!scope="column" | Coprocessors
| -
| -
| 2 x [https://www.nvidia.com/de-de/data-center/products/a30-gpu/ NVIDIA A30 (24 GB ECC HBM2, NVLink)]
| 4 x [https://www.nvidia.com/de-de/data-center/a100/ NVIDIA A100 (80 GB ECC HBM2e)]
| 4 x [https://www.nvidia.com/de-de/data-center/h200/ NVIDIA H200 NVL (141 GB ECC HBM3e, NVLink)]
|}

= Network =

The compute nodes and the parallel file system are connected via 100GbE ethernet
In contrast to BinAC 1 not all compute nodes are connected via Infiniband, but there are 84 standard compute nodes connected via HDR Infiniband (100 GbE). In order to get your jobs onto the Infiniband nodes, submit your job with <code>--constraint=ib</code>.

= File Systems =

The bwForCluster BinAC 2 consists of two separate storage systems, one for the user's home directory $HOME and one serving as a project/work space.
The home directory is limited in space and parallel access but offers snapshots of your files and backup.

The project/work is a parallel file system (PFS) which offers fast and parallel file access and a bigger capacity than the home directory. It is mounted at <code>/pfs/10</code> on the login and compute nodes. This storage is based on Lustre and can be accessed parallel from many nodes. The PFS contains the project and the work directory. Each compute project has its own directory at <code>/pfs/10/project</code> that is accessible for all members of the compute project.
Each user can create workspaces under <code>/pfs/10/work</code> using the workspace tools. These directories are only accessible for the user who created the workspace.

Additionally, each compute node provides high-speed temporary storage (SSD) on the node-local solid state disk via the $TMPDIR environment variable.

{| class="wikitable"
|-
! style="width:10%"|
! style="width:10%"| <tt>$HOME</tt>
! style="width:10%"| project
! style="width:10%"| work
! style="width:10%"| <tt>$TMPDIR</tt>
|-
!scope="column" | Visibility
| global
| global
| global
| node local
|-
!scope="column" | Lifetime
| permanent
| permanent
| work space lifetime (max. 30 days, max. 5 extensions)
| batch job walltime
|-
!scope="column" | Capacity
| -
| 8.1 PB
| 1000 TB
| 512 GB per node; 1920 GB on high-mem nodes
|-
!scope="column" | Speed (read)
| ≈ 1 GB/s, shared by all nodes
| max. 12 GB/s
| ≈ 145 GB/s peak, aggregated over 56 nodes, ideal striping
| ≈ 3 GB/s (compute)/ ≈5 GB/S (GPUA-30)/ ≈ 26 GB/s (GPU-A100 + SMP)/ ≈ 42 GB/s (GPU-H200) per node
|-
!scope="column" | [https://en.wikipedia.org/wiki/Disk_quota#Quotas Quotas]
| 40 GB per user
| not yet, maybe in the future
| none
| none
|-
!scope="column" | Backup
| yes (nightly)
| '''no'''
| '''no'''
| '''no'''
|}

global : all nodes access the same file system
local : each node has its own file system
permanent : files are stored permanently
batch job walltime : files are removed at end of the batch job

{| class="wikitable" style="color:red; background-color:#ffffcc;" cellpadding="10"
|
Please note that due to the large capacity of '''work''' and '''project''' and due to frequent file changes on these file systems, no backup can be provided.
Backing up these file systems would require a redundant storage facility with multiple times the capacity of '''project'''. Furthermore, regular backups would significantly degrade the performance.
Data is stored redundantly, i.e. immune against disk failures but not immune against catastrophic incidents like cyber attacks or a fire in the server room.
Please consider to use on of the remote storage facilities like [https://wiki.bwhpc.de/e/SDS@hd SDS@hd], [https://uni-tuebingen.de/einrichtungen/zentrum-fuer-datenverarbeitung/projekte/laufende-projekte/bwsfs bwSFS], [https://www.scc.kit.edu/en/services/lsdf.php LSFD Online Storage] or the [https://www.rda.kit.edu/english/ bwDataArchive] to back up your valuable data.
|}

=== Home ===

Home directories are meant for permanent file storage of files that are keep being used like source codes, configuration files, executable programs etc.; the content of home directories will be backed up on a regular basis.
Because the backup space is limited we enforce a quota of 40GB on the home directories.

'''NOTE:'''
Compute jobs on nodes must not write temporary data to $HOME.
Instead they should use the local $TMPDIR directory for I/O-heavy use cases
and work spaces for less I/O intense multinode-jobs.


=== Project ===

Each compute project has its own project directory at <code>/pfs/10/project</code>.

<pre>
$ ls -lh /pfs/10/project/
drwxrwx---. 2 root bw16f003 33K Dec 12 16:46 bw16f003
[...]
</pre>

As you can see the directory is owned by a group representing your compute project (here bw16f003) and the directory is accessible by all group members. It is upon your group to decide how to use the space inside this directory: shared data folders, personal directories for each project member, software containers, etc.

The data is stored on HDDs. The primary focus of <code>/pfs/10/project</code> is pure capacity, not speed.

=== Work ===

The data at <code>/pfs/10/work</code> is stored on SSDs. The primary focus is speed, not capacity.
In contrast to BinAC 1 we will enforce work space lifetime, as the capacity is limited.
We ask you to only store data you actively use for computations on <code>/pfs/10/work</code>.
Please move data to <code>/pfs/10/project</code> when you don't need it on the fast storage any more.

Each user can create workspaces at <code>/pfs/10/work</code> through the workspace tools
To create a work space you'll need to supply a name for your work space area and a lifetime in days.
For more information read the corresponding help, e.g: <code>ws_allocate -h.</code>
{| class="wikitable"
|-
!style="width:30%" | Command
!style="width:70%" | Action
|-
|<code>ws_allocate mywork 30</code>
|Allocate a work space named "mywork" for 30 days.
|-
|<code>ws_allocate myotherwork</code>
|Allocate a work space named "myotherwork" with maximum lifetime.
|-
|<code>ws_list -a</code>
|List all your work spaces.
|-
|<code>ws_find mywork</code>
|Get absolute path of work space "mywork".
|-
|<code>ws_extend mywork 30</code>
|Extend life me of work space mywork by 30 days from now.
|-
|<code>ws_release mywork</code>
|Manually erase your work space "mywork". Please remove directory content first.
|-
|}

=== Scratch ===

Please use the fast local scratch space for storing temporary data during your jobs.

For each job a scratch directory will be created on the compute nodes. It is available via the environment variable <code>$TMPDIR</code>, which points to <code>/scratch/<jobID></code>.

Especially the SMP nodes and the GPU nodes are equipped with large and fast local disks that should be used for temporary data, scratch data or data staging for ML model training.
The Lustre file system (<code>WORK</code> and <code>PROJECT</code>) is unsuited for repetitive random I/O, I/O sizes smaller than the Lustre and ZFS block size (1M) or I/O patterns where files are opened and closed in rapid succession. The XFS file system of the local scratch drives is better suited for typical scratch workloads and access patterns. Moreover, the local scratch drives offer a lower latency and a higher bandwidth than <code>WORK</code>.

=== SDS@hd ===

SDS@hd is mounted via NFS on login and compute nodes at <syntaxhighlight inline>/mnt/sds-hd</syntaxhighlight>.

To access your Speichervorhaben, the export to BinAC 2 must first be enabled by the SDS@hd-Team. Please contact [mailto:sds-hd-support@urz.uni-heidelberg.de SDS@hd support] and provide the acronym of your Speichervorhaben, along with a request to enable the export to BinAC 2.

Once this has been done, you can access your Speichervorhaben as described in the [https://wiki.bwhpc.de/e/SDS@hd/Access/NFS#Access_your_data SDS documentation].

<syntaxhighlight>
$ kinit $USER
Password for <user>@BWSERVICES.UNI-HEIDELBERG.DE:
</syntaxhighlight>

The Kerberos ticket store is shared across all nodes. Creating a single ticket is sufficient to access your Speichervorhaben on all nodes.

BinAC2/Hardware and Architecture

2025-07-01T10:14:12Z

S Behnle:

= Hardware and Architecture =

The bwForCluster BinAC 2 supports researchers from the broader fields of Bioinformatics, Medical Informatics, Astrophysics, Geosciences and Pharmacy.

== Operating System and Software ==

* Operating System: Rocky Linux 9.5
* Queuing System: [https://slurm.schedmd.com/documentation.html Slurm] (see [[BinAC2/Slurm]] for help)
* (Scientific) Libraries and Software: [[Environment Modules]]

== Compute Nodes ==

BinAC 2 offers compute nodes, high-mem nodes, and three types of GPU nodes.
* 180 compute nodes
* 16 SMP node
* 32 GPU nodes (2xA30)
* 8 GPU nodes (4xA100)
* 4 GPU nodes (4xH200)
* plus several special purpose nodes for login, interactive jobs, etc.

Compute node specification:
{| class="wikitable"
|-
! style="width:10%"|
! style="width:10%"| Standard
! style="width:10%"| High-Mem
! style="width:10%"| GPU (A30)
! style="width:10%"| GPU (A100)
! style="width:10%"| GPU (H200)
|-
!scope="column"| Quantity
| 180
| 14 / 2
| 32
| 8
| 4
|-
!scope="column" | Processors
| 2 x [https://www.amd.com/de/products/processors/server/epyc/7003-series/amd-epyc-7543.html AMD EPYC Milan 7543]
| 2 x [https://www.amd.com/de/products/processors/server/epyc/7003-series/amd-epyc-7443.html AMD EPYC Milan 7443] / 2 x [https://www.amd.com/en/products/processors/server/epyc/7003-series/amd-epyc-75f3.html AMD EPYC Milan 75F3]
| 2 x [https://www.amd.com/de/products/processors/server/epyc/7003-series/amd-epyc-7543.html AMD EPYC Milan 7543]
| 2 x [https://www.amd.com/de/products/processors/server/epyc/7003-series/amd-epyc-7543.html AMD EPYC Milan 7543]
| 2 x [https://www.amd.com/de/products/processors/server/epyc/9005-series/amd-epyc-9555.html AMD EPYC Milan 9555]
|-
!scope="column" | Processor Base Frequency (GHz)
| 2.80
| 2.85 / 2.95
| 2.80
| 2.80
| 3.20
|-
!scope="column" | Number of Physical Cores / Hypertreads
| 64 / 128
| 48 / 96 // 64 / 128
| 64 / 128
| 64 / 128
| 128 / 256
|-
!scope="column" | Working Memory (GB)
| 512
| 2048
| 512
| 512
| 1536
|-
!scope="column" | Local Disk (GB)
| 450 (NVMe-SSD)
| 14000 (NVMe-SSD)
| 450 (NVMe-SSD)
| 14000 (NVMe-SSD)
| 28000 (NVMe-SSD)
|-
!scope="column" | Interconnect
| HDR 100 IB (84 nodes) / 100GbE (96 nodes)
| 100GbE
| 100GbE
| 100GbE
| HDR 200 IB + 100GbE
|-
!scope="column" | Coprocessors
| -
| -
| 2 x [https://www.nvidia.com/de-de/data-center/products/a30-gpu/ NVIDIA A30 (24 GB ECC HBM2, NVLink)]
| 4 x [https://www.nvidia.com/de-de/data-center/a100/ NVIDIA A100 (80 GB ECC HBM2e)]
| 4 x [https://www.nvidia.com/de-de/data-center/h200/ NVIDIA H200 NVL (141 GB ECC HBM3e, NVLink)]
|}

= Network =

The compute nodes and the parallel file system are connected via 100GbE ethernet
In contrast to BinAC 1 not all compute nodes are connected via Infiniband, but there are 84 standard compute nodes connected via HDR Infiniband (100 GbE). In order to get your jobs onto the Infiniband nodes, submit your job with <code>--constraint=ib</code>.

= File Systems =

The bwForCluster BinAC 2 consists of two separate storage systems, one for the user's home directory $HOME and one serving as a project/work space.
The home directory is limited in space and parallel access but offers snapshots of your files and backup.

The project/work is a parallel file system (PFS) which offers fast and parallel file access and a bigger capacity than the home directory. It is mounted at <code>/pfs/10</code> on the login and compute nodes. This storage is based on Lustre and can be accessed parallel from many nodes. The PFS contains the project and the work directory. Each compute project has its own directory at <code>/pfs/10/project</code> that is accessible for all members of the compute project.
Each user can create workspaces under <code>/pfs/10/work</code> using the workspace tools. These directories are only accessible for the user who created the workspace.

Additionally, each compute node provides high-speed temporary storage (SSD) on the node-local solid state disk via the $TMPDIR environment variable.

{| class="wikitable"
|-
! style="width:10%"|
! style="width:10%"| <tt>$HOME</tt>
! style="width:10%"| project
! style="width:10%"| work
! style="width:10%"| <tt>$TMPDIR</tt>
|-
!scope="column" | Visibility
| global
| global
| global
| node local
|-
!scope="column" | Lifetime
| permanent
| permanent
| work space lifetime (max. 30 days, max. 5 extensions)
| batch job walltime
|-
!scope="column" | Capacity
| -
| 8.1 PB
| 1000 TB
| 512 GB per node; 1920 GB on high-mem nodes
|-
!scope="column" | Speed (read)
| ≈ 1 GB/s, shared by all nodes
| max. 12 GB/s
| ≈ 145 GB/s peak, aggregated over 56 nodes, ideal striping
| ≈ 3 GB/s (compute)/ ≈5 GB/S (GPUA30)/ ≈ 26 GB/s (GPUA100+SMP)/ ≈ 42 GB/s (GPUH200) per node
|-
!scope="column" | [https://en.wikipedia.org/wiki/Disk_quota#Quotas Quotas]
| 40 GB per user
| not yet, maybe in the future
| none
| none
|-
!scope="column" | Backup
| yes (nightly)
| '''no'''
| '''no'''
| '''no'''
|}

global : all nodes access the same file system
local : each node has its own file system
permanent : files are stored permanently
batch job walltime : files are removed at end of the batch job

{| class="wikitable" style="color:red; background-color:#ffffcc;" cellpadding="10"
|
Please note that due to the large capacity of '''work''' and '''project''' and due to frequent file changes on these file systems, no backup can be provided.
Backing up these file systems would require a redundant storage facility with multiple times the capacity of '''project'''. Furthermore, regular backups would significantly degrade the performance.
Data is stored redundantly, i.e. immune against disk failures but not immune against catastrophic incidents like cyber attacks or a fire in the server room.
Please consider to use on of the remote storage facilities like [https://wiki.bwhpc.de/e/SDS@hd SDS@hd], [https://uni-tuebingen.de/einrichtungen/zentrum-fuer-datenverarbeitung/projekte/laufende-projekte/bwsfs bwSFS], [https://www.scc.kit.edu/en/services/lsdf.php LSFD Online Storage] or the [https://www.rda.kit.edu/english/ bwDataArchive] to back up your valuable data.
|}

=== Home ===

Home directories are meant for permanent file storage of files that are keep being used like source codes, configuration files, executable programs etc.; the content of home directories will be backed up on a regular basis.
Because the backup space is limited we enforce a quota of 40GB on the home directories.

'''NOTE:'''
Compute jobs on nodes must not write temporary data to $HOME.
Instead they should use the local $TMPDIR directory for I/O-heavy use cases
and work spaces for less I/O intense multinode-jobs.


=== Project ===

Each compute project has its own project directory at <code>/pfs/10/project</code>.

<pre>
$ ls -lh /pfs/10/project/
drwxrwx---. 2 root bw16f003 33K Dec 12 16:46 bw16f003
[...]
</pre>

As you can see the directory is owned by a group representing your compute project (here bw16f003) and the directory is accessible by all group members. It is upon your group to decide how to use the space inside this directory: shared data folders, personal directories for each project member, software containers, etc.

The data is stored on HDDs. The primary focus of <code>/pfs/10/project</code> is pure capacity, not speed.

=== Work ===

The data at <code>/pfs/10/work</code> is stored on SSDs. The primary focus is speed, not capacity.
In contrast to BinAC 1 we will enforce work space lifetime, as the capacity is limited.
We ask you to only store data you actively use for computations on <code>/pfs/10/work</code>.
Please move data to <code>/pfs/10/project</code> when you don't need it on the fast storage any more.

Each user can create workspaces at <code>/pfs/10/work</code> through the workspace tools
To create a work space you'll need to supply a name for your work space area and a lifetime in days.
For more information read the corresponding help, e.g: <code>ws_allocate -h.</code>
{| class="wikitable"
|-
!style="width:30%" | Command
!style="width:70%" | Action
|-
|<code>ws_allocate mywork 30</code>
|Allocate a work space named "mywork" for 30 days.
|-
|<code>ws_allocate myotherwork</code>
|Allocate a work space named "myotherwork" with maximum lifetime.
|-
|<code>ws_list -a</code>
|List all your work spaces.
|-
|<code>ws_find mywork</code>
|Get absolute path of work space "mywork".
|-
|<code>ws_extend mywork 30</code>
|Extend life me of work space mywork by 30 days from now.
|-
|<code>ws_release mywork</code>
|Manually erase your work space "mywork". Please remove directory content first.
|-
|}

=== Scratch ===

Please use the fast local scratch space for storing temporary data during your jobs.

For each job a scratch directory will be created on the compute nodes. It is available via the environment variable <code>$TMPDIR</code>, which points to <code>/scratch/<jobID></code>.

Especially the SMP nodes and the GPU nodes are equipped with large and fast local disks that should be used for temporary data, scratch data or data staging for ML model training.
The Lustre file system (<code>WORK</code> and <code>PROJECT</code>) is unsuited for repetitive random I/O, I/O sizes smaller than the Lustre and ZFS block size (1M) or I/O patterns where files are opened and closed in rapid succession. The XFS file system of the local scratch drives is better suited for typical scratch workloads and access patterns. Moreover, the local scratch drives offer a lower latency and a higher bandwidth than <code>WORK</code>.

=== SDS@hd ===

SDS@hd is mounted via NFS on login and compute nodes at <syntaxhighlight inline>/mnt/sds-hd</syntaxhighlight>.

To access your Speichervorhaben, the export to BinAC 2 must first be enabled by the SDS@hd-Team. Please contact [mailto:sds-hd-support@urz.uni-heidelberg.de SDS@hd support] and provide the acronym of your Speichervorhaben, along with a request to enable the export to BinAC 2.

Once this has been done, you can access your Speichervorhaben as described in the [https://wiki.bwhpc.de/e/SDS@hd/Access/NFS#Access_your_data SDS documentation].

<syntaxhighlight>
$ kinit $USER
Password for <user>@BWSERVICES.UNI-HEIDELBERG.DE:
</syntaxhighlight>

The Kerberos ticket store is shared across all nodes. Creating a single ticket is sufficient to access your Speichervorhaben on all nodes.

BinAC2/Hardware and Architecture

2025-06-27T12:18:53Z

S Behnle:

= Hardware and Architecture =

The bwForCluster BinAC 2 supports researchers from the broader fields of Bioinformatics, Medical Informatics, Astrophysics, Geosciences and Pharmacy.

== Operating System and Software ==

* Operating System: Rocky Linux 9.5
* Queuing System: [https://slurm.schedmd.com/documentation.html Slurm] (see [[BinAC2/Slurm]] for help)
* (Scientific) Libraries and Software: [[Environment Modules]]

== Compute Nodes ==

BinAC 2 offers compute nodes, high-mem nodes, and three types of GPU nodes.
* 180 compute nodes
* 16 SMP node
* 32 GPU nodes (2xA30)
* 8 GPU nodes (4xA100)
* 4 GPU nodes (4xH200)
* plus several special purpose nodes for login, interactive jobs, etc.

Compute node specification:
{| class="wikitable"
|-
! style="width:10%"|
! style="width:10%"| Standard
! style="width:10%"| High-Mem
! style="width:10%"| GPU (A30)
! style="width:10%"| GPU (A100)
! style="width:10%"| GPU (H200)
|-
!scope="column"| Quantity
| 180
| 14 / 2
| 32
| 8
| 4
|-
!scope="column" | Processors
| 2 x [https://www.amd.com/de/products/processors/server/epyc/7003-series/amd-epyc-7543.html AMD EPYC Milan 7543]
| 2 x [https://www.amd.com/de/products/processors/server/epyc/7003-series/amd-epyc-7443.html AMD EPYC Milan 7443] / 2 x [https://www.amd.com/en/products/processors/server/epyc/7003-series/amd-epyc-75f3.html AMD EPYC Milan 75F3]
| 2 x [https://www.amd.com/de/products/processors/server/epyc/7003-series/amd-epyc-7543.html AMD EPYC Milan 7543]
| 2 x [https://www.amd.com/de/products/processors/server/epyc/7003-series/amd-epyc-7543.html AMD EPYC Milan 7543]
| 2 x [https://www.amd.com/de/products/processors/server/epyc/9005-series/amd-epyc-9555.html AMD EPYC Milan 9555]
|-
!scope="column" | Processor Base Frequency (GHz)
| 2.80
| 2.85 / 2.95
| 2.80
| 2.80
| 3.20
|-
!scope="column" | Number of Physical Cores / Hypertreads
| 64 / 128
| 48 / 96 // 64 / 128
| 64 / 128
| 64 / 128
| 128 / 256
|-
!scope="column" | Working Memory (GB)
| 512
| 2048
| 512
| 512
| 1536
|-
!scope="column" | Local Disk (GB)
| 450 (NVMe-SSD)
| 14000 (NVMe-SSD)
| 450 (NVMe-SSD)
| 14000 (NVMe-SSD)
| 28000 (NVMe-SSD)
|-
!scope="column" | Interconnect
| HDR 100 IB (84 nodes) / 100GbE (96 nodes)
| 100GbE
| 100GbE
| 100GbE
| HDR 200 IB + 100GbE
|-
!scope="column" | Coprocessors
| -
| -
| 2 x [https://www.nvidia.com/de-de/data-center/products/a30-gpu/ NVIDIA A30 (24 GB ECC HBM2, NVLink)]
| 4 x [https://www.nvidia.com/de-de/data-center/a100/ NVIDIA A100 (80 GB ECC HBM2e)]
| 4 x [https://www.nvidia.com/de-de/data-center/h200/ NVIDIA H200 NVL (141 GB ECC HBM3e, NVLink)]
|}

= Network =

The compute nodes and the parallel file system are connected via 100GbE ethernet
In contrast to BinAC 1 not all compute nodes are connected via Infiniband, but there are 84 standard compute nodes connected via HDR Infiniband (100 GbE). In order to get your jobs onto the Infiniband nodes, submit your job with <code>--constraint=ib</code>.

= File Systems =

The bwForCluster BinAC 2 consists of two separate storage systems, one for the user's home directory $HOME and one serving as a project/work space.
The home directory is limited in space and parallel access but offers snapshots of your files and backup.

The project/work is a parallel file system (PFS) which offers fast and parallel file access and a bigger capacity than the home directory. It is mounted at <code>/pfs/10</code> on the login and compute nodes. This storage is based on Lustre and can be accessed parallel from many nodes. The PFS contains the project and the work directory. Each compute project has its own directory at <code>/pfs/10/project</code> that is accessible for all members of the compute project.
Each user can create workspaces under <code>/pfs/10/work</code> using the workspace tools. These directories are only accessible for the user who created the workspace.

Additionally, each compute node provides high-speed temporary storage (SSD) on the node-local solid state disk via the $TMPDIR environment variable.

{| class="wikitable"
|-
! style="width:10%"|
! style="width:10%"| <tt>$HOME</tt>
! style="width:10%"| project
! style="width:10%"| work
! style="width:10%"| <tt>$TMPDIR</tt>
|-
!scope="column" | Visibility
| global
| global
| global
| node local
|-
!scope="column" | Lifetime
| permanent
| permanent
| work space lifetime (max. 30 days, max. 5 extensions)
| batch job walltime
|-
!scope="column" | Capacity
| -
| 8.1 PB
| 1000 TB
| 512 GB per node; 1920 GB on high-mem nodes
|-
!scope="column" | Speed (read)
| ≈ 1 GB/s, shared by all nodes
| max. 12 GB/s
| ≈ 145 GB/s peak, aggregated over 56 nodes, ideal striping
| ≈ 3 GB/s (compute)/ ≈5 GB/S (GPUA30)/ ≈ 26 GB/s (GPUA100+SMP) per node
|-
!scope="column" | [https://en.wikipedia.org/wiki/Disk_quota#Quotas Quotas]
| 40 GB per user
| not yet, maybe in the future
| none
| none
|-
!scope="column" | Backup
| yes (nightly)
| '''no'''
| '''no'''
| '''no'''
|}

global : all nodes access the same file system
local : each node has its own file system
permanent : files are stored permanently
batch job walltime : files are removed at end of the batch job

{| class="wikitable" style="color:red; background-color:#ffffcc;" cellpadding="10"
|
Please note that due to the large capacity of '''work''' and '''project''' and due to frequent file changes on these file systems, no backup can be provided.
Backing up these file systems would require a redundant storage facility with multiple times the capacity of '''project'''. Furthermore, regular backups would significantly degrade the performance.
Data is stored redundantly, i.e. immune against disk failures but not immune against catastrophic incidents like cyber attacks or a fire in the server room.
Please consider to use on of the remote storage facilities like [https://wiki.bwhpc.de/e/SDS@hd SDS@hd], [https://uni-tuebingen.de/einrichtungen/zentrum-fuer-datenverarbeitung/projekte/laufende-projekte/bwsfs bwSFS], [https://www.scc.kit.edu/en/services/lsdf.php LSFD Online Storage] or the [https://www.rda.kit.edu/english/ bwDataArchive] to back up your valuable data.
|}

=== Home ===

Home directories are meant for permanent file storage of files that are keep being used like source codes, configuration files, executable programs etc.; the content of home directories will be backed up on a regular basis.
Because the backup space is limited we enforce a quota of 40GB on the home directories.

'''NOTE:'''
Compute jobs on nodes must not write temporary data to $HOME.
Instead they should use the local $TMPDIR directory for I/O-heavy use cases
and work spaces for less I/O intense multinode-jobs.


=== Project ===

Each compute project has its own project directory at <code>/pfs/10/project</code>.

<pre>
$ ls -lh /pfs/10/project/
drwxrwx---. 2 root bw16f003 33K Dec 12 16:46 bw16f003
[...]
</pre>

As you can see the directory is owned by a group representing your compute project (here bw16f003) and the directory is accessible by all group members. It is upon your group to decide how to use the space inside this directory: shared data folders, personal directories for each project member, software containers, etc.

The data is stored on HDDs. The primary focus of <code>/pfs/10/project</code> is pure capacity, not speed.

=== Work ===

The data at <code>/pfs/10/work</code> is stored on SSDs. The primary focus is speed, not capacity.
In contrast to BinAC 1 we will enforce work space lifetime, as the capacity is limited.
We ask you to only store data you actively use for computations on <code>/pfs/10/work</code>.
Please move data to <code>/pfs/10/project</code> when you don't need it on the fast storage any more.

Each user can create workspaces at <code>/pfs/10/work</code> through the workspace tools
To create a work space you'll need to supply a name for your work space area and a lifetime in days.
For more information read the corresponding help, e.g: <code>ws_allocate -h.</code>
{| class="wikitable"
|-
!style="width:30%" | Command
!style="width:70%" | Action
|-
|<code>ws_allocate mywork 30</code>
|Allocate a work space named "mywork" for 30 days.
|-
|<code>ws_allocate myotherwork</code>
|Allocate a work space named "myotherwork" with maximum lifetime.
|-
|<code>ws_list -a</code>
|List all your work spaces.
|-
|<code>ws_find mywork</code>
|Get absolute path of work space "mywork".
|-
|<code>ws_extend mywork 30</code>
|Extend life me of work space mywork by 30 days from now.
|-
|<code>ws_release mywork</code>
|Manually erase your work space "mywork". Please remove directory content first.
|-
|}

=== Scratch ===

Please use the fast local scratch space for storing temporary data during your jobs.

For each job a scratch directory will be created on the compute nodes. It is available via the environment variable <code>$TMPDIR</code>, which points to <code>/scratch/<jobID></code>.

Especially the SMP nodes and the GPU nodes are equipped with large and fast local disks that should be used for temporary data, scratch data or data staging for ML model training.
The Lustre file system (<code>WORK</code> and <code>PROJECT</code>) is unsuited for repetitive random I/O, I/O sizes smaller than the Lustre and ZFS block size (1M) or I/O patterns where files are opened and closed in rapid succession. The XFS file system of the local scratch drives is better suited for typical scratch workloads and access patterns. Moreover, the local scratch drives offer a lower latency and a higher bandwidth than <code>WORK</code>.

=== SDS@hd ===

SDS@hd is mounted via NFS on login and compute nodes at <syntaxhighlight inline>/mnt/sds-hd</syntaxhighlight>.

To access your Speichervorhaben, the export to BinAC 2 must first be enabled by the SDS@hd-Team. Please contact [mailto:sds-hd-support@urz.uni-heidelberg.de SDS@hd support] and provide the acronym of your Speichervorhaben, along with a request to enable the export to BinAC 2.

Once this has been done, you can access your Speichervorhaben as described in the [https://wiki.bwhpc.de/e/SDS@hd/Access/NFS#Access_your_data SDS documentation].

<syntaxhighlight>
$ kinit $USER
Password for <user>@BWSERVICES.UNI-HEIDELBERG.DE:
</syntaxhighlight>

The Kerberos ticket store is shared across all nodes. Creating a single ticket is sufficient to access your Speichervorhaben on all nodes.

BinAC2/Hardware and Architecture

2025-06-27T10:10:31Z

S Behnle:

= Hardware and Architecture =

The bwForCluster BinAC 2 supports researchers from the broader fields of Bioinformatics, Medical Informatics, Astrophysics, Geosciences and Pharmacy.

== Operating System and Software ==

* Operating System: Rocky Linux 9.5
* Queuing System: [https://slurm.schedmd.com/documentation.html Slurm] (see [[BinAC2/Slurm]] for help)
* (Scientific) Libraries and Software: [[Environment Modules]]

== Compute Nodes ==

BinAC 2 offers compute nodes, high-mem nodes, and three types of GPU nodes.
* 180 compute nodes
* 16 SMP node
* 32 GPU nodes (2xA30)
* 8 GPU nodes (4xA100)
* 4 GPU nodes (4xH200)
* plus several special purpose nodes for login, interactive jobs, etc.

Compute node specification:
{| class="wikitable"
|-
! style="width:10%"|
! style="width:10%"| Standard
! style="width:10%"| High-Mem
! style="width:10%"| GPU (A30)
! style="width:10%"| GPU (A100)
! style="width:10%"| GPU (H200)
|-
!scope="column"| Quantity
| 180
| 14 / 2
| 32
| 8
| 4
|-
!scope="column" | Processors
| 2 x [https://www.amd.com/de/products/processors/server/epyc/7003-series/amd-epyc-7543.html AMD EPYC Milan 7543]
| 2 x [https://www.amd.com/de/products/processors/server/epyc/7003-series/amd-epyc-7443.html AMD EPYC Milan 7443] / 2 x [https://www.amd.com/en/products/processors/server/epyc/7003-series/amd-epyc-75f3.html AMD EPYC Milan 75F3]
| 2 x [https://www.amd.com/de/products/processors/server/epyc/7003-series/amd-epyc-7543.html AMD EPYC Milan 7543]
| 2 x [https://www.amd.com/de/products/processors/server/epyc/7003-series/amd-epyc-7543.html AMD EPYC Milan 7543]
| 2 x [https://www.amd.com/de/products/processors/server/epyc/9005-series/amd-epyc-9555.html AMD EPYC Milan 9555]
|-
!scope="column" | Processor Base Frequency (GHz)
| 2.80
| 2.85 / 2.95
| 2.80
| 2.80
| 3.20
|-
!scope="column" | Number of Physical Cores / Hypertreads
| 64 / 128
| 48 / 96 // 64 / 128
| 64 / 128
| 64 / 128
| 128 / 256
|-
!scope="column" | Working Memory (GB)
| 512
| 2048
| 512
| 512
| 1536
|-
!scope="column" | Local Disk (GB)
| 450 (NVMe-SSD)
| 14000 (NVMe-SSD)
| 450 (NVMe-SSD)
| 14000 (NVMe-SSD)
| 28000 (NVMe-SSD)
|-
!scope="column" | Interconnect
| HDR 100 IB (84 nodes) / 100GbE (96 nodes)
| 100GbE
| 100GbE
| 100GbE
| HDR 200 IB + 100GbE
|-
!scope="column" | Coprocessors
| -
| -
| 2 x [https://www.nvidia.com/de-de/data-center/products/a30-gpu/ NVIDIA A30 (24 GB ECC HBM2, NVLink)]
| 4 x [https://www.nvidia.com/de-de/data-center/a100/ NVIDIA A100 (80 GB ECC HBM2e)]
| 4 x [https://www.nvidia.com/de-de/data-center/h200/ NVIDIA H200 NVL (141 GB ECC HBM3e, NVLink)]
|}

= Network =

The compute nodes and the parallel file system are connected via 100GbE ethernet
In contrast to BinAC 1 not all compute nodes are connected via Infiniband, but there are 84 standard compute nodes connected via HDR Infiniband (100 GbE). In order to get your jobs onto the Infiniband nodes, submit your job with <code>--constraint=ib</code>.

= File Systems =

The bwForCluster BinAC 2 consists of two separate storage systems, one for the user's home directory $HOME and one serving as a project/work space.
The home directory is limited in space and parallel access but offers snapshots of your files and backup.

The project/work is a parallel file system (PFS) which offers fast and parallel file access and a bigger capacity than the home directory. It is mounted at <code>/pfs/10</code> on the login and compute nodes. This storage is based on Lustre and can be accessed parallel from many nodes. The PFS contains the project and the work directory. Each compute project has its own directory at <code>/pfs/10/project</code> that is accessible for all members of the compute project.
Each user can create workspaces under <code>/pfs/10/work</code> using the workspace tools. These directories are only accessible for the user who created the workspace.

Additionally, each compute node provides high-speed temporary storage (SSD) on the node-local solid state disk via the $TMPDIR environment variable.

{| class="wikitable"
|-
! style="width:10%"|
! style="width:10%"| <tt>$HOME</tt>
! style="width:10%"| project
! style="width:10%"| work
! style="width:10%"| <tt>$TMPDIR</tt>
|-
!scope="column" | Visibility
| global
| global
| global
| node local
|-
!scope="column" | Lifetime
| permanent
| permanent
| work space lifetime (max. 30 days, max. 5 extensions)
| batch job walltime
|-
!scope="column" | Capacity
| -
| 8.1 PB
| 1000 TB
| 512 GB per node; 1920 GB on high-mem nodes
|-
!scope="column" | Speed (read)
| ≈ 1 GB/s, shared by all nodes
| max. 12 GB/s
| ≈ 145 GB/s peak, aggregated over 56 nodes, ideal striping
| ≈ 3 GB/s (compute)/ ≈5 GB/S (GPUA30)/ ≈ 26 GB/s (GPUA100+SMP) per node
|-
!scope="column" | [https://en.wikipedia.org/wiki/Disk_quota#Quotas Quotas]
| 40 GB per user
| not yet, maybe in the future
| none
| none
|-
!scope="column" | Backup
| yes
| no
| no
| no
|}

global : all nodes access the same file system
local : each node has its own file system
permanent : files are stored permanently
batch job walltime : files are removed at end of the batch job

=== Home ===

Home directories are meant for permanent file storage of files that are keep being used like source codes, configuration files, executable programs etc.; the content of home directories will be backed up on a regular basis.
Because the backup space is limited we enforce a quota of 40GB on the home directories.

'''NOTE:'''
Compute jobs on nodes must not write temporary data to $HOME.
Instead they should use the local $TMPDIR directory for I/O-heavy use cases
and work spaces for less I/O intense multinode-jobs.


=== Project ===

Each compute project has its own project directory at <code>/pfs/10/project</code>.

<pre>
$ ls -lh /pfs/10/project/
drwxrwx---. 2 root bw16f003 33K Dec 12 16:46 bw16f003
[...]
</pre>

As you can see the directory is owned by a group representing your compute project (here bw16f003) and the directory is accessible by all group members. It is upon your group to decide how to use the space inside this directory: shared data folders, personal directories for each project member, software containers, etc.

The data is stored on HDDs. The primary focus of <code>/pfs/10/project</code> is pure capacity, not speed.

=== Work ===

The data at <code>/pfs/10/work</code> is stored on SSDs. The primary focus is speed, not capacity.
In contrast to BinAC 1 we will enforce work space lifetime, as the capacity is limited.
We ask you to only store data you actively use for computations on <code>/pfs/10/work</code>.
Please move data to <code>/pfs/10/project</code> when you don't need it on the fast storage any more.

Each user can create workspaces at <code>/pfs/10/work</code> through the workspace tools
To create a work space you'll need to supply a name for your work space area and a lifetime in days.
For more information read the corresponding help, e.g: <code>ws_allocate -h.</code>
{| class="wikitable"
|-
!style="width:30%" | Command
!style="width:70%" | Action
|-
|<code>ws_allocate mywork 30</code>
|Allocate a work space named "mywork" for 30 days.
|-
|<code>ws_allocate myotherwork</code>
|Allocate a work space named "myotherwork" with maximum lifetime.
|-
|<code>ws_list -a</code>
|List all your work spaces.
|-
|<code>ws_find mywork</code>
|Get absolute path of work space "mywork".
|-
|<code>ws_extend mywork 30</code>
|Extend life me of work space mywork by 30 days from now.
|-
|<code>ws_release mywork</code>
|Manually erase your work space "mywork". Please remove directory content first.
|-
|}

=== Scratch ===

Please use the fast local scratch space for storing temporary data during your jobs.

For each job a scratch directory will be created on the compute nodes. It is available via the environment variable <code>$TMPDIR</code>, which points to <code>/scratch/<jobID></code>.

Especially the SMP nodes and the GPU nodes are equipped with large and fast local disks that should be used for temporary data, scratch data or data staging for ML model training.
The Lustre file system (<code>WORK</code> and <code>PROJECT</code>) is unsuited for repetitive random I/O, I/O sizes smaller than the Lustre and ZFS block size (1M) or I/O patterns where files are opened and closed in rapid succession. The XFS file system of the local scratch drives is better suited for typical scratch workloads and access patterns. Moreover, the local scratch drives offer a lower latency and a higher bandwidth than <code>WORK</code>.

=== SDS@hd ===

SDS@hd is mounted via NFS on login and compute nodes at <syntaxhighlight inline>/mnt/sds-hd</syntaxhighlight>.

To access your Speichervorhaben, the export to BinAC 2 must first be enabled by the SDS@hd-Team. Please contact [mailto:sds-hd-support@urz.uni-heidelberg.de SDS@hd support] and provide the acronym of your Speichervorhaben, along with a request to enable the export to BinAC 2.

Once this has been done, you can access your Speichervorhaben as described in the [https://wiki.bwhpc.de/e/SDS@hd/Access/NFS#Access_your_data SDS documentation].

<syntaxhighlight>
$ kinit $USER
Password for <user>@BWSERVICES.UNI-HEIDELBERG.DE:
</syntaxhighlight>

The Kerberos ticket store is shared across all nodes. Creating a single ticket is sufficient to access your Speichervorhaben on all nodes.

BinAC2/Hardware and Architecture

2025-06-27T09:24:35Z

S Behnle:

= Hardware and Architecture =

The bwForCluster BinAC 2 supports researchers from the broader fields of Bioinformatics, Medical Informatics, Astrophysics, Geosciences and Pharmacy.

== Operating System and Software ==

* Operating System: Rocky Linux 9.5
* Queuing System: [https://slurm.schedmd.com/documentation.html Slurm] (see [[BinAC2/Slurm]] for help)
* (Scientific) Libraries and Software: [[Environment Modules]]

== Compute Nodes ==

BinAC 2 offers compute nodes, high-mem nodes, and three types of GPU nodes.
* 180 compute nodes
* 16 SMP node
* 32 GPU nodes (2xA30)
* 8 GPU nodes (4xA100)
* 4 GPU nodes (4xH200)
* plus several special purpose nodes for login, interactive jobs, etc.

Compute node specification:
{| class="wikitable"
|-
! style="width:10%"|
! style="width:10%"| Standard
! style="width:10%"| High-Mem
! style="width:10%"| GPU (A30)
! style="width:10%"| GPU (A100)
! style="width:10%"| GPU (H200)
|-
!scope="column"| Quantity
| 180
| 14 / 2
| 32
| 8
| 4
|-
!scope="column" | Processors
| 2 x [https://www.amd.com/de/products/processors/server/epyc/7003-series/amd-epyc-7543.html AMD EPYC Milan 7543]
| 2 x [https://www.amd.com/de/products/processors/server/epyc/7003-series/amd-epyc-7443.html AMD EPYC Milan 7443] / 2 x [https://www.amd.com/en/products/processors/server/epyc/7003-series/amd-epyc-75f3.html AMD EPYC Milan 75F3]
| 2 x [https://www.amd.com/de/products/processors/server/epyc/7003-series/amd-epyc-7543.html AMD EPYC Milan 7543]
| 2 x [https://www.amd.com/de/products/processors/server/epyc/7003-series/amd-epyc-7543.html AMD EPYC Milan 7543]
| 2 x [https://www.amd.com/de/products/processors/server/epyc/9005-series/amd-epyc-9555.html AMD EPYC Milan 9555]
|-
!scope="column" | Processor Base Frequency (GHz)
| 2.80
| 2.85 / 2.95
| 2.80
| 2.80
| 3.20
|-
!scope="column" | Number of Physical Cores / Hypertreads
| 64 / 128
| 48 / 96 // 64 / 128
| 64 / 128
| 64 / 128
| 128 / 256
|-
!scope="column" | Working Memory (GB)
| 512
| 2048
| 512
| 512
| 1536
|-
!scope="column" | Local Disk (GB)
| 450 (NVMe-SSD)
| 14000 (NVMe-SSD)
| 450 (NVMe-SSD)
| 14000 (NVMe-SSD)
| 28000 (NVMe-SSD)
|-
!scope="column" | Interconnect
| HDR 100 IB (84 nodes) / 100GbE (96 nodes)
| 100GbE
| 100GbE
| 100GbE
| HDR 200 IB + 100GbE
|-
!scope="column" | Coprocessors
| -
| -
| 2 x [https://www.nvidia.com/de-de/data-center/products/a30-gpu/ NVIDIA A30 (24 GB ECC HBM2, NVLink)]
| 4 x [https://www.nvidia.com/de-de/data-center/a100/ NVIDIA A100 (80 GB ECC HBM2e)]
| 4 x [https://www.nvidia.com/de-de/data-center/h200/ NVIDIA H200 NVL (141 GB ECC HBM3e, NVLink)]
|}

= Network =

The compute nodes and the parallel file system are connected via 100GbE ethernet
In contrast to BinAC 1 not all compute nodes are connected via Infiniband, but there are 84 standard compute nodes connected via HDR Infiniband (100 GbE). In order to get your jobs onto the Infiniband nodes, submit your job with <code>--constraint=ib</code>.

= File Systems =

The bwForCluster BinAC 2 consists of two separate storage systems, one for the user's home directory $HOME and one serving as a project/work space.
The home directory is limited in space and parallel access but offers snapshots of your files and backup.

The project/work is a parallel file system (PFS) which offers fast and parallel file access and a bigger capacity than the home directory. It is mounted at <code>/pfs/10</code> on the login and compute nodes. This storage is based on Lustre and can be accessed parallel from many nodes. The PFS contains the project and the work directory. Each compute project has its own directory at <code>/pfs/10/project</code> that is accessible for all members of the compute project.
Each user can create workspaces under <code>/pfs/10/work</code> using the workspace tools. These directories are only accessible for the user who created the workspace.

Additionally, each compute node provides high-speed temporary storage (SSD) on the node-local solid state disk via the $TMPDIR environment variable.

{| class="wikitable"
|-
! style="width:10%"|
! style="width:10%"| <tt>$HOME</tt>
! style="width:10%"| project
! style="width:10%"| work
! style="width:10%"| <tt>$TMPDIR</tt>
|-
!scope="column" | Visibility
| global
| global
| global
| node local
|-
!scope="column" | Lifetime
| permanent
| permanent
| work space lifetime (max. 30 days, max. 5 extensions)
| batch job walltime
|-
!scope="column" | Capacity
| -
| 8.1 PB
| 1000 TB
| 512 GB per node; 1920 GB on high-mem nodes
|-
!scope="column" | Speed
| ...
| ...
| ...
| ...
|-
!scope="column" | [https://en.wikipedia.org/wiki/Disk_quota#Quotas Quotas]
| 40 GB per user
| not yet, maybe in the future
| none
| none
|-
!scope="column" | Backup
| yes
| no
| no
| no
|}

global : all nodes access the same file system
local : each node has its own file system
permanent : files are stored permanently
batch job walltime : files are removed at end of the batch job

=== Home ===

Home directories are meant for permanent file storage of files that are keep being used like source codes, configuration files, executable programs etc.; the content of home directories will be backed up on a regular basis.
Because the backup space is limited we enforce a quota of 40GB on the home directories.

'''NOTE:'''
Compute jobs on nodes must not write temporary data to $HOME.
Instead they should use the local $TMPDIR directory for I/O-heavy use cases
and work spaces for less I/O intense multinode-jobs.


=== Project ===

Each compute project has its own project directory at <code>/pfs/10/project</code>.

<pre>
$ ls -lh /pfs/10/project/
drwxrwx---. 2 root bw16f003 33K Dec 12 16:46 bw16f003
[...]
</pre>

As you can see the directory is owned by a group representing your compute project (here bw16f003) and the directory is accessible by all group members. It is upon your group to decide how to use the space inside this directory: shared data folders, personal directories for each project member, software containers, etc.

The data is stored on HDDs. The primary focus of <code>/pfs/10/project</code> is pure capacity, not speed.

=== Work ===

The data at <code>/pfs/10/work</code> is stored on SSDs. The primary focus is speed, not capacity.
In contrast to BinAC 1 we will enforce work space lifetime, as the capacity is limited.
We ask you to only store data you actively use for computations on <code>/pfs/10/work</code>.
Please move data to <code>/pfs/10/project</code> when you don't need it on the fast storage any more.

Each user can create workspaces at <code>/pfs/10/work</code> through the workspace tools
To create a work space you'll need to supply a name for your work space area and a lifetime in days.
For more information read the corresponding help, e.g: <code>ws_allocate -h.</code>
{| class="wikitable"
|-
!style="width:30%" | Command
!style="width:70%" | Action
|-
|<code>ws_allocate mywork 30</code>
|Allocate a work space named "mywork" for 30 days.
|-
|<code>ws_allocate myotherwork</code>
|Allocate a work space named "myotherwork" with maximum lifetime.
|-
|<code>ws_list -a</code>
|List all your work spaces.
|-
|<code>ws_find mywork</code>
|Get absolute path of work space "mywork".
|-
|<code>ws_extend mywork 30</code>
|Extend life me of work space mywork by 30 days from now.
|-
|<code>ws_release mywork</code>
|Manually erase your work space "mywork". Please remove directory content first.
|-
|}

=== Scratch ===

Please use the fast local scratch space for storing temporary data during your jobs.

For each job a scratch directory will be created on the compute nodes. It is available via the environment variable <code>$TMPDIR</code>, which points to <code>/scratch/<jobID></code>.

Especially the SMP nodes and the GPU nodes are equipped with large and fast local disks that should be used for temporary data, scratch data or data staging for ML model training.

=== SDS@hd ===

SDS@hd is mounted via NFS on login and compute nodes at <syntaxhighlight inline>/mnt/sds-hd</syntaxhighlight>.

To access your Speichervorhaben, the export to BinAC 2 must first be enabled by the SDS@hd-Team. Please contact [mailto:sds-hd-support@urz.uni-heidelberg.de SDS@hd support] and provide the acronym of your Speichervorhaben, along with a request to enable the export to BinAC 2.

Once this has been done, you can access your Speichervorhaben as described in the [https://wiki.bwhpc.de/e/SDS@hd/Access/NFS#Access_your_data SDS documentation].

<syntaxhighlight>
$ kinit $USER
Password for <user>@BWSERVICES.UNI-HEIDELBERG.DE:
</syntaxhighlight>

The Kerberos ticket store is shared across all nodes. Creating a single ticket is sufficient to access your Speichervorhaben on all nodes.

BinAC2/Hardware and Architecture

2025-06-27T09:06:37Z

S Behnle:

= Hardware and Architecture =

The bwForCluster BinAC 2 supports researchers from the broader fields of Bioinformatics, Medical Informatics, Astrophysics, Geosciences and Pharmacy.

== Operating System and Software ==

* Operating System: Rocky Linux 9.5
* Queuing System: [https://slurm.schedmd.com/documentation.html Slurm] (see [[BinAC2/Slurm]] for help)
* (Scientific) Libraries and Software: [[Environment Modules]]

== Compute Nodes ==

BinAC 2 offers compute nodes, high-mem nodes, and three types of GPU nodes.
* 180 compute nodes
* 16 SMP node
* 32 GPU nodes (2xA30)
* 8 GPU nodes (4xA100)
* 4 GPU nodes (4xH200)
* plus several special purpose nodes for login, interactive jobs, etc.

Compute node specification:
{| class="wikitable"
|-
! style="width:10%"|
! style="width:10%"| Standard
! style="width:10%"| High-Mem
! style="width:10%"| GPU (A30)
! style="width:10%"| GPU (A100)
! style="width:10%"| GPU (H200)
|-
!scope="column"| Quantity
| 180
| 14 / 2
| 32
| 8
| 4
|-
!scope="column" | Processors
| 2 x [https://www.amd.com/de/products/processors/server/epyc/7003-series/amd-epyc-7543.html AMD EPYC Milan 7543]
| 2 x [https://www.amd.com/de/products/processors/server/epyc/7003-series/amd-epyc-7443.html AMD EPYC Milan 7443] / 2 x [https://www.amd.com/en/products/processors/server/epyc/7003-series/amd-epyc-75f3.html AMD EPYC Milan 75F3]
| 2 x [https://www.amd.com/de/products/processors/server/epyc/7003-series/amd-epyc-7543.html AMD EPYC Milan 7543]
| 2 x [https://www.amd.com/de/products/processors/server/epyc/7003-series/amd-epyc-7543.html AMD EPYC Milan 7543]
| 2 x [https://www.amd.com/de/products/processors/server/epyc/9005-series/amd-epyc-9555.html AMD EPYC Milan 9555]
|-
!scope="column" | Processor Base Frequency (GHz)
| 2.80
| 2.85 / 2.95
| 2.80
| 2.80
| 3.20
|-
!scope="column" | Number of Physical Cores / Hypertreads
| 64 / 128
| 48 / 96 // 64 / 128
| 64 / 128
| 64 / 128
| 64 / 128
|-
!scope="column" | Working Memory (GB)
| 512
| 2048
| 512
| 512
| 1536
|-
!scope="column" | Local Disk (GB)
| 450 (NVMe-SSD)
| 14000 (NVMe-SSD)
| 450 (NVMe-SSD)
| 14000 (NVMe-SSD)
| 28000 (NVMe-SSD)
|-
!scope="column" | Interconnect
| HDR 100 IB (84 nodes) / 100GbE (96 nodes)
| 100GbE
| 100GbE
| 100GbE
| HDR 200 IB + 100GbE
|-
!scope="column" | Coprocessors
| -
| -
| 2 x [https://www.nvidia.com/de-de/data-center/products/a30-gpu/ NVIDIA A30 (24 GB ECC HBM2, NVLink)]
| 4 x [https://www.nvidia.com/de-de/data-center/a100/ NVIDIA A100 (80 GB ECC HBM2e)]
| 4 x [https://www.nvidia.com/de-de/data-center/h200/ NVIDIA H200 NVL (141 GB ECC HBM3e, NVLink)]
|}

= Network =

The compute nodes and the parallel file system are connected via 100GbE ethernet
In contrast to BinAC 1 not all compute nodes are connected via Infiniband, but there are 84 standard compute nodes connected via HDR Infiniband (100 GbE). In order to get your jobs onto the Infiniband nodes, submit your job with <code>--constraint=ib</code>.

= File Systems =

The bwForCluster BinAC 2 consists of two separate storage systems, one for the user's home directory $HOME and one serving as a project/work space.
The home directory is limited in space and parallel access but offers snapshots of your files and backup.

The project/work is a parallel file system (PFS) which offers fast and parallel file access and a bigger capacity than the home directory. It is mounted at <code>/pfs/10</code> on the login and compute nodes. This storage is based on Lustre and can be accessed parallel from many nodes. The PFS contains the project and the work directory. Each compute project has its own directory at <code>/pfs/10/project</code> that is accessible for all members of the compute project.
Each user can create workspaces under <code>/pfs/10/work</code> using the workspace tools. These directories are only accessible for the user who created the workspace.

Additionally, each compute node provides high-speed temporary storage (SSD) on the node-local solid state disk via the $TMPDIR environment variable.

{| class="wikitable"
|-
! style="width:10%"|
! style="width:10%"| <tt>$HOME</tt>
! style="width:10%"| project
! style="width:10%"| work
! style="width:10%"| <tt>$TMPDIR</tt>
|-
!scope="column" | Visibility
| global
| global
| global
| node local
|-
!scope="column" | Lifetime
| permanent
| permanent
| work space lifetime (max. 30 days, max. 5 extensions)
| batch job walltime
|-
!scope="column" | Capacity
| -
| 8.1 PB
| 1000 TB
| 512 GB per node; 1920 GB on high-mem nodes
|-
!scope="column" | Speed
| ...
| ...
| ...
| ...
|-
!scope="column" | [https://en.wikipedia.org/wiki/Disk_quota#Quotas Quotas]
| 40 GB per user
| not yet, maybe in the future
| none
| none
|-
!scope="column" | Backup
| yes
| no
| no
| no
|}

global : all nodes access the same file system
local : each node has its own file system
permanent : files are stored permanently
batch job walltime : files are removed at end of the batch job

=== Home ===

Home directories are meant for permanent file storage of files that are keep being used like source codes, configuration files, executable programs etc.; the content of home directories will be backed up on a regular basis.
Because the backup space is limited we enforce a quota of 40GB on the home directories.

'''NOTE:'''
Compute jobs on nodes must not write temporary data to $HOME.
Instead they should use the local $TMPDIR directory for I/O-heavy use cases
and work spaces for less I/O intense multinode-jobs.


=== Project ===

Each compute project has its own project directory at <code>/pfs/10/project</code>.

<pre>
$ ls -lh /pfs/10/project/
drwxrwx---. 2 root bw16f003 33K Dec 12 16:46 bw16f003
[...]
</pre>

As you can see the directory is owned by a group representing your compute project (here bw16f003) and the directory is accessible by all group members. It is upon your group to decide how to use the space inside this directory: shared data folders, personal directories for each project member, software containers, etc.

The data is stored on HDDs. The primary focus of <code>/pfs/10/project</code> is pure capacity, not speed.

=== Work ===

The data at <code>/pfs/10/work</code> is stored on SSDs. The primary focus is speed, not capacity.
In contrast to BinAC 1 we will enforce work space lifetime, as the capacity is limited.
We ask you to only store data you actively use for computations on <code>/pfs/10/work</code>.
Please move data to <code>/pfs/10/project</code> when you don't need it on the fast storage any more.

Each user can create workspaces at <code>/pfs/10/work</code> through the workspace tools
To create a work space you'll need to supply a name for your work space area and a lifetime in days.
For more information read the corresponding help, e.g: <code>ws_allocate -h.</code>
{| class="wikitable"
|-
!style="width:30%" | Command
!style="width:70%" | Action
|-
|<code>ws_allocate mywork 30</code>
|Allocate a work space named "mywork" for 30 days.
|-
|<code>ws_allocate myotherwork</code>
|Allocate a work space named "myotherwork" with maximum lifetime.
|-
|<code>ws_list -a</code>
|List all your work spaces.
|-
|<code>ws_find mywork</code>
|Get absolute path of work space "mywork".
|-
|<code>ws_extend mywork 30</code>
|Extend life me of work space mywork by 30 days from now.
|-
|<code>ws_release mywork</code>
|Manually erase your work space "mywork". Please remove directory content first.
|-
|}

=== Scratch ===

Please use the fast local scratch space for storing temporary data during your jobs.

For each job a scratch directory will be created on the compute nodes. It is available via the environment variable <code>$TMPDIR</code>, which points to <code>/scratch/<jobID></code>.

=== SDS@hd ===

SDS@hd is mounted via NFS on login and compute nodes at <syntaxhighlight inline>/mnt/sds-hd</syntaxhighlight>.

To access your Speichervorhaben, the export to BinAC 2 must first be enabled by the SDS@hd-Team. Please contact [mailto:sds-hd-support@urz.uni-heidelberg.de SDS@hd support] and provide the acronym of your Speichervorhaben, along with a request to enable the export to BinAC 2.

Once this has been done, you can access your Speichervorhaben as described in the [https://wiki.bwhpc.de/e/SDS@hd/Access/NFS#Access_your_data SDS documentation].

<syntaxhighlight>
$ kinit $USER
Password for <user>@BWSERVICES.UNI-HEIDELBERG.DE:
</syntaxhighlight>

The Kerberos ticket store is shared across all nodes. Creating a single ticket is sufficient to access your Speichervorhaben on all nodes.

BinAC2/Acknowledgement

2025-05-20T10:48:44Z

S Behnle: Added Acknowledgement Page for BinAC2

When preparing a publication describing work that involved the usage of a bwForCluster, e.g. BinAC2, please ensure that you reference the bwHPC initiative, the bwHPC-C5 project and – if appropriate – also the bwHPC facility itself. The following sample text is suggested as a starting point.

Acknowledgement:
The authors acknowledge support by the High Performance and Cloud Computing Group at the Zentrum für Datenverarbeitung of the University of Tübingen, the state of Baden-Württemberg through bwHPC
and the German Research Foundation (DFG) through grant no INST 37/1159-1 FUGG.

In addition, we kindly ask you to notify us of any reports, conference papers, journal articles, theses, posters, talks which contain results obtained on any bwHPC resource by sending an email to
[mailto:publications@bwhpc.de publications@bwhpc.de] stating:
* cluster facility (e.g. bwForCluster BinAC2)
* RV acronym (e.g. bw16A000)
* author(s)
* title ''or'' booktitle
* journal, volume, pages ''or'' editors, address, publisher
* year.

Such recognition is highly important for acquiring funding for the next generation of hardware, support services, data storage and infrastructure.

The publications will be referenced on the bwHPC website:
https://www.bwhpc.de/user_publications.html

SDS@hd/Access

2025-03-31T13:15:09Z

S Behnle: Changed the availability of SDS@HD on BinAC. It was and is currently only mounted on login03, not all login nodes.

This page provides an overview on how to access data served by SDS@hd. To get an introduction to data transfer in general, see [[Data_Transfer|data transfer]].

== Prerequisites ==

* You need to be [[SDS@hd/Registration|registered]].
* You need to be in the belwue-Network. This means you have to use the VPN Service of your HomeOrganization, if you want to access SDS@hd from outside the bwHPC-Clusters (e.g. via eduroam or from your personal notebook).

== Needed Information, independent of the chosen tool ==

* [[Registration/Login/Username| Username]]: Same as for the bwHPC Clusters
* Password: The Service Password that you set at bwServices in the [[SDS@hd/Registration|registration step]].
* Hostname: The hostname depends on the chosen network protocol:
** For [[Data_Transfer/SSHFS|SSHFS]] and [[Data_Transfer/SFTP|SFTP]]: ''lsdf02-sshfs.urz.uni-heidelberg.de''
** For [[SDS@hd/Access/SMB|SMB]] and [[SDS@hd/Access/NFS|NFS]]: ''lsdf02.urz.uni-heidelberg.de''
** For [[Data_Transfer/WebDAV|WebDAV]] the url is: ''https://lsdf02-webdav.urz.uni-heidelberg.de''

== Recommended Setup ==
The following graphic shows the recommended way for accessing SDS@hd via Windows/Mac/Linux. The table provides an overview of the most important access options and links to the related pages. 
If you have various use cases, it is recommended to use [[Data_Transfer/Rclone|Rclone]]. You can copy, sync and mount with it. Thanks to its multithreading capability Rclone is a good fit for transferring big data. 
For an overview of all connection possibilities, please have a look at [[Data_Transfer/All_Data_Transfer_Routes|all data transfer routes]].

[[File:Data_transfer_diagram_simple.jpg|center|500px]]
Figure 1: SDS@hd main transfer routes

{| class="wikitable"
|- style="font-weight:bold; text-align:center; vertical-align:middle;"
!
! Use Case
! Windows
! Mac
! Linux
! Possible Bandwith
! Firewall Ports
|-
| [[Data_Transfer/Rclone|Rclone]] + <protocol>
| copy, sync and mount, multithreading
| ✓
| ✓
| ✓
| depends on used protocol
| depends on used protocol
|-
| [[SDS@hd/Access/SMB|SMB]]
| mount as network drive in file explorer or usage via Rclone
| [[SDS@hd/Access/SMB#Windows|✓]]
| [[SDS@hd/Access/SMB#Mac|✓]]
| [[SDS@hd/Access/SMB#Linux|✓]]
| up to 40 Gbit/sec
| 139 (netbios), 135 (rpc), 445 (smb)
|-
| [[Data_Transfer/WebDAV|WebDAV]]
| go to solution for restricted networks
| [✓]
| ✓
| ✓
| up to 100GBit/sec
| 80 (http), 443 (https)
|- style="vertical-align:middle;"
| [[Data_Transfer/Graphical_Clients#MobaXterm|MobaXterm]]
| Graphical User Interface (GUI)
| [[Data_Transfer/Graphical_Clients#MobaXterm|✓]]
| ☓
| ☓
| see sftp
| see sftp
|- style="vertical-align:middle;"
| [[SDS@hd/Access/NFS|NFS]]
| mount for multi-user environments
| ☓
| ☓
| [[SDS@hd/Access/NFS|✓]]
| up to 40 Gbit/sec
| -
|- style="vertical-align:middle;"
| [[Data_Transfer/SSHFS|SSHFS]]
| mount, needs stable internet connection
| ☓
| [[Data_Transfer/SSHFS#MacOS_&_Linux|✓]]
| [[Data_Transfer/SSHFS#MacOS_&_Linux|✓]]
| see sftp
| see sftp
|- style="vertical-align:middle;"
| [[Data_Transfer/SFTP|SFTP]]
| interactive shell, better usability when used together with Rclone
| [[Data_Transfer/SFTP#Windows|✓]]
| [[Data_Transfer/SFTP#MacOS_&_Linux|✓]]
| [[Data_Transfer/SFTP#MacOS_&_Linux|✓]]
| up to 40 Gbit/sec
| 22 (ssh)
|}
Table 1: SDS@hd transfer routes

=== Access from a bwHPC Cluster ===

'''bwForCluster Helix''' 
You can directly access your storage space under ''/mnt/sds-hd/'' on all login and compute nodes.

'''bwForCluster BinAC''' 
You can directly access your storage space on the login node ''login03''.

'''Other''' 
You can mount your SDS@hd SV on the cluster yourself by using [[Data_Transfer/Rclone#Usage_Rclone_Mount | Rclone mount]]. As transfer protocol you can use WebDAV or sftp. For a full overview please have a look at [[Data_Transfer/All_Data_Transfer_Routes | All Data Transfer Routes]].

=== Access via Webbrowser (read-only) ===

Visit [https://lsdf02-webdav.urz.uni-heidelberg.de/ lsdf02-webdav.urz.uni-heidelberg.de] and login with your SDS@hd username and service password. Here you can get an overview of the data in your "Speichervorhaben" and download single files. To be able to do more, like moving data, uploading new files, or downloading complete folders, a suitable client is needed as described above.

== Best Practices ==

* '''Managing access rights with ACLs''' -> Please set ACLs either via the [https://www.urz.uni-heidelberg.de/de/service-katalog/desktop-und-arbeitsplatz/windows-terminalserver Windows terminal server] or via bwForCluster Helix. ACL changes won't work when used locally on a mounted directory.
* '''Multiuser environment''' - -> Use [[SDS@hd/Access/NFS|NFS]]

BinAC2/Hardware and Architecture

2025-03-31T07:41:57Z

S Behnle: Domänen aktualisiert

= Hardware and Architecture =

The bwForCluster BinAC 2 supports researchers from the broader fields of Bioinformatics, Medical Informatics, Astrophysics, Geosciences and Pharmacy.

== Operating System and Software ==

* Operating System: Rocky Linux 9.4
* Queuing System: [https://slurm.schedmd.com/documentation.html Slurm] (see [[BinAC2/Slurm]] for help)
* (Scientific) Libraries and Software: [[Environment Modules]]

== Compute Nodes ==

BinAC 2 offers compute nodes, high-mem nodes, and two types of GPU nodes.
* 180 compute nodes
* 14 SMP node
* 32 GPU nodes (A30)
* 8 GPU nodes (A100)
* plus several special purpose nodes for login, interactive jobs, etc.

Compute node specification:
{| class="wikitable"
|-
! style="width:10%"|
! style="width:10%"| Standard
! style="width:10%"| High-Mem
! style="width:10%"| GPU (A30)
! style="width:10%"| GPU (A100)
|-
!scope="column"| Quantity
| 180
| 14
| 32
| 8
|-
!scope="column" | Processors
| 2 x [https://www.amd.com/de/products/cpu/amd-epyc-7543 AMD EPYC Milan 7543]
| 2 x [https://www.amd.com/de/products/cpu/amd-epyc-7443 AMD EPYC Milan 7443]
| 2 x [https://www.amd.com/de/products/cpu/amd-epyc-7543 AMD EPYC Milan 7543]
| 2 x [https://www.amd.com/de/products/cpu/amd-epyc-7543 AMD EPYC Milan 7543]
|-
!scope="column" | Processor Base Frequency (GHz)
| 2.80
| 2.85
| 2.80
| 2.80
|-
!scope="column" | Number of Physical Cores / Hypertreads
| 64 / 128
| 48 / 96
| 64 / 128
| 64 / 128
|-
!scope="column" | Working Memory (GB)
| 512
| 2048
| 512
| 512
|-
!scope="column" | Local Disk (GB)
| 450 (NVMe-SSD)
| 14000 (NVMe-SSD)
| 450 (NVMe-SSD)
| 14000 (NVMe-SSD)
|-
!scope="column" | Interconnect
| HDR IB (80 nodes) / 100GbE
| HDR
| HDR
| HDR
|-
!scope="column" | Coprocessors
| -
| -
| 2 x [https://www.nvidia.com/de-de/data-center/products/a30-gpu/ NVIDIA A30 (24 GB ECC HBM2, NVLink)]
| 4 x [https://www.nvidia.com/de-de/data-center/a100/ NVIDIA A100 (80 GB ECC HBM2e)]
|}

= Network =

The compute nodes and the parallel file system are connected via 100GbE ethernet
In contrast to BinAC 1 not all compute nodes are connected via Infiniband, but there are 84 standard compute nodes connected via HDR Infiniband (100 GbE). In order to get your jobs onto the Infiniband nodes, submit your job with <code>--constraint=ib</code>.

= Storage =

The bwForCluster BinAC 2 consists of two separate storage systems, one for the user's home directory $HOME and one serving as a project/work space.
The home directory is limited in space and parallel access but offers snapshots of your files and backup.

The project/work is a parallel file system (PFS) which offers fast and parallel file access and a bigger capacity than the home directory. It is mounted at <code>/pfs/10</code> on the login and compute nodes. This storage is based on Lustre and can be accessed parallel from many nodes. The PFS contains the project and the work directory. Each compute project has its own directory at <code>/pfs/10/project</code> that is accessible for all members of the compute project.
Each user can create workspaces under <code>/pfs/10/work</code> using the workspace tools. These directories are only accessible for the user who created the workspace.

Additionally, each compute node provides high-speed temporary storage (SSD) on the node-local solid state disk via the $TMPDIR environment variable.

{| class="wikitable"
|-
! style="width:10%"|
! style="width:10%"| <tt>$HOME</tt>
! style="width:10%"| project
! style="width:10%"| work
! style="width:10%"| <tt>$TMPDIR</tt>
|-
!scope="column" | Visibility
| global
| global
| global
| node local
|-
!scope="column" | Lifetime
| permanent
| permanent
| work space lifetime (max. 30 days, max. 5 extensions)
| batch job walltime
|-
!scope="column" | Capacity
| -
| 8.1 PB
| 1000 TB
| 512 GB per node; 1920 GB on high-mem nodes
|-
!scope="column" | Speed
| ...
| ...
| ...
| ...
|-
!scope="column" | [https://en.wikipedia.org/wiki/Disk_quota#Quotas Quotas]
| 40 GB per user
| not yet, maybe in the future
| none
| none
|-
!scope="column" | Backup
| yes
| no
| no
| no
|}

global : all nodes access the same file system
local : each node has its own file system
permanent : files are stored permanently
batch job walltime : files are removed at end of the batch job

=== Home ===

Home directories are meant for permanent file storage of files that are keep being used like source codes, configuration files, executable programs etc.; the content of home directories will be backed up on a regular basis.
Because the backup space is limited we enforce a quota of 40GB on the home directories.

'''NOTE:'''
Compute jobs on nodes must not write temporary data to $HOME.
Instead they should use the local $TMPDIR directory for I/O-heavy use cases
and work spaces for less I/O intense multinode-jobs.


=== Project ===

Each compute project has its own project directory at <code>/pfs/10/project</code>.

<pre>
$ ls -lh /pfs/10/project/
drwxrwx---. 2 root bw16f003 33K Dec 12 16:46 bw16f003
[...]
</pre>

As you can see the directory is owned by a group representing your compute project (here bw16f003) and the directory is accessible by all group members. It is upon your group to decide how to use the space inside this directory: shared data folders, personal directories for each project member, software containers, etc.

The data is stored on HDDs. The primary focus of <code>/pfs/10/project</code> is pure capacity, not speed.

=== Work ===

The data at <code>/pfs/10/work</code> is stored on SSDs. The primary focus is speed, not capacity.
In contrast to BinAC 1 we will enforce work space lifetime, as the capacity is limited.
We ask you to only store data you actively use for computations on <code>/pfs/10/work</code>.
Please move data to <code>/pfs/10/project</code> when you don't need it on the fast storage any more.

Each user can create workspaces at <code>/pfs/10/work</code> through the workspace tools
To create a work space you'll need to supply a name for your work space area and a lifetime in days.
For more information read the corresponding help, e.g: <code>ws_allocate -h.</code>
{| class="wikitable"
|-
!style="width:30%" | Command
!style="width:70%" | Action
|-
|<code>ws_allocate mywork 30</code>
|Allocate a work space named "mywork" for 30 days.
|-
|<code>ws_allocate myotherwork</code>
|Allocate a work space named "myotherwork" with maximum lifetime.
|-
|<code>ws_list -a</code>
|List all your work spaces.
|-
|<code>ws_find mywork</code>
|Get absolute path of work space "mywork".
|-
|<code>ws_extend mywork 30</code>
|Extend life me of work space mywork by 30 days from now.
|-
|<code>ws_release mywork</code>
|Manually erase your work space "mywork". Please remove directory content first.
|-
|}

=== Scratch ===

Please use the fast local scratch space for storing temporary data during your jobs.

For each job a scratch directory will be created on the compute nodes. It is available via the environment variable <code>$TMPDIR</code>, which points to <code>/scratch/<jobID></code>.

BinAC2/Hardware and Architecture

2025-01-27T09:31:25Z

S Behnle:

= Hardware and Architecture =

The bwForCluster BinAC 2 supports researchers from the broader fields of Bioinformatics, Astrophysics, and Geosciences.

== Operating System and Software ==

* Operating System: Rocky Linux 9.4
* Queuing System: [https://slurm.schedmd.com/documentation.html Slurm] (see [[BinAC2/Slurm]] for help)
* (Scientific) Libraries and Software: [[Environment Modules]]

== Compute Nodes ==

BinAC 2 offers compute nodes, high-mem nodes, and two types of GPU nodes.
* 180 compute nodes
* 14 SMP node
* 32 GPU nodes (A30)
* 8 GPU nodes (A100)
* plus several special purpose nodes for login, interactive jobs, etc.

Compute node specification:
{| class="wikitable"
|-
! style="width:10%"|
! style="width:10%"| Standard
! style="width:10%"| High-Mem
! style="width:10%"| GPU (A30)
! style="width:10%"| GPU (A100)
|-
!scope="column"| Quantity
| 180
| 14
| 32
| 8
|-
!scope="column" | Processors
| 2 x [https://www.amd.com/de/products/cpu/amd-epyc-7543 AMD EPYC Milan 7543]
| 2 x [https://www.amd.com/de/products/cpu/amd-epyc-7443 AMD EPYC Milan 7443]
| 2 x [https://www.amd.com/de/products/cpu/amd-epyc-7543 AMD EPYC Milan 7543]
| 2 x [https://www.amd.com/de/products/cpu/amd-epyc-7543 AMD EPYC Milan 7543]
|-
!scope="column" | Processor Base Frequency (GHz)
| 2.80
| 2.85
| 2.80
| 2.80
|-
!scope="column" | Number of Physical Cores / Hypertreads
| 64 / 128
| 48 / 96
| 64 / 128
| 64 / 128
|-
!scope="column" | Working Memory (GB)
| 512
| 2048
| 512
| 512
|-
!scope="column" | Local Disk (GB)
| 450 (NVMe-SSD)
| 14000 (NVMe-SSD)
| 450 (NVMe-SSD)
| 14000 (NVMe-SSD)
|-
!scope="column" | Interconnect
| HDR IB (80 nodes) / 100GbE
| HDR
| HDR
| HDR
|-
!scope="column" | Coprocessors
| -
| -
| 2 x [https://www.nvidia.com/de-de/data-center/products/a30-gpu/ NVIDIA A30 (24 GB ECC HBM2, NVLink)]
| 4 x [https://www.nvidia.com/de-de/data-center/a100/ NVIDIA A100 (80 GB ECC HBM2e)]
|}

= Network =

The compute nodes and the parallel file system are connected via 100GbE ethernet
In contrast to BinAC 1 not all compute nodes are connected via Infiniband, but there are 84 standard compute nodes connected via HDR Infiniband (100 GbE). In order to get your jobs onto the Infiniband nodes, submit your job with <code>--constraint=ib</code>.

= Storage =

The bwForCluster BinAC 2 consists of two separate storage systems, one for the user's home directory $HOME and one serving as a project/work space.
The home directory is limited in space and parallel access but offers snapshots of your files and backup.

The project/work is a parallel file system (PFS) which offers fast and parallel file access and a bigger capacity than the home directory. It is mounted at <code>/pfs/10</code> on the login and compute nodes. This storage is based on Lustre and can be accessed parallel from many nodes. The PFS contains the project and the work directory. Each compute project has its own directory at <code>/pfs/10/project</code> that is accessible for all members of the compute project.
Each user can create workspaces under <code>/pfs/10/work</code> using the workspace tools. These directories are only accessible for the user who created the workspace.

Additionally, each compute node provides high-speed temporary storage (SSD) on the node-local solid state disk via the $TMPDIR environment variable.

{| class="wikitable"
|-
! style="width:10%"|
! style="width:10%"| <tt>$HOME</tt>
! style="width:10%"| project
! style="width:10%"| work
! style="width:10%"| <tt>$TMPDIR</tt>
|-
!scope="column" | Visibility
| global
| global
| global
| node local
|-
!scope="column" | Lifetime
| permanent
| permanent
| work space lifetime (max. 30 days, max. 5 extensions)
| batch job walltime
|-
!scope="column" | Capacity
| -
| 8.1 PB
| 1000 TB
| 512 GB per node; 1920 GB on high-mem nodes
|-
!scope="column" | Speed
| ...
| ...
| ...
| ...
|-
!scope="column" | [https://en.wikipedia.org/wiki/Disk_quota#Quotas Quotas]
| 40 GB per user
| not yet, maybe in the future
| none
| none
|-
!scope="column" | Backup
| yes
| no
| no
| no
|}

global : all nodes access the same file system
local : each node has its own file system
permanent : files are stored permanently
batch job walltime : files are removed at end of the batch job

=== Home ===

Home directories are meant for permanent file storage of files that are keep being used like source codes, configuration files, executable programs etc.; the content of home directories will be backed up on a regular basis.
Because the backup space is limited we enforce a quota of 40GB on the home directories.

'''NOTE:'''
Compute jobs on nodes must not write temporary data to $HOME.
Instead they should use the local $TMPDIR directory for I/O-heavy use cases
and work spaces for less I/O intense multinode-jobs.


=== Project ===

Each compute project has its own project directory at <code>/pfs/10/project</code>.

<pre>
$ ls -lh /pfs/10/project/
drwxrwx---. 2 root bw16f003 33K Dec 12 16:46 bw16f003
[...]
</pre>

As you can see the directory is owned by a group representing your compute project (here bw16f003) and the directory is accessible by all group members. It is upon your group to decide how to use the space inside this directory: shared data folders, personal directories for each project member, software containers, etc.

The data is stored on HDDs. The primary focus of <code>/pfs/10/project</code> is pure capacity, not speed.

=== Work ===

The data at <code>/pfs/10/work</code> is stored on SSDs. The primary focus is speed, not capacity.
In contrast to BinAC 1 we will enforce work space lifetime, as the capacity is limited.
We ask you to only store data you actively use for computations on <code>/pfs/10/work</code>.
Please move data to <code>/pfs/10/project</code> when you don't need it on the fast storage any more.

Each user can create workspaces at <code>/pfs/10/work</code> through the workspace tools
To create a work space you'll need to supply a name for your work space area and a lifetime in days.
For more information read the corresponding help, e.g: <code>ws_allocate -h.</code>
{| class="wikitable"
|-
!style="width:30%" | Command
!style="width:70%" | Action
|-
|<code>ws_allocate mywork 30</code>
|Allocate a work space named "mywork" for 30 days.
|-
|<code>ws_allocate myotherwork</code>
|Allocate a work space named "myotherwork" with maximum lifetime.
|-
|<code>ws_list -a</code>
|List all your work spaces.
|-
|<code>ws_find mywork</code>
|Get absolute path of work space "mywork".
|-
|<code>ws_extend mywork 30</code>
|Extend life me of work space mywork by 30 days from now.
|-
|<code>ws_release mywork</code>
|Manually erase your work space "mywork". Please remove directory content first.
|-
|}

=== Scratch ===

Please use the fast local scratch space for storing temporary data during your jobs.

For each job a scratch directory will be created on the compute nodes. It is available via the environment variable <code>$TMPDIR</code>, which points to <code>/scratch/<jobID></code>.

BinAC2/Hardware and Architecture

2025-01-27T09:29:20Z

S Behnle:

= Hardware and Architecture =

The bwForCluster BinAC 2 supports researchers from the broader fields of Bioinformatics, Astrophysics, and Geosciences.

== Operating System and Software ==

* Operating System: Rocky Linux 9.4
* Queuing System: [https://slurm.schedmd.com/documentation.html Slurm] (see [[BinAC2/Slurm]] for help)
* (Scientific) Libraries and Software: [[Environment Modules]]

== Compute Nodes ==

BinAC 2 offers compute nodes, high-mem nodes, and two types of GPU nodes.
* 180 compute nodes
* 14 SMP node
* 32 GPU nodes (A30)
* 8 GPU nodes (A100)
* plus several special purpose nodes for login, interactive jobs, etc.

Compute node specification:
{| class="wikitable"
|-
! style="width:10%"|
! style="width:10%"| Standard
! style="width:10%"| High-Mem
! style="width:10%"| GPU (A30)
! style="width:10%"| GPU (A100)
|-
!scope="column"| Quantity
| 180
| 14
| 32
| 8
|-
!scope="column" | Processors
| 2 x [https://www.amd.com/de/products/cpu/amd-epyc-7543 AMD EPYC Milan 7543]
| 2 x [https://www.amd.com/de/products/cpu/amd-epyc-7443 AMD EPYC Milan 7443]
| 2 x [https://www.amd.com/de/products/cpu/amd-epyc-7543 AMD EPYC Milan 7543]
| 2 x [https://www.amd.com/de/products/cpu/amd-epyc-7543 AMD EPYC Milan 7543]
|-
!scope="column" | Processor Base Frequency (GHz)
| 2.80
| 2.85
| 2.80
| 2.80
|-
!scope="column" | Number of Physical Cores / Hypertreads
| 64 / 128
| 48 / 96
| 64 / 128
| 64 / 128
|-
!scope="column" | Working Memory (GB)
| 512
| 2048
| 512
| 512
|-
!scope="column" | Local Disk (GB)
| 450 (NVMe-SSD)
| 14000 (NVMe-SSD)
| 450 (NVMe-SSD)
| 14000 (NVMe-SSD)
|-
!scope="column" | Interconnect
| HDR IB (80 nodes) / 100GbE
| HDR
| HDR
| HDR
|-
!scope="column" | Coprocessors
| -
| -
| 2 x [https://www.nvidia.com/de-de/data-center/products/a30-gpu/ NVIDIA A30 (24 GB ECC HBM2, NVLink)]
| 4 x [https://www.nvidia.com/de-de/data-center/a100/ NVIDIA A100 (80 GB ECC HBM2e)]
|}

= Network =

The compute nodes and the parallel file system are connected via 100GbE ethernet
In contrast to BinAC 1 not all compute nodes are connected via Infiniband, but there are 80 standard compute nodes connected via HDR Infiniband (100 GbE). In order to get your jobs onto the Infiniband nodes, submit your job with <code>--constraint=ib</code>.

= Storage =

The bwForCluster BinAC 2 consists of two separate storage systems, one for the user's home directory $HOME and one serving as a project/work space.
The home directory is limited in space and parallel access but offers snapshots of your files and backup.

The project/work is a parallel file system (PFS) which offers fast and parallel file access and a bigger capacity than the home directory. It is mounted at <code>/pfs/10</code> on the login and compute nodes. This storage is based on Lustre and can be accessed parallel from many nodes. The PFS contains the project and the work directory. Each compute project has its own directory at <code>/pfs/10/project</code> that is accessible for all members of the compute project.
Each user can create workspaces under <code>/pfs/10/work</code> using the workspace tools. These directories are only accessible for the user who created the workspace.

Additionally, each compute node provides high-speed temporary storage (SSD) on the node-local solid state disk via the $TMPDIR environment variable.

{| class="wikitable"
|-
! style="width:10%"|
! style="width:10%"| <tt>$HOME</tt>
! style="width:10%"| project
! style="width:10%"| work
! style="width:10%"| <tt>$TMPDIR</tt>
|-
!scope="column" | Visibility
| global
| global
| global
| node local
|-
!scope="column" | Lifetime
| permanent
| permanent
| work space lifetime (max. 30 days, max. 5 extensions)
| batch job walltime
|-
!scope="column" | Capacity
| -
| 8.1 PB
| 1000 TB
| 512 GB per node; 1920 GB on high-mem nodes
|-
!scope="column" | Speed
| ...
| ...
| ...
| ...
|-
!scope="column" | [https://en.wikipedia.org/wiki/Disk_quota#Quotas Quotas]
| 40 GB per user
| not yet, maybe in the future
| none
| none
|-
!scope="column" | Backup
| yes
| no
| no
| no
|}

global : all nodes access the same file system
local : each node has its own file system
permanent : files are stored permanently
batch job walltime : files are removed at end of the batch job

=== Home ===

Home directories are meant for permanent file storage of files that are keep being used like source codes, configuration files, executable programs etc.; the content of home directories will be backed up on a regular basis.
Because the backup space is limited we enforce a quota of 40GB on the home directories.

'''NOTE:'''
Compute jobs on nodes must not write temporary data to $HOME.
Instead they should use the local $TMPDIR directory for I/O-heavy use cases
and work spaces for less I/O intense multinode-jobs.


=== Project ===

Each compute project has its own project directory at <code>/pfs/10/project</code>.

<pre>
$ ls -lh /pfs/10/project/
drwxrwx---. 2 root bw16f003 33K Dec 12 16:46 bw16f003
[...]
</pre>

As you can see the directory is owned by a group representing your compute project (here bw16f003) and the directory is accessible by all group members. It is upon your group to decide how to use the space inside this directory: shared data folders, personal directories for each project member, software containers, etc.

The data is stored on HDDs. The primary focus of <code>/pfs/10/project</code> is pure capacity, not speed.

=== Work ===

The data at <code>/pfs/10/work</code> is stored on SSDs. The primary focus is speed, not capacity.
In contrast to BinAC 1 we will enforce work space lifetime, as the capacity is limited.
We ask you to only store data you actively use for computations on <code>/pfs/10/work</code>.
Please move data to <code>/pfs/10/project</code> when you don't need it on the fast storage any more.

Each user can create workspaces at <code>/pfs/10/work</code> through the workspace tools
To create a work space you'll need to supply a name for your work space area and a lifetime in days.
For more information read the corresponding help, e.g: <code>ws_allocate -h.</code>
{| class="wikitable"
|-
!style="width:30%" | Command
!style="width:70%" | Action
|-
|<code>ws_allocate mywork 30</code>
|Allocate a work space named "mywork" for 30 days.
|-
|<code>ws_allocate myotherwork</code>
|Allocate a work space named "myotherwork" with maximum lifetime.
|-
|<code>ws_list -a</code>
|List all your work spaces.
|-
|<code>ws_find mywork</code>
|Get absolute path of work space "mywork".
|-
|<code>ws_extend mywork 30</code>
|Extend life me of work space mywork by 30 days from now.
|-
|<code>ws_release mywork</code>
|Manually erase your work space "mywork". Please remove directory content first.
|-
|}

=== Scratch ===

Please use the fast local scratch space for storing temporary data during your jobs.

For each job a scratch directory will be created on the compute nodes. It is available via the environment variable <code>$TMPDIR</code>, which points to <code>/scratch/<jobID></code>.

BinAC2/Hardware and Architecture

2025-01-27T09:25:11Z

S Behnle: Fixed BinAC2 scratch disk sized

= Hardware and Architecture =

The bwForCluster BinAC 2 supports researchers from the broader fields of Bioinformatics, Astrophysics, and Geosciences.

== Operating System and Software ==

* Operating System: Rocky Linux 9.4
* Queuing System: [https://slurm.schedmd.com/documentation.html Slurm] (see [[BinAC2/Slurm]] for help)
* (Scientific) Libraries and Software: [[Environment Modules]]

== Compute Nodes ==

BinAC 2 offers compute nodes, high-mem nodes, and two types of GPU nodes.
* 180 compute nodes
* 14 SMP node
* 32 GPU nodes (A30)
* 8 GPU nodes (A100)
* plus several special purpose nodes for login, interactive jobs, etc.

Compute node specification:
{| class="wikitable"
|-
! style="width:10%"|
! style="width:10%"| Standard
! style="width:10%"| High-Mem
! style="width:10%"| GPU (A30)
! style="width:10%"| GPU (A100)
|-
!scope="column"| Quantity
| 180
| 14
| 32
| 8
|-
!scope="column" | Processors
| 2 x [https://www.amd.com/de/products/cpu/amd-epyc-7543 AMD EPYC Milan 7543]
| 2 x [https://www.amd.com/de/products/cpu/amd-epyc-7443 AMD EPYC Milan 7443]
| 2 x [https://www.amd.com/de/products/cpu/amd-epyc-7543 AMD EPYC Milan 7543]
| 2 x [https://www.amd.com/de/products/cpu/amd-epyc-7543 AMD EPYC Milan 7543]
|-
!scope="column" | Processor Frequency (GHz)
| 2.80
| 2.85
| 2.80
| 2.80
|-
!scope="column" | Number of Cores
| 64
| 48
| 64
| 64
|-
!scope="column" | Working Memory (GB)
| 512
| 2048
| 512
| 512
|-
!scope="column" | Local Disk (GB)
| 450 (NVMe-SSD)
| 14000 (NVMe-SSD)
| 450 (NVMe-SSD)
| 14000 (NVMe-SSD)
|-
!scope="column" | Interconnect
| HDR IB (80 nodes) / 100GbE
| HDR
| HDR
| HDR
|-
!scope="column" | Coprocessors
| -
| -
| 2 x [https://www.nvidia.com/de-de/data-center/products/a30-gpu/ NVIDIA A30 (24 GB ECC HBM2, NVLink]
| 4 x [https://www.nvidia.com/de-de/data-center/a100/ NVIDIA A100 (80 GB ECC HBM2e)]
|}

= Network =

The compute nodes and the parallel file system are connected via 100GbE ethernet
In contrast to BinAC 1 not all compute nodes are connected via Infiniband, but there are 80 standard compute nodes connected via HDR Infiniband (100 GbE). In order to get your jobs onto the Infiniband nodes, submit your job with <code>--constraint=ib</code>.

= Storage =

The bwForCluster BinAC 2 consists of two separate storage systems, one for the user's home directory $HOME and one serving as a project/work space.
The home directory is limited in space and parallel access but offers snapshots of your files and backup.

The project/work is a parallel file system (PFS) which offers fast and parallel file access and a bigger capacity than the home directory. It is mounted at <code>/pfs/10</code> on the login and compute nodes. This storage is based on Lustre and can be accessed parallel from many nodes. The PFS contains the project and the work directory. Each compute project has its own directory at <code>/pfs/10/project</code> that is accessible for all members of the compute project.
Each user can create workspaces under <code>/pfs/10/work</code> using the workspace tools. These directories are only accessible for the user who created the workspace.

Additionally, each compute node provides high-speed temporary storage (SSD) on the node-local solid state disk via the $TMPDIR environment variable.

{| class="wikitable"
|-
! style="width:10%"|
! style="width:10%"| <tt>$HOME</tt>
! style="width:10%"| project
! style="width:10%"| work
! style="width:10%"| <tt>$TMPDIR</tt>
|-
!scope="column" | Visibility
| global
| global
| global
| node local
|-
!scope="column" | Lifetime
| permanent
| permanent
| work space lifetime (max. 30 days, max. 5 extensions)
| batch job walltime
|-
!scope="column" | Capacity
| -
| 8.1 PB
| 1000 TB
| 512 GB per node; 1920 GB on high-mem nodes
|-
!scope="column" | Speed
| ...
| ...
| ...
| ...
|-
!scope="column" | [https://en.wikipedia.org/wiki/Disk_quota#Quotas Quotas]
| 40 GB per user
| not yet, maybe in the future
| none
| none
|-
!scope="column" | Backup
| yes
| no
| no
| no
|}

global : all nodes access the same file system
local : each node has its own file system
permanent : files are stored permanently
batch job walltime : files are removed at end of the batch job

=== Home ===

Home directories are meant for permanent file storage of files that are keep being used like source codes, configuration files, executable programs etc.; the content of home directories will be backed up on a regular basis.
Because the backup space is limited we enforce a quota of 40GB on the home directories.

'''NOTE:'''
Compute jobs on nodes must not write temporary data to $HOME.
Instead they should use the local $TMPDIR directory for I/O-heavy use cases
and work spaces for less I/O intense multinode-jobs.


=== Project ===

Each compute project has its own project directory at <code>/pfs/10/project</code>.

<pre>
$ ls -lh /pfs/10/project/
drwxrwx---. 2 root bw16f003 33K Dec 12 16:46 bw16f003
[...]
</pre>

As you can see the directory is owned by a group representing your compute project (here bw16f003) and the directory is accessible by all group members. It is upon your group to decide how to use the space inside this directory: shared data folders, personal directories for each project member, software containers, etc.

The data is stored on HDDs. The primary focus of <code>/pfs/10/project</code> is pure capacity, not speed.

=== Work ===

The data at <code>/pfs/10/work</code> is stored on SSDs. The primary focus is speed, not capacity.
In contrast to BinAC 1 we will enforce work space lifetime, as the capacity is limited.
We ask you to only store data you actively use for computations on <code>/pfs/10/work</code>.
Please move data to <code>/pfs/10/project</code> when you don't need it on the fast storage any more.

Each user can create workspaces at <code>/pfs/10/work</code> through the workspace tools
To create a work space you'll need to supply a name for your work space area and a lifetime in days.
For more information read the corresponding help, e.g: <code>ws_allocate -h.</code>
{| class="wikitable"
|-
!style="width:30%" | Command
!style="width:70%" | Action
|-
|<code>ws_allocate mywork 30</code>
|Allocate a work space named "mywork" for 30 days.
|-
|<code>ws_allocate myotherwork</code>
|Allocate a work space named "myotherwork" with maximum lifetime.
|-
|<code>ws_list -a</code>
|List all your work spaces.
|-
|<code>ws_find mywork</code>
|Get absolute path of work space "mywork".
|-
|<code>ws_extend mywork 30</code>
|Extend life me of work space mywork by 30 days from now.
|-
|<code>ws_release mywork</code>
|Manually erase your work space "mywork". Please remove directory content first.
|-
|}

=== Scratch ===

Please use the fast local scratch space for storing temporary data during your jobs.

For each job a scratch directory will be created on the compute nodes. It is available via the environment variable <code>$TMPDIR</code>, which points to <code>/scratch/<jobID></code>.