bwHPC Wiki - User contributions [en]

BwUniCluster3.0/Hardware and Architecture/Filesystem Details

2025-04-03T15:40:23Z

R Laifer: /* Access restrictions */

= File System Details =

On bwUniCluster 3.0 the parallel file system Lustre is used for most globally visible user data. Lustre is open source and Lustre solutions and support are available from different vendors. Nowadays, most of the biggest HPC systems are using Lustre. An initial home directory on a Lustre file system is created automatically after account activation, and the environment variable $HOME holds its name. Users can create so-called workspaces on another Lustre file system for non-permanent data with temporary lifetime. There is another workspace file system based on flash storage for special requirements available.

Within a batch job further file systems are available:
* The directory $TMPDIR is only available and visible on the local node. It is located on fast SSD storage devices.
* On request a parallel on-demand file system (BeeOND) is created which uses the SSDs of the nodes which were allocated to the batch job.
* On request the external LSDF Online Storage is mounted on the nodes which were allocated to the batch job. This file system is based on the parallel file system IBM Storage Scale.

Some of the characteristics of the file systems are shown in the following table.

{| class="wikitable"
|-
! scope="column" style="width:9%"| Property
! style="width:13%"| $TMPDIR
! style="width:13%"| BeeOND
! style="width:13%"| $HOME
! style="width:13%"| Workspace
! style="width:13%"| Workspace on flash
|-
! scope="column" style="height=20px; padding:3px"| Visibility
| style="height=20px; text-align:left;padding:3px"| local node
| style="height=20px; text-align:left;padding:3px"| nodes of batch job
| style="height=20px; text-align:left;padding:3px"| global
| style="height=20px; text-align:left;padding:3px"| global
| style="height=20px; text-align:left;padding:3px"| global
|-
! scope="column" style="height=20px; padding:3px"| Lifetime
| style="height=20px; text-align:left;padding:3px"| batch job runtime
| style="height=20px; text-align:left;padding:3px"| batch job runtime
| style="height=20px; text-align:left;padding:3px"| permanent
| style="height=20px; text-align:left;padding:3px"| max. 240 days
| style="height=20px; text-align:left;padding:3px"| max. 240 days
|-
! scope="column" style="height=20px; padding:3px"| Disk space
| style="height=20px; text-align:left;padding:3px"| 960 GB - 6.4 TB details see table 1
| style="height=20px; text-align:left;padding:3px"| n*750 GB
| style="height=20px; text-align:left;padding:3px"| 1.1 PiB
| style="height=20px; text-align:left;padding:3px"| 4.6 PiB
| style="height=20px; text-align:left;padding:3px"| 236 TiB
|-
! scope="column" style="height=20px; padding:3px"| Capacity Quotas
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| yes 500 GiB per user, for MA users 250 GiB also per organization
| style="height=20px; text-align:left;padding:3px"| yes 40 TiB per user
| style="height=20px; text-align:left;padding:3px"| yes 1 TiB per user
|-
! scope="column" style="height=20px; padding:3px"| Inode Quotas
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| yes 5 million per user for MA users 2.5 million
| style="height=20px; text-align:left;padding:3px"| yes 20 million per user
| style="height=20px; text-align:left;padding:3px"| yes 5 million per user
|-
! scope="column" style="height=20px; padding:3px"| Backup
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| yes
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| no
|-
! scope="column" style="height=20px; padding:3px"| Read perf./node
| style="height=20px; text-align:left;padding:3px"| 500 MB/s - 6 GB/s depends on type of local SSD / job queue
| style="height=20px; text-align:left;padding:3px"| 400 MB/s - 500 MB/s depends on type of local SSDs / job queue
| style="height=20px; text-align:left;padding:3px"| 5 GB/s
| style="height=20px; text-align:left;padding:3px"| 5 GB/s
| style="height=20px; text-align:left;padding:3px"| 1 GB/s
|-
! scope="column" style="height=20px; padding:3px"| Write perf./node
| style="height=20px; text-align:left;padding:3px"| 500 MB/s - 4 GB/s depends on type of local SSD / job queue
| style="height=20px; text-align:left;padding:3px"| 250 MB/s - 350 MB/s depends on type of local SSDs / job queue
| style="height=20px; text-align:left;padding:3px"| 5 GB/s
| style="height=20px; text-align:left;padding:3px"| 5 GB/s
| style="height=20px; text-align:left;padding:3px"| 1 GB/s
|-
! scope="column" style="height=20px;padding:3px"| Total read perf.
| style="height=20px; text-align:left;padding:3px"| n*500-6000 MB/s
| style="height=20px; text-align:left;padding:3px"| n*400-500 MB/s
| style="height=20px; text-align:left;padding:3px"| 63 GB/s
| style="height=20px; text-align:left;padding:3px"| 45 GB/s
| style="height=20px; text-align:left;padding:3px"| 45 GB/s
|-
! scope="column" style="height=20px;padding:3px"| Total write perf.
| style="height=20px; text-align:left;padding:3px"| n*500-4000 MB/s
| style="height=20px; text-align:left;padding:3px"| n*250-350 MB/s
| style="height=20px; text-align:left;padding:3px"| 63 GB/s
| style="height=20px; text-align:left;padding:3px"| 40 GB/s
| style="height=20px; text-align:left;padding:3px"| 38 GB/s
|}
---------------------------------------------------------------------------------------------------------
global: all nodes of UniCluster access the same file system;
local: each node has its own file system;
permanent: files are stored permanently;
batch job: files are removed at end of the batch job.
---------------------------------------------------------------------------------------------------------
Table 2: Properties of the file systems

== Selecting the appropriate file system ==

In general, you should separate your data and store it on the appropriate file system.
Permanently needed data like software or important results should be stored below $HOME
but capacity restrictions (quotas) apply. In case you accidentally deleted data on $HOME
there is a chance that we can restore it from backup. Permanent data which is not needed for months
or exceeds the capacity restrictions should be sent to the LSDF Online Storage
or to the archive and deleted from the file systems. Temporary data which is only needed on a single
node and which does not exceed the disk space shown in [[BwUniCluster3.0/Hardware_and_Architecture#Compute_nodes|Table 1]]
should be stored below $TMPDIR. Data which is read many times on a single node, e.g. if you are doing AI training,
should be copied to $TMPDIR and read from there. Temporary data which is used from many nodes
of your batch job and which is only needed during job runtime should be stored on a
parallel on-demand file system. Temporary data which can be recomputed or which is the
result of one job and input for another job should be stored in workspaces. The lifetime
of data in workspaces is limited and depends on the lifetime of the workspace which can be
several months.

For further details please check the chapters below.

== $HOME ==

The home directories of bwUniCluster 3.0 (uc3) users are located in the parallel file system Lustre.
You have access to your home directory from all nodes of uc3. A regular backup of these directories
to a tape library is done automatically. The directory $HOME should be used to hold permanently used data like
source code, configuration files, executable programs etc.

On uc3 there is a default user quota limit of 500 GiB and 5 million inodes (files and directories) per user.
For users of University of Mannheim the limit is 250 GiB and 2.5 million inodes.
You can check your current usage and limits with the command
<pre>
$ lfs quota -uh $(whoami) $HOME
</pre>
Note that the quota limits mentioned above are soft quota limits. The hard limits (shown on the ''limits''
columns) are 10 percent higher. If you are above the soft limit and below the hard limit
during the grace period (7 days) your I/O operations will show a warning message. If the grace period has
passed or if you are above the hard limit your I/O operations will abort.

In addition to the user limit there is a limit of your organization (e.g. university) which depends on the financial share. This limit is enforced with so-called Lustre project quotas. You can show the current usage and limits of your organization with the following command:
<pre>
lfs quota -ph $(grep $(echo $HOME | sed -e "s|/[^/]*/[^/]*$||") /pfs/data6/project_ids.txt | cut -f 1 -d\ ) $HOME
</pre>

== Workspaces ==

On uc3 workspaces can be used to store large non-permanent data sets, e.g. restart files or output
data that has to be post-processed. The file system used for workspaces is also the parallel file system Lustre. This file system is especially designed for parallel access and for a high throughput to large
files. It is able to provide high data transfer rates of up to 40 GB/s write and read performance when data access is parallel.

On uc3 there is a default user quota limit of 40 TiB and 20 million inodes (files and directories) per user.
You can chek your current usage and limits with the command
<pre>
$ lfs quota -uh $(whoami) /pfs/work9
</pre>
Note that the quotas include data and inodes for all of your workspaces and all of your expired workspaces (as long as they are not yet completely removed).

Workspaces have a lifetime and the data on a workspace expires as a whole after a fixed period. The maximum lifetime of a workspace on uc3 is 60 days, but it can be renewed at the end of that period 3 times to a total maximum of 240 days after workspace generation.

Creating, deleting, finding, extending and sharing workspaces is explained on the [[workspace]] page.

=== Reminder for workspace deletion ===

Normally you will get an email every day starting 7 days before a workspace expires. You can send yourself a calender entry which reminds you when a workspace will be automatically deleted:

$ ws_send_ical <workspace> <email>

=== Restoring expired Workspaces ===

At expiration time your workspace will be moved to a special, hidden directory. On uc3 expired workspaces are currently kept for 30 days. During this time you can still restore your data into a valid workspace. The same is true for released workspaces but they are only kept until the next night. In order to restore an expired workspace, use
<pre>
ws_restore -l
</pre>
to get a list of your expired workspaces, and then restore them into an '''existing, active workspace''' (here with name <code>my_restored</code>):
<pre>
ws_restore <full_name_of_expired_workspace> my_restored
</pre>
NOTE: The expired workspace has to be specified using the full name as listed by <code>ws_restore -l</code>, including username prefix and timestamp suffix (otherwise, it cannot be uniquely identified).
The target workspace, on the other hand, must be given with just its short name as listed by <code>ws_list</code>, without the username prefix.

NOTE: <code>ws_restore</code> can only work on the same filesystem. So you have to ensure that the new workspace allocated with <code>ws_allocate</code> is placed on the same filesystem as the expired workspace. Therefore, you can use <code>-F <filesystem></code> flag if needed.

=== Linking workspaces in Home ===

It might be valuable to have links to personal workspaces within a certain directory, e.g. below the user home directory. The command
<pre>
ws_register <DIR>
</pre>
will create and manage links to all personal workspaces within in the directory <DIR>. Calling this command will do the following:

* The directory <DIR> will be created if necessary
* Links to all personal workspaces will be managed:
** Creates links to all available workspaces if not already present
** Removes links to released or expired workspaces

== Improving Performance on $HOME and workspaces ==

The following recommendations might help to improve throughput and metadata
performance on Lustre filesystems.

=== Improving Throughput Performance ===

Depending on your application some adaptations might be necessary if you want to reach
the full bandwidth of the filesystems.

When you are designing your application you should consider that the performance of
parallel filesystems is generally better if data is transferred in large blocks and stored in
few large files. In more detail, to increase throughput performance of a parallel application
following aspects should be considered:

* collect large chunks of data and write them sequentially at once,

* to exploit complete filesystem bandwidth use several clients,

* avoid competitive file access by different tasks or use blocks with boundaries at stripe size (default is 1MB),

* if files are small enough for the SSDs and are only used from one node store them on $TMPDIR.

Parallel filesystems typically stripe files over storage subsystems, i.e. large files are separated into stripes and distributed to different storage subsystems. In Lustre, the size of these stripes (sometimes also mentioned as chunks) is called stripe size and the number of used storage subsystems is called stripe count.

With previous Lustre versions adapting the Lustre stripe count was the most important optimization. However, for the workspaces of uc3 the Lustre feature Progressive File Layout is used to define file striping parameters. This means that the stripe count is adapted if the file size is growing. In normal cases users no longer need to adapt file striping parameters in case they have very huge files or in order to reach better performance. If you know what you are doing you can still change striping parameters but further explanation is beyond the scope of this documentation.

=== Improving Metadata Performance ===

Metadata performance on parallel file systems is usually not as good as with local
filesystems. In addition, it is usually not scalable, i.e. a limited resource. Therefore,
you should omit metadata operations whenever possible. For example, it is much better
to have few large files than lots of small files. In more detail, to increase metadata
performance of a parallel application following aspects should be considered:

* avoid creating many small files

* avoid competitive directory access, e.g. by creating files in separate subdirectories for each task

* if many small files are only used within a batch job and accessed by one node store them on $TMPDIR

== Workspaces on flash storage ==

There is another workspace file system for special requirements available. The file system is called ''full flash pfs'' and is based on the parallel file system Lustre.

=== Advantages of this file system ===

# From the Ice Lake nodes of bwUniCluster 3.0 (queue ''cpu_il'') the network distance and latency is low compared to the normal workspace file system.
# All storage devices are based on flash (no hard disks) with low access times. Hence performance is better compared to other parallel file systems for read and write access with small blocks and with small files, i.e. IOPS rates are improved.
# The file system is mounted on bwUniCluster 3.0 and HoreKa, i.e. it can be used to share data between these clusters.

=== Access restrictions ===

Only HoreKa users or KIT users of bwUniCluster 3.0 can use this file system.

=== Using the file system ===

As KIT or HoreKa user you can use the file system in the same way as a normal workspace. You just have to specify the name of the flash-based workspace file system using the option ''-F'' to all the commands that manage workspaces. On bwUniCluster 3.0 it is called ''ffuc'', on HoreKa it is ''ffhk''. For example, to create a workspace with name myws and a lifetime of 60 days on bwUniCluster 3.0 execute:
ws_allocate -F ffuc myws 60

If you want to use the full flash pfs on bwUniCluster 3.0 '''and''' HoreKa at the same time, please note that you only have to manage a particular workspace on one of the clusters since the name of the workspace directory is different. However, the path to each workspace is visible and can be used on both clusters.

Other features are similar to normal workspaces. For example, you can restore expired workspaces during 30 days after workspace expiration. There are quota limits with a default limit of 1 TiB capacity and 5 millions inodes per user. You can check your current usage with
lfs quota -uh $(whoami) /pfs/work8

== $TMPDIR ==

The environment variable $TMPDIR contains the name of a directory which is located on the local SSD of each node. This means
that different tasks of a parallel application use different directories when they do not utilize the same node.
Although $TMPDIR points to the same path name for different nodes of a batch job, the physical location and the
content of this directory path on these nodes is different.

This directory should be used for temporary files being accessed from the local node during job runtime. It should
also be used if you read the same data many times from a single node, e.g. if you are doing AI training. In this
case you should copy the data at the beginning of your batch job to $TMPDIR and read the data from there, see usage example below.

The $TMPDIR directory is located on extremely fast local SSD storage devices. This means that performance
on small files is much better than on the parallel file systems. The capacity of the local SSDs for each node type
is different and can be checked in [[BwUniCluster3.0/Hardware_and_Architecture#Compute_nodes|Table 1]].
The capacity of $TMPDIR is at least 1400 GB.

Each time a batch job is started, a subdirectory is created on the SSD of each node and assigned to the job.
$TMPDIR is set to the name of the subdirectory and this name contains the job ID so that it is unique
for each job. At the end of the job the subdirectory is removed.

{|style="background:#deffee; width:100%;"
|style="padding:5px; background:#ffa500; text-align:left"|
[[Image:Attention.svg|center|25px]]
|style="padding:5px; background:#ffa500; text-align:left"|
'''All data on $TMPDIR will be deleted when your job completes.''' 
Make sure you have copied your results back to a global filesystem, e.g., $HOME or a workspace, within your job.
|}

On login nodes $TMPDIR also points to a fast directory on a local SSD disk but this directory is not unique.
It is recommended to create your own unique subdirectory on these nodes. This directory should be used for the
installation of software packages. This means that the software package to be installed should be unpacked,
compiled and linked in a subdirectory of $TMPDIR. The real installation of the package (e.g. make install)
should be made into the $HOME folder.

{|style="background:#ffdeee; width:100%;"
|style="padding:5px; background:#f2cece; text-align:left"|
[[Image:Attention.svg|center|25px]]
|style="padding:5px; background:#f2cece; text-align:left"|
Note that you should '''not''' use /tmp or /scratch! Please use '''$TMPDIR''' instead. 
The reason is that an automatic cleanup on /tmp or /scratch is not possible because another job could be still using data below these directories. Hence the corresponding file systems could fill up and this can cause issues for you and for other users. On the other hand, $TMPDIR is created when the job starts and removed when the job completes, i.e. a cleanup is automatically done.
|}

=== Usage example for $TMPDIR ===

We will provide an example for using $TMPDIR and describe efficient data transfer to and from $TMPDIR.

If you have a data set with many files which is frequently used by batch jobs you should create
a compressed archive on a workspace. This archive can be extracted on $TMPDIR inside your batch jobs.
Such an archive can be read efficiently from a parallel file system since it is a single huge file.
On a login node you can create such an archive with the following steps:
<syntaxhighlight lang="bash">
# Create a workspace to store the archive
[ab1234@uc3n991 ~]$ ws_allocate data-ssd 60
# Create the archive from a local dataset folder (example)
[ab1234@uc3n991 ~]$ tar -cvzf $(ws_find data-ssd)/dataset.tgz dataset/
</syntaxhighlight>

Inside a batch job extract the archive on $TMPDIR, read input data from $TMPDIR, store results on $TMPDIR
and save the results on a workspace:
<syntaxhighlight lang="bash">
#!/bin/bash
# very simple example on how to use local $TMPDIR
#SBATCH -N 1
#SBATCH -t 24:00:00

# Extract compressed input dataset on local SSD
tar -C $TMPDIR/ -xvzf $(ws_find data-ssd)/dataset.tgz

# The application reads data from dataset on $TMPDIR and writes results to $TMPDIR
myapp -input $TMPDIR/dataset/myinput.csv -outputdir $TMPDIR/results

# Before job completes save results on a workspace
rsync -av $TMPDIR/results $(ws_find data-ssd)/results-${SLURM_JOB_ID}/
</syntaxhighlight>

== LSDF Online Storage ==

The LSDF Online Storage allows dedicated users to store scientific measurement data and simulation results. BwUniCluster 3.0 has an extremely fast network connection to the LSDF Online Storage. This file system provides external access via different protocols. It is only available for certain users. For information how request storage projects on the LSDF Online Storage see [[https://www.scc.kit.edu/en/services/lsdf]].

The LSDF Online Storage is mounted on the login nodes. It will also be mounted on the compute nodes of your batch job request it with the constraint flag ''LSDF''. You have one of the following options:

1. Add after the initial lines of your script job.sh the line <code>#SBATCH --constraint=LSDF</code>:
<pre>
#!/bin/bash
#SBATCH --ntasks=1
#SBATCH --time=120
#SBATCH --mem=200
#SBATCH --constraint=LSDF
</pre>
2. Add the constraint on command line:
<pre>
$ sbatch -p <queue> -n 1 -t 2:00:00 --mem 200 job.sh -C LSDF
</pre>

In order to access the LSDF Online Storage the following environment variables are available:
<code>$LSDF</code>, <code>$LSDFPROJECTS</code>, <code>$LSDFHOME</code>

==BeeOND (BeeGFS On-Demand)==

Users have the possibility to request a private BeeOND (on-demand BeeGFS) parallel filesystem for each job. The file system is created during job startup and purged when your job completes.

{|style="background:#deffee; width:100%;"
|style="padding:5px; background:#ffa500; text-align:left"|
[[Image:Attention.svg|center|25px]]
|style="padding:5px; background:#ffa500; text-align:left"|
'''All data on the private BeeOND filesystem will be deleted when your job completes.''' 
Make sure you have copied your results back to a global filesystem, e.g., $HOME or a workspace, within your job.
|}

BeeOND/BeeGFS can be used like any other parallel file system. All nodes of the batch job have acess to the same data below the same path. Tools like cp or rsync can be used to copy data in and out.

Starting and stopping BeeOND is integrated in the prolog and epilog of the cluster batch system Slurm. It can be used during job runtime if compute nodes are exclusive used. You can request the creation of a BeeOND file system with the constraint flags ''BEEOND'', ''BEEOND_4MDS'' or ''BEEOND_MAXMDS''.
* BEEOND: one metadata server is started on the first node
* BEEOND_4MDS: 4 metadata servers are started within your job. If you have less than 4 nodes less metadata servers are started.
* BEEOND_MAXMDS: on every node of your job a metadata server for the on_demand file system is started

As starting point we recommend using the contraint ''BEEOND''. You have one of the following options to request the constraint:

1. Add the line <code>#SBATCH --constraint=BEEOND</code> to your job script:
<source lang="bash">
#!/bin/bash
#SBATCH ...
#SBATCH --constraint=BEEOND # or BEEOND_4MDS or BEEOND_MAXMDS
</source>
2. Add the constraint on command line:
<pre>
$ sbatch -p <queue> -N <# of nodes> -t <runtime> --mem <mem> -C BEEOND job.sh
</pre>

After your job has started you can find the private on-demand file system in '''/mnt/odfs/${SLURM_JOB_ID}''' directory. The mountpoint comes with five pre-configured directories:
<source lang="bash">
# For small files (stripe count = 1)
/mnt/odfs/${SLURM_JOB_ID}/stripe_1
# Stripe count = 4
/mnt/odfs/${SLURM_JOB_ID}/stripe_default
# or
/mnt/odfs/${SLURM_JOB_ID}/stripe_4
# Stripe count = 8, 16 or 32, use this directories for medium sized and large files or when using MPI-IO
/mnt/odfs/${SLURM_JOB_ID}/stripe_8
/mnt/odfs/${SLURM_JOB_ID}/stripe_16
# or
/mnt/odfs/${SLURM_JOB_ID}/stripe_32
</source>

If you request less nodes than stripe count, the stripe count will be the number of nodes. For example, if you only request 8 nodes the directory stripe_16 has only a stripe count 8.

; '''Attention:''' 
:Use always the directory with the greatest stripe count for large files. E.g. if your largest file is 3.1 Tb, then you have to use a stripe count greater than 4 (4 x 750 GB), otherwise the used disk space is exceeded.

The capacity of the private file system depends on the number of nodes. For each node you get 750 Gbyte.
If you request 100 nodes for your job, the private file system is 100 * 750 Gbyte ~ 75 Tbyte (approx) capacity.

==Backup and Archiving==

There are regular backups of all data of the home directories, whereas ACLs and extended attributes will
not be saved in backups.

Please open a ticket if you have the need to restore backup data.

BwUniCluster3.0/Hardware and Architecture/Filesystem Details

2025-04-03T15:39:56Z

R Laifer: /* File System Details */

= File System Details =

On bwUniCluster 3.0 the parallel file system Lustre is used for most globally visible user data. Lustre is open source and Lustre solutions and support are available from different vendors. Nowadays, most of the biggest HPC systems are using Lustre. An initial home directory on a Lustre file system is created automatically after account activation, and the environment variable $HOME holds its name. Users can create so-called workspaces on another Lustre file system for non-permanent data with temporary lifetime. There is another workspace file system based on flash storage for special requirements available.

Within a batch job further file systems are available:
* The directory $TMPDIR is only available and visible on the local node. It is located on fast SSD storage devices.
* On request a parallel on-demand file system (BeeOND) is created which uses the SSDs of the nodes which were allocated to the batch job.
* On request the external LSDF Online Storage is mounted on the nodes which were allocated to the batch job. This file system is based on the parallel file system IBM Storage Scale.

Some of the characteristics of the file systems are shown in the following table.

{| class="wikitable"
|-
! scope="column" style="width:9%"| Property
! style="width:13%"| $TMPDIR
! style="width:13%"| BeeOND
! style="width:13%"| $HOME
! style="width:13%"| Workspace
! style="width:13%"| Workspace on flash
|-
! scope="column" style="height=20px; padding:3px"| Visibility
| style="height=20px; text-align:left;padding:3px"| local node
| style="height=20px; text-align:left;padding:3px"| nodes of batch job
| style="height=20px; text-align:left;padding:3px"| global
| style="height=20px; text-align:left;padding:3px"| global
| style="height=20px; text-align:left;padding:3px"| global
|-
! scope="column" style="height=20px; padding:3px"| Lifetime
| style="height=20px; text-align:left;padding:3px"| batch job runtime
| style="height=20px; text-align:left;padding:3px"| batch job runtime
| style="height=20px; text-align:left;padding:3px"| permanent
| style="height=20px; text-align:left;padding:3px"| max. 240 days
| style="height=20px; text-align:left;padding:3px"| max. 240 days
|-
! scope="column" style="height=20px; padding:3px"| Disk space
| style="height=20px; text-align:left;padding:3px"| 960 GB - 6.4 TB details see table 1
| style="height=20px; text-align:left;padding:3px"| n*750 GB
| style="height=20px; text-align:left;padding:3px"| 1.1 PiB
| style="height=20px; text-align:left;padding:3px"| 4.6 PiB
| style="height=20px; text-align:left;padding:3px"| 236 TiB
|-
! scope="column" style="height=20px; padding:3px"| Capacity Quotas
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| yes 500 GiB per user, for MA users 250 GiB also per organization
| style="height=20px; text-align:left;padding:3px"| yes 40 TiB per user
| style="height=20px; text-align:left;padding:3px"| yes 1 TiB per user
|-
! scope="column" style="height=20px; padding:3px"| Inode Quotas
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| yes 5 million per user for MA users 2.5 million
| style="height=20px; text-align:left;padding:3px"| yes 20 million per user
| style="height=20px; text-align:left;padding:3px"| yes 5 million per user
|-
! scope="column" style="height=20px; padding:3px"| Backup
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| yes
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| no
|-
! scope="column" style="height=20px; padding:3px"| Read perf./node
| style="height=20px; text-align:left;padding:3px"| 500 MB/s - 6 GB/s depends on type of local SSD / job queue
| style="height=20px; text-align:left;padding:3px"| 400 MB/s - 500 MB/s depends on type of local SSDs / job queue
| style="height=20px; text-align:left;padding:3px"| 5 GB/s
| style="height=20px; text-align:left;padding:3px"| 5 GB/s
| style="height=20px; text-align:left;padding:3px"| 1 GB/s
|-
! scope="column" style="height=20px; padding:3px"| Write perf./node
| style="height=20px; text-align:left;padding:3px"| 500 MB/s - 4 GB/s depends on type of local SSD / job queue
| style="height=20px; text-align:left;padding:3px"| 250 MB/s - 350 MB/s depends on type of local SSDs / job queue
| style="height=20px; text-align:left;padding:3px"| 5 GB/s
| style="height=20px; text-align:left;padding:3px"| 5 GB/s
| style="height=20px; text-align:left;padding:3px"| 1 GB/s
|-
! scope="column" style="height=20px;padding:3px"| Total read perf.
| style="height=20px; text-align:left;padding:3px"| n*500-6000 MB/s
| style="height=20px; text-align:left;padding:3px"| n*400-500 MB/s
| style="height=20px; text-align:left;padding:3px"| 63 GB/s
| style="height=20px; text-align:left;padding:3px"| 45 GB/s
| style="height=20px; text-align:left;padding:3px"| 45 GB/s
|-
! scope="column" style="height=20px;padding:3px"| Total write perf.
| style="height=20px; text-align:left;padding:3px"| n*500-4000 MB/s
| style="height=20px; text-align:left;padding:3px"| n*250-350 MB/s
| style="height=20px; text-align:left;padding:3px"| 63 GB/s
| style="height=20px; text-align:left;padding:3px"| 40 GB/s
| style="height=20px; text-align:left;padding:3px"| 38 GB/s
|}
---------------------------------------------------------------------------------------------------------
global: all nodes of UniCluster access the same file system;
local: each node has its own file system;
permanent: files are stored permanently;
batch job: files are removed at end of the batch job.
---------------------------------------------------------------------------------------------------------
Table 2: Properties of the file systems

== Selecting the appropriate file system ==

In general, you should separate your data and store it on the appropriate file system.
Permanently needed data like software or important results should be stored below $HOME
but capacity restrictions (quotas) apply. In case you accidentally deleted data on $HOME
there is a chance that we can restore it from backup. Permanent data which is not needed for months
or exceeds the capacity restrictions should be sent to the LSDF Online Storage
or to the archive and deleted from the file systems. Temporary data which is only needed on a single
node and which does not exceed the disk space shown in [[BwUniCluster3.0/Hardware_and_Architecture#Compute_nodes|Table 1]]
should be stored below $TMPDIR. Data which is read many times on a single node, e.g. if you are doing AI training,
should be copied to $TMPDIR and read from there. Temporary data which is used from many nodes
of your batch job and which is only needed during job runtime should be stored on a
parallel on-demand file system. Temporary data which can be recomputed or which is the
result of one job and input for another job should be stored in workspaces. The lifetime
of data in workspaces is limited and depends on the lifetime of the workspace which can be
several months.

For further details please check the chapters below.

== $HOME ==

The home directories of bwUniCluster 3.0 (uc3) users are located in the parallel file system Lustre.
You have access to your home directory from all nodes of uc3. A regular backup of these directories
to a tape library is done automatically. The directory $HOME should be used to hold permanently used data like
source code, configuration files, executable programs etc.

On uc3 there is a default user quota limit of 500 GiB and 5 million inodes (files and directories) per user.
For users of University of Mannheim the limit is 250 GiB and 2.5 million inodes.
You can check your current usage and limits with the command
<pre>
$ lfs quota -uh $(whoami) $HOME
</pre>
Note that the quota limits mentioned above are soft quota limits. The hard limits (shown on the ''limits''
columns) are 10 percent higher. If you are above the soft limit and below the hard limit
during the grace period (7 days) your I/O operations will show a warning message. If the grace period has
passed or if you are above the hard limit your I/O operations will abort.

In addition to the user limit there is a limit of your organization (e.g. university) which depends on the financial share. This limit is enforced with so-called Lustre project quotas. You can show the current usage and limits of your organization with the following command:
<pre>
lfs quota -ph $(grep $(echo $HOME | sed -e "s|/[^/]*/[^/]*$||") /pfs/data6/project_ids.txt | cut -f 1 -d\ ) $HOME
</pre>

== Workspaces ==

On uc3 workspaces can be used to store large non-permanent data sets, e.g. restart files or output
data that has to be post-processed. The file system used for workspaces is also the parallel file system Lustre. This file system is especially designed for parallel access and for a high throughput to large
files. It is able to provide high data transfer rates of up to 40 GB/s write and read performance when data access is parallel.

On uc3 there is a default user quota limit of 40 TiB and 20 million inodes (files and directories) per user.
You can chek your current usage and limits with the command
<pre>
$ lfs quota -uh $(whoami) /pfs/work9
</pre>
Note that the quotas include data and inodes for all of your workspaces and all of your expired workspaces (as long as they are not yet completely removed).

Workspaces have a lifetime and the data on a workspace expires as a whole after a fixed period. The maximum lifetime of a workspace on uc3 is 60 days, but it can be renewed at the end of that period 3 times to a total maximum of 240 days after workspace generation.

Creating, deleting, finding, extending and sharing workspaces is explained on the [[workspace]] page.

=== Reminder for workspace deletion ===

Normally you will get an email every day starting 7 days before a workspace expires. You can send yourself a calender entry which reminds you when a workspace will be automatically deleted:

$ ws_send_ical <workspace> <email>

=== Restoring expired Workspaces ===

At expiration time your workspace will be moved to a special, hidden directory. On uc3 expired workspaces are currently kept for 30 days. During this time you can still restore your data into a valid workspace. The same is true for released workspaces but they are only kept until the next night. In order to restore an expired workspace, use
<pre>
ws_restore -l
</pre>
to get a list of your expired workspaces, and then restore them into an '''existing, active workspace''' (here with name <code>my_restored</code>):
<pre>
ws_restore <full_name_of_expired_workspace> my_restored
</pre>
NOTE: The expired workspace has to be specified using the full name as listed by <code>ws_restore -l</code>, including username prefix and timestamp suffix (otherwise, it cannot be uniquely identified).
The target workspace, on the other hand, must be given with just its short name as listed by <code>ws_list</code>, without the username prefix.

NOTE: <code>ws_restore</code> can only work on the same filesystem. So you have to ensure that the new workspace allocated with <code>ws_allocate</code> is placed on the same filesystem as the expired workspace. Therefore, you can use <code>-F <filesystem></code> flag if needed.

=== Linking workspaces in Home ===

It might be valuable to have links to personal workspaces within a certain directory, e.g. below the user home directory. The command
<pre>
ws_register <DIR>
</pre>
will create and manage links to all personal workspaces within in the directory <DIR>. Calling this command will do the following:

* The directory <DIR> will be created if necessary
* Links to all personal workspaces will be managed:
** Creates links to all available workspaces if not already present
** Removes links to released or expired workspaces

== Improving Performance on $HOME and workspaces ==

The following recommendations might help to improve throughput and metadata
performance on Lustre filesystems.

=== Improving Throughput Performance ===

Depending on your application some adaptations might be necessary if you want to reach
the full bandwidth of the filesystems.

When you are designing your application you should consider that the performance of
parallel filesystems is generally better if data is transferred in large blocks and stored in
few large files. In more detail, to increase throughput performance of a parallel application
following aspects should be considered:

* collect large chunks of data and write them sequentially at once,

* to exploit complete filesystem bandwidth use several clients,

* avoid competitive file access by different tasks or use blocks with boundaries at stripe size (default is 1MB),

* if files are small enough for the SSDs and are only used from one node store them on $TMPDIR.

Parallel filesystems typically stripe files over storage subsystems, i.e. large files are separated into stripes and distributed to different storage subsystems. In Lustre, the size of these stripes (sometimes also mentioned as chunks) is called stripe size and the number of used storage subsystems is called stripe count.

With previous Lustre versions adapting the Lustre stripe count was the most important optimization. However, for the workspaces of uc3 the Lustre feature Progressive File Layout is used to define file striping parameters. This means that the stripe count is adapted if the file size is growing. In normal cases users no longer need to adapt file striping parameters in case they have very huge files or in order to reach better performance. If you know what you are doing you can still change striping parameters but further explanation is beyond the scope of this documentation.

=== Improving Metadata Performance ===

Metadata performance on parallel file systems is usually not as good as with local
filesystems. In addition, it is usually not scalable, i.e. a limited resource. Therefore,
you should omit metadata operations whenever possible. For example, it is much better
to have few large files than lots of small files. In more detail, to increase metadata
performance of a parallel application following aspects should be considered:

* avoid creating many small files

* avoid competitive directory access, e.g. by creating files in separate subdirectories for each task

* if many small files are only used within a batch job and accessed by one node store them on $TMPDIR

== Workspaces on flash storage ==

There is another workspace file system for special requirements available. The file system is called ''full flash pfs'' and is based on the parallel file system Lustre.

=== Advantages of this file system ===

# From the Ice Lake nodes of bwUniCluster 3.0 (queue ''cpu_il'') the network distance and latency is low compared to the normal workspace file system.
# All storage devices are based on flash (no hard disks) with low access times. Hence performance is better compared to other parallel file systems for read and write access with small blocks and with small files, i.e. IOPS rates are improved.
# The file system is mounted on bwUniCluster 3.0 and HoreKa, i.e. it can be used to share data between these clusters.

=== Access restrictions ===

Only HoreKa users or KIT users of bwUniCluster 2.0 can use this file system.

=== Using the file system ===

As KIT or HoreKa user you can use the file system in the same way as a normal workspace. You just have to specify the name of the flash-based workspace file system using the option ''-F'' to all the commands that manage workspaces. On bwUniCluster 3.0 it is called ''ffuc'', on HoreKa it is ''ffhk''. For example, to create a workspace with name myws and a lifetime of 60 days on bwUniCluster 3.0 execute:
ws_allocate -F ffuc myws 60

If you want to use the full flash pfs on bwUniCluster 3.0 '''and''' HoreKa at the same time, please note that you only have to manage a particular workspace on one of the clusters since the name of the workspace directory is different. However, the path to each workspace is visible and can be used on both clusters.

Other features are similar to normal workspaces. For example, you can restore expired workspaces during 30 days after workspace expiration. There are quota limits with a default limit of 1 TiB capacity and 5 millions inodes per user. You can check your current usage with
lfs quota -uh $(whoami) /pfs/work8

== $TMPDIR ==

The environment variable $TMPDIR contains the name of a directory which is located on the local SSD of each node. This means
that different tasks of a parallel application use different directories when they do not utilize the same node.
Although $TMPDIR points to the same path name for different nodes of a batch job, the physical location and the
content of this directory path on these nodes is different.

This directory should be used for temporary files being accessed from the local node during job runtime. It should
also be used if you read the same data many times from a single node, e.g. if you are doing AI training. In this
case you should copy the data at the beginning of your batch job to $TMPDIR and read the data from there, see usage example below.

The $TMPDIR directory is located on extremely fast local SSD storage devices. This means that performance
on small files is much better than on the parallel file systems. The capacity of the local SSDs for each node type
is different and can be checked in [[BwUniCluster3.0/Hardware_and_Architecture#Compute_nodes|Table 1]].
The capacity of $TMPDIR is at least 1400 GB.

Each time a batch job is started, a subdirectory is created on the SSD of each node and assigned to the job.
$TMPDIR is set to the name of the subdirectory and this name contains the job ID so that it is unique
for each job. At the end of the job the subdirectory is removed.

{|style="background:#deffee; width:100%;"
|style="padding:5px; background:#ffa500; text-align:left"|
[[Image:Attention.svg|center|25px]]
|style="padding:5px; background:#ffa500; text-align:left"|
'''All data on $TMPDIR will be deleted when your job completes.''' 
Make sure you have copied your results back to a global filesystem, e.g., $HOME or a workspace, within your job.
|}

On login nodes $TMPDIR also points to a fast directory on a local SSD disk but this directory is not unique.
It is recommended to create your own unique subdirectory on these nodes. This directory should be used for the
installation of software packages. This means that the software package to be installed should be unpacked,
compiled and linked in a subdirectory of $TMPDIR. The real installation of the package (e.g. make install)
should be made into the $HOME folder.

{|style="background:#ffdeee; width:100%;"
|style="padding:5px; background:#f2cece; text-align:left"|
[[Image:Attention.svg|center|25px]]
|style="padding:5px; background:#f2cece; text-align:left"|
Note that you should '''not''' use /tmp or /scratch! Please use '''$TMPDIR''' instead. 
The reason is that an automatic cleanup on /tmp or /scratch is not possible because another job could be still using data below these directories. Hence the corresponding file systems could fill up and this can cause issues for you and for other users. On the other hand, $TMPDIR is created when the job starts and removed when the job completes, i.e. a cleanup is automatically done.
|}

=== Usage example for $TMPDIR ===

We will provide an example for using $TMPDIR and describe efficient data transfer to and from $TMPDIR.

If you have a data set with many files which is frequently used by batch jobs you should create
a compressed archive on a workspace. This archive can be extracted on $TMPDIR inside your batch jobs.
Such an archive can be read efficiently from a parallel file system since it is a single huge file.
On a login node you can create such an archive with the following steps:
<syntaxhighlight lang="bash">
# Create a workspace to store the archive
[ab1234@uc3n991 ~]$ ws_allocate data-ssd 60
# Create the archive from a local dataset folder (example)
[ab1234@uc3n991 ~]$ tar -cvzf $(ws_find data-ssd)/dataset.tgz dataset/
</syntaxhighlight>

Inside a batch job extract the archive on $TMPDIR, read input data from $TMPDIR, store results on $TMPDIR
and save the results on a workspace:
<syntaxhighlight lang="bash">
#!/bin/bash
# very simple example on how to use local $TMPDIR
#SBATCH -N 1
#SBATCH -t 24:00:00

# Extract compressed input dataset on local SSD
tar -C $TMPDIR/ -xvzf $(ws_find data-ssd)/dataset.tgz

# The application reads data from dataset on $TMPDIR and writes results to $TMPDIR
myapp -input $TMPDIR/dataset/myinput.csv -outputdir $TMPDIR/results

# Before job completes save results on a workspace
rsync -av $TMPDIR/results $(ws_find data-ssd)/results-${SLURM_JOB_ID}/
</syntaxhighlight>

== LSDF Online Storage ==

The LSDF Online Storage allows dedicated users to store scientific measurement data and simulation results. BwUniCluster 3.0 has an extremely fast network connection to the LSDF Online Storage. This file system provides external access via different protocols. It is only available for certain users. For information how request storage projects on the LSDF Online Storage see [[https://www.scc.kit.edu/en/services/lsdf]].

The LSDF Online Storage is mounted on the login nodes. It will also be mounted on the compute nodes of your batch job request it with the constraint flag ''LSDF''. You have one of the following options:

1. Add after the initial lines of your script job.sh the line <code>#SBATCH --constraint=LSDF</code>:
<pre>
#!/bin/bash
#SBATCH --ntasks=1
#SBATCH --time=120
#SBATCH --mem=200
#SBATCH --constraint=LSDF
</pre>
2. Add the constraint on command line:
<pre>
$ sbatch -p <queue> -n 1 -t 2:00:00 --mem 200 job.sh -C LSDF
</pre>

In order to access the LSDF Online Storage the following environment variables are available:
<code>$LSDF</code>, <code>$LSDFPROJECTS</code>, <code>$LSDFHOME</code>

==BeeOND (BeeGFS On-Demand)==

Users have the possibility to request a private BeeOND (on-demand BeeGFS) parallel filesystem for each job. The file system is created during job startup and purged when your job completes.

{|style="background:#deffee; width:100%;"
|style="padding:5px; background:#ffa500; text-align:left"|
[[Image:Attention.svg|center|25px]]
|style="padding:5px; background:#ffa500; text-align:left"|
'''All data on the private BeeOND filesystem will be deleted when your job completes.''' 
Make sure you have copied your results back to a global filesystem, e.g., $HOME or a workspace, within your job.
|}

BeeOND/BeeGFS can be used like any other parallel file system. All nodes of the batch job have acess to the same data below the same path. Tools like cp or rsync can be used to copy data in and out.

Starting and stopping BeeOND is integrated in the prolog and epilog of the cluster batch system Slurm. It can be used during job runtime if compute nodes are exclusive used. You can request the creation of a BeeOND file system with the constraint flags ''BEEOND'', ''BEEOND_4MDS'' or ''BEEOND_MAXMDS''.
* BEEOND: one metadata server is started on the first node
* BEEOND_4MDS: 4 metadata servers are started within your job. If you have less than 4 nodes less metadata servers are started.
* BEEOND_MAXMDS: on every node of your job a metadata server for the on_demand file system is started

As starting point we recommend using the contraint ''BEEOND''. You have one of the following options to request the constraint:

1. Add the line <code>#SBATCH --constraint=BEEOND</code> to your job script:
<source lang="bash">
#!/bin/bash
#SBATCH ...
#SBATCH --constraint=BEEOND # or BEEOND_4MDS or BEEOND_MAXMDS
</source>
2. Add the constraint on command line:
<pre>
$ sbatch -p <queue> -N <# of nodes> -t <runtime> --mem <mem> -C BEEOND job.sh
</pre>

After your job has started you can find the private on-demand file system in '''/mnt/odfs/${SLURM_JOB_ID}''' directory. The mountpoint comes with five pre-configured directories:
<source lang="bash">
# For small files (stripe count = 1)
/mnt/odfs/${SLURM_JOB_ID}/stripe_1
# Stripe count = 4
/mnt/odfs/${SLURM_JOB_ID}/stripe_default
# or
/mnt/odfs/${SLURM_JOB_ID}/stripe_4
# Stripe count = 8, 16 or 32, use this directories for medium sized and large files or when using MPI-IO
/mnt/odfs/${SLURM_JOB_ID}/stripe_8
/mnt/odfs/${SLURM_JOB_ID}/stripe_16
# or
/mnt/odfs/${SLURM_JOB_ID}/stripe_32
</source>

If you request less nodes than stripe count, the stripe count will be the number of nodes. For example, if you only request 8 nodes the directory stripe_16 has only a stripe count 8.

; '''Attention:''' 
:Use always the directory with the greatest stripe count for large files. E.g. if your largest file is 3.1 Tb, then you have to use a stripe count greater than 4 (4 x 750 GB), otherwise the used disk space is exceeded.

The capacity of the private file system depends on the number of nodes. For each node you get 750 Gbyte.
If you request 100 nodes for your job, the private file system is 100 * 750 Gbyte ~ 75 Tbyte (approx) capacity.

==Backup and Archiving==

There are regular backups of all data of the home directories, whereas ACLs and extended attributes will
not be saved in backups.

Please open a ticket if you have the need to restore backup data.

BwUniCluster3.0/Hardware and Architecture

2025-04-03T15:34:26Z

R Laifer: /* File Systems */

= Architecture of bwUniCluster 3.0 =

The '''bwUniCluster 3.0''' is a parallel computer with distributed memory.
It consists of the bwUniCluster 3.0 components procured in 2024 and also includes the additional compute nodes which were procured as an extension to the bwUniCluster 2.0 in 2022.

Each node of the system consists of two Intel Xeon or AMD EPYC processors, local memory, local storage, network adapters and optional accelerators (NVIDIA A100 and H100, AMD Instinct MI300A). All nodes are connected via a fast InfiniBand interconnect.

The parallel file system (Lustre) is connected to the InfiniBand switch of the compute cluster. This provides a fast and scalable parallel file
system.

The operating system on each node is Red Hat Enterprise Linux (RHEL) 9.4.

The individual nodes of the system act in different roles. From an end users point of view the different groups of nodes are login nodes and compute nodes. File server nodes and administrative server nodes are not accessible by users.

'''Login Nodes'''

The login nodes are the only nodes directly accessible by end users. These nodes are used for interactive login, file management, program development, and interactive pre- and post-processing.
There are two nodes dedicated to this service, but they can all be reached from a single address: <code>uc3.scc.kit.edu</code>. A DNS round-robin alias distributes login sessions to the login nodes.
To prevent login nodes from being used for activities that are not permitted there and that affect the user experience of other users, '''long-running and/or compute-intensive tasks are periodically terminated without any prior warning'''. Please refer to [[BwUniCluster3.0/Login#Allowed_Activities_on_Login_Nodes|Allowed Activities on Login Nodes]].

'''Compute Nodes'''

The majority of nodes are compute nodes which are managed by a batch system. Users submit their jobs to the SLURM batch system and a job is executed when the required resources become available (depending on its fair-share priority).

'''File Systems'''
bwUniCluster 3.0 comprises two parallel file systems based on Lustre.

[[File:uc3.png|Optionen|center|Überschrift|800px]]

= Compute Resources =

== Login nodes ==

After a successful [[BwUniCluster3.0/Login|login]], users find themselves on one of the so called login nodes. Technically, these largely correspond to a standard CPU node, i.e. users have two AMD EPYC 9454 processors with a total of 96 cores at their disposal. Login nodes are the bridgehead for accessing computing resources.
Data and software are organized here, computing jobs are initiated and managed, and computing resources allocated for interactive use can also be accessed from here.
{|style="background:#deffee; width:100%;"
|style="padding:5px; background:#ffa500; text-align:left"|
[[Image:Attention.svg|center|25px]]
|style="padding:5px; background:#ffa500; text-align:left"|
'''Any compute intensive job running on the login nodes will be terminated without any notice.''' 
Please refer to [[BwUniCluster3.0/Login#Allowed_Activities_on_Login_Nodes|Allowed Activities on Login Nodes]].
|}

== Compute nodes ==
All compute activities on bwUniCluster 3.0 have to be performed on the compute nodes. Compute nodes are only available by requesting the corresponding resources via the queuing system. As soon as the requested resources are available, automated tasks are executed via a batch script or they can be accessed interactively. Please refer to [[BwUniCluster3.0/Running_Jobs|Running Jobs]] on how to request resources. 
The following compute node types are available: 
CPU nodes
* '''Standard''': Two AMD EPYC 9454 processors per node with a total of 96 physical CPU cores or 192 logical cores (Hyper-Threading) per node. The nodes have been procured in 2024.
* '''Ice Lake''': Two Intel Xeon Platinum 8358 processors per node with a total of 64 physical CPU cores or 128 logical cores (Hyper-Threading) per node. The nodes have been procured in 2022 as an extension to bwUniCluster 2.0.
* '''High Memory''': Similar to the standard nodes, but with six times larger memory.
GPU nodes
* '''NVIDIA GPU x4''': Similar to the standard nodes, but with larger memory and four NVIDIA H100 GPUs.
* '''AMD GPU x4''': AMD's accelerated processing unit (APU) MI300A with 4 CPU sockets and 4 compute units which share the same high-bandwidth memory (HBM).
* '''Ice Lake NVIDIA GPU x4''': Similar to the Ice Lake nodes, but with larger memory and four NVIDIA A100 or H100 GPUs.
{| class="wikitable"
|-
! style="width:10%"| Node Type
! style="width:10%"| CPU nodes Ice Lake
! style="width:10%"| CPU nodes Standard
! style="width:10%"| CPU nodes High Memory
! style="width:10%"| GPU nodes NVIDIA GPU x4
! style="width:10%"| GPU node AMD GPU x4
! style="width:10%"| GPU nodes Ice Lake NVIDIA GPU x4
! style="width:10%"| Login nodes
|-
!scope="column"| Availability in [[BwUniCluster3.0/Running_Jobs#Queues_on_bwUniCluster_3.0| queues]]
| <code>cpu_il</code>, <code>dev_cpu_il</code>
| <code>cpu</code>, <code>dev_cpu</code>
| <code>highmem</code>, <code>dev_highmem</code>
| <code>gpu_h100</code>, <code>dev_gpu_h100</code>
| <code>gpu_mi300</code>
| <code>gpu_a100_il</code> / <code>gpu_h100_il</code>
| -
|-
!scope="column"| Number of nodes
| 272
| 70
| 4
| 12
| 1
| 15
| 2
|-
!scope="column"| Processors
| Intel Xeon Platinum 8358
| AMD EPYC 9454
| AMD EPYC 9454
| AMD EPYC 9454
| AMD Zen 4
| Intel Xeon Platinum 8358
| AMD EPYC 9454
|-
!scope="column"| Number of sockets
| 2
| 2
| 2
| 2
| 4
| 2
| 2
|-
!scope="column"| Processor frequency (GHz)
| 2.6 GHz
| 2.75 GHz
| 2.75 GHz
| 2.75 GHz
| 3.7 GHz
| 2.6 GHz
| 2.75 GHz
|-
!scope="column"| Total number of cores
| 64
| 96
| 96
| 96
| 96 (4x 24)
| 64
| 96
|-
!scope="column"| Main memory
| 256 GB
| 384 GB
| 2.3 TB
| 768 GB
| 4x 128 GB HBM3
| 512 GB
| 384 GB
|-
!scope="column"| Local SSD
| 1.8 TB NVMe
| 3.84 TB NVMe
| 15.36 TB NVMe
| 15.36 TB NVMe
| 7.68 TB NVMe
| 6.4 TB NVMe
| 7.68 TB SATA SSD
|-
!scope="column"| Accelerators
| -
| -
| -
| 4x NVIDIA H100
| 4x AMD Instinct MI300A
| 4x NVIDIA A100 / H100
| -
|-
!scope="column"| Accelerator memory
| -
| -
| -
| 94 GB
| APU
| 80 GB / 94 GB
| -
|-
!scope="column"| Interconnect
| IB HDR200
| IB 2x NDR200
| IB 2x NDR200
| IB 4x NDR200
| IB 2x NDR200
| IB 2x HDR200
| IB 1x NDR200
|}
Table 1: Hardware overview and properties

= File Systems =

On bwUniCluster 3.0 the following file systems are available:

* '''$HOME''' The HOME directory is created automatically after account activation, and the environment variable $HOME holds its name. HOME is the place, where users find themselves after login.
* '''Workspaces''' Users can create so-called workspaces for non-permanent data with temporary lifetime.
* '''Workspaces on flash storage''' A further workspace file system based on flash-only storage is available for special requirements and certain users.
* '''$TMPDIR''' The directory $TMPDIR is only available and visible on the local node during the runtime of a compute job. It is located on fast SSD storage devices.
* '''BeeOND''' (BeeGFS On-Demand) On request a parallel on-demand file system (BeeOND) is created which uses the SSDs of the nodes which were allocated to the batch job.
* '''LSDF Online Storage''' On request the external LSDF Online Storage is mounted on the nodes which were allocated to the batch job. On the login nodes, LSDF is automatically mounted.

'''Which file system to use?'''

In general, you should separate your data and store it on the appropriate file system.
Permanently needed data like software or important results should be stored below $HOME
but capacity restrictions (quotas) apply. In case you accidentally deleted data on $HOME
there is a chance that we can restore it from backup. Permanent data which is not needed for months
or exceeds the capacity restrictions should be sent to the LSDF Online Storage
or to the archive and deleted from the file systems. Temporary data which is only needed on a single
node and which does not exceed the disk space shown in Table 1 above should be stored
below $TMPDIR. Data which is read many times on a single node, e.g. if you are doing AI training,
should be copied to $TMPDIR and read from there. Temporary data which is used from many nodes
of your batch job and which is only needed during job runtime should be stored on a
parallel on-demand file system BeeOND. Temporary data which can be recomputed or which is the
result of one job and input for another job should be stored in workspaces. The lifetime
of data in workspaces is limited and depends on the lifetime of the workspace which can be
several months.

For further details please check

[[BwUniCluster3.0/Hardware_and_Architecture/Filesystem_Details|File System Details]]

== $HOME ==

The $HOME directories of bwUniCluster 3.0 users are located on the parallel file system Lustre.
You have access to your $HOME directory from all nodes of UC3. A regular backup of these directories
to tape archive is done automatically. The directory $HOME is used to hold those files that are
permanently used like source codes, configuration files, executable programs etc.

[[BwUniCluster3.0/Hardware_and_Architecture/Filesystem_Details#$HOME|Detailed information on $HOME]]

== Workspaces ==

On UC3 workspaces should be used to store large non-permanent data sets, e.g. restart files or output
data that has to be post-processed. The file system used for workspaces is also the parallel file system Lustre. This file system is especially designed for parallel access and for a high throughput to large
files. It is able to provide high data transfer rates of up to 40 GB/s write and read performance when data access is parallel.

On UC3 there is a default user quota limit of 40 TiB and 20 million inodes (files and directories) per user.

[[BwUniCluster3.0/Hardware_and_Architecture/Filesystem_Details#Workspaces|Detailed information on Workspaces]]

== Workspaces on flash storage ==

Another workspace file system based on flash-only storage is available for special requirements and certain users.
If possible, this file system should be used from the Ice Lake nodes of bwUniCluster 3.0 (queue ''cpu_il'').
It provides high IOPS rates and better performance for small files. The quota limts are lower than on the
normal workspace file system.

[[BwUniCluster3.0/Hardware_and_Architecture/Filesystem_Details#Workspaces_on_flash_storage|Detailed information on Workspaces on flash storage]]

== $TMPDIR ==

The environment variable $TMPDIR contains the name of a directory which is located on the local SSD of each node.
This directory should be used for temporary files being accessed from the local node. It should
also be used if you read the same data many times from a single node, e.g. if you are doing AI training.
Because of the extremely fast local SSD storage devices performance with small files is much better than on the parallel file systems.

[[BwUniCluster3.0/Hardware_and_Architecture/Filesystem_Details#$TMPDIR|Detailed information on $TMPDIR]]

== BeeOND (BeeGFS On-Demand) ==

Users have the possibility to request a private BeeOND (on-demand BeeGFS) parallel filesystem for each job. The file system is created during job startup and purged when your job completes.

[[BwUniCluster3.0/Hardware_and_Architecture/Filesystem_Details#BeeOND_(BeeGFS_On-Demand)|Detailed information on BeeOND]]

== LSDF Online Storage ==

The LSDF Online Storage allows dedicated users to store scientific measurement data and simulation results. BwUniCluster 3.0 has an extremely fast network connection to the LSDF Online Storage. This file system provides external access via different protocols and is only available for certain users.

[[BwUniCluster3.0/Hardware_and_Architecture/Filesystem_Details#LSDF_Online_Storage|Detailed information on LSDF Online Storage]]

BwUniCluster3.0/Hardware and Architecture/Filesystem Details

2025-04-03T15:20:46Z

R Laifer: /* File System Details */

= File System Details =

On bwUniCluster 3.0 the parallel file system Lustre is used for most globally visible user data. Lustre is open source and Lustre solutions and support are available from different vendors. Nowadays, most of the biggest HPC systems are using Lustre. An initial home directory on a Lustre file system is created automatically after account activation, and the environment variable $HOME holds its name. Users can create so-called workspaces on another Lustre file system for non-permanent data with temporary lifetime. There is another workspace file system based on flash storage for special requirements available.

Within a batch job further file systems are available:
* The directory $TMPDIR is only available and visible on the local node. It is located on fast SSD storage devices.
* On request a parallel on-demand file system (BeeOND) is created which uses the SSDs of the nodes which were allocated to the batch job.
* On request the external LSDF Online Storage is mounted on the nodes which were allocated to the batch job. This file system is based on the parallel file system IBM Storage Scale.

Some of the characteristics of the file systems are shown in the following table.

{| class="wikitable"
|-
! scope="column" style="width:9%"| Property
! style="width:13%"| $TMPDIR
! style="width:13%"| BeeOND
! style="width:13%"| $HOME
! style="width:13%"| Workspace
! style="width:13%"| Workspace on flash
|-
! scope="column" style="height=20px; padding:3px"| Visibility
| style="height=20px; text-align:left;padding:3px"| local node
| style="height=20px; text-align:left;padding:3px"| nodes of batch job
| style="height=20px; text-align:left;padding:3px"| global
| style="height=20px; text-align:left;padding:3px"| global
| style="height=20px; text-align:left;padding:3px"| global
|-
! scope="column" style="height=20px; padding:3px"| Lifetime
| style="height=20px; text-align:left;padding:3px"| batch job runtime
| style="height=20px; text-align:left;padding:3px"| batch job runtime
| style="height=20px; text-align:left;padding:3px"| permanent
| style="height=20px; text-align:left;padding:3px"| max. 240 days
| style="height=20px; text-align:left;padding:3px"| max. 240 days
|-
! scope="column" style="height=20px; padding:3px"| Disk space
| style="height=20px; text-align:left;padding:3px"| 960 GB - 6.4 TB details see table 1
| style="height=20px; text-align:left;padding:3px"| n*750 GB
| style="height=20px; text-align:left;padding:3px"| 1.1 PiB
| style="height=20px; text-align:left;padding:3px"| 4.6 PiB
| style="height=20px; text-align:left;padding:3px"| 236 TiB
|-
! scope="column" style="height=20px; padding:3px"| Capacity Quotas
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| yes 500 GiB per user, for MA users 250 GiB also per organization
| style="height=20px; text-align:left;padding:3px"| yes 40 TiB per user
| style="height=20px; text-align:left;padding:3px"| yes 1 TiB per user
|-
! scope="column" style="height=20px; padding:3px"| Inode Quotas
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| yes 5 million per user for MA users 2.5 million
| style="height=20px; text-align:left;padding:3px"| yes 20 million per user
| style="height=20px; text-align:left;padding:3px"| yes 5 million per user
|-
! scope="column" style="height=20px; padding:3px"| Backup
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| yes
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| no
|-
! scope="column" style="height=20px; padding:3px"| Read perf./node
| style="height=20px; text-align:left;padding:3px"| 500 MB/s - 6 GB/s depends on type of local SSD / job queue: 520 MB/s @ single / multiple 800 MB/s @ multiple_e 6600 MB/s @ fat 6500 MB/s @ gpu_4 6500 MB/s @ gpu_8
| style="height=20px; text-align:left;padding:3px"| 400 MB/s - 500 MB/s depends on type of local SSDs / job queue: 500 MB/s @ multiple 400 MB/s @ multiple_e
| style="height=20px; text-align:left;padding:3px"| 5 GB/s
| style="height=20px; text-align:left;padding:3px"| 5 GB/s
| style="height=20px; text-align:left;padding:3px"| 1 GB/s
|-
! scope="column" style="height=20px; padding:3px"| Write perf./node
| style="height=20px; text-align:left;padding:3px"| 500 MB/s - 4 GB/s depends on type of local SSD / job queue: 500 MB/s @ single / multiple 650 MB/s @ multiple_e 2900 MB/s @ fat 2090 MB/s @ gpu_4 4060 MB/s @ gpu_8
| style="height=20px; text-align:left;padding:3px"| 250 MB/s - 350 MB/s depends on type of local SSDs / job queue: 350 MB/s @ multiple 250 MB/s @ multiple_e
| style="height=20px; text-align:left;padding:3px"| 5 GB/s
| style="height=20px; text-align:left;padding:3px"| 5 GB/s
| style="height=20px; text-align:left;padding:3px"| 1 GB/s
|-
! scope="column" style="height=20px;padding:3px"| Total read perf.
| style="height=20px; text-align:left;padding:3px"| n*500-6000 MB/s
| style="height=20px; text-align:left;padding:3px"| n*400-500 MB/s
| style="height=20px; text-align:left;padding:3px"| 63 GB/s
| style="height=20px; text-align:left;padding:3px"| 45 GB/s
| style="height=20px; text-align:left;padding:3px"| 45 GB/s
|-
! scope="column" style="height=20px;padding:3px"| Total write perf.
| style="height=20px; text-align:left;padding:3px"| n*500-4000 MB/s
| style="height=20px; text-align:left;padding:3px"| n*250-350 MB/s
| style="height=20px; text-align:left;padding:3px"| 63 GB/s
| style="height=20px; text-align:left;padding:3px"| 40 GB/s
| style="height=20px; text-align:left;padding:3px"| 38 GB/s
|}
---------------------------------------------------------------------------------------------------------
global: all nodes of UniCluster access the same file system;
local: each node has its own file system;
permanent: files are stored permanently;
batch job: files are removed at end of the batch job.
---------------------------------------------------------------------------------------------------------
Table 2: Properties of the file systems

== Selecting the appropriate file system ==

In general, you should separate your data and store it on the appropriate file system.
Permanently needed data like software or important results should be stored below $HOME
but capacity restrictions (quotas) apply. In case you accidentally deleted data on $HOME
there is a chance that we can restore it from backup. Permanent data which is not needed for months
or exceeds the capacity restrictions should be sent to the LSDF Online Storage
or to the archive and deleted from the file systems. Temporary data which is only needed on a single
node and which does not exceed the disk space shown in [[BwUniCluster3.0/Hardware_and_Architecture#Compute_nodes|Table 1]]
should be stored below $TMPDIR. Data which is read many times on a single node, e.g. if you are doing AI training,
should be copied to $TMPDIR and read from there. Temporary data which is used from many nodes
of your batch job and which is only needed during job runtime should be stored on a
parallel on-demand file system. Temporary data which can be recomputed or which is the
result of one job and input for another job should be stored in workspaces. The lifetime
of data in workspaces is limited and depends on the lifetime of the workspace which can be
several months.

For further details please check the chapters below.

== $HOME ==

The home directories of bwUniCluster 3.0 (uc3) users are located in the parallel file system Lustre.
You have access to your home directory from all nodes of uc3. A regular backup of these directories
to a tape library is done automatically. The directory $HOME should be used to hold permanently used data like
source code, configuration files, executable programs etc.

On uc3 there is a default user quota limit of 500 GiB and 5 million inodes (files and directories) per user.
For users of University of Mannheim the limit is 250 GiB and 2.5 million inodes.
You can check your current usage and limits with the command
<pre>
$ lfs quota -uh $(whoami) $HOME
</pre>
Note that the quota limits mentioned above are soft quota limits. The hard limits (shown on the ''limits''
columns) are 10 percent higher. If you are above the soft limit and below the hard limit
during the grace period (7 days) your I/O operations will show a warning message. If the grace period has
passed or if you are above the hard limit your I/O operations will abort.

In addition to the user limit there is a limit of your organization (e.g. university) which depends on the financial share. This limit is enforced with so-called Lustre project quotas. You can show the current usage and limits of your organization with the following command:
<pre>
lfs quota -ph $(grep $(echo $HOME | sed -e "s|/[^/]*/[^/]*$||") /pfs/data6/project_ids.txt | cut -f 1 -d\ ) $HOME
</pre>

== Workspaces ==

On uc3 workspaces can be used to store large non-permanent data sets, e.g. restart files or output
data that has to be post-processed. The file system used for workspaces is also the parallel file system Lustre. This file system is especially designed for parallel access and for a high throughput to large
files. It is able to provide high data transfer rates of up to 40 GB/s write and read performance when data access is parallel.

On uc3 there is a default user quota limit of 40 TiB and 20 million inodes (files and directories) per user.
You can chek your current usage and limits with the command
<pre>
$ lfs quota -uh $(whoami) /pfs/work9
</pre>
Note that the quotas include data and inodes for all of your workspaces and all of your expired workspaces (as long as they are not yet completely removed).

Workspaces have a lifetime and the data on a workspace expires as a whole after a fixed period. The maximum lifetime of a workspace on uc3 is 60 days, but it can be renewed at the end of that period 3 times to a total maximum of 240 days after workspace generation.

Creating, deleting, finding, extending and sharing workspaces is explained on the [[workspace]] page.

=== Reminder for workspace deletion ===

Normally you will get an email every day starting 7 days before a workspace expires. You can send yourself a calender entry which reminds you when a workspace will be automatically deleted:

$ ws_send_ical <workspace> <email>

=== Restoring expired Workspaces ===

At expiration time your workspace will be moved to a special, hidden directory. On uc3 expired workspaces are currently kept for 30 days. During this time you can still restore your data into a valid workspace. The same is true for released workspaces but they are only kept until the next night. In order to restore an expired workspace, use
<pre>
ws_restore -l
</pre>
to get a list of your expired workspaces, and then restore them into an '''existing, active workspace''' (here with name <code>my_restored</code>):
<pre>
ws_restore <full_name_of_expired_workspace> my_restored
</pre>
NOTE: The expired workspace has to be specified using the full name as listed by <code>ws_restore -l</code>, including username prefix and timestamp suffix (otherwise, it cannot be uniquely identified).
The target workspace, on the other hand, must be given with just its short name as listed by <code>ws_list</code>, without the username prefix.

NOTE: <code>ws_restore</code> can only work on the same filesystem. So you have to ensure that the new workspace allocated with <code>ws_allocate</code> is placed on the same filesystem as the expired workspace. Therefore, you can use <code>-F <filesystem></code> flag if needed.

=== Linking workspaces in Home ===

It might be valuable to have links to personal workspaces within a certain directory, e.g. below the user home directory. The command
<pre>
ws_register <DIR>
</pre>
will create and manage links to all personal workspaces within in the directory <DIR>. Calling this command will do the following:

* The directory <DIR> will be created if necessary
* Links to all personal workspaces will be managed:
** Creates links to all available workspaces if not already present
** Removes links to released or expired workspaces

== Improving Performance on $HOME and workspaces ==

The following recommendations might help to improve throughput and metadata
performance on Lustre filesystems.

=== Improving Throughput Performance ===

Depending on your application some adaptations might be necessary if you want to reach
the full bandwidth of the filesystems.

When you are designing your application you should consider that the performance of
parallel filesystems is generally better if data is transferred in large blocks and stored in
few large files. In more detail, to increase throughput performance of a parallel application
following aspects should be considered:

* collect large chunks of data and write them sequentially at once,

* to exploit complete filesystem bandwidth use several clients,

* avoid competitive file access by different tasks or use blocks with boundaries at stripe size (default is 1MB),

* if files are small enough for the SSDs and are only used from one node store them on $TMPDIR.

Parallel filesystems typically stripe files over storage subsystems, i.e. large files are separated into stripes and distributed to different storage subsystems. In Lustre, the size of these stripes (sometimes also mentioned as chunks) is called stripe size and the number of used storage subsystems is called stripe count.

With previous Lustre versions adapting the Lustre stripe count was the most important optimization. However, for the workspaces of uc3 the Lustre feature Progressive File Layout is used to define file striping parameters. This means that the stripe count is adapted if the file size is growing. In normal cases users no longer need to adapt file striping parameters in case they have very huge files or in order to reach better performance. If you know what you are doing you can still change striping parameters but further explanation is beyond the scope of this documentation.

=== Improving Metadata Performance ===

Metadata performance on parallel file systems is usually not as good as with local
filesystems. In addition, it is usually not scalable, i.e. a limited resource. Therefore,
you should omit metadata operations whenever possible. For example, it is much better
to have few large files than lots of small files. In more detail, to increase metadata
performance of a parallel application following aspects should be considered:

* avoid creating many small files

* avoid competitive directory access, e.g. by creating files in separate subdirectories for each task

* if many small files are only used within a batch job and accessed by one node store them on $TMPDIR

== Workspaces on flash storage ==

There is another workspace file system for special requirements available. The file system is called ''full flash pfs'' and is based on the parallel file system Lustre.

=== Advantages of this file system ===

# From the Ice Lake nodes of bwUniCluster 3.0 (queue ''cpu_il'') the network distance and latency is low compared to the normal workspace file system.
# All storage devices are based on flash (no hard disks) with low access times. Hence performance is better compared to other parallel file systems for read and write access with small blocks and with small files, i.e. IOPS rates are improved.
# The file system is mounted on bwUniCluster 3.0 and HoreKa, i.e. it can be used to share data between these clusters.

=== Access restrictions ===

Only HoreKa users or KIT users of bwUniCluster 2.0 can use this file system.

=== Using the file system ===

As KIT or HoreKa user you can use the file system in the same way as a normal workspace. You just have to specify the name of the flash-based workspace file system using the option ''-F'' to all the commands that manage workspaces. On bwUniCluster 3.0 it is called ''ffuc'', on HoreKa it is ''ffhk''. For example, to create a workspace with name myws and a lifetime of 60 days on bwUniCluster 3.0 execute:
ws_allocate -F ffuc myws 60

If you want to use the full flash pfs on bwUniCluster 3.0 '''and''' HoreKa at the same time, please note that you only have to manage a particular workspace on one of the clusters since the name of the workspace directory is different. However, the path to each workspace is visible and can be used on both clusters.

Other features are similar to normal workspaces. For example, you can restore expired workspaces during 30 days after workspace expiration. There are quota limits with a default limit of 1 TiB capacity and 5 millions inodes per user. You can check your current usage with
lfs quota -uh $(whoami) /pfs/work8

== $TMPDIR ==

The environment variable $TMPDIR contains the name of a directory which is located on the local SSD of each node. This means
that different tasks of a parallel application use different directories when they do not utilize the same node.
Although $TMPDIR points to the same path name for different nodes of a batch job, the physical location and the
content of this directory path on these nodes is different.

This directory should be used for temporary files being accessed from the local node during job runtime. It should
also be used if you read the same data many times from a single node, e.g. if you are doing AI training. In this
case you should copy the data at the beginning of your batch job to $TMPDIR and read the data from there, see usage example below.

The $TMPDIR directory is located on extremely fast local SSD storage devices. This means that performance
on small files is much better than on the parallel file systems. The capacity of the local SSDs for each node type
is different and can be checked in [[BwUniCluster3.0/Hardware_and_Architecture#Compute_nodes|Table 1]].
The capacity of $TMPDIR is at least 1400 GB.

Each time a batch job is started, a subdirectory is created on the SSD of each node and assigned to the job.
$TMPDIR is set to the name of the subdirectory and this name contains the job ID so that it is unique
for each job. At the end of the job the subdirectory is removed.

{|style="background:#deffee; width:100%;"
|style="padding:5px; background:#ffa500; text-align:left"|
[[Image:Attention.svg|center|25px]]
|style="padding:5px; background:#ffa500; text-align:left"|
'''All data on $TMPDIR will be deleted when your job completes.''' 
Make sure you have copied your results back to a global filesystem, e.g., $HOME or a workspace, within your job.
|}

On login nodes $TMPDIR also points to a fast directory on a local SSD disk but this directory is not unique.
It is recommended to create your own unique subdirectory on these nodes. This directory should be used for the
installation of software packages. This means that the software package to be installed should be unpacked,
compiled and linked in a subdirectory of $TMPDIR. The real installation of the package (e.g. make install)
should be made into the $HOME folder.

{|style="background:#ffdeee; width:100%;"
|style="padding:5px; background:#f2cece; text-align:left"|
[[Image:Attention.svg|center|25px]]
|style="padding:5px; background:#f2cece; text-align:left"|
Note that you should '''not''' use /tmp or /scratch! Please use '''$TMPDIR''' instead. 
The reason is that an automatic cleanup on /tmp or /scratch is not possible because another job could be still using data below these directories. Hence the corresponding file systems could fill up and this can cause issues for you and for other users. On the other hand, $TMPDIR is created when the job starts and removed when the job completes, i.e. a cleanup is automatically done.
|}

=== Usage example for $TMPDIR ===

We will provide an example for using $TMPDIR and describe efficient data transfer to and from $TMPDIR.

If you have a data set with many files which is frequently used by batch jobs you should create
a compressed archive on a workspace. This archive can be extracted on $TMPDIR inside your batch jobs.
Such an archive can be read efficiently from a parallel file system since it is a single huge file.
On a login node you can create such an archive with the following steps:
<syntaxhighlight lang="bash">
# Create a workspace to store the archive
[ab1234@uc3n991 ~]$ ws_allocate data-ssd 60
# Create the archive from a local dataset folder (example)
[ab1234@uc3n991 ~]$ tar -cvzf $(ws_find data-ssd)/dataset.tgz dataset/
</syntaxhighlight>

Inside a batch job extract the archive on $TMPDIR, read input data from $TMPDIR, store results on $TMPDIR
and save the results on a workspace:
<syntaxhighlight lang="bash">
#!/bin/bash
# very simple example on how to use local $TMPDIR
#SBATCH -N 1
#SBATCH -t 24:00:00

# Extract compressed input dataset on local SSD
tar -C $TMPDIR/ -xvzf $(ws_find data-ssd)/dataset.tgz

# The application reads data from dataset on $TMPDIR and writes results to $TMPDIR
myapp -input $TMPDIR/dataset/myinput.csv -outputdir $TMPDIR/results

# Before job completes save results on a workspace
rsync -av $TMPDIR/results $(ws_find data-ssd)/results-${SLURM_JOB_ID}/
</syntaxhighlight>

== LSDF Online Storage ==

The LSDF Online Storage allows dedicated users to store scientific measurement data and simulation results. BwUniCluster 3.0 has an extremely fast network connection to the LSDF Online Storage. This file system provides external access via different protocols. It is only available for certain users. For information how request storage projects on the LSDF Online Storage see [[https://www.scc.kit.edu/en/services/lsdf]].

The LSDF Online Storage is mounted on the login nodes. It will also be mounted on the compute nodes of your batch job request it with the constraint flag ''LSDF''. You have one of the following options:

1. Add after the initial lines of your script job.sh the line <code>#SBATCH --constraint=LSDF</code>:
<pre>
#!/bin/bash
#SBATCH --ntasks=1
#SBATCH --time=120
#SBATCH --mem=200
#SBATCH --constraint=LSDF
</pre>
2. Add the constraint on command line:
<pre>
$ sbatch -p <queue> -n 1 -t 2:00:00 --mem 200 job.sh -C LSDF
</pre>

In order to access the LSDF Online Storage the following environment variables are available:
<code>$LSDF</code>, <code>$LSDFPROJECTS</code>, <code>$LSDFHOME</code>

==BeeOND (BeeGFS On-Demand)==

Users have the possibility to request a private BeeOND (on-demand BeeGFS) parallel filesystem for each job. The file system is created during job startup and purged when your job completes.

{|style="background:#deffee; width:100%;"
|style="padding:5px; background:#ffa500; text-align:left"|
[[Image:Attention.svg|center|25px]]
|style="padding:5px; background:#ffa500; text-align:left"|
'''All data on the private BeeOND filesystem will be deleted when your job completes.''' 
Make sure you have copied your results back to a global filesystem, e.g., $HOME or a workspace, within your job.
|}

BeeOND/BeeGFS can be used like any other parallel file system. All nodes of the batch job have acess to the same data below the same path. Tools like cp or rsync can be used to copy data in and out.

Starting and stopping BeeOND is integrated in the prolog and epilog of the cluster batch system Slurm. It can be used during job runtime if compute nodes are exclusive used. You can request the creation of a BeeOND file system with the constraint flags ''BEEOND'', ''BEEOND_4MDS'' or ''BEEOND_MAXMDS''.
* BEEOND: one metadata server is started on the first node
* BEEOND_4MDS: 4 metadata servers are started within your job. If you have less than 4 nodes less metadata servers are started.
* BEEOND_MAXMDS: on every node of your job a metadata server for the on_demand file system is started

As starting point we recommend using the contraint ''BEEOND''. You have one of the following options to request the constraint:

1. Add the line <code>#SBATCH --constraint=BEEOND</code> to your job script:
<source lang="bash">
#!/bin/bash
#SBATCH ...
#SBATCH --constraint=BEEOND # or BEEOND_4MDS or BEEOND_MAXMDS
</source>
2. Add the constraint on command line:
<pre>
$ sbatch -p <queue> -N <# of nodes> -t <runtime> --mem <mem> -C BEEOND job.sh
</pre>

After your job has started you can find the private on-demand file system in '''/mnt/odfs/${SLURM_JOB_ID}''' directory. The mountpoint comes with five pre-configured directories:
<source lang="bash">
# For small files (stripe count = 1)
/mnt/odfs/${SLURM_JOB_ID}/stripe_1
# Stripe count = 4
/mnt/odfs/${SLURM_JOB_ID}/stripe_default
# or
/mnt/odfs/${SLURM_JOB_ID}/stripe_4
# Stripe count = 8, 16 or 32, use this directories for medium sized and large files or when using MPI-IO
/mnt/odfs/${SLURM_JOB_ID}/stripe_8
/mnt/odfs/${SLURM_JOB_ID}/stripe_16
# or
/mnt/odfs/${SLURM_JOB_ID}/stripe_32
</source>

If you request less nodes than stripe count, the stripe count will be the number of nodes. For example, if you only request 8 nodes the directory stripe_16 has only a stripe count 8.

; '''Attention:''' 
:Use always the directory with the greatest stripe count for large files. E.g. if your largest file is 3.1 Tb, then you have to use a stripe count greater than 4 (4 x 750 GB), otherwise the used disk space is exceeded.

The capacity of the private file system depends on the number of nodes. For each node you get 750 Gbyte.
If you request 100 nodes for your job, the private file system is 100 * 750 Gbyte ~ 75 Tbyte (approx) capacity.

==Backup and Archiving==

There are regular backups of all data of the home directories, whereas ACLs and extended attributes will
not be saved in backups.

Please open a ticket if you have the need to restore backup data.

BwUniCluster3.0/Hardware and Architecture/Filesystem Details

2025-04-03T15:15:52Z

R Laifer: /* File System Details */

= File System Details =

On bwUniCluster 3.0 the parallel file system Lustre is used for most globally visible user data. Lustre is open source and Lustre solutions and support are available from different vendors. Nowadays, most of the biggest HPC systems are using Lustre. An initial home directory on a Lustre file system is created automatically after account activation, and the environment variable $HOME holds its name. Users can create so-called workspaces on another Lustre file system for non-permanent data with temporary lifetime. There is another workspace file system based on flash storage for special requirements available.

Within a batch job further file systems are available:
* The directory $TMPDIR is only available and visible on the local node. It is located on fast SSD storage devices.
* On request a parallel on-demand file system (BeeOND) is created which uses the SSDs of the nodes which were allocated to the batch job.
* On request the external LSDF Online Storage is mounted on the nodes which were allocated to the batch job. This file system is based on the parallel file system IBM Storage Scale.

Some of the characteristics of the file systems are shown in the following table.

{| class="wikitable"
|-
! scope="column" style="width:9%"| Property
! style="width:13%"| $TMPDIR
! style="width:13%"| BeeOND
! style="width:13%"| $HOME
! style="width:13%"| Workspace
! style="width:13%"| Workspace on flash
|-
! scope="column" style="height=20px; padding:3px"| Visibility
| style="height=20px; text-align:left;padding:3px"| local node
| style="height=20px; text-align:left;padding:3px"| nodes of batch job
| style="height=20px; text-align:left;padding:3px"| global
| style="height=20px; text-align:left;padding:3px"| global
| style="height=20px; text-align:left;padding:3px"| global
|-
! scope="column" style="height=20px; padding:3px"| Lifetime
| style="height=20px; text-align:left;padding:3px"| batch job runtime
| style="height=20px; text-align:left;padding:3px"| batch job runtime
| style="height=20px; text-align:left;padding:3px"| permanent
| style="height=20px; text-align:left;padding:3px"| max. 240 days
| style="height=20px; text-align:left;padding:3px"| max. 240 days
|-
! scope="column" style="height=20px; padding:3px"| Disk space
| style="height=20px; text-align:left;padding:3px"| 960 GB - 6.4 TB details see table 1
| style="height=20px; text-align:left;padding:3px"| n*750 GB
| style="height=20px; text-align:left;padding:3px"| 1.1 PiB
| style="height=20px; text-align:left;padding:3px"| 4.6 PiB
| style="height=20px; text-align:left;padding:3px"| 236 TiB
|-
! scope="column" style="height=20px; padding:3px"| Capacity Quotas
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| yes 500 GiB per user, for MA users 250 GiB also per organization
| style="height=20px; text-align:left;padding:3px"| yes 40 TiB per user
| style="height=20px; text-align:left;padding:3px"| yes 1 TiB per user
|-
! scope="column" style="height=20px; padding:3px"| Inode Quotas
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| yes 5 million per user for MA users 2.5 million
| style="height=20px; text-align:left;padding:3px"| yes 20 million per user
| style="height=20px; text-align:left;padding:3px"| yes 5 million per user
|-
! scope="column" style="height=20px; padding:3px"| Backup
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| yes
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| no
|-
! scope="column" style="height=20px; padding:3px"| Read perf./node
| style="height=20px; text-align:left;padding:3px"| 500 MB/s - 6 GB/s depends on type of local SSD / job queue: 520 MB/s @ single / multiple 800 MB/s @ multiple_e 6600 MB/s @ fat 6500 MB/s @ gpu_4 6500 MB/s @ gpu_8
| style="height=20px; text-align:left;padding:3px"| 400 MB/s - 500 MB/s depends on type of local SSDs / job queue: 500 MB/s @ multiple 400 MB/s @ multiple_e
| style="height=20px; text-align:left;padding:3px"| 5 GB/s
| style="height=20px; text-align:left;padding:3px"| 5 GB/s
| style="height=20px; text-align:left;padding:3px"| 1 GB/s
|-
! scope="column" style="height=20px; padding:3px"| Write perf./node
| style="height=20px; text-align:left;padding:3px"| 500 MB/s - 4 GB/s depends on type of local SSD / job queue: 500 MB/s @ single / multiple 650 MB/s @ multiple_e 2900 MB/s @ fat 2090 MB/s @ gpu_4 4060 MB/s @ gpu_8
| style="height=20px; text-align:left;padding:3px"| 250 MB/s - 350 MB/s depends on type of local SSDs / job queue: 350 MB/s @ multiple 250 MB/s @ multiple_e
| style="height=20px; text-align:left;padding:3px"| 5 GB/s
| style="height=20px; text-align:left;padding:3px"| 5 GB/s
| style="height=20px; text-align:left;padding:3px"| 1 GB/s
|-
! scope="column" style="height=20px;padding:3px"| Total read perf.
| style="height=20px; text-align:left;padding:3px"| n*500-6000 MB/s
| style="height=20px; text-align:left;padding:3px"| n*400-500 MB/s
| style="height=20px; text-align:left;padding:3px"| 63 GB/s
| style="height=20px; text-align:left;padding:3px"| 45 GB/s
| style="height=20px; text-align:left;padding:3px"| 45 GB/s
|-
! scope="column" style="height=20px;padding:3px"| Total write perf.
| style="height=20px; text-align:left;padding:3px"| n*500-4000 MB/s
| style="height=20px; text-align:left;padding:3px"| n*250-350 MB/s
| style="height=20px; text-align:left;padding:3px"| 63 GB/s
| style="height=20px; text-align:left;padding:3px"| 40 GB/s
| style="height=20px; text-align:left;padding:3px"| 38 GB/s
|}
---------------------------------------------------------------------------------------------------------
global: all nodes of UniCluster access the same file system;
local: each node has its own file system;
permanent: files are stored permanently;
batch job: files are removed at end of the batch job.
---------------------------------------------------------------------------------------------------------
Table 2: Properties of the file systems

== Selecting the appropriate file system ==

In general, you should separate your data and store it on the appropriate file system.
Permanently needed data like software or important results should be stored below $HOME
but capacity restrictions (quotas) apply. In case you accidentally deleted data on $HOME
there is a chance that we can restore it from backup. Permanent data which is not needed for months
or exceeds the capacity restrictions should be sent to the LSDF Online Storage
or to the archive and deleted from the file systems. Temporary data which is only needed on a single
node and which does not exceed the disk space shown in [[BwUniCluster3.0/Hardware_and_Architecture#Compute_nodes|Table 1]]
should be stored below $TMPDIR. Data which is read many times on a single node, e.g. if you are doing AI training,
should be copied to $TMPDIR and read from there. Temporary data which is used from many nodes
of your batch job and which is only needed during job runtime should be stored on a
parallel on-demand file system. Temporary data which can be recomputed or which is the
result of one job and input for another job should be stored in workspaces. The lifetime
of data in workspaces is limited and depends on the lifetime of the workspace which can be
several months.

For further details please check the chapters below.

== $HOME ==

The home directories of bwUniCluster 3.0 (uc3) users are located in the parallel file system Lustre.
You have access to your home directory from all nodes of uc3. A regular backup of these directories
to a tape library is done automatically. The directory $HOME should be used to hold permanently used data like
source code, configuration files, executable programs etc.

On uc3 there is a default user quota limit of 500 GiB and 5 million inodes (files and directories) per user.
For users of University of Mannheim the limit is 250 GiB and 2.5 million inodes.
You can check your current usage and limits with the command
<pre>
$ lfs quota -uh $(whoami) $HOME
</pre>
Note that the quota limits mentioned above are soft quota limits. The hard limits (shown on the ''limits''
columns) are 10 percent higher. If you are above the soft limit and below the hard limit
during the grace period (7 days) your I/O operations will show a warning message. If the grace period has
passed or if you are above the hard limit your I/O operations will abort.

In addition to the user limit there is a limit of your organization (e.g. university) which depends on the financial share. This limit is enforced with so-called Lustre project quotas. You can show the current usage and limits of your organization with the following command:
<pre>
lfs quota -ph $(grep $(echo $HOME | sed -e "s|/[^/]*/[^/]*$||") /pfs/data6/project_ids.txt | cut -f 1 -d\ ) $HOME
</pre>

== Workspaces ==

On uc3 workspaces can be used to store large non-permanent data sets, e.g. restart files or output
data that has to be post-processed. The file system used for workspaces is also the parallel file system Lustre. This file system is especially designed for parallel access and for a high throughput to large
files. It is able to provide high data transfer rates of up to 40 GB/s write and read performance when data access is parallel.

On uc3 there is a default user quota limit of 40 TiB and 20 million inodes (files and directories) per user.
You can chek your current usage and limits with the command
<pre>
$ lfs quota -uh $(whoami) /pfs/work9
</pre>
Note that the quotas include data and inodes for all of your workspaces and all of your expired workspaces (as long as they are not yet completely removed).

Workspaces have a lifetime and the data on a workspace expires as a whole after a fixed period. The maximum lifetime of a workspace on uc3 is 60 days, but it can be renewed at the end of that period 3 times to a total maximum of 240 days after workspace generation.

Creating, deleting, finding, extending and sharing workspaces is explained on the [[workspace]] page.

=== Reminder for workspace deletion ===

Normally you will get an email every day starting 7 days before a workspace expires. You can send yourself a calender entry which reminds you when a workspace will be automatically deleted:

$ ws_send_ical <workspace> <email>

=== Restoring expired Workspaces ===

At expiration time your workspace will be moved to a special, hidden directory. On uc3 expired workspaces are currently kept for 30 days. During this time you can still restore your data into a valid workspace. The same is true for released workspaces but they are only kept until the next night. In order to restore an expired workspace, use
<pre>
ws_restore -l
</pre>
to get a list of your expired workspaces, and then restore them into an '''existing, active workspace''' (here with name <code>my_restored</code>):
<pre>
ws_restore <full_name_of_expired_workspace> my_restored
</pre>
NOTE: The expired workspace has to be specified using the full name as listed by <code>ws_restore -l</code>, including username prefix and timestamp suffix (otherwise, it cannot be uniquely identified).
The target workspace, on the other hand, must be given with just its short name as listed by <code>ws_list</code>, without the username prefix.

NOTE: <code>ws_restore</code> can only work on the same filesystem. So you have to ensure that the new workspace allocated with <code>ws_allocate</code> is placed on the same filesystem as the expired workspace. Therefore, you can use <code>-F <filesystem></code> flag if needed.

=== Linking workspaces in Home ===

It might be valuable to have links to personal workspaces within a certain directory, e.g. below the user home directory. The command
<pre>
ws_register <DIR>
</pre>
will create and manage links to all personal workspaces within in the directory <DIR>. Calling this command will do the following:

* The directory <DIR> will be created if necessary
* Links to all personal workspaces will be managed:
** Creates links to all available workspaces if not already present
** Removes links to released or expired workspaces

== Improving Performance on $HOME and workspaces ==

The following recommendations might help to improve throughput and metadata
performance on Lustre filesystems.

=== Improving Throughput Performance ===

Depending on your application some adaptations might be necessary if you want to reach
the full bandwidth of the filesystems.

When you are designing your application you should consider that the performance of
parallel filesystems is generally better if data is transferred in large blocks and stored in
few large files. In more detail, to increase throughput performance of a parallel application
following aspects should be considered:

* collect large chunks of data and write them sequentially at once,

* to exploit complete filesystem bandwidth use several clients,

* avoid competitive file access by different tasks or use blocks with boundaries at stripe size (default is 1MB),

* if files are small enough for the SSDs and are only used from one node store them on $TMPDIR.

Parallel filesystems typically stripe files over storage subsystems, i.e. large files are separated into stripes and distributed to different storage subsystems. In Lustre, the size of these stripes (sometimes also mentioned as chunks) is called stripe size and the number of used storage subsystems is called stripe count.

With previous Lustre versions adapting the Lustre stripe count was the most important optimization. However, for the workspaces of uc3 the Lustre feature Progressive File Layout is used to define file striping parameters. This means that the stripe count is adapted if the file size is growing. In normal cases users no longer need to adapt file striping parameters in case they have very huge files or in order to reach better performance. If you know what you are doing you can still change striping parameters but further explanation is beyond the scope of this documentation.

=== Improving Metadata Performance ===

Metadata performance on parallel file systems is usually not as good as with local
filesystems. In addition, it is usually not scalable, i.e. a limited resource. Therefore,
you should omit metadata operations whenever possible. For example, it is much better
to have few large files than lots of small files. In more detail, to increase metadata
performance of a parallel application following aspects should be considered:

* avoid creating many small files

* avoid competitive directory access, e.g. by creating files in separate subdirectories for each task

* if many small files are only used within a batch job and accessed by one node store them on $TMPDIR

== Workspaces on flash storage ==

There is another workspace file system for special requirements available. The file system is called ''full flash pfs'' and is based on the parallel file system Lustre.

=== Advantages of this file system ===

# From the Ice Lake nodes of bwUniCluster 3.0 (queue ''cpu_il'') the network distance and latency is low compared to the normal workspace file system.
# All storage devices are based on flash (no hard disks) with low access times. Hence performance is better compared to other parallel file systems for read and write access with small blocks and with small files, i.e. IOPS rates are improved.
# The file system is mounted on bwUniCluster 3.0 and HoreKa, i.e. it can be used to share data between these clusters.

=== Access restrictions ===

Only HoreKa users or KIT users of bwUniCluster 2.0 can use this file system.

=== Using the file system ===

As KIT or HoreKa user you can use the file system in the same way as a normal workspace. You just have to specify the name of the flash-based workspace file system using the option ''-F'' to all the commands that manage workspaces. On bwUniCluster 3.0 it is called ''ffuc'', on HoreKa it is ''ffhk''. For example, to create a workspace with name myws and a lifetime of 60 days on bwUniCluster 3.0 execute:
ws_allocate -F ffuc myws 60

If you want to use the full flash pfs on bwUniCluster 3.0 '''and''' HoreKa at the same time, please note that you only have to manage a particular workspace on one of the clusters since the name of the workspace directory is different. However, the path to each workspace is visible and can be used on both clusters.

Other features are similar to normal workspaces. For example, you can restore expired workspaces during 30 days after workspace expiration. There are quota limits with a default limit of 1 TiB capacity and 5 millions inodes per user. You can check your current usage with
lfs quota -uh $(whoami) /pfs/work8

== $TMPDIR ==

The environment variable $TMPDIR contains the name of a directory which is located on the local SSD of each node. This means
that different tasks of a parallel application use different directories when they do not utilize the same node.
Although $TMPDIR points to the same path name for different nodes of a batch job, the physical location and the
content of this directory path on these nodes is different.

This directory should be used for temporary files being accessed from the local node during job runtime. It should
also be used if you read the same data many times from a single node, e.g. if you are doing AI training. In this
case you should copy the data at the beginning of your batch job to $TMPDIR and read the data from there, see usage example below.

The $TMPDIR directory is located on extremely fast local SSD storage devices. This means that performance
on small files is much better than on the parallel file systems. The capacity of the local SSDs for each node type
is different and can be checked in [[BwUniCluster3.0/Hardware_and_Architecture#Compute_nodes|Table 1]].
The capacity of $TMPDIR is at least 1400 GB.

Each time a batch job is started, a subdirectory is created on the SSD of each node and assigned to the job.
$TMPDIR is set to the name of the subdirectory and this name contains the job ID so that it is unique
for each job. At the end of the job the subdirectory is removed.

On login nodes $TMPDIR also points to a fast directory on a local SSD disk but this directory is not unique.
It is recommended to create your own unique subdirectory on these nodes. This directory should be used for the
installation of software packages. This means that the software package to be installed should be unpacked,
compiled and linked in a subdirectory of $TMPDIR. The real installation of the package (e.g. make install)
should be made into the $HOME folder.

{|style="background:#ffdeee; width:100%;"
|style="padding:5px; background:#f2cece; text-align:left"|
[[Image:Attention.svg|center|25px]]
|style="padding:5px; background:#f2cece; text-align:left"|
Note that you should '''not''' use /tmp or /scratch! Please use '''$TMPDIR''' instead. 
The reason is that an automatic cleanup on /tmp or /scratch is not possible because another job could be still using data below these directories. Hence the corresponding file systems could fill up and this can cause issues for you and for other users. On the other hand, $TMPDIR is created when the job starts and removed when the job completes, i.e. a cleanup is automatically done.
|}

=== Usage example for $TMPDIR ===

We will provide an example for using $TMPDIR and describe efficient data transfer to and from $TMPDIR.

If you have a data set with many files which is frequently used by batch jobs you should create
a compressed archive on a workspace. This archive can be extracted on $TMPDIR inside your batch jobs.
Such an archive can be read efficiently from a parallel file system since it is a single huge file.
On a login node you can create such an archive with the following steps:
<syntaxhighlight lang="bash">
# Create a workspace to store the archive
[ab1234@uc3n991 ~]$ ws_allocate data-ssd 60
# Create the archive from a local dataset folder (example)
[ab1234@uc3n991 ~]$ tar -cvzf $(ws_find data-ssd)/dataset.tgz dataset/
</syntaxhighlight>

Inside a batch job extract the archive on $TMPDIR, read input data from $TMPDIR, store results on $TMPDIR
and save the results on a workspace:
<syntaxhighlight lang="bash">
#!/bin/bash
# very simple example on how to use local $TMPDIR
#SBATCH -N 1
#SBATCH -t 24:00:00

# Extract compressed input dataset on local SSD
tar -C $TMPDIR/ -xvzf $(ws_find data-ssd)/dataset.tgz

# The application reads data from dataset on $TMPDIR and writes results to $TMPDIR
myapp -input $TMPDIR/dataset/myinput.csv -outputdir $TMPDIR/results

# Before job completes save results on a workspace
rsync -av $TMPDIR/results $(ws_find data-ssd)/results-${SLURM_JOB_ID}/
</syntaxhighlight>

== LSDF Online Storage ==

The LSDF Online Storage allows dedicated users to store scientific measurement data and simulation results. BwUniCluster 3.0 has an extremely fast network connection to the LSDF Online Storage. This file system provides external access via different protocols. It is only available for certain users. For information how request storage projects on the LSDF Online Storage see [[https://www.scc.kit.edu/en/services/lsdf]].

The LSDF Online Storage is mounted on the login nodes. It will also be mounted on the compute nodes of your batch job request it with the constraint flag ''LSDF''. You have one of the following options:

1. Add after the initial lines of your script job.sh the line <code>#SBATCH --constraint=LSDF</code>:
<pre>
#!/bin/bash
#SBATCH --ntasks=1
#SBATCH --time=120
#SBATCH --mem=200
#SBATCH --constraint=LSDF
</pre>
2. Add the constraint on command line:
<pre>
$ sbatch -p <queue> -n 1 -t 2:00:00 --mem 200 job.sh -C LSDF
</pre>

In order to access the LSDF Online Storage the following environment variables are available:
<code>$LSDF</code>, <code>$LSDFPROJECTS</code>, <code>$LSDFHOME</code>

==BeeOND (BeeGFS On-Demand)==

Users have the possibility to request a private BeeOND (on-demand BeeGFS) parallel filesystem for each job. The file system is created during job startup and purged when your job completes.

{|style="background:#deffee; width:100%;"
|style="padding:5px; background:#ffa500; text-align:left"|
[[Image:Attention.svg|center|25px]]
|style="padding:5px; background:#ffa500; text-align:left"|
'''All data on the private filesystem will be deleted when your job completes.''' 
Make sure you have copied your results back to a global filesystem, e.g., $HOME or a workspace, within your job.
|}

BeeOND/BeeGFS can be used like any other parallel file system. All nodes of the batch job have acess to the same data below the same path. Tools like cp or rsync can be used to copy data in and out.

Starting and stopping BeeOND is integrated in the prolog and epilog of the cluster batch system Slurm. It can be used during job runtime if compute nodes are exclusive used. You can request the creation of a BeeOND file system with the constraint flags ''BEEOND'', ''BEEOND_4MDS'' or ''BEEOND_MAXMDS''.
* BEEOND: one metadata server is started on the first node
* BEEOND_4MDS: 4 metadata servers are started within your job. If you have less than 4 nodes less metadata servers are started.
* BEEOND_MAXMDS: on every node of your job a metadata server for the on_demand file system is started

As starting point we recommend using the contraint ''BEEOND''. You have one of the following options to request the constraint:

1. Add the line <code>#SBATCH --constraint=BEEOND</code> to your job script:
<source lang="bash">
#!/bin/bash
#SBATCH ...
#SBATCH --constraint=BEEOND # or BEEOND_4MDS or BEEOND_MAXMDS
</source>
2. Add the constraint on command line:
<pre>
$ sbatch -p <queue> -N <# of nodes> -t <runtime> --mem <mem> -C BEEOND job.sh
</pre>

After your job has started you can find the private on-demand file system in '''/mnt/odfs/${SLURM_JOB_ID}''' directory. The mountpoint comes with five pre-configured directories:
<source lang="bash">
# For small files (stripe count = 1)
/mnt/odfs/${SLURM_JOB_ID}/stripe_1
# Stripe count = 4
/mnt/odfs/${SLURM_JOB_ID}/stripe_default
# or
/mnt/odfs/${SLURM_JOB_ID}/stripe_4
# Stripe count = 8, 16 or 32, use this directories for medium sized and large files or when using MPI-IO
/mnt/odfs/${SLURM_JOB_ID}/stripe_8
/mnt/odfs/${SLURM_JOB_ID}/stripe_16
# or
/mnt/odfs/${SLURM_JOB_ID}/stripe_32
</source>

If you request less nodes than stripe count, the stripe count will be the number of nodes. For example, if you only request 8 nodes the directory stripe_16 has only a stripe count 8.

; '''Attention:''' 
:Use always the directory with the greatest stripe count for large files. E.g. if your largest file is 3.1 Tb, then you have to use a stripe count greater than 4 (4 x 750 GB), otherwise the used disk space is exceeded.

The capacity of the private file system depends on the number of nodes. For each node you get 750 Gbyte.
If you request 100 nodes for your job, the private file system is 100 * 750 Gbyte ~ 75 Tbyte (approx) capacity.

==Backup and Archiving==

There are regular backups of all data of the home directories, whereas ACLs and extended attributes will
not be saved in backups.

Please open a ticket if you have the need to restore backup data.

BwUniCluster3.0/Hardware and Architecture/Filesystem Details

2025-04-03T15:09:46Z

R Laifer: /* Workspaces on flash storage */

= File System Details =

On bwUniCluster 2.0 the parallel file system Lustre is used for most globally visible user data. Lustre is open source and Lustre solutions and support are available from different vendors. Nowadays, most of the biggest HPC systems are using Lustre. An initial home directory on a Lustre file system is created automatically after account activation, and the environment variable $HOME holds its name. Users can create so-called workspaces on another Lustre file system for non-permanent data with temporary lifetime. There is another workspace file system based on flash storage for special requirements available.

Within a batch job further file systems are available:
* The directory $TMPDIR is only available and visible on the local node. It is located on fast SSD storage devices.
* On request a parallel on-demand file system (BeeOND) is created which uses the SSDs of the nodes which were allocated to the batch job.
* On request the external LSDF Online Storage is mounted on the nodes which were allocated to the batch job. This file system is based on the parallel file system Spectrum Scale.

Some of the characteristics of the file systems are shown in Table 2.

{| class="wikitable"
|-
! scope="column" style="width:9%"| Property
! style="width:13%"| $TMPDIR
! style="width:13%"| BeeOND
! style="width:13%"| $HOME
! style="width:13%"| Workspace
! style="width:13%"| Workspace on flash
|-
! scope="column" style="height=20px; padding:3px"| Visibility
| style="height=20px; text-align:left;padding:3px"| local node
| style="height=20px; text-align:left;padding:3px"| nodes of batch job
| style="height=20px; text-align:left;padding:3px"| global
| style="height=20px; text-align:left;padding:3px"| global
| style="height=20px; text-align:left;padding:3px"| global
|-
! scope="column" style="height=20px; padding:3px"| Lifetime
| style="height=20px; text-align:left;padding:3px"| batch job runtime
| style="height=20px; text-align:left;padding:3px"| batch job runtime
| style="height=20px; text-align:left;padding:3px"| permanent
| style="height=20px; text-align:left;padding:3px"| max. 240 days
| style="height=20px; text-align:left;padding:3px"| max. 240 days
|-
! scope="column" style="height=20px; padding:3px"| Disk space
| style="height=20px; text-align:left;padding:3px"| 960 GB - 6.4 TB details see table 1
| style="height=20px; text-align:left;padding:3px"| n*750 GB
| style="height=20px; text-align:left;padding:3px"| 1.1 PiB
| style="height=20px; text-align:left;padding:3px"| 4.6 PiB
| style="height=20px; text-align:left;padding:3px"| 236 TiB
|-
! scope="column" style="height=20px; padding:3px"| Capacity Quotas
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| yes 500 GiB per user, for MA users 250 GiB also per organization
| style="height=20px; text-align:left;padding:3px"| yes 40 TiB per user
| style="height=20px; text-align:left;padding:3px"| yes 1 TiB per user
|-
! scope="column" style="height=20px; padding:3px"| Inode Quotas
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| yes 5 million per user for MA users 2.5 million
| style="height=20px; text-align:left;padding:3px"| yes 20 million per user
| style="height=20px; text-align:left;padding:3px"| yes 5 million per user
|-
! scope="column" style="height=20px; padding:3px"| Backup
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| yes
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| no
|-
! scope="column" style="height=20px; padding:3px"| Read perf./node
| style="height=20px; text-align:left;padding:3px"| 500 MB/s - 6 GB/s depends on type of local SSD / job queue: 520 MB/s @ single / multiple 800 MB/s @ multiple_e 6600 MB/s @ fat 6500 MB/s @ gpu_4 6500 MB/s @ gpu_8
| style="height=20px; text-align:left;padding:3px"| 400 MB/s - 500 MB/s depends on type of local SSDs / job queue: 500 MB/s @ multiple 400 MB/s @ multiple_e
| style="height=20px; text-align:left;padding:3px"| 5 GB/s
| style="height=20px; text-align:left;padding:3px"| 5 GB/s
| style="height=20px; text-align:left;padding:3px"| 1 GB/s
|-
! scope="column" style="height=20px; padding:3px"| Write perf./node
| style="height=20px; text-align:left;padding:3px"| 500 MB/s - 4 GB/s depends on type of local SSD / job queue: 500 MB/s @ single / multiple 650 MB/s @ multiple_e 2900 MB/s @ fat 2090 MB/s @ gpu_4 4060 MB/s @ gpu_8
| style="height=20px; text-align:left;padding:3px"| 250 MB/s - 350 MB/s depends on type of local SSDs / job queue: 350 MB/s @ multiple 250 MB/s @ multiple_e
| style="height=20px; text-align:left;padding:3px"| 5 GB/s
| style="height=20px; text-align:left;padding:3px"| 5 GB/s
| style="height=20px; text-align:left;padding:3px"| 1 GB/s
|-
! scope="column" style="height=20px;padding:3px"| Total read perf.
| style="height=20px; text-align:left;padding:3px"| n*500-6000 MB/s
| style="height=20px; text-align:left;padding:3px"| n*400-500 MB/s
| style="height=20px; text-align:left;padding:3px"| 63 GB/s
| style="height=20px; text-align:left;padding:3px"| 45 GB/s
| style="height=20px; text-align:left;padding:3px"| 45 GB/s
|-
! scope="column" style="height=20px;padding:3px"| Total write perf.
| style="height=20px; text-align:left;padding:3px"| n*500-4000 MB/s
| style="height=20px; text-align:left;padding:3px"| n*250-350 MB/s
| style="height=20px; text-align:left;padding:3px"| 63 GB/s
| style="height=20px; text-align:left;padding:3px"| 40 GB/s
| style="height=20px; text-align:left;padding:3px"| 38 GB/s
|}
---------------------------------------------------------------------------------------------------------
global: all nodes of UniCluster access the same file system;
local: each node has its own file system;
permanent: files are stored permanently;
batch job: files are removed at end of the batch job.
---------------------------------------------------------------------------------------------------------
Table 2: Properties of the file systems

== Selecting the appropriate file system ==

In general, you should separate your data and store it on the appropriate file system.
Permanently needed data like software or important results should be stored below $HOME
but capacity restrictions (quotas) apply. In case you accidentally deleted data on $HOME
there is a chance that we can restore it from backup. Permanent data which is not needed for months
or exceeds the capacity restrictions should be sent to the LSDF Online Storage
or to the archive and deleted from the file systems. Temporary data which is only needed on a single
node and which does not exceed the disk space shown in the table above should be stored
below $TMPDIR. Data which is read many times on a single node, e.g. if you are doing AI training,
should be copied to $TMPDIR and read from there. Temporary data which is used from many nodes
of your batch job and which is only needed during job runtime should be stored on a
parallel on-demand file system. Temporary data which can be recomputed or which is the
result of one job and input for another job should be stored in workspaces. The lifetime
of data in workspaces is limited and depends on the lifetime of the workspace which can be
several months.

For further details please check the chapters below.

== $HOME ==

The home directories of bwUniCluster 3.0 (uc3) users are located in the parallel file system Lustre.
You have access to your home directory from all nodes of uc3. A regular backup of these directories
to a tape library is done automatically. The directory $HOME should be used to hold permanently used data like
source code, configuration files, executable programs etc.

On uc3 there is a default user quota limit of 500 GiB and 5 million inodes (files and directories) per user.
For users of University of Mannheim the limit is 250 GiB and 2.5 million inodes.
You can check your current usage and limits with the command
<pre>
$ lfs quota -uh $(whoami) $HOME
</pre>
Note that the quota limits mentioned above are soft quota limits. The hard limits (shown on the ''limits''
columns) are 10 percent higher. If you are above the soft limit and below the hard limit
during the grace period (7 days) your I/O operations will show a warning message. If the grace period has
passed or if you are above the hard limit your I/O operations will abort.

In addition to the user limit there is a limit of your organization (e.g. university) which depends on the financial share. This limit is enforced with so-called Lustre project quotas. You can show the current usage and limits of your organization with the following command:
<pre>
lfs quota -ph $(grep $(echo $HOME | sed -e "s|/[^/]*/[^/]*$||") /pfs/data6/project_ids.txt | cut -f 1 -d\ ) $HOME
</pre>

== Workspaces ==

On uc3 workspaces can be used to store large non-permanent data sets, e.g. restart files or output
data that has to be post-processed. The file system used for workspaces is also the parallel file system Lustre. This file system is especially designed for parallel access and for a high throughput to large
files. It is able to provide high data transfer rates of up to 40 GB/s write and read performance when data access is parallel.

On uc3 there is a default user quota limit of 40 TiB and 20 million inodes (files and directories) per user.
You can chek your current usage and limits with the command
<pre>
$ lfs quota -uh $(whoami) /pfs/work9
</pre>
Note that the quotas include data and inodes for all of your workspaces and all of your expired workspaces (as long as they are not yet completely removed).

Workspaces have a lifetime and the data on a workspace expires as a whole after a fixed period. The maximum lifetime of a workspace on uc3 is 60 days, but it can be renewed at the end of that period 3 times to a total maximum of 240 days after workspace generation.

Creating, deleting, finding, extending and sharing workspaces is explained on the [[workspace]] page.

=== Reminder for workspace deletion ===

Normally you will get an email every day starting 7 days before a workspace expires. You can send yourself a calender entry which reminds you when a workspace will be automatically deleted:

$ ws_send_ical <workspace> <email>

=== Restoring expired Workspaces ===

At expiration time your workspace will be moved to a special, hidden directory. On uc3 expired workspaces are currently kept for 30 days. During this time you can still restore your data into a valid workspace. The same is true for released workspaces but they are only kept until the next night. In order to restore an expired workspace, use
<pre>
ws_restore -l
</pre>
to get a list of your expired workspaces, and then restore them into an '''existing, active workspace''' (here with name <code>my_restored</code>):
<pre>
ws_restore <full_name_of_expired_workspace> my_restored
</pre>
NOTE: The expired workspace has to be specified using the full name as listed by <code>ws_restore -l</code>, including username prefix and timestamp suffix (otherwise, it cannot be uniquely identified).
The target workspace, on the other hand, must be given with just its short name as listed by <code>ws_list</code>, without the username prefix.

NOTE: <code>ws_restore</code> can only work on the same filesystem. So you have to ensure that the new workspace allocated with <code>ws_allocate</code> is placed on the same filesystem as the expired workspace. Therefore, you can use <code>-F <filesystem></code> flag if needed.

=== Linking workspaces in Home ===

It might be valuable to have links to personal workspaces within a certain directory, e.g. below the user home directory. The command
<pre>
ws_register <DIR>
</pre>
will create and manage links to all personal workspaces within in the directory <DIR>. Calling this command will do the following:

* The directory <DIR> will be created if necessary
* Links to all personal workspaces will be managed:
** Creates links to all available workspaces if not already present
** Removes links to released or expired workspaces

== Improving Performance on $HOME and workspaces ==

The following recommendations might help to improve throughput and metadata
performance on Lustre filesystems.

=== Improving Throughput Performance ===

Depending on your application some adaptations might be necessary if you want to reach
the full bandwidth of the filesystems.

When you are designing your application you should consider that the performance of
parallel filesystems is generally better if data is transferred in large blocks and stored in
few large files. In more detail, to increase throughput performance of a parallel application
following aspects should be considered:

* collect large chunks of data and write them sequentially at once,

* to exploit complete filesystem bandwidth use several clients,

* avoid competitive file access by different tasks or use blocks with boundaries at stripe size (default is 1MB),

* if files are small enough for the SSDs and are only used from one node store them on $TMPDIR.

Parallel filesystems typically stripe files over storage subsystems, i.e. large files are separated into stripes and distributed to different storage subsystems. In Lustre, the size of these stripes (sometimes also mentioned as chunks) is called stripe size and the number of used storage subsystems is called stripe count.

With previous Lustre versions adapting the Lustre stripe count was the most important optimization. However, for the workspaces of uc3 the Lustre feature Progressive File Layout is used to define file striping parameters. This means that the stripe count is adapted if the file size is growing. In normal cases users no longer need to adapt file striping parameters in case they have very huge files or in order to reach better performance. If you know what you are doing you can still change striping parameters but further explanation is beyond the scope of this documentation.

=== Improving Metadata Performance ===

Metadata performance on parallel file systems is usually not as good as with local
filesystems. In addition, it is usually not scalable, i.e. a limited resource. Therefore,
you should omit metadata operations whenever possible. For example, it is much better
to have few large files than lots of small files. In more detail, to increase metadata
performance of a parallel application following aspects should be considered:

* avoid creating many small files

* avoid competitive directory access, e.g. by creating files in separate subdirectories for each task

* if many small files are only used within a batch job and accessed by one node store them on $TMPDIR

== Workspaces on flash storage ==

There is another workspace file system for special requirements available. The file system is called ''full flash pfs'' and is based on the parallel file system Lustre.

=== Advantages of this file system ===

# From the Ice Lake nodes of bwUniCluster 3.0 (queue ''cpu_il'') the network distance and latency is low compared to the normal workspace file system.
# All storage devices are based on flash (no hard disks) with low access times. Hence performance is better compared to other parallel file systems for read and write access with small blocks and with small files, i.e. IOPS rates are improved.
# The file system is mounted on bwUniCluster 3.0 and HoreKa, i.e. it can be used to share data between these clusters.

=== Access restrictions ===

Only HoreKa users or KIT users of bwUniCluster 2.0 can use this file system.

=== Using the file system ===

As KIT or HoreKa user you can use the file system in the same way as a normal workspace. You just have to specify the name of the flash-based workspace file system using the option ''-F'' to all the commands that manage workspaces. On bwUniCluster 3.0 it is called ''ffuc'', on HoreKa it is ''ffhk''. For example, to create a workspace with name myws and a lifetime of 60 days on bwUniCluster 3.0 execute:
ws_allocate -F ffuc myws 60

If you want to use the full flash pfs on bwUniCluster 3.0 '''and''' HoreKa at the same time, please note that you only have to manage a particular workspace on one of the clusters since the name of the workspace directory is different. However, the path to each workspace is visible and can be used on both clusters.

Other features are similar to normal workspaces. For example, you can restore expired workspaces during 30 days after workspace expiration. There are quota limits with a default limit of 1 TiB capacity and 5 millions inodes per user. You can check your current usage with
lfs quota -uh $(whoami) /pfs/work8

== $TMPDIR ==

The environment variable $TMPDIR contains the name of a directory which is located on the local SSD of each node. This means
that different tasks of a parallel application use different directories when they do not utilize the same node.
Although $TMPDIR points to the same path name for different nodes of a batch job, the physical location and the
content of this directory path on these nodes is different.

This directory should be used for temporary files being accessed from the local node during job runtime. It should
also be used if you read the same data many times from a single node, e.g. if you are doing AI training. In this
case you should copy the data at the beginning of your batch job to $TMPDIR and read the data from there, see usage example below.

The $TMPDIR directory is located on extremely fast local SSD storage devices. This means that performance
on small files is much better than on the parallel file systems. The capacity of the local SSDs for each node type
is different and can be checked in [[BwUniCluster3.0/Hardware_and_Architecture#Compute_nodes|Table 1]].
The capacity of $TMPDIR is at least 1400 GB.

Each time a batch job is started, a subdirectory is created on the SSD of each node and assigned to the job.
$TMPDIR is set to the name of the subdirectory and this name contains the job ID so that it is unique
for each job. At the end of the job the subdirectory is removed.

On login nodes $TMPDIR also points to a fast directory on a local SSD disk but this directory is not unique.
It is recommended to create your own unique subdirectory on these nodes. This directory should be used for the
installation of software packages. This means that the software package to be installed should be unpacked,
compiled and linked in a subdirectory of $TMPDIR. The real installation of the package (e.g. make install)
should be made into the $HOME folder.

{|style="background:#ffdeee; width:100%;"
|style="padding:5px; background:#f2cece; text-align:left"|
[[Image:Attention.svg|center|25px]]
|style="padding:5px; background:#f2cece; text-align:left"|
Note that you should '''not''' use /tmp or /scratch! Please use '''$TMPDIR''' instead. 
The reason is that an automatic cleanup on /tmp or /scratch is not possible because another job could be still using data below these directories. Hence the corresponding file systems could fill up and this can cause issues for you and for other users. On the other hand, $TMPDIR is created when the job starts and removed when the job completes, i.e. a cleanup is automatically done.
|}

=== Usage example for $TMPDIR ===

We will provide an example for using $TMPDIR and describe efficient data transfer to and from $TMPDIR.

If you have a data set with many files which is frequently used by batch jobs you should create
a compressed archive on a workspace. This archive can be extracted on $TMPDIR inside your batch jobs.
Such an archive can be read efficiently from a parallel file system since it is a single huge file.
On a login node you can create such an archive with the following steps:
<syntaxhighlight lang="bash">
# Create a workspace to store the archive
[ab1234@uc3n991 ~]$ ws_allocate data-ssd 60
# Create the archive from a local dataset folder (example)
[ab1234@uc3n991 ~]$ tar -cvzf $(ws_find data-ssd)/dataset.tgz dataset/
</syntaxhighlight>

Inside a batch job extract the archive on $TMPDIR, read input data from $TMPDIR, store results on $TMPDIR
and save the results on a workspace:
<syntaxhighlight lang="bash">
#!/bin/bash
# very simple example on how to use local $TMPDIR
#SBATCH -N 1
#SBATCH -t 24:00:00

# Extract compressed input dataset on local SSD
tar -C $TMPDIR/ -xvzf $(ws_find data-ssd)/dataset.tgz

# The application reads data from dataset on $TMPDIR and writes results to $TMPDIR
myapp -input $TMPDIR/dataset/myinput.csv -outputdir $TMPDIR/results

# Before job completes save results on a workspace
rsync -av $TMPDIR/results $(ws_find data-ssd)/results-${SLURM_JOB_ID}/
</syntaxhighlight>

== LSDF Online Storage ==

The LSDF Online Storage allows dedicated users to store scientific measurement data and simulation results. BwUniCluster 3.0 has an extremely fast network connection to the LSDF Online Storage. This file system provides external access via different protocols. It is only available for certain users. For information how request storage projects on the LSDF Online Storage see [[https://www.scc.kit.edu/en/services/lsdf]].

The LSDF Online Storage is mounted on the login nodes. It will also be mounted on the compute nodes of your batch job request it with the constraint flag ''LSDF''. You have one of the following options:

1. Add after the initial lines of your script job.sh the line <code>#SBATCH --constraint=LSDF</code>:
<pre>
#!/bin/bash
#SBATCH --ntasks=1
#SBATCH --time=120
#SBATCH --mem=200
#SBATCH --constraint=LSDF
</pre>
2. Add the constraint on command line:
<pre>
$ sbatch -p <queue> -n 1 -t 2:00:00 --mem 200 job.sh -C LSDF
</pre>

In order to access the LSDF Online Storage the following environment variables are available:
<code>$LSDF</code>, <code>$LSDFPROJECTS</code>, <code>$LSDFHOME</code>

==BeeOND (BeeGFS On-Demand)==

Users have the possibility to request a private BeeOND (on-demand BeeGFS) parallel filesystem for each job. The file system is created during job startup and purged when your job completes.

{|style="background:#deffee; width:100%;"
|style="padding:5px; background:#ffa500; text-align:left"|
[[Image:Attention.svg|center|25px]]
|style="padding:5px; background:#ffa500; text-align:left"|
'''All data on the private filesystem will be deleted when your job completes.''' 
Make sure you have copied your results back to a global filesystem, e.g., $HOME or a workspace, within your job.
|}

BeeOND/BeeGFS can be used like any other parallel file system. All nodes of the batch job have acess to the same data below the same path. Tools like cp or rsync can be used to copy data in and out.

Starting and stopping BeeOND is integrated in the prolog and epilog of the cluster batch system Slurm. It can be used during job runtime if compute nodes are exclusive used. You can request the creation of a BeeOND file system with the constraint flags ''BEEOND'', ''BEEOND_4MDS'' or ''BEEOND_MAXMDS''.
* BEEOND: one metadata server is started on the first node
* BEEOND_4MDS: 4 metadata servers are started within your job. If you have less than 4 nodes less metadata servers are started.
* BEEOND_MAXMDS: on every node of your job a metadata server for the on_demand file system is started

As starting point we recommend using the contraint ''BEEOND''. You have one of the following options to request the constraint:

1. Add the line <code>#SBATCH --constraint=BEEOND</code> to your job script:
<source lang="bash">
#!/bin/bash
#SBATCH ...
#SBATCH --constraint=BEEOND # or BEEOND_4MDS or BEEOND_MAXMDS
</source>
2. Add the constraint on command line:
<pre>
$ sbatch -p <queue> -N <# of nodes> -t <runtime> --mem <mem> -C BEEOND job.sh
</pre>

After your job has started you can find the private on-demand file system in '''/mnt/odfs/${SLURM_JOB_ID}''' directory. The mountpoint comes with five pre-configured directories:
<source lang="bash">
# For small files (stripe count = 1)
/mnt/odfs/${SLURM_JOB_ID}/stripe_1
# Stripe count = 4
/mnt/odfs/${SLURM_JOB_ID}/stripe_default
# or
/mnt/odfs/${SLURM_JOB_ID}/stripe_4
# Stripe count = 8, 16 or 32, use this directories for medium sized and large files or when using MPI-IO
/mnt/odfs/${SLURM_JOB_ID}/stripe_8
/mnt/odfs/${SLURM_JOB_ID}/stripe_16
# or
/mnt/odfs/${SLURM_JOB_ID}/stripe_32
</source>

If you request less nodes than stripe count, the stripe count will be the number of nodes. For example, if you only request 8 nodes the directory stripe_16 has only a stripe count 8.

; '''Attention:''' 
:Use always the directory with the greatest stripe count for large files. E.g. if your largest file is 3.1 Tb, then you have to use a stripe count greater than 4 (4 x 750 GB), otherwise the used disk space is exceeded.

The capacity of the private file system depends on the number of nodes. For each node you get 750 Gbyte.
If you request 100 nodes for your job, the private file system is 100 * 750 Gbyte ~ 75 Tbyte (approx) capacity.

==Backup and Archiving==

There are regular backups of all data of the home directories, whereas ACLs and extended attributes will
not be saved in backups.

Please open a ticket if you have the need to restore backup data.

BwUniCluster3.0/Hardware and Architecture/Filesystem Details

2025-04-03T15:02:14Z

R Laifer: /* BeeOND (BeeGFS On-Demand) */

= File System Details =

On bwUniCluster 2.0 the parallel file system Lustre is used for most globally visible user data. Lustre is open source and Lustre solutions and support are available from different vendors. Nowadays, most of the biggest HPC systems are using Lustre. An initial home directory on a Lustre file system is created automatically after account activation, and the environment variable $HOME holds its name. Users can create so-called workspaces on another Lustre file system for non-permanent data with temporary lifetime. There is another workspace file system based on flash storage for special requirements available.

Within a batch job further file systems are available:
* The directory $TMPDIR is only available and visible on the local node. It is located on fast SSD storage devices.
* On request a parallel on-demand file system (BeeOND) is created which uses the SSDs of the nodes which were allocated to the batch job.
* On request the external LSDF Online Storage is mounted on the nodes which were allocated to the batch job. This file system is based on the parallel file system Spectrum Scale.

Some of the characteristics of the file systems are shown in Table 2.

{| class="wikitable"
|-
! scope="column" style="width:9%"| Property
! style="width:13%"| $TMPDIR
! style="width:13%"| BeeOND
! style="width:13%"| $HOME
! style="width:13%"| Workspace
! style="width:13%"| Workspace on flash
|-
! scope="column" style="height=20px; padding:3px"| Visibility
| style="height=20px; text-align:left;padding:3px"| local node
| style="height=20px; text-align:left;padding:3px"| nodes of batch job
| style="height=20px; text-align:left;padding:3px"| global
| style="height=20px; text-align:left;padding:3px"| global
| style="height=20px; text-align:left;padding:3px"| global
|-
! scope="column" style="height=20px; padding:3px"| Lifetime
| style="height=20px; text-align:left;padding:3px"| batch job runtime
| style="height=20px; text-align:left;padding:3px"| batch job runtime
| style="height=20px; text-align:left;padding:3px"| permanent
| style="height=20px; text-align:left;padding:3px"| max. 240 days
| style="height=20px; text-align:left;padding:3px"| max. 240 days
|-
! scope="column" style="height=20px; padding:3px"| Disk space
| style="height=20px; text-align:left;padding:3px"| 960 GB - 6.4 TB details see table 1
| style="height=20px; text-align:left;padding:3px"| n*750 GB
| style="height=20px; text-align:left;padding:3px"| 1.1 PiB
| style="height=20px; text-align:left;padding:3px"| 4.6 PiB
| style="height=20px; text-align:left;padding:3px"| 236 TiB
|-
! scope="column" style="height=20px; padding:3px"| Capacity Quotas
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| yes 500 GiB per user, for MA users 250 GiB also per organization
| style="height=20px; text-align:left;padding:3px"| yes 40 TiB per user
| style="height=20px; text-align:left;padding:3px"| yes 1 TiB per user
|-
! scope="column" style="height=20px; padding:3px"| Inode Quotas
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| yes 5 million per user for MA users 2.5 million
| style="height=20px; text-align:left;padding:3px"| yes 20 million per user
| style="height=20px; text-align:left;padding:3px"| yes 5 million per user
|-
! scope="column" style="height=20px; padding:3px"| Backup
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| yes
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| no
|-
! scope="column" style="height=20px; padding:3px"| Read perf./node
| style="height=20px; text-align:left;padding:3px"| 500 MB/s - 6 GB/s depends on type of local SSD / job queue: 520 MB/s @ single / multiple 800 MB/s @ multiple_e 6600 MB/s @ fat 6500 MB/s @ gpu_4 6500 MB/s @ gpu_8
| style="height=20px; text-align:left;padding:3px"| 400 MB/s - 500 MB/s depends on type of local SSDs / job queue: 500 MB/s @ multiple 400 MB/s @ multiple_e
| style="height=20px; text-align:left;padding:3px"| 5 GB/s
| style="height=20px; text-align:left;padding:3px"| 5 GB/s
| style="height=20px; text-align:left;padding:3px"| 1 GB/s
|-
! scope="column" style="height=20px; padding:3px"| Write perf./node
| style="height=20px; text-align:left;padding:3px"| 500 MB/s - 4 GB/s depends on type of local SSD / job queue: 500 MB/s @ single / multiple 650 MB/s @ multiple_e 2900 MB/s @ fat 2090 MB/s @ gpu_4 4060 MB/s @ gpu_8
| style="height=20px; text-align:left;padding:3px"| 250 MB/s - 350 MB/s depends on type of local SSDs / job queue: 350 MB/s @ multiple 250 MB/s @ multiple_e
| style="height=20px; text-align:left;padding:3px"| 5 GB/s
| style="height=20px; text-align:left;padding:3px"| 5 GB/s
| style="height=20px; text-align:left;padding:3px"| 1 GB/s
|-
! scope="column" style="height=20px;padding:3px"| Total read perf.
| style="height=20px; text-align:left;padding:3px"| n*500-6000 MB/s
| style="height=20px; text-align:left;padding:3px"| n*400-500 MB/s
| style="height=20px; text-align:left;padding:3px"| 63 GB/s
| style="height=20px; text-align:left;padding:3px"| 45 GB/s
| style="height=20px; text-align:left;padding:3px"| 45 GB/s
|-
! scope="column" style="height=20px;padding:3px"| Total write perf.
| style="height=20px; text-align:left;padding:3px"| n*500-4000 MB/s
| style="height=20px; text-align:left;padding:3px"| n*250-350 MB/s
| style="height=20px; text-align:left;padding:3px"| 63 GB/s
| style="height=20px; text-align:left;padding:3px"| 40 GB/s
| style="height=20px; text-align:left;padding:3px"| 38 GB/s
|}
---------------------------------------------------------------------------------------------------------
global: all nodes of UniCluster access the same file system;
local: each node has its own file system;
permanent: files are stored permanently;
batch job: files are removed at end of the batch job.
---------------------------------------------------------------------------------------------------------
Table 2: Properties of the file systems

== Selecting the appropriate file system ==

In general, you should separate your data and store it on the appropriate file system.
Permanently needed data like software or important results should be stored below $HOME
but capacity restrictions (quotas) apply. In case you accidentally deleted data on $HOME
there is a chance that we can restore it from backup. Permanent data which is not needed for months
or exceeds the capacity restrictions should be sent to the LSDF Online Storage
or to the archive and deleted from the file systems. Temporary data which is only needed on a single
node and which does not exceed the disk space shown in the table above should be stored
below $TMPDIR. Data which is read many times on a single node, e.g. if you are doing AI training,
should be copied to $TMPDIR and read from there. Temporary data which is used from many nodes
of your batch job and which is only needed during job runtime should be stored on a
parallel on-demand file system. Temporary data which can be recomputed or which is the
result of one job and input for another job should be stored in workspaces. The lifetime
of data in workspaces is limited and depends on the lifetime of the workspace which can be
several months.

For further details please check the chapters below.

== $HOME ==

The home directories of bwUniCluster 3.0 (uc3) users are located in the parallel file system Lustre.
You have access to your home directory from all nodes of uc3. A regular backup of these directories
to a tape library is done automatically. The directory $HOME should be used to hold permanently used data like
source code, configuration files, executable programs etc.

On uc3 there is a default user quota limit of 500 GiB and 5 million inodes (files and directories) per user.
For users of University of Mannheim the limit is 250 GiB and 2.5 million inodes.
You can check your current usage and limits with the command
<pre>
$ lfs quota -uh $(whoami) $HOME
</pre>
Note that the quota limits mentioned above are soft quota limits. The hard limits (shown on the ''limits''
columns) are 10 percent higher. If you are above the soft limit and below the hard limit
during the grace period (7 days) your I/O operations will show a warning message. If the grace period has
passed or if you are above the hard limit your I/O operations will abort.

In addition to the user limit there is a limit of your organization (e.g. university) which depends on the financial share. This limit is enforced with so-called Lustre project quotas. You can show the current usage and limits of your organization with the following command:
<pre>
lfs quota -ph $(grep $(echo $HOME | sed -e "s|/[^/]*/[^/]*$||") /pfs/data6/project_ids.txt | cut -f 1 -d\ ) $HOME
</pre>

== Workspaces ==

On uc3 workspaces can be used to store large non-permanent data sets, e.g. restart files or output
data that has to be post-processed. The file system used for workspaces is also the parallel file system Lustre. This file system is especially designed for parallel access and for a high throughput to large
files. It is able to provide high data transfer rates of up to 40 GB/s write and read performance when data access is parallel.

On uc3 there is a default user quota limit of 40 TiB and 20 million inodes (files and directories) per user.
You can chek your current usage and limits with the command
<pre>
$ lfs quota -uh $(whoami) /pfs/work9
</pre>
Note that the quotas include data and inodes for all of your workspaces and all of your expired workspaces (as long as they are not yet completely removed).

Workspaces have a lifetime and the data on a workspace expires as a whole after a fixed period. The maximum lifetime of a workspace on uc3 is 60 days, but it can be renewed at the end of that period 3 times to a total maximum of 240 days after workspace generation.

Creating, deleting, finding, extending and sharing workspaces is explained on the [[workspace]] page.

=== Reminder for workspace deletion ===

Normally you will get an email every day starting 7 days before a workspace expires. You can send yourself a calender entry which reminds you when a workspace will be automatically deleted:

$ ws_send_ical <workspace> <email>

=== Restoring expired Workspaces ===

At expiration time your workspace will be moved to a special, hidden directory. On uc3 expired workspaces are currently kept for 30 days. During this time you can still restore your data into a valid workspace. The same is true for released workspaces but they are only kept until the next night. In order to restore an expired workspace, use
<pre>
ws_restore -l
</pre>
to get a list of your expired workspaces, and then restore them into an '''existing, active workspace''' (here with name <code>my_restored</code>):
<pre>
ws_restore <full_name_of_expired_workspace> my_restored
</pre>
NOTE: The expired workspace has to be specified using the full name as listed by <code>ws_restore -l</code>, including username prefix and timestamp suffix (otherwise, it cannot be uniquely identified).
The target workspace, on the other hand, must be given with just its short name as listed by <code>ws_list</code>, without the username prefix.

NOTE: <code>ws_restore</code> can only work on the same filesystem. So you have to ensure that the new workspace allocated with <code>ws_allocate</code> is placed on the same filesystem as the expired workspace. Therefore, you can use <code>-F <filesystem></code> flag if needed.

=== Linking workspaces in Home ===

It might be valuable to have links to personal workspaces within a certain directory, e.g. below the user home directory. The command
<pre>
ws_register <DIR>
</pre>
will create and manage links to all personal workspaces within in the directory <DIR>. Calling this command will do the following:

* The directory <DIR> will be created if necessary
* Links to all personal workspaces will be managed:
** Creates links to all available workspaces if not already present
** Removes links to released or expired workspaces

== Improving Performance on $HOME and workspaces ==

The following recommendations might help to improve throughput and metadata
performance on Lustre filesystems.

=== Improving Throughput Performance ===

Depending on your application some adaptations might be necessary if you want to reach
the full bandwidth of the filesystems.

When you are designing your application you should consider that the performance of
parallel filesystems is generally better if data is transferred in large blocks and stored in
few large files. In more detail, to increase throughput performance of a parallel application
following aspects should be considered:

* collect large chunks of data and write them sequentially at once,

* to exploit complete filesystem bandwidth use several clients,

* avoid competitive file access by different tasks or use blocks with boundaries at stripe size (default is 1MB),

* if files are small enough for the SSDs and are only used from one node store them on $TMPDIR.

Parallel filesystems typically stripe files over storage subsystems, i.e. large files are separated into stripes and distributed to different storage subsystems. In Lustre, the size of these stripes (sometimes also mentioned as chunks) is called stripe size and the number of used storage subsystems is called stripe count.

With previous Lustre versions adapting the Lustre stripe count was the most important optimization. However, for the workspaces of uc3 the Lustre feature Progressive File Layout is used to define file striping parameters. This means that the stripe count is adapted if the file size is growing. In normal cases users no longer need to adapt file striping parameters in case they have very huge files or in order to reach better performance. If you know what you are doing you can still change striping parameters but further explanation is beyond the scope of this documentation.

=== Improving Metadata Performance ===

Metadata performance on parallel file systems is usually not as good as with local
filesystems. In addition, it is usually not scalable, i.e. a limited resource. Therefore,
you should omit metadata operations whenever possible. For example, it is much better
to have few large files than lots of small files. In more detail, to increase metadata
performance of a parallel application following aspects should be considered:

* avoid creating many small files

* avoid competitive directory access, e.g. by creating files in separate subdirectories for each task

* if many small files are only used within a batch job and accessed by one node store them on $TMPDIR

== Workspaces on flash storage ==

There is another workspace file system for special requirements available. The file system is called ''full flash pfs'' and is based on the parallel file system Lustre.

=== Advantages of this file system ===

# All storage devices are based on flash (no hard disks) with low access times. Hence performance is better compared to other parallel file systems for read and write access with small blocks and with small files, i.e. IOPS rates are improved.
# The file system is mounted on bwUniCluster 2.0 and HoreKa, i.e. it can be used to share data between these clusters.

=== Access restrictions ===

Only HoreKa users or KIT users of bwUniCluster 2.0 can use this file system.

=== Using the file system ===

As KIT or HoreKa user you can use the file system in the same way as a normal workspace. You just have to specify the name of the flash-based workspace file system using the option ''-F'' to all the commands that manage workspaces. On bwUniCluster 2.0 it is called ''ffuc'', on HoreKa it is ''ffhk''. For example, to create a workspace with name myws and a lifetime of 60 days on bwUniCluster 2.0 execute:
ws_allocate -F ffuc myws 60

If you want to use the full flash pfs on bwUniCluster 2.0 '''and''' HoreKa at the same time, please note that you only have to manage a particular workspace on one of the clusters since the name of the workspace directory is different. However, the path to each workspace is visible and can be used on both clusters.

Other features are similar to normal workspaces. For example, we are able to restore expired workspaces for few weeks and you have to open a ticket to request the restore. There are quota limits with a default limit of 1 TiB capacity and 5 millions inodes per user. You can check your current usage with
lfs quota -uh $(whoami) /pfs/work8

== $TMPDIR ==

The environment variable $TMPDIR contains the name of a directory which is located on the local SSD of each node. This means
that different tasks of a parallel application use different directories when they do not utilize the same node.
Although $TMPDIR points to the same path name for different nodes of a batch job, the physical location and the
content of this directory path on these nodes is different.

This directory should be used for temporary files being accessed from the local node during job runtime. It should
also be used if you read the same data many times from a single node, e.g. if you are doing AI training. In this
case you should copy the data at the beginning of your batch job to $TMPDIR and read the data from there, see usage example below.

The $TMPDIR directory is located on extremely fast local SSD storage devices. This means that performance
on small files is much better than on the parallel file systems. The capacity of the local SSDs for each node type
is different and can be checked in [[BwUniCluster3.0/Hardware_and_Architecture#Compute_nodes|Table 1]].
The capacity of $TMPDIR is at least 1400 GB.

Each time a batch job is started, a subdirectory is created on the SSD of each node and assigned to the job.
$TMPDIR is set to the name of the subdirectory and this name contains the job ID so that it is unique
for each job. At the end of the job the subdirectory is removed.

On login nodes $TMPDIR also points to a fast directory on a local SSD disk but this directory is not unique.
It is recommended to create your own unique subdirectory on these nodes. This directory should be used for the
installation of software packages. This means that the software package to be installed should be unpacked,
compiled and linked in a subdirectory of $TMPDIR. The real installation of the package (e.g. make install)
should be made into the $HOME folder.

{|style="background:#ffdeee; width:100%;"
|style="padding:5px; background:#f2cece; text-align:left"|
[[Image:Attention.svg|center|25px]]
|style="padding:5px; background:#f2cece; text-align:left"|
Note that you should '''not''' use /tmp or /scratch! Please use '''$TMPDIR''' instead. 
The reason is that an automatic cleanup on /tmp or /scratch is not possible because another job could be still using data below these directories. Hence the corresponding file systems could fill up and this can cause issues for you and for other users. On the other hand, $TMPDIR is created when the job starts and removed when the job completes, i.e. a cleanup is automatically done.
|}

=== Usage example for $TMPDIR ===

We will provide an example for using $TMPDIR and describe efficient data transfer to and from $TMPDIR.

If you have a data set with many files which is frequently used by batch jobs you should create
a compressed archive on a workspace. This archive can be extracted on $TMPDIR inside your batch jobs.
Such an archive can be read efficiently from a parallel file system since it is a single huge file.
On a login node you can create such an archive with the following steps:
<syntaxhighlight lang="bash">
# Create a workspace to store the archive
[ab1234@uc3n991 ~]$ ws_allocate data-ssd 60
# Create the archive from a local dataset folder (example)
[ab1234@uc3n991 ~]$ tar -cvzf $(ws_find data-ssd)/dataset.tgz dataset/
</syntaxhighlight>

Inside a batch job extract the archive on $TMPDIR, read input data from $TMPDIR, store results on $TMPDIR
and save the results on a workspace:
<syntaxhighlight lang="bash">
#!/bin/bash
# very simple example on how to use local $TMPDIR
#SBATCH -N 1
#SBATCH -t 24:00:00

# Extract compressed input dataset on local SSD
tar -C $TMPDIR/ -xvzf $(ws_find data-ssd)/dataset.tgz

# The application reads data from dataset on $TMPDIR and writes results to $TMPDIR
myapp -input $TMPDIR/dataset/myinput.csv -outputdir $TMPDIR/results

# Before job completes save results on a workspace
rsync -av $TMPDIR/results $(ws_find data-ssd)/results-${SLURM_JOB_ID}/
</syntaxhighlight>

== LSDF Online Storage ==

The LSDF Online Storage allows dedicated users to store scientific measurement data and simulation results. BwUniCluster 3.0 has an extremely fast network connection to the LSDF Online Storage. This file system provides external access via different protocols. It is only available for certain users. For information how request storage projects on the LSDF Online Storage see [[https://www.scc.kit.edu/en/services/lsdf]].

The LSDF Online Storage is mounted on the login nodes. It will also be mounted on the compute nodes of your batch job request it with the constraint flag ''LSDF''. You have one of the following options:

1. Add after the initial lines of your script job.sh the line <code>#SBATCH --constraint=LSDF</code>:
<pre>
#!/bin/bash
#SBATCH --ntasks=1
#SBATCH --time=120
#SBATCH --mem=200
#SBATCH --constraint=LSDF
</pre>
2. Add the constraint on command line:
<pre>
$ sbatch -p <queue> -n 1 -t 2:00:00 --mem 200 job.sh -C LSDF
</pre>

In order to access the LSDF Online Storage the following environment variables are available:
<code>$LSDF</code>, <code>$LSDFPROJECTS</code>, <code>$LSDFHOME</code>

==BeeOND (BeeGFS On-Demand)==

Users have the possibility to request a private BeeOND (on-demand BeeGFS) parallel filesystem for each job. The file system is created during job startup and purged when your job completes.

{|style="background:#deffee; width:100%;"
|style="padding:5px; background:#ffa500; text-align:left"|
[[Image:Attention.svg|center|25px]]
|style="padding:5px; background:#ffa500; text-align:left"|
'''All data on the private filesystem will be deleted when your job completes.''' 
Make sure you have copied your results back to a global filesystem, e.g., $HOME or a workspace, within your job.
|}

BeeOND/BeeGFS can be used like any other parallel file system. All nodes of the batch job have acess to the same data below the same path. Tools like cp or rsync can be used to copy data in and out.

Starting and stopping BeeOND is integrated in the prolog and epilog of the cluster batch system Slurm. It can be used during job runtime if compute nodes are exclusive used. You can request the creation of a BeeOND file system with the constraint flags ''BEEOND'', ''BEEOND_4MDS'' or ''BEEOND_MAXMDS''.
* BEEOND: one metadata server is started on the first node
* BEEOND_4MDS: 4 metadata servers are started within your job. If you have less than 4 nodes less metadata servers are started.
* BEEOND_MAXMDS: on every node of your job a metadata server for the on_demand file system is started

As starting point we recommend using the contraint ''BEEOND''. You have one of the following options to request the constraint:

1. Add the line <code>#SBATCH --constraint=BEEOND</code> to your job script:
<source lang="bash">
#!/bin/bash
#SBATCH ...
#SBATCH --constraint=BEEOND # or BEEOND_4MDS or BEEOND_MAXMDS
</source>
2. Add the constraint on command line:
<pre>
$ sbatch -p <queue> -N <# of nodes> -t <runtime> --mem <mem> -C BEEOND job.sh
</pre>

After your job has started you can find the private on-demand file system in '''/mnt/odfs/${SLURM_JOB_ID}''' directory. The mountpoint comes with five pre-configured directories:
<source lang="bash">
# For small files (stripe count = 1)
/mnt/odfs/${SLURM_JOB_ID}/stripe_1
# Stripe count = 4
/mnt/odfs/${SLURM_JOB_ID}/stripe_default
# or
/mnt/odfs/${SLURM_JOB_ID}/stripe_4
# Stripe count = 8, 16 or 32, use this directories for medium sized and large files or when using MPI-IO
/mnt/odfs/${SLURM_JOB_ID}/stripe_8
/mnt/odfs/${SLURM_JOB_ID}/stripe_16
# or
/mnt/odfs/${SLURM_JOB_ID}/stripe_32
</source>

If you request less nodes than stripe count, the stripe count will be the number of nodes. For example, if you only request 8 nodes the directory stripe_16 has only a stripe count 8.

; '''Attention:''' 
:Use always the directory with the greatest stripe count for large files. E.g. if your largest file is 3.1 Tb, then you have to use a stripe count greater than 4 (4 x 750 GB), otherwise the used disk space is exceeded.

The capacity of the private file system depends on the number of nodes. For each node you get 750 Gbyte.
If you request 100 nodes for your job, the private file system is 100 * 750 Gbyte ~ 75 Tbyte (approx) capacity.

==Backup and Archiving==

There are regular backups of all data of the home directories, whereas ACLs and extended attributes will
not be saved in backups.

Please open a ticket if you have the need to restore backup data.

BwUniCluster3.0/Hardware and Architecture

2025-04-03T14:33:59Z

R Laifer: /* BeeOND (BeeGFS On-Demand) */

= Architecture of bwUniCluster 3.0 =

The '''bwUniCluster 3.0''' is a parallel computer with distributed memory.
It consists of the bwUniCluster 3.0 components procured in 2024 and also includes the additional compute nodes which were procured as an extension to the bwUniCluster 2.0 in 2022.

Each node of the system consists of two Intel Xeon or AMD EPYC processors, local memory, local storage, network adapters and optional accelerators (NVIDIA A100 and H100, AMD Instinct MI300A). All nodes are connected via a fast InfiniBand interconnect.

The parallel file system (Lustre) is connected to the InfiniBand switch of the compute cluster. This provides a fast and scalable parallel file
system.

The operating system on each node is Red Hat Enterprise Linux (RHEL) 9.4.

The individual nodes of the system act in different roles. From an end users point of view the different groups of nodes are login nodes and compute nodes. File server nodes and administrative server nodes are not accessible by users.

'''Login Nodes'''

The login nodes are the only nodes directly accessible by end users. These nodes are used for interactive login, file management, program development, and interactive pre- and post-processing.
There are two nodes dedicated to this service, but they can all be reached from a single address: <code>uc3.scc.kit.edu</code>. A DNS round-robin alias distributes login sessions to the login nodes.
To prevent login nodes from being used for activities that are not permitted there and that affect the user experience of other users, '''long-running and/or compute-intensive tasks are periodically terminated without any prior warning'''. Please refer to [[BwUniCluster3.0/Login#Allowed_Activities_on_Login_Nodes|Allowed Activities on Login Nodes]].

'''Compute Nodes'''

The majority of nodes are compute nodes which are managed by a batch system. Users submit their jobs to the SLURM batch system and a job is executed when the required resources become available (depending on its fair-share priority).

'''File Systems'''
bwUniCluster 3.0 comprises two parallel file systems based on Lustre.

[[File:uc3.png|Optionen|center|Überschrift|800px]]

= Compute Resources =

== Login nodes ==

After a successful [[BwUniCluster3.0/Login|login]], users find themselves on one of the so called login nodes. Technically, these largely correspond to a standard CPU node, i.e. users have two AMD EPYC 9454 processors with a total of 96 cores at their disposal. Login nodes are the bridgehead for accessing computing resources.
Data and software are organized here, computing jobs are initiated and managed, and computing resources allocated for interactive use can also be accessed from here.
{|style="background:#deffee; width:100%;"
|style="padding:5px; background:#ffa500; text-align:left"|
[[Image:Attention.svg|center|25px]]
|style="padding:5px; background:#ffa500; text-align:left"|
'''Any compute intensive job running on the login nodes will be terminated without any notice.''' 
Please refer to [[BwUniCluster3.0/Login#Allowed_Activities_on_Login_Nodes|Allowed Activities on Login Nodes]].
|}

== Compute nodes ==
All compute activities on bwUniCluster 3.0 have to be performed on the compute nodes. Compute nodes are only available by requesting the corresponding resources via the queuing system. As soon as the requested resources are available, automated tasks are executed via a batch script or they can be accessed interactively. Please refer to [[BwUniCluster3.0/Running_Jobs|Running Jobs]] on how to request resources. 
The following compute node types are available: 
CPU nodes
* '''Standard''': Two AMD EPYC 9454 processors per node with a total of 96 physical CPU cores or 192 logical cores (Hyper-Threading) per node. The nodes have been procured in 2024.
* '''Ice Lake''': Two Intel Xeon Platinum 8358 processors per node with a total of 64 physical CPU cores or 128 logical cores (Hyper-Threading) per node. The nodes have been procured in 2022 as an extension to bwUniCluster 2.0.
* '''High Memory''': Similar to the standard nodes, but with six times larger memory.
GPU nodes
* '''NVIDIA GPU x4''': Similar to the standard nodes, but with larger memory and four NVIDIA H100 GPUs.
* '''AMD GPU x4''': AMD's accelerated processing unit (APU) MI300A with 4 CPU sockets and 4 compute units which share the same high-bandwidth memory (HBM).
* '''Ice Lake NVIDIA GPU x4''': Similar to the Ice Lake nodes, but with larger memory and four NVIDIA A100 or H100 GPUs.
{| class="wikitable"
|-
! style="width:10%"| Node Type
! style="width:10%"| CPU nodes Ice Lake
! style="width:10%"| CPU nodes Standard
! style="width:10%"| CPU nodes High Memory
! style="width:10%"| GPU nodes NVIDIA GPU x4
! style="width:10%"| GPU node AMD GPU x4
! style="width:10%"| GPU nodes Ice Lake NVIDIA GPU x4
! style="width:10%"| Login nodes
|-
!scope="column"| Availability in [[BwUniCluster3.0/Running_Jobs#Queues_on_bwUniCluster_3.0| queues]]
| <code>cpu_il</code>, <code>dev_cpu_il</code>
| <code>cpu</code>, <code>dev_cpu</code>
| <code>highmem</code>, <code>dev_highmem</code>
| <code>gpu_h100</code>, <code>dev_gpu_h100</code>
| <code>gpu_mi300</code>
| <code>gpu_a100_il</code> / <code>gpu_h100_il</code>
| -
|-
!scope="column"| Number of nodes
| 272
| 70
| 4
| 12
| 1
| 15
| 2
|-
!scope="column"| Processors
| Intel Xeon Platinum 8358
| AMD EPYC 9454
| AMD EPYC 9454
| AMD EPYC 9454
| AMD Zen 4
| Intel Xeon Platinum 8358
| AMD EPYC 9454
|-
!scope="column"| Number of sockets
| 2
| 2
| 2
| 2
| 4
| 2
| 2
|-
!scope="column"| Processor frequency (GHz)
| 2.6 GHz
| 2.75 GHz
| 2.75 GHz
| 2.75 GHz
| 3.7 GHz
| 2.6 GHz
| 2.75 GHz
|-
!scope="column"| Total number of cores
| 64
| 96
| 96
| 96
| 96 (4x 24)
| 64
| 96
|-
!scope="column"| Main memory
| 256 GB
| 384 GB
| 2.3 TB
| 768 GB
| 4x 128 GB HBM3
| 512 GB
| 384 GB
|-
!scope="column"| Local SSD
| 1.8 TB NVMe
| 3.84 TB NVMe
| 15.36 TB NVMe
| 15.36 TB NVMe
| 7.68 TB NVMe
| 6.4 TB NVMe
| 7.68 TB SATA SSD
|-
!scope="column"| Accelerators
| -
| -
| -
| 4x NVIDIA H100
| 4x AMD Instinct MI300A
| 4x NVIDIA A100 / H100
| -
|-
!scope="column"| Accelerator memory
| -
| -
| -
| 94 GB
| APU
| 80 GB / 94 GB
| -
|-
!scope="column"| Interconnect
| IB HDR200
| IB 2x NDR200
| IB 2x NDR200
| IB 4x NDR200
| IB 2x NDR200
| IB 2x HDR200
| IB 1x NDR200
|}
Table 1: Hardware overview and properties

= File Systems =

On bwUniCluster 3.0 the following file systems are available:

* '''$HOME''' The HOME directory is created automatically after account activation, and the environment variable $HOME holds its name. HOME is the place, where users find themselves after login.
* '''Workspaces''' Users can create so-called workspaces for non-permanent data with temporary lifetime. A further workspace type based on flash-only storage for special requirements is also available.
* '''$TMPDIR''' The directory $TMPDIR is only available and visible on the local node during the runtime of a compute job. It is located on fast SSD storage devices.
* '''BeeOND''' (BeeGFS On-Demand) On request a parallel on-demand file system (BeeOND) is created which uses the SSDs of the nodes which were allocated to the batch job.
* '''LSDF Online Storage''' On request the external LSDF Online Storage is mounted on the nodes which were allocated to the batch job. On the login nodes, LSDF is automatically mounted.

'''Which file system to use?'''

In general, you should separate your data and store it on the appropriate file system.
Permanently needed data like software or important results should be stored below $HOME
but capacity restrictions (quotas) apply. In case you accidentally deleted data on $HOME
there is a chance that we can restore it from backup. Permanent data which is not needed for months
or exceeds the capacity restrictions should be sent to the LSDF Online Storage
or to the archive and deleted from the file systems. Temporary data which is only needed on a single
node and which does not exceed the disk space shown in Table 1 above should be stored
below $TMPDIR. Data which is read many times on a single node, e.g. if you are doing AI training,
should be copied to $TMPDIR and read from there. Temporary data which is used from many nodes
of your batch job and which is only needed during job runtime should be stored on a
parallel on-demand file system BeeOND. Temporary data which can be recomputed or which is the
result of one job and input for another job should be stored in workspaces. The lifetime
of data in workspaces is limited and depends on the lifetime of the workspace which can be
several months.

For further details please check

[[BwUniCluster3.0/Hardware_and_Architecture/Filesystem_Details|File System Details]]

== $HOME ==

The $HOME directories of bwUniCluster 3.0 users are located on the parallel file system Lustre.
You have access to your $HOME directory from all nodes of UC3. A regular backup of these directories
to tape archive is done automatically. The directory $HOME is used to hold those files that are
permanently used like source codes, configuration files, executable programs etc.

[[BwUniCluster3.0/Hardware_and_Architecture/Filesystem_Details#$HOME|Detailed information on $HOME]]

== Workspaces ==

On UC3 workspaces should be used to store large non-permanent data sets, e.g. restart files or output
data that has to be post-processed. The file system used for workspaces is also the parallel file system Lustre. This file system is especially designed for parallel access and for a high throughput to large
files. It is able to provide high data transfer rates of up to 40 GB/s write and read performance when data access is parallel.

On UC3 there is a default user quota limit of 40 TiB and 20 million inodes (files and directories) per user.

[[BwUniCluster3.0/Hardware_and_Architecture/Filesystem_Details#Workspaces|Detailed information on Workspaces]]

== $TMPDIR ==

The environment variable $TMPDIR contains the name of a directory which is located on the local SSD of each node.
This directory should be used for temporary files being accessed from the local node. It should
also be used if you read the same data many times from a single node, e.g. if you are doing AI training.
Because of the extremely fast local SSD storage devices performance with small files is much better than on the parallel file systems.

[[BwUniCluster3.0/Hardware_and_Architecture/Filesystem_Details#$TMPDIR|Detailed information on $TMPDIR]]

== BeeOND (BeeGFS On-Demand) ==

Users have the possibility to request a private BeeOND (on-demand BeeGFS) parallel filesystem for each job. The file system is created during job startup and purged when your job completes.

[[BwUniCluster3.0/Hardware_and_Architecture/Filesystem_Details#BeeOND_(BeeGFS_On-Demand)|Detailed information on BeeOND]]

== LSDF Online Storage ==

The LSDF Online Storage allows dedicated users to store scientific measurement data and simulation results. BwUniCluster 3.0 has an extremely fast network connection to the LSDF Online Storage. This file system provides external access via different protocols and is only available for certain users.

[[BwUniCluster3.0/Hardware_and_Architecture/Filesystem_Details#LSDF_Online_Storage|Detailed information on LSDF Online Storage]]

BwUniCluster3.0/Hardware and Architecture

2025-04-03T14:32:03Z

R Laifer: /* File Systems */

= Architecture of bwUniCluster 3.0 =

The '''bwUniCluster 3.0''' is a parallel computer with distributed memory.
It consists of the bwUniCluster 3.0 components procured in 2024 and also includes the additional compute nodes which were procured as an extension to the bwUniCluster 2.0 in 2022.

Each node of the system consists of two Intel Xeon or AMD EPYC processors, local memory, local storage, network adapters and optional accelerators (NVIDIA A100 and H100, AMD Instinct MI300A). All nodes are connected via a fast InfiniBand interconnect.

The parallel file system (Lustre) is connected to the InfiniBand switch of the compute cluster. This provides a fast and scalable parallel file
system.

The operating system on each node is Red Hat Enterprise Linux (RHEL) 9.4.

The individual nodes of the system act in different roles. From an end users point of view the different groups of nodes are login nodes and compute nodes. File server nodes and administrative server nodes are not accessible by users.

'''Login Nodes'''

The login nodes are the only nodes directly accessible by end users. These nodes are used for interactive login, file management, program development, and interactive pre- and post-processing.
There are two nodes dedicated to this service, but they can all be reached from a single address: <code>uc3.scc.kit.edu</code>. A DNS round-robin alias distributes login sessions to the login nodes.
To prevent login nodes from being used for activities that are not permitted there and that affect the user experience of other users, '''long-running and/or compute-intensive tasks are periodically terminated without any prior warning'''. Please refer to [[BwUniCluster3.0/Login#Allowed_Activities_on_Login_Nodes|Allowed Activities on Login Nodes]].

'''Compute Nodes'''

The majority of nodes are compute nodes which are managed by a batch system. Users submit their jobs to the SLURM batch system and a job is executed when the required resources become available (depending on its fair-share priority).

'''File Systems'''
bwUniCluster 3.0 comprises two parallel file systems based on Lustre.

[[File:uc3.png|Optionen|center|Überschrift|800px]]

= Compute Resources =

== Login nodes ==

After a successful [[BwUniCluster3.0/Login|login]], users find themselves on one of the so called login nodes. Technically, these largely correspond to a standard CPU node, i.e. users have two AMD EPYC 9454 processors with a total of 96 cores at their disposal. Login nodes are the bridgehead for accessing computing resources.
Data and software are organized here, computing jobs are initiated and managed, and computing resources allocated for interactive use can also be accessed from here.
{|style="background:#deffee; width:100%;"
|style="padding:5px; background:#ffa500; text-align:left"|
[[Image:Attention.svg|center|25px]]
|style="padding:5px; background:#ffa500; text-align:left"|
'''Any compute intensive job running on the login nodes will be terminated without any notice.''' 
Please refer to [[BwUniCluster3.0/Login#Allowed_Activities_on_Login_Nodes|Allowed Activities on Login Nodes]].
|}

== Compute nodes ==
All compute activities on bwUniCluster 3.0 have to be performed on the compute nodes. Compute nodes are only available by requesting the corresponding resources via the queuing system. As soon as the requested resources are available, automated tasks are executed via a batch script or they can be accessed interactively. Please refer to [[BwUniCluster3.0/Running_Jobs|Running Jobs]] on how to request resources. 
The following compute node types are available: 
CPU nodes
* '''Standard''': Two AMD EPYC 9454 processors per node with a total of 96 physical CPU cores or 192 logical cores (Hyper-Threading) per node. The nodes have been procured in 2024.
* '''Ice Lake''': Two Intel Xeon Platinum 8358 processors per node with a total of 64 physical CPU cores or 128 logical cores (Hyper-Threading) per node. The nodes have been procured in 2022 as an extension to bwUniCluster 2.0.
* '''High Memory''': Similar to the standard nodes, but with six times larger memory.
GPU nodes
* '''NVIDIA GPU x4''': Similar to the standard nodes, but with larger memory and four NVIDIA H100 GPUs.
* '''AMD GPU x4''': AMD's accelerated processing unit (APU) MI300A with 4 CPU sockets and 4 compute units which share the same high-bandwidth memory (HBM).
* '''Ice Lake NVIDIA GPU x4''': Similar to the Ice Lake nodes, but with larger memory and four NVIDIA A100 or H100 GPUs.
{| class="wikitable"
|-
! style="width:10%"| Node Type
! style="width:10%"| CPU nodes Ice Lake
! style="width:10%"| CPU nodes Standard
! style="width:10%"| CPU nodes High Memory
! style="width:10%"| GPU nodes NVIDIA GPU x4
! style="width:10%"| GPU node AMD GPU x4
! style="width:10%"| GPU nodes Ice Lake NVIDIA GPU x4
! style="width:10%"| Login nodes
|-
!scope="column"| Availability in [[BwUniCluster3.0/Running_Jobs#Queues_on_bwUniCluster_3.0| queues]]
| <code>cpu_il</code>, <code>dev_cpu_il</code>
| <code>cpu</code>, <code>dev_cpu</code>
| <code>highmem</code>, <code>dev_highmem</code>
| <code>gpu_h100</code>, <code>dev_gpu_h100</code>
| <code>gpu_mi300</code>
| <code>gpu_a100_il</code> / <code>gpu_h100_il</code>
| -
|-
!scope="column"| Number of nodes
| 272
| 70
| 4
| 12
| 1
| 15
| 2
|-
!scope="column"| Processors
| Intel Xeon Platinum 8358
| AMD EPYC 9454
| AMD EPYC 9454
| AMD EPYC 9454
| AMD Zen 4
| Intel Xeon Platinum 8358
| AMD EPYC 9454
|-
!scope="column"| Number of sockets
| 2
| 2
| 2
| 2
| 4
| 2
| 2
|-
!scope="column"| Processor frequency (GHz)
| 2.6 GHz
| 2.75 GHz
| 2.75 GHz
| 2.75 GHz
| 3.7 GHz
| 2.6 GHz
| 2.75 GHz
|-
!scope="column"| Total number of cores
| 64
| 96
| 96
| 96
| 96 (4x 24)
| 64
| 96
|-
!scope="column"| Main memory
| 256 GB
| 384 GB
| 2.3 TB
| 768 GB
| 4x 128 GB HBM3
| 512 GB
| 384 GB
|-
!scope="column"| Local SSD
| 1.8 TB NVMe
| 3.84 TB NVMe
| 15.36 TB NVMe
| 15.36 TB NVMe
| 7.68 TB NVMe
| 6.4 TB NVMe
| 7.68 TB SATA SSD
|-
!scope="column"| Accelerators
| -
| -
| -
| 4x NVIDIA H100
| 4x AMD Instinct MI300A
| 4x NVIDIA A100 / H100
| -
|-
!scope="column"| Accelerator memory
| -
| -
| -
| 94 GB
| APU
| 80 GB / 94 GB
| -
|-
!scope="column"| Interconnect
| IB HDR200
| IB 2x NDR200
| IB 2x NDR200
| IB 4x NDR200
| IB 2x NDR200
| IB 2x HDR200
| IB 1x NDR200
|}
Table 1: Hardware overview and properties

= File Systems =

On bwUniCluster 3.0 the following file systems are available:

* '''$HOME''' The HOME directory is created automatically after account activation, and the environment variable $HOME holds its name. HOME is the place, where users find themselves after login.
* '''Workspaces''' Users can create so-called workspaces for non-permanent data with temporary lifetime. A further workspace type based on flash-only storage for special requirements is also available.
* '''$TMPDIR''' The directory $TMPDIR is only available and visible on the local node during the runtime of a compute job. It is located on fast SSD storage devices.
* '''BeeOND''' (BeeGFS On-Demand) On request a parallel on-demand file system (BeeOND) is created which uses the SSDs of the nodes which were allocated to the batch job.
* '''LSDF Online Storage''' On request the external LSDF Online Storage is mounted on the nodes which were allocated to the batch job. On the login nodes, LSDF is automatically mounted.

'''Which file system to use?'''

In general, you should separate your data and store it on the appropriate file system.
Permanently needed data like software or important results should be stored below $HOME
but capacity restrictions (quotas) apply. In case you accidentally deleted data on $HOME
there is a chance that we can restore it from backup. Permanent data which is not needed for months
or exceeds the capacity restrictions should be sent to the LSDF Online Storage
or to the archive and deleted from the file systems. Temporary data which is only needed on a single
node and which does not exceed the disk space shown in Table 1 above should be stored
below $TMPDIR. Data which is read many times on a single node, e.g. if you are doing AI training,
should be copied to $TMPDIR and read from there. Temporary data which is used from many nodes
of your batch job and which is only needed during job runtime should be stored on a
parallel on-demand file system BeeOND. Temporary data which can be recomputed or which is the
result of one job and input for another job should be stored in workspaces. The lifetime
of data in workspaces is limited and depends on the lifetime of the workspace which can be
several months.

For further details please check

[[BwUniCluster3.0/Hardware_and_Architecture/Filesystem_Details|File System Details]]

== $HOME ==

The $HOME directories of bwUniCluster 3.0 users are located on the parallel file system Lustre.
You have access to your $HOME directory from all nodes of UC3. A regular backup of these directories
to tape archive is done automatically. The directory $HOME is used to hold those files that are
permanently used like source codes, configuration files, executable programs etc.

[[BwUniCluster3.0/Hardware_and_Architecture/Filesystem_Details#$HOME|Detailed information on $HOME]]

== Workspaces ==

On UC3 workspaces should be used to store large non-permanent data sets, e.g. restart files or output
data that has to be post-processed. The file system used for workspaces is also the parallel file system Lustre. This file system is especially designed for parallel access and for a high throughput to large
files. It is able to provide high data transfer rates of up to 40 GB/s write and read performance when data access is parallel.

On UC3 there is a default user quota limit of 40 TiB and 20 million inodes (files and directories) per user.

[[BwUniCluster3.0/Hardware_and_Architecture/Filesystem_Details#Workspaces|Detailed information on Workspaces]]

== $TMPDIR ==

The environment variable $TMPDIR contains the name of a directory which is located on the local SSD of each node.
This directory should be used for temporary files being accessed from the local node. It should
also be used if you read the same data many times from a single node, e.g. if you are doing AI training.
Because of the extremely fast local SSD storage devices performance with small files is much better than on the parallel file systems.

[[BwUniCluster3.0/Hardware_and_Architecture/Filesystem_Details#$TMPDIR|Detailed information on $TMPDIR]]

== BeeOND (BeeGFS On-Demand) ==

Users have possibility to request a private BeeOND (on-demand BeeGFS) parallel filesystem for each job. The file system is created during job startup and purged when your job completes.

[[BwUniCluster3.0/Hardware_and_Architecture/Filesystem_Details#BeeOND_(BeeGFS_On-Demand)|Detailed information on BeeOND]]

== LSDF Online Storage ==

The LSDF Online Storage allows dedicated users to store scientific measurement data and simulation results. BwUniCluster 3.0 has an extremely fast network connection to the LSDF Online Storage. This file system provides external access via different protocols and is only available for certain users.

[[BwUniCluster3.0/Hardware_and_Architecture/Filesystem_Details#LSDF_Online_Storage|Detailed information on LSDF Online Storage]]

BwUniCluster3.0/Hardware and Architecture/Filesystem Details

2025-04-03T14:30:47Z

R Laifer: /* LSDF Online Storage */

= File System Details =

On bwUniCluster 2.0 the parallel file system Lustre is used for most globally visible user data. Lustre is open source and Lustre solutions and support are available from different vendors. Nowadays, most of the biggest HPC systems are using Lustre. An initial home directory on a Lustre file system is created automatically after account activation, and the environment variable $HOME holds its name. Users can create so-called workspaces on another Lustre file system for non-permanent data with temporary lifetime. There is another workspace file system based on flash storage for special requirements available.

Within a batch job further file systems are available:
* The directory $TMPDIR is only available and visible on the local node. It is located on fast SSD storage devices.
* On request a parallel on-demand file system (BeeOND) is created which uses the SSDs of the nodes which were allocated to the batch job.
* On request the external LSDF Online Storage is mounted on the nodes which were allocated to the batch job. This file system is based on the parallel file system Spectrum Scale.

Some of the characteristics of the file systems are shown in Table 2.

{| class="wikitable"
|-
! scope="column" style="width:9%"| Property
! style="width:13%"| $TMPDIR
! style="width:13%"| BeeOND
! style="width:13%"| $HOME
! style="width:13%"| Workspace
! style="width:13%"| Workspace on flash
|-
! scope="column" style="height=20px; padding:3px"| Visibility
| style="height=20px; text-align:left;padding:3px"| local node
| style="height=20px; text-align:left;padding:3px"| nodes of batch job
| style="height=20px; text-align:left;padding:3px"| global
| style="height=20px; text-align:left;padding:3px"| global
| style="height=20px; text-align:left;padding:3px"| global
|-
! scope="column" style="height=20px; padding:3px"| Lifetime
| style="height=20px; text-align:left;padding:3px"| batch job runtime
| style="height=20px; text-align:left;padding:3px"| batch job runtime
| style="height=20px; text-align:left;padding:3px"| permanent
| style="height=20px; text-align:left;padding:3px"| max. 240 days
| style="height=20px; text-align:left;padding:3px"| max. 240 days
|-
! scope="column" style="height=20px; padding:3px"| Disk space
| style="height=20px; text-align:left;padding:3px"| 960 GB - 6.4 TB details see table 1
| style="height=20px; text-align:left;padding:3px"| n*750 GB
| style="height=20px; text-align:left;padding:3px"| 1.1 PiB
| style="height=20px; text-align:left;padding:3px"| 4.6 PiB
| style="height=20px; text-align:left;padding:3px"| 236 TiB
|-
! scope="column" style="height=20px; padding:3px"| Capacity Quotas
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| yes 500 GiB per user, for MA users 250 GiB also per organization
| style="height=20px; text-align:left;padding:3px"| yes 40 TiB per user
| style="height=20px; text-align:left;padding:3px"| yes 1 TiB per user
|-
! scope="column" style="height=20px; padding:3px"| Inode Quotas
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| yes 5 million per user for MA users 2.5 million
| style="height=20px; text-align:left;padding:3px"| yes 20 million per user
| style="height=20px; text-align:left;padding:3px"| yes 5 million per user
|-
! scope="column" style="height=20px; padding:3px"| Backup
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| yes
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| no
|-
! scope="column" style="height=20px; padding:3px"| Read perf./node
| style="height=20px; text-align:left;padding:3px"| 500 MB/s - 6 GB/s depends on type of local SSD / job queue: 520 MB/s @ single / multiple 800 MB/s @ multiple_e 6600 MB/s @ fat 6500 MB/s @ gpu_4 6500 MB/s @ gpu_8
| style="height=20px; text-align:left;padding:3px"| 400 MB/s - 500 MB/s depends on type of local SSDs / job queue: 500 MB/s @ multiple 400 MB/s @ multiple_e
| style="height=20px; text-align:left;padding:3px"| 5 GB/s
| style="height=20px; text-align:left;padding:3px"| 5 GB/s
| style="height=20px; text-align:left;padding:3px"| 1 GB/s
|-
! scope="column" style="height=20px; padding:3px"| Write perf./node
| style="height=20px; text-align:left;padding:3px"| 500 MB/s - 4 GB/s depends on type of local SSD / job queue: 500 MB/s @ single / multiple 650 MB/s @ multiple_e 2900 MB/s @ fat 2090 MB/s @ gpu_4 4060 MB/s @ gpu_8
| style="height=20px; text-align:left;padding:3px"| 250 MB/s - 350 MB/s depends on type of local SSDs / job queue: 350 MB/s @ multiple 250 MB/s @ multiple_e
| style="height=20px; text-align:left;padding:3px"| 5 GB/s
| style="height=20px; text-align:left;padding:3px"| 5 GB/s
| style="height=20px; text-align:left;padding:3px"| 1 GB/s
|-
! scope="column" style="height=20px;padding:3px"| Total read perf.
| style="height=20px; text-align:left;padding:3px"| n*500-6000 MB/s
| style="height=20px; text-align:left;padding:3px"| n*400-500 MB/s
| style="height=20px; text-align:left;padding:3px"| 63 GB/s
| style="height=20px; text-align:left;padding:3px"| 45 GB/s
| style="height=20px; text-align:left;padding:3px"| 45 GB/s
|-
! scope="column" style="height=20px;padding:3px"| Total write perf.
| style="height=20px; text-align:left;padding:3px"| n*500-4000 MB/s
| style="height=20px; text-align:left;padding:3px"| n*250-350 MB/s
| style="height=20px; text-align:left;padding:3px"| 63 GB/s
| style="height=20px; text-align:left;padding:3px"| 40 GB/s
| style="height=20px; text-align:left;padding:3px"| 38 GB/s
|}
---------------------------------------------------------------------------------------------------------
global: all nodes of UniCluster access the same file system;
local: each node has its own file system;
permanent: files are stored permanently;
batch job: files are removed at end of the batch job.
---------------------------------------------------------------------------------------------------------
Table 2: Properties of the file systems

== Selecting the appropriate file system ==

In general, you should separate your data and store it on the appropriate file system.
Permanently needed data like software or important results should be stored below $HOME
but capacity restrictions (quotas) apply. In case you accidentally deleted data on $HOME
there is a chance that we can restore it from backup. Permanent data which is not needed for months
or exceeds the capacity restrictions should be sent to the LSDF Online Storage
or to the archive and deleted from the file systems. Temporary data which is only needed on a single
node and which does not exceed the disk space shown in the table above should be stored
below $TMPDIR. Data which is read many times on a single node, e.g. if you are doing AI training,
should be copied to $TMPDIR and read from there. Temporary data which is used from many nodes
of your batch job and which is only needed during job runtime should be stored on a
parallel on-demand file system. Temporary data which can be recomputed or which is the
result of one job and input for another job should be stored in workspaces. The lifetime
of data in workspaces is limited and depends on the lifetime of the workspace which can be
several months.

For further details please check the chapters below.

== $HOME ==

The home directories of bwUniCluster 3.0 (uc3) users are located in the parallel file system Lustre.
You have access to your home directory from all nodes of uc3. A regular backup of these directories
to a tape library is done automatically. The directory $HOME should be used to hold permanently used data like
source code, configuration files, executable programs etc.

On uc3 there is a default user quota limit of 500 GiB and 5 million inodes (files and directories) per user.
For users of University of Mannheim the limit is 250 GiB and 2.5 million inodes.
You can check your current usage and limits with the command
<pre>
$ lfs quota -uh $(whoami) $HOME
</pre>
Note that the quota limits mentioned above are soft quota limits. The hard limits (shown on the ''limits''
columns) are 10 percent higher. If you are above the soft limit and below the hard limit
during the grace period (7 days) your I/O operations will show a warning message. If the grace period has
passed or if you are above the hard limit your I/O operations will abort.

In addition to the user limit there is a limit of your organization (e.g. university) which depends on the financial share. This limit is enforced with so-called Lustre project quotas. You can show the current usage and limits of your organization with the following command:
<pre>
lfs quota -ph $(grep $(echo $HOME | sed -e "s|/[^/]*/[^/]*$||") /pfs/data6/project_ids.txt | cut -f 1 -d\ ) $HOME
</pre>

== Workspaces ==

On uc3 workspaces can be used to store large non-permanent data sets, e.g. restart files or output
data that has to be post-processed. The file system used for workspaces is also the parallel file system Lustre. This file system is especially designed for parallel access and for a high throughput to large
files. It is able to provide high data transfer rates of up to 40 GB/s write and read performance when data access is parallel.

On uc3 there is a default user quota limit of 40 TiB and 20 million inodes (files and directories) per user.
You can chek your current usage and limits with the command
<pre>
$ lfs quota -uh $(whoami) /pfs/work9
</pre>
Note that the quotas include data and inodes for all of your workspaces and all of your expired workspaces (as long as they are not yet completely removed).

Workspaces have a lifetime and the data on a workspace expires as a whole after a fixed period. The maximum lifetime of a workspace on uc3 is 60 days, but it can be renewed at the end of that period 3 times to a total maximum of 240 days after workspace generation.

Creating, deleting, finding, extending and sharing workspaces is explained on the [[workspace]] page.

=== Reminder for workspace deletion ===

Normally you will get an email every day starting 7 days before a workspace expires. You can send yourself a calender entry which reminds you when a workspace will be automatically deleted:

$ ws_send_ical <workspace> <email>

=== Restoring expired Workspaces ===

At expiration time your workspace will be moved to a special, hidden directory. On uc3 expired workspaces are currently kept for 30 days. During this time you can still restore your data into a valid workspace. The same is true for released workspaces but they are only kept until the next night. In order to restore an expired workspace, use
<pre>
ws_restore -l
</pre>
to get a list of your expired workspaces, and then restore them into an '''existing, active workspace''' (here with name <code>my_restored</code>):
<pre>
ws_restore <full_name_of_expired_workspace> my_restored
</pre>
NOTE: The expired workspace has to be specified using the full name as listed by <code>ws_restore -l</code>, including username prefix and timestamp suffix (otherwise, it cannot be uniquely identified).
The target workspace, on the other hand, must be given with just its short name as listed by <code>ws_list</code>, without the username prefix.

NOTE: <code>ws_restore</code> can only work on the same filesystem. So you have to ensure that the new workspace allocated with <code>ws_allocate</code> is placed on the same filesystem as the expired workspace. Therefore, you can use <code>-F <filesystem></code> flag if needed.

=== Linking workspaces in Home ===

It might be valuable to have links to personal workspaces within a certain directory, e.g. below the user home directory. The command
<pre>
ws_register <DIR>
</pre>
will create and manage links to all personal workspaces within in the directory <DIR>. Calling this command will do the following:

* The directory <DIR> will be created if necessary
* Links to all personal workspaces will be managed:
** Creates links to all available workspaces if not already present
** Removes links to released or expired workspaces

== Improving Performance on $HOME and workspaces ==

The following recommendations might help to improve throughput and metadata
performance on Lustre filesystems.

=== Improving Throughput Performance ===

Depending on your application some adaptations might be necessary if you want to reach
the full bandwidth of the filesystems.

When you are designing your application you should consider that the performance of
parallel filesystems is generally better if data is transferred in large blocks and stored in
few large files. In more detail, to increase throughput performance of a parallel application
following aspects should be considered:

* collect large chunks of data and write them sequentially at once,

* to exploit complete filesystem bandwidth use several clients,

* avoid competitive file access by different tasks or use blocks with boundaries at stripe size (default is 1MB),

* if files are small enough for the SSDs and are only used from one node store them on $TMPDIR.

Parallel filesystems typically stripe files over storage subsystems, i.e. large files are separated into stripes and distributed to different storage subsystems. In Lustre, the size of these stripes (sometimes also mentioned as chunks) is called stripe size and the number of used storage subsystems is called stripe count.

With previous Lustre versions adapting the Lustre stripe count was the most important optimization. However, for the workspaces of uc3 the Lustre feature Progressive File Layout is used to define file striping parameters. This means that the stripe count is adapted if the file size is growing. In normal cases users no longer need to adapt file striping parameters in case they have very huge files or in order to reach better performance. If you know what you are doing you can still change striping parameters but further explanation is beyond the scope of this documentation.

=== Improving Metadata Performance ===

Metadata performance on parallel file systems is usually not as good as with local
filesystems. In addition, it is usually not scalable, i.e. a limited resource. Therefore,
you should omit metadata operations whenever possible. For example, it is much better
to have few large files than lots of small files. In more detail, to increase metadata
performance of a parallel application following aspects should be considered:

* avoid creating many small files

* avoid competitive directory access, e.g. by creating files in separate subdirectories for each task

* if many small files are only used within a batch job and accessed by one node store them on $TMPDIR

== Workspaces on flash storage ==

There is another workspace file system for special requirements available. The file system is called ''full flash pfs'' and is based on the parallel file system Lustre.

=== Advantages of this file system ===

# All storage devices are based on flash (no hard disks) with low access times. Hence performance is better compared to other parallel file systems for read and write access with small blocks and with small files, i.e. IOPS rates are improved.
# The file system is mounted on bwUniCluster 2.0 and HoreKa, i.e. it can be used to share data between these clusters.

=== Access restrictions ===

Only HoreKa users or KIT users of bwUniCluster 2.0 can use this file system.

=== Using the file system ===

As KIT or HoreKa user you can use the file system in the same way as a normal workspace. You just have to specify the name of the flash-based workspace file system using the option ''-F'' to all the commands that manage workspaces. On bwUniCluster 2.0 it is called ''ffuc'', on HoreKa it is ''ffhk''. For example, to create a workspace with name myws and a lifetime of 60 days on bwUniCluster 2.0 execute:
ws_allocate -F ffuc myws 60

If you want to use the full flash pfs on bwUniCluster 2.0 '''and''' HoreKa at the same time, please note that you only have to manage a particular workspace on one of the clusters since the name of the workspace directory is different. However, the path to each workspace is visible and can be used on both clusters.

Other features are similar to normal workspaces. For example, we are able to restore expired workspaces for few weeks and you have to open a ticket to request the restore. There are quota limits with a default limit of 1 TiB capacity and 5 millions inodes per user. You can check your current usage with
lfs quota -uh $(whoami) /pfs/work8

== $TMPDIR ==

The environment variable $TMPDIR contains the name of a directory which is located on the local SSD of each node. This means
that different tasks of a parallel application use different directories when they do not utilize the same node.
Although $TMPDIR points to the same path name for different nodes of a batch job, the physical location and the
content of this directory path on these nodes is different.

This directory should be used for temporary files being accessed from the local node during job runtime. It should
also be used if you read the same data many times from a single node, e.g. if you are doing AI training. In this
case you should copy the data at the beginning of your batch job to $TMPDIR and read the data from there, see usage example below.

The $TMPDIR directory is located on extremely fast local SSD storage devices. This means that performance
on small files is much better than on the parallel file systems. The capacity of the local SSDs for each node type
is different and can be checked in [[BwUniCluster3.0/Hardware_and_Architecture#Compute_nodes|Table 1]].
The capacity of $TMPDIR is at least 1400 GB.

Each time a batch job is started, a subdirectory is created on the SSD of each node and assigned to the job.
$TMPDIR is set to the name of the subdirectory and this name contains the job ID so that it is unique
for each job. At the end of the job the subdirectory is removed.

On login nodes $TMPDIR also points to a fast directory on a local SSD disk but this directory is not unique.
It is recommended to create your own unique subdirectory on these nodes. This directory should be used for the
installation of software packages. This means that the software package to be installed should be unpacked,
compiled and linked in a subdirectory of $TMPDIR. The real installation of the package (e.g. make install)
should be made into the $HOME folder.

{|style="background:#ffdeee; width:100%;"
|style="padding:5px; background:#f2cece; text-align:left"|
[[Image:Attention.svg|center|25px]]
|style="padding:5px; background:#f2cece; text-align:left"|
Note that you should '''not''' use /tmp or /scratch! Please use '''$TMPDIR''' instead. 
The reason is that an automatic cleanup on /tmp or /scratch is not possible because another job could be still using data below these directories. Hence the corresponding file systems could fill up and this can cause issues for you and for other users. On the other hand, $TMPDIR is created when the job starts and removed when the job completes, i.e. a cleanup is automatically done.
|}

=== Usage example for $TMPDIR ===

We will provide an example for using $TMPDIR and describe efficient data transfer to and from $TMPDIR.

If you have a data set with many files which is frequently used by batch jobs you should create
a compressed archive on a workspace. This archive can be extracted on $TMPDIR inside your batch jobs.
Such an archive can be read efficiently from a parallel file system since it is a single huge file.
On a login node you can create such an archive with the following steps:
<syntaxhighlight lang="bash">
# Create a workspace to store the archive
[ab1234@uc3n991 ~]$ ws_allocate data-ssd 60
# Create the archive from a local dataset folder (example)
[ab1234@uc3n991 ~]$ tar -cvzf $(ws_find data-ssd)/dataset.tgz dataset/
</syntaxhighlight>

Inside a batch job extract the archive on $TMPDIR, read input data from $TMPDIR, store results on $TMPDIR
and save the results on a workspace:
<syntaxhighlight lang="bash">
#!/bin/bash
# very simple example on how to use local $TMPDIR
#SBATCH -N 1
#SBATCH -t 24:00:00

# Extract compressed input dataset on local SSD
tar -C $TMPDIR/ -xvzf $(ws_find data-ssd)/dataset.tgz

# The application reads data from dataset on $TMPDIR and writes results to $TMPDIR
myapp -input $TMPDIR/dataset/myinput.csv -outputdir $TMPDIR/results

# Before job completes save results on a workspace
rsync -av $TMPDIR/results $(ws_find data-ssd)/results-${SLURM_JOB_ID}/
</syntaxhighlight>

== LSDF Online Storage ==

The LSDF Online Storage allows dedicated users to store scientific measurement data and simulation results. BwUniCluster 3.0 has an extremely fast network connection to the LSDF Online Storage. This file system provides external access via different protocols. It is only available for certain users. For information how request storage projects on the LSDF Online Storage see [[https://www.scc.kit.edu/en/services/lsdf]].

The LSDF Online Storage is mounted on the login nodes. It will also be mounted on the compute nodes of your batch job request it with the constraint flag ''LSDF''. You have one of the following options:

1. Add after the initial lines of your script job.sh the line <code>#SBATCH --constraint=LSDF</code>:
<pre>
#!/bin/bash
#SBATCH --ntasks=1
#SBATCH --time=120
#SBATCH --mem=200
#SBATCH --constraint=LSDF
</pre>
2. Add the constraint on command line:
<pre>
$ sbatch -p <queue> -n 1 -t 2:00:00 --mem 200 job.sh -C LSDF
</pre>

In order to access the LSDF Online Storage the following environment variables are available:
<code>$LSDF</code>, <code>$LSDFPROJECTS</code>, <code>$LSDFHOME</code>

==BeeOND (BeeGFS On-Demand)==

Users of the UniCluster have possibility to request a private BeeOND (on-demand BeeGFS) parallel filesystem for each job. The file system is created during job startup and purged after your job.

; '''IMPORTANT: '''
: All data on the private filesystem will be deleted after your job. Make sure you have copied your data back to the global filesystem (within job), e.g., $HOME or any workspace.

BeeOND/BeeGFS can be used like any other parallel file system. Tools like cp or rsync can be used to copy data in and out.

For detailed usage see here: [[BwUniCluster_2.0_Slurm_common_Features#BeeOND_.28BeeGFS_On-Demand.29|Request on-demand file system]]

==Backup and Archiving==

There are regular backups of all data of the home directories, whereas ACLs and extended attributes will
not be saved in backups.

Please open a ticket if you have the need to restore backup data.

BwUniCluster3.0/Hardware and Architecture/Filesystem Details

2025-04-03T14:29:48Z

R Laifer: /* LSDF Online Storage */

= File System Details =

On bwUniCluster 2.0 the parallel file system Lustre is used for most globally visible user data. Lustre is open source and Lustre solutions and support are available from different vendors. Nowadays, most of the biggest HPC systems are using Lustre. An initial home directory on a Lustre file system is created automatically after account activation, and the environment variable $HOME holds its name. Users can create so-called workspaces on another Lustre file system for non-permanent data with temporary lifetime. There is another workspace file system based on flash storage for special requirements available.

Within a batch job further file systems are available:
* The directory $TMPDIR is only available and visible on the local node. It is located on fast SSD storage devices.
* On request a parallel on-demand file system (BeeOND) is created which uses the SSDs of the nodes which were allocated to the batch job.
* On request the external LSDF Online Storage is mounted on the nodes which were allocated to the batch job. This file system is based on the parallel file system Spectrum Scale.

Some of the characteristics of the file systems are shown in Table 2.

{| class="wikitable"
|-
! scope="column" style="width:9%"| Property
! style="width:13%"| $TMPDIR
! style="width:13%"| BeeOND
! style="width:13%"| $HOME
! style="width:13%"| Workspace
! style="width:13%"| Workspace on flash
|-
! scope="column" style="height=20px; padding:3px"| Visibility
| style="height=20px; text-align:left;padding:3px"| local node
| style="height=20px; text-align:left;padding:3px"| nodes of batch job
| style="height=20px; text-align:left;padding:3px"| global
| style="height=20px; text-align:left;padding:3px"| global
| style="height=20px; text-align:left;padding:3px"| global
|-
! scope="column" style="height=20px; padding:3px"| Lifetime
| style="height=20px; text-align:left;padding:3px"| batch job runtime
| style="height=20px; text-align:left;padding:3px"| batch job runtime
| style="height=20px; text-align:left;padding:3px"| permanent
| style="height=20px; text-align:left;padding:3px"| max. 240 days
| style="height=20px; text-align:left;padding:3px"| max. 240 days
|-
! scope="column" style="height=20px; padding:3px"| Disk space
| style="height=20px; text-align:left;padding:3px"| 960 GB - 6.4 TB details see table 1
| style="height=20px; text-align:left;padding:3px"| n*750 GB
| style="height=20px; text-align:left;padding:3px"| 1.1 PiB
| style="height=20px; text-align:left;padding:3px"| 4.6 PiB
| style="height=20px; text-align:left;padding:3px"| 236 TiB
|-
! scope="column" style="height=20px; padding:3px"| Capacity Quotas
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| yes 500 GiB per user, for MA users 250 GiB also per organization
| style="height=20px; text-align:left;padding:3px"| yes 40 TiB per user
| style="height=20px; text-align:left;padding:3px"| yes 1 TiB per user
|-
! scope="column" style="height=20px; padding:3px"| Inode Quotas
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| yes 5 million per user for MA users 2.5 million
| style="height=20px; text-align:left;padding:3px"| yes 20 million per user
| style="height=20px; text-align:left;padding:3px"| yes 5 million per user
|-
! scope="column" style="height=20px; padding:3px"| Backup
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| yes
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| no
|-
! scope="column" style="height=20px; padding:3px"| Read perf./node
| style="height=20px; text-align:left;padding:3px"| 500 MB/s - 6 GB/s depends on type of local SSD / job queue: 520 MB/s @ single / multiple 800 MB/s @ multiple_e 6600 MB/s @ fat 6500 MB/s @ gpu_4 6500 MB/s @ gpu_8
| style="height=20px; text-align:left;padding:3px"| 400 MB/s - 500 MB/s depends on type of local SSDs / job queue: 500 MB/s @ multiple 400 MB/s @ multiple_e
| style="height=20px; text-align:left;padding:3px"| 5 GB/s
| style="height=20px; text-align:left;padding:3px"| 5 GB/s
| style="height=20px; text-align:left;padding:3px"| 1 GB/s
|-
! scope="column" style="height=20px; padding:3px"| Write perf./node
| style="height=20px; text-align:left;padding:3px"| 500 MB/s - 4 GB/s depends on type of local SSD / job queue: 500 MB/s @ single / multiple 650 MB/s @ multiple_e 2900 MB/s @ fat 2090 MB/s @ gpu_4 4060 MB/s @ gpu_8
| style="height=20px; text-align:left;padding:3px"| 250 MB/s - 350 MB/s depends on type of local SSDs / job queue: 350 MB/s @ multiple 250 MB/s @ multiple_e
| style="height=20px; text-align:left;padding:3px"| 5 GB/s
| style="height=20px; text-align:left;padding:3px"| 5 GB/s
| style="height=20px; text-align:left;padding:3px"| 1 GB/s
|-
! scope="column" style="height=20px;padding:3px"| Total read perf.
| style="height=20px; text-align:left;padding:3px"| n*500-6000 MB/s
| style="height=20px; text-align:left;padding:3px"| n*400-500 MB/s
| style="height=20px; text-align:left;padding:3px"| 63 GB/s
| style="height=20px; text-align:left;padding:3px"| 45 GB/s
| style="height=20px; text-align:left;padding:3px"| 45 GB/s
|-
! scope="column" style="height=20px;padding:3px"| Total write perf.
| style="height=20px; text-align:left;padding:3px"| n*500-4000 MB/s
| style="height=20px; text-align:left;padding:3px"| n*250-350 MB/s
| style="height=20px; text-align:left;padding:3px"| 63 GB/s
| style="height=20px; text-align:left;padding:3px"| 40 GB/s
| style="height=20px; text-align:left;padding:3px"| 38 GB/s
|}
---------------------------------------------------------------------------------------------------------
global: all nodes of UniCluster access the same file system;
local: each node has its own file system;
permanent: files are stored permanently;
batch job: files are removed at end of the batch job.
---------------------------------------------------------------------------------------------------------
Table 2: Properties of the file systems

== Selecting the appropriate file system ==

In general, you should separate your data and store it on the appropriate file system.
Permanently needed data like software or important results should be stored below $HOME
but capacity restrictions (quotas) apply. In case you accidentally deleted data on $HOME
there is a chance that we can restore it from backup. Permanent data which is not needed for months
or exceeds the capacity restrictions should be sent to the LSDF Online Storage
or to the archive and deleted from the file systems. Temporary data which is only needed on a single
node and which does not exceed the disk space shown in the table above should be stored
below $TMPDIR. Data which is read many times on a single node, e.g. if you are doing AI training,
should be copied to $TMPDIR and read from there. Temporary data which is used from many nodes
of your batch job and which is only needed during job runtime should be stored on a
parallel on-demand file system. Temporary data which can be recomputed or which is the
result of one job and input for another job should be stored in workspaces. The lifetime
of data in workspaces is limited and depends on the lifetime of the workspace which can be
several months.

For further details please check the chapters below.

== $HOME ==

The home directories of bwUniCluster 3.0 (uc3) users are located in the parallel file system Lustre.
You have access to your home directory from all nodes of uc3. A regular backup of these directories
to a tape library is done automatically. The directory $HOME should be used to hold permanently used data like
source code, configuration files, executable programs etc.

On uc3 there is a default user quota limit of 500 GiB and 5 million inodes (files and directories) per user.
For users of University of Mannheim the limit is 250 GiB and 2.5 million inodes.
You can check your current usage and limits with the command
<pre>
$ lfs quota -uh $(whoami) $HOME
</pre>
Note that the quota limits mentioned above are soft quota limits. The hard limits (shown on the ''limits''
columns) are 10 percent higher. If you are above the soft limit and below the hard limit
during the grace period (7 days) your I/O operations will show a warning message. If the grace period has
passed or if you are above the hard limit your I/O operations will abort.

In addition to the user limit there is a limit of your organization (e.g. university) which depends on the financial share. This limit is enforced with so-called Lustre project quotas. You can show the current usage and limits of your organization with the following command:
<pre>
lfs quota -ph $(grep $(echo $HOME | sed -e "s|/[^/]*/[^/]*$||") /pfs/data6/project_ids.txt | cut -f 1 -d\ ) $HOME
</pre>

== Workspaces ==

On uc3 workspaces can be used to store large non-permanent data sets, e.g. restart files or output
data that has to be post-processed. The file system used for workspaces is also the parallel file system Lustre. This file system is especially designed for parallel access and for a high throughput to large
files. It is able to provide high data transfer rates of up to 40 GB/s write and read performance when data access is parallel.

On uc3 there is a default user quota limit of 40 TiB and 20 million inodes (files and directories) per user.
You can chek your current usage and limits with the command
<pre>
$ lfs quota -uh $(whoami) /pfs/work9
</pre>
Note that the quotas include data and inodes for all of your workspaces and all of your expired workspaces (as long as they are not yet completely removed).

Workspaces have a lifetime and the data on a workspace expires as a whole after a fixed period. The maximum lifetime of a workspace on uc3 is 60 days, but it can be renewed at the end of that period 3 times to a total maximum of 240 days after workspace generation.

Creating, deleting, finding, extending and sharing workspaces is explained on the [[workspace]] page.

=== Reminder for workspace deletion ===

Normally you will get an email every day starting 7 days before a workspace expires. You can send yourself a calender entry which reminds you when a workspace will be automatically deleted:

$ ws_send_ical <workspace> <email>

=== Restoring expired Workspaces ===

At expiration time your workspace will be moved to a special, hidden directory. On uc3 expired workspaces are currently kept for 30 days. During this time you can still restore your data into a valid workspace. The same is true for released workspaces but they are only kept until the next night. In order to restore an expired workspace, use
<pre>
ws_restore -l
</pre>
to get a list of your expired workspaces, and then restore them into an '''existing, active workspace''' (here with name <code>my_restored</code>):
<pre>
ws_restore <full_name_of_expired_workspace> my_restored
</pre>
NOTE: The expired workspace has to be specified using the full name as listed by <code>ws_restore -l</code>, including username prefix and timestamp suffix (otherwise, it cannot be uniquely identified).
The target workspace, on the other hand, must be given with just its short name as listed by <code>ws_list</code>, without the username prefix.

NOTE: <code>ws_restore</code> can only work on the same filesystem. So you have to ensure that the new workspace allocated with <code>ws_allocate</code> is placed on the same filesystem as the expired workspace. Therefore, you can use <code>-F <filesystem></code> flag if needed.

=== Linking workspaces in Home ===

It might be valuable to have links to personal workspaces within a certain directory, e.g. below the user home directory. The command
<pre>
ws_register <DIR>
</pre>
will create and manage links to all personal workspaces within in the directory <DIR>. Calling this command will do the following:

* The directory <DIR> will be created if necessary
* Links to all personal workspaces will be managed:
** Creates links to all available workspaces if not already present
** Removes links to released or expired workspaces

== Improving Performance on $HOME and workspaces ==

The following recommendations might help to improve throughput and metadata
performance on Lustre filesystems.

=== Improving Throughput Performance ===

Depending on your application some adaptations might be necessary if you want to reach
the full bandwidth of the filesystems.

When you are designing your application you should consider that the performance of
parallel filesystems is generally better if data is transferred in large blocks and stored in
few large files. In more detail, to increase throughput performance of a parallel application
following aspects should be considered:

* collect large chunks of data and write them sequentially at once,

* to exploit complete filesystem bandwidth use several clients,

* avoid competitive file access by different tasks or use blocks with boundaries at stripe size (default is 1MB),

* if files are small enough for the SSDs and are only used from one node store them on $TMPDIR.

Parallel filesystems typically stripe files over storage subsystems, i.e. large files are separated into stripes and distributed to different storage subsystems. In Lustre, the size of these stripes (sometimes also mentioned as chunks) is called stripe size and the number of used storage subsystems is called stripe count.

With previous Lustre versions adapting the Lustre stripe count was the most important optimization. However, for the workspaces of uc3 the Lustre feature Progressive File Layout is used to define file striping parameters. This means that the stripe count is adapted if the file size is growing. In normal cases users no longer need to adapt file striping parameters in case they have very huge files or in order to reach better performance. If you know what you are doing you can still change striping parameters but further explanation is beyond the scope of this documentation.

=== Improving Metadata Performance ===

Metadata performance on parallel file systems is usually not as good as with local
filesystems. In addition, it is usually not scalable, i.e. a limited resource. Therefore,
you should omit metadata operations whenever possible. For example, it is much better
to have few large files than lots of small files. In more detail, to increase metadata
performance of a parallel application following aspects should be considered:

* avoid creating many small files

* avoid competitive directory access, e.g. by creating files in separate subdirectories for each task

* if many small files are only used within a batch job and accessed by one node store them on $TMPDIR

== Workspaces on flash storage ==

There is another workspace file system for special requirements available. The file system is called ''full flash pfs'' and is based on the parallel file system Lustre.

=== Advantages of this file system ===

# All storage devices are based on flash (no hard disks) with low access times. Hence performance is better compared to other parallel file systems for read and write access with small blocks and with small files, i.e. IOPS rates are improved.
# The file system is mounted on bwUniCluster 2.0 and HoreKa, i.e. it can be used to share data between these clusters.

=== Access restrictions ===

Only HoreKa users or KIT users of bwUniCluster 2.0 can use this file system.

=== Using the file system ===

As KIT or HoreKa user you can use the file system in the same way as a normal workspace. You just have to specify the name of the flash-based workspace file system using the option ''-F'' to all the commands that manage workspaces. On bwUniCluster 2.0 it is called ''ffuc'', on HoreKa it is ''ffhk''. For example, to create a workspace with name myws and a lifetime of 60 days on bwUniCluster 2.0 execute:
ws_allocate -F ffuc myws 60

If you want to use the full flash pfs on bwUniCluster 2.0 '''and''' HoreKa at the same time, please note that you only have to manage a particular workspace on one of the clusters since the name of the workspace directory is different. However, the path to each workspace is visible and can be used on both clusters.

Other features are similar to normal workspaces. For example, we are able to restore expired workspaces for few weeks and you have to open a ticket to request the restore. There are quota limits with a default limit of 1 TiB capacity and 5 millions inodes per user. You can check your current usage with
lfs quota -uh $(whoami) /pfs/work8

== $TMPDIR ==

The environment variable $TMPDIR contains the name of a directory which is located on the local SSD of each node. This means
that different tasks of a parallel application use different directories when they do not utilize the same node.
Although $TMPDIR points to the same path name for different nodes of a batch job, the physical location and the
content of this directory path on these nodes is different.

This directory should be used for temporary files being accessed from the local node during job runtime. It should
also be used if you read the same data many times from a single node, e.g. if you are doing AI training. In this
case you should copy the data at the beginning of your batch job to $TMPDIR and read the data from there, see usage example below.

The $TMPDIR directory is located on extremely fast local SSD storage devices. This means that performance
on small files is much better than on the parallel file systems. The capacity of the local SSDs for each node type
is different and can be checked in [[BwUniCluster3.0/Hardware_and_Architecture#Compute_nodes|Table 1]].
The capacity of $TMPDIR is at least 1400 GB.

Each time a batch job is started, a subdirectory is created on the SSD of each node and assigned to the job.
$TMPDIR is set to the name of the subdirectory and this name contains the job ID so that it is unique
for each job. At the end of the job the subdirectory is removed.

On login nodes $TMPDIR also points to a fast directory on a local SSD disk but this directory is not unique.
It is recommended to create your own unique subdirectory on these nodes. This directory should be used for the
installation of software packages. This means that the software package to be installed should be unpacked,
compiled and linked in a subdirectory of $TMPDIR. The real installation of the package (e.g. make install)
should be made into the $HOME folder.

{|style="background:#ffdeee; width:100%;"
|style="padding:5px; background:#f2cece; text-align:left"|
[[Image:Attention.svg|center|25px]]
|style="padding:5px; background:#f2cece; text-align:left"|
Note that you should '''not''' use /tmp or /scratch! Please use '''$TMPDIR''' instead. 
The reason is that an automatic cleanup on /tmp or /scratch is not possible because another job could be still using data below these directories. Hence the corresponding file systems could fill up and this can cause issues for you and for other users. On the other hand, $TMPDIR is created when the job starts and removed when the job completes, i.e. a cleanup is automatically done.
|}

=== Usage example for $TMPDIR ===

We will provide an example for using $TMPDIR and describe efficient data transfer to and from $TMPDIR.

If you have a data set with many files which is frequently used by batch jobs you should create
a compressed archive on a workspace. This archive can be extracted on $TMPDIR inside your batch jobs.
Such an archive can be read efficiently from a parallel file system since it is a single huge file.
On a login node you can create such an archive with the following steps:
<syntaxhighlight lang="bash">
# Create a workspace to store the archive
[ab1234@uc3n991 ~]$ ws_allocate data-ssd 60
# Create the archive from a local dataset folder (example)
[ab1234@uc3n991 ~]$ tar -cvzf $(ws_find data-ssd)/dataset.tgz dataset/
</syntaxhighlight>

Inside a batch job extract the archive on $TMPDIR, read input data from $TMPDIR, store results on $TMPDIR
and save the results on a workspace:
<syntaxhighlight lang="bash">
#!/bin/bash
# very simple example on how to use local $TMPDIR
#SBATCH -N 1
#SBATCH -t 24:00:00

# Extract compressed input dataset on local SSD
tar -C $TMPDIR/ -xvzf $(ws_find data-ssd)/dataset.tgz

# The application reads data from dataset on $TMPDIR and writes results to $TMPDIR
myapp -input $TMPDIR/dataset/myinput.csv -outputdir $TMPDIR/results

# Before job completes save results on a workspace
rsync -av $TMPDIR/results $(ws_find data-ssd)/results-${SLURM_JOB_ID}/
</syntaxhighlight>

== LSDF Online Storage ==

The LSDF Online Storage allows dedicated users to store scientific measurement data and simulation results. BwUniCluster 3.0 has an extremely fast network connection to the LSDF Online Storage. This file system provides external access via different protocols. It is only available for certain users. For information how request storage projects on the LSDF Online Storage see [[https://www.scc.kit.edu/en/services/lsdf|here]].

The LSDF Online Storage is mounted on the login nodes. It will also be mounted on the compute nodes of your batch job request it with the constraint flag ''LSDF''. You have one of the following options:

1. Add after the initial lines of your script job.sh the line <code>#SBATCH --constraint=LSDF</code>:
<pre>
#!/bin/bash
#SBATCH --ntasks=1
#SBATCH --time=120
#SBATCH --mem=200
#SBATCH --constraint=LSDF
</pre>
2. Add the constraint on command line:
<pre>
$ sbatch -p <queue> -n 1 -t 2:00:00 --mem 200 job.sh -C LSDF
</pre>

In order to access the LSDF Online Storage the following environment variables are available:
<code>$LSDF</code>, <code>$LSDFPROJECTS</code>, <code>$LSDFHOME</code>

==BeeOND (BeeGFS On-Demand)==

Users of the UniCluster have possibility to request a private BeeOND (on-demand BeeGFS) parallel filesystem for each job. The file system is created during job startup and purged after your job.

; '''IMPORTANT: '''
: All data on the private filesystem will be deleted after your job. Make sure you have copied your data back to the global filesystem (within job), e.g., $HOME or any workspace.

BeeOND/BeeGFS can be used like any other parallel file system. Tools like cp or rsync can be used to copy data in and out.

For detailed usage see here: [[BwUniCluster_2.0_Slurm_common_Features#BeeOND_.28BeeGFS_On-Demand.29|Request on-demand file system]]

==Backup and Archiving==

There are regular backups of all data of the home directories, whereas ACLs and extended attributes will
not be saved in backups.

Please open a ticket if you have the need to restore backup data.

BwUniCluster2.0/Slurm

2025-04-03T14:13:20Z

R Laifer: /* BeeOND (BeeGFS On-Demand) */

<div id="top"></div>
= Slurm HPC Workload Manager =
== Specification ==
Slurm is an open source, fault-tolerant, and highly scalable cluster management and job scheduling system for large and small Linux clusters. Slurm requires no kernel modifications for its operation and is relatively self-contained. As a cluster workload manager, Slurm has three key functions. First, it allocates exclusive and/or non-exclusive access to resources (compute nodes) to users for some duration of time so they can perform work. Second, it provides a framework for starting, executing, and monitoring work (normally a parallel job) on the set of allocated nodes. Finally, it arbitrates contention for resources by managing a queue of pending work.
 
 
Any kind of calculation on the compute nodes of [[bwUniCluster 2.0|bwUniCluster 2.0]] requires the user to define calculations as a sequence of commands or single command together with required run time, number of CPU cores and main memory and submit all, i.e., the '''batch job''', to a resource and workload managing software. bwUniCluster 2.0 has installed the workload managing software Slurm. Therefore any job submission by the user is to be executed by commands of the Slurm software. Slurm queues and runs user jobs based on fair sharing policies.
 
 

== Slurm Commands (excerpt) ==
Some of the most used Slurm commands for non-administrators working on bwUniCluster 2.0.
{| width=750px class="wikitable"
! Slurm commands !! Brief explanation
|-
| [[#Job Submission : sbatch|sbatch]] || Submits a job and queues it in an input queue [[https://slurm.schedmd.com/sbatch.html sbatch]]
|-
| [[#Detailed job information : scontrol show job|scontrol show job]] || Displays detailed job state information [[https://slurm.schedmd.com/scontrol.html scontrol]]
|-
| [[#List of your submitted jo/bs : squeue|squeue]] || Displays information about active, eligible, blocked, and/or recently completed jobs [[https://slurm.schedmd.com/squeue.html squeue]]
|-
| [[#Start time of job or resources : squeue|squeue --start]] || Returns start time of submitted job or requested resources [[https://slurm.schedmd.com/squeue.html squeue]]
|-
| [[#Shows free resources : sinfo_t_idle|sinfo_t_idle]] || Shows what resources are available for immediate use [[https://slurm.schedmd.com/sinfo.html sinfo]]
|-
| [[#Canceling own jobs : scancel|scancel]] || Cancels a job (obsoleted!) [[https://slurm.schedmd.com/scancel.html scancel]]
|}

 
* [https://slurm.schedmd.com/tutorials.html Slurm Tutorials]
* [https://slurm.schedmd.com/pdfs/summary.pdf Slurm command/option summary (2 pages)]
* [https://slurm.schedmd.com/man_index.html Slurm Commands]
 

== Job Submission : sbatch ==
Batch jobs are submitted by using the command '''sbatch'''. The main purpose of the '''sbatch''' command is to specify the resources that are needed to run the job. '''sbatch''' will then queue the batch job. However, starting of batch job depends on the availability of the requested resources and the fair sharing value.
 
 
=== sbatch Command Parameters ===
The syntax and use of '''sbatch''' can be displayed via:
<pre>
$ man sbatch
</pre>
'''sbatch''' options can be used from the command line or in your job script.
{| width=750px class="wikitable"
! colspan="3" | sbatch Options
|-
! Command line
! Script
! Purpose
|- style="vertical-align:top;"
| -t ''time'' or --time=''time''
| #SBATCH --time=''time''
| Wall clock time limit. 
|-
|- style="vertical-align:top;"
| -N ''count'' or --nodes=''count''
| #SBATCH --nodes=''count''
| Number of nodes to be used.
|- style="vertical-align:top;"
| -n ''count'' or --ntasks=''count''
| #SBATCH --ntasks=''count''
| Number of tasks to be launched.
|-
|- style="vertical-align:top;"
| --ntasks-per-node=''count''
| #SBATCH --ntasks-per-node=''count''
| Maximum count (<= 28 and <= 40 resp.) of tasks per node. (Replaces the option ppn of MOAB.)
|-
|- style="vertical-align:top;"
| -c ''count'' or --cpus-per-task=''count''
| #SBATCH --cpus-per-task=''count''
| Number of CPUs required per (MPI-)task.
|-
|- style="vertical-align:top;"
| --mem=''value_in_MB''
| #SBATCH --mem=''value_in_MB''
| Memory in MegaByte per node. (Default value is 128000 and 96000 MB resp., i.e. you should omit the setting of this option.)
|-
|- style="vertical-align:top;"
| --mem-per-cpu=''value_in_MB''
| #SBATCH --mem-per-cpu=''value_in_MB''
| Minimum Memory required per allocated CPU. (Replaces the option pmem of MOAB. You should omit the setting of this option.)
|-
|- style="vertical-align:top;"
| --mail-type=''type''
| #SBATCH --mail-type=''type''
| Notify user by email when certain event types occur. Valid type values are NONE, BEGIN, END, FAIL, REQUEUE, ALL.
|-
|- style="vertical-align:top;"
| --mail-user=''mail-address''
| #SBATCH --mail-user=''mail-address''
| The specified mail-address receives email notification of state changes as defined by --mail-type.
|-
|- style="vertical-align:top;"
| --output=''name''
| #SBATCH --output=''name''
| File in which job output is stored.
|-
|- style="vertical-align:top;"
| --error=''name''
| #SBATCH --error=''name''
| File in which job error messages are stored.
|-
|- style="vertical-align:top;"
| -J ''name'' or --job-name=''name''
| #SBATCH --job-name=''name''
| Job name.
|-
|- style="vertical-align:top;"
| --export=[ALL,] ''env-variables''
| #SBATCH --export=[ALL,] ''env-variables''
| Identifies which environment variables from the submission environment are propagated to the launched application. Default is ALL. If adding an environment variable to the submission environment is intended, the argument ALL must be added.
|-
|- style="vertical-align:top;"
| -A ''group-name'' or --account=''group-name''
| #SBATCH --account=''group-name''
| Change resources used by this job to specified group. You may need this option if your account is assigned to more than one group. By command "scontrol show job" the project group the job is accounted on can be seen behind "Account=".
|-
|- style="vertical-align:top;"
| -p ''queue-name'' or --partition=''queue-name''
| #SBATCH --partition=''queue-name''
| Request a specific queue for the resource allocation.
|-
|- style="vertical-align:top;"
| --reservation=''reservation-name''
| #SBATCH --reservation=''reservation-name''
| Use a specific reservation for the resource allocation.
|-
|- style="vertical-align:top;"
| -C ''LSDF'' or --constraint=''LSDF''
| #SBATCH --constraint=LSDF
| Job constraint LSDF Filesystems.
|-
|- style="vertical-align:top;"
| -C ''BEEOND (BEEOND_4MDS, BEEOND_MAXMDS)'' or --constraint=''BEEOND (BEEOND_4MDS, BEEOND_MAXMDS''
| #SBATCH --constraint=BEEOND (BEEOND_4MDS, BEEOND_MAXMDS)
| Job constraint BeeOND file system.
|-
|}
 

==== sbatch --partition ''queues'' ====
Queue classes define maximum resources such as walltime, nodes and processes per node and queue of the compute system. Details can be found here:
* [[BwUniCluster_2.0_Batch_Queues#sbatch_-p_queue|bwUniCluster 2.0 queue settings]]
 

=== sbatch Examples ===
==== Serial Programs ====
To submit a serial job that runs the script '''job.sh''' and that requires 5000 MB of main memory and 10 minutes of wall clock time

a) execute:
<pre>
$ sbatch -p dev_single -n 1 -t 10:00 --mem=5000 job.sh
</pre>
or
b) add after the initial line of your script '''job.sh''' the lines (here with a high memory request):
<source lang="bash">
#SBATCH --ntasks=1
#SBATCH --time=10
#SBATCH --mem=180gb
#SBATCH --job-name=simple
</source>
and execute the modified script with the command line option ''--partition=fat'' (with ''--partition=(dev_)single'' maximum ''--mem=96gb'' is possible):
<pre>
$ sbatch --partition=fat job.sh
</pre>
Note, that sbatch command line options overrule script options.
 
 

==== Multithreaded Programs ====
Multithreaded programs operate faster than serial programs on CPUs with multiple cores. 
Moreover, multiple threads of one process share resources such as memory.
 
For multithreaded programs based on '''Open''' '''M'''ulti-'''P'''rocessing (OpenMP) a number of threads is defined by the environment variable OMP_NUM_THREADS. By default this variable is set to 1 (OMP_NUM_THREADS=1).
 
'''Because hyperthreading is switched on on bwUniCluster 2.0, the option --cpus-per-task (-c) must be set to 2*n, if you want to use n threads.'''
 
To submit a batch job called ''OpenMP_Test'' that runs a 40-fold threaded program ''omp_exe'' which requires 6000 MByte of total physical memory and total wall clock time of 40 minutes:
 
a) execute:
<pre>
$ sbatch -p single --export=ALL,OMP_NUM_THREADS=40 -J OpenMP_Test -N 1 -c 80 -t 40 --mem=6000 ./omp_exe
</pre>
or
-->
* generate the script '''job_omp.sh''' containing the following lines:
<source lang="bash">
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --cpus-per-task=80
#SBATCH --time=40:00
#SBATCH --mem=6000mb
#SBATCH --export=ALL,EXECUTABLE=./omp_exe
#SBATCH -J OpenMP_Test

#Usually you should set
export KMP_AFFINITY=compact,1,0
#export KMP_AFFINITY=verbose,compact,1,0 prints messages concerning the supported affinity
#KMP_AFFINITY Description: https://software.intel.com/en-us/node/524790#KMP_AFFINITY_ENVIRONMENT_VARIABLE

export OMP_NUM_THREADS=$((${SLURM_JOB_CPUS_PER_NODE}/2))
echo "Executable ${EXECUTABLE} running on ${SLURM_JOB_CPUS_PER_NODE} cores with ${OMP_NUM_THREADS} threads"
startexe=${EXECUTABLE}
echo $startexe
exec $startexe
</source>
Using Intel compiler the environment variable KMP_AFFINITY switches on binding of threads to specific cores and, if necessary, replace <placeholder> with the required modulefile to enable the OpenMP environment and execute the script '''job_omp.sh''' adding the queue class ''single'' as sbatch option:
<pre>
$ sbatch -p single job_omp.sh
</pre>
Note, that sbatch command line options overrule script options, e.g.,
<pre>
$ sbatch --partition=single --mem=200 job_omp.sh
</pre>
overwrites the script setting of 6000 MByte with 200 MByte.
 
 

==== MPI Parallel Programs ====
MPI parallel programs run faster than serial programs on multi CPU and multi core systems. N-fold spawned processes of the MPI program, i.e., '''MPI tasks''', run simultaneously and communicate via the Message Passing Interface (MPI) paradigm. MPI tasks do not share memory but can be spawned over different nodes.
 
Multiple MPI tasks must be launched via '''mpirun''', e.g. 4 MPI tasks of ''my_par_program'':
<pre>
$ mpirun -n 4 my_par_program
</pre>
This command runs 4 MPI tasks of ''my_par_program'' on the node you are logged in.
To run this command with a loaded Intel MPI the environment variable I_MPI_HYDRA_BOOTSTRAP must be unset ( --> $ unset I_MPI_HYDRA_BOOTSTRAP).

Running MPI parallel programs in a batch job the interactive environment - particularly the loaded modules - will also be set in the batch job. If you want to set a defined module environment in your batch job you have to purge all modules before setting the wished modules.
 
 
===== OpenMPI =====

If you want to run jobs on batch nodes, generate a wrapper script ''job_ompi.sh'' for '''OpenMPI''' containing the following lines:
<source lang="bash">
#!/bin/bash
# Use when using the module environment for OpenMPI
module load compiler/<placeholder_for_compiler>/<placeholder_for_compiler_version>
module load mpi/openmpi/<placeholder_for_mpi_version>
mpirun --bind-to core --map-by core -report-bindings my_par_program
</source>
'''Attention:''' Do '''NOT''' add mpirun options ''-n <number_of_processes>'' or any other option defining processes or nodes, since Slurm instructs mpirun about number of processes and node hostnames. Use '''ALWAYS''' the MPI options '''''--bind-to core''''' and '''''--map-by core|socket|node'''''. Please type ''mpirun --help'' for an explanation of the meaning of the different options of mpirun option ''--map-by''.
 
Considering 4 OpenMPI tasks on a single node, each requiring 2000 MByte, and running for 1 hour, execute:
<pre>
$ sbatch -p single -N 1 -n 4 --mem-per-cpu=2000 --time=01:00:00 ./job_ompi.sh
</pre>
 

===== Intel MPI =====

Generate a wrapper script for '''Intel MPI''', ''job_impi.sh'' containing the following lines:
<source lang="bash">
#!/bin/bash
# Use when a defined module environment related to Intel MPI is wished
module load compiler/<placeholder_for_compiler>/<placeholder_for_compiler_version>
module load mpi/impi/<placeholder_for_version>
mpiexec.hydra -bootstrap slurm my_par_program
</source>
'''Attention:''' 
Do '''NOT''' add mpirun options ''-n <number_of_processes>'' or any other option defining processes or nodes, since Slurm instructs mpirun about number of processes and node hostnames.
 
Launching and running 200 Intel MPI tasks on 5 nodes, each requiring 80 GByte, and running for 5 hours, execute:
<pre>
$ sbatch --partition=multiple -N 5 --ntasks-per-node=40 --mem=80gb -t 300 ./job_impi.sh
</pre>
 
If you want to use 128 or more nodes, you must also set the environment variable as follows: 
export I_MPI_HYDRA_BRANCH_COUNT=-1
 
If you want to use the options perhost, ppn or rr, you must additionally set the environment variable I_MPI_JOB_RESPECT_PROCESS_PLACEMENT=off.
 
 

==== Multithreaded + MPI parallel Programs ====
Multithreaded + MPI parallel programs operate faster than serial programs on multi CPUs with multiple cores. All threads of one process share resources such as memory. On the contrary MPI tasks do not share memory but can be spawned over different nodes. '''Because hyperthreading is switched on on bwUniCluster 2.0, the option --cpus-per-task (-c) must be set to 2*n, if you want to use n threads.'''
 
 
===== OpenMPI with Multithreading =====
Multiple MPI tasks using '''OpenMPI''' must be launched by the MPI parallel program '''mpirun'''. For multithreaded programs based on '''Open''' '''M'''ulti-'''P'''rocessing (OpenMP) number of threads are defined by the environment variable OMP_NUM_THREADS. By default this variable is set to 1 (OMP_NUM_THREADS=1).
 
'''For OpenMPI''' a job-script to submit a batch job called ''job_ompi_omp.sh'' that runs a MPI program with 4 tasks and a 28-fold threaded program ''ompi_omp_program'' requiring 3000 MByte of physical memory per thread (using 28 threads per MPI task you will get 28*3000 MByte = 84000 MByte per MPI task) and total wall clock time of 3 hours looks like:

<source lang="bash">
#!/bin/bash
#SBATCH --ntasks=4
#SBATCH --cpus-per-task=56
#SBATCH --time=03:00:00
#SBATCH --mem=83gb # 84000 MB = 84000/1024 GB = 82.1 GB
#SBATCH --export=ALL,MPI_MODULE=mpi/openmpi/3.1,EXECUTABLE=./ompi_omp_program
#SBATCH --output="parprog_hybrid_%j.out"

# Use when a defined module environment related to OpenMPI is wished
module load ${MPI_MODULE}
export OMP_NUM_THREADS=$((${SLURM_CPUS_PER_TASK}/2))
export MPIRUN_OPTIONS="--bind-to core --map-by socket:PE=${OMP_NUM_THREADS} -report-bindings"
export NUM_CORES=${SLURM_NTASKS}*${OMP_NUM_THREADS}
echo "${EXECUTABLE} running on ${NUM_CORES} cores with ${SLURM_NTASKS} MPI-tasks and ${OMP_NUM_THREADS} threads"
startexe="mpirun -n ${SLURM_NTASKS} ${MPIRUN_OPTIONS} ${EXECUTABLE}"
echo $startexe
exec $startexe
</source>
Execute the script '''job_ompi_omp.sh''' by command sbatch:
<pre>
$ sbatch -p multiple ./job_ompi_omp.sh
</pre>
* With the mpirun option ''--bind-to core'' MPI tasks and OpenMP threads are bound to physical cores.
* With the option ''--map-by node:PE=<value>'' (neighbored) MPI tasks will be attached to different nodes and each MPI task is bound to the first core of a node. <value> must be set to ${OMP_NUM_THREADS}.
* The option ''-report-bindings'' shows the bindings between MPI tasks and physical cores.
* The mpirun-options '''--bind-to core''', '''--map-by socket|...|node:PE=<value>''' should always be used when running a multithreaded MPI program.
 

===== Intel MPI with Multithreading =====
Multithreaded + MPI parallel programs operate faster than serial programs on multi CPUs with multiple cores. All threads of one process share resources such as memory. On the contrary MPI tasks do not share memory but can be spawned over different nodes.

Multiple Intel MPI tasks must be launched by the MPI parallel program '''mpiexec.hydra'''. For multithreaded programs based on '''Open''' '''M'''ulti-'''P'''rocessing (OpenMP) number of threads are defined by the environment variable OMP_NUM_THREADS. By default this variable is set to 1 (OMP_NUM_THREADS=1).

'''For Intel MPI''' a job-script to submit a batch job called ''job_impi_omp.sh'' that runs a Intel MPI program with 10 tasks and a 40-fold threaded program ''impi_omp_program'' requiring 96000 MByte of total physical memory per task and total wall clock time of 1 hours looks like:


<source lang="bash">
#!/bin/bash
#SBATCH --ntasks=10
#SBATCH --cpus-per-task=80
#SBATCH --time=60
#SBATCH --mem=96000
#SBATCH --export=ALL,MPI_MODULE=mpi/impi,EXE=./impi_omp_program
#SBATCH --output="parprog_impi_omp_%j.out"

#If using more than one MPI task per node please set
export KMP_AFFINITY=compact,1,0
#export KMP_AFFINITY=verbose,scatter prints messages concerning the supported affinity
#KMP_AFFINITY Description: https://software.intel.com/en-us/node/524790#KMP_AFFINITY_ENVIRONMENT_VARIABLE

# Use when a defined module environment related to Intel MPI is wished
module load ${MPI_MODULE}
export OMP_NUM_THREADS=$((${SLURM_CPUS_PER_TASK}/2))
export MPIRUN_OPTIONS="-binding "domain=omp:compact" -print-rank-map -envall"
export NUM_PROCS=eval(${SLURM_NTASKS}*${OMP_NUM_THREADS})
echo "${EXE} running on ${NUM_PROCS} cores with ${SLURM_NTASKS} MPI-tasks and ${OMP_NUM_THREADS} threads"
startexe="mpiexec.hydra -bootstrap slurm ${MPIRUN_OPTIONS} -n ${SLURM_NTASKS} ${EXE}"
echo $startexe
exec $startexe
</source>
Using Intel compiler the environment variable KMP_AFFINITY switches on binding of threads to specific cores. If you only run one MPI task per node please set KMP_AFFINITY=compact,1,0.
 
If you want to use 128 or more nodes, you must also set the environment variable as follows: 
export I_MPI_HYDRA_BRANCH_COUNT=-1
 
If you want to use the options perhost, ppn or rr, you must additionally set the environment variable I_MPI_JOB_RESPECT_PROCESS_PLACEMENT=off.
 
 
Execute the script '''job_impi_omp.sh''' by command sbatch:
<pre>
$ sbatch -p multiple ./job_impi_omp.sh
</pre>
 
The mpirun option ''-print-rank-map'' shows the bindings between MPI tasks and nodes (not very beneficial). The option ''-binding'' binds MPI tasks (processes) to a particular processor; ''domain=omp'' means that the domain size is determined by the number of threads. If you would choose 2 MPI tasks per node, you should choose ''-binding "cell=unit;map=bunch"''; this binding maps one MPI process to each socket.
 
 

==== Chain jobs ====
The CPU time requirements of many applications exceed the limits of the job classes. In those situations it is recommended to solve the problem by a job chain. A job chain is a sequence of jobs where each job automatically starts its successor.
<source lang="bash">
#!/bin/bash
####################################
## simple Slurm submitter script to setup ##
## a chain of jobs using Slurm ##
####################################
## ver. : 2018-11-27, KIT, SCC

## Define maximum number of jobs via positional parameter 1, default is 5
max_nojob=${1:-5}

## Define your jobscript (e.g. "~/chain_job.sh")
chain_link_job=${PWD}/chain_job.sh

## Define type of dependency via positional parameter 2, default is 'afterok'
dep_type="${2:-afterok}"
## -> List of all dependencies:
## https://slurm.schedmd.com/sbatch.html

myloop_counter=1
## Submit loop
while [ ${myloop_counter} -le ${max_nojob} ] ; do
##
## Differ slurm_opt depending on chain link number
if [ ${myloop_counter} -eq 1 ] ; then
slurm_opt=""
else
slurm_opt="-d ${dep_type}:${jobID}"
fi
##
## Print current iteration number and sbatch command
echo "Chain job iteration = ${myloop_counter}"
echo " sbatch --export=myloop_counter=${myloop_counter} ${slurm_opt} ${chain_link_job}"
## Store job ID for next iteration by storing output of sbatch command with empty lines
jobID=$(sbatch -p <queue> --export=ALL,myloop_counter=${myloop_counter} ${slurm_opt} ${chain_link_job} 2>&1 | sed 's/[S,a-z]* //g')
##
## Check if ERROR occured
if [[ "${jobID}" =~ "ERROR" ]] ; then
echo " -> submission failed!" ; exit 1
else
echo " -> job number = ${jobID}"
fi
##
## Increase counter
let myloop_counter+=1
done
</source>
 

==== GPU jobs ====

The nodes in the gpu_4 and gpu_8 queues have 4 or 8 NVIDIA Tesla V100 GPUs. Just submitting a job to these queues is not enough to also allocate one or more GPUs, you have to do so using the "--gres=gpu" parameter. You have to specifiy how many GPUs your job needs, e.g. "--gres=gpu:2" will request two GPUs.

The GPU nodes are shared between multiple jobs if the jobs don't request all the GPUs in a node and there are enough ressources to run more than one job. The individual GPUs are always bound to a single job and will not be shared between different jobs.

a) add after the initial line of your script job.sh the line including the
information about the GPU usage: #SBATCH --gres=gpu:2
<pre>
#!/bin/bash
#SBATCH --ntasks=40
#SBATCH --time=02:00:00
#SBATCH --mem=4000
#SBATCH --gres=gpu:2
</pre>

or b) execute:
<pre>
$ sbatch -p <queue> -n 40 -t 02:00:00 --mem 4000 --gres=gpu:2 job.sh
</pre>
 
If you start an interactive session on of the GPU nodes, you can use the "nvidia-smi" command to list the GPUs allocated to your job:
<pre>
$ nvidia-smi
Sun Mar 29 15:20:05 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.33.01 Driver Version: 440.33.01 CUDA Version: 10.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2... Off | 00000000:3A:00.0 Off | 0 |
| N/A 29C P0 39W / 300W | 9MiB / 32510MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla V100-SXM2... Off | 00000000:3B:00.0 Off | 0 |
| N/A 30C P0 41W / 300W | 8MiB / 32510MiB | 0% Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 14228 G /usr/bin/X 8MiB |
| 1 14228 G /usr/bin/X 8MiB |
+-----------------------------------------------------------------------------+
</pre>

 
In case of using Open MPI, the underlying communication infrastructure (UCX and Open MPI's BTL) is CUDA-aware.
However, there may be warnings, e.g. when running
<pre>
$ module load compiler/gnu/10.3 mpi/openmpi devel/cuad
$ mpirun mpirun -np 2 ./mpi_cuda_app
--------------------------------------
WARNING: There are more than one active ports on host 'uc2n520', but the
default subnet GID prefix was detected on more than one of these
ports. If these ports are connected to different physical IB
networks, this configuration will fail in Open MPI. This version of
Open MPI requires that every physically separate IB subnet that is
used between connected MPI processes must have different subnet ID
values.

Please see this FAQ entry for more details:

http://www.open-mpi.org/faq/?category=openfabrics#ofa-default-subnet-gid

NOTE: You can turn off this warning by setting the MCA parameter
btl_openib_warn_default_gid_prefix to 0.
--------------------------------------------------------------------------
</pre>

Please run Open MPI's mpirun using the following command:
<pre>
$ mpirun --mca pml ucx --mca btl_openib_warn_default_gid_prefix 0 -np 2 ./mpi_cuda_app
</pre>
or disabling the (older) communication layer Bit-Transfer-Layer (short BTL) alltogether:
<pre>
$ mpirun --mca pml ucx --mca btl ^openib -np 2 ./mpi_cuda_app
</pre>

(Please note, that CUDA per v11.4 is only available with up to GCC-10)
 
 

==== LSDF Online Storage ====
On bwUniCluster 2.0 you can use for special cases the LSDF Online Storage on the HPC cluster nodes. Please request for this service separately ([https://www.lsdf.kit.edu/os/storagerequest/: LSDF Storage Request]).
To mount the LSDF Online Storage on the compute nodes during the job runtime the
the constraint flag "LSDF" has to be set.

a) add after the initial line of your script job.sh the line including the
information about the LSDF Online Storage usage: #SBATCH --constraint=LSDF
<pre>
#!/bin/bash
#SBATCH --ntasks=1
#SBATCH --time=120
#SBATCH --mem=200
#SBATCH --constraint=LSDF
</pre>

or b) execute:
<pre>
$ sbatch -p <queue> -n 1 -t 2:00:00 --mem 200 job.sh -C LSDF
</pre>
 
For the usage of the LSDF Online Storage
the following environment variables are available: $LSDF, $LSDFPROJECTS, $LSDFHOME.
 
 

====BeeOND (BeeGFS On-Demand)====

Starting and stopping BeeOND is integrated in the prolog and epilog of the cluster batch system Slurm. It can be used during job runtime if compute nodes are exclusive used. You can request the creation of a BeeOND file system with the constraint flags "BEEOND", "BEEOND_4MDS" or "BEEOND_MAXMDS" ([[BwUniCluster_2.0_Slurm_common_Features#sbatch_Command_Parameters|Slurm Command Parameters]])
* BEEOND: one metadata server is started on the first node
* BEEOND_4MDS: 4 metadata servers are started within your job. If you have less than 4 nodes less metadata servers are started.
* BEEOND_MAXMDS: on every node of your job a metadata server for the on_demand file system is started

As starting point we recommend using the "BEEOND" option. If you are unsure if this is sufficient for you feel free to contact the support team.
<source lang="bash">
#!/bin/bash
#SBATCH ...
#SBATCH --constraint=BEEOND # or BEEOND_4MDS or BEEOND_MAXMDS
</source>

After your job has started you can find the private on-demand file system in '''/mnt/odfs/${SLURM_JOB_ID}''' directory. The mountpoint comes with five pre-configured directories:
<source lang="bash">
# For small files (stripe count = 1)
/mnt/odfs/${SLURM_JOB_ID}/stripe_1
# Stripe count = 4
/mnt/odfs/${SLURM_JOB_ID}/stripe_default
# or
/mnt/odfs/${SLURM_JOB_ID}/stripe_4
# Stripe count = 8, 16 or 32, use this directories for medium sized and large files or when using MPI-IO
/mnt/odfs/${SLURM_JOB_ID}/stripe_8
/mnt/odfs/${SLURM_JOB_ID}/stripe_16
# or
/mnt/odfs/${SLURM_JOB_ID}/stripe_32
</source>

If you request less nodes than stripe count, the stripe count will be the number of nodes. For example, if you only request 8 nodes the directory stripe_16 has only a stripe count 8.

; '''Attention:''' 
:Be careful when creating large files: It is recommended to use the directory with the maximum stripe count for large files. For example, if your largest file is 1.1 Tb, then you have to use a stripe count larger than 2, otherwise the used disk space is exceeded.

The capacity of the private file system depends on the number of nodes. For each node you get 750 Gbyte.
If you request 100 nodes for your job, the private file system is 100 * 750 Gbyte ~ 75 Tbyte (approx) capacity.

== Start time of job or resources : squeue --start ==
The command can be used by any user to displays the estimated start time of a job based a number of analysis types based on historical usage, earliest available reservable resources, and priority based backlog. The command squeue is explained in detail on the webpage https://slurm.schedmd.com/squeue.html or via manpage (man squeue).
 
 
=== Access ===
By default, this command can be run by '''any user'''.
 
 

== List of your submitted jobs : squeue ==
Displays information about YOUR active, pending and/or recently completed jobs. The command displays all own active and pending jobs. The command squeue is explained in detail on the webpage https://slurm.schedmd.com/squeue.html or via manpage (man squeue).
 
 
=== Access ===
By default, this command can be run by any user.
 
 

=== Flags ===
{| width=750px class="wikitable"
|-
! Flag !! Description
|-
| -l, --long
| Report more of the available information for the selected jobs or job steps, subject to any constraints specified.
|}
 

=== Examples ===
''squeue'' example on bwUniCluster 2.0 (Only your own jobs are displayed!).
<pre>
$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
18088744 single CPV.sbat ab1234 PD 0:00 1 (Priority)
18098414 multiple CPV.sbat ab1234 PD 0:00 2 (Priority)
18090089 multiple CPV.sbat ab1234 R 2:27 2 uc2n[127-128]
$ squeue -l
JOBID PARTITION NAME USER STATE TIME TIME_LIMI NODES NODELIST(REASON)
18088654 single CPV.sbat ab1234 COMPLETI 4:29 2:00:00 1 uc2n374
18088785 single CPV.sbat ab1234 PENDING 0:00 2:00:00 1 (Priority)
18098414 multiple CPV.sbat ab1234 PENDING 0:00 2:00:00 2 (Priority)
18088683 single CPV.sbat ab1234 RUNNING 0:14 2:00:00 1 uc2n413
</pre>
* The output of ''squeue'' shows how many jobs of yours are running or pending and how many nodes are in use by your jobs.
 

== Shows free resources : sinfo_t_idle ==
The Slurm command sinfo is used to view partition and node information for a system running Slurm. It incorporates down time, reservations, and node state information in determining the available backfill window. The sinfo command can only be used by the administrator.
 
SCC has prepared a special script (sinfo_t_idle) to find out how many processors are available for immediate use on the system. It is anticipated that users will use this information to submit jobs that meet these criteria and thus obtain quick job turnaround times.
 
 
=== Access ===
By default, this command can be used by any user or administrator.
 
 
=== Example ===
* The following command displays what resources are available for immediate use for the whole partition.
<pre>$ sinfo_t_idle
Partition dev_multiple : 8 nodes idle
Partition multiple : 332 nodes idle
Partition dev_single : 4 nodes idle
Partition single : 76 nodes idle
Partition long : 80 nodes idle
Partition fat : 5 nodes idle
Partition dev_special : 342 nodes idle
Partition special : 342 nodes idle
Partition dev_multiple_e: 7 nodes idle
Partition multiple_e : 335 nodes idle
Partition gpu_4 : 12 nodes idle
Partition gpu_8 : 6 nodes idle
</pre>
* For the above example jobs in all partitions can be run immediately.
 

== Detailed job information : scontrol show job ==
scontrol show job displays detailed job state information and diagnostic output for all or a specified job of yours. Detailed information is available for active, pending and recently completed jobs. The command scontrol is explained in detail on the webpage https://slurm.schedmd.com/scontrol.html or via manpage (man scontrol).
 
Display the state of all your jobs in normal mode: scontrol show job
 
Display the state of a job with <jobid> in normal mode: scontrol show job <jobid>
 
 
=== Access ===
* End users can use scontrol show job to view the status of their '''own jobs''' only.
 

=== Arguments ===
{| width=750px class="wikitable"
|-
! Option !! Default !! Description !! Example
|- style="vertical-align:top;"
|- style="width:12%;"
| -d
| (n/a)
| Detailed mode
| Example: Display the state with jobid 18089884 in detailed mode. <pre>scontrol -d show job 18089884</pre>
|}
 
 

=== Scontrol show job Example ===
Here is an example from bwUniCluster 2.0.
<pre>
squeue # show my own jobs (here the userid is replaced!)
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
18089884 multiple CPV.sbat bq0742 R 33:44 2 uc2n[165-166]

$
$ # now, see what's up with my pending job with jobid 18089884
$
$ scontrol show job 18089884

JobId=18089884 JobName=CPV.sbatch
UserId=bq0742(8946) GroupId=scc(12345) MCS_label=N/A
Priority=3 Nice=0 Account=kit QOS=normal
JobState=RUNNING Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
RunTime=00:35:06 TimeLimit=02:00:00 TimeMin=N/A
SubmitTime=2020-03-16T14:14:54 EligibleTime=2020-03-16T14:14:54
AccrueTime=2020-03-16T14:14:54
StartTime=2020-03-16T15:12:51 EndTime=2020-03-16T17:12:51 Deadline=N/A
SuspendTime=None SecsPreSuspend=0 LastSchedEval=2020-03-16T15:12:51
Partition=multiple AllocNode:Sid=uc2n995:5064
ReqNodeList=(null) ExcNodeList=(null)
NodeList=uc2n[165-166]
BatchHost=uc2n165
NumNodes=2 NumCPUs=160 NumTasks=80 CPUs/Task=1 ReqB:S:C:T=0:0:*:1
TRES=cpu=160,mem=96320M,node=2,billing=160
Socks/Node=* NtasksPerN:B:S:C=40:0:*:1 CoreSpec=*
MinCPUsNode=40 MinMemoryCPU=1204M MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
OverSubscribe=NO Contiguous=0 Licenses=(null) Network=(null)
Command=/pfs/data5/home/kit/scc/bq0742/git/CPV/bin/CPV.sbatch
WorkDir=/pfs/data5/home/kit/scc/bq0742/git/CPV/bin
StdErr=/pfs/data5/home/kit/scc/bq0742/git/CPV/bin/slurm-18089884.out
StdIn=/dev/null
StdOut=/pfs/data5/home/kit/scc/bq0742/git/CPV/bin/slurm-18089884.out
Power=
MailUser=(null) MailType=NONE
</pre>
 
You can use standard Linux pipe commands to filter the very detailed scontrol show job output.
* In which state the job is?
<pre>$ scontrol show job 18089884 | grep -i State
JobState=COMPLETED Reason=None Dependency=(null)
</pre>
 

== Cancel Slurm Jobs ==
The scancel command is used to cancel jobs. The command scancel is explained in detail on the webpage https://slurm.schedmd.com/scancel.html or via manpage (man scancel).
 
 
=== Canceling own jobs : scancel ===
scancel is used to signal or cancel jobs, job arrays or job steps. The command is:
<pre>
$ scancel [-i] <job-id>
$ scancel -t <job_state_name>
</pre>

 
{| width=750px class="wikitable"
! Flag !! Default !! Description !! Example
|- style="vertical-align:top;"
| -i, --interactive
| (n/a)
| Interactive mode.
| Cancel the job 987654 interactively. <pre> scancel -i 987654 </pre>
|-
| -t, --state
| (n/a)
| Restrict the scancel operation to jobs in a certain state. "job_state_name" may have a value of either "PENDING", "RUNNING" or "SUSPENDED".
| Cancel all jobs in state "PENDING". <pre> scancel -t "PENDING" </pre>
|}
 

= Resource Managers =
=== Batch Job (Slurm) Variables ===
The following environment variables of Slurm are added to your environment once your job has started
(only an excerpt of the most important ones).
{| width=750px class="wikitable"
! Environment !! Brief explanation
|-
| SLURM_JOB_CPUS_PER_NODE
| Number of processes per node dedicated to the job
|-
| SLURM_JOB_NODELIST
| List of nodes dedicated to the job
|-
| SLURM_JOB_NUM_NODES
| Number of nodes dedicated to the job
|-
| SLURM_MEM_PER_NODE
| Memory per node dedicated to the job
|-
| SLURM_NPROCS
| Total number of processes dedicated to the job
|-
| SLURM_CLUSTER_NAME
| Name of the cluster executing the job
|-
| SLURM_CPUS_PER_TASK
| Number of CPUs requested per task
|-
| SLURM_JOB_ACCOUNT
| Account name
|-
| SLURM_JOB_ID
| Job ID
|-
| SLURM_JOB_NAME
| Job Name
|-
| SLURM_JOB_PARTITION
| Partition/queue running the job
|-
| SLURM_JOB_UID
| User ID of the job's owner
|-
| SLURM_SUBMIT_DIR
| Job submit folder. The directory from which sbatch was invoked.
|-
| SLURM_JOB_USER
| User name of the job's owner
|-
| SLURM_RESTART_COUNT
| Number of times job has restarted
|-
| SLURM_PROCID
| Task ID (MPI rank)
|-
| SLURM_NTASKS
| The total number of tasks available for the job
|-
| SLURM_STEP_ID
| Job step ID
|-
| SLURM_STEP_NUM_TASKS
| Task count (number of MPI ranks)
|-
| SLURM_JOB_CONSTRAINT
| Job constraints
|}
See also:
* [https://slurm.schedmd.com/sbatch.html#lbAI Slurm input and output environment variables]
 

=== Job Exit Codes ===
A job's exit code (also known as exit status, return code and completion code) is captured by SLURM and saved as part of the job record.
 
Any non-zero exit code will be assumed to be a job failure and will result in a Job State of FAILED with a reason of "NonZeroExitCode".
 
The exit code is an 8 bit unsigned number ranging between 0 and 255. While it is possible for a job to return a negative exit code, SLURM will display it as an unsigned value in the 0 - 255 range.
 
 
==== Displaying Exit Codes and Signals ====
SLURM displays a job's exit code in the output of the '''scontrol show job''' and the sview utility.
 
When a signal was responsible for a job or step's termination, the signal number will be displayed after the exit code, delineated by a colon(:).
 
 
==== Submitting Termination Signal ====
Here is an example, how to 'save' a Slurm termination signal in a typical jobscript.
<source lang="bash">
[...]
exit_code=$?
mpirun -np <#cores> <EXE_BIN_DIR>/<executable> ... (options) 2>&1
[ "$exit_code" -eq 0 ] && echo "all clean..." || \
echo "Executable <EXE_BIN_DIR>/<executable> finished with exit code ${$exit_code}"
[...]
</source>
* Do not use ''''time'''' mpirun! The exit code will be the one submitted by the first (time) program.
* You do not need an '''exit $exit_code''' in the scripts.
 
 
----
[[#top|Back to top]]

BwUniCluster3.0/Hardware and Architecture/Filesystem Details

2025-04-03T13:56:51Z

R Laifer: /* Backup and Archiving */

= File System Details =

On bwUniCluster 2.0 the parallel file system Lustre is used for most globally visible user data. Lustre is open source and Lustre solutions and support are available from different vendors. Nowadays, most of the biggest HPC systems are using Lustre. An initial home directory on a Lustre file system is created automatically after account activation, and the environment variable $HOME holds its name. Users can create so-called workspaces on another Lustre file system for non-permanent data with temporary lifetime. There is another workspace file system based on flash storage for special requirements available.

Within a batch job further file systems are available:
* The directory $TMPDIR is only available and visible on the local node. It is located on fast SSD storage devices.
* On request a parallel on-demand file system (BeeOND) is created which uses the SSDs of the nodes which were allocated to the batch job.
* On request the external LSDF Online Storage is mounted on the nodes which were allocated to the batch job. This file system is based on the parallel file system Spectrum Scale.

Some of the characteristics of the file systems are shown in Table 2.

{| class="wikitable"
|-
! scope="column" style="width:9%"| Property
! style="width:13%"| $TMPDIR
! style="width:13%"| BeeOND
! style="width:13%"| $HOME
! style="width:13%"| Workspace
! style="width:13%"| Workspace on flash
|-
! scope="column" style="height=20px; padding:3px"| Visibility
| style="height=20px; text-align:left;padding:3px"| local node
| style="height=20px; text-align:left;padding:3px"| nodes of batch job
| style="height=20px; text-align:left;padding:3px"| global
| style="height=20px; text-align:left;padding:3px"| global
| style="height=20px; text-align:left;padding:3px"| global
|-
! scope="column" style="height=20px; padding:3px"| Lifetime
| style="height=20px; text-align:left;padding:3px"| batch job runtime
| style="height=20px; text-align:left;padding:3px"| batch job runtime
| style="height=20px; text-align:left;padding:3px"| permanent
| style="height=20px; text-align:left;padding:3px"| max. 240 days
| style="height=20px; text-align:left;padding:3px"| max. 240 days
|-
! scope="column" style="height=20px; padding:3px"| Disk space
| style="height=20px; text-align:left;padding:3px"| 960 GB - 6.4 TB details see table 1
| style="height=20px; text-align:left;padding:3px"| n*750 GB
| style="height=20px; text-align:left;padding:3px"| 1.1 PiB
| style="height=20px; text-align:left;padding:3px"| 4.6 PiB
| style="height=20px; text-align:left;padding:3px"| 236 TiB
|-
! scope="column" style="height=20px; padding:3px"| Capacity Quotas
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| yes 500 GiB per user, for MA users 250 GiB also per organization
| style="height=20px; text-align:left;padding:3px"| yes 40 TiB per user
| style="height=20px; text-align:left;padding:3px"| yes 1 TiB per user
|-
! scope="column" style="height=20px; padding:3px"| Inode Quotas
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| yes 5 million per user for MA users 2.5 million
| style="height=20px; text-align:left;padding:3px"| yes 20 million per user
| style="height=20px; text-align:left;padding:3px"| yes 5 million per user
|-
! scope="column" style="height=20px; padding:3px"| Backup
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| yes
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| no
|-
! scope="column" style="height=20px; padding:3px"| Read perf./node
| style="height=20px; text-align:left;padding:3px"| 500 MB/s - 6 GB/s depends on type of local SSD / job queue: 520 MB/s @ single / multiple 800 MB/s @ multiple_e 6600 MB/s @ fat 6500 MB/s @ gpu_4 6500 MB/s @ gpu_8
| style="height=20px; text-align:left;padding:3px"| 400 MB/s - 500 MB/s depends on type of local SSDs / job queue: 500 MB/s @ multiple 400 MB/s @ multiple_e
| style="height=20px; text-align:left;padding:3px"| 5 GB/s
| style="height=20px; text-align:left;padding:3px"| 5 GB/s
| style="height=20px; text-align:left;padding:3px"| 1 GB/s
|-
! scope="column" style="height=20px; padding:3px"| Write perf./node
| style="height=20px; text-align:left;padding:3px"| 500 MB/s - 4 GB/s depends on type of local SSD / job queue: 500 MB/s @ single / multiple 650 MB/s @ multiple_e 2900 MB/s @ fat 2090 MB/s @ gpu_4 4060 MB/s @ gpu_8
| style="height=20px; text-align:left;padding:3px"| 250 MB/s - 350 MB/s depends on type of local SSDs / job queue: 350 MB/s @ multiple 250 MB/s @ multiple_e
| style="height=20px; text-align:left;padding:3px"| 5 GB/s
| style="height=20px; text-align:left;padding:3px"| 5 GB/s
| style="height=20px; text-align:left;padding:3px"| 1 GB/s
|-
! scope="column" style="height=20px;padding:3px"| Total read perf.
| style="height=20px; text-align:left;padding:3px"| n*500-6000 MB/s
| style="height=20px; text-align:left;padding:3px"| n*400-500 MB/s
| style="height=20px; text-align:left;padding:3px"| 63 GB/s
| style="height=20px; text-align:left;padding:3px"| 45 GB/s
| style="height=20px; text-align:left;padding:3px"| 45 GB/s
|-
! scope="column" style="height=20px;padding:3px"| Total write perf.
| style="height=20px; text-align:left;padding:3px"| n*500-4000 MB/s
| style="height=20px; text-align:left;padding:3px"| n*250-350 MB/s
| style="height=20px; text-align:left;padding:3px"| 63 GB/s
| style="height=20px; text-align:left;padding:3px"| 40 GB/s
| style="height=20px; text-align:left;padding:3px"| 38 GB/s
|}
---------------------------------------------------------------------------------------------------------
global: all nodes of UniCluster access the same file system;
local: each node has its own file system;
permanent: files are stored permanently;
batch job: files are removed at end of the batch job.
---------------------------------------------------------------------------------------------------------
Table 2: Properties of the file systems

== Selecting the appropriate file system ==

In general, you should separate your data and store it on the appropriate file system.
Permanently needed data like software or important results should be stored below $HOME
but capacity restrictions (quotas) apply. In case you accidentally deleted data on $HOME
there is a chance that we can restore it from backup. Permanent data which is not needed for months
or exceeds the capacity restrictions should be sent to the LSDF Online Storage
or to the archive and deleted from the file systems. Temporary data which is only needed on a single
node and which does not exceed the disk space shown in the table above should be stored
below $TMPDIR. Data which is read many times on a single node, e.g. if you are doing AI training,
should be copied to $TMPDIR and read from there. Temporary data which is used from many nodes
of your batch job and which is only needed during job runtime should be stored on a
parallel on-demand file system. Temporary data which can be recomputed or which is the
result of one job and input for another job should be stored in workspaces. The lifetime
of data in workspaces is limited and depends on the lifetime of the workspace which can be
several months.

For further details please check the chapters below.

== $HOME ==

The home directories of bwUniCluster 3.0 (uc3) users are located in the parallel file system Lustre.
You have access to your home directory from all nodes of uc3. A regular backup of these directories
to a tape library is done automatically. The directory $HOME should be used to hold permanently used data like
source code, configuration files, executable programs etc.

On uc3 there is a default user quota limit of 500 GiB and 5 million inodes (files and directories) per user.
For users of University of Mannheim the limit is 250 GiB and 2.5 million inodes.
You can check your current usage and limits with the command
<pre>
$ lfs quota -uh $(whoami) $HOME
</pre>
Note that the quota limits mentioned above are soft quota limits. The hard limits (shown on the ''limits''
columns) are 10 percent higher. If you are above the soft limit and below the hard limit
during the grace period (7 days) your I/O operations will show a warning message. If the grace period has
passed or if you are above the hard limit your I/O operations will abort.

In addition to the user limit there is a limit of your organization (e.g. university) which depends on the financial share. This limit is enforced with so-called Lustre project quotas. You can show the current usage and limits of your organization with the following command:
<pre>
lfs quota -ph $(grep $(echo $HOME | sed -e "s|/[^/]*/[^/]*$||") /pfs/data6/project_ids.txt | cut -f 1 -d\ ) $HOME
</pre>

== Workspaces ==

On uc3 workspaces can be used to store large non-permanent data sets, e.g. restart files or output
data that has to be post-processed. The file system used for workspaces is also the parallel file system Lustre. This file system is especially designed for parallel access and for a high throughput to large
files. It is able to provide high data transfer rates of up to 40 GB/s write and read performance when data access is parallel.

On uc3 there is a default user quota limit of 40 TiB and 20 million inodes (files and directories) per user.
You can chek your current usage and limits with the command
<pre>
$ lfs quota -uh $(whoami) /pfs/work9
</pre>
Note that the quotas include data and inodes for all of your workspaces and all of your expired workspaces (as long as they are not yet completely removed).

Workspaces have a lifetime and the data on a workspace expires as a whole after a fixed period. The maximum lifetime of a workspace on uc3 is 60 days, but it can be renewed at the end of that period 3 times to a total maximum of 240 days after workspace generation.

Creating, deleting, finding, extending and sharing workspaces is explained on the [[workspace]] page.

=== Reminder for workspace deletion ===

Normally you will get an email every day starting 7 days before a workspace expires. You can send yourself a calender entry which reminds you when a workspace will be automatically deleted:

$ ws_send_ical <workspace> <email>

=== Restoring expired Workspaces ===

At expiration time your workspace will be moved to a special, hidden directory. On uc3 expired workspaces are currently kept for 30 days. During this time you can still restore your data into a valid workspace. The same is true for released workspaces but they are only kept until the next night. In order to restore an expired workspace, use
<pre>
ws_restore -l
</pre>
to get a list of your expired workspaces, and then restore them into an '''existing, active workspace''' (here with name <code>my_restored</code>):
<pre>
ws_restore <full_name_of_expired_workspace> my_restored
</pre>
NOTE: The expired workspace has to be specified using the full name as listed by <code>ws_restore -l</code>, including username prefix and timestamp suffix (otherwise, it cannot be uniquely identified).
The target workspace, on the other hand, must be given with just its short name as listed by <code>ws_list</code>, without the username prefix.

NOTE: <code>ws_restore</code> can only work on the same filesystem. So you have to ensure that the new workspace allocated with <code>ws_allocate</code> is placed on the same filesystem as the expired workspace. Therefore, you can use <code>-F <filesystem></code> flag if needed.

=== Linking workspaces in Home ===

It might be valuable to have links to personal workspaces within a certain directory, e.g. below the user home directory. The command
<pre>
ws_register <DIR>
</pre>
will create and manage links to all personal workspaces within in the directory <DIR>. Calling this command will do the following:

* The directory <DIR> will be created if necessary
* Links to all personal workspaces will be managed:
** Creates links to all available workspaces if not already present
** Removes links to released or expired workspaces

== Improving Performance on $HOME and workspaces ==

The following recommendations might help to improve throughput and metadata
performance on Lustre filesystems.

=== Improving Throughput Performance ===

Depending on your application some adaptations might be necessary if you want to reach
the full bandwidth of the filesystems.

When you are designing your application you should consider that the performance of
parallel filesystems is generally better if data is transferred in large blocks and stored in
few large files. In more detail, to increase throughput performance of a parallel application
following aspects should be considered:

* collect large chunks of data and write them sequentially at once,

* to exploit complete filesystem bandwidth use several clients,

* avoid competitive file access by different tasks or use blocks with boundaries at stripe size (default is 1MB),

* if files are small enough for the SSDs and are only used from one node store them on $TMPDIR.

Parallel filesystems typically stripe files over storage subsystems, i.e. large files are separated into stripes and distributed to different storage subsystems. In Lustre, the size of these stripes (sometimes also mentioned as chunks) is called stripe size and the number of used storage subsystems is called stripe count.

With previous Lustre versions adapting the Lustre stripe count was the most important optimization. However, for the workspaces of uc3 the Lustre feature Progressive File Layout is used to define file striping parameters. This means that the stripe count is adapted if the file size is growing. In normal cases users no longer need to adapt file striping parameters in case they have very huge files or in order to reach better performance. If you know what you are doing you can still change striping parameters but further explanation is beyond the scope of this documentation.

=== Improving Metadata Performance ===

Metadata performance on parallel file systems is usually not as good as with local
filesystems. In addition, it is usually not scalable, i.e. a limited resource. Therefore,
you should omit metadata operations whenever possible. For example, it is much better
to have few large files than lots of small files. In more detail, to increase metadata
performance of a parallel application following aspects should be considered:

* avoid creating many small files

* avoid competitive directory access, e.g. by creating files in separate subdirectories for each task

* if many small files are only used within a batch job and accessed by one node store them on $TMPDIR

== Workspaces on flash storage ==

There is another workspace file system for special requirements available. The file system is called ''full flash pfs'' and is based on the parallel file system Lustre.

=== Advantages of this file system ===

# All storage devices are based on flash (no hard disks) with low access times. Hence performance is better compared to other parallel file systems for read and write access with small blocks and with small files, i.e. IOPS rates are improved.
# The file system is mounted on bwUniCluster 2.0 and HoreKa, i.e. it can be used to share data between these clusters.

=== Access restrictions ===

Only HoreKa users or KIT users of bwUniCluster 2.0 can use this file system.

=== Using the file system ===

As KIT or HoreKa user you can use the file system in the same way as a normal workspace. You just have to specify the name of the flash-based workspace file system using the option ''-F'' to all the commands that manage workspaces. On bwUniCluster 2.0 it is called ''ffuc'', on HoreKa it is ''ffhk''. For example, to create a workspace with name myws and a lifetime of 60 days on bwUniCluster 2.0 execute:
ws_allocate -F ffuc myws 60

If you want to use the full flash pfs on bwUniCluster 2.0 '''and''' HoreKa at the same time, please note that you only have to manage a particular workspace on one of the clusters since the name of the workspace directory is different. However, the path to each workspace is visible and can be used on both clusters.

Other features are similar to normal workspaces. For example, we are able to restore expired workspaces for few weeks and you have to open a ticket to request the restore. There are quota limits with a default limit of 1 TiB capacity and 5 millions inodes per user. You can check your current usage with
lfs quota -uh $(whoami) /pfs/work8

== $TMPDIR ==

The environment variable $TMPDIR contains the name of a directory which is located on the local SSD of each node. This means
that different tasks of a parallel application use different directories when they do not utilize the same node.
Although $TMPDIR points to the same path name for different nodes of a batch job, the physical location and the
content of this directory path on these nodes is different.

This directory should be used for temporary files being accessed from the local node during job runtime. It should
also be used if you read the same data many times from a single node, e.g. if you are doing AI training. In this
case you should copy the data at the beginning of your batch job to $TMPDIR and read the data from there, see usage example below.

The $TMPDIR directory is located on extremely fast local SSD storage devices. This means that performance
on small files is much better than on the parallel file systems. The capacity of the local SSDs for each node type
is different and can be checked in [[BwUniCluster3.0/Hardware_and_Architecture#Compute_nodes|Table 1]].
The capacity of $TMPDIR is at least 1400 GB.

Each time a batch job is started, a subdirectory is created on the SSD of each node and assigned to the job.
$TMPDIR is set to the name of the subdirectory and this name contains the job ID so that it is unique
for each job. At the end of the job the subdirectory is removed.

On login nodes $TMPDIR also points to a fast directory on a local SSD disk but this directory is not unique.
It is recommended to create your own unique subdirectory on these nodes. This directory should be used for the
installation of software packages. This means that the software package to be installed should be unpacked,
compiled and linked in a subdirectory of $TMPDIR. The real installation of the package (e.g. make install)
should be made into the $HOME folder.

{|style="background:#ffdeee; width:100%;"
|style="padding:5px; background:#f2cece; text-align:left"|
[[Image:Attention.svg|center|25px]]
|style="padding:5px; background:#f2cece; text-align:left"|
Note that you should '''not''' use /tmp or /scratch! Please use '''$TMPDIR''' instead. 
The reason is that an automatic cleanup on /tmp or /scratch is not possible because another job could be still using data below these directories. Hence the corresponding file systems could fill up and this can cause issues for you and for other users. On the other hand, $TMPDIR is created when the job starts and removed when the job completes, i.e. a cleanup is automatically done.
|}

=== Usage example for $TMPDIR ===

We will provide an example for using $TMPDIR and describe efficient data transfer to and from $TMPDIR.

If you have a data set with many files which is frequently used by batch jobs you should create
a compressed archive on a workspace. This archive can be extracted on $TMPDIR inside your batch jobs.
Such an archive can be read efficiently from a parallel file system since it is a single huge file.
On a login node you can create such an archive with the following steps:
<syntaxhighlight lang="bash">
# Create a workspace to store the archive
[ab1234@uc3n991 ~]$ ws_allocate data-ssd 60
# Create the archive from a local dataset folder (example)
[ab1234@uc3n991 ~]$ tar -cvzf $(ws_find data-ssd)/dataset.tgz dataset/
</syntaxhighlight>

Inside a batch job extract the archive on $TMPDIR, read input data from $TMPDIR, store results on $TMPDIR
and save the results on a workspace:
<syntaxhighlight lang="bash">
#!/bin/bash
# very simple example on how to use local $TMPDIR
#SBATCH -N 1
#SBATCH -t 24:00:00

# Extract compressed input dataset on local SSD
tar -C $TMPDIR/ -xvzf $(ws_find data-ssd)/dataset.tgz

# The application reads data from dataset on $TMPDIR and writes results to $TMPDIR
myapp -input $TMPDIR/dataset/myinput.csv -outputdir $TMPDIR/results

# Before job completes save results on a workspace
rsync -av $TMPDIR/results $(ws_find data-ssd)/results-${SLURM_JOB_ID}/
</syntaxhighlight>

== LSDF Online Storage==

In some cases it is useful to have access to the LSDF Online Storage on the HPC-Clusters also. Therefore the LSDF Online Storage is mounted on the Login- and Datamover-Nodes.
Furthermore it can be used on the compute nodes during the job runtime with the constraint flag "LSDF" ([[bwUniCluster_2.0_Slurm_common_Features|Slurm common features]]
). There is also an example about the LSDF batch usage: [[bwUniCluster_2.0_Slurm_common_Features#LSDF_Online_Storage|Slurm LSDF example ]] .
<pre>
#!/bin/bash
#SBATCH ...
#SBATCH --constraint=LSDF
</pre>
 
For the usage of the LSDF Online Storage the following environment variables are available: $LSDF, $LSDFPROJECTS, $LSDFHOME.
Please request storage projects in the LSDF Online Storage seperately:
[https://www.lsdf.kit.edu/os/storagerequest/: LSDF Storage Request].

==BeeOND (BeeGFS On-Demand)==

Users of the UniCluster have possibility to request a private BeeOND (on-demand BeeGFS) parallel filesystem for each job. The file system is created during job startup and purged after your job.

; '''IMPORTANT: '''
: All data on the private filesystem will be deleted after your job. Make sure you have copied your data back to the global filesystem (within job), e.g., $HOME or any workspace.

BeeOND/BeeGFS can be used like any other parallel file system. Tools like cp or rsync can be used to copy data in and out.

For detailed usage see here: [[BwUniCluster_2.0_Slurm_common_Features#BeeOND_.28BeeGFS_On-Demand.29|Request on-demand file system]]

==Backup and Archiving==

There are regular backups of all data of the home directories, whereas ACLs and extended attributes will
not be saved in backups.

Please open a ticket if you have the need to restore backup data.

BwUniCluster3.0/Hardware and Architecture/Filesystem Details

2025-04-03T13:29:28Z

R Laifer: /* $TMPDIR */

= File System Details =

On bwUniCluster 2.0 the parallel file system Lustre is used for most globally visible user data. Lustre is open source and Lustre solutions and support are available from different vendors. Nowadays, most of the biggest HPC systems are using Lustre. An initial home directory on a Lustre file system is created automatically after account activation, and the environment variable $HOME holds its name. Users can create so-called workspaces on another Lustre file system for non-permanent data with temporary lifetime. There is another workspace file system based on flash storage for special requirements available.

Within a batch job further file systems are available:
* The directory $TMPDIR is only available and visible on the local node. It is located on fast SSD storage devices.
* On request a parallel on-demand file system (BeeOND) is created which uses the SSDs of the nodes which were allocated to the batch job.
* On request the external LSDF Online Storage is mounted on the nodes which were allocated to the batch job. This file system is based on the parallel file system Spectrum Scale.

Some of the characteristics of the file systems are shown in Table 2.

{| class="wikitable"
|-
! scope="column" style="width:9%"| Property
! style="width:13%"| $TMPDIR
! style="width:13%"| BeeOND
! style="width:13%"| $HOME
! style="width:13%"| Workspace
! style="width:13%"| Workspace on flash
|-
! scope="column" style="height=20px; padding:3px"| Visibility
| style="height=20px; text-align:left;padding:3px"| local node
| style="height=20px; text-align:left;padding:3px"| nodes of batch job
| style="height=20px; text-align:left;padding:3px"| global
| style="height=20px; text-align:left;padding:3px"| global
| style="height=20px; text-align:left;padding:3px"| global
|-
! scope="column" style="height=20px; padding:3px"| Lifetime
| style="height=20px; text-align:left;padding:3px"| batch job runtime
| style="height=20px; text-align:left;padding:3px"| batch job runtime
| style="height=20px; text-align:left;padding:3px"| permanent
| style="height=20px; text-align:left;padding:3px"| max. 240 days
| style="height=20px; text-align:left;padding:3px"| max. 240 days
|-
! scope="column" style="height=20px; padding:3px"| Disk space
| style="height=20px; text-align:left;padding:3px"| 960 GB - 6.4 TB details see table 1
| style="height=20px; text-align:left;padding:3px"| n*750 GB
| style="height=20px; text-align:left;padding:3px"| 1.1 PiB
| style="height=20px; text-align:left;padding:3px"| 4.6 PiB
| style="height=20px; text-align:left;padding:3px"| 236 TiB
|-
! scope="column" style="height=20px; padding:3px"| Capacity Quotas
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| yes 500 GiB per user, for MA users 250 GiB also per organization
| style="height=20px; text-align:left;padding:3px"| yes 40 TiB per user
| style="height=20px; text-align:left;padding:3px"| yes 1 TiB per user
|-
! scope="column" style="height=20px; padding:3px"| Inode Quotas
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| yes 5 million per user for MA users 2.5 million
| style="height=20px; text-align:left;padding:3px"| yes 20 million per user
| style="height=20px; text-align:left;padding:3px"| yes 5 million per user
|-
! scope="column" style="height=20px; padding:3px"| Backup
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| yes
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| no
|-
! scope="column" style="height=20px; padding:3px"| Read perf./node
| style="height=20px; text-align:left;padding:3px"| 500 MB/s - 6 GB/s depends on type of local SSD / job queue: 520 MB/s @ single / multiple 800 MB/s @ multiple_e 6600 MB/s @ fat 6500 MB/s @ gpu_4 6500 MB/s @ gpu_8
| style="height=20px; text-align:left;padding:3px"| 400 MB/s - 500 MB/s depends on type of local SSDs / job queue: 500 MB/s @ multiple 400 MB/s @ multiple_e
| style="height=20px; text-align:left;padding:3px"| 5 GB/s
| style="height=20px; text-align:left;padding:3px"| 5 GB/s
| style="height=20px; text-align:left;padding:3px"| 1 GB/s
|-
! scope="column" style="height=20px; padding:3px"| Write perf./node
| style="height=20px; text-align:left;padding:3px"| 500 MB/s - 4 GB/s depends on type of local SSD / job queue: 500 MB/s @ single / multiple 650 MB/s @ multiple_e 2900 MB/s @ fat 2090 MB/s @ gpu_4 4060 MB/s @ gpu_8
| style="height=20px; text-align:left;padding:3px"| 250 MB/s - 350 MB/s depends on type of local SSDs / job queue: 350 MB/s @ multiple 250 MB/s @ multiple_e
| style="height=20px; text-align:left;padding:3px"| 5 GB/s
| style="height=20px; text-align:left;padding:3px"| 5 GB/s
| style="height=20px; text-align:left;padding:3px"| 1 GB/s
|-
! scope="column" style="height=20px;padding:3px"| Total read perf.
| style="height=20px; text-align:left;padding:3px"| n*500-6000 MB/s
| style="height=20px; text-align:left;padding:3px"| n*400-500 MB/s
| style="height=20px; text-align:left;padding:3px"| 63 GB/s
| style="height=20px; text-align:left;padding:3px"| 45 GB/s
| style="height=20px; text-align:left;padding:3px"| 45 GB/s
|-
! scope="column" style="height=20px;padding:3px"| Total write perf.
| style="height=20px; text-align:left;padding:3px"| n*500-4000 MB/s
| style="height=20px; text-align:left;padding:3px"| n*250-350 MB/s
| style="height=20px; text-align:left;padding:3px"| 63 GB/s
| style="height=20px; text-align:left;padding:3px"| 40 GB/s
| style="height=20px; text-align:left;padding:3px"| 38 GB/s
|}
---------------------------------------------------------------------------------------------------------
global: all nodes of UniCluster access the same file system;
local: each node has its own file system;
permanent: files are stored permanently;
batch job: files are removed at end of the batch job.
---------------------------------------------------------------------------------------------------------
Table 2: Properties of the file systems

== Selecting the appropriate file system ==

In general, you should separate your data and store it on the appropriate file system.
Permanently needed data like software or important results should be stored below $HOME
but capacity restrictions (quotas) apply. In case you accidentally deleted data on $HOME
there is a chance that we can restore it from backup. Permanent data which is not needed for months
or exceeds the capacity restrictions should be sent to the LSDF Online Storage
or to the archive and deleted from the file systems. Temporary data which is only needed on a single
node and which does not exceed the disk space shown in the table above should be stored
below $TMPDIR. Data which is read many times on a single node, e.g. if you are doing AI training,
should be copied to $TMPDIR and read from there. Temporary data which is used from many nodes
of your batch job and which is only needed during job runtime should be stored on a
parallel on-demand file system. Temporary data which can be recomputed or which is the
result of one job and input for another job should be stored in workspaces. The lifetime
of data in workspaces is limited and depends on the lifetime of the workspace which can be
several months.

For further details please check the chapters below.

== $HOME ==

The home directories of bwUniCluster 3.0 (uc3) users are located in the parallel file system Lustre.
You have access to your home directory from all nodes of uc3. A regular backup of these directories
to a tape library is done automatically. The directory $HOME should be used to hold permanently used data like
source code, configuration files, executable programs etc.

On uc3 there is a default user quota limit of 500 GiB and 5 million inodes (files and directories) per user.
For users of University of Mannheim the limit is 250 GiB and 2.5 million inodes.
You can check your current usage and limits with the command
<pre>
$ lfs quota -uh $(whoami) $HOME
</pre>
Note that the quota limits mentioned above are soft quota limits. The hard limits (shown on the ''limits''
columns) are 10 percent higher. If you are above the soft limit and below the hard limit
during the grace period (7 days) your I/O operations will show a warning message. If the grace period has
passed or if you are above the hard limit your I/O operations will abort.

In addition to the user limit there is a limit of your organization (e.g. university) which depends on the financial share. This limit is enforced with so-called Lustre project quotas. You can show the current usage and limits of your organization with the following command:
<pre>
lfs quota -ph $(grep $(echo $HOME | sed -e "s|/[^/]*/[^/]*$||") /pfs/data6/project_ids.txt | cut -f 1 -d\ ) $HOME
</pre>

== Workspaces ==

On uc3 workspaces can be used to store large non-permanent data sets, e.g. restart files or output
data that has to be post-processed. The file system used for workspaces is also the parallel file system Lustre. This file system is especially designed for parallel access and for a high throughput to large
files. It is able to provide high data transfer rates of up to 40 GB/s write and read performance when data access is parallel.

On uc3 there is a default user quota limit of 40 TiB and 20 million inodes (files and directories) per user.
You can chek your current usage and limits with the command
<pre>
$ lfs quota -uh $(whoami) /pfs/work9
</pre>
Note that the quotas include data and inodes for all of your workspaces and all of your expired workspaces (as long as they are not yet completely removed).

Workspaces have a lifetime and the data on a workspace expires as a whole after a fixed period. The maximum lifetime of a workspace on uc3 is 60 days, but it can be renewed at the end of that period 3 times to a total maximum of 240 days after workspace generation.

Creating, deleting, finding, extending and sharing workspaces is explained on the [[workspace]] page.

=== Reminder for workspace deletion ===

Normally you will get an email every day starting 7 days before a workspace expires. You can send yourself a calender entry which reminds you when a workspace will be automatically deleted:

$ ws_send_ical <workspace> <email>

=== Restoring expired Workspaces ===

At expiration time your workspace will be moved to a special, hidden directory. On uc3 expired workspaces are currently kept for 30 days. During this time you can still restore your data into a valid workspace. The same is true for released workspaces but they are only kept until the next night. In order to restore an expired workspace, use
<pre>
ws_restore -l
</pre>
to get a list of your expired workspaces, and then restore them into an '''existing, active workspace''' (here with name <code>my_restored</code>):
<pre>
ws_restore <full_name_of_expired_workspace> my_restored
</pre>
NOTE: The expired workspace has to be specified using the full name as listed by <code>ws_restore -l</code>, including username prefix and timestamp suffix (otherwise, it cannot be uniquely identified).
The target workspace, on the other hand, must be given with just its short name as listed by <code>ws_list</code>, without the username prefix.

NOTE: <code>ws_restore</code> can only work on the same filesystem. So you have to ensure that the new workspace allocated with <code>ws_allocate</code> is placed on the same filesystem as the expired workspace. Therefore, you can use <code>-F <filesystem></code> flag if needed.

=== Linking workspaces in Home ===

It might be valuable to have links to personal workspaces within a certain directory, e.g. below the user home directory. The command
<pre>
ws_register <DIR>
</pre>
will create and manage links to all personal workspaces within in the directory <DIR>. Calling this command will do the following:

* The directory <DIR> will be created if necessary
* Links to all personal workspaces will be managed:
** Creates links to all available workspaces if not already present
** Removes links to released or expired workspaces

== Improving Performance on $HOME and workspaces ==

The following recommendations might help to improve throughput and metadata
performance on Lustre filesystems.

=== Improving Throughput Performance ===

Depending on your application some adaptations might be necessary if you want to reach
the full bandwidth of the filesystems.

When you are designing your application you should consider that the performance of
parallel filesystems is generally better if data is transferred in large blocks and stored in
few large files. In more detail, to increase throughput performance of a parallel application
following aspects should be considered:

* collect large chunks of data and write them sequentially at once,

* to exploit complete filesystem bandwidth use several clients,

* avoid competitive file access by different tasks or use blocks with boundaries at stripe size (default is 1MB),

* if files are small enough for the SSDs and are only used from one node store them on $TMPDIR.

Parallel filesystems typically stripe files over storage subsystems, i.e. large files are separated into stripes and distributed to different storage subsystems. In Lustre, the size of these stripes (sometimes also mentioned as chunks) is called stripe size and the number of used storage subsystems is called stripe count.

With previous Lustre versions adapting the Lustre stripe count was the most important optimization. However, for the workspaces of uc3 the Lustre feature Progressive File Layout is used to define file striping parameters. This means that the stripe count is adapted if the file size is growing. In normal cases users no longer need to adapt file striping parameters in case they have very huge files or in order to reach better performance. If you know what you are doing you can still change striping parameters but further explanation is beyond the scope of this documentation.

=== Improving Metadata Performance ===

Metadata performance on parallel file systems is usually not as good as with local
filesystems. In addition, it is usually not scalable, i.e. a limited resource. Therefore,
you should omit metadata operations whenever possible. For example, it is much better
to have few large files than lots of small files. In more detail, to increase metadata
performance of a parallel application following aspects should be considered:

* avoid creating many small files

* avoid competitive directory access, e.g. by creating files in separate subdirectories for each task

* if many small files are only used within a batch job and accessed by one node store them on $TMPDIR

== Workspaces on flash storage ==

There is another workspace file system for special requirements available. The file system is called ''full flash pfs'' and is based on the parallel file system Lustre.

=== Advantages of this file system ===

# All storage devices are based on flash (no hard disks) with low access times. Hence performance is better compared to other parallel file systems for read and write access with small blocks and with small files, i.e. IOPS rates are improved.
# The file system is mounted on bwUniCluster 2.0 and HoreKa, i.e. it can be used to share data between these clusters.

=== Access restrictions ===

Only HoreKa users or KIT users of bwUniCluster 2.0 can use this file system.

=== Using the file system ===

As KIT or HoreKa user you can use the file system in the same way as a normal workspace. You just have to specify the name of the flash-based workspace file system using the option ''-F'' to all the commands that manage workspaces. On bwUniCluster 2.0 it is called ''ffuc'', on HoreKa it is ''ffhk''. For example, to create a workspace with name myws and a lifetime of 60 days on bwUniCluster 2.0 execute:
ws_allocate -F ffuc myws 60

If you want to use the full flash pfs on bwUniCluster 2.0 '''and''' HoreKa at the same time, please note that you only have to manage a particular workspace on one of the clusters since the name of the workspace directory is different. However, the path to each workspace is visible and can be used on both clusters.

Other features are similar to normal workspaces. For example, we are able to restore expired workspaces for few weeks and you have to open a ticket to request the restore. There are quota limits with a default limit of 1 TiB capacity and 5 millions inodes per user. You can check your current usage with
lfs quota -uh $(whoami) /pfs/work8

== $TMPDIR ==

The environment variable $TMPDIR contains the name of a directory which is located on the local SSD of each node. This means
that different tasks of a parallel application use different directories when they do not utilize the same node.
Although $TMPDIR points to the same path name for different nodes of a batch job, the physical location and the
content of this directory path on these nodes is different.

This directory should be used for temporary files being accessed from the local node during job runtime. It should
also be used if you read the same data many times from a single node, e.g. if you are doing AI training. In this
case you should copy the data at the beginning of your batch job to $TMPDIR and read the data from there, see usage example below.

The $TMPDIR directory is located on extremely fast local SSD storage devices. This means that performance
on small files is much better than on the parallel file systems. The capacity of the local SSDs for each node type
is different and can be checked in [[BwUniCluster3.0/Hardware_and_Architecture#Compute_nodes|Table 1]].
The capacity of $TMPDIR is at least 1400 GB.

Each time a batch job is started, a subdirectory is created on the SSD of each node and assigned to the job.
$TMPDIR is set to the name of the subdirectory and this name contains the job ID so that it is unique
for each job. At the end of the job the subdirectory is removed.

On login nodes $TMPDIR also points to a fast directory on a local SSD disk but this directory is not unique.
It is recommended to create your own unique subdirectory on these nodes. This directory should be used for the
installation of software packages. This means that the software package to be installed should be unpacked,
compiled and linked in a subdirectory of $TMPDIR. The real installation of the package (e.g. make install)
should be made into the $HOME folder.

{|style="background:#ffdeee; width:100%;"
|style="padding:5px; background:#f2cece; text-align:left"|
[[Image:Attention.svg|center|25px]]
|style="padding:5px; background:#f2cece; text-align:left"|
Note that you should '''not''' use /tmp or /scratch! Please use '''$TMPDIR''' instead. 
The reason is that an automatic cleanup on /tmp or /scratch is not possible because another job could be still using data below these directories. Hence the corresponding file systems could fill up and this can cause issues for you and for other users. On the other hand, $TMPDIR is created when the job starts and removed when the job completes, i.e. a cleanup is automatically done.
|}

=== Usage example for $TMPDIR ===

We will provide an example for using $TMPDIR and describe efficient data transfer to and from $TMPDIR.

If you have a data set with many files which is frequently used by batch jobs you should create
a compressed archive on a workspace. This archive can be extracted on $TMPDIR inside your batch jobs.
Such an archive can be read efficiently from a parallel file system since it is a single huge file.
On a login node you can create such an archive with the following steps:
<syntaxhighlight lang="bash">
# Create a workspace to store the archive
[ab1234@uc3n991 ~]$ ws_allocate data-ssd 60
# Create the archive from a local dataset folder (example)
[ab1234@uc3n991 ~]$ tar -cvzf $(ws_find data-ssd)/dataset.tgz dataset/
</syntaxhighlight>

Inside a batch job extract the archive on $TMPDIR, read input data from $TMPDIR, store results on $TMPDIR
and save the results on a workspace:
<syntaxhighlight lang="bash">
#!/bin/bash
# very simple example on how to use local $TMPDIR
#SBATCH -N 1
#SBATCH -t 24:00:00

# Extract compressed input dataset on local SSD
tar -C $TMPDIR/ -xvzf $(ws_find data-ssd)/dataset.tgz

# The application reads data from dataset on $TMPDIR and writes results to $TMPDIR
myapp -input $TMPDIR/dataset/myinput.csv -outputdir $TMPDIR/results

# Before job completes save results on a workspace
rsync -av $TMPDIR/results $(ws_find data-ssd)/results-${SLURM_JOB_ID}/
</syntaxhighlight>

== LSDF Online Storage==

In some cases it is useful to have access to the LSDF Online Storage on the HPC-Clusters also. Therefore the LSDF Online Storage is mounted on the Login- and Datamover-Nodes.
Furthermore it can be used on the compute nodes during the job runtime with the constraint flag "LSDF" ([[bwUniCluster_2.0_Slurm_common_Features|Slurm common features]]
). There is also an example about the LSDF batch usage: [[bwUniCluster_2.0_Slurm_common_Features#LSDF_Online_Storage|Slurm LSDF example ]] .
<pre>
#!/bin/bash
#SBATCH ...
#SBATCH --constraint=LSDF
</pre>
 
For the usage of the LSDF Online Storage the following environment variables are available: $LSDF, $LSDFPROJECTS, $LSDFHOME.
Please request storage projects in the LSDF Online Storage seperately:
[https://www.lsdf.kit.edu/os/storagerequest/: LSDF Storage Request].

==BeeOND (BeeGFS On-Demand)==

Users of the UniCluster have possibility to request a private BeeOND (on-demand BeeGFS) parallel filesystem for each job. The file system is created during job startup and purged after your job.

; '''IMPORTANT: '''
: All data on the private filesystem will be deleted after your job. Make sure you have copied your data back to the global filesystem (within job), e.g., $HOME or any workspace.

BeeOND/BeeGFS can be used like any other parallel file system. Tools like cp or rsync can be used to copy data in and out.

For detailed usage see here: [[BwUniCluster_2.0_Slurm_common_Features#BeeOND_.28BeeGFS_On-Demand.29|Request on-demand file system]]

==Backup and Archiving==

There are regular backups of all data of the home directories,whereas ACLs and extended attributes will
not be backuped.

Please open a ticket if you need backuped data.

BwUniCluster3.0/Hardware and Architecture

2025-04-03T13:19:16Z

R Laifer: /* $TMPDIR */ Adapted TMPDIR description

= Architecture of bwUniCluster 3.0 =

The '''bwUniCluster 3.0''' is a parallel computer with distributed memory.
It consists of the bwUniCluster 3.0 components procured in 2024 and also includes the additional compute nodes which were procured as an extension to the bwUniCluster 2.0 in 2022.

Each node of the system consists of two Intel Xeon or AMD EPYC processors, local memory, local storage, network adapters and optional accelerators (NVIDIA A100 and H100, AMD Instinct MI300A). All nodes are connected via a fast InfiniBand interconnect.

The parallel file system (Lustre) is connected to the InfiniBand switch of the compute cluster. This provides a fast and scalable parallel file
system.

The operating system on each node is Red Hat Enterprise Linux (RHEL) 9.4.

The individual nodes of the system act in different roles. From an end users point of view the different groups of nodes are login nodes and compute nodes. File server nodes and administrative server nodes are not accessible by users.

'''Login Nodes'''

The login nodes are the only nodes directly accessible by end users. These nodes are used for interactive login, file management, program development, and interactive pre- and post-processing.
There are two nodes dedicated to this service, but they can all be reached from a single address: <code>uc3.scc.kit.edu</code>. A DNS round-robin alias distributes login sessions to the login nodes.
To prevent login nodes from being used for activities that are not permitted there and that affect the user experience of other users, '''long-running and/or compute-intensive tasks are periodically terminated without any prior warning'''. Please refer to [[BwUniCluster3.0/Login#Allowed_Activities_on_Login_Nodes|Allowed Activities on Login Nodes]].

'''Compute Nodes'''

The majority of nodes are compute nodes which are managed by a batch system. Users submit their jobs to the SLURM batch system and a job is executed when the required resources become available (depending on its fair-share priority).

'''File Systems'''
bwUniCluster 3.0 comprises two parallel file systems based on Lustre.

[[File:uc3.png|Optionen|center|Überschrift|800px]]

= Compute Resources =

== Login nodes ==

After a successful [[BwUniCluster3.0/Login|login]], users find themselves on one of the so called login nodes. Technically, these largely correspond to a standard CPU node, i.e. users have two AMD EPYC 9454 processors with a total of 96 cores at their disposal. Login nodes are the bridgehead for accessing computing resources.
Data and software are organized here, computing jobs are initiated and managed, and computing resources allocated for interactive use can also be accessed from here.
{|style="background:#deffee; width:100%;"
|style="padding:5px; background:#ffa500; text-align:left"|
[[Image:Attention.svg|center|25px]]
|style="padding:5px; background:#ffa500; text-align:left"|
'''Any compute intensive job running on the login nodes will be terminated without any notice.''' 
Please refer to [[BwUniCluster3.0/Login#Allowed_Activities_on_Login_Nodes|Allowed Activities on Login Nodes]].
|}

== Compute nodes ==
All compute activities on bwUniCluster 3.0 have to be performed on the compute nodes. Compute nodes are only available by requesting the corresponding resources via the queuing system. As soon as the requested resources are available, automated tasks are executed via a batch script or they can be accessed interactively. Please refer to [[BwUniCluster3.0/Running_Jobs|Running Jobs]] on how to request resources. 
The following compute node types are available: 
CPU nodes
* '''Standard''': Two AMD EPYC 9454 processors per node with a total of 96 physical CPU cores or 192 logical cores (Hyper-Threading) per node. The nodes have been procured in 2024.
* '''Ice Lake''': Two Intel Xeon Platinum 8358 processors per node with a total of 64 physical CPU cores or 128 logical cores (Hyper-Threading) per node. The nodes have been procured in 2022 as an extension to bwUniCluster 2.0.
* '''High Memory''': Similar to the standard nodes, but with six times larger memory.
GPU nodes
* '''NVIDIA GPU x4''': Similar to the standard nodes, but with larger memory and four NVIDIA H100 GPUs.
* '''AMD GPU x4''': AMD's accelerated processing unit (APU) MI300A with 4 CPU sockets and 4 compute units which share the same high-bandwidth memory (HBM).
* '''Ice Lake NVIDIA GPU x4''': Similar to the Ice Lake nodes, but with larger memory and four NVIDIA A100 or H100 GPUs.
{| class="wikitable"
|-
! style="width:10%"| Node Type
! style="width:10%"| CPU nodes Ice Lake
! style="width:10%"| CPU nodes Standard
! style="width:10%"| CPU nodes High Memory
! style="width:10%"| GPU nodes NVIDIA GPU x4
! style="width:10%"| GPU node AMD GPU x4
! style="width:10%"| GPU nodes Ice Lake NVIDIA GPU x4
! style="width:10%"| Login nodes
|-
!scope="column"| Availability in [[BwUniCluster3.0/Running_Jobs#Queues_on_bwUniCluster_3.0| queues]]
| <code>cpu_il</code>, <code>dev_cpu_il</code>
| <code>cpu</code>, <code>dev_cpu</code>
| <code>highmem</code>, <code>dev_highmem</code>
| <code>gpu_h100</code>, <code>dev_gpu_h100</code>
| <code>gpu_mi300</code>
| <code>gpu_a100_il</code> / <code>gpu_h100_il</code>
| -
|-
!scope="column"| Number of nodes
| 272
| 70
| 4
| 12
| 1
| 15
| 2
|-
!scope="column"| Processors
| Intel Xeon Platinum 8358
| AMD EPYC 9454
| AMD EPYC 9454
| AMD EPYC 9454
| AMD Zen 4
| Intel Xeon Platinum 8358
| AMD EPYC 9454
|-
!scope="column"| Number of sockets
| 2
| 2
| 2
| 2
| 4
| 2
| 2
|-
!scope="column"| Processor frequency (GHz)
| 2.6 GHz
| 2.75 GHz
| 2.75 GHz
| 2.75 GHz
| 3.7 GHz
| 2.6 GHz
| 2.75 GHz
|-
!scope="column"| Total number of cores
| 64
| 96
| 96
| 96
| 96 (4x 24)
| 64
| 96
|-
!scope="column"| Main memory
| 256 GB
| 384 GB
| 2.3 TB
| 768 GB
| 4x 128 GB HBM3
| 512 GB
| 384 GB
|-
!scope="column"| Local SSD
| 1.8 TB NVMe
| 3.84 TB NVMe
| 15.36 TB NVMe
| 15.36 TB NVMe
| 7.68 TB NVMe
| 6.4 TB NVMe
| 7.68 TB SATA SSD
|-
!scope="column"| Accelerators
| -
| -
| -
| 4x NVIDIA H100
| 4x AMD Instinct MI300A
| 4x NVIDIA A100 / H100
| -
|-
!scope="column"| Accelerator memory
| -
| -
| -
| 94 GB
| APU
| 80 GB / 94 GB
| -
|-
!scope="column"| Interconnect
| IB HDR200
| IB 2x NDR200
| IB 2x NDR200
| IB 4x NDR200
| IB 2x NDR200
| IB 2x HDR200
| IB 1x NDR200
|}
Table 1: Hardware overview and properties

= File Systems =

On bwUniCluster 3.0 the following file systems are available:

* '''$HOME''' The HOME directory is created automatically after account activation, and the environment variable $HOME holds its name. HOME is the place, where users find themselves after login.
* '''Workspaces''' Users can create so-called workspaces for non-permanent data with temporary lifetime. A further workspace type based on flash-only storage for special requirements is also available.
* '''$TMPDIR''' The directory $TMPDIR is only available and visible on the local node during the runtime of a compute job. It is located on fast SSD storage devices.
* '''BeeOND''' (BeeGFS On-Demand) On request a parallel on-demand file system (BeeOND) is created which uses the SSDs of the nodes which were allocated to the batch job.
* '''LSDF Online Storage''' On request the external LSDF Online Storage is mounted on the nodes which were allocated to the batch job. On the login nodes, LSDF is automatically mounted.

'''Which file system to use?'''

In general, you should separate your data and store it on the appropriate file system.
Permanently needed data like software or important results should be stored below $HOME
but capacity restrictions (quotas) apply. In case you accidentally deleted data on $HOME
there is a chance that we can restore it from backup. Permanent data which is not needed for months
or exceeds the capacity restrictions should be sent to the LSDF Online Storage
or to the archive and deleted from the file systems. Temporary data which is only needed on a single
node and which does not exceed the disk space shown in Table 1 above should be stored
below $TMPDIR. Data which is read many times on a single node, e.g. if you are doing AI training,
should be copied to $TMPDIR and read from there. Temporary data which is used from many nodes
of your batch job and which is only needed during job runtime should be stored on a
parallel on-demand file system BeeOND. Temporary data which can be recomputed or which is the
result of one job and input for another job should be stored in workspaces. The lifetime
of data in workspaces is limited and depends on the lifetime of the workspace which can be
several months.

For further details please check

[[BwUniCluster3.0/Hardware_and_Architecture/Filesystem_Details|File System Details]]

== $HOME ==

The $HOME directories of bwUniCluster 3.0 users are located on the parallel file system Lustre.
You have access to your $HOME directory from all nodes of UC3. A regular backup of these directories
to tape archive is done automatically. The directory $HOME is used to hold those files that are
permanently used like source codes, configuration files, executable programs etc.

[[BwUniCluster3.0/Hardware_and_Architecture/Filesystem_Details#$HOME|Detailed information on $HOME]]

== Workspaces ==

On UC3 workspaces should be used to store large non-permanent data sets, e.g. restart files or output
data that has to be post-processed. The file system used for workspaces is also the parallel file system Lustre. This file system is especially designed for parallel access and for a high throughput to large
files. It is able to provide high data transfer rates of up to 40 GB/s write and read performance when data access is parallel.

On UC3 there is a default user quota limit of 40 TiB and 20 million inodes (files and directories) per user.

[[BwUniCluster3.0/Hardware_and_Architecture/Filesystem_Details#Workspaces|Detailed information on Workspaces]]

== $TMPDIR ==

The environment variable $TMPDIR contains the name of a directory which is located on the local SSD of each node.
This directory should be used for temporary files being accessed from the local node. It should
also be used if you read the same data many times from a single node, e.g. if you are doing AI training.
Because of the extremely fast local SSD storage devices performance with small files is much better than on the parallel file systems.

[[BwUniCluster3.0/Hardware_and_Architecture/Filesystem_Details#$TMPDIR|Detailed information on $TMPDIR]]

== LSDF Online Storage==

In some cases it is useful to have access to the LSDF Online Storage on the HPC-Clusters also. Therefore the LSDF Online Storage is mounted on the Login- and Datamover-Nodes.
Furthermore it can be used on the compute nodes during the job runtime with the constraint flag "LSDF" ([[bwUniCluster_3.0_Slurm_common_Features|Slurm common features]]
). There is also an example about the LSDF batch usage: [[bwUniCluster_3.0_Slurm_common_Features#LSDF_Online_Storage|Slurm LSDF example ]] .
<pre>
#!/bin/bash
#SBATCH ...
#SBATCH --constraint=LSDF
</pre>
 
For the usage of the LSDF Online Storage the following environment variables are available: $LSDF, $LSDFPROJECTS, $LSDFHOME.
Please request storage projects in the LSDF Online Storage seperately:
[https://www.lsdf.kit.edu/os/storagerequest/: LSDF Storage Request].

==BeeOND (BeeGFS On-Demand)==

Users of the UniCluster have possibility to request a private BeeOND (on-demand BeeGFS) parallel filesystem for each job. The file system is created during job startup and purged after your job.

; '''IMPORTANT: '''
: All data on the private filesystem will be deleted after your job. Make sure you have copied your data back to the global filesystem (within job), e.g., $HOME or any workspace.

BeeOND/BeeGFS can be used like any other parallel file system. Tools like cp or rsync can be used to copy data in and out.

For detailed usage see here: [[BwUniCluster_3.0_Slurm_common_Features#BeeOND_.28BeeGFS_On-Demand.29|Request on-demand file system]]

BwUniCluster3.0/Hardware and Architecture/Filesystem Details

2025-04-03T13:13:58Z

R Laifer: /* $TMPDIR */

= File System Details =

On bwUniCluster 2.0 the parallel file system Lustre is used for most globally visible user data. Lustre is open source and Lustre solutions and support are available from different vendors. Nowadays, most of the biggest HPC systems are using Lustre. An initial home directory on a Lustre file system is created automatically after account activation, and the environment variable $HOME holds its name. Users can create so-called workspaces on another Lustre file system for non-permanent data with temporary lifetime. There is another workspace file system based on flash storage for special requirements available.

Within a batch job further file systems are available:
* The directory $TMPDIR is only available and visible on the local node. It is located on fast SSD storage devices.
* On request a parallel on-demand file system (BeeOND) is created which uses the SSDs of the nodes which were allocated to the batch job.
* On request the external LSDF Online Storage is mounted on the nodes which were allocated to the batch job. This file system is based on the parallel file system Spectrum Scale.

Some of the characteristics of the file systems are shown in Table 2.

{| class="wikitable"
|-
! scope="column" style="width:9%"| Property
! style="width:13%"| $TMPDIR
! style="width:13%"| BeeOND
! style="width:13%"| $HOME
! style="width:13%"| Workspace
! style="width:13%"| Workspace on flash
|-
! scope="column" style="height=20px; padding:3px"| Visibility
| style="height=20px; text-align:left;padding:3px"| local node
| style="height=20px; text-align:left;padding:3px"| nodes of batch job
| style="height=20px; text-align:left;padding:3px"| global
| style="height=20px; text-align:left;padding:3px"| global
| style="height=20px; text-align:left;padding:3px"| global
|-
! scope="column" style="height=20px; padding:3px"| Lifetime
| style="height=20px; text-align:left;padding:3px"| batch job runtime
| style="height=20px; text-align:left;padding:3px"| batch job runtime
| style="height=20px; text-align:left;padding:3px"| permanent
| style="height=20px; text-align:left;padding:3px"| max. 240 days
| style="height=20px; text-align:left;padding:3px"| max. 240 days
|-
! scope="column" style="height=20px; padding:3px"| Disk space
| style="height=20px; text-align:left;padding:3px"| 960 GB - 6.4 TB details see table 1
| style="height=20px; text-align:left;padding:3px"| n*750 GB
| style="height=20px; text-align:left;padding:3px"| 1.1 PiB
| style="height=20px; text-align:left;padding:3px"| 4.6 PiB
| style="height=20px; text-align:left;padding:3px"| 236 TiB
|-
! scope="column" style="height=20px; padding:3px"| Capacity Quotas
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| yes 500 GiB per user, for MA users 250 GiB also per organization
| style="height=20px; text-align:left;padding:3px"| yes 40 TiB per user
| style="height=20px; text-align:left;padding:3px"| yes 1 TiB per user
|-
! scope="column" style="height=20px; padding:3px"| Inode Quotas
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| yes 5 million per user for MA users 2.5 million
| style="height=20px; text-align:left;padding:3px"| yes 20 million per user
| style="height=20px; text-align:left;padding:3px"| yes 5 million per user
|-
! scope="column" style="height=20px; padding:3px"| Backup
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| yes
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| no
|-
! scope="column" style="height=20px; padding:3px"| Read perf./node
| style="height=20px; text-align:left;padding:3px"| 500 MB/s - 6 GB/s depends on type of local SSD / job queue: 520 MB/s @ single / multiple 800 MB/s @ multiple_e 6600 MB/s @ fat 6500 MB/s @ gpu_4 6500 MB/s @ gpu_8
| style="height=20px; text-align:left;padding:3px"| 400 MB/s - 500 MB/s depends on type of local SSDs / job queue: 500 MB/s @ multiple 400 MB/s @ multiple_e
| style="height=20px; text-align:left;padding:3px"| 5 GB/s
| style="height=20px; text-align:left;padding:3px"| 5 GB/s
| style="height=20px; text-align:left;padding:3px"| 1 GB/s
|-
! scope="column" style="height=20px; padding:3px"| Write perf./node
| style="height=20px; text-align:left;padding:3px"| 500 MB/s - 4 GB/s depends on type of local SSD / job queue: 500 MB/s @ single / multiple 650 MB/s @ multiple_e 2900 MB/s @ fat 2090 MB/s @ gpu_4 4060 MB/s @ gpu_8
| style="height=20px; text-align:left;padding:3px"| 250 MB/s - 350 MB/s depends on type of local SSDs / job queue: 350 MB/s @ multiple 250 MB/s @ multiple_e
| style="height=20px; text-align:left;padding:3px"| 5 GB/s
| style="height=20px; text-align:left;padding:3px"| 5 GB/s
| style="height=20px; text-align:left;padding:3px"| 1 GB/s
|-
! scope="column" style="height=20px;padding:3px"| Total read perf.
| style="height=20px; text-align:left;padding:3px"| n*500-6000 MB/s
| style="height=20px; text-align:left;padding:3px"| n*400-500 MB/s
| style="height=20px; text-align:left;padding:3px"| 63 GB/s
| style="height=20px; text-align:left;padding:3px"| 45 GB/s
| style="height=20px; text-align:left;padding:3px"| 45 GB/s
|-
! scope="column" style="height=20px;padding:3px"| Total write perf.
| style="height=20px; text-align:left;padding:3px"| n*500-4000 MB/s
| style="height=20px; text-align:left;padding:3px"| n*250-350 MB/s
| style="height=20px; text-align:left;padding:3px"| 63 GB/s
| style="height=20px; text-align:left;padding:3px"| 40 GB/s
| style="height=20px; text-align:left;padding:3px"| 38 GB/s
|}
---------------------------------------------------------------------------------------------------------
global: all nodes of UniCluster access the same file system;
local: each node has its own file system;
permanent: files are stored permanently;
batch job: files are removed at end of the batch job.
---------------------------------------------------------------------------------------------------------
Table 2: Properties of the file systems

== Selecting the appropriate file system ==

In general, you should separate your data and store it on the appropriate file system.
Permanently needed data like software or important results should be stored below $HOME
but capacity restrictions (quotas) apply. In case you accidentally deleted data on $HOME
there is a chance that we can restore it from backup. Permanent data which is not needed for months
or exceeds the capacity restrictions should be sent to the LSDF Online Storage
or to the archive and deleted from the file systems. Temporary data which is only needed on a single
node and which does not exceed the disk space shown in the table above should be stored
below $TMPDIR. Data which is read many times on a single node, e.g. if you are doing AI training,
should be copied to $TMPDIR and read from there. Temporary data which is used from many nodes
of your batch job and which is only needed during job runtime should be stored on a
parallel on-demand file system. Temporary data which can be recomputed or which is the
result of one job and input for another job should be stored in workspaces. The lifetime
of data in workspaces is limited and depends on the lifetime of the workspace which can be
several months.

For further details please check the chapters below.

== $HOME ==

The home directories of bwUniCluster 3.0 (uc3) users are located in the parallel file system Lustre.
You have access to your home directory from all nodes of uc3. A regular backup of these directories
to a tape library is done automatically. The directory $HOME should be used to hold permanently used data like
source code, configuration files, executable programs etc.

On uc3 there is a default user quota limit of 500 GiB and 5 million inodes (files and directories) per user.
For users of University of Mannheim the limit is 250 GiB and 2.5 million inodes.
You can check your current usage and limits with the command
<pre>
$ lfs quota -uh $(whoami) $HOME
</pre>
Note that the quota limits mentioned above are soft quota limits. The hard limits (shown on the ''limits''
columns) are 10 percent higher. If you are above the soft limit and below the hard limit
during the grace period (7 days) your I/O operations will show a warning message. If the grace period has
passed or if you are above the hard limit your I/O operations will abort.

In addition to the user limit there is a limit of your organization (e.g. university) which depends on the financial share. This limit is enforced with so-called Lustre project quotas. You can show the current usage and limits of your organization with the following command:
<pre>
lfs quota -ph $(grep $(echo $HOME | sed -e "s|/[^/]*/[^/]*$||") /pfs/data6/project_ids.txt | cut -f 1 -d\ ) $HOME
</pre>

== Workspaces ==

On uc3 workspaces can be used to store large non-permanent data sets, e.g. restart files or output
data that has to be post-processed. The file system used for workspaces is also the parallel file system Lustre. This file system is especially designed for parallel access and for a high throughput to large
files. It is able to provide high data transfer rates of up to 40 GB/s write and read performance when data access is parallel.

On uc3 there is a default user quota limit of 40 TiB and 20 million inodes (files and directories) per user.
You can chek your current usage and limits with the command
<pre>
$ lfs quota -uh $(whoami) /pfs/work9
</pre>
Note that the quotas include data and inodes for all of your workspaces and all of your expired workspaces (as long as they are not yet completely removed).

Workspaces have a lifetime and the data on a workspace expires as a whole after a fixed period. The maximum lifetime of a workspace on uc3 is 60 days, but it can be renewed at the end of that period 3 times to a total maximum of 240 days after workspace generation.

Creating, deleting, finding, extending and sharing workspaces is explained on the [[workspace]] page.

=== Reminder for workspace deletion ===

Normally you will get an email every day starting 7 days before a workspace expires. You can send yourself a calender entry which reminds you when a workspace will be automatically deleted:

$ ws_send_ical <workspace> <email>

=== Restoring expired Workspaces ===

At expiration time your workspace will be moved to a special, hidden directory. On uc3 expired workspaces are currently kept for 30 days. During this time you can still restore your data into a valid workspace. The same is true for released workspaces but they are only kept until the next night. In order to restore an expired workspace, use
<pre>
ws_restore -l
</pre>
to get a list of your expired workspaces, and then restore them into an '''existing, active workspace''' (here with name <code>my_restored</code>):
<pre>
ws_restore <full_name_of_expired_workspace> my_restored
</pre>
NOTE: The expired workspace has to be specified using the full name as listed by <code>ws_restore -l</code>, including username prefix and timestamp suffix (otherwise, it cannot be uniquely identified).
The target workspace, on the other hand, must be given with just its short name as listed by <code>ws_list</code>, without the username prefix.

NOTE: <code>ws_restore</code> can only work on the same filesystem. So you have to ensure that the new workspace allocated with <code>ws_allocate</code> is placed on the same filesystem as the expired workspace. Therefore, you can use <code>-F <filesystem></code> flag if needed.

=== Linking workspaces in Home ===

It might be valuable to have links to personal workspaces within a certain directory, e.g. below the user home directory. The command
<pre>
ws_register <DIR>
</pre>
will create and manage links to all personal workspaces within in the directory <DIR>. Calling this command will do the following:

* The directory <DIR> will be created if necessary
* Links to all personal workspaces will be managed:
** Creates links to all available workspaces if not already present
** Removes links to released or expired workspaces

== Improving Performance on $HOME and workspaces ==

The following recommendations might help to improve throughput and metadata
performance on Lustre filesystems.

=== Improving Throughput Performance ===

Depending on your application some adaptations might be necessary if you want to reach
the full bandwidth of the filesystems.

When you are designing your application you should consider that the performance of
parallel filesystems is generally better if data is transferred in large blocks and stored in
few large files. In more detail, to increase throughput performance of a parallel application
following aspects should be considered:

* collect large chunks of data and write them sequentially at once,

* to exploit complete filesystem bandwidth use several clients,

* avoid competitive file access by different tasks or use blocks with boundaries at stripe size (default is 1MB),

* if files are small enough for the SSDs and are only used from one node store them on $TMPDIR.

Parallel filesystems typically stripe files over storage subsystems, i.e. large files are separated into stripes and distributed to different storage subsystems. In Lustre, the size of these stripes (sometimes also mentioned as chunks) is called stripe size and the number of used storage subsystems is called stripe count.

With previous Lustre versions adapting the Lustre stripe count was the most important optimization. However, for the workspaces of uc3 the Lustre feature Progressive File Layout is used to define file striping parameters. This means that the stripe count is adapted if the file size is growing. In normal cases users no longer need to adapt file striping parameters in case they have very huge files or in order to reach better performance. If you know what you are doing you can still change striping parameters but further explanation is beyond the scope of this documentation.

=== Improving Metadata Performance ===

Metadata performance on parallel file systems is usually not as good as with local
filesystems. In addition, it is usually not scalable, i.e. a limited resource. Therefore,
you should omit metadata operations whenever possible. For example, it is much better
to have few large files than lots of small files. In more detail, to increase metadata
performance of a parallel application following aspects should be considered:

* avoid creating many small files

* avoid competitive directory access, e.g. by creating files in separate subdirectories for each task

* if many small files are only used within a batch job and accessed by one node store them on $TMPDIR

== Workspaces on flash storage ==

There is another workspace file system for special requirements available. The file system is called ''full flash pfs'' and is based on the parallel file system Lustre.

=== Advantages of this file system ===

# All storage devices are based on flash (no hard disks) with low access times. Hence performance is better compared to other parallel file systems for read and write access with small blocks and with small files, i.e. IOPS rates are improved.
# The file system is mounted on bwUniCluster 2.0 and HoreKa, i.e. it can be used to share data between these clusters.

=== Access restrictions ===

Only HoreKa users or KIT users of bwUniCluster 2.0 can use this file system.

=== Using the file system ===

As KIT or HoreKa user you can use the file system in the same way as a normal workspace. You just have to specify the name of the flash-based workspace file system using the option ''-F'' to all the commands that manage workspaces. On bwUniCluster 2.0 it is called ''ffuc'', on HoreKa it is ''ffhk''. For example, to create a workspace with name myws and a lifetime of 60 days on bwUniCluster 2.0 execute:
ws_allocate -F ffuc myws 60

If you want to use the full flash pfs on bwUniCluster 2.0 '''and''' HoreKa at the same time, please note that you only have to manage a particular workspace on one of the clusters since the name of the workspace directory is different. However, the path to each workspace is visible and can be used on both clusters.

Other features are similar to normal workspaces. For example, we are able to restore expired workspaces for few weeks and you have to open a ticket to request the restore. There are quota limits with a default limit of 1 TiB capacity and 5 millions inodes per user. You can check your current usage with
lfs quota -uh $(whoami) /pfs/work8

== $TMPDIR ==

The environment variable $TMPDIR contains the name of a directory which is located on the local SSD of each node. This means
that different tasks of a parallel application use different directories when they do not utilize the same node.
Although $TMPDIR points to the same path name for different nodes of a batch job, the physical location and the
content of this directory path on these nodes is different.

This directory should be used for temporary files being accessed from the local node during job runtime. It should
also be used if you read the same data many times from a single node, e.g. if you are doing AI training. In this
case you should copy the data at the beginning of your batch job to $TMPDIR and read the data from there, see usage example below.

The $TMPDIR directory is located on extremely fast local SSD storage devices. This means that performance
on small files is much better than on the parallel file systems. The capacity of the local SSDs for each node type
is different and can be checked in Table 1 above. The capacity of $TMPDIR is at least 1400 GB.

Each time a batch job is started, a subdirectory is created on the SSD of each node and assigned to the job.
$TMPDIR is set to the name of the subdirectory and this name contains the job ID so that it is unique
for each job. At the end of the job the subdirectory is removed.

On login nodes $TMPDIR also points to a fast directory on a local SSD disk but this directory is not unique.
It is recommended to create your own unique subdirectory on these nodes. This directory should be used for the
installation of software packages. This means that the software package to be installed should be unpacked,
compiled and linked in a subdirectory of $TMPDIR. The real installation of the package (e.g. make install)
should be made into the $HOME folder.

{|style="background:#ffdeee; width:100%;"
|style="padding:5px; background:#f2cece; text-align:left"|
[[Image:Attention.svg|center|25px]]
|style="padding:5px; background:#f2cece; text-align:left"|
Note that you should '''not''' use /tmp or /scratch! Please use '''$TMPDIR''' instead. 
The reason is that an automatic cleanup on /tmp or /scratch is not possible because another job could be still using data below these directories. Hence the corresponding file systems could fill up and this can cause issues for you and for other users. On the other hand, $TMPDIR is created when the job starts and removed when the job completes, i.e. a cleanup is automatically done.
|}

=== Usage example for $TMPDIR ===

We will provide an example for using $TMPDIR and describe efficient data transfer to and from $TMPDIR.

If you have a data set with many files which is frequently used by batch jobs you should create
a compressed archive on a workspace. This archive can be extracted on $TMPDIR inside your batch jobs.
Such an archive can be read efficiently from a parallel file system since it is a single huge file.
On a login node you can create such an archive with the following steps:
<syntaxhighlight lang="bash">
# Create a workspace to store the archive
[ab1234@uc2n997 ~]$ ws_allocate data-ssd 60
# Create the archive from a local dataset folder (example)
[ab1234@uc2n997 ~]$ tar -cvzf $(ws_find data-ssd)/dataset.tgz dataset/
</syntaxhighlight>

Inside a batch job extract the archive on $TMPDIR, read input data from $TMPDIR, store results on $TMPDIR
and save the results on a workspace:
<syntaxhighlight lang="bash">
#!/bin/bash
# very simple example on how to use local $TMPDIR
#SBATCH -N 1
#SBATCH -t 24:00:00

# Extract compressed input dataset on local SSD
tar -C $TMPDIR/ -xvzf $(ws_find data-ssd)/dataset.tgz

# The application reads data from dataset on $TMPDIR and writes results to $TMPDIR
myapp -input $TMPDIR/dataset/myinput.csv -outputdir $TMPDIR/results

# Before job completes save results on a workspace
rsync -av $TMPDIR/results $(ws_find data-ssd)/results-${SLURM_JOB_ID}/
</syntaxhighlight>

== LSDF Online Storage==

In some cases it is useful to have access to the LSDF Online Storage on the HPC-Clusters also. Therefore the LSDF Online Storage is mounted on the Login- and Datamover-Nodes.
Furthermore it can be used on the compute nodes during the job runtime with the constraint flag "LSDF" ([[bwUniCluster_2.0_Slurm_common_Features|Slurm common features]]
). There is also an example about the LSDF batch usage: [[bwUniCluster_2.0_Slurm_common_Features#LSDF_Online_Storage|Slurm LSDF example ]] .
<pre>
#!/bin/bash
#SBATCH ...
#SBATCH --constraint=LSDF
</pre>
 
For the usage of the LSDF Online Storage the following environment variables are available: $LSDF, $LSDFPROJECTS, $LSDFHOME.
Please request storage projects in the LSDF Online Storage seperately:
[https://www.lsdf.kit.edu/os/storagerequest/: LSDF Storage Request].

==BeeOND (BeeGFS On-Demand)==

Users of the UniCluster have possibility to request a private BeeOND (on-demand BeeGFS) parallel filesystem for each job. The file system is created during job startup and purged after your job.

; '''IMPORTANT: '''
: All data on the private filesystem will be deleted after your job. Make sure you have copied your data back to the global filesystem (within job), e.g., $HOME or any workspace.

BeeOND/BeeGFS can be used like any other parallel file system. Tools like cp or rsync can be used to copy data in and out.

For detailed usage see here: [[BwUniCluster_2.0_Slurm_common_Features#BeeOND_.28BeeGFS_On-Demand.29|Request on-demand file system]]

==Backup and Archiving==

There are regular backups of all data of the home directories,whereas ACLs and extended attributes will
not be backuped.

Please open a ticket if you need backuped data.

BwUniCluster3.0/Hardware and Architecture

2025-04-03T13:09:49Z

R Laifer: /* File Systems */ some corrections

= Architecture of bwUniCluster 3.0 =

The '''bwUniCluster 3.0''' is a parallel computer with distributed memory.
It consists of the bwUniCluster 3.0 components procured in 2024 and also includes the additional compute nodes which were procured as an extension to the bwUniCluster 2.0 in 2022.

Each node of the system consists of two Intel Xeon or AMD EPYC processors, local memory, local storage, network adapters and optional accelerators (NVIDIA A100 and H100, AMD Instinct MI300A). All nodes are connected via a fast InfiniBand interconnect.

The parallel file system (Lustre) is connected to the InfiniBand switch of the compute cluster. This provides a fast and scalable parallel file
system.

The operating system on each node is Red Hat Enterprise Linux (RHEL) 9.4.

The individual nodes of the system act in different roles. From an end users point of view the different groups of nodes are login nodes and compute nodes. File server nodes and administrative server nodes are not accessible by users.

'''Login Nodes'''

The login nodes are the only nodes directly accessible by end users. These nodes are used for interactive login, file management, program development, and interactive pre- and post-processing.
There are two nodes dedicated to this service, but they can all be reached from a single address: <code>uc3.scc.kit.edu</code>. A DNS round-robin alias distributes login sessions to the login nodes.
To prevent login nodes from being used for activities that are not permitted there and that affect the user experience of other users, '''long-running and/or compute-intensive tasks are periodically terminated without any prior warning'''. Please refer to [[BwUniCluster3.0/Login#Allowed_Activities_on_Login_Nodes|Allowed Activities on Login Nodes]].

'''Compute Nodes'''

The majority of nodes are compute nodes which are managed by a batch system. Users submit their jobs to the SLURM batch system and a job is executed when the required resources become available (depending on its fair-share priority).

'''File Systems'''
bwUniCluster 3.0 comprises two parallel file systems based on Lustre.

[[File:uc3.png|Optionen|center|Überschrift|800px]]

= Compute Resources =

== Login nodes ==

After a successful [[BwUniCluster3.0/Login|login]], users find themselves on one of the so called login nodes. Technically, these largely correspond to a standard CPU node, i.e. users have two AMD EPYC 9454 processors with a total of 96 cores at their disposal. Login nodes are the bridgehead for accessing computing resources.
Data and software are organized here, computing jobs are initiated and managed, and computing resources allocated for interactive use can also be accessed from here.
{|style="background:#deffee; width:100%;"
|style="padding:5px; background:#ffa500; text-align:left"|
[[Image:Attention.svg|center|25px]]
|style="padding:5px; background:#ffa500; text-align:left"|
'''Any compute intensive job running on the login nodes will be terminated without any notice.''' 
Please refer to [[BwUniCluster3.0/Login#Allowed_Activities_on_Login_Nodes|Allowed Activities on Login Nodes]].
|}

== Compute nodes ==
All compute activities on bwUniCluster 3.0 have to be performed on the compute nodes. Compute nodes are only available by requesting the corresponding resources via the queuing system. As soon as the requested resources are available, automated tasks are executed via a batch script or they can be accessed interactively. Please refer to [[BwUniCluster3.0/Running_Jobs|Running Jobs]] on how to request resources. 
The following compute node types are available: 
CPU nodes
* '''Standard''': Two AMD EPYC 9454 processors per node with a total of 96 physical CPU cores or 192 logical cores (Hyper-Threading) per node. The nodes have been procured in 2024.
* '''Ice Lake''': Two Intel Xeon Platinum 8358 processors per node with a total of 64 physical CPU cores or 128 logical cores (Hyper-Threading) per node. The nodes have been procured in 2022 as an extension to bwUniCluster 2.0.
* '''High Memory''': Similar to the standard nodes, but with six times larger memory.
GPU nodes
* '''NVIDIA GPU x4''': Similar to the standard nodes, but with larger memory and four NVIDIA H100 GPUs.
* '''AMD GPU x4''': AMD's accelerated processing unit (APU) MI300A with 4 CPU sockets and 4 compute units which share the same high-bandwidth memory (HBM).
* '''Ice Lake NVIDIA GPU x4''': Similar to the Ice Lake nodes, but with larger memory and four NVIDIA A100 or H100 GPUs.
{| class="wikitable"
|-
! style="width:10%"| Node Type
! style="width:10%"| CPU nodes Ice Lake
! style="width:10%"| CPU nodes Standard
! style="width:10%"| CPU nodes High Memory
! style="width:10%"| GPU nodes NVIDIA GPU x4
! style="width:10%"| GPU node AMD GPU x4
! style="width:10%"| GPU nodes Ice Lake NVIDIA GPU x4
! style="width:10%"| Login nodes
|-
!scope="column"| Availability in [[BwUniCluster3.0/Running_Jobs#Queues_on_bwUniCluster_3.0| queues]]
| <code>cpu_il</code>, <code>dev_cpu_il</code>
| <code>cpu</code>, <code>dev_cpu</code>
| <code>highmem</code>, <code>dev_highmem</code>
| <code>gpu_h100</code>, <code>dev_gpu_h100</code>
| <code>gpu_mi300</code>
| <code>gpu_a100_il</code> / <code>gpu_h100_il</code>
| -
|-
!scope="column"| Number of nodes
| 272
| 70
| 4
| 12
| 1
| 15
| 2
|-
!scope="column"| Processors
| Intel Xeon Platinum 8358
| AMD EPYC 9454
| AMD EPYC 9454
| AMD EPYC 9454
| AMD Zen 4
| Intel Xeon Platinum 8358
| AMD EPYC 9454
|-
!scope="column"| Number of sockets
| 2
| 2
| 2
| 2
| 4
| 2
| 2
|-
!scope="column"| Processor frequency (GHz)
| 2.6 GHz
| 2.75 GHz
| 2.75 GHz
| 2.75 GHz
| 3.7 GHz
| 2.6 GHz
| 2.75 GHz
|-
!scope="column"| Total number of cores
| 64
| 96
| 96
| 96
| 96 (4x 24)
| 64
| 96
|-
!scope="column"| Main memory
| 256 GB
| 384 GB
| 2.3 TB
| 768 GB
| 4x 128 GB HBM3
| 512 GB
| 384 GB
|-
!scope="column"| Local SSD
| 1.8 TB NVMe
| 3.84 TB NVMe
| 15.36 TB NVMe
| 15.36 TB NVMe
| 7.68 TB NVMe
| 6.4 TB NVMe
| 7.68 TB SATA SSD
|-
!scope="column"| Accelerators
| -
| -
| -
| 4x NVIDIA H100
| 4x AMD Instinct MI300A
| 4x NVIDIA A100 / H100
| -
|-
!scope="column"| Accelerator memory
| -
| -
| -
| 94 GB
| APU
| 80 GB / 94 GB
| -
|-
!scope="column"| Interconnect
| IB HDR200
| IB 2x NDR200
| IB 2x NDR200
| IB 4x NDR200
| IB 2x NDR200
| IB 2x HDR200
| IB 1x NDR200
|}
Table 1: Hardware overview and properties

= File Systems =

On bwUniCluster 3.0 the following file systems are available:

* '''$HOME''' The HOME directory is created automatically after account activation, and the environment variable $HOME holds its name. HOME is the place, where users find themselves after login.
* '''Workspaces''' Users can create so-called workspaces for non-permanent data with temporary lifetime. A further workspace type based on flash-only storage for special requirements is also available.
* '''$TMPDIR''' The directory $TMPDIR is only available and visible on the local node during the runtime of a compute job. It is located on fast SSD storage devices.
* '''BeeOND''' (BeeGFS On-Demand) On request a parallel on-demand file system (BeeOND) is created which uses the SSDs of the nodes which were allocated to the batch job.
* '''LSDF Online Storage''' On request the external LSDF Online Storage is mounted on the nodes which were allocated to the batch job. On the login nodes, LSDF is automatically mounted.

'''Which file system to use?'''

In general, you should separate your data and store it on the appropriate file system.
Permanently needed data like software or important results should be stored below $HOME
but capacity restrictions (quotas) apply. In case you accidentally deleted data on $HOME
there is a chance that we can restore it from backup. Permanent data which is not needed for months
or exceeds the capacity restrictions should be sent to the LSDF Online Storage
or to the archive and deleted from the file systems. Temporary data which is only needed on a single
node and which does not exceed the disk space shown in Table 1 above should be stored
below $TMPDIR. Data which is read many times on a single node, e.g. if you are doing AI training,
should be copied to $TMPDIR and read from there. Temporary data which is used from many nodes
of your batch job and which is only needed during job runtime should be stored on a
parallel on-demand file system BeeOND. Temporary data which can be recomputed or which is the
result of one job and input for another job should be stored in workspaces. The lifetime
of data in workspaces is limited and depends on the lifetime of the workspace which can be
several months.

For further details please check

[[BwUniCluster3.0/Hardware_and_Architecture/Filesystem_Details|File System Details]]

== $HOME ==

The $HOME directories of bwUniCluster 3.0 users are located on the parallel file system Lustre.
You have access to your $HOME directory from all nodes of UC3. A regular backup of these directories
to tape archive is done automatically. The directory $HOME is used to hold those files that are
permanently used like source codes, configuration files, executable programs etc.

[[BwUniCluster3.0/Hardware_and_Architecture/Filesystem_Details#$HOME|Detailed information on $HOME]]

== Workspaces ==

On UC3 workspaces should be used to store large non-permanent data sets, e.g. restart files or output
data that has to be post-processed. The file system used for workspaces is also the parallel file system Lustre. This file system is especially designed for parallel access and for a high throughput to large
files. It is able to provide high data transfer rates of up to 40 GB/s write and read performance when data access is parallel.

On UC3 there is a default user quota limit of 40 TiB and 20 million inodes (files and directories) per user.

[[BwUniCluster3.0/Hardware_and_Architecture/Filesystem_Details#Workspaces|Detailed information on Workspaces]]

== $TMPDIR ==

The environment variable $TMPDIR contains the name of a directory which is located on the local SSD of each node. This means
that different tasks of a parallel application use different directories when they do not utilize the same node.
Although $TMPDIR points to the same path name for different nodes of a batch job, the physical location and the
content of this directory path on these nodes is different.

This directory should be used for temporary files being accessed from the local node during job runtime. It should
also be used if you read the same data many times from a single node, e.g. if you are doing AI training. In this
case you should copy the data at the beginning of your batch job to $TMPDIR and read the data from there, see usage example below.

The $TMPDIR directory is located on extremely fast local SSD storage devices. This means that performance
on small files is much better than on the parallel file systems. The capacity of the local SSDs for each node type
is different and can be checked in Table 1 above. The capacity of $TMPDIR is at least 1400 GB.

[[BwUniCluster3.0/Hardware_and_Architecture/Filesystem_Details#$TMPDIR|Detailed information on $TMPDIR]]

== LSDF Online Storage==

In some cases it is useful to have access to the LSDF Online Storage on the HPC-Clusters also. Therefore the LSDF Online Storage is mounted on the Login- and Datamover-Nodes.
Furthermore it can be used on the compute nodes during the job runtime with the constraint flag "LSDF" ([[bwUniCluster_3.0_Slurm_common_Features|Slurm common features]]
). There is also an example about the LSDF batch usage: [[bwUniCluster_3.0_Slurm_common_Features#LSDF_Online_Storage|Slurm LSDF example ]] .
<pre>
#!/bin/bash
#SBATCH ...
#SBATCH --constraint=LSDF
</pre>
 
For the usage of the LSDF Online Storage the following environment variables are available: $LSDF, $LSDFPROJECTS, $LSDFHOME.
Please request storage projects in the LSDF Online Storage seperately:
[https://www.lsdf.kit.edu/os/storagerequest/: LSDF Storage Request].

==BeeOND (BeeGFS On-Demand)==

Users of the UniCluster have possibility to request a private BeeOND (on-demand BeeGFS) parallel filesystem for each job. The file system is created during job startup and purged after your job.

; '''IMPORTANT: '''
: All data on the private filesystem will be deleted after your job. Make sure you have copied your data back to the global filesystem (within job), e.g., $HOME or any workspace.

BeeOND/BeeGFS can be used like any other parallel file system. Tools like cp or rsync can be used to copy data in and out.

For detailed usage see here: [[BwUniCluster_3.0_Slurm_common_Features#BeeOND_.28BeeGFS_On-Demand.29|Request on-demand file system]]

BwUniCluster3.0/Data Migration Guide

2025-04-03T12:34:32Z

R Laifer: /* Migration of Workspaces */ Hinweis auf ws_list auf UC2

__TOC__

= Summary of changes =

bwUniCluster 3.0 is located on the North Campus of KIT to meet the requirements of energy efficient and environmentally friendly HPC operation by using the hot water cooling available there. bwUniCluster 3.0 has new parallel file systems for HOME and workspaces. The most important changes compared to bwUniCluster 2.0 are listed below.

== Entitlement, Registration and Login ==
All users who already have an entitlement on bwUniCluster 2.0 are authorized to access bwUniCluster 3.0. The user only needs to '''register for the new service''' at https://bwidm.scc.kit.edu (as described here: [[Registration/bwUniCluster/Service|Step B: bwUniCluster Registration]]). 
'''The service bwUniCluster 3.0 will be visible in bwIDM from 07.04.2025 8 o'clock.''' 
The new hostname for login via Secure Shell (SSH) is: '''uc3.scc.kit.edu''' 
Like all other users, from now on also KIT users have to use the <code>ka_</code> prefix in front of their username.

== Hardware ==
The new bwUniCluster 3.0 features more than 340 CPU nodes and 28 GPU nodes. Most of the CPU nodes originate from the bwUniCluster 2.0 Extension and are equipped with the well-known Intel Xeon Platinum 8358 processors (Ice Lake) with 64 cores per dual socket node. The new CPU partition consists of 70 AMD EPYC 9454 nodes with 96 cores per dual socket node. The GPU nodes feature A100 and H100 accelerators by NVIDIA. The AMD APU node features 4 MI300A accelerators. 
The node interconnect for the new partitions is InfiniBand 2x/4x NDR200, which is expected to provide even better parallel performance. For details please refer to [[BwUniCluster3.0/Hardware_and_Architecture|Hardware and Architecture]].

== Software ==
The operating system on all nodes is Red Hat Enterprise Linux (RHEL) 9.4.

== Operations ==
There are no more dedicated single node job queues anymore. Compute resource hence can be allocated with a minimum of one CPU core or one GPU, regardless of the hardware partition. For details please refer to [[BwUniCluster3.0/Running_Jobs#Queues_on_bwUniCluster_3.0|Queues on bwUniCluster 3.0]].

== Policy ==
* '''New Quotas'''
** HOME: '''500GB''', '''5 million files (inodes)'''
** Workspace: '''40TB''', '''20 million files (inodes)'''
** Throttling Policies: The '''maximum amount of cores''' used at any given time from jobs running is 1920 per user (aggregated over all running jobs).
* '''Username and HOME directory for KIT users'''
** Like everyone else, KIT users' usernames now have the two-character prefix of their home location: '''<code>ka_</code>'''
** The HOME directory for user ''ab1234'' would be: '''<code>/home/ka/ka_OE/ka_ab1234</code>''' (OE: organizational unit)
** Login with SSH: '''<code>ssh ka_ab1234@uc3.scc.kit.edu</code>'''
* '''Access for KIT students'''
** KIT students can be granted access with their regular u-student account in the context of a lecture (cf. https://www.scc.kit.edu/servicedesk/formulare.php → Application Form for Students accounts on bwUniCluster).
** The account is only enabled '''during the lecture period'''. After the end of the semester, the accounts will be deprovisioned and the user data is deleted.
** A guest and partner account (GuP) is required for all other projects of KIT students on bwUniCluster 3.0.

= Data Migration =

bwUniCluster 3.0 features a completely new file system, there is no automatic migration of user data! Users have to actively migrate the data to the new file system. During a limited duration of time ('''till July 6, 2025'''), however, the old file system and login nodes remain in operation. The file system is mounted on the new system. It will also be possible to log in to the old bwUniCluster 2.0 login nodes during this period. This leaves enough time to copy any HOME directory and workspace data to be migrated to the new file systems.
Please be aware of the new, slightly more stringent quota policies! Before the data can be copied, the new quotas must be checked to see if they are sufficient to accept the old data. If there are any quota issues, users should take a look at their respective data lifecycle management.

You perform the data migration while logged in to bwUniCluster 3.0.

== Assisted data migration ==

To facilitate the transfer of data between the old and new HOME directories and workspaces, we provide a script that guides you through the copy process or even automates the transfer: 
<code>migrate_data_uc2_uc3.sh</code>

In order to mitigate the effects of the quota changes, the script first performs a quota check. If a quota check detects that the storage capacity or the number of files (inodes) has been exceeded, the program terminates with an error message.
If the quota check was without objection, the data migration command '''is displayed, not executed!''' For the fearless, the <code>-x</code> flag can even be used to initiate the copy process itself.
The script can automate the data transfer to the new HOME directory. If you intend to also transfer data resident in workspaces, the script can automate this, too. However, the target workspaces on the new system first have to be setup manually (cf. [[#Migration_of_Workspaces|Migration of Workspaces]]).

'''Options of the script''' 
* <code>-h</code> provides detailed information about its usage
* <code>-v</code> provides verbose output including the quota checks
* <code>-x</code> will execute the migration, if the quota checks did not fail

If the data migration fails due to time limit or if you do not intend to do the data transfer interactively, the help message (<code>-h</code>) provides an example on how to do the data transfer via a batch job. This even accelerates the copy process due to the exclusive usage of a compute node. Alternatively, rsync can be run repeatedly, as it performs incremental synchronizations. This means that data that has already been copied will not be copied a second time, and only files that do not exist in the target directory will be copied.

'''Attention''' 
We explicitly want the users to NOT migrate their old dot files and dot directories, which possibly contain settings not compatible with the new system (<code>.bashrc</code>, <code>.config/</code>, ...). The script therefore excludes these files from migration. We recommend that you start with a new set of default configuration files and adapt them to your needs as required. Please compare section [[#Migration_of_Software_and_Settings|Migration of Software and settings]].

'''Examples''' 

* '''Getting the help text''' 
<div style="padding-left: 20px;">
<div class="toccolours mw-collapsible mw-collapsed" style="width:100%;max-width:1000px; overflow:visible;">
<div style="font-weight:bold;line-height:1.6;"><code>migrate_data_uc2_uc3.sh -h</code></div>
<div class="mw-collapsible-content">
<pre>
migrate_data_uc2_uc3.sh [-h|--help] [-d|--debug] [-x|--execute] [-f|--force] [-v|--verbose] [-w|--workspace <name>]

Without options this script will print the recommended rsync command which can be used to copy data
from the home directory of bwUniCluster 2.0 to bwUniCluster3.0. You can either select different
rsync options (see "man rsync" for explanations) or start the script again with the option "-x"
in oder to execute the rsync command. Note that the recommended options exclude files and directories
on the old home directory path which start with a dot, for example ''.bashrc''. This is done because
these files and directories typically include configuration and cache data which is probably different
on the new system. If these dot files and directories include data which is still needed you should
migrate it manually.

The script can also be used to migrate the data of a workspace, see option "-w". Here the option
"-x" is only alloed if the old and the new workspace has the same name. If you want to modify the
name of the old workspace just use the printed rsync command and select an appropriate target directory.
Note that you have to create the new workspace beforehand.

You should start the script inside a batch job since limits on the login node will otherwise probably
abort the actions and since the login node will otherwise be overloaded.
Example for starting the batch job:
sbatch -p cpu -N 1 -t 24:00:00 --mem=30gb /pfs/data6/scripts/migrate_data_uc2_uc3.sh -x

Options:
-d|--debug Provide debug messages.
-f|--force Continue if capacity or inode usage on old file system are higher than
quota limits on new file system.
-h|--help This help.
-x|--execute Execute rsync command. If this option is not set only print rsync command to terminal.
-v|--verbose Provide verbose messages.
-w|--workspace <name> Do rsync for the workspace <name>. If this option is not set do it for your home directory.
</pre>
</div></div></div>

* '''Give verbose hints''' (quota OK) 
<div style="padding-left: 20px;">
<div class="toccolours mw-collapsible mw-collapsed" style="width:100%;max-width:1000px; overflow:visible;">
<div style="font-weight:bold;line-height:1.6;"><code>migrate_data_uc2_uc3.sh -v</code></div>
<div class="mw-collapsible-content">
<pre>Doing the actions for the home directoy.
Checking if capacity and inode usage on the old home file system is lower than the limits on the new file system.
✅ Quota checks for capacity and inode usage of the home directoy have passed.
Recommended command line for the rsync command:
rsync -x --numeric-ids -S -rlptoD -H -A --exclude='/.*' /pfs/data5/home/kit/scc/ab1234/ /home/ka/ka_scc/ka_ab1234/
</pre>
</div></div></div>

* '''Give verbose hints''' (quota not OK) 
<div style="padding-left: 20px;">
<div class="toccolours mw-collapsible mw-collapsed" style="width:100%;max-width:1000px; overflow:visible;">
<div style="font-weight:bold;line-height:1.6;"><code>migrate_data_uc2_uc3.sh -v</code></div>
<div class="mw-collapsible-content">
<pre>Doing the actions for the home directoy.
Checking if capacity and inode usage on the old home file system is lower than the limits on the new file system.
❌ Exiting because old capacity usage (563281380) is higher than new capacity limit (524288000).
Please remove data of your old home directory (/pfs/data5/home/kit/scc/ab1234).
You can also use the force option if you believe that the new limit is sufficient.
</pre>
</div></div></div>

== Migration of HOME ==
If guided migration fails due to quota issues, users will need to reduce the number of inodes or the amount of data. A manual check of the used resources is helpful, which is described below.

''' 1. Check the quota of HOME '''

Show user quota of the '''old''' HOME: <code>$ lfs quota -uh $USER /pfs/data5</code> 
Show user quota of the '''new''' HOME: <code>$ lfs quota -uh $USER /pfs/data6</code>
 
For the new file system, the limit for capacity and number of files must be higher than the capacity and the number of filed used in the old file system in order to avoid I/O errors during data transfer. Pay attention to the respective ''used'', ''files'', and ''quota'' column of the outputs.

[[File:quotas-uc2.png|600px]]

''' 2. Cleanup '''

If the capacity limit or the maximum number of files is exceeded, now is the right time to clean up. 
Either delete data in the source directory before the rsync command or use additional <code>--exclude</code> statements during rsync. 
'''Hint:''' 
If the file limit is exceeded, you should, for example, delete all existing Python virtual environments, which often contain a massive number of small files and which are not functional on the new system anyway.

''' 3. Migrate the data ''' 
The easiest way to get a suitable rsync command that fits your needs is to use the output of <code>migrate_data_uc2_uc3.sh</code> and eventually adding further <code>--exclude</code> statements.

== Migration of Workspaces ==
Show user quota of the '''old''' workspaces: <code>$ lfs quota -uh $USER /pfs/work7</code> 
Show user quota of the '''new''' workspaces: <code>$ lfs quota -uh $USER /pfs/work9</code>

If the quota checks are successful, then an appropriate workspace needs to be created and the data transfer can be initiated: 
1. Create a new workspace with the same name as the old one, e.g. <code>ws_allocate demospace 10</code>. If you do not reemember your old workspace names, go to a login node of bwUniCluster 2.0 and execute the <code>ws_list</code> command. 
2. Transfer the data using the recommended command: 
<div style="padding-left: 20px;">
<div class="toccolours mw-collapsible mw-collapsed" style="width:100%;max-width:1200px; overflow:visible;">
<div style="font-weight:bold;line-height:1.6;"><code>migrate_data_uc2_uc3.sh --workspace demospace -v</code></div>
<div class="mw-collapsible-content">
<pre>Doing the actions for workspace "demospace".
Found old workspace path (/pfs/work7/workspace/scratch/ej4555-demospace/).
Found new workspace path (/pfs/work9/workspace/scratch/ka_ej4555-demospace/).
Recommended command line for the rsync command line:
rsync -x --numeric-ids -S -rlptoD -H -A --stats /pfs/work7/workspace/scratch/ej4555-demospace/ /pfs/work9/workspace/scratch/ka_ej4555-demospace/
</pre>
</div></div></div>

= Migration of Software and Settings =

We explicitly and intentionally exclude all dot files and dot directories (<code>.bashrc</code>, <code>.config/</code>, ...) from above data migration helper script. Our users should NOT migrate their old dot files and dot directories, which possibly contain settings not compatible with the new system. We recommend that you start with a new set of default configuration files and adapt them to your needs as required.

'''Settings and Configurations''' 
The change to the new system will probably be most noticeable if the behavior of the Bash shell has been customized.
Please consult your old ''.bashrc'' file and copy aliases, bash functions or other settings you have defined there to the new ''.bashrc'' file.
Do not simply copy the old ''.bashrc'' file. Avoid moving settings that were made by conda in ''.bashrc'' (cf. [[Development/Conda#Conda_Installation|Conda Installation]]).

'''Python environments''' 
Virtual Python environments such as venvs or conda environments should NOT be migrated to the new systems. For one reason, it is very likely that the virtual environment will not be functional on the new system. Also, these environments usually contain a large number of small files for which any data movement on the parallel file system will provide only mediocre performance.
Fortunately, reinstalling or setting up Python environments is relatively easy if, for example, the use of ''requirements.txt'' is consistently followed. Please refer to the [[Development/Python#Best_Practice|Best Practice]] guidelines for the handling and usage of Python.

'''User software''' 
Application software that you have not compiled yourself must be reinstalled; the relevant installation file may be required for this.
Self-compiled software must be recompiled from the sources and installed in the HOME directory.

'''Containers''' 
Containers such as Apptainer or Enroot can either be exported to an image or freshly downloaded and set up (cf. [[BwUniCluster3.0/Containers#Exporting_and_transfering_containers|Exporting and transfering containers]]).

{{Note|type=error|text=Foo}}

BwUniCluster3.0/Hardware and Architecture/Filesystem Details

2025-04-03T12:28:10Z

R Laifer: /* $HOME */

= File System Details =

On bwUniCluster 2.0 the parallel file system Lustre is used for most globally visible user data. Lustre is open source and Lustre solutions and support are available from different vendors. Nowadays, most of the biggest HPC systems are using Lustre. An initial home directory on a Lustre file system is created automatically after account activation, and the environment variable $HOME holds its name. Users can create so-called workspaces on another Lustre file system for non-permanent data with temporary lifetime. There is another workspace file system based on flash storage for special requirements available.

Within a batch job further file systems are available:
* The directory $TMPDIR is only available and visible on the local node. It is located on fast SSD storage devices.
* On request a parallel on-demand file system (BeeOND) is created which uses the SSDs of the nodes which were allocated to the batch job.
* On request the external LSDF Online Storage is mounted on the nodes which were allocated to the batch job. This file system is based on the parallel file system Spectrum Scale.

Some of the characteristics of the file systems are shown in Table 2.

{| class="wikitable"
|-
! scope="column" style="width:9%"| Property
! style="width:13%"| $TMPDIR
! style="width:13%"| BeeOND
! style="width:13%"| $HOME
! style="width:13%"| Workspace
! style="width:13%"| Workspace on flash
|-
! scope="column" style="height=20px; padding:3px"| Visibility
| style="height=20px; text-align:left;padding:3px"| local node
| style="height=20px; text-align:left;padding:3px"| nodes of batch job
| style="height=20px; text-align:left;padding:3px"| global
| style="height=20px; text-align:left;padding:3px"| global
| style="height=20px; text-align:left;padding:3px"| global
|-
! scope="column" style="height=20px; padding:3px"| Lifetime
| style="height=20px; text-align:left;padding:3px"| batch job runtime
| style="height=20px; text-align:left;padding:3px"| batch job runtime
| style="height=20px; text-align:left;padding:3px"| permanent
| style="height=20px; text-align:left;padding:3px"| max. 240 days
| style="height=20px; text-align:left;padding:3px"| max. 240 days
|-
! scope="column" style="height=20px; padding:3px"| Disk space
| style="height=20px; text-align:left;padding:3px"| 960 GB - 6.4 TB details see table 1
| style="height=20px; text-align:left;padding:3px"| n*750 GB
| style="height=20px; text-align:left;padding:3px"| 1.1 PiB
| style="height=20px; text-align:left;padding:3px"| 4.6 PiB
| style="height=20px; text-align:left;padding:3px"| 236 TiB
|-
! scope="column" style="height=20px; padding:3px"| Capacity Quotas
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| yes 500 GiB per user, for MA users 250 GiB also per organization
| style="height=20px; text-align:left;padding:3px"| yes 40 TiB per user
| style="height=20px; text-align:left;padding:3px"| yes 1 TiB per user
|-
! scope="column" style="height=20px; padding:3px"| Inode Quotas
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| yes 5 million per user for MA users 2.5 million
| style="height=20px; text-align:left;padding:3px"| yes 20 million per user
| style="height=20px; text-align:left;padding:3px"| yes 5 million per user
|-
! scope="column" style="height=20px; padding:3px"| Backup
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| yes
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| no
|-
! scope="column" style="height=20px; padding:3px"| Read perf./node
| style="height=20px; text-align:left;padding:3px"| 500 MB/s - 6 GB/s depends on type of local SSD / job queue: 520 MB/s @ single / multiple 800 MB/s @ multiple_e 6600 MB/s @ fat 6500 MB/s @ gpu_4 6500 MB/s @ gpu_8
| style="height=20px; text-align:left;padding:3px"| 400 MB/s - 500 MB/s depends on type of local SSDs / job queue: 500 MB/s @ multiple 400 MB/s @ multiple_e
| style="height=20px; text-align:left;padding:3px"| 5 GB/s
| style="height=20px; text-align:left;padding:3px"| 5 GB/s
| style="height=20px; text-align:left;padding:3px"| 1 GB/s
|-
! scope="column" style="height=20px; padding:3px"| Write perf./node
| style="height=20px; text-align:left;padding:3px"| 500 MB/s - 4 GB/s depends on type of local SSD / job queue: 500 MB/s @ single / multiple 650 MB/s @ multiple_e 2900 MB/s @ fat 2090 MB/s @ gpu_4 4060 MB/s @ gpu_8
| style="height=20px; text-align:left;padding:3px"| 250 MB/s - 350 MB/s depends on type of local SSDs / job queue: 350 MB/s @ multiple 250 MB/s @ multiple_e
| style="height=20px; text-align:left;padding:3px"| 5 GB/s
| style="height=20px; text-align:left;padding:3px"| 5 GB/s
| style="height=20px; text-align:left;padding:3px"| 1 GB/s
|-
! scope="column" style="height=20px;padding:3px"| Total read perf.
| style="height=20px; text-align:left;padding:3px"| n*500-6000 MB/s
| style="height=20px; text-align:left;padding:3px"| n*400-500 MB/s
| style="height=20px; text-align:left;padding:3px"| 63 GB/s
| style="height=20px; text-align:left;padding:3px"| 45 GB/s
| style="height=20px; text-align:left;padding:3px"| 45 GB/s
|-
! scope="column" style="height=20px;padding:3px"| Total write perf.
| style="height=20px; text-align:left;padding:3px"| n*500-4000 MB/s
| style="height=20px; text-align:left;padding:3px"| n*250-350 MB/s
| style="height=20px; text-align:left;padding:3px"| 63 GB/s
| style="height=20px; text-align:left;padding:3px"| 40 GB/s
| style="height=20px; text-align:left;padding:3px"| 38 GB/s
|}
---------------------------------------------------------------------------------------------------------
global: all nodes of UniCluster access the same file system;
local: each node has its own file system;
permanent: files are stored permanently;
batch job: files are removed at end of the batch job.
---------------------------------------------------------------------------------------------------------
Table 2: Properties of the file systems

== Selecting the appropriate file system ==

In general, you should separate your data and store it on the appropriate file system.
Permanently needed data like software or important results should be stored below $HOME
but capacity restrictions (quotas) apply. In case you accidentally deleted data on $HOME
there is a chance that we can restore it from backup. Permanent data which is not needed for months
or exceeds the capacity restrictions should be sent to the LSDF Online Storage
or to the archive and deleted from the file systems. Temporary data which is only needed on a single
node and which does not exceed the disk space shown in the table above should be stored
below $TMPDIR. Data which is read many times on a single node, e.g. if you are doing AI training,
should be copied to $TMPDIR and read from there. Temporary data which is used from many nodes
of your batch job and which is only needed during job runtime should be stored on a
parallel on-demand file system. Temporary data which can be recomputed or which is the
result of one job and input for another job should be stored in workspaces. The lifetime
of data in workspaces is limited and depends on the lifetime of the workspace which can be
several months.

For further details please check the chapters below.

== $HOME ==

The home directories of bwUniCluster 3.0 (uc3) users are located in the parallel file system Lustre.
You have access to your home directory from all nodes of uc3. A regular backup of these directories
to a tape library is done automatically. The directory $HOME should be used to hold permanently used data like
source code, configuration files, executable programs etc.

On uc3 there is a default user quota limit of 500 GiB and 5 million inodes (files and directories) per user.
For users of University of Mannheim the limit is 250 GiB and 2.5 million inodes.
You can check your current usage and limits with the command
<pre>
$ lfs quota -uh $(whoami) $HOME
</pre>
Note that the quota limits mentioned above are soft quota limits. The hard limits (shown on the ''limits''
columns) are 10 percent higher. If you are above the soft limit and below the hard limit
during the grace period (7 days) your I/O operations will show a warning message. If the grace period has
passed or if you are above the hard limit your I/O operations will abort.

In addition to the user limit there is a limit of your organization (e.g. university) which depends on the financial share. This limit is enforced with so-called Lustre project quotas. You can show the current usage and limits of your organization with the following command:
<pre>
lfs quota -ph $(grep $(echo $HOME | sed -e "s|/[^/]*/[^/]*$||") /pfs/data6/project_ids.txt | cut -f 1 -d\ ) $HOME
</pre>

== Workspaces ==

On uc3 workspaces can be used to store large non-permanent data sets, e.g. restart files or output
data that has to be post-processed. The file system used for workspaces is also the parallel file system Lustre. This file system is especially designed for parallel access and for a high throughput to large
files. It is able to provide high data transfer rates of up to 40 GB/s write and read performance when data access is parallel.

On uc3 there is a default user quota limit of 40 TiB and 20 million inodes (files and directories) per user.
You can chek your current usage and limits with the command
<pre>
$ lfs quota -uh $(whoami) /pfs/work9
</pre>
Note that the quotas include data and inodes for all of your workspaces and all of your expired workspaces (as long as they are not yet completely removed).

Workspaces have a lifetime and the data on a workspace expires as a whole after a fixed period. The maximum lifetime of a workspace on uc3 is 60 days, but it can be renewed at the end of that period 3 times to a total maximum of 240 days after workspace generation.

Creating, deleting, finding, extending and sharing workspaces is explained on the [[workspace]] page.

=== Reminder for workspace deletion ===

Normally you will get an email every day starting 7 days before a workspace expires. You can send yourself a calender entry which reminds you when a workspace will be automatically deleted:

$ ws_send_ical <workspace> <email>

=== Restoring expired Workspaces ===

At expiration time your workspace will be moved to a special, hidden directory. On uc3 expired workspaces are currently kept for 30 days. During this time you can still restore your data into a valid workspace. The same is true for released workspaces but they are only kept until the next night. In order to restore an expired workspace, use
<pre>
ws_restore -l
</pre>
to get a list of your expired workspaces, and then restore them into an '''existing, active workspace''' (here with name <code>my_restored</code>):
<pre>
ws_restore <full_name_of_expired_workspace> my_restored
</pre>
NOTE: The expired workspace has to be specified using the full name as listed by <code>ws_restore -l</code>, including username prefix and timestamp suffix (otherwise, it cannot be uniquely identified).
The target workspace, on the other hand, must be given with just its short name as listed by <code>ws_list</code>, without the username prefix.

NOTE: <code>ws_restore</code> can only work on the same filesystem. So you have to ensure that the new workspace allocated with <code>ws_allocate</code> is placed on the same filesystem as the expired workspace. Therefore, you can use <code>-F <filesystem></code> flag if needed.

=== Linking workspaces in Home ===

It might be valuable to have links to personal workspaces within a certain directory, e.g. below the user home directory. The command
<pre>
ws_register <DIR>
</pre>
will create and manage links to all personal workspaces within in the directory <DIR>. Calling this command will do the following:

* The directory <DIR> will be created if necessary
* Links to all personal workspaces will be managed:
** Creates links to all available workspaces if not already present
** Removes links to released or expired workspaces

== Improving Performance on $HOME and workspaces ==

The following recommendations might help to improve throughput and metadata
performance on Lustre filesystems.

=== Improving Throughput Performance ===

Depending on your application some adaptations might be necessary if you want to reach
the full bandwidth of the filesystems.

When you are designing your application you should consider that the performance of
parallel filesystems is generally better if data is transferred in large blocks and stored in
few large files. In more detail, to increase throughput performance of a parallel application
following aspects should be considered:

* collect large chunks of data and write them sequentially at once,

* to exploit complete filesystem bandwidth use several clients,

* avoid competitive file access by different tasks or use blocks with boundaries at stripe size (default is 1MB),

* if files are small enough for the SSDs and are only used from one node store them on $TMPDIR.

Parallel filesystems typically stripe files over storage subsystems, i.e. large files are separated into stripes and distributed to different storage subsystems. In Lustre, the size of these stripes (sometimes also mentioned as chunks) is called stripe size and the number of used storage subsystems is called stripe count.

With previous Lustre versions adapting the Lustre stripe count was the most important optimization. However, for the workspaces of uc3 the Lustre feature Progressive File Layout is used to define file striping parameters. This means that the stripe count is adapted if the file size is growing. In normal cases users no longer need to adapt file striping parameters in case they have very huge files or in order to reach better performance. If you know what you are doing you can still change striping parameters but further explanation is beyond the scope of this documentation.

=== Improving Metadata Performance ===

Metadata performance on parallel file systems is usually not as good as with local
filesystems. In addition, it is usually not scalable, i.e. a limited resource. Therefore,
you should omit metadata operations whenever possible. For example, it is much better
to have few large files than lots of small files. In more detail, to increase metadata
performance of a parallel application following aspects should be considered:

* avoid creating many small files

* avoid competitive directory access, e.g. by creating files in separate subdirectories for each task

* if many small files are only used within a batch job and accessed by one node store them on $TMPDIR

== Workspaces on flash storage ==

There is another workspace file system for special requirements available. The file system is called ''full flash pfs'' and is based on the parallel file system Lustre.

=== Advantages of this file system ===

# All storage devices are based on flash (no hard disks) with low access times. Hence performance is better compared to other parallel file systems for read and write access with small blocks and with small files, i.e. IOPS rates are improved.
# The file system is mounted on bwUniCluster 2.0 and HoreKa, i.e. it can be used to share data between these clusters.

=== Access restrictions ===

Only HoreKa users or KIT users of bwUniCluster 2.0 can use this file system.

=== Using the file system ===

As KIT or HoreKa user you can use the file system in the same way as a normal workspace. You just have to specify the name of the flash-based workspace file system using the option ''-F'' to all the commands that manage workspaces. On bwUniCluster 2.0 it is called ''ffuc'', on HoreKa it is ''ffhk''. For example, to create a workspace with name myws and a lifetime of 60 days on bwUniCluster 2.0 execute:
ws_allocate -F ffuc myws 60

If you want to use the full flash pfs on bwUniCluster 2.0 '''and''' HoreKa at the same time, please note that you only have to manage a particular workspace on one of the clusters since the name of the workspace directory is different. However, the path to each workspace is visible and can be used on both clusters.

Other features are similar to normal workspaces. For example, we are able to restore expired workspaces for few weeks and you have to open a ticket to request the restore. There are quota limits with a default limit of 1 TiB capacity and 5 millions inodes per user. You can check your current usage with
lfs quota -uh $(whoami) /pfs/work8

== $TMPDIR ==

The environment variable $TMPDIR contains the name of a directory which is located on the local SSD of each node. This means
that different tasks of a parallel application use different directories when they do not utilize the same node.
Although $TMPDIR points to the same path name for different nodes of a batch job, the physical location and the
content of this directory path on these nodes is different.

This directory should be used for temporary files being accessed from the local node during job runtime. It should
also be used if you read the same data many times from a single node, e.g. if you are doing AI training. In this
case you should copy the data at the beginning of your batch job to $TMPDIR and read the data from there, see usage example below.

The $TMPDIR directory is located on extremely fast local SSD storage devices. This means that performance
on small files is much better than on the parallel file systems. The capacity of the local SSDs for each node type
is different and can be checked in Table 1 above. The capacity of $TMPDIR is at least 800 GB.

Each time a batch job is started, a subdirectory is created on the SSD of each node and assigned to the job.
$TMPDIR is set to the name of the subdirectory and this name contains the job ID so that it is unique
for each job. At the end of the job the subdirectory is removed.

On login nodes $TMPDIR also points to a fast directory on a local SSD disk but this directory is not unique.
It is recommended to create your own unique subdirectory on these nodes. This directory should be used for the
installation of software packages. This means that the software package to be installed should be unpacked,
compiled and linked in a subdirectory of $TMPDIR. The real installation of the package (e.g. make install)
should be made into the $HOME folder.

{|style="background:#ffdeee; width:100%;"
|style="padding:5px; background:#f2cece; text-align:left"|
[[Image:Attention.svg|center|25px]]
|style="padding:5px; background:#f2cece; text-align:left"|
Note that you should '''not''' use /tmp or /scratch! Please use '''$TMPDIR''' instead. 
The reason is that an automatic cleanup on /tmp or /scratch is not possible because another job could be still using data below these directories. Hence the corresponding file systems could fill up and this can cause issues for you and for other users. On the other hand, $TMPDIR is created when the job starts and removed when the job completes, i.e. a cleanup is automatically done.
|}

=== Usage example for $TMPDIR ===

We will provide an example for using $TMPDIR and describe efficient data transfer to and from $TMPDIR.

If you have a data set with many files which is frequently used by batch jobs you should create
a compressed archive on a workspace. This archive can be extracted on $TMPDIR inside your batch jobs.
Such an archive can be read efficiently from a parallel file system since it is a single huge file.
On a login node you can create such an archive with the following steps:
<syntaxhighlight lang="bash">
# Create a workspace to store the archive
[ab1234@uc2n997 ~]$ ws_allocate data-ssd 60
# Create the archive from a local dataset folder (example)
[ab1234@uc2n997 ~]$ tar -cvzf $(ws_find data-ssd)/dataset.tgz dataset/
</syntaxhighlight>

Inside a batch job extract the archive on $TMPDIR, read input data from $TMPDIR, store results on $TMPDIR
and save the results on a workspace:
<syntaxhighlight lang="bash">
#!/bin/bash
# very simple example on how to use local $TMPDIR
#SBATCH -N 1
#SBATCH -t 24:00:00

# Extract compressed input dataset on local SSD
tar -C $TMPDIR/ -xvzf $(ws_find data-ssd)/dataset.tgz

# The application reads data from dataset on $TMPDIR and writes results to $TMPDIR
myapp -input $TMPDIR/dataset/myinput.csv -outputdir $TMPDIR/results

# Before job completes save results on a workspace
rsync -av $TMPDIR/results $(ws_find data-ssd)/results-${SLURM_JOB_ID}/
</syntaxhighlight>

== LSDF Online Storage==

In some cases it is useful to have access to the LSDF Online Storage on the HPC-Clusters also. Therefore the LSDF Online Storage is mounted on the Login- and Datamover-Nodes.
Furthermore it can be used on the compute nodes during the job runtime with the constraint flag "LSDF" ([[bwUniCluster_2.0_Slurm_common_Features|Slurm common features]]
). There is also an example about the LSDF batch usage: [[bwUniCluster_2.0_Slurm_common_Features#LSDF_Online_Storage|Slurm LSDF example ]] .
<pre>
#!/bin/bash
#SBATCH ...
#SBATCH --constraint=LSDF
</pre>
 
For the usage of the LSDF Online Storage the following environment variables are available: $LSDF, $LSDFPROJECTS, $LSDFHOME.
Please request storage projects in the LSDF Online Storage seperately:
[https://www.lsdf.kit.edu/os/storagerequest/: LSDF Storage Request].

==BeeOND (BeeGFS On-Demand)==

Users of the UniCluster have possibility to request a private BeeOND (on-demand BeeGFS) parallel filesystem for each job. The file system is created during job startup and purged after your job.

; '''IMPORTANT: '''
: All data on the private filesystem will be deleted after your job. Make sure you have copied your data back to the global filesystem (within job), e.g., $HOME or any workspace.

BeeOND/BeeGFS can be used like any other parallel file system. Tools like cp or rsync can be used to copy data in and out.

For detailed usage see here: [[BwUniCluster_2.0_Slurm_common_Features#BeeOND_.28BeeGFS_On-Demand.29|Request on-demand file system]]

==Backup and Archiving==

There are regular backups of all data of the home directories,whereas ACLs and extended attributes will
not be backuped.

Please open a ticket if you need backuped data.

BwUniCluster3.0/Hardware and Architecture

2025-04-03T12:12:12Z

R Laifer: /* Workspaces */

= Architecture of bwUniCluster 3.0 =

The '''bwUniCluster 3.0''' is a parallel computer with distributed memory.
It consists of the bwUniCluster 3.0 components procured in 2024 and also includes the additional compute nodes which were procured as an extension to the bwUniCluster 2.0 in 2022.

Each node of the system consists of two Intel Xeon or AMD EPYC processors, local memory, local storage, network adapters and optional accelerators (NVIDIA A100 and H100, AMD Instinct MI300A). All nodes are connected via a fast InfiniBand interconnect.

The parallel file system (Lustre) is connected to the InfiniBand switch of the compute cluster. This provides a fast and scalable parallel file
system.

The operating system on each node is Red Hat Enterprise Linux (RHEL) 9.4.

The individual nodes of the system act in different roles. From an end users point of view the different groups of nodes are login nodes and compute nodes. File server nodes and administrative server nodes are not accessible by users.

'''Login Nodes'''

The login nodes are the only nodes directly accessible by end users. These nodes are used for interactive login, file management, program development, and interactive pre- and post-processing.
There are two nodes dedicated to this service, but they can all be reached from a single address: <code>uc3.scc.kit.edu</code>. A DNS round-robin alias distributes login sessions to the login nodes.
To prevent login nodes from being used for activities that are not permitted there and that affect the user experience of other users, '''long-running and/or compute-intensive tasks are periodically terminated without any prior warning'''. Please refer to [[BwUniCluster3.0/Login#Allowed_Activities_on_Login_Nodes|Allowed Activities on Login Nodes]].

'''Compute Nodes'''

The majority of nodes are compute nodes which are managed by a batch system. Users submit their jobs to the SLURM batch system and a job is executed when the required resources become available (depending on its fair-share priority).

'''File Systems'''
bwUniCluster 3.0 comprises two parallel file systems based on Lustre.

[[File:uc3.png|Optionen|center|Überschrift|800px]]

= Compute Resources =

== Login nodes ==

After a successful [[BwUniCluster3.0/Login|login]], users find themselves on one of the so called login nodes. Technically, these largely correspond to a standard CPU node, i.e. users have two AMD EPYC 9454 processors with a total of 96 cores at their disposal. Login nodes are the bridgehead for accessing computing resources.
Data and software are organized here, computing jobs are initiated and managed, and computing resources allocated for interactive use can also be accessed from here.
{|style="background:#deffee; width:100%;"
|style="padding:5px; background:#ffa500; text-align:left"|
[[Image:Attention.svg|center|25px]]
|style="padding:5px; background:#ffa500; text-align:left"|
'''Any compute intensive job running on the login nodes will be terminated without any notice.''' 
Please refer to [[BwUniCluster3.0/Login#Allowed_Activities_on_Login_Nodes|Allowed Activities on Login Nodes]].
|}

== Compute nodes ==
All compute activities on bwUniCluster 3.0 have to be performed on the compute nodes. Compute nodes are only available by requesting the corresponding resources via the queuing system. As soon as the requested resources are available, automated tasks are executed via a batch script or they can be accessed interactively. Please refer to [[BwUniCluster3.0/Running_Jobs|Running Jobs]] on how to request resources. 
The following compute node types are available: 
CPU nodes
* '''Standard''': Two AMD EPYC 9454 processors per node with a total of 96 physical CPU cores or 192 logical cores (Hyper-Threading) per node. The nodes have been procured in 2024.
* '''Ice Lake''': Two Intel Xeon Platinum 8358 processors per node with a total of 64 physical CPU cores or 128 logical cores (Hyper-Threading) per node. The nodes have been procured in 2022 as an extension to bwUniCluster 2.0.
* '''High Memory''': Similar to the standard nodes, but with six times larger memory.
GPU nodes
* '''NVIDIA GPU x4''': Similar to the standard nodes, but with larger memory and four NVIDIA H100 GPUs.
* '''AMD GPU x4''': AMD's accelerated processing unit (APU) MI300A with 4 CPU sockets and 4 compute units which share the same high-bandwidth memory (HBM).
* '''Ice Lake NVIDIA GPU x4''': Similar to the Ice Lake nodes, but with larger memory and four NVIDIA A100 or H100 GPUs.
{| class="wikitable"
|-
! style="width:10%"| Node Type
! style="width:10%"| CPU nodes Ice Lake
! style="width:10%"| CPU nodes Standard
! style="width:10%"| CPU nodes High Memory
! style="width:10%"| GPU nodes NVIDIA GPU x4
! style="width:10%"| GPU node AMD GPU x4
! style="width:10%"| GPU nodes Ice Lake NVIDIA GPU x4
! style="width:10%"| Login nodes
|-
!scope="column"| Availability in [[BwUniCluster3.0/Running_Jobs#Queues_on_bwUniCluster_3.0| queues]]
| <code>cpu_il</code>, <code>dev_cpu_il</code>
| <code>cpu</code>, <code>dev_cpu</code>
| <code>highmem</code>, <code>dev_highmem</code>
| <code>gpu_h100</code>, <code>dev_gpu_h100</code>
| <code>gpu_mi300</code>
| <code>gpu_a100_il</code> / <code>gpu_h100_il</code>
| -
|-
!scope="column"| Number of nodes
| 272
| 70
| 4
| 12
| 1
| 15
| 2
|-
!scope="column"| Processors
| Intel Xeon Platinum 8358
| AMD EPYC 9454
| AMD EPYC 9454
| AMD EPYC 9454
| AMD Zen 4
| Intel Xeon Platinum 8358
| AMD EPYC 9454
|-
!scope="column"| Number of sockets
| 2
| 2
| 2
| 2
| 4
| 2
| 2
|-
!scope="column"| Processor frequency (GHz)
| 2.6 GHz
| 2.75 GHz
| 2.75 GHz
| 2.75 GHz
| 3.7 GHz
| 2.6 GHz
| 2.75 GHz
|-
!scope="column"| Total number of cores
| 64
| 96
| 96
| 96
| 96 (4x 24)
| 64
| 96
|-
!scope="column"| Main memory
| 256 GB
| 384 GB
| 2.3 TB
| 768 GB
| 4x 128 GB HBM3
| 512 GB
| 384 GB
|-
!scope="column"| Local SSD
| 1.8 TB NVMe
| 3.84 TB NVMe
| 15.36 TB NVMe
| 15.36 TB NVMe
| 7.68 TB NVMe
| 6.4 TB NVMe
| 7.68 TB SATA SSD
|-
!scope="column"| Accelerators
| -
| -
| -
| 4x NVIDIA H100
| 4x AMD Instinct MI300A
| 4x NVIDIA A100 / H100
| -
|-
!scope="column"| Accelerator memory
| -
| -
| -
| 94 GB
| APU
| 80 GB / 94 GB
| -
|-
!scope="column"| Interconnect
| IB HDR200
| IB 2x NDR200
| IB 2x NDR200
| IB 4x NDR200
| IB 2x NDR200
| IB 2x HDR200
| IB 1x NDR200
|}
Table 1: Hardware overview and properties

= File Systems =

On bwUniCluster 3.0 the following file systems are available:

* '''$HOME''' The HOME directory is created automatically after account activation, and the environment variable $HOME holds its name. HOME is the place, where users find themselves after login.
* '''Workspaces''' Users can create so-called workspaces for non-permanent data with temporary lifetime. A further workspace type based on flash-only storage for special requirements is also available.
* '''$TMPDIR''' The directory $TMPDIR is only available and visible on the local node during the runtime of a compute job. It is located on fast SSD storage devices.
* '''LSDF Online Storage''' On request a parallel on-demand file system (BeeOND) is created which uses the SSDs of the nodes which were allocated to the batch job.
* '''BeeOND''' (BeeGFS On-Demand) On request the external LSDF Online Storage is mounted on the nodes which were allocated to the batch job. On the login nodes, LSDF is automatically mounted.

'''Which file system to use?'''

In general, you should separate your data and store it on the appropriate file system.
Permanently needed data like software or important results should be stored below $HOME
but capacity restrictions (quotas) apply. In case you accidentally deleted data on $HOME
there is a chance that we can restore it from backup. Permanent data which is not needed for months
or exceeds the capacity restrictions should be sent to the LSDF Online Storage
or to the archive and deleted from the file systems. Temporary data which is only needed on a single
node and which does not exceed the disk space shown in the table above should be stored
below $TMPDIR. Data which is read many times on a single node, e.g. if you are doing AI training,
should be copied to $TMPDIR and read from there. Temporary data which is used from many nodes
of your batch job and which is only needed during job runtime should be stored on a
parallel on-demand file system. Temporary data which can be recomputed or which is the
result of one job and input for another job should be stored in workspaces. The lifetime
of data in workspaces is limited and depends on the lifetime of the workspace which can be
several months.

For further details please check the chapters below.

[[BwUniCluster3.0/Hardware_and_Architecture/Filesystem_Details|File System Details]]

== $HOME ==

The $HOME directories of bwUniCluster 3.0 users are located on the parallel file system Lustre.
You have access to your $HOME directory from all nodes of UC3. A regular backup of these directories
to tape archive is done automatically. The directory $HOME is used to hold those files that are
permanently used like source codes, configuration files, executable programs etc.

[[BwUniCluster3.0/Hardware_and_Architecture/Filesystem_Details#$HOME|Detailed information on $HOME]]

== Workspaces ==

On UC3 workspaces should be used to store large non-permanent data sets, e.g. restart files or output
data that has to be post-processed. The file system used for workspaces is also the parallel file system Lustre. This file system is especially designed for parallel access and for a high throughput to large
files. It is able to provide high data transfer rates of up to 40 GB/s write and read performance when data access is parallel.

On UC3 there is a default user quota limit of 40 TiB and 20 million inodes (files and directories) per user.

[[BwUniCluster3.0/Hardware_and_Architecture/Filesystem_Details#Workspaces|Detailed information on Workspaces]]

== $TMPDIR ==

The environment variable $TMPDIR contains the name of a directory which is located on the local SSD of each node. This means
that different tasks of a parallel application use different directories when they do not utilize the same node.
Although $TMPDIR points to the same path name for different nodes of a batch job, the physical location and the
content of this directory path on these nodes is different.

This directory should be used for temporary files being accessed from the local node during job runtime. It should
also be used if you read the same data many times from a single node, e.g. if you are doing AI training. In this
case you should copy the data at the beginning of your batch job to $TMPDIR and read the data from there, see usage example below.

The $TMPDIR directory is located on extremely fast local SSD storage devices. This means that performance
on small files is much better than on the parallel file systems. The capacity of the local SSDs for each node type
is different and can be checked in Table 1 above. The capacity of $TMPDIR is at least 800 GB.

[[BwUniCluster3.0/Hardware_and_Architecture/Filesystem_Details#$TMPDIR|Detailed information on $TMPDIR]]

== LSDF Online Storage==

In some cases it is useful to have access to the LSDF Online Storage on the HPC-Clusters also. Therefore the LSDF Online Storage is mounted on the Login- and Datamover-Nodes.
Furthermore it can be used on the compute nodes during the job runtime with the constraint flag "LSDF" ([[bwUniCluster_3.0_Slurm_common_Features|Slurm common features]]
). There is also an example about the LSDF batch usage: [[bwUniCluster_3.0_Slurm_common_Features#LSDF_Online_Storage|Slurm LSDF example ]] .
<pre>
#!/bin/bash
#SBATCH ...
#SBATCH --constraint=LSDF
</pre>
 
For the usage of the LSDF Online Storage the following environment variables are available: $LSDF, $LSDFPROJECTS, $LSDFHOME.
Please request storage projects in the LSDF Online Storage seperately:
[https://www.lsdf.kit.edu/os/storagerequest/: LSDF Storage Request].

==BeeOND (BeeGFS On-Demand)==

Users of the UniCluster have possibility to request a private BeeOND (on-demand BeeGFS) parallel filesystem for each job. The file system is created during job startup and purged after your job.

; '''IMPORTANT: '''
: All data on the private filesystem will be deleted after your job. Make sure you have copied your data back to the global filesystem (within job), e.g., $HOME or any workspace.

BeeOND/BeeGFS can be used like any other parallel file system. Tools like cp or rsync can be used to copy data in and out.

For detailed usage see here: [[BwUniCluster_3.0_Slurm_common_Features#BeeOND_.28BeeGFS_On-Demand.29|Request on-demand file system]]

BwUniCluster2.0/Slurm

2025-04-03T12:05:20Z

R Laifer: /* BeeOND (BeeGFS On-Demand) */

<div id="top"></div>
= Slurm HPC Workload Manager =
== Specification ==
Slurm is an open source, fault-tolerant, and highly scalable cluster management and job scheduling system for large and small Linux clusters. Slurm requires no kernel modifications for its operation and is relatively self-contained. As a cluster workload manager, Slurm has three key functions. First, it allocates exclusive and/or non-exclusive access to resources (compute nodes) to users for some duration of time so they can perform work. Second, it provides a framework for starting, executing, and monitoring work (normally a parallel job) on the set of allocated nodes. Finally, it arbitrates contention for resources by managing a queue of pending work.
 
 
Any kind of calculation on the compute nodes of [[bwUniCluster 2.0|bwUniCluster 2.0]] requires the user to define calculations as a sequence of commands or single command together with required run time, number of CPU cores and main memory and submit all, i.e., the '''batch job''', to a resource and workload managing software. bwUniCluster 2.0 has installed the workload managing software Slurm. Therefore any job submission by the user is to be executed by commands of the Slurm software. Slurm queues and runs user jobs based on fair sharing policies.
 
 

== Slurm Commands (excerpt) ==
Some of the most used Slurm commands for non-administrators working on bwUniCluster 2.0.
{| width=750px class="wikitable"
! Slurm commands !! Brief explanation
|-
| [[#Job Submission : sbatch|sbatch]] || Submits a job and queues it in an input queue [[https://slurm.schedmd.com/sbatch.html sbatch]]
|-
| [[#Detailed job information : scontrol show job|scontrol show job]] || Displays detailed job state information [[https://slurm.schedmd.com/scontrol.html scontrol]]
|-
| [[#List of your submitted jo/bs : squeue|squeue]] || Displays information about active, eligible, blocked, and/or recently completed jobs [[https://slurm.schedmd.com/squeue.html squeue]]
|-
| [[#Start time of job or resources : squeue|squeue --start]] || Returns start time of submitted job or requested resources [[https://slurm.schedmd.com/squeue.html squeue]]
|-
| [[#Shows free resources : sinfo_t_idle|sinfo_t_idle]] || Shows what resources are available for immediate use [[https://slurm.schedmd.com/sinfo.html sinfo]]
|-
| [[#Canceling own jobs : scancel|scancel]] || Cancels a job (obsoleted!) [[https://slurm.schedmd.com/scancel.html scancel]]
|}

 
* [https://slurm.schedmd.com/tutorials.html Slurm Tutorials]
* [https://slurm.schedmd.com/pdfs/summary.pdf Slurm command/option summary (2 pages)]
* [https://slurm.schedmd.com/man_index.html Slurm Commands]
 

== Job Submission : sbatch ==
Batch jobs are submitted by using the command '''sbatch'''. The main purpose of the '''sbatch''' command is to specify the resources that are needed to run the job. '''sbatch''' will then queue the batch job. However, starting of batch job depends on the availability of the requested resources and the fair sharing value.
 
 
=== sbatch Command Parameters ===
The syntax and use of '''sbatch''' can be displayed via:
<pre>
$ man sbatch
</pre>
'''sbatch''' options can be used from the command line or in your job script.
{| width=750px class="wikitable"
! colspan="3" | sbatch Options
|-
! Command line
! Script
! Purpose
|- style="vertical-align:top;"
| -t ''time'' or --time=''time''
| #SBATCH --time=''time''
| Wall clock time limit. 
|-
|- style="vertical-align:top;"
| -N ''count'' or --nodes=''count''
| #SBATCH --nodes=''count''
| Number of nodes to be used.
|- style="vertical-align:top;"
| -n ''count'' or --ntasks=''count''
| #SBATCH --ntasks=''count''
| Number of tasks to be launched.
|-
|- style="vertical-align:top;"
| --ntasks-per-node=''count''
| #SBATCH --ntasks-per-node=''count''
| Maximum count (<= 28 and <= 40 resp.) of tasks per node. (Replaces the option ppn of MOAB.)
|-
|- style="vertical-align:top;"
| -c ''count'' or --cpus-per-task=''count''
| #SBATCH --cpus-per-task=''count''
| Number of CPUs required per (MPI-)task.
|-
|- style="vertical-align:top;"
| --mem=''value_in_MB''
| #SBATCH --mem=''value_in_MB''
| Memory in MegaByte per node. (Default value is 128000 and 96000 MB resp., i.e. you should omit the setting of this option.)
|-
|- style="vertical-align:top;"
| --mem-per-cpu=''value_in_MB''
| #SBATCH --mem-per-cpu=''value_in_MB''
| Minimum Memory required per allocated CPU. (Replaces the option pmem of MOAB. You should omit the setting of this option.)
|-
|- style="vertical-align:top;"
| --mail-type=''type''
| #SBATCH --mail-type=''type''
| Notify user by email when certain event types occur. Valid type values are NONE, BEGIN, END, FAIL, REQUEUE, ALL.
|-
|- style="vertical-align:top;"
| --mail-user=''mail-address''
| #SBATCH --mail-user=''mail-address''
| The specified mail-address receives email notification of state changes as defined by --mail-type.
|-
|- style="vertical-align:top;"
| --output=''name''
| #SBATCH --output=''name''
| File in which job output is stored.
|-
|- style="vertical-align:top;"
| --error=''name''
| #SBATCH --error=''name''
| File in which job error messages are stored.
|-
|- style="vertical-align:top;"
| -J ''name'' or --job-name=''name''
| #SBATCH --job-name=''name''
| Job name.
|-
|- style="vertical-align:top;"
| --export=[ALL,] ''env-variables''
| #SBATCH --export=[ALL,] ''env-variables''
| Identifies which environment variables from the submission environment are propagated to the launched application. Default is ALL. If adding an environment variable to the submission environment is intended, the argument ALL must be added.
|-
|- style="vertical-align:top;"
| -A ''group-name'' or --account=''group-name''
| #SBATCH --account=''group-name''
| Change resources used by this job to specified group. You may need this option if your account is assigned to more than one group. By command "scontrol show job" the project group the job is accounted on can be seen behind "Account=".
|-
|- style="vertical-align:top;"
| -p ''queue-name'' or --partition=''queue-name''
| #SBATCH --partition=''queue-name''
| Request a specific queue for the resource allocation.
|-
|- style="vertical-align:top;"
| --reservation=''reservation-name''
| #SBATCH --reservation=''reservation-name''
| Use a specific reservation for the resource allocation.
|-
|- style="vertical-align:top;"
| -C ''LSDF'' or --constraint=''LSDF''
| #SBATCH --constraint=LSDF
| Job constraint LSDF Filesystems.
|-
|- style="vertical-align:top;"
| -C ''BEEOND (BEEOND_4MDS, BEEOND_MAXMDS)'' or --constraint=''BEEOND (BEEOND_4MDS, BEEOND_MAXMDS''
| #SBATCH --constraint=BEEOND (BEEOND_4MDS, BEEOND_MAXMDS)
| Job constraint BeeOND file system.
|-
|}
 

==== sbatch --partition ''queues'' ====
Queue classes define maximum resources such as walltime, nodes and processes per node and queue of the compute system. Details can be found here:
* [[BwUniCluster_2.0_Batch_Queues#sbatch_-p_queue|bwUniCluster 2.0 queue settings]]
 

=== sbatch Examples ===
==== Serial Programs ====
To submit a serial job that runs the script '''job.sh''' and that requires 5000 MB of main memory and 10 minutes of wall clock time

a) execute:
<pre>
$ sbatch -p dev_single -n 1 -t 10:00 --mem=5000 job.sh
</pre>
or
b) add after the initial line of your script '''job.sh''' the lines (here with a high memory request):
<source lang="bash">
#SBATCH --ntasks=1
#SBATCH --time=10
#SBATCH --mem=180gb
#SBATCH --job-name=simple
</source>
and execute the modified script with the command line option ''--partition=fat'' (with ''--partition=(dev_)single'' maximum ''--mem=96gb'' is possible):
<pre>
$ sbatch --partition=fat job.sh
</pre>
Note, that sbatch command line options overrule script options.
 
 

==== Multithreaded Programs ====
Multithreaded programs operate faster than serial programs on CPUs with multiple cores. 
Moreover, multiple threads of one process share resources such as memory.
 
For multithreaded programs based on '''Open''' '''M'''ulti-'''P'''rocessing (OpenMP) a number of threads is defined by the environment variable OMP_NUM_THREADS. By default this variable is set to 1 (OMP_NUM_THREADS=1).
 
'''Because hyperthreading is switched on on bwUniCluster 2.0, the option --cpus-per-task (-c) must be set to 2*n, if you want to use n threads.'''
 
To submit a batch job called ''OpenMP_Test'' that runs a 40-fold threaded program ''omp_exe'' which requires 6000 MByte of total physical memory and total wall clock time of 40 minutes:
 
a) execute:
<pre>
$ sbatch -p single --export=ALL,OMP_NUM_THREADS=40 -J OpenMP_Test -N 1 -c 80 -t 40 --mem=6000 ./omp_exe
</pre>
or
-->
* generate the script '''job_omp.sh''' containing the following lines:
<source lang="bash">
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --cpus-per-task=80
#SBATCH --time=40:00
#SBATCH --mem=6000mb
#SBATCH --export=ALL,EXECUTABLE=./omp_exe
#SBATCH -J OpenMP_Test

#Usually you should set
export KMP_AFFINITY=compact,1,0
#export KMP_AFFINITY=verbose,compact,1,0 prints messages concerning the supported affinity
#KMP_AFFINITY Description: https://software.intel.com/en-us/node/524790#KMP_AFFINITY_ENVIRONMENT_VARIABLE

export OMP_NUM_THREADS=$((${SLURM_JOB_CPUS_PER_NODE}/2))
echo "Executable ${EXECUTABLE} running on ${SLURM_JOB_CPUS_PER_NODE} cores with ${OMP_NUM_THREADS} threads"
startexe=${EXECUTABLE}
echo $startexe
exec $startexe
</source>
Using Intel compiler the environment variable KMP_AFFINITY switches on binding of threads to specific cores and, if necessary, replace <placeholder> with the required modulefile to enable the OpenMP environment and execute the script '''job_omp.sh''' adding the queue class ''single'' as sbatch option:
<pre>
$ sbatch -p single job_omp.sh
</pre>
Note, that sbatch command line options overrule script options, e.g.,
<pre>
$ sbatch --partition=single --mem=200 job_omp.sh
</pre>
overwrites the script setting of 6000 MByte with 200 MByte.
 
 

==== MPI Parallel Programs ====
MPI parallel programs run faster than serial programs on multi CPU and multi core systems. N-fold spawned processes of the MPI program, i.e., '''MPI tasks''', run simultaneously and communicate via the Message Passing Interface (MPI) paradigm. MPI tasks do not share memory but can be spawned over different nodes.
 
Multiple MPI tasks must be launched via '''mpirun''', e.g. 4 MPI tasks of ''my_par_program'':
<pre>
$ mpirun -n 4 my_par_program
</pre>
This command runs 4 MPI tasks of ''my_par_program'' on the node you are logged in.
To run this command with a loaded Intel MPI the environment variable I_MPI_HYDRA_BOOTSTRAP must be unset ( --> $ unset I_MPI_HYDRA_BOOTSTRAP).

Running MPI parallel programs in a batch job the interactive environment - particularly the loaded modules - will also be set in the batch job. If you want to set a defined module environment in your batch job you have to purge all modules before setting the wished modules.
 
 
===== OpenMPI =====

If you want to run jobs on batch nodes, generate a wrapper script ''job_ompi.sh'' for '''OpenMPI''' containing the following lines:
<source lang="bash">
#!/bin/bash
# Use when using the module environment for OpenMPI
module load compiler/<placeholder_for_compiler>/<placeholder_for_compiler_version>
module load mpi/openmpi/<placeholder_for_mpi_version>
mpirun --bind-to core --map-by core -report-bindings my_par_program
</source>
'''Attention:''' Do '''NOT''' add mpirun options ''-n <number_of_processes>'' or any other option defining processes or nodes, since Slurm instructs mpirun about number of processes and node hostnames. Use '''ALWAYS''' the MPI options '''''--bind-to core''''' and '''''--map-by core|socket|node'''''. Please type ''mpirun --help'' for an explanation of the meaning of the different options of mpirun option ''--map-by''.
 
Considering 4 OpenMPI tasks on a single node, each requiring 2000 MByte, and running for 1 hour, execute:
<pre>
$ sbatch -p single -N 1 -n 4 --mem-per-cpu=2000 --time=01:00:00 ./job_ompi.sh
</pre>
 

===== Intel MPI =====

Generate a wrapper script for '''Intel MPI''', ''job_impi.sh'' containing the following lines:
<source lang="bash">
#!/bin/bash
# Use when a defined module environment related to Intel MPI is wished
module load compiler/<placeholder_for_compiler>/<placeholder_for_compiler_version>
module load mpi/impi/<placeholder_for_version>
mpiexec.hydra -bootstrap slurm my_par_program
</source>
'''Attention:''' 
Do '''NOT''' add mpirun options ''-n <number_of_processes>'' or any other option defining processes or nodes, since Slurm instructs mpirun about number of processes and node hostnames.
 
Launching and running 200 Intel MPI tasks on 5 nodes, each requiring 80 GByte, and running for 5 hours, execute:
<pre>
$ sbatch --partition=multiple -N 5 --ntasks-per-node=40 --mem=80gb -t 300 ./job_impi.sh
</pre>
 
If you want to use 128 or more nodes, you must also set the environment variable as follows: 
export I_MPI_HYDRA_BRANCH_COUNT=-1
 
If you want to use the options perhost, ppn or rr, you must additionally set the environment variable I_MPI_JOB_RESPECT_PROCESS_PLACEMENT=off.
 
 

==== Multithreaded + MPI parallel Programs ====
Multithreaded + MPI parallel programs operate faster than serial programs on multi CPUs with multiple cores. All threads of one process share resources such as memory. On the contrary MPI tasks do not share memory but can be spawned over different nodes. '''Because hyperthreading is switched on on bwUniCluster 2.0, the option --cpus-per-task (-c) must be set to 2*n, if you want to use n threads.'''
 
 
===== OpenMPI with Multithreading =====
Multiple MPI tasks using '''OpenMPI''' must be launched by the MPI parallel program '''mpirun'''. For multithreaded programs based on '''Open''' '''M'''ulti-'''P'''rocessing (OpenMP) number of threads are defined by the environment variable OMP_NUM_THREADS. By default this variable is set to 1 (OMP_NUM_THREADS=1).
 
'''For OpenMPI''' a job-script to submit a batch job called ''job_ompi_omp.sh'' that runs a MPI program with 4 tasks and a 28-fold threaded program ''ompi_omp_program'' requiring 3000 MByte of physical memory per thread (using 28 threads per MPI task you will get 28*3000 MByte = 84000 MByte per MPI task) and total wall clock time of 3 hours looks like:

<source lang="bash">
#!/bin/bash
#SBATCH --ntasks=4
#SBATCH --cpus-per-task=56
#SBATCH --time=03:00:00
#SBATCH --mem=83gb # 84000 MB = 84000/1024 GB = 82.1 GB
#SBATCH --export=ALL,MPI_MODULE=mpi/openmpi/3.1,EXECUTABLE=./ompi_omp_program
#SBATCH --output="parprog_hybrid_%j.out"

# Use when a defined module environment related to OpenMPI is wished
module load ${MPI_MODULE}
export OMP_NUM_THREADS=$((${SLURM_CPUS_PER_TASK}/2))
export MPIRUN_OPTIONS="--bind-to core --map-by socket:PE=${OMP_NUM_THREADS} -report-bindings"
export NUM_CORES=${SLURM_NTASKS}*${OMP_NUM_THREADS}
echo "${EXECUTABLE} running on ${NUM_CORES} cores with ${SLURM_NTASKS} MPI-tasks and ${OMP_NUM_THREADS} threads"
startexe="mpirun -n ${SLURM_NTASKS} ${MPIRUN_OPTIONS} ${EXECUTABLE}"
echo $startexe
exec $startexe
</source>
Execute the script '''job_ompi_omp.sh''' by command sbatch:
<pre>
$ sbatch -p multiple ./job_ompi_omp.sh
</pre>
* With the mpirun option ''--bind-to core'' MPI tasks and OpenMP threads are bound to physical cores.
* With the option ''--map-by node:PE=<value>'' (neighbored) MPI tasks will be attached to different nodes and each MPI task is bound to the first core of a node. <value> must be set to ${OMP_NUM_THREADS}.
* The option ''-report-bindings'' shows the bindings between MPI tasks and physical cores.
* The mpirun-options '''--bind-to core''', '''--map-by socket|...|node:PE=<value>''' should always be used when running a multithreaded MPI program.
 

===== Intel MPI with Multithreading =====
Multithreaded + MPI parallel programs operate faster than serial programs on multi CPUs with multiple cores. All threads of one process share resources such as memory. On the contrary MPI tasks do not share memory but can be spawned over different nodes.

Multiple Intel MPI tasks must be launched by the MPI parallel program '''mpiexec.hydra'''. For multithreaded programs based on '''Open''' '''M'''ulti-'''P'''rocessing (OpenMP) number of threads are defined by the environment variable OMP_NUM_THREADS. By default this variable is set to 1 (OMP_NUM_THREADS=1).

'''For Intel MPI''' a job-script to submit a batch job called ''job_impi_omp.sh'' that runs a Intel MPI program with 10 tasks and a 40-fold threaded program ''impi_omp_program'' requiring 96000 MByte of total physical memory per task and total wall clock time of 1 hours looks like:


<source lang="bash">
#!/bin/bash
#SBATCH --ntasks=10
#SBATCH --cpus-per-task=80
#SBATCH --time=60
#SBATCH --mem=96000
#SBATCH --export=ALL,MPI_MODULE=mpi/impi,EXE=./impi_omp_program
#SBATCH --output="parprog_impi_omp_%j.out"

#If using more than one MPI task per node please set
export KMP_AFFINITY=compact,1,0
#export KMP_AFFINITY=verbose,scatter prints messages concerning the supported affinity
#KMP_AFFINITY Description: https://software.intel.com/en-us/node/524790#KMP_AFFINITY_ENVIRONMENT_VARIABLE

# Use when a defined module environment related to Intel MPI is wished
module load ${MPI_MODULE}
export OMP_NUM_THREADS=$((${SLURM_CPUS_PER_TASK}/2))
export MPIRUN_OPTIONS="-binding "domain=omp:compact" -print-rank-map -envall"
export NUM_PROCS=eval(${SLURM_NTASKS}*${OMP_NUM_THREADS})
echo "${EXE} running on ${NUM_PROCS} cores with ${SLURM_NTASKS} MPI-tasks and ${OMP_NUM_THREADS} threads"
startexe="mpiexec.hydra -bootstrap slurm ${MPIRUN_OPTIONS} -n ${SLURM_NTASKS} ${EXE}"
echo $startexe
exec $startexe
</source>
Using Intel compiler the environment variable KMP_AFFINITY switches on binding of threads to specific cores. If you only run one MPI task per node please set KMP_AFFINITY=compact,1,0.
 
If you want to use 128 or more nodes, you must also set the environment variable as follows: 
export I_MPI_HYDRA_BRANCH_COUNT=-1
 
If you want to use the options perhost, ppn or rr, you must additionally set the environment variable I_MPI_JOB_RESPECT_PROCESS_PLACEMENT=off.
 
 
Execute the script '''job_impi_omp.sh''' by command sbatch:
<pre>
$ sbatch -p multiple ./job_impi_omp.sh
</pre>
 
The mpirun option ''-print-rank-map'' shows the bindings between MPI tasks and nodes (not very beneficial). The option ''-binding'' binds MPI tasks (processes) to a particular processor; ''domain=omp'' means that the domain size is determined by the number of threads. If you would choose 2 MPI tasks per node, you should choose ''-binding "cell=unit;map=bunch"''; this binding maps one MPI process to each socket.
 
 

==== Chain jobs ====
The CPU time requirements of many applications exceed the limits of the job classes. In those situations it is recommended to solve the problem by a job chain. A job chain is a sequence of jobs where each job automatically starts its successor.
<source lang="bash">
#!/bin/bash
####################################
## simple Slurm submitter script to setup ##
## a chain of jobs using Slurm ##
####################################
## ver. : 2018-11-27, KIT, SCC

## Define maximum number of jobs via positional parameter 1, default is 5
max_nojob=${1:-5}

## Define your jobscript (e.g. "~/chain_job.sh")
chain_link_job=${PWD}/chain_job.sh

## Define type of dependency via positional parameter 2, default is 'afterok'
dep_type="${2:-afterok}"
## -> List of all dependencies:
## https://slurm.schedmd.com/sbatch.html

myloop_counter=1
## Submit loop
while [ ${myloop_counter} -le ${max_nojob} ] ; do
##
## Differ slurm_opt depending on chain link number
if [ ${myloop_counter} -eq 1 ] ; then
slurm_opt=""
else
slurm_opt="-d ${dep_type}:${jobID}"
fi
##
## Print current iteration number and sbatch command
echo "Chain job iteration = ${myloop_counter}"
echo " sbatch --export=myloop_counter=${myloop_counter} ${slurm_opt} ${chain_link_job}"
## Store job ID for next iteration by storing output of sbatch command with empty lines
jobID=$(sbatch -p <queue> --export=ALL,myloop_counter=${myloop_counter} ${slurm_opt} ${chain_link_job} 2>&1 | sed 's/[S,a-z]* //g')
##
## Check if ERROR occured
if [[ "${jobID}" =~ "ERROR" ]] ; then
echo " -> submission failed!" ; exit 1
else
echo " -> job number = ${jobID}"
fi
##
## Increase counter
let myloop_counter+=1
done
</source>
 

==== GPU jobs ====

The nodes in the gpu_4 and gpu_8 queues have 4 or 8 NVIDIA Tesla V100 GPUs. Just submitting a job to these queues is not enough to also allocate one or more GPUs, you have to do so using the "--gres=gpu" parameter. You have to specifiy how many GPUs your job needs, e.g. "--gres=gpu:2" will request two GPUs.

The GPU nodes are shared between multiple jobs if the jobs don't request all the GPUs in a node and there are enough ressources to run more than one job. The individual GPUs are always bound to a single job and will not be shared between different jobs.

a) add after the initial line of your script job.sh the line including the
information about the GPU usage: #SBATCH --gres=gpu:2
<pre>
#!/bin/bash
#SBATCH --ntasks=40
#SBATCH --time=02:00:00
#SBATCH --mem=4000
#SBATCH --gres=gpu:2
</pre>

or b) execute:
<pre>
$ sbatch -p <queue> -n 40 -t 02:00:00 --mem 4000 --gres=gpu:2 job.sh
</pre>
 
If you start an interactive session on of the GPU nodes, you can use the "nvidia-smi" command to list the GPUs allocated to your job:
<pre>
$ nvidia-smi
Sun Mar 29 15:20:05 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.33.01 Driver Version: 440.33.01 CUDA Version: 10.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2... Off | 00000000:3A:00.0 Off | 0 |
| N/A 29C P0 39W / 300W | 9MiB / 32510MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla V100-SXM2... Off | 00000000:3B:00.0 Off | 0 |
| N/A 30C P0 41W / 300W | 8MiB / 32510MiB | 0% Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 14228 G /usr/bin/X 8MiB |
| 1 14228 G /usr/bin/X 8MiB |
+-----------------------------------------------------------------------------+
</pre>

 
In case of using Open MPI, the underlying communication infrastructure (UCX and Open MPI's BTL) is CUDA-aware.
However, there may be warnings, e.g. when running
<pre>
$ module load compiler/gnu/10.3 mpi/openmpi devel/cuad
$ mpirun mpirun -np 2 ./mpi_cuda_app
--------------------------------------
WARNING: There are more than one active ports on host 'uc2n520', but the
default subnet GID prefix was detected on more than one of these
ports. If these ports are connected to different physical IB
networks, this configuration will fail in Open MPI. This version of
Open MPI requires that every physically separate IB subnet that is
used between connected MPI processes must have different subnet ID
values.

Please see this FAQ entry for more details:

http://www.open-mpi.org/faq/?category=openfabrics#ofa-default-subnet-gid

NOTE: You can turn off this warning by setting the MCA parameter
btl_openib_warn_default_gid_prefix to 0.
--------------------------------------------------------------------------
</pre>

Please run Open MPI's mpirun using the following command:
<pre>
$ mpirun --mca pml ucx --mca btl_openib_warn_default_gid_prefix 0 -np 2 ./mpi_cuda_app
</pre>
or disabling the (older) communication layer Bit-Transfer-Layer (short BTL) alltogether:
<pre>
$ mpirun --mca pml ucx --mca btl ^openib -np 2 ./mpi_cuda_app
</pre>

(Please note, that CUDA per v11.4 is only available with up to GCC-10)
 
 

==== LSDF Online Storage ====
On bwUniCluster 2.0 you can use for special cases the LSDF Online Storage on the HPC cluster nodes. Please request for this service separately ([https://www.lsdf.kit.edu/os/storagerequest/: LSDF Storage Request]).
To mount the LSDF Online Storage on the compute nodes during the job runtime the
the constraint flag "LSDF" has to be set.

a) add after the initial line of your script job.sh the line including the
information about the LSDF Online Storage usage: #SBATCH --constraint=LSDF
<pre>
#!/bin/bash
#SBATCH --ntasks=1
#SBATCH --time=120
#SBATCH --mem=200
#SBATCH --constraint=LSDF
</pre>

or b) execute:
<pre>
$ sbatch -p <queue> -n 1 -t 2:00:00 --mem 200 job.sh -C LSDF
</pre>
 
For the usage of the LSDF Online Storage
the following environment variables are available: $LSDF, $LSDFPROJECTS, $LSDFHOME.
 
 

====BeeOND (BeeGFS On-Demand)====

Starting and stopping BeeOND is integrated in the prolog and epilog of the cluster batch system Slurm. It can be used during job runtime if compute nodes are exclusive used. You can request the creation of a BeeON file system with the constraint flags "BEEOND", "BEEOND_4MDS" or "BEEOND_MAXMDS" ([[BwUniCluster_2.0_Slurm_common_Features#sbatch_Command_Parameters|Slurm Command Parameters]])
* BEEOND: one metadata server is started on the first node
* BEEOND_4MDS: 4 metadata servers are started within your job. If you have less than 4 nodes less metadata servers are started.
* BEEOND_MAXMDS: on every node of your job a metadata server for the on_demand file system is started

As starting point we recommend using the "BEEOND" option. If you are unsure if this is sufficient for you feel free to contact the support team.
<source lang="bash">
#!/bin/bash
#SBATCH ...
#SBATCH --constraint=BEEOND # or BEEOND_4MDS or BEEOND_MAXMDS
</source>

After your job has started you can find the private on-demand file system in '''/mnt/odfs/${SLURM_JOB_ID}''' directory. The mountpoint comes with five pre-configured directories:
<source lang="bash">
# For small files (stripe count = 1)
/mnt/odfs/${SLURM_JOB_ID}/stripe_1
# Stripe count = 4
/mnt/odfs/${SLURM_JOB_ID}/stripe_default
# or
/mnt/odfs/${SLURM_JOB_ID}/stripe_4
# Stripe count = 8, 16 or 32, use this directories for medium sized and large files or when using MPI-IO
/mnt/odfs/${SLURM_JOB_ID}/stripe_8
/mnt/odfs/${SLURM_JOB_ID}/stripe_16
# or
/mnt/odfs/${SLURM_JOB_ID}/stripe_32
</source>

If you request less nodes than stripe count, the stripe count will be the number of nodes. For example, if you only request 8 nodes the directory stripe_16 has only a stripe count 8.

; '''Attention:''' 
:Be careful when creating large files: It is recommended to use the directory with the maximum stripe count for large files. For example, if your largest file is 1.1 Tb, then you have to use a stripe count larger than 2, otherwise the used disk space is exceeded.

The capacity of the private file system depends on the number of nodes. For each node you get 750 Gbyte.
If you request 100 nodes for your job, the private file system is 100 * 750 Gbyte ~ 75 Tbyte (approx) capacity.

== Start time of job or resources : squeue --start ==
The command can be used by any user to displays the estimated start time of a job based a number of analysis types based on historical usage, earliest available reservable resources, and priority based backlog. The command squeue is explained in detail on the webpage https://slurm.schedmd.com/squeue.html or via manpage (man squeue).
 
 
=== Access ===
By default, this command can be run by '''any user'''.
 
 

== List of your submitted jobs : squeue ==
Displays information about YOUR active, pending and/or recently completed jobs. The command displays all own active and pending jobs. The command squeue is explained in detail on the webpage https://slurm.schedmd.com/squeue.html or via manpage (man squeue).
 
 
=== Access ===
By default, this command can be run by any user.
 
 

=== Flags ===
{| width=750px class="wikitable"
|-
! Flag !! Description
|-
| -l, --long
| Report more of the available information for the selected jobs or job steps, subject to any constraints specified.
|}
 

=== Examples ===
''squeue'' example on bwUniCluster 2.0 (Only your own jobs are displayed!).
<pre>
$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
18088744 single CPV.sbat ab1234 PD 0:00 1 (Priority)
18098414 multiple CPV.sbat ab1234 PD 0:00 2 (Priority)
18090089 multiple CPV.sbat ab1234 R 2:27 2 uc2n[127-128]
$ squeue -l
JOBID PARTITION NAME USER STATE TIME TIME_LIMI NODES NODELIST(REASON)
18088654 single CPV.sbat ab1234 COMPLETI 4:29 2:00:00 1 uc2n374
18088785 single CPV.sbat ab1234 PENDING 0:00 2:00:00 1 (Priority)
18098414 multiple CPV.sbat ab1234 PENDING 0:00 2:00:00 2 (Priority)
18088683 single CPV.sbat ab1234 RUNNING 0:14 2:00:00 1 uc2n413
</pre>
* The output of ''squeue'' shows how many jobs of yours are running or pending and how many nodes are in use by your jobs.
 

== Shows free resources : sinfo_t_idle ==
The Slurm command sinfo is used to view partition and node information for a system running Slurm. It incorporates down time, reservations, and node state information in determining the available backfill window. The sinfo command can only be used by the administrator.
 
SCC has prepared a special script (sinfo_t_idle) to find out how many processors are available for immediate use on the system. It is anticipated that users will use this information to submit jobs that meet these criteria and thus obtain quick job turnaround times.
 
 
=== Access ===
By default, this command can be used by any user or administrator.
 
 
=== Example ===
* The following command displays what resources are available for immediate use for the whole partition.
<pre>$ sinfo_t_idle
Partition dev_multiple : 8 nodes idle
Partition multiple : 332 nodes idle
Partition dev_single : 4 nodes idle
Partition single : 76 nodes idle
Partition long : 80 nodes idle
Partition fat : 5 nodes idle
Partition dev_special : 342 nodes idle
Partition special : 342 nodes idle
Partition dev_multiple_e: 7 nodes idle
Partition multiple_e : 335 nodes idle
Partition gpu_4 : 12 nodes idle
Partition gpu_8 : 6 nodes idle
</pre>
* For the above example jobs in all partitions can be run immediately.
 

== Detailed job information : scontrol show job ==
scontrol show job displays detailed job state information and diagnostic output for all or a specified job of yours. Detailed information is available for active, pending and recently completed jobs. The command scontrol is explained in detail on the webpage https://slurm.schedmd.com/scontrol.html or via manpage (man scontrol).
 
Display the state of all your jobs in normal mode: scontrol show job
 
Display the state of a job with <jobid> in normal mode: scontrol show job <jobid>
 
 
=== Access ===
* End users can use scontrol show job to view the status of their '''own jobs''' only.
 

=== Arguments ===
{| width=750px class="wikitable"
|-
! Option !! Default !! Description !! Example
|- style="vertical-align:top;"
|- style="width:12%;"
| -d
| (n/a)
| Detailed mode
| Example: Display the state with jobid 18089884 in detailed mode. <pre>scontrol -d show job 18089884</pre>
|}
 
 

=== Scontrol show job Example ===
Here is an example from bwUniCluster 2.0.
<pre>
squeue # show my own jobs (here the userid is replaced!)
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
18089884 multiple CPV.sbat bq0742 R 33:44 2 uc2n[165-166]

$
$ # now, see what's up with my pending job with jobid 18089884
$
$ scontrol show job 18089884

JobId=18089884 JobName=CPV.sbatch
UserId=bq0742(8946) GroupId=scc(12345) MCS_label=N/A
Priority=3 Nice=0 Account=kit QOS=normal
JobState=RUNNING Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
RunTime=00:35:06 TimeLimit=02:00:00 TimeMin=N/A
SubmitTime=2020-03-16T14:14:54 EligibleTime=2020-03-16T14:14:54
AccrueTime=2020-03-16T14:14:54
StartTime=2020-03-16T15:12:51 EndTime=2020-03-16T17:12:51 Deadline=N/A
SuspendTime=None SecsPreSuspend=0 LastSchedEval=2020-03-16T15:12:51
Partition=multiple AllocNode:Sid=uc2n995:5064
ReqNodeList=(null) ExcNodeList=(null)
NodeList=uc2n[165-166]
BatchHost=uc2n165
NumNodes=2 NumCPUs=160 NumTasks=80 CPUs/Task=1 ReqB:S:C:T=0:0:*:1
TRES=cpu=160,mem=96320M,node=2,billing=160
Socks/Node=* NtasksPerN:B:S:C=40:0:*:1 CoreSpec=*
MinCPUsNode=40 MinMemoryCPU=1204M MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
OverSubscribe=NO Contiguous=0 Licenses=(null) Network=(null)
Command=/pfs/data5/home/kit/scc/bq0742/git/CPV/bin/CPV.sbatch
WorkDir=/pfs/data5/home/kit/scc/bq0742/git/CPV/bin
StdErr=/pfs/data5/home/kit/scc/bq0742/git/CPV/bin/slurm-18089884.out
StdIn=/dev/null
StdOut=/pfs/data5/home/kit/scc/bq0742/git/CPV/bin/slurm-18089884.out
Power=
MailUser=(null) MailType=NONE
</pre>
 
You can use standard Linux pipe commands to filter the very detailed scontrol show job output.
* In which state the job is?
<pre>$ scontrol show job 18089884 | grep -i State
JobState=COMPLETED Reason=None Dependency=(null)
</pre>
 

== Cancel Slurm Jobs ==
The scancel command is used to cancel jobs. The command scancel is explained in detail on the webpage https://slurm.schedmd.com/scancel.html or via manpage (man scancel).
 
 
=== Canceling own jobs : scancel ===
scancel is used to signal or cancel jobs, job arrays or job steps. The command is:
<pre>
$ scancel [-i] <job-id>
$ scancel -t <job_state_name>
</pre>

 
{| width=750px class="wikitable"
! Flag !! Default !! Description !! Example
|- style="vertical-align:top;"
| -i, --interactive
| (n/a)
| Interactive mode.
| Cancel the job 987654 interactively. <pre> scancel -i 987654 </pre>
|-
| -t, --state
| (n/a)
| Restrict the scancel operation to jobs in a certain state. "job_state_name" may have a value of either "PENDING", "RUNNING" or "SUSPENDED".
| Cancel all jobs in state "PENDING". <pre> scancel -t "PENDING" </pre>
|}
 

= Resource Managers =
=== Batch Job (Slurm) Variables ===
The following environment variables of Slurm are added to your environment once your job has started
(only an excerpt of the most important ones).
{| width=750px class="wikitable"
! Environment !! Brief explanation
|-
| SLURM_JOB_CPUS_PER_NODE
| Number of processes per node dedicated to the job
|-
| SLURM_JOB_NODELIST
| List of nodes dedicated to the job
|-
| SLURM_JOB_NUM_NODES
| Number of nodes dedicated to the job
|-
| SLURM_MEM_PER_NODE
| Memory per node dedicated to the job
|-
| SLURM_NPROCS
| Total number of processes dedicated to the job
|-
| SLURM_CLUSTER_NAME
| Name of the cluster executing the job
|-
| SLURM_CPUS_PER_TASK
| Number of CPUs requested per task
|-
| SLURM_JOB_ACCOUNT
| Account name
|-
| SLURM_JOB_ID
| Job ID
|-
| SLURM_JOB_NAME
| Job Name
|-
| SLURM_JOB_PARTITION
| Partition/queue running the job
|-
| SLURM_JOB_UID
| User ID of the job's owner
|-
| SLURM_SUBMIT_DIR
| Job submit folder. The directory from which sbatch was invoked.
|-
| SLURM_JOB_USER
| User name of the job's owner
|-
| SLURM_RESTART_COUNT
| Number of times job has restarted
|-
| SLURM_PROCID
| Task ID (MPI rank)
|-
| SLURM_NTASKS
| The total number of tasks available for the job
|-
| SLURM_STEP_ID
| Job step ID
|-
| SLURM_STEP_NUM_TASKS
| Task count (number of MPI ranks)
|-
| SLURM_JOB_CONSTRAINT
| Job constraints
|}
See also:
* [https://slurm.schedmd.com/sbatch.html#lbAI Slurm input and output environment variables]
 

=== Job Exit Codes ===
A job's exit code (also known as exit status, return code and completion code) is captured by SLURM and saved as part of the job record.
 
Any non-zero exit code will be assumed to be a job failure and will result in a Job State of FAILED with a reason of "NonZeroExitCode".
 
The exit code is an 8 bit unsigned number ranging between 0 and 255. While it is possible for a job to return a negative exit code, SLURM will display it as an unsigned value in the 0 - 255 range.
 
 
==== Displaying Exit Codes and Signals ====
SLURM displays a job's exit code in the output of the '''scontrol show job''' and the sview utility.
 
When a signal was responsible for a job or step's termination, the signal number will be displayed after the exit code, delineated by a colon(:).
 
 
==== Submitting Termination Signal ====
Here is an example, how to 'save' a Slurm termination signal in a typical jobscript.
<source lang="bash">
[...]
exit_code=$?
mpirun -np <#cores> <EXE_BIN_DIR>/<executable> ... (options) 2>&1
[ "$exit_code" -eq 0 ] && echo "all clean..." || \
echo "Executable <EXE_BIN_DIR>/<executable> finished with exit code ${$exit_code}"
[...]
</source>
* Do not use ''''time'''' mpirun! The exit code will be the one submitted by the first (time) program.
* You do not need an '''exit $exit_code''' in the scripts.
 
 
----
[[#top|Back to top]]

BwUniCluster3.0/Hardware and Architecture/Filesystem Details

2025-04-03T11:42:12Z

R Laifer: /* Improving Metadata Performance */

= File System Details =

On bwUniCluster 2.0 the parallel file system Lustre is used for most globally visible user data. Lustre is open source and Lustre solutions and support are available from different vendors. Nowadays, most of the biggest HPC systems are using Lustre. An initial home directory on a Lustre file system is created automatically after account activation, and the environment variable $HOME holds its name. Users can create so-called workspaces on another Lustre file system for non-permanent data with temporary lifetime. There is another workspace file system based on flash storage for special requirements available.

Within a batch job further file systems are available:
* The directory $TMPDIR is only available and visible on the local node. It is located on fast SSD storage devices.
* On request a parallel on-demand file system (BeeOND) is created which uses the SSDs of the nodes which were allocated to the batch job.
* On request the external LSDF Online Storage is mounted on the nodes which were allocated to the batch job. This file system is based on the parallel file system Spectrum Scale.

Some of the characteristics of the file systems are shown in Table 2.

{| class="wikitable"
|-
! scope="column" style="width:9%"| Property
! style="width:13%"| $TMPDIR
! style="width:13%"| BeeOND
! style="width:13%"| $HOME
! style="width:13%"| Workspace
! style="width:13%"| Workspace on flash
|-
! scope="column" style="height=20px; padding:3px"| Visibility
| style="height=20px; text-align:left;padding:3px"| local node
| style="height=20px; text-align:left;padding:3px"| nodes of batch job
| style="height=20px; text-align:left;padding:3px"| global
| style="height=20px; text-align:left;padding:3px"| global
| style="height=20px; text-align:left;padding:3px"| global
|-
! scope="column" style="height=20px; padding:3px"| Lifetime
| style="height=20px; text-align:left;padding:3px"| batch job runtime
| style="height=20px; text-align:left;padding:3px"| batch job runtime
| style="height=20px; text-align:left;padding:3px"| permanent
| style="height=20px; text-align:left;padding:3px"| max. 240 days
| style="height=20px; text-align:left;padding:3px"| max. 240 days
|-
! scope="column" style="height=20px; padding:3px"| Disk space
| style="height=20px; text-align:left;padding:3px"| 960 GB - 6.4 TB details see table 1
| style="height=20px; text-align:left;padding:3px"| n*750 GB
| style="height=20px; text-align:left;padding:3px"| 1.1 PiB
| style="height=20px; text-align:left;padding:3px"| 4.6 PiB
| style="height=20px; text-align:left;padding:3px"| 236 TiB
|-
! scope="column" style="height=20px; padding:3px"| Capacity Quotas
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| yes 500 GiB per user, for MA users 250 GiB also per organization
| style="height=20px; text-align:left;padding:3px"| yes 40 TiB per user
| style="height=20px; text-align:left;padding:3px"| yes 1 TiB per user
|-
! scope="column" style="height=20px; padding:3px"| Inode Quotas
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| yes 5 million per user for MA users 2.5 million
| style="height=20px; text-align:left;padding:3px"| yes 20 million per user
| style="height=20px; text-align:left;padding:3px"| yes 5 million per user
|-
! scope="column" style="height=20px; padding:3px"| Backup
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| yes
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| no
|-
! scope="column" style="height=20px; padding:3px"| Read perf./node
| style="height=20px; text-align:left;padding:3px"| 500 MB/s - 6 GB/s depends on type of local SSD / job queue: 520 MB/s @ single / multiple 800 MB/s @ multiple_e 6600 MB/s @ fat 6500 MB/s @ gpu_4 6500 MB/s @ gpu_8
| style="height=20px; text-align:left;padding:3px"| 400 MB/s - 500 MB/s depends on type of local SSDs / job queue: 500 MB/s @ multiple 400 MB/s @ multiple_e
| style="height=20px; text-align:left;padding:3px"| 5 GB/s
| style="height=20px; text-align:left;padding:3px"| 5 GB/s
| style="height=20px; text-align:left;padding:3px"| 1 GB/s
|-
! scope="column" style="height=20px; padding:3px"| Write perf./node
| style="height=20px; text-align:left;padding:3px"| 500 MB/s - 4 GB/s depends on type of local SSD / job queue: 500 MB/s @ single / multiple 650 MB/s @ multiple_e 2900 MB/s @ fat 2090 MB/s @ gpu_4 4060 MB/s @ gpu_8
| style="height=20px; text-align:left;padding:3px"| 250 MB/s - 350 MB/s depends on type of local SSDs / job queue: 350 MB/s @ multiple 250 MB/s @ multiple_e
| style="height=20px; text-align:left;padding:3px"| 5 GB/s
| style="height=20px; text-align:left;padding:3px"| 5 GB/s
| style="height=20px; text-align:left;padding:3px"| 1 GB/s
|-
! scope="column" style="height=20px;padding:3px"| Total read perf.
| style="height=20px; text-align:left;padding:3px"| n*500-6000 MB/s
| style="height=20px; text-align:left;padding:3px"| n*400-500 MB/s
| style="height=20px; text-align:left;padding:3px"| 63 GB/s
| style="height=20px; text-align:left;padding:3px"| 45 GB/s
| style="height=20px; text-align:left;padding:3px"| 45 GB/s
|-
! scope="column" style="height=20px;padding:3px"| Total write perf.
| style="height=20px; text-align:left;padding:3px"| n*500-4000 MB/s
| style="height=20px; text-align:left;padding:3px"| n*250-350 MB/s
| style="height=20px; text-align:left;padding:3px"| 63 GB/s
| style="height=20px; text-align:left;padding:3px"| 40 GB/s
| style="height=20px; text-align:left;padding:3px"| 38 GB/s
|}
---------------------------------------------------------------------------------------------------------
global: all nodes of UniCluster access the same file system;
local: each node has its own file system;
permanent: files are stored permanently;
batch job: files are removed at end of the batch job.
---------------------------------------------------------------------------------------------------------
Table 2: Properties of the file systems

== Selecting the appropriate file system ==

In general, you should separate your data and store it on the appropriate file system.
Permanently needed data like software or important results should be stored below $HOME
but capacity restrictions (quotas) apply. In case you accidentally deleted data on $HOME
there is a chance that we can restore it from backup. Permanent data which is not needed for months
or exceeds the capacity restrictions should be sent to the LSDF Online Storage
or to the archive and deleted from the file systems. Temporary data which is only needed on a single
node and which does not exceed the disk space shown in the table above should be stored
below $TMPDIR. Data which is read many times on a single node, e.g. if you are doing AI training,
should be copied to $TMPDIR and read from there. Temporary data which is used from many nodes
of your batch job and which is only needed during job runtime should be stored on a
parallel on-demand file system. Temporary data which can be recomputed or which is the
result of one job and input for another job should be stored in workspaces. The lifetime
of data in workspaces is limited and depends on the lifetime of the workspace which can be
several months.

For further details please check the chapters below.

== $HOME ==

The home directories of bwUniCluster 3.0 (uc3) users are located in the parallel file system Lustre.
You have access to your home directory from all nodes of uc3. A regular backup of these directories
to a tape library is done automatically. The directory $HOME should be used to hold permanently used data like
source code, configuration files, executable programs etc.

On uc3 there is a default user quota limit of 500 GiB and 5 million inodes (files and directories) per user.
For users of University of Mannheim the limit is 250 GiB and 2.5 million inodes.
You can check your current usage and limits with the command
<pre>
$ lfs quota -uh $(whoami) $HOME
</pre>
Note that the quota limits mentioned above are soft quota limits. The hard limits (shown on the <code>limits</code>
columns) are 10 percent higher. If you are above the soft limit and below the hard limit
during the grace period (7 days) your I/O operations will show a warning message. If the grace period has
passed or if you are above the hard limit your I/O operations will abort.

In addition to the user limit there is a limit of your organization (e.g. university) which depends on the financial share. This limit is enforced with so-called Lustre project quotas. You can show the current usage and limits of your organization with the following command:
<pre>
lfs quota -ph $(grep $(echo $HOME | sed -e "s|/[^/]*/[^/]*$||") /pfs/data6/project_ids.txt | cut -f 1 -d\ ) $HOME
</pre>

== Workspaces ==

On uc3 workspaces can be used to store large non-permanent data sets, e.g. restart files or output
data that has to be post-processed. The file system used for workspaces is also the parallel file system Lustre. This file system is especially designed for parallel access and for a high throughput to large
files. It is able to provide high data transfer rates of up to 40 GB/s write and read performance when data access is parallel.

On uc3 there is a default user quota limit of 40 TiB and 20 million inodes (files and directories) per user.
You can chek your current usage and limits with the command
<pre>
$ lfs quota -uh $(whoami) /pfs/work9
</pre>
Note that the quotas include data and inodes for all of your workspaces and all of your expired workspaces (as long as they are not yet completely removed).

Workspaces have a lifetime and the data on a workspace expires as a whole after a fixed period. The maximum lifetime of a workspace on uc3 is 60 days, but it can be renewed at the end of that period 3 times to a total maximum of 240 days after workspace generation.

Creating, deleting, finding, extending and sharing workspaces is explained on the [[workspace]] page.

=== Reminder for workspace deletion ===

Normally you will get an email every day starting 7 days before a workspace expires. You can send yourself a calender entry which reminds you when a workspace will be automatically deleted:

$ ws_send_ical <workspace> <email>

=== Restoring expired Workspaces ===

At expiration time your workspace will be moved to a special, hidden directory. On uc3 expired workspaces are currently kept for 30 days. During this time you can still restore your data into a valid workspace. The same is true for released workspaces but they are only kept until the next night. In order to restore an expired workspace, use
<pre>
ws_restore -l
</pre>
to get a list of your expired workspaces, and then restore them into an '''existing, active workspace''' (here with name <code>my_restored</code>):
<pre>
ws_restore <full_name_of_expired_workspace> my_restored
</pre>
NOTE: The expired workspace has to be specified using the full name as listed by <code>ws_restore -l</code>, including username prefix and timestamp suffix (otherwise, it cannot be uniquely identified).
The target workspace, on the other hand, must be given with just its short name as listed by <code>ws_list</code>, without the username prefix.

NOTE: <code>ws_restore</code> can only work on the same filesystem. So you have to ensure that the new workspace allocated with <code>ws_allocate</code> is placed on the same filesystem as the expired workspace. Therefore, you can use <code>-F <filesystem></code> flag if needed.

=== Linking workspaces in Home ===

It might be valuable to have links to personal workspaces within a certain directory, e.g. below the user home directory. The command
<pre>
ws_register <DIR>
</pre>
will create and manage links to all personal workspaces within in the directory <DIR>. Calling this command will do the following:

* The directory <DIR> will be created if necessary
* Links to all personal workspaces will be managed:
** Creates links to all available workspaces if not already present
** Removes links to released or expired workspaces

== Improving Performance on $HOME and workspaces ==

The following recommendations might help to improve throughput and metadata
performance on Lustre filesystems.

=== Improving Throughput Performance ===

Depending on your application some adaptations might be necessary if you want to reach
the full bandwidth of the filesystems.

When you are designing your application you should consider that the performance of
parallel filesystems is generally better if data is transferred in large blocks and stored in
few large files. In more detail, to increase throughput performance of a parallel application
following aspects should be considered:

* collect large chunks of data and write them sequentially at once,

* to exploit complete filesystem bandwidth use several clients,

* avoid competitive file access by different tasks or use blocks with boundaries at stripe size (default is 1MB),

* if files are small enough for the SSDs and are only used from one node store them on $TMPDIR.

Parallel filesystems typically stripe files over storage subsystems, i.e. large files are separated into stripes and distributed to different storage subsystems. In Lustre, the size of these stripes (sometimes also mentioned as chunks) is called stripe size and the number of used storage subsystems is called stripe count.

With previous Lustre versions adapting the Lustre stripe count was the most important optimization. However, for the workspaces of uc3 the Lustre feature Progressive File Layout is used to define file striping parameters. This means that the stripe count is adapted if the file size is growing. In normal cases users no longer need to adapt file striping parameters in case they have very huge files or in order to reach better performance. If you know what you are doing you can still change striping parameters but further explanation is beyond the scope of this documentation.

=== Improving Metadata Performance ===

Metadata performance on parallel file systems is usually not as good as with local
filesystems. In addition, it is usually not scalable, i.e. a limited resource. Therefore,
you should omit metadata operations whenever possible. For example, it is much better
to have few large files than lots of small files. In more detail, to increase metadata
performance of a parallel application following aspects should be considered:

* avoid creating many small files

* avoid competitive directory access, e.g. by creating files in separate subdirectories for each task

* if many small files are only used within a batch job and accessed by one node store them on $TMPDIR

== Workspaces on flash storage ==

There is another workspace file system for special requirements available. The file system is called ''full flash pfs'' and is based on the parallel file system Lustre.

=== Advantages of this file system ===

# All storage devices are based on flash (no hard disks) with low access times. Hence performance is better compared to other parallel file systems for read and write access with small blocks and with small files, i.e. IOPS rates are improved.
# The file system is mounted on bwUniCluster 2.0 and HoreKa, i.e. it can be used to share data between these clusters.

=== Access restrictions ===

Only HoreKa users or KIT users of bwUniCluster 2.0 can use this file system.

=== Using the file system ===

As KIT or HoreKa user you can use the file system in the same way as a normal workspace. You just have to specify the name of the flash-based workspace file system using the option ''-F'' to all the commands that manage workspaces. On bwUniCluster 2.0 it is called ''ffuc'', on HoreKa it is ''ffhk''. For example, to create a workspace with name myws and a lifetime of 60 days on bwUniCluster 2.0 execute:
ws_allocate -F ffuc myws 60

If you want to use the full flash pfs on bwUniCluster 2.0 '''and''' HoreKa at the same time, please note that you only have to manage a particular workspace on one of the clusters since the name of the workspace directory is different. However, the path to each workspace is visible and can be used on both clusters.

Other features are similar to normal workspaces. For example, we are able to restore expired workspaces for few weeks and you have to open a ticket to request the restore. There are quota limits with a default limit of 1 TiB capacity and 5 millions inodes per user. You can check your current usage with
lfs quota -uh $(whoami) /pfs/work8

== $TMPDIR ==

The environment variable $TMPDIR contains the name of a directory which is located on the local SSD of each node. This means
that different tasks of a parallel application use different directories when they do not utilize the same node.
Although $TMPDIR points to the same path name for different nodes of a batch job, the physical location and the
content of this directory path on these nodes is different.

This directory should be used for temporary files being accessed from the local node during job runtime. It should
also be used if you read the same data many times from a single node, e.g. if you are doing AI training. In this
case you should copy the data at the beginning of your batch job to $TMPDIR and read the data from there, see usage example below.

The $TMPDIR directory is located on extremely fast local SSD storage devices. This means that performance
on small files is much better than on the parallel file systems. The capacity of the local SSDs for each node type
is different and can be checked in Table 1 above. The capacity of $TMPDIR is at least 800 GB.

Each time a batch job is started, a subdirectory is created on the SSD of each node and assigned to the job.
$TMPDIR is set to the name of the subdirectory and this name contains the job ID so that it is unique
for each job. At the end of the job the subdirectory is removed.

On login nodes $TMPDIR also points to a fast directory on a local SSD disk but this directory is not unique.
It is recommended to create your own unique subdirectory on these nodes. This directory should be used for the
installation of software packages. This means that the software package to be installed should be unpacked,
compiled and linked in a subdirectory of $TMPDIR. The real installation of the package (e.g. make install)
should be made into the $HOME folder.

{|style="background:#ffdeee; width:100%;"
|style="padding:5px; background:#f2cece; text-align:left"|
[[Image:Attention.svg|center|25px]]
|style="padding:5px; background:#f2cece; text-align:left"|
Note that you should '''not''' use /tmp or /scratch! Please use '''$TMPDIR''' instead. 
The reason is that an automatic cleanup on /tmp or /scratch is not possible because another job could be still using data below these directories. Hence the corresponding file systems could fill up and this can cause issues for you and for other users. On the other hand, $TMPDIR is created when the job starts and removed when the job completes, i.e. a cleanup is automatically done.
|}

=== Usage example for $TMPDIR ===

We will provide an example for using $TMPDIR and describe efficient data transfer to and from $TMPDIR.

If you have a data set with many files which is frequently used by batch jobs you should create
a compressed archive on a workspace. This archive can be extracted on $TMPDIR inside your batch jobs.
Such an archive can be read efficiently from a parallel file system since it is a single huge file.
On a login node you can create such an archive with the following steps:
<syntaxhighlight lang="bash">
# Create a workspace to store the archive
[ab1234@uc2n997 ~]$ ws_allocate data-ssd 60
# Create the archive from a local dataset folder (example)
[ab1234@uc2n997 ~]$ tar -cvzf $(ws_find data-ssd)/dataset.tgz dataset/
</syntaxhighlight>

Inside a batch job extract the archive on $TMPDIR, read input data from $TMPDIR, store results on $TMPDIR
and save the results on a workspace:
<syntaxhighlight lang="bash">
#!/bin/bash
# very simple example on how to use local $TMPDIR
#SBATCH -N 1
#SBATCH -t 24:00:00

# Extract compressed input dataset on local SSD
tar -C $TMPDIR/ -xvzf $(ws_find data-ssd)/dataset.tgz

# The application reads data from dataset on $TMPDIR and writes results to $TMPDIR
myapp -input $TMPDIR/dataset/myinput.csv -outputdir $TMPDIR/results

# Before job completes save results on a workspace
rsync -av $TMPDIR/results $(ws_find data-ssd)/results-${SLURM_JOB_ID}/
</syntaxhighlight>

== LSDF Online Storage==

In some cases it is useful to have access to the LSDF Online Storage on the HPC-Clusters also. Therefore the LSDF Online Storage is mounted on the Login- and Datamover-Nodes.
Furthermore it can be used on the compute nodes during the job runtime with the constraint flag "LSDF" ([[bwUniCluster_2.0_Slurm_common_Features|Slurm common features]]
). There is also an example about the LSDF batch usage: [[bwUniCluster_2.0_Slurm_common_Features#LSDF_Online_Storage|Slurm LSDF example ]] .
<pre>
#!/bin/bash
#SBATCH ...
#SBATCH --constraint=LSDF
</pre>
 
For the usage of the LSDF Online Storage the following environment variables are available: $LSDF, $LSDFPROJECTS, $LSDFHOME.
Please request storage projects in the LSDF Online Storage seperately:
[https://www.lsdf.kit.edu/os/storagerequest/: LSDF Storage Request].

==BeeOND (BeeGFS On-Demand)==

Users of the UniCluster have possibility to request a private BeeOND (on-demand BeeGFS) parallel filesystem for each job. The file system is created during job startup and purged after your job.

; '''IMPORTANT: '''
: All data on the private filesystem will be deleted after your job. Make sure you have copied your data back to the global filesystem (within job), e.g., $HOME or any workspace.

BeeOND/BeeGFS can be used like any other parallel file system. Tools like cp or rsync can be used to copy data in and out.

For detailed usage see here: [[BwUniCluster_2.0_Slurm_common_Features#BeeOND_.28BeeGFS_On-Demand.29|Request on-demand file system]]

==Backup and Archiving==

There are regular backups of all data of the home directories,whereas ACLs and extended attributes will
not be backuped.

Please open a ticket if you need backuped data.

BwUniCluster3.0/Hardware and Architecture/Filesystem Details

2025-04-03T11:39:21Z

R Laifer: /* Improving Throughput Performance */

= File System Details =

On bwUniCluster 2.0 the parallel file system Lustre is used for most globally visible user data. Lustre is open source and Lustre solutions and support are available from different vendors. Nowadays, most of the biggest HPC systems are using Lustre. An initial home directory on a Lustre file system is created automatically after account activation, and the environment variable $HOME holds its name. Users can create so-called workspaces on another Lustre file system for non-permanent data with temporary lifetime. There is another workspace file system based on flash storage for special requirements available.

Within a batch job further file systems are available:
* The directory $TMPDIR is only available and visible on the local node. It is located on fast SSD storage devices.
* On request a parallel on-demand file system (BeeOND) is created which uses the SSDs of the nodes which were allocated to the batch job.
* On request the external LSDF Online Storage is mounted on the nodes which were allocated to the batch job. This file system is based on the parallel file system Spectrum Scale.

Some of the characteristics of the file systems are shown in Table 2.

{| class="wikitable"
|-
! scope="column" style="width:9%"| Property
! style="width:13%"| $TMPDIR
! style="width:13%"| BeeOND
! style="width:13%"| $HOME
! style="width:13%"| Workspace
! style="width:13%"| Workspace on flash
|-
! scope="column" style="height=20px; padding:3px"| Visibility
| style="height=20px; text-align:left;padding:3px"| local node
| style="height=20px; text-align:left;padding:3px"| nodes of batch job
| style="height=20px; text-align:left;padding:3px"| global
| style="height=20px; text-align:left;padding:3px"| global
| style="height=20px; text-align:left;padding:3px"| global
|-
! scope="column" style="height=20px; padding:3px"| Lifetime
| style="height=20px; text-align:left;padding:3px"| batch job runtime
| style="height=20px; text-align:left;padding:3px"| batch job runtime
| style="height=20px; text-align:left;padding:3px"| permanent
| style="height=20px; text-align:left;padding:3px"| max. 240 days
| style="height=20px; text-align:left;padding:3px"| max. 240 days
|-
! scope="column" style="height=20px; padding:3px"| Disk space
| style="height=20px; text-align:left;padding:3px"| 960 GB - 6.4 TB details see table 1
| style="height=20px; text-align:left;padding:3px"| n*750 GB
| style="height=20px; text-align:left;padding:3px"| 1.1 PiB
| style="height=20px; text-align:left;padding:3px"| 4.6 PiB
| style="height=20px; text-align:left;padding:3px"| 236 TiB
|-
! scope="column" style="height=20px; padding:3px"| Capacity Quotas
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| yes 500 GiB per user, for MA users 250 GiB also per organization
| style="height=20px; text-align:left;padding:3px"| yes 40 TiB per user
| style="height=20px; text-align:left;padding:3px"| yes 1 TiB per user
|-
! scope="column" style="height=20px; padding:3px"| Inode Quotas
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| yes 5 million per user for MA users 2.5 million
| style="height=20px; text-align:left;padding:3px"| yes 20 million per user
| style="height=20px; text-align:left;padding:3px"| yes 5 million per user
|-
! scope="column" style="height=20px; padding:3px"| Backup
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| yes
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| no
|-
! scope="column" style="height=20px; padding:3px"| Read perf./node
| style="height=20px; text-align:left;padding:3px"| 500 MB/s - 6 GB/s depends on type of local SSD / job queue: 520 MB/s @ single / multiple 800 MB/s @ multiple_e 6600 MB/s @ fat 6500 MB/s @ gpu_4 6500 MB/s @ gpu_8
| style="height=20px; text-align:left;padding:3px"| 400 MB/s - 500 MB/s depends on type of local SSDs / job queue: 500 MB/s @ multiple 400 MB/s @ multiple_e
| style="height=20px; text-align:left;padding:3px"| 5 GB/s
| style="height=20px; text-align:left;padding:3px"| 5 GB/s
| style="height=20px; text-align:left;padding:3px"| 1 GB/s
|-
! scope="column" style="height=20px; padding:3px"| Write perf./node
| style="height=20px; text-align:left;padding:3px"| 500 MB/s - 4 GB/s depends on type of local SSD / job queue: 500 MB/s @ single / multiple 650 MB/s @ multiple_e 2900 MB/s @ fat 2090 MB/s @ gpu_4 4060 MB/s @ gpu_8
| style="height=20px; text-align:left;padding:3px"| 250 MB/s - 350 MB/s depends on type of local SSDs / job queue: 350 MB/s @ multiple 250 MB/s @ multiple_e
| style="height=20px; text-align:left;padding:3px"| 5 GB/s
| style="height=20px; text-align:left;padding:3px"| 5 GB/s
| style="height=20px; text-align:left;padding:3px"| 1 GB/s
|-
! scope="column" style="height=20px;padding:3px"| Total read perf.
| style="height=20px; text-align:left;padding:3px"| n*500-6000 MB/s
| style="height=20px; text-align:left;padding:3px"| n*400-500 MB/s
| style="height=20px; text-align:left;padding:3px"| 63 GB/s
| style="height=20px; text-align:left;padding:3px"| 45 GB/s
| style="height=20px; text-align:left;padding:3px"| 45 GB/s
|-
! scope="column" style="height=20px;padding:3px"| Total write perf.
| style="height=20px; text-align:left;padding:3px"| n*500-4000 MB/s
| style="height=20px; text-align:left;padding:3px"| n*250-350 MB/s
| style="height=20px; text-align:left;padding:3px"| 63 GB/s
| style="height=20px; text-align:left;padding:3px"| 40 GB/s
| style="height=20px; text-align:left;padding:3px"| 38 GB/s
|}
---------------------------------------------------------------------------------------------------------
global: all nodes of UniCluster access the same file system;
local: each node has its own file system;
permanent: files are stored permanently;
batch job: files are removed at end of the batch job.
---------------------------------------------------------------------------------------------------------
Table 2: Properties of the file systems

== Selecting the appropriate file system ==

In general, you should separate your data and store it on the appropriate file system.
Permanently needed data like software or important results should be stored below $HOME
but capacity restrictions (quotas) apply. In case you accidentally deleted data on $HOME
there is a chance that we can restore it from backup. Permanent data which is not needed for months
or exceeds the capacity restrictions should be sent to the LSDF Online Storage
or to the archive and deleted from the file systems. Temporary data which is only needed on a single
node and which does not exceed the disk space shown in the table above should be stored
below $TMPDIR. Data which is read many times on a single node, e.g. if you are doing AI training,
should be copied to $TMPDIR and read from there. Temporary data which is used from many nodes
of your batch job and which is only needed during job runtime should be stored on a
parallel on-demand file system. Temporary data which can be recomputed or which is the
result of one job and input for another job should be stored in workspaces. The lifetime
of data in workspaces is limited and depends on the lifetime of the workspace which can be
several months.

For further details please check the chapters below.

== $HOME ==

The home directories of bwUniCluster 3.0 (uc3) users are located in the parallel file system Lustre.
You have access to your home directory from all nodes of uc3. A regular backup of these directories
to a tape library is done automatically. The directory $HOME should be used to hold permanently used data like
source code, configuration files, executable programs etc.

On uc3 there is a default user quota limit of 500 GiB and 5 million inodes (files and directories) per user.
For users of University of Mannheim the limit is 250 GiB and 2.5 million inodes.
You can check your current usage and limits with the command
<pre>
$ lfs quota -uh $(whoami) $HOME
</pre>
Note that the quota limits mentioned above are soft quota limits. The hard limits (shown on the <code>limits</code>
columns) are 10 percent higher. If you are above the soft limit and below the hard limit
during the grace period (7 days) your I/O operations will show a warning message. If the grace period has
passed or if you are above the hard limit your I/O operations will abort.

In addition to the user limit there is a limit of your organization (e.g. university) which depends on the financial share. This limit is enforced with so-called Lustre project quotas. You can show the current usage and limits of your organization with the following command:
<pre>
lfs quota -ph $(grep $(echo $HOME | sed -e "s|/[^/]*/[^/]*$||") /pfs/data6/project_ids.txt | cut -f 1 -d\ ) $HOME
</pre>

== Workspaces ==

On uc3 workspaces can be used to store large non-permanent data sets, e.g. restart files or output
data that has to be post-processed. The file system used for workspaces is also the parallel file system Lustre. This file system is especially designed for parallel access and for a high throughput to large
files. It is able to provide high data transfer rates of up to 40 GB/s write and read performance when data access is parallel.

On uc3 there is a default user quota limit of 40 TiB and 20 million inodes (files and directories) per user.
You can chek your current usage and limits with the command
<pre>
$ lfs quota -uh $(whoami) /pfs/work9
</pre>
Note that the quotas include data and inodes for all of your workspaces and all of your expired workspaces (as long as they are not yet completely removed).

Workspaces have a lifetime and the data on a workspace expires as a whole after a fixed period. The maximum lifetime of a workspace on uc3 is 60 days, but it can be renewed at the end of that period 3 times to a total maximum of 240 days after workspace generation.

Creating, deleting, finding, extending and sharing workspaces is explained on the [[workspace]] page.

=== Reminder for workspace deletion ===

Normally you will get an email every day starting 7 days before a workspace expires. You can send yourself a calender entry which reminds you when a workspace will be automatically deleted:

$ ws_send_ical <workspace> <email>

=== Restoring expired Workspaces ===

At expiration time your workspace will be moved to a special, hidden directory. On uc3 expired workspaces are currently kept for 30 days. During this time you can still restore your data into a valid workspace. The same is true for released workspaces but they are only kept until the next night. In order to restore an expired workspace, use
<pre>
ws_restore -l
</pre>
to get a list of your expired workspaces, and then restore them into an '''existing, active workspace''' (here with name <code>my_restored</code>):
<pre>
ws_restore <full_name_of_expired_workspace> my_restored
</pre>
NOTE: The expired workspace has to be specified using the full name as listed by <code>ws_restore -l</code>, including username prefix and timestamp suffix (otherwise, it cannot be uniquely identified).
The target workspace, on the other hand, must be given with just its short name as listed by <code>ws_list</code>, without the username prefix.

NOTE: <code>ws_restore</code> can only work on the same filesystem. So you have to ensure that the new workspace allocated with <code>ws_allocate</code> is placed on the same filesystem as the expired workspace. Therefore, you can use <code>-F <filesystem></code> flag if needed.

=== Linking workspaces in Home ===

It might be valuable to have links to personal workspaces within a certain directory, e.g. below the user home directory. The command
<pre>
ws_register <DIR>
</pre>
will create and manage links to all personal workspaces within in the directory <DIR>. Calling this command will do the following:

* The directory <DIR> will be created if necessary
* Links to all personal workspaces will be managed:
** Creates links to all available workspaces if not already present
** Removes links to released or expired workspaces

== Improving Performance on $HOME and workspaces ==

The following recommendations might help to improve throughput and metadata
performance on Lustre filesystems.

=== Improving Throughput Performance ===

Depending on your application some adaptations might be necessary if you want to reach
the full bandwidth of the filesystems.

When you are designing your application you should consider that the performance of
parallel filesystems is generally better if data is transferred in large blocks and stored in
few large files. In more detail, to increase throughput performance of a parallel application
following aspects should be considered:

* collect large chunks of data and write them sequentially at once,

* to exploit complete filesystem bandwidth use several clients,

* avoid competitive file access by different tasks or use blocks with boundaries at stripe size (default is 1MB),

* if files are small enough for the SSDs and are only used from one node store them on $TMPDIR.

Parallel filesystems typically stripe files over storage subsystems, i.e. large files are separated into stripes and distributed to different storage subsystems. In Lustre, the size of these stripes (sometimes also mentioned as chunks) is called stripe size and the number of used storage subsystems is called stripe count.

With previous Lustre versions adapting the Lustre stripe count was the most important optimization. However, for the workspaces of uc3 the Lustre feature Progressive File Layout is used to define file striping parameters. This means that the stripe count is adapted if the file size is growing. In normal cases users no longer need to adapt file striping parameters in case they have very huge files or in order to reach better performance. If you know what you are doing you can still change striping parameters but further explanation is beyond the scope of this documentation.

=== Improving Metadata Performance ===

Metadata performance on parallel file systems is usually not as good as with local
filesystems. In addition, it is usually not scalable, i.e. a limited resource. Therefore,
you should omit metadata operations whenever possible. For example, it is much better
to have few large files than lots of small files. In more detail, to increase metadata
performance of a parallel application following aspects should be considered:

* avoid creating many small files,

* avoid competitive directory access, e.g. by creating files in separate subdirectories for each task,

* if many small files are only used within a batch job and accessed by one process store them on $TMPDIR,

* change the default colorization setting of the command ls (see below).

On modern Linux systems, the GNU ls command often uses colorization by default to
visually highlight the file type; this is especially true if the command is run within a terminal
session. This is because the default shell profile initializations usually contain an alias
directive similar to the following for the ls command:
<pre>
$ alias ls="ls --color=tty"
</pre>
However, running the ls command in this way for files on a Lustre file system requires
a stat() call to be used to determine the file type. This can result in a performance
overhead, because the stat() call always needs to determine the size of a file, and that
in turn means that the client node must query the object size of all the backing objects
that make up a file. As a result of the default colorization setting, running a simple
ls command on a Lustre file system often takes as much time as running the ls command
with the -l option (the same is true if the -F, -p, or the -classify option, or any other option
that requires information from a stat() call, is used). To avoid this performance overhead
when using ls commands, add an alias directive similar to the following
to your shell startup script:
<pre>
$ alias ls="ls --color=never"
</pre>

== Workspaces on flash storage ==

There is another workspace file system for special requirements available. The file system is called ''full flash pfs'' and is based on the parallel file system Lustre.

=== Advantages of this file system ===

# All storage devices are based on flash (no hard disks) with low access times. Hence performance is better compared to other parallel file systems for read and write access with small blocks and with small files, i.e. IOPS rates are improved.
# The file system is mounted on bwUniCluster 2.0 and HoreKa, i.e. it can be used to share data between these clusters.

=== Access restrictions ===

Only HoreKa users or KIT users of bwUniCluster 2.0 can use this file system.

=== Using the file system ===

As KIT or HoreKa user you can use the file system in the same way as a normal workspace. You just have to specify the name of the flash-based workspace file system using the option ''-F'' to all the commands that manage workspaces. On bwUniCluster 2.0 it is called ''ffuc'', on HoreKa it is ''ffhk''. For example, to create a workspace with name myws and a lifetime of 60 days on bwUniCluster 2.0 execute:
ws_allocate -F ffuc myws 60

If you want to use the full flash pfs on bwUniCluster 2.0 '''and''' HoreKa at the same time, please note that you only have to manage a particular workspace on one of the clusters since the name of the workspace directory is different. However, the path to each workspace is visible and can be used on both clusters.

Other features are similar to normal workspaces. For example, we are able to restore expired workspaces for few weeks and you have to open a ticket to request the restore. There are quota limits with a default limit of 1 TiB capacity and 5 millions inodes per user. You can check your current usage with
lfs quota -uh $(whoami) /pfs/work8

== $TMPDIR ==

The environment variable $TMPDIR contains the name of a directory which is located on the local SSD of each node. This means
that different tasks of a parallel application use different directories when they do not utilize the same node.
Although $TMPDIR points to the same path name for different nodes of a batch job, the physical location and the
content of this directory path on these nodes is different.

This directory should be used for temporary files being accessed from the local node during job runtime. It should
also be used if you read the same data many times from a single node, e.g. if you are doing AI training. In this
case you should copy the data at the beginning of your batch job to $TMPDIR and read the data from there, see usage example below.

The $TMPDIR directory is located on extremely fast local SSD storage devices. This means that performance
on small files is much better than on the parallel file systems. The capacity of the local SSDs for each node type
is different and can be checked in Table 1 above. The capacity of $TMPDIR is at least 800 GB.

Each time a batch job is started, a subdirectory is created on the SSD of each node and assigned to the job.
$TMPDIR is set to the name of the subdirectory and this name contains the job ID so that it is unique
for each job. At the end of the job the subdirectory is removed.

On login nodes $TMPDIR also points to a fast directory on a local SSD disk but this directory is not unique.
It is recommended to create your own unique subdirectory on these nodes. This directory should be used for the
installation of software packages. This means that the software package to be installed should be unpacked,
compiled and linked in a subdirectory of $TMPDIR. The real installation of the package (e.g. make install)
should be made into the $HOME folder.

{|style="background:#ffdeee; width:100%;"
|style="padding:5px; background:#f2cece; text-align:left"|
[[Image:Attention.svg|center|25px]]
|style="padding:5px; background:#f2cece; text-align:left"|
Note that you should '''not''' use /tmp or /scratch! Please use '''$TMPDIR''' instead. 
The reason is that an automatic cleanup on /tmp or /scratch is not possible because another job could be still using data below these directories. Hence the corresponding file systems could fill up and this can cause issues for you and for other users. On the other hand, $TMPDIR is created when the job starts and removed when the job completes, i.e. a cleanup is automatically done.
|}

=== Usage example for $TMPDIR ===

We will provide an example for using $TMPDIR and describe efficient data transfer to and from $TMPDIR.

If you have a data set with many files which is frequently used by batch jobs you should create
a compressed archive on a workspace. This archive can be extracted on $TMPDIR inside your batch jobs.
Such an archive can be read efficiently from a parallel file system since it is a single huge file.
On a login node you can create such an archive with the following steps:
<syntaxhighlight lang="bash">
# Create a workspace to store the archive
[ab1234@uc2n997 ~]$ ws_allocate data-ssd 60
# Create the archive from a local dataset folder (example)
[ab1234@uc2n997 ~]$ tar -cvzf $(ws_find data-ssd)/dataset.tgz dataset/
</syntaxhighlight>

Inside a batch job extract the archive on $TMPDIR, read input data from $TMPDIR, store results on $TMPDIR
and save the results on a workspace:
<syntaxhighlight lang="bash">
#!/bin/bash
# very simple example on how to use local $TMPDIR
#SBATCH -N 1
#SBATCH -t 24:00:00

# Extract compressed input dataset on local SSD
tar -C $TMPDIR/ -xvzf $(ws_find data-ssd)/dataset.tgz

# The application reads data from dataset on $TMPDIR and writes results to $TMPDIR
myapp -input $TMPDIR/dataset/myinput.csv -outputdir $TMPDIR/results

# Before job completes save results on a workspace
rsync -av $TMPDIR/results $(ws_find data-ssd)/results-${SLURM_JOB_ID}/
</syntaxhighlight>

== LSDF Online Storage==

In some cases it is useful to have access to the LSDF Online Storage on the HPC-Clusters also. Therefore the LSDF Online Storage is mounted on the Login- and Datamover-Nodes.
Furthermore it can be used on the compute nodes during the job runtime with the constraint flag "LSDF" ([[bwUniCluster_2.0_Slurm_common_Features|Slurm common features]]
). There is also an example about the LSDF batch usage: [[bwUniCluster_2.0_Slurm_common_Features#LSDF_Online_Storage|Slurm LSDF example ]] .
<pre>
#!/bin/bash
#SBATCH ...
#SBATCH --constraint=LSDF
</pre>
 
For the usage of the LSDF Online Storage the following environment variables are available: $LSDF, $LSDFPROJECTS, $LSDFHOME.
Please request storage projects in the LSDF Online Storage seperately:
[https://www.lsdf.kit.edu/os/storagerequest/: LSDF Storage Request].

==BeeOND (BeeGFS On-Demand)==

Users of the UniCluster have possibility to request a private BeeOND (on-demand BeeGFS) parallel filesystem for each job. The file system is created during job startup and purged after your job.

; '''IMPORTANT: '''
: All data on the private filesystem will be deleted after your job. Make sure you have copied your data back to the global filesystem (within job), e.g., $HOME or any workspace.

BeeOND/BeeGFS can be used like any other parallel file system. Tools like cp or rsync can be used to copy data in and out.

For detailed usage see here: [[BwUniCluster_2.0_Slurm_common_Features#BeeOND_.28BeeGFS_On-Demand.29|Request on-demand file system]]

==Backup and Archiving==

There are regular backups of all data of the home directories,whereas ACLs and extended attributes will
not be backuped.

Please open a ticket if you need backuped data.

BwUniCluster3.0/Hardware and Architecture/Filesystem Details

2025-04-03T11:26:19Z

R Laifer: /* $HOME */

= File System Details =

On bwUniCluster 2.0 the parallel file system Lustre is used for most globally visible user data. Lustre is open source and Lustre solutions and support are available from different vendors. Nowadays, most of the biggest HPC systems are using Lustre. An initial home directory on a Lustre file system is created automatically after account activation, and the environment variable $HOME holds its name. Users can create so-called workspaces on another Lustre file system for non-permanent data with temporary lifetime. There is another workspace file system based on flash storage for special requirements available.

Within a batch job further file systems are available:
* The directory $TMPDIR is only available and visible on the local node. It is located on fast SSD storage devices.
* On request a parallel on-demand file system (BeeOND) is created which uses the SSDs of the nodes which were allocated to the batch job.
* On request the external LSDF Online Storage is mounted on the nodes which were allocated to the batch job. This file system is based on the parallel file system Spectrum Scale.

Some of the characteristics of the file systems are shown in Table 2.

{| class="wikitable"
|-
! scope="column" style="width:9%"| Property
! style="width:13%"| $TMPDIR
! style="width:13%"| BeeOND
! style="width:13%"| $HOME
! style="width:13%"| Workspace
! style="width:13%"| Workspace on flash
|-
! scope="column" style="height=20px; padding:3px"| Visibility
| style="height=20px; text-align:left;padding:3px"| local node
| style="height=20px; text-align:left;padding:3px"| nodes of batch job
| style="height=20px; text-align:left;padding:3px"| global
| style="height=20px; text-align:left;padding:3px"| global
| style="height=20px; text-align:left;padding:3px"| global
|-
! scope="column" style="height=20px; padding:3px"| Lifetime
| style="height=20px; text-align:left;padding:3px"| batch job runtime
| style="height=20px; text-align:left;padding:3px"| batch job runtime
| style="height=20px; text-align:left;padding:3px"| permanent
| style="height=20px; text-align:left;padding:3px"| max. 240 days
| style="height=20px; text-align:left;padding:3px"| max. 240 days
|-
! scope="column" style="height=20px; padding:3px"| Disk space
| style="height=20px; text-align:left;padding:3px"| 960 GB - 6.4 TB details see table 1
| style="height=20px; text-align:left;padding:3px"| n*750 GB
| style="height=20px; text-align:left;padding:3px"| 1.1 PiB
| style="height=20px; text-align:left;padding:3px"| 4.6 PiB
| style="height=20px; text-align:left;padding:3px"| 236 TiB
|-
! scope="column" style="height=20px; padding:3px"| Capacity Quotas
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| yes 500 GiB per user, for MA users 250 GiB also per organization
| style="height=20px; text-align:left;padding:3px"| yes 40 TiB per user
| style="height=20px; text-align:left;padding:3px"| yes 1 TiB per user
|-
! scope="column" style="height=20px; padding:3px"| Inode Quotas
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| yes 5 million per user for MA users 2.5 million
| style="height=20px; text-align:left;padding:3px"| yes 20 million per user
| style="height=20px; text-align:left;padding:3px"| yes 5 million per user
|-
! scope="column" style="height=20px; padding:3px"| Backup
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| yes
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| no
|-
! scope="column" style="height=20px; padding:3px"| Read perf./node
| style="height=20px; text-align:left;padding:3px"| 500 MB/s - 6 GB/s depends on type of local SSD / job queue: 520 MB/s @ single / multiple 800 MB/s @ multiple_e 6600 MB/s @ fat 6500 MB/s @ gpu_4 6500 MB/s @ gpu_8
| style="height=20px; text-align:left;padding:3px"| 400 MB/s - 500 MB/s depends on type of local SSDs / job queue: 500 MB/s @ multiple 400 MB/s @ multiple_e
| style="height=20px; text-align:left;padding:3px"| 5 GB/s
| style="height=20px; text-align:left;padding:3px"| 5 GB/s
| style="height=20px; text-align:left;padding:3px"| 1 GB/s
|-
! scope="column" style="height=20px; padding:3px"| Write perf./node
| style="height=20px; text-align:left;padding:3px"| 500 MB/s - 4 GB/s depends on type of local SSD / job queue: 500 MB/s @ single / multiple 650 MB/s @ multiple_e 2900 MB/s @ fat 2090 MB/s @ gpu_4 4060 MB/s @ gpu_8
| style="height=20px; text-align:left;padding:3px"| 250 MB/s - 350 MB/s depends on type of local SSDs / job queue: 350 MB/s @ multiple 250 MB/s @ multiple_e
| style="height=20px; text-align:left;padding:3px"| 5 GB/s
| style="height=20px; text-align:left;padding:3px"| 5 GB/s
| style="height=20px; text-align:left;padding:3px"| 1 GB/s
|-
! scope="column" style="height=20px;padding:3px"| Total read perf.
| style="height=20px; text-align:left;padding:3px"| n*500-6000 MB/s
| style="height=20px; text-align:left;padding:3px"| n*400-500 MB/s
| style="height=20px; text-align:left;padding:3px"| 63 GB/s
| style="height=20px; text-align:left;padding:3px"| 45 GB/s
| style="height=20px; text-align:left;padding:3px"| 45 GB/s
|-
! scope="column" style="height=20px;padding:3px"| Total write perf.
| style="height=20px; text-align:left;padding:3px"| n*500-4000 MB/s
| style="height=20px; text-align:left;padding:3px"| n*250-350 MB/s
| style="height=20px; text-align:left;padding:3px"| 63 GB/s
| style="height=20px; text-align:left;padding:3px"| 40 GB/s
| style="height=20px; text-align:left;padding:3px"| 38 GB/s
|}
---------------------------------------------------------------------------------------------------------
global: all nodes of UniCluster access the same file system;
local: each node has its own file system;
permanent: files are stored permanently;
batch job: files are removed at end of the batch job.
---------------------------------------------------------------------------------------------------------
Table 2: Properties of the file systems

== Selecting the appropriate file system ==

In general, you should separate your data and store it on the appropriate file system.
Permanently needed data like software or important results should be stored below $HOME
but capacity restrictions (quotas) apply. In case you accidentally deleted data on $HOME
there is a chance that we can restore it from backup. Permanent data which is not needed for months
or exceeds the capacity restrictions should be sent to the LSDF Online Storage
or to the archive and deleted from the file systems. Temporary data which is only needed on a single
node and which does not exceed the disk space shown in the table above should be stored
below $TMPDIR. Data which is read many times on a single node, e.g. if you are doing AI training,
should be copied to $TMPDIR and read from there. Temporary data which is used from many nodes
of your batch job and which is only needed during job runtime should be stored on a
parallel on-demand file system. Temporary data which can be recomputed or which is the
result of one job and input for another job should be stored in workspaces. The lifetime
of data in workspaces is limited and depends on the lifetime of the workspace which can be
several months.

For further details please check the chapters below.

== $HOME ==

The home directories of bwUniCluster 3.0 (uc3) users are located in the parallel file system Lustre.
You have access to your home directory from all nodes of uc3. A regular backup of these directories
to a tape library is done automatically. The directory $HOME should be used to hold permanently used data like
source code, configuration files, executable programs etc.

On uc3 there is a default user quota limit of 500 GiB and 5 million inodes (files and directories) per user.
For users of University of Mannheim the limit is 250 GiB and 2.5 million inodes.
You can check your current usage and limits with the command
<pre>
$ lfs quota -uh $(whoami) $HOME
</pre>
Note that the quota limits mentioned above are soft quota limits. The hard limits (shown on the <code>limits</code>
columns) are 10 percent higher. If you are above the soft limit and below the hard limit
during the grace period (7 days) your I/O operations will show a warning message. If the grace period has
passed or if you are above the hard limit your I/O operations will abort.

In addition to the user limit there is a limit of your organization (e.g. university) which depends on the financial share. This limit is enforced with so-called Lustre project quotas. You can show the current usage and limits of your organization with the following command:
<pre>
lfs quota -ph $(grep $(echo $HOME | sed -e "s|/[^/]*/[^/]*$||") /pfs/data6/project_ids.txt | cut -f 1 -d\ ) $HOME
</pre>

== Workspaces ==

On uc3 workspaces can be used to store large non-permanent data sets, e.g. restart files or output
data that has to be post-processed. The file system used for workspaces is also the parallel file system Lustre. This file system is especially designed for parallel access and for a high throughput to large
files. It is able to provide high data transfer rates of up to 40 GB/s write and read performance when data access is parallel.

On uc3 there is a default user quota limit of 40 TiB and 20 million inodes (files and directories) per user.
You can chek your current usage and limits with the command
<pre>
$ lfs quota -uh $(whoami) /pfs/work9
</pre>
Note that the quotas include data and inodes for all of your workspaces and all of your expired workspaces (as long as they are not yet completely removed).

Workspaces have a lifetime and the data on a workspace expires as a whole after a fixed period. The maximum lifetime of a workspace on uc3 is 60 days, but it can be renewed at the end of that period 3 times to a total maximum of 240 days after workspace generation.

Creating, deleting, finding, extending and sharing workspaces is explained on the [[workspace]] page.

=== Reminder for workspace deletion ===

Normally you will get an email every day starting 7 days before a workspace expires. You can send yourself a calender entry which reminds you when a workspace will be automatically deleted:

$ ws_send_ical <workspace> <email>

=== Restoring expired Workspaces ===

At expiration time your workspace will be moved to a special, hidden directory. On uc3 expired workspaces are currently kept for 30 days. During this time you can still restore your data into a valid workspace. The same is true for released workspaces but they are only kept until the next night. In order to restore an expired workspace, use
<pre>
ws_restore -l
</pre>
to get a list of your expired workspaces, and then restore them into an '''existing, active workspace''' (here with name <code>my_restored</code>):
<pre>
ws_restore <full_name_of_expired_workspace> my_restored
</pre>
NOTE: The expired workspace has to be specified using the full name as listed by <code>ws_restore -l</code>, including username prefix and timestamp suffix (otherwise, it cannot be uniquely identified).
The target workspace, on the other hand, must be given with just its short name as listed by <code>ws_list</code>, without the username prefix.

NOTE: <code>ws_restore</code> can only work on the same filesystem. So you have to ensure that the new workspace allocated with <code>ws_allocate</code> is placed on the same filesystem as the expired workspace. Therefore, you can use <code>-F <filesystem></code> flag if needed.

=== Linking workspaces in Home ===

It might be valuable to have links to personal workspaces within a certain directory, e.g. below the user home directory. The command
<pre>
ws_register <DIR>
</pre>
will create and manage links to all personal workspaces within in the directory <DIR>. Calling this command will do the following:

* The directory <DIR> will be created if necessary
* Links to all personal workspaces will be managed:
** Creates links to all available workspaces if not already present
** Removes links to released or expired workspaces

== Improving Performance on $HOME and workspaces ==

The following recommendations might help to improve throughput and metadata
performance on Lustre filesystems.

=== Improving Throughput Performance ===

Depending on your application some adaptations might be necessary if you want to reach
the full bandwidth of the filesystems. Parallel filesystems typically stripe files over storage subsystems, i.e. large files are separated into stripes and distributed to different storage subsystems. In Lustre, the size of these stripes (sometimes also mentioned as chunks) is called stripe size and the number of used storage subsystems is called stripe count.

When you are designing your application you should consider that the performance of
parallel filesystems is generally better if data is transferred in large blocks and stored in
few large files. In more detail, to increase throughput performance of a parallel application
following aspects should be considered:

* collect large chunks of data and write them sequentially at once,

* to exploit complete filesystem bandwidth use several clients,

* avoid competitive file access by different tasks or use blocks with boundaries at stripe size (default is 1MB),

* if files are small enough for the SSDs and are only used by one process store them on $TMPDIR.

With previous Lustre versions adapting the Lustre stripe count was the most important optimization. However, for the workspaces of uc2 the new Lustre feature Progressive File Layouts has been used to define file striping parameters. This means that the stripe count is adapted if the file size is growing. In normal cases users no longer need to adapt file striping parameters in case they have very huge files or in order to reach better performance.

If you know what you are doing you can still change striping parameters, e.g. the stripe count, of a directory and of newly created files. New files and directories inherit the stripe count from the parent directory. E.g. if you want to enhance throughput on a single very large file which is created in the directory $HOME/my_output_dir you can use the command
<pre>
$ lfs setstripe -c-1 $HOME/my_output_dir
</pre>
to change the stripe count to -1 which means that all storage subsystems of the file system are used to store that file. If you change the stripe count of a directory the stripe count of existing files inside this
directory is not changed. If you want to change the stripe count of existing files, change
the stripe count of the parent directory, copy the files to new files, remove the old files
and move the new files back to the old name. In order to check the stripe setting of
the file my_file use
<pre>
$ lfs getstripe my_file
</pre>
Also note that changes on the striping parameters (e.g. stripe count) are not saved in the
backup, i.e. if directories have to be recreated this information is lost and the default stripe
count will be used. Therefore, you should annotate for which directories you made changes
to the striping parameters so that you can repeat these changes if required.

=== Improving Metadata Performance ===

Metadata performance on parallel file systems is usually not as good as with local
filesystems. In addition, it is usually not scalable, i.e. a limited resource. Therefore,
you should omit metadata operations whenever possible. For example, it is much better
to have few large files than lots of small files. In more detail, to increase metadata
performance of a parallel application following aspects should be considered:

* avoid creating many small files,

* avoid competitive directory access, e.g. by creating files in separate subdirectories for each task,

* if many small files are only used within a batch job and accessed by one process store them on $TMPDIR,

* change the default colorization setting of the command ls (see below).

On modern Linux systems, the GNU ls command often uses colorization by default to
visually highlight the file type; this is especially true if the command is run within a terminal
session. This is because the default shell profile initializations usually contain an alias
directive similar to the following for the ls command:
<pre>
$ alias ls="ls --color=tty"
</pre>
However, running the ls command in this way for files on a Lustre file system requires
a stat() call to be used to determine the file type. This can result in a performance
overhead, because the stat() call always needs to determine the size of a file, and that
in turn means that the client node must query the object size of all the backing objects
that make up a file. As a result of the default colorization setting, running a simple
ls command on a Lustre file system often takes as much time as running the ls command
with the -l option (the same is true if the -F, -p, or the -classify option, or any other option
that requires information from a stat() call, is used). To avoid this performance overhead
when using ls commands, add an alias directive similar to the following
to your shell startup script:
<pre>
$ alias ls="ls --color=never"
</pre>

== Workspaces on flash storage ==

There is another workspace file system for special requirements available. The file system is called ''full flash pfs'' and is based on the parallel file system Lustre.

=== Advantages of this file system ===

# All storage devices are based on flash (no hard disks) with low access times. Hence performance is better compared to other parallel file systems for read and write access with small blocks and with small files, i.e. IOPS rates are improved.
# The file system is mounted on bwUniCluster 2.0 and HoreKa, i.e. it can be used to share data between these clusters.

=== Access restrictions ===

Only HoreKa users or KIT users of bwUniCluster 2.0 can use this file system.

=== Using the file system ===

As KIT or HoreKa user you can use the file system in the same way as a normal workspace. You just have to specify the name of the flash-based workspace file system using the option ''-F'' to all the commands that manage workspaces. On bwUniCluster 2.0 it is called ''ffuc'', on HoreKa it is ''ffhk''. For example, to create a workspace with name myws and a lifetime of 60 days on bwUniCluster 2.0 execute:
ws_allocate -F ffuc myws 60

If you want to use the full flash pfs on bwUniCluster 2.0 '''and''' HoreKa at the same time, please note that you only have to manage a particular workspace on one of the clusters since the name of the workspace directory is different. However, the path to each workspace is visible and can be used on both clusters.

Other features are similar to normal workspaces. For example, we are able to restore expired workspaces for few weeks and you have to open a ticket to request the restore. There are quota limits with a default limit of 1 TiB capacity and 5 millions inodes per user. You can check your current usage with
lfs quota -uh $(whoami) /pfs/work8

== $TMPDIR ==

The environment variable $TMPDIR contains the name of a directory which is located on the local SSD of each node. This means
that different tasks of a parallel application use different directories when they do not utilize the same node.
Although $TMPDIR points to the same path name for different nodes of a batch job, the physical location and the
content of this directory path on these nodes is different.

This directory should be used for temporary files being accessed from the local node during job runtime. It should
also be used if you read the same data many times from a single node, e.g. if you are doing AI training. In this
case you should copy the data at the beginning of your batch job to $TMPDIR and read the data from there, see usage example below.

The $TMPDIR directory is located on extremely fast local SSD storage devices. This means that performance
on small files is much better than on the parallel file systems. The capacity of the local SSDs for each node type
is different and can be checked in Table 1 above. The capacity of $TMPDIR is at least 800 GB.

Each time a batch job is started, a subdirectory is created on the SSD of each node and assigned to the job.
$TMPDIR is set to the name of the subdirectory and this name contains the job ID so that it is unique
for each job. At the end of the job the subdirectory is removed.

On login nodes $TMPDIR also points to a fast directory on a local SSD disk but this directory is not unique.
It is recommended to create your own unique subdirectory on these nodes. This directory should be used for the
installation of software packages. This means that the software package to be installed should be unpacked,
compiled and linked in a subdirectory of $TMPDIR. The real installation of the package (e.g. make install)
should be made into the $HOME folder.

{|style="background:#ffdeee; width:100%;"
|style="padding:5px; background:#f2cece; text-align:left"|
[[Image:Attention.svg|center|25px]]
|style="padding:5px; background:#f2cece; text-align:left"|
Note that you should '''not''' use /tmp or /scratch! Please use '''$TMPDIR''' instead. 
The reason is that an automatic cleanup on /tmp or /scratch is not possible because another job could be still using data below these directories. Hence the corresponding file systems could fill up and this can cause issues for you and for other users. On the other hand, $TMPDIR is created when the job starts and removed when the job completes, i.e. a cleanup is automatically done.
|}

=== Usage example for $TMPDIR ===

We will provide an example for using $TMPDIR and describe efficient data transfer to and from $TMPDIR.

If you have a data set with many files which is frequently used by batch jobs you should create
a compressed archive on a workspace. This archive can be extracted on $TMPDIR inside your batch jobs.
Such an archive can be read efficiently from a parallel file system since it is a single huge file.
On a login node you can create such an archive with the following steps:
<syntaxhighlight lang="bash">
# Create a workspace to store the archive
[ab1234@uc2n997 ~]$ ws_allocate data-ssd 60
# Create the archive from a local dataset folder (example)
[ab1234@uc2n997 ~]$ tar -cvzf $(ws_find data-ssd)/dataset.tgz dataset/
</syntaxhighlight>

Inside a batch job extract the archive on $TMPDIR, read input data from $TMPDIR, store results on $TMPDIR
and save the results on a workspace:
<syntaxhighlight lang="bash">
#!/bin/bash
# very simple example on how to use local $TMPDIR
#SBATCH -N 1
#SBATCH -t 24:00:00

# Extract compressed input dataset on local SSD
tar -C $TMPDIR/ -xvzf $(ws_find data-ssd)/dataset.tgz

# The application reads data from dataset on $TMPDIR and writes results to $TMPDIR
myapp -input $TMPDIR/dataset/myinput.csv -outputdir $TMPDIR/results

# Before job completes save results on a workspace
rsync -av $TMPDIR/results $(ws_find data-ssd)/results-${SLURM_JOB_ID}/
</syntaxhighlight>

== LSDF Online Storage==

In some cases it is useful to have access to the LSDF Online Storage on the HPC-Clusters also. Therefore the LSDF Online Storage is mounted on the Login- and Datamover-Nodes.
Furthermore it can be used on the compute nodes during the job runtime with the constraint flag "LSDF" ([[bwUniCluster_2.0_Slurm_common_Features|Slurm common features]]
). There is also an example about the LSDF batch usage: [[bwUniCluster_2.0_Slurm_common_Features#LSDF_Online_Storage|Slurm LSDF example ]] .
<pre>
#!/bin/bash
#SBATCH ...
#SBATCH --constraint=LSDF
</pre>
 
For the usage of the LSDF Online Storage the following environment variables are available: $LSDF, $LSDFPROJECTS, $LSDFHOME.
Please request storage projects in the LSDF Online Storage seperately:
[https://www.lsdf.kit.edu/os/storagerequest/: LSDF Storage Request].

==BeeOND (BeeGFS On-Demand)==

Users of the UniCluster have possibility to request a private BeeOND (on-demand BeeGFS) parallel filesystem for each job. The file system is created during job startup and purged after your job.

; '''IMPORTANT: '''
: All data on the private filesystem will be deleted after your job. Make sure you have copied your data back to the global filesystem (within job), e.g., $HOME or any workspace.

BeeOND/BeeGFS can be used like any other parallel file system. Tools like cp or rsync can be used to copy data in and out.

For detailed usage see here: [[BwUniCluster_2.0_Slurm_common_Features#BeeOND_.28BeeGFS_On-Demand.29|Request on-demand file system]]

==Backup and Archiving==

There are regular backups of all data of the home directories,whereas ACLs and extended attributes will
not be backuped.

Please open a ticket if you need backuped data.

BwUniCluster3.0/Hardware and Architecture/Filesystem Details

2025-04-03T11:23:53Z

R Laifer: /* Workspaces */ Anpassung an UC3

= File System Details =

On bwUniCluster 2.0 the parallel file system Lustre is used for most globally visible user data. Lustre is open source and Lustre solutions and support are available from different vendors. Nowadays, most of the biggest HPC systems are using Lustre. An initial home directory on a Lustre file system is created automatically after account activation, and the environment variable $HOME holds its name. Users can create so-called workspaces on another Lustre file system for non-permanent data with temporary lifetime. There is another workspace file system based on flash storage for special requirements available.

Within a batch job further file systems are available:
* The directory $TMPDIR is only available and visible on the local node. It is located on fast SSD storage devices.
* On request a parallel on-demand file system (BeeOND) is created which uses the SSDs of the nodes which were allocated to the batch job.
* On request the external LSDF Online Storage is mounted on the nodes which were allocated to the batch job. This file system is based on the parallel file system Spectrum Scale.

Some of the characteristics of the file systems are shown in Table 2.

{| class="wikitable"
|-
! scope="column" style="width:9%"| Property
! style="width:13%"| $TMPDIR
! style="width:13%"| BeeOND
! style="width:13%"| $HOME
! style="width:13%"| Workspace
! style="width:13%"| Workspace on flash
|-
! scope="column" style="height=20px; padding:3px"| Visibility
| style="height=20px; text-align:left;padding:3px"| local node
| style="height=20px; text-align:left;padding:3px"| nodes of batch job
| style="height=20px; text-align:left;padding:3px"| global
| style="height=20px; text-align:left;padding:3px"| global
| style="height=20px; text-align:left;padding:3px"| global
|-
! scope="column" style="height=20px; padding:3px"| Lifetime
| style="height=20px; text-align:left;padding:3px"| batch job runtime
| style="height=20px; text-align:left;padding:3px"| batch job runtime
| style="height=20px; text-align:left;padding:3px"| permanent
| style="height=20px; text-align:left;padding:3px"| max. 240 days
| style="height=20px; text-align:left;padding:3px"| max. 240 days
|-
! scope="column" style="height=20px; padding:3px"| Disk space
| style="height=20px; text-align:left;padding:3px"| 960 GB - 6.4 TB details see table 1
| style="height=20px; text-align:left;padding:3px"| n*750 GB
| style="height=20px; text-align:left;padding:3px"| 1.1 PiB
| style="height=20px; text-align:left;padding:3px"| 4.6 PiB
| style="height=20px; text-align:left;padding:3px"| 236 TiB
|-
! scope="column" style="height=20px; padding:3px"| Capacity Quotas
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| yes 500 GiB per user, for MA users 250 GiB also per organization
| style="height=20px; text-align:left;padding:3px"| yes 40 TiB per user
| style="height=20px; text-align:left;padding:3px"| yes 1 TiB per user
|-
! scope="column" style="height=20px; padding:3px"| Inode Quotas
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| yes 5 million per user for MA users 2.5 million
| style="height=20px; text-align:left;padding:3px"| yes 20 million per user
| style="height=20px; text-align:left;padding:3px"| yes 5 million per user
|-
! scope="column" style="height=20px; padding:3px"| Backup
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| yes
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| no
|-
! scope="column" style="height=20px; padding:3px"| Read perf./node
| style="height=20px; text-align:left;padding:3px"| 500 MB/s - 6 GB/s depends on type of local SSD / job queue: 520 MB/s @ single / multiple 800 MB/s @ multiple_e 6600 MB/s @ fat 6500 MB/s @ gpu_4 6500 MB/s @ gpu_8
| style="height=20px; text-align:left;padding:3px"| 400 MB/s - 500 MB/s depends on type of local SSDs / job queue: 500 MB/s @ multiple 400 MB/s @ multiple_e
| style="height=20px; text-align:left;padding:3px"| 5 GB/s
| style="height=20px; text-align:left;padding:3px"| 5 GB/s
| style="height=20px; text-align:left;padding:3px"| 1 GB/s
|-
! scope="column" style="height=20px; padding:3px"| Write perf./node
| style="height=20px; text-align:left;padding:3px"| 500 MB/s - 4 GB/s depends on type of local SSD / job queue: 500 MB/s @ single / multiple 650 MB/s @ multiple_e 2900 MB/s @ fat 2090 MB/s @ gpu_4 4060 MB/s @ gpu_8
| style="height=20px; text-align:left;padding:3px"| 250 MB/s - 350 MB/s depends on type of local SSDs / job queue: 350 MB/s @ multiple 250 MB/s @ multiple_e
| style="height=20px; text-align:left;padding:3px"| 5 GB/s
| style="height=20px; text-align:left;padding:3px"| 5 GB/s
| style="height=20px; text-align:left;padding:3px"| 1 GB/s
|-
! scope="column" style="height=20px;padding:3px"| Total read perf.
| style="height=20px; text-align:left;padding:3px"| n*500-6000 MB/s
| style="height=20px; text-align:left;padding:3px"| n*400-500 MB/s
| style="height=20px; text-align:left;padding:3px"| 63 GB/s
| style="height=20px; text-align:left;padding:3px"| 45 GB/s
| style="height=20px; text-align:left;padding:3px"| 45 GB/s
|-
! scope="column" style="height=20px;padding:3px"| Total write perf.
| style="height=20px; text-align:left;padding:3px"| n*500-4000 MB/s
| style="height=20px; text-align:left;padding:3px"| n*250-350 MB/s
| style="height=20px; text-align:left;padding:3px"| 63 GB/s
| style="height=20px; text-align:left;padding:3px"| 40 GB/s
| style="height=20px; text-align:left;padding:3px"| 38 GB/s
|}
---------------------------------------------------------------------------------------------------------
global: all nodes of UniCluster access the same file system;
local: each node has its own file system;
permanent: files are stored permanently;
batch job: files are removed at end of the batch job.
---------------------------------------------------------------------------------------------------------
Table 2: Properties of the file systems

== Selecting the appropriate file system ==

In general, you should separate your data and store it on the appropriate file system.
Permanently needed data like software or important results should be stored below $HOME
but capacity restrictions (quotas) apply. In case you accidentally deleted data on $HOME
there is a chance that we can restore it from backup. Permanent data which is not needed for months
or exceeds the capacity restrictions should be sent to the LSDF Online Storage
or to the archive and deleted from the file systems. Temporary data which is only needed on a single
node and which does not exceed the disk space shown in the table above should be stored
below $TMPDIR. Data which is read many times on a single node, e.g. if you are doing AI training,
should be copied to $TMPDIR and read from there. Temporary data which is used from many nodes
of your batch job and which is only needed during job runtime should be stored on a
parallel on-demand file system. Temporary data which can be recomputed or which is the
result of one job and input for another job should be stored in workspaces. The lifetime
of data in workspaces is limited and depends on the lifetime of the workspace which can be
several months.

For further details please check the chapters below.

== $HOME ==

The home directories of bwUniCluster 3.0 (uc3) users are located in the parallel file system Lustre.
You have access to your home directory from all nodes of uc3. A regular backup of these directories
to a tape library is done automatically. The directory $HOME should be used to hold permanently used data like
source code, configuration files, executable programs etc.

On uc3 there is a default user quota limit of 500 GiB and 5 million inodes (files and directories) per user.
For users of University of Mannheim the limit is 250 GiB and 2.5 million inodes.
You can check your current usage and limits with the command
<pre>
$ lfs quota -uh $(whoami) $HOME
</pre>
Note that the quota limits mentioned above are soft quota limits. The hard limits (shown on the limits columns)
are 10 percent higher. If you are above the soft limit and below the hard limit
during the grace period (7 days) your I/O operations will show a warning message. If the grace period has
passed or if you are above the hard limit your I/O operations will abort.

In addition to the user limit there is a limit of your organization (e.g. university) which depends on the financial share. This limit is enforced with so-called Lustre project quotas. You can show the current usage and limits of your organization with the following command:
<pre>
lfs quota -ph $(grep $(echo $HOME | sed -e "s|/[^/]*/[^/]*$||") /pfs/data6/project_ids.txt | cut -f 1 -d\ ) $HOME
</pre>

== Workspaces ==

On uc3 workspaces can be used to store large non-permanent data sets, e.g. restart files or output
data that has to be post-processed. The file system used for workspaces is also the parallel file system Lustre. This file system is especially designed for parallel access and for a high throughput to large
files. It is able to provide high data transfer rates of up to 40 GB/s write and read performance when data access is parallel.

On uc3 there is a default user quota limit of 40 TiB and 20 million inodes (files and directories) per user.
You can chek your current usage and limits with the command
<pre>
$ lfs quota -uh $(whoami) /pfs/work9
</pre>
Note that the quotas include data and inodes for all of your workspaces and all of your expired workspaces (as long as they are not yet completely removed).

Workspaces have a lifetime and the data on a workspace expires as a whole after a fixed period. The maximum lifetime of a workspace on uc3 is 60 days, but it can be renewed at the end of that period 3 times to a total maximum of 240 days after workspace generation.

Creating, deleting, finding, extending and sharing workspaces is explained on the [[workspace]] page.

=== Reminder for workspace deletion ===

Normally you will get an email every day starting 7 days before a workspace expires. You can send yourself a calender entry which reminds you when a workspace will be automatically deleted:

$ ws_send_ical <workspace> <email>

=== Restoring expired Workspaces ===

At expiration time your workspace will be moved to a special, hidden directory. On uc3 expired workspaces are currently kept for 30 days. During this time you can still restore your data into a valid workspace. The same is true for released workspaces but they are only kept until the next night. In order to restore an expired workspace, use
<pre>
ws_restore -l
</pre>
to get a list of your expired workspaces, and then restore them into an '''existing, active workspace''' (here with name <code>my_restored</code>):
<pre>
ws_restore <full_name_of_expired_workspace> my_restored
</pre>
NOTE: The expired workspace has to be specified using the full name as listed by <code>ws_restore -l</code>, including username prefix and timestamp suffix (otherwise, it cannot be uniquely identified).
The target workspace, on the other hand, must be given with just its short name as listed by <code>ws_list</code>, without the username prefix.

NOTE: <code>ws_restore</code> can only work on the same filesystem. So you have to ensure that the new workspace allocated with <code>ws_allocate</code> is placed on the same filesystem as the expired workspace. Therefore, you can use <code>-F <filesystem></code> flag if needed.

=== Linking workspaces in Home ===

It might be valuable to have links to personal workspaces within a certain directory, e.g. below the user home directory. The command
<pre>
ws_register <DIR>
</pre>
will create and manage links to all personal workspaces within in the directory <DIR>. Calling this command will do the following:

* The directory <DIR> will be created if necessary
* Links to all personal workspaces will be managed:
** Creates links to all available workspaces if not already present
** Removes links to released or expired workspaces

== Improving Performance on $HOME and workspaces ==

The following recommendations might help to improve throughput and metadata
performance on Lustre filesystems.

=== Improving Throughput Performance ===

Depending on your application some adaptations might be necessary if you want to reach
the full bandwidth of the filesystems. Parallel filesystems typically stripe files over storage subsystems, i.e. large files are separated into stripes and distributed to different storage subsystems. In Lustre, the size of these stripes (sometimes also mentioned as chunks) is called stripe size and the number of used storage subsystems is called stripe count.

When you are designing your application you should consider that the performance of
parallel filesystems is generally better if data is transferred in large blocks and stored in
few large files. In more detail, to increase throughput performance of a parallel application
following aspects should be considered:

* collect large chunks of data and write them sequentially at once,

* to exploit complete filesystem bandwidth use several clients,

* avoid competitive file access by different tasks or use blocks with boundaries at stripe size (default is 1MB),

* if files are small enough for the SSDs and are only used by one process store them on $TMPDIR.

With previous Lustre versions adapting the Lustre stripe count was the most important optimization. However, for the workspaces of uc2 the new Lustre feature Progressive File Layouts has been used to define file striping parameters. This means that the stripe count is adapted if the file size is growing. In normal cases users no longer need to adapt file striping parameters in case they have very huge files or in order to reach better performance.

If you know what you are doing you can still change striping parameters, e.g. the stripe count, of a directory and of newly created files. New files and directories inherit the stripe count from the parent directory. E.g. if you want to enhance throughput on a single very large file which is created in the directory $HOME/my_output_dir you can use the command
<pre>
$ lfs setstripe -c-1 $HOME/my_output_dir
</pre>
to change the stripe count to -1 which means that all storage subsystems of the file system are used to store that file. If you change the stripe count of a directory the stripe count of existing files inside this
directory is not changed. If you want to change the stripe count of existing files, change
the stripe count of the parent directory, copy the files to new files, remove the old files
and move the new files back to the old name. In order to check the stripe setting of
the file my_file use
<pre>
$ lfs getstripe my_file
</pre>
Also note that changes on the striping parameters (e.g. stripe count) are not saved in the
backup, i.e. if directories have to be recreated this information is lost and the default stripe
count will be used. Therefore, you should annotate for which directories you made changes
to the striping parameters so that you can repeat these changes if required.

=== Improving Metadata Performance ===

Metadata performance on parallel file systems is usually not as good as with local
filesystems. In addition, it is usually not scalable, i.e. a limited resource. Therefore,
you should omit metadata operations whenever possible. For example, it is much better
to have few large files than lots of small files. In more detail, to increase metadata
performance of a parallel application following aspects should be considered:

* avoid creating many small files,

* avoid competitive directory access, e.g. by creating files in separate subdirectories for each task,

* if many small files are only used within a batch job and accessed by one process store them on $TMPDIR,

* change the default colorization setting of the command ls (see below).

On modern Linux systems, the GNU ls command often uses colorization by default to
visually highlight the file type; this is especially true if the command is run within a terminal
session. This is because the default shell profile initializations usually contain an alias
directive similar to the following for the ls command:
<pre>
$ alias ls="ls --color=tty"
</pre>
However, running the ls command in this way for files on a Lustre file system requires
a stat() call to be used to determine the file type. This can result in a performance
overhead, because the stat() call always needs to determine the size of a file, and that
in turn means that the client node must query the object size of all the backing objects
that make up a file. As a result of the default colorization setting, running a simple
ls command on a Lustre file system often takes as much time as running the ls command
with the -l option (the same is true if the -F, -p, or the -classify option, or any other option
that requires information from a stat() call, is used). To avoid this performance overhead
when using ls commands, add an alias directive similar to the following
to your shell startup script:
<pre>
$ alias ls="ls --color=never"
</pre>

== Workspaces on flash storage ==

There is another workspace file system for special requirements available. The file system is called ''full flash pfs'' and is based on the parallel file system Lustre.

=== Advantages of this file system ===

# All storage devices are based on flash (no hard disks) with low access times. Hence performance is better compared to other parallel file systems for read and write access with small blocks and with small files, i.e. IOPS rates are improved.
# The file system is mounted on bwUniCluster 2.0 and HoreKa, i.e. it can be used to share data between these clusters.

=== Access restrictions ===

Only HoreKa users or KIT users of bwUniCluster 2.0 can use this file system.

=== Using the file system ===

As KIT or HoreKa user you can use the file system in the same way as a normal workspace. You just have to specify the name of the flash-based workspace file system using the option ''-F'' to all the commands that manage workspaces. On bwUniCluster 2.0 it is called ''ffuc'', on HoreKa it is ''ffhk''. For example, to create a workspace with name myws and a lifetime of 60 days on bwUniCluster 2.0 execute:
ws_allocate -F ffuc myws 60

If you want to use the full flash pfs on bwUniCluster 2.0 '''and''' HoreKa at the same time, please note that you only have to manage a particular workspace on one of the clusters since the name of the workspace directory is different. However, the path to each workspace is visible and can be used on both clusters.

Other features are similar to normal workspaces. For example, we are able to restore expired workspaces for few weeks and you have to open a ticket to request the restore. There are quota limits with a default limit of 1 TiB capacity and 5 millions inodes per user. You can check your current usage with
lfs quota -uh $(whoami) /pfs/work8

== $TMPDIR ==

The environment variable $TMPDIR contains the name of a directory which is located on the local SSD of each node. This means
that different tasks of a parallel application use different directories when they do not utilize the same node.
Although $TMPDIR points to the same path name for different nodes of a batch job, the physical location and the
content of this directory path on these nodes is different.

This directory should be used for temporary files being accessed from the local node during job runtime. It should
also be used if you read the same data many times from a single node, e.g. if you are doing AI training. In this
case you should copy the data at the beginning of your batch job to $TMPDIR and read the data from there, see usage example below.

The $TMPDIR directory is located on extremely fast local SSD storage devices. This means that performance
on small files is much better than on the parallel file systems. The capacity of the local SSDs for each node type
is different and can be checked in Table 1 above. The capacity of $TMPDIR is at least 800 GB.

Each time a batch job is started, a subdirectory is created on the SSD of each node and assigned to the job.
$TMPDIR is set to the name of the subdirectory and this name contains the job ID so that it is unique
for each job. At the end of the job the subdirectory is removed.

On login nodes $TMPDIR also points to a fast directory on a local SSD disk but this directory is not unique.
It is recommended to create your own unique subdirectory on these nodes. This directory should be used for the
installation of software packages. This means that the software package to be installed should be unpacked,
compiled and linked in a subdirectory of $TMPDIR. The real installation of the package (e.g. make install)
should be made into the $HOME folder.

{|style="background:#ffdeee; width:100%;"
|style="padding:5px; background:#f2cece; text-align:left"|
[[Image:Attention.svg|center|25px]]
|style="padding:5px; background:#f2cece; text-align:left"|
Note that you should '''not''' use /tmp or /scratch! Please use '''$TMPDIR''' instead. 
The reason is that an automatic cleanup on /tmp or /scratch is not possible because another job could be still using data below these directories. Hence the corresponding file systems could fill up and this can cause issues for you and for other users. On the other hand, $TMPDIR is created when the job starts and removed when the job completes, i.e. a cleanup is automatically done.
|}

=== Usage example for $TMPDIR ===

We will provide an example for using $TMPDIR and describe efficient data transfer to and from $TMPDIR.

If you have a data set with many files which is frequently used by batch jobs you should create
a compressed archive on a workspace. This archive can be extracted on $TMPDIR inside your batch jobs.
Such an archive can be read efficiently from a parallel file system since it is a single huge file.
On a login node you can create such an archive with the following steps:
<syntaxhighlight lang="bash">
# Create a workspace to store the archive
[ab1234@uc2n997 ~]$ ws_allocate data-ssd 60
# Create the archive from a local dataset folder (example)
[ab1234@uc2n997 ~]$ tar -cvzf $(ws_find data-ssd)/dataset.tgz dataset/
</syntaxhighlight>

Inside a batch job extract the archive on $TMPDIR, read input data from $TMPDIR, store results on $TMPDIR
and save the results on a workspace:
<syntaxhighlight lang="bash">
#!/bin/bash
# very simple example on how to use local $TMPDIR
#SBATCH -N 1
#SBATCH -t 24:00:00

# Extract compressed input dataset on local SSD
tar -C $TMPDIR/ -xvzf $(ws_find data-ssd)/dataset.tgz

# The application reads data from dataset on $TMPDIR and writes results to $TMPDIR
myapp -input $TMPDIR/dataset/myinput.csv -outputdir $TMPDIR/results

# Before job completes save results on a workspace
rsync -av $TMPDIR/results $(ws_find data-ssd)/results-${SLURM_JOB_ID}/
</syntaxhighlight>

== LSDF Online Storage==

In some cases it is useful to have access to the LSDF Online Storage on the HPC-Clusters also. Therefore the LSDF Online Storage is mounted on the Login- and Datamover-Nodes.
Furthermore it can be used on the compute nodes during the job runtime with the constraint flag "LSDF" ([[bwUniCluster_2.0_Slurm_common_Features|Slurm common features]]
). There is also an example about the LSDF batch usage: [[bwUniCluster_2.0_Slurm_common_Features#LSDF_Online_Storage|Slurm LSDF example ]] .
<pre>
#!/bin/bash
#SBATCH ...
#SBATCH --constraint=LSDF
</pre>
 
For the usage of the LSDF Online Storage the following environment variables are available: $LSDF, $LSDFPROJECTS, $LSDFHOME.
Please request storage projects in the LSDF Online Storage seperately:
[https://www.lsdf.kit.edu/os/storagerequest/: LSDF Storage Request].

==BeeOND (BeeGFS On-Demand)==

Users of the UniCluster have possibility to request a private BeeOND (on-demand BeeGFS) parallel filesystem for each job. The file system is created during job startup and purged after your job.

; '''IMPORTANT: '''
: All data on the private filesystem will be deleted after your job. Make sure you have copied your data back to the global filesystem (within job), e.g., $HOME or any workspace.

BeeOND/BeeGFS can be used like any other parallel file system. Tools like cp or rsync can be used to copy data in and out.

For detailed usage see here: [[BwUniCluster_2.0_Slurm_common_Features#BeeOND_.28BeeGFS_On-Demand.29|Request on-demand file system]]

==Backup and Archiving==

There are regular backups of all data of the home directories,whereas ACLs and extended attributes will
not be backuped.

Please open a ticket if you need backuped data.

BwUniCluster3.0/Hardware and Architecture/Filesystem Details

2025-04-03T11:08:47Z

R Laifer: /* $HOME */ Anpassung an UC3

= File System Details =

On bwUniCluster 2.0 the parallel file system Lustre is used for most globally visible user data. Lustre is open source and Lustre solutions and support are available from different vendors. Nowadays, most of the biggest HPC systems are using Lustre. An initial home directory on a Lustre file system is created automatically after account activation, and the environment variable $HOME holds its name. Users can create so-called workspaces on another Lustre file system for non-permanent data with temporary lifetime. There is another workspace file system based on flash storage for special requirements available.

Within a batch job further file systems are available:
* The directory $TMPDIR is only available and visible on the local node. It is located on fast SSD storage devices.
* On request a parallel on-demand file system (BeeOND) is created which uses the SSDs of the nodes which were allocated to the batch job.
* On request the external LSDF Online Storage is mounted on the nodes which were allocated to the batch job. This file system is based on the parallel file system Spectrum Scale.

Some of the characteristics of the file systems are shown in Table 2.

{| class="wikitable"
|-
! scope="column" style="width:9%"| Property
! style="width:13%"| $TMPDIR
! style="width:13%"| BeeOND
! style="width:13%"| $HOME
! style="width:13%"| Workspace
! style="width:13%"| Workspace on flash
|-
! scope="column" style="height=20px; padding:3px"| Visibility
| style="height=20px; text-align:left;padding:3px"| local node
| style="height=20px; text-align:left;padding:3px"| nodes of batch job
| style="height=20px; text-align:left;padding:3px"| global
| style="height=20px; text-align:left;padding:3px"| global
| style="height=20px; text-align:left;padding:3px"| global
|-
! scope="column" style="height=20px; padding:3px"| Lifetime
| style="height=20px; text-align:left;padding:3px"| batch job runtime
| style="height=20px; text-align:left;padding:3px"| batch job runtime
| style="height=20px; text-align:left;padding:3px"| permanent
| style="height=20px; text-align:left;padding:3px"| max. 240 days
| style="height=20px; text-align:left;padding:3px"| max. 240 days
|-
! scope="column" style="height=20px; padding:3px"| Disk space
| style="height=20px; text-align:left;padding:3px"| 960 GB - 6.4 TB details see table 1
| style="height=20px; text-align:left;padding:3px"| n*750 GB
| style="height=20px; text-align:left;padding:3px"| 1.1 PiB
| style="height=20px; text-align:left;padding:3px"| 4.6 PiB
| style="height=20px; text-align:left;padding:3px"| 236 TiB
|-
! scope="column" style="height=20px; padding:3px"| Capacity Quotas
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| yes 500 GiB per user, for MA users 250 GiB also per organization
| style="height=20px; text-align:left;padding:3px"| yes 40 TiB per user
| style="height=20px; text-align:left;padding:3px"| yes 1 TiB per user
|-
! scope="column" style="height=20px; padding:3px"| Inode Quotas
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| yes 5 million per user for MA users 2.5 million
| style="height=20px; text-align:left;padding:3px"| yes 20 million per user
| style="height=20px; text-align:left;padding:3px"| yes 5 million per user
|-
! scope="column" style="height=20px; padding:3px"| Backup
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| yes
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| no
|-
! scope="column" style="height=20px; padding:3px"| Read perf./node
| style="height=20px; text-align:left;padding:3px"| 500 MB/s - 6 GB/s depends on type of local SSD / job queue: 520 MB/s @ single / multiple 800 MB/s @ multiple_e 6600 MB/s @ fat 6500 MB/s @ gpu_4 6500 MB/s @ gpu_8
| style="height=20px; text-align:left;padding:3px"| 400 MB/s - 500 MB/s depends on type of local SSDs / job queue: 500 MB/s @ multiple 400 MB/s @ multiple_e
| style="height=20px; text-align:left;padding:3px"| 5 GB/s
| style="height=20px; text-align:left;padding:3px"| 5 GB/s
| style="height=20px; text-align:left;padding:3px"| 1 GB/s
|-
! scope="column" style="height=20px; padding:3px"| Write perf./node
| style="height=20px; text-align:left;padding:3px"| 500 MB/s - 4 GB/s depends on type of local SSD / job queue: 500 MB/s @ single / multiple 650 MB/s @ multiple_e 2900 MB/s @ fat 2090 MB/s @ gpu_4 4060 MB/s @ gpu_8
| style="height=20px; text-align:left;padding:3px"| 250 MB/s - 350 MB/s depends on type of local SSDs / job queue: 350 MB/s @ multiple 250 MB/s @ multiple_e
| style="height=20px; text-align:left;padding:3px"| 5 GB/s
| style="height=20px; text-align:left;padding:3px"| 5 GB/s
| style="height=20px; text-align:left;padding:3px"| 1 GB/s
|-
! scope="column" style="height=20px;padding:3px"| Total read perf.
| style="height=20px; text-align:left;padding:3px"| n*500-6000 MB/s
| style="height=20px; text-align:left;padding:3px"| n*400-500 MB/s
| style="height=20px; text-align:left;padding:3px"| 63 GB/s
| style="height=20px; text-align:left;padding:3px"| 45 GB/s
| style="height=20px; text-align:left;padding:3px"| 45 GB/s
|-
! scope="column" style="height=20px;padding:3px"| Total write perf.
| style="height=20px; text-align:left;padding:3px"| n*500-4000 MB/s
| style="height=20px; text-align:left;padding:3px"| n*250-350 MB/s
| style="height=20px; text-align:left;padding:3px"| 63 GB/s
| style="height=20px; text-align:left;padding:3px"| 40 GB/s
| style="height=20px; text-align:left;padding:3px"| 38 GB/s
|}
---------------------------------------------------------------------------------------------------------
global: all nodes of UniCluster access the same file system;
local: each node has its own file system;
permanent: files are stored permanently;
batch job: files are removed at end of the batch job.
---------------------------------------------------------------------------------------------------------
Table 2: Properties of the file systems

== Selecting the appropriate file system ==

In general, you should separate your data and store it on the appropriate file system.
Permanently needed data like software or important results should be stored below $HOME
but capacity restrictions (quotas) apply. In case you accidentally deleted data on $HOME
there is a chance that we can restore it from backup. Permanent data which is not needed for months
or exceeds the capacity restrictions should be sent to the LSDF Online Storage
or to the archive and deleted from the file systems. Temporary data which is only needed on a single
node and which does not exceed the disk space shown in the table above should be stored
below $TMPDIR. Data which is read many times on a single node, e.g. if you are doing AI training,
should be copied to $TMPDIR and read from there. Temporary data which is used from many nodes
of your batch job and which is only needed during job runtime should be stored on a
parallel on-demand file system. Temporary data which can be recomputed or which is the
result of one job and input for another job should be stored in workspaces. The lifetime
of data in workspaces is limited and depends on the lifetime of the workspace which can be
several months.

For further details please check the chapters below.

== $HOME ==

The home directories of bwUniCluster 3.0 (uc3) users are located in the parallel file system Lustre.
You have access to your home directory from all nodes of uc3. A regular backup of these directories
to a tape library is done automatically. The directory $HOME should be used to hold permanently used data like
source code, configuration files, executable programs etc.

On uc3 there is a default user quota limit of 500 GiB and 5 million inodes (files and directories) per user.
For users of University of Mannheim the limit is 250 GiB and 2.5 million inodes.
You can check your current usage and limits with the command
<pre>
$ lfs quota -uh $(whoami) $HOME
</pre>
Note that the quota limits mentioned above are soft quota limits. The hard limits (shown on the limits columns)
are 10 percent higher. If you are above the soft limit and below the hard limit
during the grace period (7 days) your I/O operations will show a warning message. If the grace period has
passed or if you are above the hard limit your I/O operations will abort.

In addition to the user limit there is a limit of your organization (e.g. university) which depends on the financial share. This limit is enforced with so-called Lustre project quotas. You can show the current usage and limits of your organization with the following command:
<pre>
lfs quota -ph $(grep $(echo $HOME | sed -e "s|/[^/]*/[^/]*$||") /pfs/data6/project_ids.txt | cut -f 1 -d\ ) $HOME
</pre>

== Workspaces ==

On uc2 workspaces can be used to store large non-permanent data sets, e.g. restart files or output
data that has to be post-processed. The file system used for workspaces is also the parallel file system Lustre. This file system is especially designed for parallel access and for a high throughput to large
files. It is able to provide high data transfer rates of up to 54 GB/s write and read performance when data access is parallel.

On uc2 there is a default user quota limit of 40 TiB and 30 million inodes (files and directories) per user.
You can chek your current usage and limits with the command
<pre>
$ lfs quota -uh $(whoami) /pfs/work7
</pre>
Note that the quotas include data and inodes for all of your workspaces and all of your expired workspaces (as long as they are not yet completely removed).

Workspaces have a lifetime and the data on a workspace expires as a whole after a fixed period. The maximum lifetime of a workspace on uc2 is 60 days, but it can be renewed at the end of that period 3 times to a total maximum of 240 days after workspace generation.

Creating, deleting, finding, extending and sharing workspaces is explained on the [[workspace]] page.

=== Reminder for workspace deletion ===

Normally you will get an email every day starting 7 days before a workspace expires. You can send yourself a calender entry which reminds you when a workspace will be automatically deleted:

$ ws_send_ical <workspace> <email>

=== Restoring expired Workspaces ===

At expiration time your workspace will be moved to a special, hidden directory. On uc2 expired workspaces are currently kept for 30 days. During this time you can still restore your data into a valid workspace. The same is true for released workspaces but they are only kept until the next night. In order to restore an expired workspace, use
<pre>
ws_restore -l
</pre>
to get a list of your expired workspaces, and then restore them into an '''existing, active workspace''' (here with name <code>my_restored</code>):
<pre>
ws_restore <full_name_of_expired_workspace> my_restored
</pre>
NOTE: The expired workspace has to be specified using the full name as listed by <code>ws_restore -l</code>, including username prefix and timestamp suffix (otherwise, it cannot be uniquely identified).
The target workspace, on the other hand, must be given with just its short name as listed by <code>ws_list</code>, without the username prefix.

NOTE: <code>ws_restore</code> can only work on the same filesystem. So you have to ensure that the new workspace allocated with <code>ws_allocate</code> is placed on the same filesystem as the expired workspace. Therefore, you can use <code>-F <filesystem></code> flag if needed.

=== Linking workspaces in Home ===

It might be valuable to have links to personal workspaces within a certain directory, e.g. below the user home directory. The command
<pre>
ws_register <DIR>
</pre>
will create and manage links to all personal workspaces within in the directory <DIR>. Calling this command will do the following:

* The directory <DIR> will be created if necessary
* Links to all personal workspaces will be managed:
** Creates links to all available workspaces if not already present
** Removes links to released or expired workspaces

== Improving Performance on $HOME and workspaces ==

The following recommendations might help to improve throughput and metadata
performance on Lustre filesystems.

=== Improving Throughput Performance ===

Depending on your application some adaptations might be necessary if you want to reach
the full bandwidth of the filesystems. Parallel filesystems typically stripe files over storage subsystems, i.e. large files are separated into stripes and distributed to different storage subsystems. In Lustre, the size of these stripes (sometimes also mentioned as chunks) is called stripe size and the number of used storage subsystems is called stripe count.

When you are designing your application you should consider that the performance of
parallel filesystems is generally better if data is transferred in large blocks and stored in
few large files. In more detail, to increase throughput performance of a parallel application
following aspects should be considered:

* collect large chunks of data and write them sequentially at once,

* to exploit complete filesystem bandwidth use several clients,

* avoid competitive file access by different tasks or use blocks with boundaries at stripe size (default is 1MB),

* if files are small enough for the SSDs and are only used by one process store them on $TMPDIR.

With previous Lustre versions adapting the Lustre stripe count was the most important optimization. However, for the workspaces of uc2 the new Lustre feature Progressive File Layouts has been used to define file striping parameters. This means that the stripe count is adapted if the file size is growing. In normal cases users no longer need to adapt file striping parameters in case they have very huge files or in order to reach better performance.

If you know what you are doing you can still change striping parameters, e.g. the stripe count, of a directory and of newly created files. New files and directories inherit the stripe count from the parent directory. E.g. if you want to enhance throughput on a single very large file which is created in the directory $HOME/my_output_dir you can use the command
<pre>
$ lfs setstripe -c-1 $HOME/my_output_dir
</pre>
to change the stripe count to -1 which means that all storage subsystems of the file system are used to store that file. If you change the stripe count of a directory the stripe count of existing files inside this
directory is not changed. If you want to change the stripe count of existing files, change
the stripe count of the parent directory, copy the files to new files, remove the old files
and move the new files back to the old name. In order to check the stripe setting of
the file my_file use
<pre>
$ lfs getstripe my_file
</pre>
Also note that changes on the striping parameters (e.g. stripe count) are not saved in the
backup, i.e. if directories have to be recreated this information is lost and the default stripe
count will be used. Therefore, you should annotate for which directories you made changes
to the striping parameters so that you can repeat these changes if required.

=== Improving Metadata Performance ===

Metadata performance on parallel file systems is usually not as good as with local
filesystems. In addition, it is usually not scalable, i.e. a limited resource. Therefore,
you should omit metadata operations whenever possible. For example, it is much better
to have few large files than lots of small files. In more detail, to increase metadata
performance of a parallel application following aspects should be considered:

* avoid creating many small files,

* avoid competitive directory access, e.g. by creating files in separate subdirectories for each task,

* if many small files are only used within a batch job and accessed by one process store them on $TMPDIR,

* change the default colorization setting of the command ls (see below).

On modern Linux systems, the GNU ls command often uses colorization by default to
visually highlight the file type; this is especially true if the command is run within a terminal
session. This is because the default shell profile initializations usually contain an alias
directive similar to the following for the ls command:
<pre>
$ alias ls="ls --color=tty"
</pre>
However, running the ls command in this way for files on a Lustre file system requires
a stat() call to be used to determine the file type. This can result in a performance
overhead, because the stat() call always needs to determine the size of a file, and that
in turn means that the client node must query the object size of all the backing objects
that make up a file. As a result of the default colorization setting, running a simple
ls command on a Lustre file system often takes as much time as running the ls command
with the -l option (the same is true if the -F, -p, or the -classify option, or any other option
that requires information from a stat() call, is used). To avoid this performance overhead
when using ls commands, add an alias directive similar to the following
to your shell startup script:
<pre>
$ alias ls="ls --color=never"
</pre>

== Workspaces on flash storage ==

There is another workspace file system for special requirements available. The file system is called ''full flash pfs'' and is based on the parallel file system Lustre.

=== Advantages of this file system ===

# All storage devices are based on flash (no hard disks) with low access times. Hence performance is better compared to other parallel file systems for read and write access with small blocks and with small files, i.e. IOPS rates are improved.
# The file system is mounted on bwUniCluster 2.0 and HoreKa, i.e. it can be used to share data between these clusters.

=== Access restrictions ===

Only HoreKa users or KIT users of bwUniCluster 2.0 can use this file system.

=== Using the file system ===

As KIT or HoreKa user you can use the file system in the same way as a normal workspace. You just have to specify the name of the flash-based workspace file system using the option ''-F'' to all the commands that manage workspaces. On bwUniCluster 2.0 it is called ''ffuc'', on HoreKa it is ''ffhk''. For example, to create a workspace with name myws and a lifetime of 60 days on bwUniCluster 2.0 execute:
ws_allocate -F ffuc myws 60

If you want to use the full flash pfs on bwUniCluster 2.0 '''and''' HoreKa at the same time, please note that you only have to manage a particular workspace on one of the clusters since the name of the workspace directory is different. However, the path to each workspace is visible and can be used on both clusters.

Other features are similar to normal workspaces. For example, we are able to restore expired workspaces for few weeks and you have to open a ticket to request the restore. There are quota limits with a default limit of 1 TiB capacity and 5 millions inodes per user. You can check your current usage with
lfs quota -uh $(whoami) /pfs/work8

== $TMPDIR ==

The environment variable $TMPDIR contains the name of a directory which is located on the local SSD of each node. This means
that different tasks of a parallel application use different directories when they do not utilize the same node.
Although $TMPDIR points to the same path name for different nodes of a batch job, the physical location and the
content of this directory path on these nodes is different.

This directory should be used for temporary files being accessed from the local node during job runtime. It should
also be used if you read the same data many times from a single node, e.g. if you are doing AI training. In this
case you should copy the data at the beginning of your batch job to $TMPDIR and read the data from there, see usage example below.

The $TMPDIR directory is located on extremely fast local SSD storage devices. This means that performance
on small files is much better than on the parallel file systems. The capacity of the local SSDs for each node type
is different and can be checked in Table 1 above. The capacity of $TMPDIR is at least 800 GB.

Each time a batch job is started, a subdirectory is created on the SSD of each node and assigned to the job.
$TMPDIR is set to the name of the subdirectory and this name contains the job ID so that it is unique
for each job. At the end of the job the subdirectory is removed.

On login nodes $TMPDIR also points to a fast directory on a local SSD disk but this directory is not unique.
It is recommended to create your own unique subdirectory on these nodes. This directory should be used for the
installation of software packages. This means that the software package to be installed should be unpacked,
compiled and linked in a subdirectory of $TMPDIR. The real installation of the package (e.g. make install)
should be made into the $HOME folder.

{|style="background:#ffdeee; width:100%;"
|style="padding:5px; background:#f2cece; text-align:left"|
[[Image:Attention.svg|center|25px]]
|style="padding:5px; background:#f2cece; text-align:left"|
Note that you should '''not''' use /tmp or /scratch! Please use '''$TMPDIR''' instead. 
The reason is that an automatic cleanup on /tmp or /scratch is not possible because another job could be still using data below these directories. Hence the corresponding file systems could fill up and this can cause issues for you and for other users. On the other hand, $TMPDIR is created when the job starts and removed when the job completes, i.e. a cleanup is automatically done.
|}

=== Usage example for $TMPDIR ===

We will provide an example for using $TMPDIR and describe efficient data transfer to and from $TMPDIR.

If you have a data set with many files which is frequently used by batch jobs you should create
a compressed archive on a workspace. This archive can be extracted on $TMPDIR inside your batch jobs.
Such an archive can be read efficiently from a parallel file system since it is a single huge file.
On a login node you can create such an archive with the following steps:
<syntaxhighlight lang="bash">
# Create a workspace to store the archive
[ab1234@uc2n997 ~]$ ws_allocate data-ssd 60
# Create the archive from a local dataset folder (example)
[ab1234@uc2n997 ~]$ tar -cvzf $(ws_find data-ssd)/dataset.tgz dataset/
</syntaxhighlight>

Inside a batch job extract the archive on $TMPDIR, read input data from $TMPDIR, store results on $TMPDIR
and save the results on a workspace:
<syntaxhighlight lang="bash">
#!/bin/bash
# very simple example on how to use local $TMPDIR
#SBATCH -N 1
#SBATCH -t 24:00:00

# Extract compressed input dataset on local SSD
tar -C $TMPDIR/ -xvzf $(ws_find data-ssd)/dataset.tgz

# The application reads data from dataset on $TMPDIR and writes results to $TMPDIR
myapp -input $TMPDIR/dataset/myinput.csv -outputdir $TMPDIR/results

# Before job completes save results on a workspace
rsync -av $TMPDIR/results $(ws_find data-ssd)/results-${SLURM_JOB_ID}/
</syntaxhighlight>

== LSDF Online Storage==

In some cases it is useful to have access to the LSDF Online Storage on the HPC-Clusters also. Therefore the LSDF Online Storage is mounted on the Login- and Datamover-Nodes.
Furthermore it can be used on the compute nodes during the job runtime with the constraint flag "LSDF" ([[bwUniCluster_2.0_Slurm_common_Features|Slurm common features]]
). There is also an example about the LSDF batch usage: [[bwUniCluster_2.0_Slurm_common_Features#LSDF_Online_Storage|Slurm LSDF example ]] .
<pre>
#!/bin/bash
#SBATCH ...
#SBATCH --constraint=LSDF
</pre>
 
For the usage of the LSDF Online Storage the following environment variables are available: $LSDF, $LSDFPROJECTS, $LSDFHOME.
Please request storage projects in the LSDF Online Storage seperately:
[https://www.lsdf.kit.edu/os/storagerequest/: LSDF Storage Request].

==BeeOND (BeeGFS On-Demand)==

Users of the UniCluster have possibility to request a private BeeOND (on-demand BeeGFS) parallel filesystem for each job. The file system is created during job startup and purged after your job.

; '''IMPORTANT: '''
: All data on the private filesystem will be deleted after your job. Make sure you have copied your data back to the global filesystem (within job), e.g., $HOME or any workspace.

BeeOND/BeeGFS can be used like any other parallel file system. Tools like cp or rsync can be used to copy data in and out.

For detailed usage see here: [[BwUniCluster_2.0_Slurm_common_Features#BeeOND_.28BeeGFS_On-Demand.29|Request on-demand file system]]

==Backup and Archiving==

There are regular backups of all data of the home directories,whereas ACLs and extended attributes will
not be backuped.

Please open a ticket if you need backuped data.

BwUniCluster3.0/Hardware and Architecture/Filesystem Details

2025-03-27T09:27:31Z

R Laifer: /* File System Details */

= File System Details =

On bwUniCluster 2.0 the parallel file system Lustre is used for most globally visible user data. Lustre is open source and Lustre solutions and support are available from different vendors. Nowadays, most of the biggest HPC systems are using Lustre. An initial home directory on a Lustre file system is created automatically after account activation, and the environment variable $HOME holds its name. Users can create so-called workspaces on another Lustre file system for non-permanent data with temporary lifetime. There is another workspace file system based on flash storage for special requirements available.

Within a batch job further file systems are available:
* The directory $TMPDIR is only available and visible on the local node. It is located on fast SSD storage devices.
* On request a parallel on-demand file system (BeeOND) is created which uses the SSDs of the nodes which were allocated to the batch job.
* On request the external LSDF Online Storage is mounted on the nodes which were allocated to the batch job. This file system is based on the parallel file system Spectrum Scale.

Some of the characteristics of the file systems are shown in Table 2.

{| class="wikitable"
|-
! scope="column" style="width:9%"| Property
! style="width:13%"| $TMPDIR
! style="width:13%"| BeeOND
! style="width:13%"| $HOME
! style="width:13%"| Workspace
! style="width:13%"| Workspace on flash
|-
! scope="column" style="height=20px; padding:3px"| Visibility
| style="height=20px; text-align:left;padding:3px"| local node
| style="height=20px; text-align:left;padding:3px"| nodes of batch job
| style="height=20px; text-align:left;padding:3px"| global
| style="height=20px; text-align:left;padding:3px"| global
| style="height=20px; text-align:left;padding:3px"| global
|-
! scope="column" style="height=20px; padding:3px"| Lifetime
| style="height=20px; text-align:left;padding:3px"| batch job runtime
| style="height=20px; text-align:left;padding:3px"| batch job runtime
| style="height=20px; text-align:left;padding:3px"| permanent
| style="height=20px; text-align:left;padding:3px"| max. 240 days
| style="height=20px; text-align:left;padding:3px"| max. 240 days
|-
! scope="column" style="height=20px; padding:3px"| Disk space
| style="height=20px; text-align:left;padding:3px"| 960 GB - 6.4 TB details see table 1
| style="height=20px; text-align:left;padding:3px"| n*750 GB
| style="height=20px; text-align:left;padding:3px"| 1.1 PiB
| style="height=20px; text-align:left;padding:3px"| 4.6 PiB
| style="height=20px; text-align:left;padding:3px"| 236 TiB
|-
! scope="column" style="height=20px; padding:3px"| Capacity Quotas
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| yes 500 GiB per user, for MA users 250 GiB also per organization
| style="height=20px; text-align:left;padding:3px"| yes 40 TiB per user
| style="height=20px; text-align:left;padding:3px"| yes 1 TiB per user
|-
! scope="column" style="height=20px; padding:3px"| Inode Quotas
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| yes 5 million per user for MA users 2.5 million
| style="height=20px; text-align:left;padding:3px"| yes 20 million per user
| style="height=20px; text-align:left;padding:3px"| yes 5 million per user
|-
! scope="column" style="height=20px; padding:3px"| Backup
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| yes
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| no
|-
! scope="column" style="height=20px; padding:3px"| Read perf./node
| style="height=20px; text-align:left;padding:3px"| 500 MB/s - 6 GB/s depends on type of local SSD / job queue: 520 MB/s @ single / multiple 800 MB/s @ multiple_e 6600 MB/s @ fat 6500 MB/s @ gpu_4 6500 MB/s @ gpu_8
| style="height=20px; text-align:left;padding:3px"| 400 MB/s - 500 MB/s depends on type of local SSDs / job queue: 500 MB/s @ multiple 400 MB/s @ multiple_e
| style="height=20px; text-align:left;padding:3px"| 5 GB/s
| style="height=20px; text-align:left;padding:3px"| 5 GB/s
| style="height=20px; text-align:left;padding:3px"| 1 GB/s
|-
! scope="column" style="height=20px; padding:3px"| Write perf./node
| style="height=20px; text-align:left;padding:3px"| 500 MB/s - 4 GB/s depends on type of local SSD / job queue: 500 MB/s @ single / multiple 650 MB/s @ multiple_e 2900 MB/s @ fat 2090 MB/s @ gpu_4 4060 MB/s @ gpu_8
| style="height=20px; text-align:left;padding:3px"| 250 MB/s - 350 MB/s depends on type of local SSDs / job queue: 350 MB/s @ multiple 250 MB/s @ multiple_e
| style="height=20px; text-align:left;padding:3px"| 5 GB/s
| style="height=20px; text-align:left;padding:3px"| 5 GB/s
| style="height=20px; text-align:left;padding:3px"| 1 GB/s
|-
! scope="column" style="height=20px;padding:3px"| Total read perf.
| style="height=20px; text-align:left;padding:3px"| n*500-6000 MB/s
| style="height=20px; text-align:left;padding:3px"| n*400-500 MB/s
| style="height=20px; text-align:left;padding:3px"| 63 GB/s
| style="height=20px; text-align:left;padding:3px"| 45 GB/s
| style="height=20px; text-align:left;padding:3px"| 45 GB/s
|-
! scope="column" style="height=20px;padding:3px"| Total write perf.
| style="height=20px; text-align:left;padding:3px"| n*500-4000 MB/s
| style="height=20px; text-align:left;padding:3px"| n*250-350 MB/s
| style="height=20px; text-align:left;padding:3px"| 63 GB/s
| style="height=20px; text-align:left;padding:3px"| 40 GB/s
| style="height=20px; text-align:left;padding:3px"| 38 GB/s
|}
---------------------------------------------------------------------------------------------------------
global: all nodes of UniCluster access the same file system;
local: each node has its own file system;
permanent: files are stored permanently;
batch job: files are removed at end of the batch job.
---------------------------------------------------------------------------------------------------------
Table 2: Properties of the file systems

== Selecting the appropriate file system ==

In general, you should separate your data and store it on the appropriate file system.
Permanently needed data like software or important results should be stored below $HOME
but capacity restrictions (quotas) apply. In case you accidentally deleted data on $HOME
there is a chance that we can restore it from backup. Permanent data which is not needed for months
or exceeds the capacity restrictions should be sent to the LSDF Online Storage
or to the archive and deleted from the file systems. Temporary data which is only needed on a single
node and which does not exceed the disk space shown in the table above should be stored
below $TMPDIR. Data which is read many times on a single node, e.g. if you are doing AI training,
should be copied to $TMPDIR and read from there. Temporary data which is used from many nodes
of your batch job and which is only needed during job runtime should be stored on a
parallel on-demand file system. Temporary data which can be recomputed or which is the
result of one job and input for another job should be stored in workspaces. The lifetime
of data in workspaces is limited and depends on the lifetime of the workspace which can be
several months.

For further details please check the chapters below.

== $HOME ==

The home directories of bwUniCluster 2.0 (uc2) users are located in the parallel file system Lustre.
You have access to your home directory from all nodes of uc2. A regular backup of these directories
to tape archive is done automatically. The directory $HOME is used to hold those files that are
permanently used like source codes, configuration files, executable programs etc.

On uc2 there is a default user quota limit of 1 TiB and 10 million inodes (files and directories) per user.
For users of University of Mannheim the limit is 256 GiB and 2.5 million inodes.
You can check your current usage and limits with the command
<pre>
$ lfs quota -uh $(whoami) $HOME
</pre>
In addition to the user limit there is a limit of your organization (e.g. university) which depends on the financial share. This limit is enforced with so-called Lustre project quotas. You can show the current usage and limits of your organization with the following command:
<pre>
lfs quota -ph $(grep $(echo $HOME | sed -e "s|/[^/]*/[^/]*$||") /pfs/data5/project_ids.txt | cut -f 1 -d\ ) $HOME
</pre>

== Workspaces ==

On uc2 workspaces can be used to store large non-permanent data sets, e.g. restart files or output
data that has to be post-processed. The file system used for workspaces is also the parallel file system Lustre. This file system is especially designed for parallel access and for a high throughput to large
files. It is able to provide high data transfer rates of up to 54 GB/s write and read performance when data access is parallel.

On uc2 there is a default user quota limit of 40 TiB and 30 million inodes (files and directories) per user.
You can chek your current usage and limits with the command
<pre>
$ lfs quota -uh $(whoami) /pfs/work7
</pre>
Note that the quotas include data and inodes for all of your workspaces and all of your expired workspaces (as long as they are not yet completely removed).

Workspaces have a lifetime and the data on a workspace expires as a whole after a fixed period. The maximum lifetime of a workspace on uc2 is 60 days, but it can be renewed at the end of that period 3 times to a total maximum of 240 days after workspace generation.

Creating, deleting, finding, extending and sharing workspaces is explained on the [[workspace]] page.

=== Reminder for workspace deletion ===

Normally you will get an email every day starting 7 days before a workspace expires. You can send yourself a calender entry which reminds you when a workspace will be automatically deleted:

$ ws_send_ical <workspace> <email>

=== Restoring expired Workspaces ===

At expiration time your workspace will be moved to a special, hidden directory. On uc2 expired workspaces are currently kept for 30 days. During this time you can still restore your data into a valid workspace. The same is true for released workspaces but they are only kept until the next night. In order to restore an expired workspace, use
<pre>
ws_restore -l
</pre>
to get a list of your expired workspaces, and then restore them into an '''existing, active workspace''' (here with name <code>my_restored</code>):
<pre>
ws_restore <full_name_of_expired_workspace> my_restored
</pre>
NOTE: The expired workspace has to be specified using the full name as listed by <code>ws_restore -l</code>, including username prefix and timestamp suffix (otherwise, it cannot be uniquely identified).
The target workspace, on the other hand, must be given with just its short name as listed by <code>ws_list</code>, without the username prefix.

NOTE: <code>ws_restore</code> can only work on the same filesystem. So you have to ensure that the new workspace allocated with <code>ws_allocate</code> is placed on the same filesystem as the expired workspace. Therefore, you can use <code>-F <filesystem></code> flag if needed.

=== Linking workspaces in Home ===

It might be valuable to have links to personal workspaces within a certain directory, e.g. below the user home directory. The command
<pre>
ws_register <DIR>
</pre>
will create and manage links to all personal workspaces within in the directory <DIR>. Calling this command will do the following:

* The directory <DIR> will be created if necessary
* Links to all personal workspaces will be managed:
** Creates links to all available workspaces if not already present
** Removes links to released or expired workspaces

== Improving Performance on $HOME and workspaces ==

The following recommendations might help to improve throughput and metadata
performance on Lustre filesystems.

=== Improving Throughput Performance ===

Depending on your application some adaptations might be necessary if you want to reach
the full bandwidth of the filesystems. Parallel filesystems typically stripe files over storage subsystems, i.e. large files are separated into stripes and distributed to different storage subsystems. In Lustre, the size of these stripes (sometimes also mentioned as chunks) is called stripe size and the number of used storage subsystems is called stripe count.

When you are designing your application you should consider that the performance of
parallel filesystems is generally better if data is transferred in large blocks and stored in
few large files. In more detail, to increase throughput performance of a parallel application
following aspects should be considered:

* collect large chunks of data and write them sequentially at once,

* to exploit complete filesystem bandwidth use several clients,

* avoid competitive file access by different tasks or use blocks with boundaries at stripe size (default is 1MB),

* if files are small enough for the SSDs and are only used by one process store them on $TMPDIR.

With previous Lustre versions adapting the Lustre stripe count was the most important optimization. However, for the workspaces of uc2 the new Lustre feature Progressive File Layouts has been used to define file striping parameters. This means that the stripe count is adapted if the file size is growing. In normal cases users no longer need to adapt file striping parameters in case they have very huge files or in order to reach better performance.

If you know what you are doing you can still change striping parameters, e.g. the stripe count, of a directory and of newly created files. New files and directories inherit the stripe count from the parent directory. E.g. if you want to enhance throughput on a single very large file which is created in the directory $HOME/my_output_dir you can use the command
<pre>
$ lfs setstripe -c-1 $HOME/my_output_dir
</pre>
to change the stripe count to -1 which means that all storage subsystems of the file system are used to store that file. If you change the stripe count of a directory the stripe count of existing files inside this
directory is not changed. If you want to change the stripe count of existing files, change
the stripe count of the parent directory, copy the files to new files, remove the old files
and move the new files back to the old name. In order to check the stripe setting of
the file my_file use
<pre>
$ lfs getstripe my_file
</pre>
Also note that changes on the striping parameters (e.g. stripe count) are not saved in the
backup, i.e. if directories have to be recreated this information is lost and the default stripe
count will be used. Therefore, you should annotate for which directories you made changes
to the striping parameters so that you can repeat these changes if required.

=== Improving Metadata Performance ===

Metadata performance on parallel file systems is usually not as good as with local
filesystems. In addition, it is usually not scalable, i.e. a limited resource. Therefore,
you should omit metadata operations whenever possible. For example, it is much better
to have few large files than lots of small files. In more detail, to increase metadata
performance of a parallel application following aspects should be considered:

* avoid creating many small files,

* avoid competitive directory access, e.g. by creating files in separate subdirectories for each task,

* if many small files are only used within a batch job and accessed by one process store them on $TMPDIR,

* change the default colorization setting of the command ls (see below).

On modern Linux systems, the GNU ls command often uses colorization by default to
visually highlight the file type; this is especially true if the command is run within a terminal
session. This is because the default shell profile initializations usually contain an alias
directive similar to the following for the ls command:
<pre>
$ alias ls="ls --color=tty"
</pre>
However, running the ls command in this way for files on a Lustre file system requires
a stat() call to be used to determine the file type. This can result in a performance
overhead, because the stat() call always needs to determine the size of a file, and that
in turn means that the client node must query the object size of all the backing objects
that make up a file. As a result of the default colorization setting, running a simple
ls command on a Lustre file system often takes as much time as running the ls command
with the -l option (the same is true if the -F, -p, or the -classify option, or any other option
that requires information from a stat() call, is used). To avoid this performance overhead
when using ls commands, add an alias directive similar to the following
to your shell startup script:
<pre>
$ alias ls="ls --color=never"
</pre>

== Workspaces on flash storage ==

There is another workspace file system for special requirements available. The file system is called ''full flash pfs'' and is based on the parallel file system Lustre.

=== Advantages of this file system ===

# All storage devices are based on flash (no hard disks) with low access times. Hence performance is better compared to other parallel file systems for read and write access with small blocks and with small files, i.e. IOPS rates are improved.
# The file system is mounted on bwUniCluster 2.0 and HoreKa, i.e. it can be used to share data between these clusters.

=== Access restrictions ===

Only HoreKa users or KIT users of bwUniCluster 2.0 can use this file system.

=== Using the file system ===

As KIT or HoreKa user you can use the file system in the same way as a normal workspace. You just have to specify the name of the flash-based workspace file system using the option ''-F'' to all the commands that manage workspaces. On bwUniCluster 2.0 it is called ''ffuc'', on HoreKa it is ''ffhk''. For example, to create a workspace with name myws and a lifetime of 60 days on bwUniCluster 2.0 execute:
ws_allocate -F ffuc myws 60

If you want to use the full flash pfs on bwUniCluster 2.0 '''and''' HoreKa at the same time, please note that you only have to manage a particular workspace on one of the clusters since the name of the workspace directory is different. However, the path to each workspace is visible and can be used on both clusters.

Other features are similar to normal workspaces. For example, we are able to restore expired workspaces for few weeks and you have to open a ticket to request the restore. There are quota limits with a default limit of 1 TiB capacity and 5 millions inodes per user. You can check your current usage with
lfs quota -uh $(whoami) /pfs/work8

== $TMPDIR ==

The environment variable $TMPDIR contains the name of a directory which is located on the local SSD of each node. This means
that different tasks of a parallel application use different directories when they do not utilize the same node.
Although $TMPDIR points to the same path name for different nodes of a batch job, the physical location and the
content of this directory path on these nodes is different.

This directory should be used for temporary files being accessed from the local node during job runtime. It should
also be used if you read the same data many times from a single node, e.g. if you are doing AI training. In this
case you should copy the data at the beginning of your batch job to $TMPDIR and read the data from there, see usage example below.

The $TMPDIR directory is located on extremely fast local SSD storage devices. This means that performance
on small files is much better than on the parallel file systems. The capacity of the local SSDs for each node type
is different and can be checked in Table 1 above. The capacity of $TMPDIR is at least 800 GB.

Each time a batch job is started, a subdirectory is created on the SSD of each node and assigned to the job.
$TMPDIR is set to the name of the subdirectory and this name contains the job ID so that it is unique
for each job. At the end of the job the subdirectory is removed.

On login nodes $TMPDIR also points to a fast directory on a local SSD disk but this directory is not unique.
It is recommended to create your own unique subdirectory on these nodes. This directory should be used for the
installation of software packages. This means that the software package to be installed should be unpacked,
compiled and linked in a subdirectory of $TMPDIR. The real installation of the package (e.g. make install)
should be made into the $HOME folder.

{|style="background:#ffdeee; width:100%;"
|style="padding:5px; background:#f2cece; text-align:left"|
[[Image:Attention.svg|center|25px]]
|style="padding:5px; background:#f2cece; text-align:left"|
Note that you should '''not''' use /tmp or /scratch! Please use '''$TMPDIR''' instead. 
The reason is that an automatic cleanup on /tmp or /scratch is not possible because another job could be still using data below these directories. Hence the corresponding file systems could fill up and this can cause issues for you and for other users. On the other hand, $TMPDIR is created when the job starts and removed when the job completes, i.e. a cleanup is automatically done.
|}

=== Usage example for $TMPDIR ===

We will provide an example for using $TMPDIR and describe efficient data transfer to and from $TMPDIR.

If you have a data set with many files which is frequently used by batch jobs you should create
a compressed archive on a workspace. This archive can be extracted on $TMPDIR inside your batch jobs.
Such an archive can be read efficiently from a parallel file system since it is a single huge file.
On a login node you can create such an archive with the following steps:
<syntaxhighlight lang="bash">
# Create a workspace to store the archive
[ab1234@uc2n997 ~]$ ws_allocate data-ssd 60
# Create the archive from a local dataset folder (example)
[ab1234@uc2n997 ~]$ tar -cvzf $(ws_find data-ssd)/dataset.tgz dataset/
</syntaxhighlight>

Inside a batch job extract the archive on $TMPDIR, read input data from $TMPDIR, store results on $TMPDIR
and save the results on a workspace:
<syntaxhighlight lang="bash">
#!/bin/bash
# very simple example on how to use local $TMPDIR
#SBATCH -N 1
#SBATCH -t 24:00:00

# Extract compressed input dataset on local SSD
tar -C $TMPDIR/ -xvzf $(ws_find data-ssd)/dataset.tgz

# The application reads data from dataset on $TMPDIR and writes results to $TMPDIR
myapp -input $TMPDIR/dataset/myinput.csv -outputdir $TMPDIR/results

# Before job completes save results on a workspace
rsync -av $TMPDIR/results $(ws_find data-ssd)/results-${SLURM_JOB_ID}/
</syntaxhighlight>

== LSDF Online Storage==

In some cases it is useful to have access to the LSDF Online Storage on the HPC-Clusters also. Therefore the LSDF Online Storage is mounted on the Login- and Datamover-Nodes.
Furthermore it can be used on the compute nodes during the job runtime with the constraint flag "LSDF" ([[bwUniCluster_2.0_Slurm_common_Features|Slurm common features]]
). There is also an example about the LSDF batch usage: [[bwUniCluster_2.0_Slurm_common_Features#LSDF_Online_Storage|Slurm LSDF example ]] .
<pre>
#!/bin/bash
#SBATCH ...
#SBATCH --constraint=LSDF
</pre>
 
For the usage of the LSDF Online Storage the following environment variables are available: $LSDF, $LSDFPROJECTS, $LSDFHOME.
Please request storage projects in the LSDF Online Storage seperately:
[https://www.lsdf.kit.edu/os/storagerequest/: LSDF Storage Request].

==BeeOND (BeeGFS On-Demand)==

Users of the UniCluster have possibility to request a private BeeOND (on-demand BeeGFS) parallel filesystem for each job. The file system is created during job startup and purged after your job.

; '''IMPORTANT: '''
: All data on the private filesystem will be deleted after your job. Make sure you have copied your data back to the global filesystem (within job), e.g., $HOME or any workspace.

BeeOND/BeeGFS can be used like any other parallel file system. Tools like cp or rsync can be used to copy data in and out.

For detailed usage see here: [[BwUniCluster_2.0_Slurm_common_Features#BeeOND_.28BeeGFS_On-Demand.29|Request on-demand file system]]

==Backup and Archiving==

There are regular backups of all data of the home directories,whereas ACLs and extended attributes will
not be backuped.

Please open a ticket if you need backuped data.

BwUniCluster3.0/Hardware and Architecture/Filesystem Details

2025-02-04T14:29:20Z

R Laifer: /* File System Details */

= File System Details =

On bwUniCluster 2.0 the parallel file system Lustre is used for most globally visible user data. Lustre is open source and Lustre solutions and support are available from different vendors. Nowadays, most of the biggest HPC systems are using Lustre. An initial home directory on a Lustre file system is created automatically after account activation, and the environment variable $HOME holds its name. Users can create so-called workspaces on another Lustre file system for non-permanent data with temporary lifetime. There is another workspace file system based on flash storage for special requirements available.

Within a batch job further file systems are available:
* The directory $TMPDIR is only available and visible on the local node. It is located on fast SSD storage devices.
* On request a parallel on-demand file system (BeeOND) is created which uses the SSDs of the nodes which were allocated to the batch job.
* On request the external LSDF Online Storage is mounted on the nodes which were allocated to the batch job. This file system is based on the parallel file system Spectrum Scale.

Some of the characteristics of the file systems are shown in Table 2.

{| class="wikitable"
|-
! scope="column" style="width:9%"| Property
! style="width:13%"| $TMPDIR
! style="width:13%"| BeeOND
! style="width:13%"| $HOME
! style="width:13%"| Workspace
! style="width:13%"| Workspace on flash
|-
! scope="column" style="height=20px; padding:3px"| Visibility
| style="height=20px; text-align:left;padding:3px"| local node
| style="height=20px; text-align:left;padding:3px"| nodes of batch job
| style="height=20px; text-align:left;padding:3px"| global
| style="height=20px; text-align:left;padding:3px"| global
| style="height=20px; text-align:left;padding:3px"| global
|-
! scope="column" style="height=20px; padding:3px"| Lifetime
| style="height=20px; text-align:left;padding:3px"| batch job runtime
| style="height=20px; text-align:left;padding:3px"| batch job runtime
| style="height=20px; text-align:left;padding:3px"| permanent
| style="height=20px; text-align:left;padding:3px"| max. 240 days
| style="height=20px; text-align:left;padding:3px"| max. 240 days
|-
! scope="column" style="height=20px; padding:3px"| Disk space
| style="height=20px; text-align:left;padding:3px"| 960 GB - 6.4 TB details see table 1
| style="height=20px; text-align:left;padding:3px"| n*750 GB
| style="height=20px; text-align:left;padding:3px"| 1.1 PiB
| style="height=20px; text-align:left;padding:3px"| 4.6 PiB
| style="height=20px; text-align:left;padding:3px"| 236 TiB
|-
! scope="column" style="height=20px; padding:3px"| Capacity Quotas
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| yes 500 GiB per user, for MA users 256 GiB also per organization
| style="height=20px; text-align:left;padding:3px"| yes 40 TiB per user
| style="height=20px; text-align:left;padding:3px"| yes 1 TiB per user
|-
! scope="column" style="height=20px; padding:3px"| Inode Quotas
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| yes 5 million per user for MA users 2.5 million
| style="height=20px; text-align:left;padding:3px"| yes 20 million per user
| style="height=20px; text-align:left;padding:3px"| yes 5 million per user
|-
! scope="column" style="height=20px; padding:3px"| Backup
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| yes
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| no
|-
! scope="column" style="height=20px; padding:3px"| Read perf./node
| style="height=20px; text-align:left;padding:3px"| 500 MB/s - 6 GB/s depends on type of local SSD / job queue: 520 MB/s @ single / multiple 800 MB/s @ multiple_e 6600 MB/s @ fat 6500 MB/s @ gpu_4 6500 MB/s @ gpu_8
| style="height=20px; text-align:left;padding:3px"| 400 MB/s - 500 MB/s depends on type of local SSDs / job queue: 500 MB/s @ multiple 400 MB/s @ multiple_e
| style="height=20px; text-align:left;padding:3px"| 5 GB/s
| style="height=20px; text-align:left;padding:3px"| 5 GB/s
| style="height=20px; text-align:left;padding:3px"| 1 GB/s
|-
! scope="column" style="height=20px; padding:3px"| Write perf./node
| style="height=20px; text-align:left;padding:3px"| 500 MB/s - 4 GB/s depends on type of local SSD / job queue: 500 MB/s @ single / multiple 650 MB/s @ multiple_e 2900 MB/s @ fat 2090 MB/s @ gpu_4 4060 MB/s @ gpu_8
| style="height=20px; text-align:left;padding:3px"| 250 MB/s - 350 MB/s depends on type of local SSDs / job queue: 350 MB/s @ multiple 250 MB/s @ multiple_e
| style="height=20px; text-align:left;padding:3px"| 5 GB/s
| style="height=20px; text-align:left;padding:3px"| 5 GB/s
| style="height=20px; text-align:left;padding:3px"| 1 GB/s
|-
! scope="column" style="height=20px;padding:3px"| Total read perf.
| style="height=20px; text-align:left;padding:3px"| n*500-6000 MB/s
| style="height=20px; text-align:left;padding:3px"| n*400-500 MB/s
| style="height=20px; text-align:left;padding:3px"| 63 GB/s
| style="height=20px; text-align:left;padding:3px"| 45 GB/s
| style="height=20px; text-align:left;padding:3px"| 45 GB/s
|-
! scope="column" style="height=20px;padding:3px"| Total write perf.
| style="height=20px; text-align:left;padding:3px"| n*500-4000 MB/s
| style="height=20px; text-align:left;padding:3px"| n*250-350 MB/s
| style="height=20px; text-align:left;padding:3px"| 63 GB/s
| style="height=20px; text-align:left;padding:3px"| 40 GB/s
| style="height=20px; text-align:left;padding:3px"| 38 GB/s
|}
---------------------------------------------------------------------------------------------------------
global: all nodes of UniCluster access the same file system;
local: each node has its own file system;
permanent: files are stored permanently;
batch job: files are removed at end of the batch job.
---------------------------------------------------------------------------------------------------------
Table 2: Properties of the file systems

== Selecting the appropriate file system ==

In general, you should separate your data and store it on the appropriate file system.
Permanently needed data like software or important results should be stored below $HOME
but capacity restrictions (quotas) apply. In case you accidentally deleted data on $HOME
there is a chance that we can restore it from backup. Permanent data which is not needed for months
or exceeds the capacity restrictions should be sent to the LSDF Online Storage
or to the archive and deleted from the file systems. Temporary data which is only needed on a single
node and which does not exceed the disk space shown in the table above should be stored
below $TMPDIR. Data which is read many times on a single node, e.g. if you are doing AI training,
should be copied to $TMPDIR and read from there. Temporary data which is used from many nodes
of your batch job and which is only needed during job runtime should be stored on a
parallel on-demand file system. Temporary data which can be recomputed or which is the
result of one job and input for another job should be stored in workspaces. The lifetime
of data in workspaces is limited and depends on the lifetime of the workspace which can be
several months.

For further details please check the chapters below.

== $HOME ==

The home directories of bwUniCluster 2.0 (uc2) users are located in the parallel file system Lustre.
You have access to your home directory from all nodes of uc2. A regular backup of these directories
to tape archive is done automatically. The directory $HOME is used to hold those files that are
permanently used like source codes, configuration files, executable programs etc.

On uc2 there is a default user quota limit of 1 TiB and 10 million inodes (files and directories) per user.
For users of University of Mannheim the limit is 256 GiB and 2.5 million inodes.
You can check your current usage and limits with the command
<pre>
$ lfs quota -uh $(whoami) $HOME
</pre>
In addition to the user limit there is a limit of your organization (e.g. university) which depends on the financial share. This limit is enforced with so-called Lustre project quotas. You can show the current usage and limits of your organization with the following command:
<pre>
lfs quota -ph $(grep $(echo $HOME | sed -e "s|/[^/]*/[^/]*$||") /pfs/data5/project_ids.txt | cut -f 1 -d\ ) $HOME
</pre>

== Workspaces ==

On uc2 workspaces can be used to store large non-permanent data sets, e.g. restart files or output
data that has to be post-processed. The file system used for workspaces is also the parallel file system Lustre. This file system is especially designed for parallel access and for a high throughput to large
files. It is able to provide high data transfer rates of up to 54 GB/s write and read performance when data access is parallel.

On uc2 there is a default user quota limit of 40 TiB and 30 million inodes (files and directories) per user.
You can chek your current usage and limits with the command
<pre>
$ lfs quota -uh $(whoami) /pfs/work7
</pre>
Note that the quotas include data and inodes for all of your workspaces and all of your expired workspaces (as long as they are not yet completely removed).

Workspaces have a lifetime and the data on a workspace expires as a whole after a fixed period. The maximum lifetime of a workspace on uc2 is 60 days, but it can be renewed at the end of that period 3 times to a total maximum of 240 days after workspace generation.

Creating, deleting, finding, extending and sharing workspaces is explained on the [[workspace]] page.

=== Reminder for workspace deletion ===

Normally you will get an email every day starting 7 days before a workspace expires. You can send yourself a calender entry which reminds you when a workspace will be automatically deleted:

$ ws_send_ical <workspace> <email>

=== Restoring expired Workspaces ===

At expiration time your workspace will be moved to a special, hidden directory. On uc2 expired workspaces are currently kept for 30 days. During this time you can still restore your data into a valid workspace. The same is true for released workspaces but they are only kept until the next night. In order to restore an expired workspace, use
<pre>
ws_restore -l
</pre>
to get a list of your expired workspaces, and then restore them into an '''existing, active workspace''' (here with name <code>my_restored</code>):
<pre>
ws_restore <full_name_of_expired_workspace> my_restored
</pre>
NOTE: The expired workspace has to be specified using the full name as listed by <code>ws_restore -l</code>, including username prefix and timestamp suffix (otherwise, it cannot be uniquely identified).
The target workspace, on the other hand, must be given with just its short name as listed by <code>ws_list</code>, without the username prefix.

NOTE: <code>ws_restore</code> can only work on the same filesystem. So you have to ensure that the new workspace allocated with <code>ws_allocate</code> is placed on the same filesystem as the expired workspace. Therefore, you can use <code>-F <filesystem></code> flag if needed.

=== Linking workspaces in Home ===

It might be valuable to have links to personal workspaces within a certain directory, e.g. below the user home directory. The command
<pre>
ws_register <DIR>
</pre>
will create and manage links to all personal workspaces within in the directory <DIR>. Calling this command will do the following:

* The directory <DIR> will be created if necessary
* Links to all personal workspaces will be managed:
** Creates links to all available workspaces if not already present
** Removes links to released or expired workspaces

== Improving Performance on $HOME and workspaces ==

The following recommendations might help to improve throughput and metadata
performance on Lustre filesystems.

=== Improving Throughput Performance ===

Depending on your application some adaptations might be necessary if you want to reach
the full bandwidth of the filesystems. Parallel filesystems typically stripe files over storage subsystems, i.e. large files are separated into stripes and distributed to different storage subsystems. In Lustre, the size of these stripes (sometimes also mentioned as chunks) is called stripe size and the number of used storage subsystems is called stripe count.

When you are designing your application you should consider that the performance of
parallel filesystems is generally better if data is transferred in large blocks and stored in
few large files. In more detail, to increase throughput performance of a parallel application
following aspects should be considered:

* collect large chunks of data and write them sequentially at once,

* to exploit complete filesystem bandwidth use several clients,

* avoid competitive file access by different tasks or use blocks with boundaries at stripe size (default is 1MB),

* if files are small enough for the SSDs and are only used by one process store them on $TMPDIR.

With previous Lustre versions adapting the Lustre stripe count was the most important optimization. However, for the workspaces of uc2 the new Lustre feature Progressive File Layouts has been used to define file striping parameters. This means that the stripe count is adapted if the file size is growing. In normal cases users no longer need to adapt file striping parameters in case they have very huge files or in order to reach better performance.

If you know what you are doing you can still change striping parameters, e.g. the stripe count, of a directory and of newly created files. New files and directories inherit the stripe count from the parent directory. E.g. if you want to enhance throughput on a single very large file which is created in the directory $HOME/my_output_dir you can use the command
<pre>
$ lfs setstripe -c-1 $HOME/my_output_dir
</pre>
to change the stripe count to -1 which means that all storage subsystems of the file system are used to store that file. If you change the stripe count of a directory the stripe count of existing files inside this
directory is not changed. If you want to change the stripe count of existing files, change
the stripe count of the parent directory, copy the files to new files, remove the old files
and move the new files back to the old name. In order to check the stripe setting of
the file my_file use
<pre>
$ lfs getstripe my_file
</pre>
Also note that changes on the striping parameters (e.g. stripe count) are not saved in the
backup, i.e. if directories have to be recreated this information is lost and the default stripe
count will be used. Therefore, you should annotate for which directories you made changes
to the striping parameters so that you can repeat these changes if required.

=== Improving Metadata Performance ===

Metadata performance on parallel file systems is usually not as good as with local
filesystems. In addition, it is usually not scalable, i.e. a limited resource. Therefore,
you should omit metadata operations whenever possible. For example, it is much better
to have few large files than lots of small files. In more detail, to increase metadata
performance of a parallel application following aspects should be considered:

* avoid creating many small files,

* avoid competitive directory access, e.g. by creating files in separate subdirectories for each task,

* if many small files are only used within a batch job and accessed by one process store them on $TMPDIR,

* change the default colorization setting of the command ls (see below).

On modern Linux systems, the GNU ls command often uses colorization by default to
visually highlight the file type; this is especially true if the command is run within a terminal
session. This is because the default shell profile initializations usually contain an alias
directive similar to the following for the ls command:
<pre>
$ alias ls="ls --color=tty"
</pre>
However, running the ls command in this way for files on a Lustre file system requires
a stat() call to be used to determine the file type. This can result in a performance
overhead, because the stat() call always needs to determine the size of a file, and that
in turn means that the client node must query the object size of all the backing objects
that make up a file. As a result of the default colorization setting, running a simple
ls command on a Lustre file system often takes as much time as running the ls command
with the -l option (the same is true if the -F, -p, or the -classify option, or any other option
that requires information from a stat() call, is used). To avoid this performance overhead
when using ls commands, add an alias directive similar to the following
to your shell startup script:
<pre>
$ alias ls="ls --color=never"
</pre>

== Workspaces on flash storage ==

There is another workspace file system for special requirements available. The file system is called ''full flash pfs'' and is based on the parallel file system Lustre.

=== Advantages of this file system ===

# All storage devices are based on flash (no hard disks) with low access times. Hence performance is better compared to other parallel file systems for read and write access with small blocks and with small files, i.e. IOPS rates are improved.
# The file system is mounted on bwUniCluster 2.0 and HoreKa, i.e. it can be used to share data between these clusters.

=== Access restrictions ===

Only HoreKa users or KIT users of bwUniCluster 2.0 can use this file system.

=== Using the file system ===

As KIT or HoreKa user you can use the file system in the same way as a normal workspace. You just have to specify the name of the flash-based workspace file system using the option ''-F'' to all the commands that manage workspaces. On bwUniCluster 2.0 it is called ''ffuc'', on HoreKa it is ''ffhk''. For example, to create a workspace with name myws and a lifetime of 60 days on bwUniCluster 2.0 execute:
ws_allocate -F ffuc myws 60

If you want to use the full flash pfs on bwUniCluster 2.0 '''and''' HoreKa at the same time, please note that you only have to manage a particular workspace on one of the clusters since the name of the workspace directory is different. However, the path to each workspace is visible and can be used on both clusters.

Other features are similar to normal workspaces. For example, we are able to restore expired workspaces for few weeks and you have to open a ticket to request the restore. There are quota limits with a default limit of 1 TiB capacity and 5 millions inodes per user. You can check your current usage with
lfs quota -uh $(whoami) /pfs/work8

== $TMPDIR ==

The environment variable $TMPDIR contains the name of a directory which is located on the local SSD of each node. This means
that different tasks of a parallel application use different directories when they do not utilize the same node.
Although $TMPDIR points to the same path name for different nodes of a batch job, the physical location and the
content of this directory path on these nodes is different.

This directory should be used for temporary files being accessed from the local node during job runtime. It should
also be used if you read the same data many times from a single node, e.g. if you are doing AI training. In this
case you should copy the data at the beginning of your batch job to $TMPDIR and read the data from there, see usage example below.

The $TMPDIR directory is located on extremely fast local SSD storage devices. This means that performance
on small files is much better than on the parallel file systems. The capacity of the local SSDs for each node type
is different and can be checked in Table 1 above. The capacity of $TMPDIR is at least 800 GB.

Each time a batch job is started, a subdirectory is created on the SSD of each node and assigned to the job.
$TMPDIR is set to the name of the subdirectory and this name contains the job ID so that it is unique
for each job. At the end of the job the subdirectory is removed.

On login nodes $TMPDIR also points to a fast directory on a local SSD disk but this directory is not unique.
It is recommended to create your own unique subdirectory on these nodes. This directory should be used for the
installation of software packages. This means that the software package to be installed should be unpacked,
compiled and linked in a subdirectory of $TMPDIR. The real installation of the package (e.g. make install)
should be made into the $HOME folder.

{|style="background:#ffdeee; width:100%;"
|style="padding:5px; background:#f2cece; text-align:left"|
[[Image:Attention.svg|center|25px]]
|style="padding:5px; background:#f2cece; text-align:left"|
Note that you should '''not''' use /tmp or /scratch! Please use '''$TMPDIR''' instead. 
The reason is that an automatic cleanup on /tmp or /scratch is not possible because another job could be still using data below these directories. Hence the corresponding file systems could fill up and this can cause issues for you and for other users. On the other hand, $TMPDIR is created when the job starts and removed when the job completes, i.e. a cleanup is automatically done.
|}

=== Usage example for $TMPDIR ===

We will provide an example for using $TMPDIR and describe efficient data transfer to and from $TMPDIR.

If you have a data set with many files which is frequently used by batch jobs you should create
a compressed archive on a workspace. This archive can be extracted on $TMPDIR inside your batch jobs.
Such an archive can be read efficiently from a parallel file system since it is a single huge file.
On a login node you can create such an archive with the following steps:
<syntaxhighlight lang="bash">
# Create a workspace to store the archive
[ab1234@uc2n997 ~]$ ws_allocate data-ssd 60
# Create the archive from a local dataset folder (example)
[ab1234@uc2n997 ~]$ tar -cvzf $(ws_find data-ssd)/dataset.tgz dataset/
</syntaxhighlight>

Inside a batch job extract the archive on $TMPDIR, read input data from $TMPDIR, store results on $TMPDIR
and save the results on a workspace:
<syntaxhighlight lang="bash">
#!/bin/bash
# very simple example on how to use local $TMPDIR
#SBATCH -N 1
#SBATCH -t 24:00:00

# Extract compressed input dataset on local SSD
tar -C $TMPDIR/ -xvzf $(ws_find data-ssd)/dataset.tgz

# The application reads data from dataset on $TMPDIR and writes results to $TMPDIR
myapp -input $TMPDIR/dataset/myinput.csv -outputdir $TMPDIR/results

# Before job completes save results on a workspace
rsync -av $TMPDIR/results $(ws_find data-ssd)/results-${SLURM_JOB_ID}/
</syntaxhighlight>

== LSDF Online Storage==

In some cases it is useful to have access to the LSDF Online Storage on the HPC-Clusters also. Therefore the LSDF Online Storage is mounted on the Login- and Datamover-Nodes.
Furthermore it can be used on the compute nodes during the job runtime with the constraint flag "LSDF" ([[bwUniCluster_2.0_Slurm_common_Features|Slurm common features]]
). There is also an example about the LSDF batch usage: [[bwUniCluster_2.0_Slurm_common_Features#LSDF_Online_Storage|Slurm LSDF example ]] .
<pre>
#!/bin/bash
#SBATCH ...
#SBATCH --constraint=LSDF
</pre>
 
For the usage of the LSDF Online Storage the following environment variables are available: $LSDF, $LSDFPROJECTS, $LSDFHOME.
Please request storage projects in the LSDF Online Storage seperately:
[https://www.lsdf.kit.edu/os/storagerequest/: LSDF Storage Request].

==BeeOND (BeeGFS On-Demand)==

Users of the UniCluster have possibility to request a private BeeOND (on-demand BeeGFS) parallel filesystem for each job. The file system is created during job startup and purged after your job.

; '''IMPORTANT: '''
: All data on the private filesystem will be deleted after your job. Make sure you have copied your data back to the global filesystem (within job), e.g., $HOME or any workspace.

BeeOND/BeeGFS can be used like any other parallel file system. Tools like cp or rsync can be used to copy data in and out.

For detailed usage see here: [[BwUniCluster_2.0_Slurm_common_Features#BeeOND_.28BeeGFS_On-Demand.29|Request on-demand file system]]

==Backup and Archiving==

There are regular backups of all data of the home directories,whereas ACLs and extended attributes will
not be backuped.

Please open a ticket if you need backuped data.

BwUniCluster3.0/Hardware and Architecture/Filesystem Details

2025-02-04T14:28:43Z

R Laifer: /* File System Details updated perf values of pfs7 */

= File System Details =

On bwUniCluster 2.0 the parallel file system Lustre is used for most globally visible user data. Lustre is open source and Lustre solutions and support are available from different vendors. Nowadays, most of the biggest HPC systems are using Lustre. An initial home directory on a Lustre file system is created automatically after account activation, and the environment variable $HOME holds its name. Users can create so-called workspaces on another Lustre file system for non-permanent data with temporary lifetime. There is another workspace file system based on flash storage for special requirements available.

Within a batch job further file systems are available:
* The directory $TMPDIR is only available and visible on the local node. It is located on fast SSD storage devices.
* On request a parallel on-demand file system (BeeOND) is created which uses the SSDs of the nodes which were allocated to the batch job.
* On request the external LSDF Online Storage is mounted on the nodes which were allocated to the batch job. This file system is based on the parallel file system Spectrum Scale.

Some of the characteristics of the file systems are shown in Table 2.

{| class="wikitable"
|-
! scope="column" style="width:9%"| Property
! style="width:13%"| $TMPDIR
! style="width:13%"| BeeOND
! style="width:13%"| $HOME
! style="width:13%"| Workspace
! style="width:13%"| Workspace on flash
|-
! scope="column" style="height=20px; padding:3px"| Visibility
| style="height=20px; text-align:left;padding:3px"| local node
| style="height=20px; text-align:left;padding:3px"| nodes of batch job
| style="height=20px; text-align:left;padding:3px"| global
| style="height=20px; text-align:left;padding:3px"| global
| style="height=20px; text-align:left;padding:3px"| global
|-
! scope="column" style="height=20px; padding:3px"| Lifetime
| style="height=20px; text-align:left;padding:3px"| batch job runtime
| style="height=20px; text-align:left;padding:3px"| batch job runtime
| style="height=20px; text-align:left;padding:3px"| permanent
| style="height=20px; text-align:left;padding:3px"| max. 240 days
| style="height=20px; text-align:left;padding:3px"| max. 240 days
|-
! scope="column" style="height=20px; padding:3px"| Disk space
| style="height=20px; text-align:left;padding:3px"| 960 GB - 6.4 TB details see table 1
| style="height=20px; text-align:left;padding:3px"| n*750 GB
| style="height=20px; text-align:left;padding:3px"| 1.1 PiB
| style="height=20px; text-align:left;padding:3px"| 4.6 PiB
| style="height=20px; text-align:left;padding:3px"| 236 TiB
|-
! scope="column" style="height=20px; padding:3px"| Capacity Quotas
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| yes 500 GiB per user, for MA users 256 GiB also per organization
| style="height=20px; text-align:left;padding:3px"| yes 40 TiB per user
| style="height=20px; text-align:left;padding:3px"| yes 1 TiB per user
|-
! scope="column" style="height=20px; padding:3px"| Inode Quotas
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| yes 5 million per user for MA users 2.5 million
| style="height=20px; text-align:left;padding:3px"| yes 30 million per user
| style="height=20px; text-align:left;padding:3px"| yes 5 million per user
|-
! scope="column" style="height=20px; padding:3px"| Backup
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| yes
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| no
|-
! scope="column" style="height=20px; padding:3px"| Read perf./node
| style="height=20px; text-align:left;padding:3px"| 500 MB/s - 6 GB/s depends on type of local SSD / job queue: 520 MB/s @ single / multiple 800 MB/s @ multiple_e 6600 MB/s @ fat 6500 MB/s @ gpu_4 6500 MB/s @ gpu_8
| style="height=20px; text-align:left;padding:3px"| 400 MB/s - 500 MB/s depends on type of local SSDs / job queue: 500 MB/s @ multiple 400 MB/s @ multiple_e
| style="height=20px; text-align:left;padding:3px"| 5 GB/s
| style="height=20px; text-align:left;padding:3px"| 5 GB/s
| style="height=20px; text-align:left;padding:3px"| 1 GB/s
|-
! scope="column" style="height=20px; padding:3px"| Write perf./node
| style="height=20px; text-align:left;padding:3px"| 500 MB/s - 4 GB/s depends on type of local SSD / job queue: 500 MB/s @ single / multiple 650 MB/s @ multiple_e 2900 MB/s @ fat 2090 MB/s @ gpu_4 4060 MB/s @ gpu_8
| style="height=20px; text-align:left;padding:3px"| 250 MB/s - 350 MB/s depends on type of local SSDs / job queue: 350 MB/s @ multiple 250 MB/s @ multiple_e
| style="height=20px; text-align:left;padding:3px"| 5 GB/s
| style="height=20px; text-align:left;padding:3px"| 5 GB/s
| style="height=20px; text-align:left;padding:3px"| 1 GB/s
|-
! scope="column" style="height=20px;padding:3px"| Total read perf.
| style="height=20px; text-align:left;padding:3px"| n*500-6000 MB/s
| style="height=20px; text-align:left;padding:3px"| n*400-500 MB/s
| style="height=20px; text-align:left;padding:3px"| 63 GB/s
| style="height=20px; text-align:left;padding:3px"| 45 GB/s
| style="height=20px; text-align:left;padding:3px"| 45 GB/s
|-
! scope="column" style="height=20px;padding:3px"| Total write perf.
| style="height=20px; text-align:left;padding:3px"| n*500-4000 MB/s
| style="height=20px; text-align:left;padding:3px"| n*250-350 MB/s
| style="height=20px; text-align:left;padding:3px"| 63 GB/s
| style="height=20px; text-align:left;padding:3px"| 40 GB/s
| style="height=20px; text-align:left;padding:3px"| 38 GB/s
|}
---------------------------------------------------------------------------------------------------------
global: all nodes of UniCluster access the same file system;
local: each node has its own file system;
permanent: files are stored permanently;
batch job: files are removed at end of the batch job.
---------------------------------------------------------------------------------------------------------
Table 2: Properties of the file systems

== Selecting the appropriate file system ==

In general, you should separate your data and store it on the appropriate file system.
Permanently needed data like software or important results should be stored below $HOME
but capacity restrictions (quotas) apply. In case you accidentally deleted data on $HOME
there is a chance that we can restore it from backup. Permanent data which is not needed for months
or exceeds the capacity restrictions should be sent to the LSDF Online Storage
or to the archive and deleted from the file systems. Temporary data which is only needed on a single
node and which does not exceed the disk space shown in the table above should be stored
below $TMPDIR. Data which is read many times on a single node, e.g. if you are doing AI training,
should be copied to $TMPDIR and read from there. Temporary data which is used from many nodes
of your batch job and which is only needed during job runtime should be stored on a
parallel on-demand file system. Temporary data which can be recomputed or which is the
result of one job and input for another job should be stored in workspaces. The lifetime
of data in workspaces is limited and depends on the lifetime of the workspace which can be
several months.

For further details please check the chapters below.

== $HOME ==

The home directories of bwUniCluster 2.0 (uc2) users are located in the parallel file system Lustre.
You have access to your home directory from all nodes of uc2. A regular backup of these directories
to tape archive is done automatically. The directory $HOME is used to hold those files that are
permanently used like source codes, configuration files, executable programs etc.

On uc2 there is a default user quota limit of 1 TiB and 10 million inodes (files and directories) per user.
For users of University of Mannheim the limit is 256 GiB and 2.5 million inodes.
You can check your current usage and limits with the command
<pre>
$ lfs quota -uh $(whoami) $HOME
</pre>
In addition to the user limit there is a limit of your organization (e.g. university) which depends on the financial share. This limit is enforced with so-called Lustre project quotas. You can show the current usage and limits of your organization with the following command:
<pre>
lfs quota -ph $(grep $(echo $HOME | sed -e "s|/[^/]*/[^/]*$||") /pfs/data5/project_ids.txt | cut -f 1 -d\ ) $HOME
</pre>

== Workspaces ==

On uc2 workspaces can be used to store large non-permanent data sets, e.g. restart files or output
data that has to be post-processed. The file system used for workspaces is also the parallel file system Lustre. This file system is especially designed for parallel access and for a high throughput to large
files. It is able to provide high data transfer rates of up to 54 GB/s write and read performance when data access is parallel.

On uc2 there is a default user quota limit of 40 TiB and 30 million inodes (files and directories) per user.
You can chek your current usage and limits with the command
<pre>
$ lfs quota -uh $(whoami) /pfs/work7
</pre>
Note that the quotas include data and inodes for all of your workspaces and all of your expired workspaces (as long as they are not yet completely removed).

Workspaces have a lifetime and the data on a workspace expires as a whole after a fixed period. The maximum lifetime of a workspace on uc2 is 60 days, but it can be renewed at the end of that period 3 times to a total maximum of 240 days after workspace generation.

Creating, deleting, finding, extending and sharing workspaces is explained on the [[workspace]] page.

=== Reminder for workspace deletion ===

Normally you will get an email every day starting 7 days before a workspace expires. You can send yourself a calender entry which reminds you when a workspace will be automatically deleted:

$ ws_send_ical <workspace> <email>

=== Restoring expired Workspaces ===

At expiration time your workspace will be moved to a special, hidden directory. On uc2 expired workspaces are currently kept for 30 days. During this time you can still restore your data into a valid workspace. The same is true for released workspaces but they are only kept until the next night. In order to restore an expired workspace, use
<pre>
ws_restore -l
</pre>
to get a list of your expired workspaces, and then restore them into an '''existing, active workspace''' (here with name <code>my_restored</code>):
<pre>
ws_restore <full_name_of_expired_workspace> my_restored
</pre>
NOTE: The expired workspace has to be specified using the full name as listed by <code>ws_restore -l</code>, including username prefix and timestamp suffix (otherwise, it cannot be uniquely identified).
The target workspace, on the other hand, must be given with just its short name as listed by <code>ws_list</code>, without the username prefix.

NOTE: <code>ws_restore</code> can only work on the same filesystem. So you have to ensure that the new workspace allocated with <code>ws_allocate</code> is placed on the same filesystem as the expired workspace. Therefore, you can use <code>-F <filesystem></code> flag if needed.

=== Linking workspaces in Home ===

It might be valuable to have links to personal workspaces within a certain directory, e.g. below the user home directory. The command
<pre>
ws_register <DIR>
</pre>
will create and manage links to all personal workspaces within in the directory <DIR>. Calling this command will do the following:

* The directory <DIR> will be created if necessary
* Links to all personal workspaces will be managed:
** Creates links to all available workspaces if not already present
** Removes links to released or expired workspaces

== Improving Performance on $HOME and workspaces ==

The following recommendations might help to improve throughput and metadata
performance on Lustre filesystems.

=== Improving Throughput Performance ===

Depending on your application some adaptations might be necessary if you want to reach
the full bandwidth of the filesystems. Parallel filesystems typically stripe files over storage subsystems, i.e. large files are separated into stripes and distributed to different storage subsystems. In Lustre, the size of these stripes (sometimes also mentioned as chunks) is called stripe size and the number of used storage subsystems is called stripe count.

When you are designing your application you should consider that the performance of
parallel filesystems is generally better if data is transferred in large blocks and stored in
few large files. In more detail, to increase throughput performance of a parallel application
following aspects should be considered:

* collect large chunks of data and write them sequentially at once,

* to exploit complete filesystem bandwidth use several clients,

* avoid competitive file access by different tasks or use blocks with boundaries at stripe size (default is 1MB),

* if files are small enough for the SSDs and are only used by one process store them on $TMPDIR.

With previous Lustre versions adapting the Lustre stripe count was the most important optimization. However, for the workspaces of uc2 the new Lustre feature Progressive File Layouts has been used to define file striping parameters. This means that the stripe count is adapted if the file size is growing. In normal cases users no longer need to adapt file striping parameters in case they have very huge files or in order to reach better performance.

If you know what you are doing you can still change striping parameters, e.g. the stripe count, of a directory and of newly created files. New files and directories inherit the stripe count from the parent directory. E.g. if you want to enhance throughput on a single very large file which is created in the directory $HOME/my_output_dir you can use the command
<pre>
$ lfs setstripe -c-1 $HOME/my_output_dir
</pre>
to change the stripe count to -1 which means that all storage subsystems of the file system are used to store that file. If you change the stripe count of a directory the stripe count of existing files inside this
directory is not changed. If you want to change the stripe count of existing files, change
the stripe count of the parent directory, copy the files to new files, remove the old files
and move the new files back to the old name. In order to check the stripe setting of
the file my_file use
<pre>
$ lfs getstripe my_file
</pre>
Also note that changes on the striping parameters (e.g. stripe count) are not saved in the
backup, i.e. if directories have to be recreated this information is lost and the default stripe
count will be used. Therefore, you should annotate for which directories you made changes
to the striping parameters so that you can repeat these changes if required.

=== Improving Metadata Performance ===

Metadata performance on parallel file systems is usually not as good as with local
filesystems. In addition, it is usually not scalable, i.e. a limited resource. Therefore,
you should omit metadata operations whenever possible. For example, it is much better
to have few large files than lots of small files. In more detail, to increase metadata
performance of a parallel application following aspects should be considered:

* avoid creating many small files,

* avoid competitive directory access, e.g. by creating files in separate subdirectories for each task,

* if many small files are only used within a batch job and accessed by one process store them on $TMPDIR,

* change the default colorization setting of the command ls (see below).

On modern Linux systems, the GNU ls command often uses colorization by default to
visually highlight the file type; this is especially true if the command is run within a terminal
session. This is because the default shell profile initializations usually contain an alias
directive similar to the following for the ls command:
<pre>
$ alias ls="ls --color=tty"
</pre>
However, running the ls command in this way for files on a Lustre file system requires
a stat() call to be used to determine the file type. This can result in a performance
overhead, because the stat() call always needs to determine the size of a file, and that
in turn means that the client node must query the object size of all the backing objects
that make up a file. As a result of the default colorization setting, running a simple
ls command on a Lustre file system often takes as much time as running the ls command
with the -l option (the same is true if the -F, -p, or the -classify option, or any other option
that requires information from a stat() call, is used). To avoid this performance overhead
when using ls commands, add an alias directive similar to the following
to your shell startup script:
<pre>
$ alias ls="ls --color=never"
</pre>

== Workspaces on flash storage ==

There is another workspace file system for special requirements available. The file system is called ''full flash pfs'' and is based on the parallel file system Lustre.

=== Advantages of this file system ===

# All storage devices are based on flash (no hard disks) with low access times. Hence performance is better compared to other parallel file systems for read and write access with small blocks and with small files, i.e. IOPS rates are improved.
# The file system is mounted on bwUniCluster 2.0 and HoreKa, i.e. it can be used to share data between these clusters.

=== Access restrictions ===

Only HoreKa users or KIT users of bwUniCluster 2.0 can use this file system.

=== Using the file system ===

As KIT or HoreKa user you can use the file system in the same way as a normal workspace. You just have to specify the name of the flash-based workspace file system using the option ''-F'' to all the commands that manage workspaces. On bwUniCluster 2.0 it is called ''ffuc'', on HoreKa it is ''ffhk''. For example, to create a workspace with name myws and a lifetime of 60 days on bwUniCluster 2.0 execute:
ws_allocate -F ffuc myws 60

If you want to use the full flash pfs on bwUniCluster 2.0 '''and''' HoreKa at the same time, please note that you only have to manage a particular workspace on one of the clusters since the name of the workspace directory is different. However, the path to each workspace is visible and can be used on both clusters.

Other features are similar to normal workspaces. For example, we are able to restore expired workspaces for few weeks and you have to open a ticket to request the restore. There are quota limits with a default limit of 1 TiB capacity and 5 millions inodes per user. You can check your current usage with
lfs quota -uh $(whoami) /pfs/work8

== $TMPDIR ==

The environment variable $TMPDIR contains the name of a directory which is located on the local SSD of each node. This means
that different tasks of a parallel application use different directories when they do not utilize the same node.
Although $TMPDIR points to the same path name for different nodes of a batch job, the physical location and the
content of this directory path on these nodes is different.

This directory should be used for temporary files being accessed from the local node during job runtime. It should
also be used if you read the same data many times from a single node, e.g. if you are doing AI training. In this
case you should copy the data at the beginning of your batch job to $TMPDIR and read the data from there, see usage example below.

The $TMPDIR directory is located on extremely fast local SSD storage devices. This means that performance
on small files is much better than on the parallel file systems. The capacity of the local SSDs for each node type
is different and can be checked in Table 1 above. The capacity of $TMPDIR is at least 800 GB.

Each time a batch job is started, a subdirectory is created on the SSD of each node and assigned to the job.
$TMPDIR is set to the name of the subdirectory and this name contains the job ID so that it is unique
for each job. At the end of the job the subdirectory is removed.

On login nodes $TMPDIR also points to a fast directory on a local SSD disk but this directory is not unique.
It is recommended to create your own unique subdirectory on these nodes. This directory should be used for the
installation of software packages. This means that the software package to be installed should be unpacked,
compiled and linked in a subdirectory of $TMPDIR. The real installation of the package (e.g. make install)
should be made into the $HOME folder.

{|style="background:#ffdeee; width:100%;"
|style="padding:5px; background:#f2cece; text-align:left"|
[[Image:Attention.svg|center|25px]]
|style="padding:5px; background:#f2cece; text-align:left"|
Note that you should '''not''' use /tmp or /scratch! Please use '''$TMPDIR''' instead. 
The reason is that an automatic cleanup on /tmp or /scratch is not possible because another job could be still using data below these directories. Hence the corresponding file systems could fill up and this can cause issues for you and for other users. On the other hand, $TMPDIR is created when the job starts and removed when the job completes, i.e. a cleanup is automatically done.
|}

=== Usage example for $TMPDIR ===

We will provide an example for using $TMPDIR and describe efficient data transfer to and from $TMPDIR.

If you have a data set with many files which is frequently used by batch jobs you should create
a compressed archive on a workspace. This archive can be extracted on $TMPDIR inside your batch jobs.
Such an archive can be read efficiently from a parallel file system since it is a single huge file.
On a login node you can create such an archive with the following steps:
<syntaxhighlight lang="bash">
# Create a workspace to store the archive
[ab1234@uc2n997 ~]$ ws_allocate data-ssd 60
# Create the archive from a local dataset folder (example)
[ab1234@uc2n997 ~]$ tar -cvzf $(ws_find data-ssd)/dataset.tgz dataset/
</syntaxhighlight>

Inside a batch job extract the archive on $TMPDIR, read input data from $TMPDIR, store results on $TMPDIR
and save the results on a workspace:
<syntaxhighlight lang="bash">
#!/bin/bash
# very simple example on how to use local $TMPDIR
#SBATCH -N 1
#SBATCH -t 24:00:00

# Extract compressed input dataset on local SSD
tar -C $TMPDIR/ -xvzf $(ws_find data-ssd)/dataset.tgz

# The application reads data from dataset on $TMPDIR and writes results to $TMPDIR
myapp -input $TMPDIR/dataset/myinput.csv -outputdir $TMPDIR/results

# Before job completes save results on a workspace
rsync -av $TMPDIR/results $(ws_find data-ssd)/results-${SLURM_JOB_ID}/
</syntaxhighlight>

== LSDF Online Storage==

In some cases it is useful to have access to the LSDF Online Storage on the HPC-Clusters also. Therefore the LSDF Online Storage is mounted on the Login- and Datamover-Nodes.
Furthermore it can be used on the compute nodes during the job runtime with the constraint flag "LSDF" ([[bwUniCluster_2.0_Slurm_common_Features|Slurm common features]]
). There is also an example about the LSDF batch usage: [[bwUniCluster_2.0_Slurm_common_Features#LSDF_Online_Storage|Slurm LSDF example ]] .
<pre>
#!/bin/bash
#SBATCH ...
#SBATCH --constraint=LSDF
</pre>
 
For the usage of the LSDF Online Storage the following environment variables are available: $LSDF, $LSDFPROJECTS, $LSDFHOME.
Please request storage projects in the LSDF Online Storage seperately:
[https://www.lsdf.kit.edu/os/storagerequest/: LSDF Storage Request].

==BeeOND (BeeGFS On-Demand)==

Users of the UniCluster have possibility to request a private BeeOND (on-demand BeeGFS) parallel filesystem for each job. The file system is created during job startup and purged after your job.

; '''IMPORTANT: '''
: All data on the private filesystem will be deleted after your job. Make sure you have copied your data back to the global filesystem (within job), e.g., $HOME or any workspace.

BeeOND/BeeGFS can be used like any other parallel file system. Tools like cp or rsync can be used to copy data in and out.

For detailed usage see here: [[BwUniCluster_2.0_Slurm_common_Features#BeeOND_.28BeeGFS_On-Demand.29|Request on-demand file system]]

==Backup and Archiving==

There are regular backups of all data of the home directories,whereas ACLs and extended attributes will
not be backuped.

Please open a ticket if you need backuped data.

BwUniCluster2.0/Hardware and Architecture

2024-11-26T12:08:20Z

R Laifer: /* $TMPDIR */

= Architecture of bwUniCluster 2.0 =

The bwUniCluster 2.0 is a parallel computer with distributed memory. Each node of system consists of at least two Intel Xeon processor, local memory, disks, network adapters and optionally accelerators (NVIDIA Tesla V100, A100 or H100). All nodes are connected by a fast InfiniBand interconnect. In addition the file system
Lustre, that is connected by coupling the InfiniBand of the file server with the InfiniBand
switch of the compute cluster, is added to bwUniCluster 2.0 to provide a fast and scalable
parallel file system.

The operating system on each node is Red Hat Enterprise Linux (RHEL) 8.4. A number of additional software packages like e.g. SLURM have been installed on top. Some of these components are of special interest to end users and are briefly
discussed in this document. Others which are of greater importance to system
administrators will not be covered by this document.

The individual nodes of the system may act in different roles. According to the services supplied by the nodes, they are separated into disjoint groups. From an end users point of view the different groups of nodes are login nodes, compute nodes, file server nodes and administrative server nodes.

'''Login Nodes'''

The login nodes are the only nodes that are directly accessible by end users. These nodes
are used for interactive login, file management, program development and interactive pre-
and postprocessing. Three nodes are dedicated to this service but they are all accessible via
one address and a DNS round-robin alias distributes the login sessions to the
different login nodes.

'''Compute Node'''

The majority of nodes are compute nodes which are managed by a batch system. Users
submit their jobs to the SLURM batch system and a job is executed when the required resources become available (depending on its fair-share priority).

'''File Server Nodes'''

The hardware of the parallel file system Lustre incorporates some file server nodes; the file
system Lustre is connected by coupling the InfiniBand of the file server with the independent InfiniBand switch of the compute cluster. In addition to shared file space there is also local storage on the disks of each node (for details see chapter "File Systems").

'''Administrative Server Nodes'''

Some other nodes are delivering additional services like resource management, external
network connection, administration etc. These nodes can be accessed directly by system administrators only.

= Components of bwUniCluster =

{| class="wikitable"
|-
! style="width:9%"|
! style="width:13%"| Compute nodes "Thin"
! style="width:13%"| Compute nodes "HPC"
! style="width:13%"| Compute nodes "IceLake"
! style="width:13%"| Compute nodes "Fat"
! style="width:13%"| GPU x4
! style="width:13%"| GPU x8
! style="width:13%"| IceLake + GPU x4
! style="width:13%"| Login
|-
!scope="column"| Number of nodes
| 200 + 60
| 260
| 272
| 6
| 14
| 10
| 15
| 3
|-
!scope="column"| Processors
| Intel Xeon Gold 6230
| Intel Xeon Gold 6230
| Intel Xeon Platinum 8358
| Intel Xeon Gold 6230
| Intel Xeon Gold 6230
| Intel Xeon Gold 6248
| Intel Xeon Platinum 8358
|-
!scope="column"| Number of sockets
| 2
| 2
| 2
| 4
| 2
| 2
| 2
| 2
|-
!scope="column"| Processor frequency (GHz)
| 2.1 Ghz
| 2.1 Ghz
| 2.6 Ghz
| 2.1 Ghz
| 2.1 Ghz
| 2.6 Ghz
| 2.5 Ghz
|
|-
!scope="column"| Total number of cores
| 40
| 40
| 64
| 80
| 40
| 40
| 64
| 40
|-
!scope="column"| Main memory
| 96 GB / 192 GB
| 96 GB
| 256 GB
| 3 TB
| 384 GB
| 768 GB
| 512 GB
| 384 GB
|-
!scope="column"| Local SSD
| 960 GB SATA
| 960 GB SATA
| 1,8 TB NVMe
| 4,8 TB NVMe
| 3,2 TB NVMe
| 15 TB NVMe
| 6,4 TB NVMe
|
|-
!scope="column"| Accelerators
| -
| -
| -
| -
| 4x NVIDIA Tesla V100
| 8x NVIDIA Tesla V100
| 4x NVIDIA A100 / 4x NVIDIA H100
| -
|-
!scope="column"| Accelerator memory
| -
| -
| -
| -
| 32 GB
| 32 GB
| 80 GB / 94 GB
| -
|-
!scope="column"| Interconnect
| IB HDR100 (blocking)
| IB HDR100
| IB HDR200
| IB HDR
| IB HDR
| IB HDR
| IB HDR200
| IB HDR100 (blocking)
|}
Table 1: Properties of the nodes

= File Systems =

On bwUniCluster 2.0 the parallel file system Lustre is used for most globally visible user data. Lustre is open source and Lustre solutions and support are available from different vendors. Nowadays, most of the biggest HPC systems are using Lustre. An initial home directory on a Lustre file system is created automatically after account activation, and the environment variable $HOME holds its name. Users can create so-called workspaces on another Lustre file system for non-permanent data with temporary lifetime. There is another workspace file system based on flash storage for special requirements available.

Within a batch job further file systems are available:
* The directory $TMPDIR is only available and visible on the local node. It is located on fast SSD storage devices.
* On request a parallel on-demand file system (BeeOND) is created which uses the SSDs of the nodes which were allocated to the batch job.
* On request the external LSDF Online Storage is mounted on the nodes which were allocated to the batch job. This file system is based on the parallel file system Spectrum Scale.

Some of the characteristics of the file systems are shown in Table 2.

{| style="width:100%; vertical-align:top; background:#f5fffa;border:1px solid #000000;padding:1px"
|- style="width:20%;height=20px; text-align:left;padding:3px"
! style="background-color:#AAA;padding:3px"| Property
! style="background-color:#AAA;padding:3px"| $TMPDIR
! style="background-color:#AAA;padding:3px"| BeeOND
! style="background-color:#AAA;padding:3px"| $HOME
! style="background-color:#AAA;padding:3px"| Workspace
! style="background-color:#AAA;padding:3px"| Workspace on flash
|-
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Visibility
| style="height=20px; text-align:left;padding:3px"| local node
| style="height=20px; text-align:left;padding:3px"| nodes of batch job
| style="height=20px; text-align:left;padding:3px"| global
| style="height=20px; text-align:left;padding:3px"| global
| style="height=20px; text-align:left;padding:3px"| global
|-
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Lifetime
| style="height=20px; text-align:left;padding:3px"| batch job runtime
| style="height=20px; text-align:left;padding:3px"| batch job runtime
| style="height=20px; text-align:left;padding:3px"| permanent
| style="height=20px; text-align:left;padding:3px"| max. 240 days
| style="height=20px; text-align:left;padding:3px"| max. 240 days
|- style="vertical-align:top;"
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Disk space
| style="height=20px; text-align:left;padding:3px"| 960 GB - 6.4 TB details see table 1
| style="height=20px; text-align:left;padding:3px"| n*750 GB
| style="height=20px; text-align:left;padding:3px"| 1.2 PiB
| style="height=20px; text-align:left;padding:3px"| 4.1 PiB
| style="height=20px; text-align:left;padding:3px"| 236 TiB
|-
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Capacity Quotas
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| yes 1 TiB per user, for MA users 256 GiB also per organization
| style="height=20px; text-align:left;padding:3px"| yes 40 TiB per user
| style="height=20px; text-align:left;padding:3px"| yes 1 TiB per user
|-
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Inode Quotas
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| yes 10 million per user for MA users 2.5 million
| style="height=20px; text-align:left;padding:3px"| yes 30 million per user
| style="height=20px; text-align:left;padding:3px"| yes 5 million per user
|-
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Backup
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| yes
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| no
|- style="vertical-align:top;"
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Read perf./node
| style="height=20px; text-align:left;padding:3px"| 500 MB/s - 6 GB/s depends on type of local SSD / job queue: 520 MB/s @ single / multiple 800 MB/s @ multiple_e 6600 MB/s @ fat 6500 MB/s @ gpu_4 6500 MB/s @ gpu_8
| style="height=20px; text-align:left;padding:3px"| 400 MB/s - 500 MB/s depends on type of local SSDs / job queue: 500 MB/s @ multiple 400 MB/s @ multiple_e
| style="height=20px; text-align:left;padding:3px"| 1 GB/s
| style="height=20px; text-align:left;padding:3px"| 1 GB/s
| style="height=20px; text-align:left;padding:3px"| 1 GB/s
|- style="vertical-align:top;"
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Write perf./node
| style="height=20px; text-align:left;padding:3px"| 500 MB/s - 4 GB/s depends on type of local SSD / job queue: 500 MB/s @ single / multiple 650 MB/s @ multiple_e 2900 MB/s @ fat 2090 MB/s @ gpu_4 4060 MB/s @ gpu_8
| style="height=20px; text-align:left;padding:3px"| 250 MB/s - 350 MB/s depends on type of local SSDs / job queue: 350 MB/s @ multiple 250 MB/s @ multiple_e
| style="height=20px; text-align:left;padding:3px"| 1 GB/s
| style="height=20px; text-align:left;padding:3px"| 1 GB/s
| style="height=20px; text-align:left;padding:3px"| 1 GB/s
|- style="vertical-align:top;"
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Total read perf.
| style="height=20px; text-align:left;padding:3px"| n*500-6000 MB/s
| style="height=20px; text-align:left;padding:3px"| n*400-500 MB/s
| style="height=20px; text-align:left;padding:3px"| 18 GB/s
| style="height=20px; text-align:left;padding:3px"| 54 GB/s
| style="height=20px; text-align:left;padding:3px"| 45 GB/s
|- style="vertical-align:top;"
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Total write perf.
| style="height=20px; text-align:left;padding:3px"| n*500-4000 MB/s
| style="height=20px; text-align:left;padding:3px"| n*250-350 MB/s
| style="height=20px; text-align:left;padding:3px"| 18 GB/s
| style="height=20px; text-align:left;padding:3px"| 54 GB/s
| style="height=20px; text-align:left;padding:3px"| 38 GB/s
|}
---------------------------------------------------------------------------------------------------------
global: all nodes of UniCluster access the same file system;
local: each node has its own file system;
permanent: files are stored permanently;
batch job: files are removed at end of the batch job.
---------------------------------------------------------------------------------------------------------
Table 2: Properties of the file systems

== Selecting the appropriate file system ==

In general, you should separate your data and store it on the appropriate file system.
Permanently needed data like software or important results should be stored below $HOME
but capacity restrictions (quotas) apply. In case you accidentally deleted data on $HOME
there is a chance that we can restore it from backup. Permanent data which is not needed for months
or exceeds the capacity restrictions should be sent to the LSDF Online Storage
or to the archive and deleted from the file systems. Temporary data which is only needed on a single
node and which does not exceed the disk space shown in the table above should be stored
below $TMPDIR. Data which is read many times on a single node, e.g. if you are doing AI training,
should be copied to $TMPDIR and read from there. Temporary data which is used from many nodes
of your batch job and which is only needed during job runtime should be stored on a
parallel on-demand file system. Temporary data which can be recomputed or which is the
result of one job and input for another job should be stored in workspaces. The lifetime
of data in workspaces is limited and depends on the lifetime of the workspace which can be
several months.

For further details please check the chapters below.

== $HOME ==

The home directories of bwUniCluster 2.0 (uc2) users are located in the parallel file system Lustre.
You have access to your home directory from all nodes of uc2. A regular backup of these directories
to tape archive is done automatically. The directory $HOME is used to hold those files that are
permanently used like source codes, configuration files, executable programs etc.

On uc2 there is a default user quota limit of 1 TiB and 10 million inodes (files and directories) per user.
For users of University of Mannheim the limit is 256 GiB and 2.5 million inodes.
You can check your current usage and limits with the command
<pre>
$ lfs quota -uh $(whoami) $HOME
</pre>
In addition to the user limit there is a limit of your organization (e.g. university) which depends on the financial share. This limit is enforced with so-called Lustre project quotas. You can show the current usage and limits of your organization with the following command:
<pre>
lfs quota -ph $(grep $(echo $HOME | sed -e "s|/[^/]*/[^/]*$||") /pfs/data5/project_ids.txt | cut -f 1 -d\ ) $HOME
</pre>

== Workspaces ==

On uc2 workspaces can be used to store large non-permanent data sets, e.g. restart files or output
data that has to be post-processed. The file system used for workspaces is also the parallel file system Lustre. This file system is especially designed for parallel access and for a high throughput to large
files. It is able to provide high data transfer rates of up to 54 GB/s write and read performance when data access is parallel.

On uc2 there is a default user quota limit of 40 TiB and 30 million inodes (files and directories) per user.
You can chek your current usage and limits with the command
<pre>
$ lfs quota -uh $(whoami) /pfs/work7
</pre>
Note that the quotas include data and inodes for all of your workspaces and all of your expired workspaces (as long as they are not yet completely removed).

Workspaces have a lifetime and the data on a workspace expires as a whole after a fixed period. The maximum lifetime of a workspace on uc2 is 60 days, but it can be renewed at the end of that period 3 times to a total maximum of 240 days after workspace generation.

Creating, deleting, finding, extending and sharing workspaces is explained on the [[workspace]] page.

=== Reminder for workspace deletion ===

Normally you will get an email every day starting 7 days before a workspace expires. You can send yourself a calender entry which reminds you when a workspace will be automatically deleted:

$ ws_send_ical <workspace> <email>

=== Restoring expired Workspaces ===

At expiration time your workspace will be moved to a special, hidden directory. On uc2 expired workspaces are currently kept for 30 days. During this time you can still restore your data into a valid workspace. The same is true for released workspaces but they are only kept until the next night. In order to restore an expired workspace, use
<pre>
ws_restore -l
</pre>
to get a list of your expired workspaces, and then restore them into an '''existing, active workspace''' (here with name <code>my_restored</code>):
<pre>
ws_restore <full_name_of_expired_workspace> my_restored
</pre>
NOTE: The expired workspace has to be specified using the full name as listed by <code>ws_restore -l</code>, including username prefix and timestamp suffix (otherwise, it cannot be uniquely identified).
The target workspace, on the other hand, must be given with just its short name as listed by <code>ws_list</code>, without the username prefix.

NOTE: <code>ws_restore</code> can only work on the same filesystem. So you have to ensure that the new workspace allocated with <code>ws_allocate</code> is placed on the same filesystem as the expired workspace. Therefore, you can use <code>-F <filesystem></code> flag if needed.

=== Linking workspaces in Home ===

It might be valuable to have links to personal workspaces within a certain directory, e.g. below the user home directory. The command
<pre>
ws_register <DIR>
</pre>
will create and manage links to all personal workspaces within in the directory <DIR>. Calling this command will do the following:

* The directory <DIR> will be created if necessary
* Links to all personal workspaces will be managed:
** Creates links to all available workspaces if not already present
** Removes links to released or expired workspaces

== Improving Performance on $HOME and workspaces ==

The following recommendations might help to improve throughput and metadata
performance on Lustre filesystems.

=== '''Improving Throughput Performance''' ===

Depending on your application some adaptations might be necessary if you want to reach
the full bandwidth of the filesystems. Parallel filesystems typically stripe files over storage subsystems, i.e. large files are separated into stripes and distributed to different storage subsystems. In Lustre, the size of these stripes (sometimes also mentioned as chunks) is called stripe size and the number of used storage subsystems is called stripe count.

When you are designing your application you should consider that the performance of
parallel filesystems is generally better if data is transferred in large blocks and stored in
few large files. In more detail, to increase throughput performance of a parallel application
following aspects should be considered:

* collect large chunks of data and write them sequentially at once,

* to exploit complete filesystem bandwidth use several clients,

* avoid competitive file access by different tasks or use blocks with boundaries at stripe size (default is 1MB),

* if files are small enough for the SSDs and are only used by one process store them on $TMPDIR.

With previous Lustre versions adapting the Lustre stripe count was the most important optimization. However, for the workspaces of uc2 the new Lustre feature Progressive File Layouts has been used to define file striping parameters. This means that the stripe count is adapted if the file size is growing. In normal cases users no longer need to adapt file striping parameters in case they have very huge files or in order to reach better performance.

If you know what you are doing you can still change striping parameters, e.g. the stripe count, of a directory and of newly created files. New files and directories inherit the stripe count from the parent directory. E.g. if you want to enhance throughput on a single very large file which is created in the directory $HOME/my_output_dir you can use the command
<pre>
$ lfs setstripe -c-1 $HOME/my_output_dir
</pre>
to change the stripe count to -1 which means that all storage subsystems of the file system are used to store that file. If you change the stripe count of a directory the stripe count of existing files inside this
directory is not changed. If you want to change the stripe count of existing files, change
the stripe count of the parent directory, copy the files to new files, remove the old files
and move the new files back to the old name. In order to check the stripe setting of
the file my_file use
<pre>
$ lfs getstripe my_file
</pre>
Also note that changes on the striping parameters (e.g. stripe count) are not saved in the
backup, i.e. if directories have to be recreated this information is lost and the default stripe
count will be used. Therefore, you should annotate for which directories you made changes
to the striping parameters so that you can repeat these changes if required.

=== '''Improving Metadata Performance''' ===

Metadata performance on parallel file systems is usually not as good as with local
filesystems. In addition, it is usually not scalable, i.e. a limited resource. Therefore,
you should omit metadata operations whenever possible. For example, it is much better
to have few large files than lots of small files. In more detail, to increase metadata
performance of a parallel application following aspects should be considered:

* avoid creating many small files,

* avoid competitive directory access, e.g. by creating files in separate subdirectories for each task,

* if many small files are only used within a batch job and accessed by one process store them on $TMPDIR,

* change the default colorization setting of the command ls (see below).

On modern Linux systems, the GNU ls command often uses colorization by default to
visually highlight the file type; this is especially true if the command is run within a terminal
session. This is because the default shell profile initializations usually contain an alias
directive similar to the following for the ls command:
<pre>
$ alias ls="ls --color=tty"
</pre>
However, running the ls command in this way for files on a Lustre file system requires
a stat() call to be used to determine the file type. This can result in a performance
overhead, because the stat() call always needs to determine the size of a file, and that
in turn means that the client node must query the object size of all the backing objects
that make up a file. As a result of the default colorization setting, running a simple
ls command on a Lustre file system often takes as much time as running the ls command
with the -l option (the same is true if the -F, -p, or the -classify option, or any other option
that requires information from a stat() call, is used). To avoid this performance overhead
when using ls commands, add an alias directive similar to the following
to your shell startup script:
<pre>
$ alias ls="ls --color=never"
</pre>

== Workspaces on flash storage ==

There is another workspace file system for special requirements available. The file system is called ''full flash pfs'' and is based on the parallel file system Lustre.

=== Advantages of this file system ===

# All storage devices are based on flash (no hard disks) with low access times. Hence performance is better compared to other parallel file systems for read and write access with small blocks and with small files, i.e. IOPS rates are improved.
# The file system is mounted on bwUniCluster 2.0 and HoreKa, i.e. it can be used to share data between these clusters.

=== Access restrictions ===

Only HoreKa users or KIT users of bwUniCluster 2.0 can use this file system.

=== Using the file system ===

As KIT or HoreKa user you can use the file system in the same way as a normal workspace. You just have to specify the name of the flash-based workspace file system using the option ''-F'' to all the commands that manage workspaces. On bwUniCluster 2.0 it is called ''ffuc'', on HoreKa it is ''ffhk''. For example, to create a workspace with name myws and a lifetime of 60 days on bwUniCluster 2.0 execute:
ws_allocate -F ffuc myws 60

If you want to use the full flash pfs on bwUniCluster 2.0 '''and''' HoreKa at the same time, please note that you only have to manage a particular workspace on one of the clusters since the name of the workspace directory is different. However, the path to each workspace is visible and can be used on both clusters.

Other features are similar to normal workspaces. For example, we are able to restore expired workspaces for few weeks and you have to open a ticket to request the restore. There are quota limits with a default limit of 1 TiB capacity and 5 millions inodes per user. You can check your current usage with
lfs quota -uh $(whoami) /pfs/work8

== $TMPDIR ==

The environment variable $TMPDIR contains the name of a directory which is located on the local SSD of each node. This means
that different tasks of a parallel application use different directories when they do not utilize the same node.
Although $TMPDIR points to the same path name for different nodes of a batch job, the physical location and the
content of this directory path on these nodes is different.

This directory should be used for temporary files being accessed from the local node during job runtime. It should
also be used if you read the same data many times from a single node, e.g. if you are doing AI training. In this
case you should copy the data at the beginning of your batch job to $TMPDIR and read the data from there, see usage example below.

The $TMPDIR directory is located on extremely fast local SSD storage devices. This means that performance
on small files is much better than on the parallel file systems. The capacity of the local SSDs for each node type
is different and can be checked in Table 1 above. The capacity of $TMPDIR is at least 800 GB.

Each time a batch job is started, a subdirectory is created on the SSD of each node and assigned to the job.
$TMPDIR is set to the name of the subdirectory and this name contains the job ID so that it is unique
for each job. At the end of the job the subdirectory is removed.

On login nodes $TMPDIR also points to a fast directory on a local SSD disk but this directory is not unique.
It is recommended to create your own unique subdirectory on these nodes. This directory should be used for the
installation of software packages. This means that the software package to be installed should be unpacked,
compiled and linked in a subdirectory of $TMPDIR. The real installation of the package (e.g. make install)
should be made into the $HOME folder.

{|style="background:#ffdeee; width:100%;"
|style="padding:5px; background:#f2cece; text-align:left"|
[[Image:Attention.svg|center|25px]]
|style="padding:5px; background:#f2cece; text-align:left"|
Note that you should '''not''' use /tmp or /scratch! Please use '''$TMPDIR''' instead. 
The reason is that an automatic cleanup on /tmp or /scratch is not possible because another job could be still using data below these directories. Hence the corresponding file systems could fill up and this can cause issues for you and for other users. On the other hand, $TMPDIR is created when the job starts and removed when the job completes, i.e. a cleanup is automatically done.
|}

=== Usage example for $TMPDIR ===

We will provide an example for using $TMPDIR and describe efficient data transfer to and from $TMPDIR.

If you have a data set with many files which is frequently used by batch jobs you should create
a compressed archive on a workspace. This archive can be extracted on $TMPDIR inside your batch jobs.
Such an archive can be read efficiently from a parallel file system since it is a single huge file.
On a login node you can create such an archive with the following steps:
<source lang="bash">
# Create a workspace to store the archive
[ab1234@uc2n997 ~]$ ws_allocate data-ssd 60
# Create the archive from a local dataset folder (example)
[ab1234@uc2n997 ~]$ tar -cvzf $(ws_find data-ssd)/dataset.tgz dataset/
</source>

Inside a batch job extract the archive on $TMPDIR, read input data from $TMPDIR, store results on $TMPDIR
and save the results on a workspace:
<source lang="bash">
#!/bin/bash
# very simple example on how to use local $TMPDIR
#SBATCH -N 1
#SBATCH -t 24:00:00

# Extract compressed input dataset on local SSD
tar -C $TMPDIR/ -xvzf $(ws_find data-ssd)/dataset.tgz

# The application reads data from dataset on $TMPDIR and writes results to $TMPDIR
myapp -input $TMPDIR/dataset/myinput.csv -outputdir $TMPDIR/results

# Before job completes save results on a workspace
rsync -av $TMPDIR/results $(ws_find data-ssd)/results-${SLURM_JOB_ID}/
</source>

== LSDF Online Storage==

In some cases it is useful to have access to the LSDF Online Storage on the HPC-Clusters also. Therefore the LSDF Online Storage is mounted on the Login- and Datamover-Nodes.
Furthermore it can be used on the compute nodes during the job runtime with the constraint flag "LSDF" ([[bwUniCluster_2.0_Slurm_common_Features|Slurm common features]]
). There is also an example about the LSDF batch usage: [[bwUniCluster_2.0_Slurm_common_Features#LSDF_Online_Storage|Slurm LSDF example ]] .
<pre>
#!/bin/bash
#SBATCH ...
#SBATCH --constraint=LSDF
</pre>
 
For the usage of the LSDF Online Storage the following environment variables are available: $LSDF, $LSDFPROJECTS, $LSDFHOME.
Please request storage projects in the LSDF Online Storage seperately:
[https://www.lsdf.kit.edu/os/storagerequest/: LSDF Storage Request].

==BeeOND (BeeGFS On-Demand)==

Users of the UniCluster have possibility to request a private BeeOND (on-demand BeeGFS) parallel filesystem for each job. The file system is created during job startup and purged after your job.

; '''IMPORTANT: '''
: All data on the private filesystem will be deleted after your job. Make sure you have copied your data back to the global filesystem (within job), e.g., $HOME or any workspace.

BeeOND/BeeGFS can be used like any other parallel file system. Tools like cp or rsync can be used to copy data in and out.

For detailed usage see here: [[BwUniCluster_2.0_Slurm_common_Features#BeeOND_.28BeeGFS_On-Demand.29|Request on-demand file system]]

==Backup and Archiving==

There are regular backups of all data of the home directories,whereas ACLs and extended attributes will
not be backuped.

Please open a ticket if you need backuped data.

BwUniCluster2.0/Hardware and Architecture

2024-10-29T09:44:54Z

R Laifer: /* $TMPDIR */

= Architecture of bwUniCluster 2.0 =

The bwUniCluster 2.0 is a parallel computer with distributed memory. Each node of system consists of at least two Intel Xeon processor, local memory, disks, network adapters and optionally accelerators (NVIDIA Tesla V100, A100 or H100). All nodes are connected by a fast InfiniBand interconnect. In addition the file system
Lustre, that is connected by coupling the InfiniBand of the file server with the InfiniBand
switch of the compute cluster, is added to bwUniCluster 2.0 to provide a fast and scalable
parallel file system.

The operating system on each node is Red Hat Enterprise Linux (RHEL) 8.4. A number of additional software packages like e.g. SLURM have been installed on top. Some of these components are of special interest to end users and are briefly
discussed in this document. Others which are of greater importance to system
administrators will not be covered by this document.

The individual nodes of the system may act in different roles. According to the services supplied by the nodes, they are separated into disjoint groups. From an end users point of view the different groups of nodes are login nodes, compute nodes, file server nodes and administrative server nodes.

'''Login Nodes'''

The login nodes are the only nodes that are directly accessible by end users. These nodes
are used for interactive login, file management, program development and interactive pre-
and postprocessing. Three nodes are dedicated to this service but they are all accessible via
one address and a DNS round-robin alias distributes the login sessions to the
different login nodes.

'''Compute Node'''

The majority of nodes are compute nodes which are managed by a batch system. Users
submit their jobs to the SLURM batch system and a job is executed when the required resources become available (depending on its fair-share priority).

'''File Server Nodes'''

The hardware of the parallel file system Lustre incorporates some file server nodes; the file
system Lustre is connected by coupling the InfiniBand of the file server with the independent InfiniBand switch of the compute cluster. In addition to shared file space there is also local storage on the disks of each node (for details see chapter "File Systems").

'''Administrative Server Nodes'''

Some other nodes are delivering additional services like resource management, external
network connection, administration etc. These nodes can be accessed directly by system administrators only.

= Components of bwUniCluster =

{| class="wikitable"
|-
! style="width:9%"|
! style="width:13%"| Compute nodes "Thin"
! style="width:13%"| Compute nodes "HPC"
! style="width:13%"| Compute nodes "IceLake"
! style="width:13%"| Compute nodes "Fat"
! style="width:13%"| GPU x4
! style="width:13%"| GPU x8
! style="width:13%"| IceLake + GPU x4
! style="width:13%"| Login
|-
!scope="column"| Number of nodes
| 200 + 60
| 260
| 272
| 6
| 14
| 10
| 15
| 3
|-
!scope="column"| Processors
| Intel Xeon Gold 6230
| Intel Xeon Gold 6230
| Intel Xeon Platinum 8358
| Intel Xeon Gold 6230
| Intel Xeon Gold 6230
| Intel Xeon Gold 6248
| Intel Xeon Platinum 8358
|-
!scope="column"| Number of sockets
| 2
| 2
| 2
| 4
| 2
| 2
| 2
| 2
|-
!scope="column"| Processor frequency (GHz)
| 2.1 Ghz
| 2.1 Ghz
| 2.6 Ghz
| 2.1 Ghz
| 2.1 Ghz
| 2.6 Ghz
| 2.5 Ghz
|
|-
!scope="column"| Total number of cores
| 40
| 40
| 64
| 80
| 40
| 40
| 64
| 40
|-
!scope="column"| Main memory
| 96 GB / 192 GB
| 96 GB
| 256 GB
| 3 TB
| 384 GB
| 768 GB
| 512 GB
| 384 GB
|-
!scope="column"| Local SSD
| 960 GB SATA
| 960 GB SATA
| 1,8 TB NVMe
| 4,8 TB NVMe
| 3,2 TB NVMe
| 15 TB NVMe
| 6,4 TB NVMe
|
|-
!scope="column"| Accelerators
| -
| -
| -
| -
| 4x NVIDIA Tesla V100
| 8x NVIDIA Tesla V100
| 4x NVIDIA A100 / 4x NVIDIA H100
| -
|-
!scope="column"| Accelerator memory
| -
| -
| -
| -
| 32 GB
| 32 GB
| 80 GB / 94 GB
| -
|-
!scope="column"| Interconnect
| IB HDR100 (blocking)
| IB HDR100
| IB HDR200
| IB HDR
| IB HDR
| IB HDR
| IB HDR200
| IB HDR100 (blocking)
|}
Table 1: Properties of the nodes

= File Systems =

On bwUniCluster 2.0 the parallel file system Lustre is used for most globally visible user data. Lustre is open source and Lustre solutions and support are available from different vendors. Nowadays, most of the biggest HPC systems are using Lustre. An initial home directory on a Lustre file system is created automatically after account activation, and the environment variable $HOME holds its name. Users can create so-called workspaces on another Lustre file system for non-permanent data with temporary lifetime. There is another workspace file system based on flash storage for special requirements available.

Within a batch job further file systems are available:
* The directory $TMPDIR is only available and visible on the local node. It is located on fast SSD storage devices.
* On request a parallel on-demand file system (BeeOND) is created which uses the SSDs of the nodes which were allocated to the batch job.
* On request the external LSDF Online Storage is mounted on the nodes which were allocated to the batch job. This file system is based on the parallel file system Spectrum Scale.

Some of the characteristics of the file systems are shown in Table 2.

{| style="width:100%; vertical-align:top; background:#f5fffa;border:1px solid #000000;padding:1px"
|- style="width:20%;height=20px; text-align:left;padding:3px"
! style="background-color:#AAA;padding:3px"| Property
! style="background-color:#AAA;padding:3px"| $TMPDIR
! style="background-color:#AAA;padding:3px"| BeeOND
! style="background-color:#AAA;padding:3px"| $HOME
! style="background-color:#AAA;padding:3px"| Workspace
! style="background-color:#AAA;padding:3px"| Workspace on flash
|-
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Visibility
| style="height=20px; text-align:left;padding:3px"| local node
| style="height=20px; text-align:left;padding:3px"| nodes of batch job
| style="height=20px; text-align:left;padding:3px"| global
| style="height=20px; text-align:left;padding:3px"| global
| style="height=20px; text-align:left;padding:3px"| global
|-
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Lifetime
| style="height=20px; text-align:left;padding:3px"| batch job runtime
| style="height=20px; text-align:left;padding:3px"| batch job runtime
| style="height=20px; text-align:left;padding:3px"| permanent
| style="height=20px; text-align:left;padding:3px"| max. 240 days
| style="height=20px; text-align:left;padding:3px"| max. 240 days
|- style="vertical-align:top;"
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Disk space
| style="height=20px; text-align:left;padding:3px"| 960 GB - 6.4 TB details see table 1
| style="height=20px; text-align:left;padding:3px"| n*750 GB
| style="height=20px; text-align:left;padding:3px"| 1.2 PiB
| style="height=20px; text-align:left;padding:3px"| 4.1 PiB
| style="height=20px; text-align:left;padding:3px"| 236 TiB
|-
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Capacity Quotas
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| yes 1 TiB per user, for MA users 256 GiB also per organization
| style="height=20px; text-align:left;padding:3px"| yes 40 TiB per user
| style="height=20px; text-align:left;padding:3px"| yes 1 TiB per user
|-
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Inode Quotas
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| yes 10 million per user for MA users 2.5 million
| style="height=20px; text-align:left;padding:3px"| yes 30 million per user
| style="height=20px; text-align:left;padding:3px"| yes 5 million per user
|-
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Backup
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| yes
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| no
|- style="vertical-align:top;"
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Read perf./node
| style="height=20px; text-align:left;padding:3px"| 500 MB/s - 6 GB/s depends on type of local SSD / job queue: 520 MB/s @ single / multiple 800 MB/s @ multiple_e 6600 MB/s @ fat 6500 MB/s @ gpu_4 6500 MB/s @ gpu_8
| style="height=20px; text-align:left;padding:3px"| 400 MB/s - 500 MB/s depends on type of local SSDs / job queue: 500 MB/s @ multiple 400 MB/s @ multiple_e
| style="height=20px; text-align:left;padding:3px"| 1 GB/s
| style="height=20px; text-align:left;padding:3px"| 1 GB/s
| style="height=20px; text-align:left;padding:3px"| 1 GB/s
|- style="vertical-align:top;"
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Write perf./node
| style="height=20px; text-align:left;padding:3px"| 500 MB/s - 4 GB/s depends on type of local SSD / job queue: 500 MB/s @ single / multiple 650 MB/s @ multiple_e 2900 MB/s @ fat 2090 MB/s @ gpu_4 4060 MB/s @ gpu_8
| style="height=20px; text-align:left;padding:3px"| 250 MB/s - 350 MB/s depends on type of local SSDs / job queue: 350 MB/s @ multiple 250 MB/s @ multiple_e
| style="height=20px; text-align:left;padding:3px"| 1 GB/s
| style="height=20px; text-align:left;padding:3px"| 1 GB/s
| style="height=20px; text-align:left;padding:3px"| 1 GB/s
|- style="vertical-align:top;"
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Total read perf.
| style="height=20px; text-align:left;padding:3px"| n*500-6000 MB/s
| style="height=20px; text-align:left;padding:3px"| n*400-500 MB/s
| style="height=20px; text-align:left;padding:3px"| 18 GB/s
| style="height=20px; text-align:left;padding:3px"| 54 GB/s
| style="height=20px; text-align:left;padding:3px"| 45 GB/s
|- style="vertical-align:top;"
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Total write perf.
| style="height=20px; text-align:left;padding:3px"| n*500-4000 MB/s
| style="height=20px; text-align:left;padding:3px"| n*250-350 MB/s
| style="height=20px; text-align:left;padding:3px"| 18 GB/s
| style="height=20px; text-align:left;padding:3px"| 54 GB/s
| style="height=20px; text-align:left;padding:3px"| 38 GB/s
|}
---------------------------------------------------------------------------------------------------------
global: all nodes of UniCluster access the same file system;
local: each node has its own file system;
permanent: files are stored permanently;
batch job: files are removed at end of the batch job.
---------------------------------------------------------------------------------------------------------
Table 2: Properties of the file systems

== Selecting the appropriate file system ==

In general, you should separate your data and store it on the appropriate file system.
Permanently needed data like software or important results should be stored below $HOME
but capacity restrictions (quotas) apply. In case you accidentally deleted data on $HOME
there is a chance that we can restore it from backup. Permanent data which is not needed for months
or exceeds the capacity restrictions should be sent to the LSDF Online Storage
or to the archive and deleted from the file systems. Temporary data which is only needed on a single
node and which does not exceed the disk space shown in the table above should be stored
below $TMPDIR. Data which is read many times on a single node, e.g. if you are doing AI training,
should be copied to $TMPDIR and read from there. Temporary data which is used from many nodes
of your batch job and which is only needed during job runtime should be stored on a
parallel on-demand file system. Temporary data which can be recomputed or which is the
result of one job and input for another job should be stored in workspaces. The lifetime
of data in workspaces is limited and depends on the lifetime of the workspace which can be
several months.

For further details please check the chapters below.

== $HOME ==

The home directories of bwUniCluster 2.0 (uc2) users are located in the parallel file system Lustre.
You have access to your home directory from all nodes of uc2. A regular backup of these directories
to tape archive is done automatically. The directory $HOME is used to hold those files that are
permanently used like source codes, configuration files, executable programs etc.

On uc2 there is a default user quota limit of 1 TiB and 10 million inodes (files and directories) per user.
For users of University of Mannheim the limit is 256 GiB and 2.5 million inodes.
You can check your current usage and limits with the command
<pre>
$ lfs quota -uh $(whoami) $HOME
</pre>
In addition to the user limit there is a limit of your organization (e.g. university) which depends on the financial share. This limit is enforced with so-called Lustre project quotas. You can show the current usage and limits of your organization with the following command:
<pre>
lfs quota -ph $(grep $(echo $HOME | sed -e "s|/[^/]*/[^/]*$||") /pfs/data5/project_ids.txt | cut -f 1 -d\ ) $HOME
</pre>

== Workspaces ==

On uc2 workspaces can be used to store large non-permanent data sets, e.g. restart files or output
data that has to be post-processed. The file system used for workspaces is also the parallel file system Lustre. This file system is especially designed for parallel access and for a high throughput to large
files. It is able to provide high data transfer rates of up to 54 GB/s write and read performance when data access is parallel.

On uc2 there is a default user quota limit of 40 TiB and 30 million inodes (files and directories) per user.
You can chek your current usage and limits with the command
<pre>
$ lfs quota -uh $(whoami) /pfs/work7
</pre>
Note that the quotas include data and inodes for all of your workspaces and all of your expired workspaces (as long as they are not yet completely removed).

Workspaces have a lifetime and the data on a workspace expires as a whole after a fixed period. The maximum lifetime of a workspace on uc2 is 60 days, but it can be renewed at the end of that period 3 times to a total maximum of 240 days after workspace generation.

Creating, deleting, finding, extending and sharing workspaces is explained on the [[workspace]] page.

=== Reminder for workspace deletion ===

Normally you will get an email every day starting 7 days before a workspace expires. You can send yourself a calender entry which reminds you when a workspace will be automatically deleted:

$ ws_send_ical <workspace> <email>

=== Restoring expired Workspaces ===

At expiration time your workspace will be moved to a special, hidden directory. On uc2 expired workspaces are currently kept for 30 days. During this time you can still restore your data into a valid workspace. The same is true for released workspaces but they are only kept until the next night. In order to restore an expired workspace, use
<pre>
ws_restore -l
</pre>
to get a list of your expired workspaces, and then restore them into an '''existing, active workspace''' (here with name <code>my_restored</code>):
<pre>
ws_restore <full_name_of_expired_workspace> my_restored
</pre>
NOTE: The expired workspace has to be specified using the full name as listed by <code>ws_restore -l</code>, including username prefix and timestamp suffix (otherwise, it cannot be uniquely identified).
The target workspace, on the other hand, must be given with just its short name as listed by <code>ws_list</code>, without the username prefix.

NOTE: <code>ws_restore</code> can only work on the same filesystem. So you have to ensure that the new workspace allocated with <code>ws_allocate</code> is placed on the same filesystem as the expired workspace. Therefore, you can use <code>-F <filesystem></code> flag if needed.

=== Linking workspaces in Home ===

It might be valuable to have links to personal workspaces within a certain directory, e.g. below the user home directory. The command
<pre>
ws_register <DIR>
</pre>
will create and manage links to all personal workspaces within in the directory <DIR>. Calling this command will do the following:

* The directory <DIR> will be created if necessary
* Links to all personal workspaces will be managed:
** Creates links to all available workspaces if not already present
** Removes links to released or expired workspaces

== Improving Performance on $HOME and workspaces ==

The following recommendations might help to improve throughput and metadata
performance on Lustre filesystems.

=== '''Improving Throughput Performance''' ===

Depending on your application some adaptations might be necessary if you want to reach
the full bandwidth of the filesystems. Parallel filesystems typically stripe files over storage subsystems, i.e. large files are separated into stripes and distributed to different storage subsystems. In Lustre, the size of these stripes (sometimes also mentioned as chunks) is called stripe size and the number of used storage subsystems is called stripe count.

When you are designing your application you should consider that the performance of
parallel filesystems is generally better if data is transferred in large blocks and stored in
few large files. In more detail, to increase throughput performance of a parallel application
following aspects should be considered:

* collect large chunks of data and write them sequentially at once,

* to exploit complete filesystem bandwidth use several clients,

* avoid competitive file access by different tasks or use blocks with boundaries at stripe size (default is 1MB),

* if files are small enough for the SSDs and are only used by one process store them on $TMPDIR.

With previous Lustre versions adapting the Lustre stripe count was the most important optimization. However, for the workspaces of uc2 the new Lustre feature Progressive File Layouts has been used to define file striping parameters. This means that the stripe count is adapted if the file size is growing. In normal cases users no longer need to adapt file striping parameters in case they have very huge files or in order to reach better performance.

If you know what you are doing you can still change striping parameters, e.g. the stripe count, of a directory and of newly created files. New files and directories inherit the stripe count from the parent directory. E.g. if you want to enhance throughput on a single very large file which is created in the directory $HOME/my_output_dir you can use the command
<pre>
$ lfs setstripe -c-1 $HOME/my_output_dir
</pre>
to change the stripe count to -1 which means that all storage subsystems of the file system are used to store that file. If you change the stripe count of a directory the stripe count of existing files inside this
directory is not changed. If you want to change the stripe count of existing files, change
the stripe count of the parent directory, copy the files to new files, remove the old files
and move the new files back to the old name. In order to check the stripe setting of
the file my_file use
<pre>
$ lfs getstripe my_file
</pre>
Also note that changes on the striping parameters (e.g. stripe count) are not saved in the
backup, i.e. if directories have to be recreated this information is lost and the default stripe
count will be used. Therefore, you should annotate for which directories you made changes
to the striping parameters so that you can repeat these changes if required.

=== '''Improving Metadata Performance''' ===

Metadata performance on parallel file systems is usually not as good as with local
filesystems. In addition, it is usually not scalable, i.e. a limited resource. Therefore,
you should omit metadata operations whenever possible. For example, it is much better
to have few large files than lots of small files. In more detail, to increase metadata
performance of a parallel application following aspects should be considered:

* avoid creating many small files,

* avoid competitive directory access, e.g. by creating files in separate subdirectories for each task,

* if many small files are only used within a batch job and accessed by one process store them on $TMPDIR,

* change the default colorization setting of the command ls (see below).

On modern Linux systems, the GNU ls command often uses colorization by default to
visually highlight the file type; this is especially true if the command is run within a terminal
session. This is because the default shell profile initializations usually contain an alias
directive similar to the following for the ls command:
<pre>
$ alias ls="ls --color=tty"
</pre>
However, running the ls command in this way for files on a Lustre file system requires
a stat() call to be used to determine the file type. This can result in a performance
overhead, because the stat() call always needs to determine the size of a file, and that
in turn means that the client node must query the object size of all the backing objects
that make up a file. As a result of the default colorization setting, running a simple
ls command on a Lustre file system often takes as much time as running the ls command
with the -l option (the same is true if the -F, -p, or the -classify option, or any other option
that requires information from a stat() call, is used). To avoid this performance overhead
when using ls commands, add an alias directive similar to the following
to your shell startup script:
<pre>
$ alias ls="ls --color=never"
</pre>

== Workspaces on flash storage ==

There is another workspace file system for special requirements available. The file system is called ''full flash pfs'' and is based on the parallel file system Lustre.

=== Advantages of this file system ===

# All storage devices are based on flash (no hard disks) with low access times. Hence performance is better compared to other parallel file systems for read and write access with small blocks and with small files, i.e. IOPS rates are improved.
# The file system is mounted on bwUniCluster 2.0 and HoreKa, i.e. it can be used to share data between these clusters.

=== Access restrictions ===

Only HoreKa users or KIT users of bwUniCluster 2.0 can use this file system.

=== Using the file system ===

As KIT or HoreKa user you can use the file system in the same way as a normal workspace. You just have to specify the name of the flash-based workspace file system using the option ''-F'' to all the commands that manage workspaces. On bwUniCluster 2.0 it is called ''ffuc'', on HoreKa it is ''ffhk''. For example, to create a workspace with name myws and a lifetime of 60 days on bwUniCluster 2.0 execute:
ws_allocate -F ffuc myws 60

If you want to use the full flash pfs on bwUniCluster 2.0 '''and''' HoreKa at the same time, please note that you only have to manage a particular workspace on one of the clusters since the name of the workspace directory is different. However, the path to each workspace is visible and can be used on both clusters.

Other features are similar to normal workspaces. For example, we are able to restore expired workspaces for few weeks and you have to open a ticket to request the restore. There are quota limits with a default limit of 1 TiB capacity and 5 millions inodes per user. You can check your current usage with
lfs quota -uh $(whoami) /pfs/work8

== $TMPDIR ==

The environment variable $TMPDIR contains the name of a directory which is located on the local SSD of each node. This means
that different tasks of a parallel application use different directories when they do not utilize the same node.
Although $TMPDIR points to the same path name for different nodes of a batch job, the physical location and the
content of this directory path on these nodes is different.

This directory should be used for temporary files being accessed from the local node during job runtime. It should
also be used if you read the same data many times from a single node, e.g. if you are doing AI training. In this
case you should copy the data at the beginning of your batch job to $TMPDIR and read the data from there, see usage example below.

The $TMPDIR directory is located on extremely fast local SSD storage devices. This means that performance
on small files is much better than on the parallel file systems. The capacity of the local SSDs for each node type
is different and can be checked in Table 1 above. The capacity of $TMPDIR is at least 800 GB.

Each time a batch job is started, a subdirectory is created on the SSD of each node and assigned to the job.
$TMPDIR is set to the name of the subdirectory and this name contains the job ID so that it is unique
for each job. At the end of the job the subdirectory is removed.

On login nodes $TMPDIR also points to a fast directory on a local SSD disk but this directory is not unique.
It is recommended to create your own unique subdirectory on these nodes. This directory should be used for the
installation of software packages. This means that the software package to be installed should be unpacked,
compiled and linked in a subdirectory of $TMPDIR. The real installation of the package (e.g. make install)
should be made into the $HOME folder.

=== Usage example for $TMPDIR ===

We will provide an example for using $TMPDIR and describe efficient data transfer to and from $TMPDIR.

If you have a data set with many files which is frequently used by batch jobs you should create
a compressed archive on a workspace. This archive can be extracted on $TMPDIR inside your batch jobs.
Such an archive can be read efficiently from a parallel file system since it is a single huge file.
On a login node you can create such an archive with the following steps:
<source lang="bash">
# Create a workspace to store the archive
[ab1234@uc2n997 ~]$ ws_allocate data-ssd 60
# Create the archive from a local dataset folder (example)
[ab1234@uc2n997 ~]$ tar -cvzf $(ws_find data-ssd)/dataset.tgz dataset/
</source>

Inside a batch job extract the archive on $TMPDIR, read input data from $TMPDIR, store results on $TMPDIR
and save the results on a workspace:
<source lang="bash">
#!/bin/bash
# very simple example on how to use local $TMPDIR
#SBATCH -N 1
#SBATCH -t 24:00:00

# Extract compressed input dataset on local SSD
tar -C $TMPDIR/ -xvzf $(ws_find data-ssd)/dataset.tgz

# The application reads data from dataset on $TMPDIR and writes results to $TMPDIR
myapp -input $TMPDIR/dataset/myinput.csv -outputdir $TMPDIR/results

# Before job completes save results on a workspace
rsync -av $TMPDIR/results $(ws_find data-ssd)/results-${SLURM_JOB_ID}/
</source>

=== Use $TMPDIR instead of /tmp or /scratch ===

Note that you should ''not'' use /tmp or /scratch and use $TMPDIR instead. The reason is that an automatic cleanup on /tmp or /scratch is not possible because another job could be still using data below these directories. Hence the corresponding file systems could fill up and this can cause issues for you and for other users. On the other hand, $TMPDIR is created when the job starts and removed when the job completes, i.e. a cleanup is automatically done.

== LSDF Online Storage==

In some cases it is useful to have access to the LSDF Online Storage on the HPC-Clusters also. Therefore the LSDF Online Storage is mounted on the Login- and Datamover-Nodes.
Furthermore it can be used on the compute nodes during the job runtime with the constraint flag "LSDF" ([[bwUniCluster_2.0_Slurm_common_Features|Slurm common features]]
). There is also an example about the LSDF batch usage: [[bwUniCluster_2.0_Slurm_common_Features#LSDF_Online_Storage|Slurm LSDF example ]] .
<pre>
#!/bin/bash
#SBATCH ...
#SBATCH --constraint=LSDF
</pre>
 
For the usage of the LSDF Online Storage the following environment variables are available: $LSDF, $LSDFPROJECTS, $LSDFHOME.
Please request storage projects in the LSDF Online Storage seperately:
[https://www.lsdf.kit.edu/os/storagerequest/: LSDF Storage Request].

==BeeOND (BeeGFS On-Demand)==

Users of the UniCluster have possibility to request a private BeeOND (on-demand BeeGFS) parallel filesystem for each job. The file system is created during job startup and purged after your job.

; '''IMPORTANT: '''
: All data on the private filesystem will be deleted after your job. Make sure you have copied your data back to the global filesystem (within job), e.g., $HOME or any workspace.

BeeOND/BeeGFS can be used like any other parallel file system. Tools like cp or rsync can be used to copy data in and out.

For detailed usage see here: [[BwUniCluster_2.0_Slurm_common_Features#BeeOND_.28BeeGFS_On-Demand.29|Request on-demand file system]]

==Backup and Archiving==

There are regular backups of all data of the home directories,whereas ACLs and extended attributes will
not be backuped.

Please open a ticket if you need backuped data.

BwUniCluster2.0/Hardware and Architecture

2024-10-17T17:50:30Z

R Laifer: /* Linking workspaces in Home */

= Architecture of bwUniCluster 2.0 =

The bwUniCluster 2.0 is a parallel computer with distributed memory. Each node of system consists of at least two Intel Xeon processor, local memory, disks, network adapters and optionally accelerators (NVIDIA Tesla V100, A100 or H100). All nodes are connected by a fast InfiniBand interconnect. In addition the file system
Lustre, that is connected by coupling the InfiniBand of the file server with the InfiniBand
switch of the compute cluster, is added to bwUniCluster 2.0 to provide a fast and scalable
parallel file system.

The operating system on each node is Red Hat Enterprise Linux (RHEL) 8.4. A number of additional software packages like e.g. SLURM have been installed on top. Some of these components are of special interest to end users and are briefly
discussed in this document. Others which are of greater importance to system
administrators will not be covered by this document.

The individual nodes of the system may act in different roles. According to the services supplied by the nodes, they are separated into disjoint groups. From an end users point of view the different groups of nodes are login nodes, compute nodes, file server nodes and administrative server nodes.

'''Login Nodes'''

The login nodes are the only nodes that are directly accessible by end users. These nodes
are used for interactive login, file management, program development and interactive pre-
and postprocessing. Three nodes are dedicated to this service but they are all accessible via
one address and a DNS round-robin alias distributes the login sessions to the
different login nodes.

'''Compute Node'''

The majority of nodes are compute nodes which are managed by a batch system. Users
submit their jobs to the SLURM batch system and a job is executed when the required resources become available (depending on its fair-share priority).

'''File Server Nodes'''

The hardware of the parallel file system Lustre incorporates some file server nodes; the file
system Lustre is connected by coupling the InfiniBand of the file server with the independent InfiniBand switch of the compute cluster. In addition to shared file space there is also local storage on the disks of each node (for details see chapter "File Systems").

'''Administrative Server Nodes'''

Some other nodes are delivering additional services like resource management, external
network connection, administration etc. These nodes can be accessed directly by system administrators only.

= Components of bwUniCluster =

{| class="wikitable"
|-
! style="width:9%"|
! style="width:13%"| Compute nodes "Thin"
! style="width:13%"| Compute nodes "HPC"
! style="width:13%"| Compute nodes "IceLake"
! style="width:13%"| Compute nodes "Fat"
! style="width:13%"| GPU x4
! style="width:13%"| GPU x8
! style="width:13%"| IceLake + GPU x4
! style="width:13%"| Login
|-
!scope="column"| Number of nodes
| 200 + 60
| 260
| 272
| 6
| 14
| 10
| 15
| 3
|-
!scope="column"| Processors
| Intel Xeon Gold 6230
| Intel Xeon Gold 6230
| Intel Xeon Platinum 8358
| Intel Xeon Gold 6230
| Intel Xeon Gold 6230
| Intel Xeon Gold 6248
| Intel Xeon Platinum 8358
|-
!scope="column"| Number of sockets
| 2
| 2
| 2
| 4
| 2
| 2
| 2
| 2
|-
!scope="column"| Processor frequency (GHz)
| 2.1 Ghz
| 2.1 Ghz
| 2.6 Ghz
| 2.1 Ghz
| 2.1 Ghz
| 2.6 Ghz
| 2.5 Ghz
|
|-
!scope="column"| Total number of cores
| 40
| 40
| 64
| 80
| 40
| 40
| 64
| 40
|-
!scope="column"| Main memory
| 96 GB / 192 GB
| 96 GB
| 256 GB
| 3 TB
| 384 GB
| 768 GB
| 512 GB
| 384 GB
|-
!scope="column"| Local SSD
| 960 GB SATA
| 960 GB SATA
| 1,8 TB NVMe
| 4,8 TB NVMe
| 3,2 TB NVMe
| 15 TB NVMe
| 6,4 TB NVMe
|
|-
!scope="column"| Accelerators
| -
| -
| -
| -
| 4x NVIDIA Tesla V100
| 8x NVIDIA Tesla V100
| 4x NVIDIA A100 / 4x NVIDIA H100
| -
|-
!scope="column"| Accelerator memory
| -
| -
| -
| -
| 32 GB
| 32 GB
| 80 GB / 94 GB
| -
|-
!scope="column"| Interconnect
| IB HDR100 (blocking)
| IB HDR100
| IB HDR200
| IB HDR
| IB HDR
| IB HDR
| IB HDR200
| IB HDR100 (blocking)
|}
Table 1: Properties of the nodes

= File Systems =

On bwUniCluster 2.0 the parallel file system Lustre is used for most globally visible user data. Lustre is open source and Lustre solutions and support are available from different vendors. Nowadays, most of the biggest HPC systems are using Lustre. An initial home directory on a Lustre file system is created automatically after account activation, and the environment variable $HOME holds its name. Users can create so-called workspaces on another Lustre file system for non-permanent data with temporary lifetime. There is another workspace file system based on flash storage for special requirements available.

Within a batch job further file systems are available:
* The directory $TMPDIR is only available and visible on the local node. It is located on fast SSD storage devices.
* On request a parallel on-demand file system (BeeOND) is created which uses the SSDs of the nodes which were allocated to the batch job.
* On request the external LSDF Online Storage is mounted on the nodes which were allocated to the batch job. This file system is based on the parallel file system Spectrum Scale.

Some of the characteristics of the file systems are shown in Table 2.

{| style="width:100%; vertical-align:top; background:#f5fffa;border:1px solid #000000;padding:1px"
|- style="width:20%;height=20px; text-align:left;padding:3px"
! style="background-color:#AAA;padding:3px"| Property
! style="background-color:#AAA;padding:3px"| $TMPDIR
! style="background-color:#AAA;padding:3px"| BeeOND
! style="background-color:#AAA;padding:3px"| $HOME
! style="background-color:#AAA;padding:3px"| Workspace
! style="background-color:#AAA;padding:3px"| Workspace on flash
|-
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Visibility
| style="height=20px; text-align:left;padding:3px"| local node
| style="height=20px; text-align:left;padding:3px"| nodes of batch job
| style="height=20px; text-align:left;padding:3px"| global
| style="height=20px; text-align:left;padding:3px"| global
| style="height=20px; text-align:left;padding:3px"| global
|-
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Lifetime
| style="height=20px; text-align:left;padding:3px"| batch job runtime
| style="height=20px; text-align:left;padding:3px"| batch job runtime
| style="height=20px; text-align:left;padding:3px"| permanent
| style="height=20px; text-align:left;padding:3px"| max. 240 days
| style="height=20px; text-align:left;padding:3px"| max. 240 days
|- style="vertical-align:top;"
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Disk space
| style="height=20px; text-align:left;padding:3px"| 960 GB - 6.4 TB details see table 1
| style="height=20px; text-align:left;padding:3px"| n*750 GB
| style="height=20px; text-align:left;padding:3px"| 1.2 PiB
| style="height=20px; text-align:left;padding:3px"| 4.1 PiB
| style="height=20px; text-align:left;padding:3px"| 236 TiB
|-
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Capacity Quotas
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| yes 1 TiB per user, for MA users 256 GiB also per organization
| style="height=20px; text-align:left;padding:3px"| yes 40 TiB per user
| style="height=20px; text-align:left;padding:3px"| yes 1 TiB per user
|-
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Inode Quotas
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| yes 10 million per user for MA users 2.5 million
| style="height=20px; text-align:left;padding:3px"| yes 30 million per user
| style="height=20px; text-align:left;padding:3px"| yes 5 million per user
|-
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Backup
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| yes
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| no
|- style="vertical-align:top;"
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Read perf./node
| style="height=20px; text-align:left;padding:3px"| 500 MB/s - 6 GB/s depends on type of local SSD / job queue: 520 MB/s @ single / multiple 800 MB/s @ multiple_e 6600 MB/s @ fat 6500 MB/s @ gpu_4 6500 MB/s @ gpu_8
| style="height=20px; text-align:left;padding:3px"| 400 MB/s - 500 MB/s depends on type of local SSDs / job queue: 500 MB/s @ multiple 400 MB/s @ multiple_e
| style="height=20px; text-align:left;padding:3px"| 1 GB/s
| style="height=20px; text-align:left;padding:3px"| 1 GB/s
| style="height=20px; text-align:left;padding:3px"| 1 GB/s
|- style="vertical-align:top;"
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Write perf./node
| style="height=20px; text-align:left;padding:3px"| 500 MB/s - 4 GB/s depends on type of local SSD / job queue: 500 MB/s @ single / multiple 650 MB/s @ multiple_e 2900 MB/s @ fat 2090 MB/s @ gpu_4 4060 MB/s @ gpu_8
| style="height=20px; text-align:left;padding:3px"| 250 MB/s - 350 MB/s depends on type of local SSDs / job queue: 350 MB/s @ multiple 250 MB/s @ multiple_e
| style="height=20px; text-align:left;padding:3px"| 1 GB/s
| style="height=20px; text-align:left;padding:3px"| 1 GB/s
| style="height=20px; text-align:left;padding:3px"| 1 GB/s
|- style="vertical-align:top;"
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Total read perf.
| style="height=20px; text-align:left;padding:3px"| n*500-6000 MB/s
| style="height=20px; text-align:left;padding:3px"| n*400-500 MB/s
| style="height=20px; text-align:left;padding:3px"| 18 GB/s
| style="height=20px; text-align:left;padding:3px"| 54 GB/s
| style="height=20px; text-align:left;padding:3px"| 45 GB/s
|- style="vertical-align:top;"
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Total write perf.
| style="height=20px; text-align:left;padding:3px"| n*500-4000 MB/s
| style="height=20px; text-align:left;padding:3px"| n*250-350 MB/s
| style="height=20px; text-align:left;padding:3px"| 18 GB/s
| style="height=20px; text-align:left;padding:3px"| 54 GB/s
| style="height=20px; text-align:left;padding:3px"| 38 GB/s
|}
---------------------------------------------------------------------------------------------------------
global: all nodes of UniCluster access the same file system;
local: each node has its own file system;
permanent: files are stored permanently;
batch job: files are removed at end of the batch job.
---------------------------------------------------------------------------------------------------------
Table 2: Properties of the file systems

== Selecting the appropriate file system ==

In general, you should separate your data and store it on the appropriate file system.
Permanently needed data like software or important results should be stored below $HOME
but capacity restrictions (quotas) apply. In case you accidentally deleted data on $HOME
there is a chance that we can restore it from backup. Permanent data which is not needed for months
or exceeds the capacity restrictions should be sent to the LSDF Online Storage
or to the archive and deleted from the file systems. Temporary data which is only needed on a single
node and which does not exceed the disk space shown in the table above should be stored
below $TMPDIR. Data which is read many times on a single node, e.g. if you are doing AI training,
should be copied to $TMPDIR and read from there. Temporary data which is used from many nodes
of your batch job and which is only needed during job runtime should be stored on a
parallel on-demand file system. Temporary data which can be recomputed or which is the
result of one job and input for another job should be stored in workspaces. The lifetime
of data in workspaces is limited and depends on the lifetime of the workspace which can be
several months.

For further details please check the chapters below.

== $HOME ==

The home directories of bwUniCluster 2.0 (uc2) users are located in the parallel file system Lustre.
You have access to your home directory from all nodes of uc2. A regular backup of these directories
to tape archive is done automatically. The directory $HOME is used to hold those files that are
permanently used like source codes, configuration files, executable programs etc.

On uc2 there is a default user quota limit of 1 TiB and 10 million inodes (files and directories) per user.
For users of University of Mannheim the limit is 256 GiB and 2.5 million inodes.
You can check your current usage and limits with the command
<pre>
$ lfs quota -uh $(whoami) $HOME
</pre>
In addition to the user limit there is a limit of your organization (e.g. university) which depends on the financial share. This limit is enforced with so-called Lustre project quotas. You can show the current usage and limits of your organization with the following command:
<pre>
lfs quota -ph $(grep $(echo $HOME | sed -e "s|/[^/]*/[^/]*$||") /pfs/data5/project_ids.txt | cut -f 1 -d\ ) $HOME
</pre>

== Workspaces ==

On uc2 workspaces can be used to store large non-permanent data sets, e.g. restart files or output
data that has to be post-processed. The file system used for workspaces is also the parallel file system Lustre. This file system is especially designed for parallel access and for a high throughput to large
files. It is able to provide high data transfer rates of up to 54 GB/s write and read performance when data access is parallel.

On uc2 there is a default user quota limit of 40 TiB and 30 million inodes (files and directories) per user.
You can chek your current usage and limits with the command
<pre>
$ lfs quota -uh $(whoami) /pfs/work7
</pre>
Note that the quotas include data and inodes for all of your workspaces and all of your expired workspaces (as long as they are not yet completely removed).

Workspaces have a lifetime and the data on a workspace expires as a whole after a fixed period. The maximum lifetime of a workspace on uc2 is 60 days, but it can be renewed at the end of that period 3 times to a total maximum of 240 days after workspace generation.

Creating, deleting, finding, extending and sharing workspaces is explained on the [[workspace]] page.

=== Reminder for workspace deletion ===

Normally you will get an email every day starting 7 days before a workspace expires. You can send yourself a calender entry which reminds you when a workspace will be automatically deleted:

$ ws_send_ical <workspace> <email>

=== Restoring expired Workspaces ===

At expiration time your workspace will be moved to a special, hidden directory. On uc2 expired workspaces are currently kept for 30 days. During this time you can still restore your data into a valid workspace. The same is true for released workspaces but they are only kept until the next night. In order to restore an expired workspace, use
<pre>
ws_restore -l
</pre>
to get a list of your expired workspaces, and then restore them into an '''existing, active workspace''' (here with name <code>my_restored</code>):
<pre>
ws_restore <full_name_of_expired_workspace> my_restored
</pre>
NOTE: The expired workspace has to be specified using the full name as listed by <code>ws_restore -l</code>, including username prefix and timestamp suffix (otherwise, it cannot be uniquely identified).
The target workspace, on the other hand, must be given with just its short name as listed by <code>ws_list</code>, without the username prefix.

NOTE: <code>ws_restore</code> can only work on the same filesystem. So you have to ensure that the new workspace allocated with <code>ws_allocate</code> is placed on the same filesystem as the expired workspace. Therefore, you can use <code>-F <filesystem></code> flag if needed.

=== Linking workspaces in Home ===

It might be valuable to have links to personal workspaces within a certain directory, e.g. below the user home directory. The command
<pre>
ws_register <DIR>
</pre>
will create and manage links to all personal workspaces within in the directory <DIR>. Calling this command will do the following:

* The directory <DIR> will be created if necessary
* Links to all personal workspaces will be managed:
** Creates links to all available workspaces if not already present
** Removes links to released or expired workspaces

== Improving Performance on $HOME and workspaces ==

The following recommendations might help to improve throughput and metadata
performance on Lustre filesystems.

=== '''Improving Throughput Performance''' ===

Depending on your application some adaptations might be necessary if you want to reach
the full bandwidth of the filesystems. Parallel filesystems typically stripe files over storage subsystems, i.e. large files are separated into stripes and distributed to different storage subsystems. In Lustre, the size of these stripes (sometimes also mentioned as chunks) is called stripe size and the number of used storage subsystems is called stripe count.

When you are designing your application you should consider that the performance of
parallel filesystems is generally better if data is transferred in large blocks and stored in
few large files. In more detail, to increase throughput performance of a parallel application
following aspects should be considered:

* collect large chunks of data and write them sequentially at once,

* to exploit complete filesystem bandwidth use several clients,

* avoid competitive file access by different tasks or use blocks with boundaries at stripe size (default is 1MB),

* if files are small enough for the SSDs and are only used by one process store them on $TMPDIR.

With previous Lustre versions adapting the Lustre stripe count was the most important optimization. However, for the workspaces of uc2 the new Lustre feature Progressive File Layouts has been used to define file striping parameters. This means that the stripe count is adapted if the file size is growing. In normal cases users no longer need to adapt file striping parameters in case they have very huge files or in order to reach better performance.

If you know what you are doing you can still change striping parameters, e.g. the stripe count, of a directory and of newly created files. New files and directories inherit the stripe count from the parent directory. E.g. if you want to enhance throughput on a single very large file which is created in the directory $HOME/my_output_dir you can use the command
<pre>
$ lfs setstripe -c-1 $HOME/my_output_dir
</pre>
to change the stripe count to -1 which means that all storage subsystems of the file system are used to store that file. If you change the stripe count of a directory the stripe count of existing files inside this
directory is not changed. If you want to change the stripe count of existing files, change
the stripe count of the parent directory, copy the files to new files, remove the old files
and move the new files back to the old name. In order to check the stripe setting of
the file my_file use
<pre>
$ lfs getstripe my_file
</pre>
Also note that changes on the striping parameters (e.g. stripe count) are not saved in the
backup, i.e. if directories have to be recreated this information is lost and the default stripe
count will be used. Therefore, you should annotate for which directories you made changes
to the striping parameters so that you can repeat these changes if required.

=== '''Improving Metadata Performance''' ===

Metadata performance on parallel file systems is usually not as good as with local
filesystems. In addition, it is usually not scalable, i.e. a limited resource. Therefore,
you should omit metadata operations whenever possible. For example, it is much better
to have few large files than lots of small files. In more detail, to increase metadata
performance of a parallel application following aspects should be considered:

* avoid creating many small files,

* avoid competitive directory access, e.g. by creating files in separate subdirectories for each task,

* if many small files are only used within a batch job and accessed by one process store them on $TMPDIR,

* change the default colorization setting of the command ls (see below).

On modern Linux systems, the GNU ls command often uses colorization by default to
visually highlight the file type; this is especially true if the command is run within a terminal
session. This is because the default shell profile initializations usually contain an alias
directive similar to the following for the ls command:
<pre>
$ alias ls="ls --color=tty"
</pre>
However, running the ls command in this way for files on a Lustre file system requires
a stat() call to be used to determine the file type. This can result in a performance
overhead, because the stat() call always needs to determine the size of a file, and that
in turn means that the client node must query the object size of all the backing objects
that make up a file. As a result of the default colorization setting, running a simple
ls command on a Lustre file system often takes as much time as running the ls command
with the -l option (the same is true if the -F, -p, or the -classify option, or any other option
that requires information from a stat() call, is used). To avoid this performance overhead
when using ls commands, add an alias directive similar to the following
to your shell startup script:
<pre>
$ alias ls="ls --color=never"
</pre>

== Workspaces on flash storage ==

There is another workspace file system for special requirements available. The file system is called ''full flash pfs'' and is based on the parallel file system Lustre.

=== Advantages of this file system ===

# All storage devices are based on flash (no hard disks) with low access times. Hence performance is better compared to other parallel file systems for read and write access with small blocks and with small files, i.e. IOPS rates are improved.
# The file system is mounted on bwUniCluster 2.0 and HoreKa, i.e. it can be used to share data between these clusters.

=== Access restrictions ===

Only HoreKa users or KIT users of bwUniCluster 2.0 can use this file system.

=== Using the file system ===

As KIT or HoreKa user you can use the file system in the same way as a normal workspace. You just have to specify the name of the flash-based workspace file system using the option ''-F'' to all the commands that manage workspaces. On bwUniCluster 2.0 it is called ''ffuc'', on HoreKa it is ''ffhk''. For example, to create a workspace with name myws and a lifetime of 60 days on bwUniCluster 2.0 execute:
ws_allocate -F ffuc myws 60

If you want to use the full flash pfs on bwUniCluster 2.0 '''and''' HoreKa at the same time, please note that you only have to manage a particular workspace on one of the clusters since the name of the workspace directory is different. However, the path to each workspace is visible and can be used on both clusters.

Other features are similar to normal workspaces. For example, we are able to restore expired workspaces for few weeks and you have to open a ticket to request the restore. There are quota limits with a default limit of 1 TiB capacity and 5 millions inodes per user. You can check your current usage with
lfs quota -uh $(whoami) /pfs/work8

== $TMPDIR ==

The environment variable $TMPDIR contains the name of a directory which is local to each node. This means
that different tasks of a parallel application use different directories when they do not utilize the same node.
Although $TMPDIR points to the same path name for different nodes of a batch job, the physical location and the
content of this directory path on these nodes is different.

This directory should be used for temporary files being accessed from the local node during job runtime. It should
also be used if you read the same data many times from a single node, e.g. if you are doing AI training. In this
case you should copy the data at the beginning of your batch job to $TMPDIR and read the data from there.

The $TMPDIR directory is located on extremely fast SSD local storage devices. This means that performance
on small files is much better than on the parallel file systems. The capacity of the local SSDs for each node type
is different and can be checked in Table 1 above. The capacity of $TMPDIR is at least 800 GB.

Each time a batch job is started, a subdirectory is created on the SSD of each node and assigned to the job.
$TMPDIR is set to the name of the subdirectory and this name contains the job ID so that it is unique
for each job. At the end of the job the subdirectory is removed.

On login nodes $TMPDIR also points to a fast directory on a local SSD disk but this directory is not unique.
It is recommended to create your own unique subdirectory on these nodes. This directory should be used for the
installation of software packages. This means that the software package to be installed should be unpacked,
compiled and linked in a subdirectory of $TMPDIR. The real installation of the package (e.g. make install)
should be made into the $HOME folder.

=== Usage example for $TMPDIR ===

We will provide an example for using $TMPDIR and describe efficient data transfer to and from $TMPDIR.

If you have a data set with many files which is frequently used by batch jobs you should create
a compressed archive on a workspace. This archive can be extracted on $TMPDIR inside your batch jobs.
Such an archive can be read efficiently from a parallel file system since it is a single huge file.
On a login node you can create such an archive with the following steps:
<source lang="bash">
# Create a workspace to store the archive
[ab1234@uc2n997 ~]$ ws_allocate data-ssd 60
# Create the archive from a local dataset folder (example)
[ab1234@uc2n997 ~]$ tar -cvzf $(ws_find data-ssd)/dataset.tgz dataset/
</source>

Inside a batch job extract the archive on $TMPDIR, read input data from $TMPDIR, store results on $TMPDIR
and save the results on a workspace:
<source lang="bash">
#!/bin/bash
# very simple example on how to use local $TMPDIR
#SBATCH -N 1
#SBATCH -t 24:00:00

# Extract compressed input dataset on local SSD
tar -C $TMPDIR/ -xvzf $(ws_find data-ssd)/dataset.tgz

# The application reads data from dataset on $TMPDIR and writes results to $TMPDIR
myapp -input $TMPDIR/dataset/myinput.csv -outputdir $TMPDIR/results

# Before job completes save results on a workspace
rsync -av $TMPDIR/results $(ws_find data-ssd)/results-${SLURM_JOB_ID}/
</source>

== LSDF Online Storage==

In some cases it is useful to have access to the LSDF Online Storage on the HPC-Clusters also. Therefore the LSDF Online Storage is mounted on the Login- and Datamover-Nodes.
Furthermore it can be used on the compute nodes during the job runtime with the constraint flag "LSDF" ([[bwUniCluster_2.0_Slurm_common_Features|Slurm common features]]
). There is also an example about the LSDF batch usage: [[bwUniCluster_2.0_Slurm_common_Features#LSDF_Online_Storage|Slurm LSDF example ]] .
<pre>
#!/bin/bash
#SBATCH ...
#SBATCH --constraint=LSDF
</pre>
 
For the usage of the LSDF Online Storage the following environment variables are available: $LSDF, $LSDFPROJECTS, $LSDFHOME.
Please request storage projects in the LSDF Online Storage seperately:
[https://www.lsdf.kit.edu/os/storagerequest/: LSDF Storage Request].

==BeeOND (BeeGFS On-Demand)==

Users of the UniCluster have possibility to request a private BeeOND (on-demand BeeGFS) parallel filesystem for each job. The file system is created during job startup and purged after your job.

; '''IMPORTANT: '''
: All data on the private filesystem will be deleted after your job. Make sure you have copied your data back to the global filesystem (within job), e.g., $HOME or any workspace.

BeeOND/BeeGFS can be used like any other parallel file system. Tools like cp or rsync can be used to copy data in and out.

For detailed usage see here: [[BwUniCluster_2.0_Slurm_common_Features#BeeOND_.28BeeGFS_On-Demand.29|Request on-demand file system]]

==Backup and Archiving==

There are regular backups of all data of the home directories,whereas ACLs and extended attributes will
not be backuped.

Please open a ticket if you need backuped data.

BwUniCluster2.0/Hardware and Architecture

2024-10-17T17:49:15Z

R Laifer: /* Restoring expired Workspaces */

= Architecture of bwUniCluster 2.0 =

The bwUniCluster 2.0 is a parallel computer with distributed memory. Each node of system consists of at least two Intel Xeon processor, local memory, disks, network adapters and optionally accelerators (NVIDIA Tesla V100, A100 or H100). All nodes are connected by a fast InfiniBand interconnect. In addition the file system
Lustre, that is connected by coupling the InfiniBand of the file server with the InfiniBand
switch of the compute cluster, is added to bwUniCluster 2.0 to provide a fast and scalable
parallel file system.

The operating system on each node is Red Hat Enterprise Linux (RHEL) 8.4. A number of additional software packages like e.g. SLURM have been installed on top. Some of these components are of special interest to end users and are briefly
discussed in this document. Others which are of greater importance to system
administrators will not be covered by this document.

The individual nodes of the system may act in different roles. According to the services supplied by the nodes, they are separated into disjoint groups. From an end users point of view the different groups of nodes are login nodes, compute nodes, file server nodes and administrative server nodes.

'''Login Nodes'''

The login nodes are the only nodes that are directly accessible by end users. These nodes
are used for interactive login, file management, program development and interactive pre-
and postprocessing. Three nodes are dedicated to this service but they are all accessible via
one address and a DNS round-robin alias distributes the login sessions to the
different login nodes.

'''Compute Node'''

The majority of nodes are compute nodes which are managed by a batch system. Users
submit their jobs to the SLURM batch system and a job is executed when the required resources become available (depending on its fair-share priority).

'''File Server Nodes'''

The hardware of the parallel file system Lustre incorporates some file server nodes; the file
system Lustre is connected by coupling the InfiniBand of the file server with the independent InfiniBand switch of the compute cluster. In addition to shared file space there is also local storage on the disks of each node (for details see chapter "File Systems").

'''Administrative Server Nodes'''

Some other nodes are delivering additional services like resource management, external
network connection, administration etc. These nodes can be accessed directly by system administrators only.

= Components of bwUniCluster =

{| class="wikitable"
|-
! style="width:9%"|
! style="width:13%"| Compute nodes "Thin"
! style="width:13%"| Compute nodes "HPC"
! style="width:13%"| Compute nodes "IceLake"
! style="width:13%"| Compute nodes "Fat"
! style="width:13%"| GPU x4
! style="width:13%"| GPU x8
! style="width:13%"| IceLake + GPU x4
! style="width:13%"| Login
|-
!scope="column"| Number of nodes
| 200 + 60
| 260
| 272
| 6
| 14
| 10
| 15
| 3
|-
!scope="column"| Processors
| Intel Xeon Gold 6230
| Intel Xeon Gold 6230
| Intel Xeon Platinum 8358
| Intel Xeon Gold 6230
| Intel Xeon Gold 6230
| Intel Xeon Gold 6248
| Intel Xeon Platinum 8358
|-
!scope="column"| Number of sockets
| 2
| 2
| 2
| 4
| 2
| 2
| 2
| 2
|-
!scope="column"| Processor frequency (GHz)
| 2.1 Ghz
| 2.1 Ghz
| 2.6 Ghz
| 2.1 Ghz
| 2.1 Ghz
| 2.6 Ghz
| 2.5 Ghz
|
|-
!scope="column"| Total number of cores
| 40
| 40
| 64
| 80
| 40
| 40
| 64
| 40
|-
!scope="column"| Main memory
| 96 GB / 192 GB
| 96 GB
| 256 GB
| 3 TB
| 384 GB
| 768 GB
| 512 GB
| 384 GB
|-
!scope="column"| Local SSD
| 960 GB SATA
| 960 GB SATA
| 1,8 TB NVMe
| 4,8 TB NVMe
| 3,2 TB NVMe
| 15 TB NVMe
| 6,4 TB NVMe
|
|-
!scope="column"| Accelerators
| -
| -
| -
| -
| 4x NVIDIA Tesla V100
| 8x NVIDIA Tesla V100
| 4x NVIDIA A100 / 4x NVIDIA H100
| -
|-
!scope="column"| Accelerator memory
| -
| -
| -
| -
| 32 GB
| 32 GB
| 80 GB / 94 GB
| -
|-
!scope="column"| Interconnect
| IB HDR100 (blocking)
| IB HDR100
| IB HDR200
| IB HDR
| IB HDR
| IB HDR
| IB HDR200
| IB HDR100 (blocking)
|}
Table 1: Properties of the nodes

= File Systems =

On bwUniCluster 2.0 the parallel file system Lustre is used for most globally visible user data. Lustre is open source and Lustre solutions and support are available from different vendors. Nowadays, most of the biggest HPC systems are using Lustre. An initial home directory on a Lustre file system is created automatically after account activation, and the environment variable $HOME holds its name. Users can create so-called workspaces on another Lustre file system for non-permanent data with temporary lifetime. There is another workspace file system based on flash storage for special requirements available.

Within a batch job further file systems are available:
* The directory $TMPDIR is only available and visible on the local node. It is located on fast SSD storage devices.
* On request a parallel on-demand file system (BeeOND) is created which uses the SSDs of the nodes which were allocated to the batch job.
* On request the external LSDF Online Storage is mounted on the nodes which were allocated to the batch job. This file system is based on the parallel file system Spectrum Scale.

Some of the characteristics of the file systems are shown in Table 2.

{| style="width:100%; vertical-align:top; background:#f5fffa;border:1px solid #000000;padding:1px"
|- style="width:20%;height=20px; text-align:left;padding:3px"
! style="background-color:#AAA;padding:3px"| Property
! style="background-color:#AAA;padding:3px"| $TMPDIR
! style="background-color:#AAA;padding:3px"| BeeOND
! style="background-color:#AAA;padding:3px"| $HOME
! style="background-color:#AAA;padding:3px"| Workspace
! style="background-color:#AAA;padding:3px"| Workspace on flash
|-
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Visibility
| style="height=20px; text-align:left;padding:3px"| local node
| style="height=20px; text-align:left;padding:3px"| nodes of batch job
| style="height=20px; text-align:left;padding:3px"| global
| style="height=20px; text-align:left;padding:3px"| global
| style="height=20px; text-align:left;padding:3px"| global
|-
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Lifetime
| style="height=20px; text-align:left;padding:3px"| batch job runtime
| style="height=20px; text-align:left;padding:3px"| batch job runtime
| style="height=20px; text-align:left;padding:3px"| permanent
| style="height=20px; text-align:left;padding:3px"| max. 240 days
| style="height=20px; text-align:left;padding:3px"| max. 240 days
|- style="vertical-align:top;"
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Disk space
| style="height=20px; text-align:left;padding:3px"| 960 GB - 6.4 TB details see table 1
| style="height=20px; text-align:left;padding:3px"| n*750 GB
| style="height=20px; text-align:left;padding:3px"| 1.2 PiB
| style="height=20px; text-align:left;padding:3px"| 4.1 PiB
| style="height=20px; text-align:left;padding:3px"| 236 TiB
|-
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Capacity Quotas
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| yes 1 TiB per user, for MA users 256 GiB also per organization
| style="height=20px; text-align:left;padding:3px"| yes 40 TiB per user
| style="height=20px; text-align:left;padding:3px"| yes 1 TiB per user
|-
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Inode Quotas
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| yes 10 million per user for MA users 2.5 million
| style="height=20px; text-align:left;padding:3px"| yes 30 million per user
| style="height=20px; text-align:left;padding:3px"| yes 5 million per user
|-
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Backup
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| yes
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| no
|- style="vertical-align:top;"
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Read perf./node
| style="height=20px; text-align:left;padding:3px"| 500 MB/s - 6 GB/s depends on type of local SSD / job queue: 520 MB/s @ single / multiple 800 MB/s @ multiple_e 6600 MB/s @ fat 6500 MB/s @ gpu_4 6500 MB/s @ gpu_8
| style="height=20px; text-align:left;padding:3px"| 400 MB/s - 500 MB/s depends on type of local SSDs / job queue: 500 MB/s @ multiple 400 MB/s @ multiple_e
| style="height=20px; text-align:left;padding:3px"| 1 GB/s
| style="height=20px; text-align:left;padding:3px"| 1 GB/s
| style="height=20px; text-align:left;padding:3px"| 1 GB/s
|- style="vertical-align:top;"
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Write perf./node
| style="height=20px; text-align:left;padding:3px"| 500 MB/s - 4 GB/s depends on type of local SSD / job queue: 500 MB/s @ single / multiple 650 MB/s @ multiple_e 2900 MB/s @ fat 2090 MB/s @ gpu_4 4060 MB/s @ gpu_8
| style="height=20px; text-align:left;padding:3px"| 250 MB/s - 350 MB/s depends on type of local SSDs / job queue: 350 MB/s @ multiple 250 MB/s @ multiple_e
| style="height=20px; text-align:left;padding:3px"| 1 GB/s
| style="height=20px; text-align:left;padding:3px"| 1 GB/s
| style="height=20px; text-align:left;padding:3px"| 1 GB/s
|- style="vertical-align:top;"
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Total read perf.
| style="height=20px; text-align:left;padding:3px"| n*500-6000 MB/s
| style="height=20px; text-align:left;padding:3px"| n*400-500 MB/s
| style="height=20px; text-align:left;padding:3px"| 18 GB/s
| style="height=20px; text-align:left;padding:3px"| 54 GB/s
| style="height=20px; text-align:left;padding:3px"| 45 GB/s
|- style="vertical-align:top;"
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Total write perf.
| style="height=20px; text-align:left;padding:3px"| n*500-4000 MB/s
| style="height=20px; text-align:left;padding:3px"| n*250-350 MB/s
| style="height=20px; text-align:left;padding:3px"| 18 GB/s
| style="height=20px; text-align:left;padding:3px"| 54 GB/s
| style="height=20px; text-align:left;padding:3px"| 38 GB/s
|}
---------------------------------------------------------------------------------------------------------
global: all nodes of UniCluster access the same file system;
local: each node has its own file system;
permanent: files are stored permanently;
batch job: files are removed at end of the batch job.
---------------------------------------------------------------------------------------------------------
Table 2: Properties of the file systems

== Selecting the appropriate file system ==

In general, you should separate your data and store it on the appropriate file system.
Permanently needed data like software or important results should be stored below $HOME
but capacity restrictions (quotas) apply. In case you accidentally deleted data on $HOME
there is a chance that we can restore it from backup. Permanent data which is not needed for months
or exceeds the capacity restrictions should be sent to the LSDF Online Storage
or to the archive and deleted from the file systems. Temporary data which is only needed on a single
node and which does not exceed the disk space shown in the table above should be stored
below $TMPDIR. Data which is read many times on a single node, e.g. if you are doing AI training,
should be copied to $TMPDIR and read from there. Temporary data which is used from many nodes
of your batch job and which is only needed during job runtime should be stored on a
parallel on-demand file system. Temporary data which can be recomputed or which is the
result of one job and input for another job should be stored in workspaces. The lifetime
of data in workspaces is limited and depends on the lifetime of the workspace which can be
several months.

For further details please check the chapters below.

== $HOME ==

The home directories of bwUniCluster 2.0 (uc2) users are located in the parallel file system Lustre.
You have access to your home directory from all nodes of uc2. A regular backup of these directories
to tape archive is done automatically. The directory $HOME is used to hold those files that are
permanently used like source codes, configuration files, executable programs etc.

On uc2 there is a default user quota limit of 1 TiB and 10 million inodes (files and directories) per user.
For users of University of Mannheim the limit is 256 GiB and 2.5 million inodes.
You can check your current usage and limits with the command
<pre>
$ lfs quota -uh $(whoami) $HOME
</pre>
In addition to the user limit there is a limit of your organization (e.g. university) which depends on the financial share. This limit is enforced with so-called Lustre project quotas. You can show the current usage and limits of your organization with the following command:
<pre>
lfs quota -ph $(grep $(echo $HOME | sed -e "s|/[^/]*/[^/]*$||") /pfs/data5/project_ids.txt | cut -f 1 -d\ ) $HOME
</pre>

== Workspaces ==

On uc2 workspaces can be used to store large non-permanent data sets, e.g. restart files or output
data that has to be post-processed. The file system used for workspaces is also the parallel file system Lustre. This file system is especially designed for parallel access and for a high throughput to large
files. It is able to provide high data transfer rates of up to 54 GB/s write and read performance when data access is parallel.

On uc2 there is a default user quota limit of 40 TiB and 30 million inodes (files and directories) per user.
You can chek your current usage and limits with the command
<pre>
$ lfs quota -uh $(whoami) /pfs/work7
</pre>
Note that the quotas include data and inodes for all of your workspaces and all of your expired workspaces (as long as they are not yet completely removed).

Workspaces have a lifetime and the data on a workspace expires as a whole after a fixed period. The maximum lifetime of a workspace on uc2 is 60 days, but it can be renewed at the end of that period 3 times to a total maximum of 240 days after workspace generation.

Creating, deleting, finding, extending and sharing workspaces is explained on the [[workspace]] page.

=== Reminder for workspace deletion ===

Normally you will get an email every day starting 7 days before a workspace expires. You can send yourself a calender entry which reminds you when a workspace will be automatically deleted:

$ ws_send_ical <workspace> <email>

=== Restoring expired Workspaces ===

At expiration time your workspace will be moved to a special, hidden directory. On uc2 expired workspaces are currently kept for 30 days. During this time you can still restore your data into a valid workspace. The same is true for released workspaces but they are only kept until the next night. In order to restore an expired workspace, use
<pre>
ws_restore -l
</pre>
to get a list of your expired workspaces, and then restore them into an '''existing, active workspace''' (here with name <code>my_restored</code>):
<pre>
ws_restore <full_name_of_expired_workspace> my_restored
</pre>
NOTE: The expired workspace has to be specified using the full name as listed by <code>ws_restore -l</code>, including username prefix and timestamp suffix (otherwise, it cannot be uniquely identified).
The target workspace, on the other hand, must be given with just its short name as listed by <code>ws_list</code>, without the username prefix.

NOTE: <code>ws_restore</code> can only work on the same filesystem. So you have to ensure that the new workspace allocated with <code>ws_allocate</code> is placed on the same filesystem as the expired workspace. Therefore, you can use <code>-F <filesystem></code> flag if needed.

=== Linking workspaces in Home ===

It might be valuable to have links to personal workspaces within a certain directory, e.g., below the user home directory. The command
<pre>
ws_register <DIR>
</pre>
will create and manage links to all personal workspaces within in the directory <DIR>. Calling this command will do the following:

* The directory <DIR> will be created if necessary
* Links to all personal workspaces will be managed:
** Creates links to all available workspaces if not already present
** Removes links to released or expired workspaces

== Improving Performance on $HOME and workspaces ==

The following recommendations might help to improve throughput and metadata
performance on Lustre filesystems.

=== '''Improving Throughput Performance''' ===

Depending on your application some adaptations might be necessary if you want to reach
the full bandwidth of the filesystems. Parallel filesystems typically stripe files over storage subsystems, i.e. large files are separated into stripes and distributed to different storage subsystems. In Lustre, the size of these stripes (sometimes also mentioned as chunks) is called stripe size and the number of used storage subsystems is called stripe count.

When you are designing your application you should consider that the performance of
parallel filesystems is generally better if data is transferred in large blocks and stored in
few large files. In more detail, to increase throughput performance of a parallel application
following aspects should be considered:

* collect large chunks of data and write them sequentially at once,

* to exploit complete filesystem bandwidth use several clients,

* avoid competitive file access by different tasks or use blocks with boundaries at stripe size (default is 1MB),

* if files are small enough for the SSDs and are only used by one process store them on $TMPDIR.

With previous Lustre versions adapting the Lustre stripe count was the most important optimization. However, for the workspaces of uc2 the new Lustre feature Progressive File Layouts has been used to define file striping parameters. This means that the stripe count is adapted if the file size is growing. In normal cases users no longer need to adapt file striping parameters in case they have very huge files or in order to reach better performance.

If you know what you are doing you can still change striping parameters, e.g. the stripe count, of a directory and of newly created files. New files and directories inherit the stripe count from the parent directory. E.g. if you want to enhance throughput on a single very large file which is created in the directory $HOME/my_output_dir you can use the command
<pre>
$ lfs setstripe -c-1 $HOME/my_output_dir
</pre>
to change the stripe count to -1 which means that all storage subsystems of the file system are used to store that file. If you change the stripe count of a directory the stripe count of existing files inside this
directory is not changed. If you want to change the stripe count of existing files, change
the stripe count of the parent directory, copy the files to new files, remove the old files
and move the new files back to the old name. In order to check the stripe setting of
the file my_file use
<pre>
$ lfs getstripe my_file
</pre>
Also note that changes on the striping parameters (e.g. stripe count) are not saved in the
backup, i.e. if directories have to be recreated this information is lost and the default stripe
count will be used. Therefore, you should annotate for which directories you made changes
to the striping parameters so that you can repeat these changes if required.

=== '''Improving Metadata Performance''' ===

Metadata performance on parallel file systems is usually not as good as with local
filesystems. In addition, it is usually not scalable, i.e. a limited resource. Therefore,
you should omit metadata operations whenever possible. For example, it is much better
to have few large files than lots of small files. In more detail, to increase metadata
performance of a parallel application following aspects should be considered:

* avoid creating many small files,

* avoid competitive directory access, e.g. by creating files in separate subdirectories for each task,

* if many small files are only used within a batch job and accessed by one process store them on $TMPDIR,

* change the default colorization setting of the command ls (see below).

On modern Linux systems, the GNU ls command often uses colorization by default to
visually highlight the file type; this is especially true if the command is run within a terminal
session. This is because the default shell profile initializations usually contain an alias
directive similar to the following for the ls command:
<pre>
$ alias ls="ls --color=tty"
</pre>
However, running the ls command in this way for files on a Lustre file system requires
a stat() call to be used to determine the file type. This can result in a performance
overhead, because the stat() call always needs to determine the size of a file, and that
in turn means that the client node must query the object size of all the backing objects
that make up a file. As a result of the default colorization setting, running a simple
ls command on a Lustre file system often takes as much time as running the ls command
with the -l option (the same is true if the -F, -p, or the -classify option, or any other option
that requires information from a stat() call, is used). To avoid this performance overhead
when using ls commands, add an alias directive similar to the following
to your shell startup script:
<pre>
$ alias ls="ls --color=never"
</pre>

== Workspaces on flash storage ==

There is another workspace file system for special requirements available. The file system is called ''full flash pfs'' and is based on the parallel file system Lustre.

=== Advantages of this file system ===

# All storage devices are based on flash (no hard disks) with low access times. Hence performance is better compared to other parallel file systems for read and write access with small blocks and with small files, i.e. IOPS rates are improved.
# The file system is mounted on bwUniCluster 2.0 and HoreKa, i.e. it can be used to share data between these clusters.

=== Access restrictions ===

Only HoreKa users or KIT users of bwUniCluster 2.0 can use this file system.

=== Using the file system ===

As KIT or HoreKa user you can use the file system in the same way as a normal workspace. You just have to specify the name of the flash-based workspace file system using the option ''-F'' to all the commands that manage workspaces. On bwUniCluster 2.0 it is called ''ffuc'', on HoreKa it is ''ffhk''. For example, to create a workspace with name myws and a lifetime of 60 days on bwUniCluster 2.0 execute:
ws_allocate -F ffuc myws 60

If you want to use the full flash pfs on bwUniCluster 2.0 '''and''' HoreKa at the same time, please note that you only have to manage a particular workspace on one of the clusters since the name of the workspace directory is different. However, the path to each workspace is visible and can be used on both clusters.

Other features are similar to normal workspaces. For example, we are able to restore expired workspaces for few weeks and you have to open a ticket to request the restore. There are quota limits with a default limit of 1 TiB capacity and 5 millions inodes per user. You can check your current usage with
lfs quota -uh $(whoami) /pfs/work8

== $TMPDIR ==

The environment variable $TMPDIR contains the name of a directory which is local to each node. This means
that different tasks of a parallel application use different directories when they do not utilize the same node.
Although $TMPDIR points to the same path name for different nodes of a batch job, the physical location and the
content of this directory path on these nodes is different.

This directory should be used for temporary files being accessed from the local node during job runtime. It should
also be used if you read the same data many times from a single node, e.g. if you are doing AI training. In this
case you should copy the data at the beginning of your batch job to $TMPDIR and read the data from there.

The $TMPDIR directory is located on extremely fast SSD local storage devices. This means that performance
on small files is much better than on the parallel file systems. The capacity of the local SSDs for each node type
is different and can be checked in Table 1 above. The capacity of $TMPDIR is at least 800 GB.

Each time a batch job is started, a subdirectory is created on the SSD of each node and assigned to the job.
$TMPDIR is set to the name of the subdirectory and this name contains the job ID so that it is unique
for each job. At the end of the job the subdirectory is removed.

On login nodes $TMPDIR also points to a fast directory on a local SSD disk but this directory is not unique.
It is recommended to create your own unique subdirectory on these nodes. This directory should be used for the
installation of software packages. This means that the software package to be installed should be unpacked,
compiled and linked in a subdirectory of $TMPDIR. The real installation of the package (e.g. make install)
should be made into the $HOME folder.

=== Usage example for $TMPDIR ===

We will provide an example for using $TMPDIR and describe efficient data transfer to and from $TMPDIR.

If you have a data set with many files which is frequently used by batch jobs you should create
a compressed archive on a workspace. This archive can be extracted on $TMPDIR inside your batch jobs.
Such an archive can be read efficiently from a parallel file system since it is a single huge file.
On a login node you can create such an archive with the following steps:
<source lang="bash">
# Create a workspace to store the archive
[ab1234@uc2n997 ~]$ ws_allocate data-ssd 60
# Create the archive from a local dataset folder (example)
[ab1234@uc2n997 ~]$ tar -cvzf $(ws_find data-ssd)/dataset.tgz dataset/
</source>

Inside a batch job extract the archive on $TMPDIR, read input data from $TMPDIR, store results on $TMPDIR
and save the results on a workspace:
<source lang="bash">
#!/bin/bash
# very simple example on how to use local $TMPDIR
#SBATCH -N 1
#SBATCH -t 24:00:00

# Extract compressed input dataset on local SSD
tar -C $TMPDIR/ -xvzf $(ws_find data-ssd)/dataset.tgz

# The application reads data from dataset on $TMPDIR and writes results to $TMPDIR
myapp -input $TMPDIR/dataset/myinput.csv -outputdir $TMPDIR/results

# Before job completes save results on a workspace
rsync -av $TMPDIR/results $(ws_find data-ssd)/results-${SLURM_JOB_ID}/
</source>

== LSDF Online Storage==

In some cases it is useful to have access to the LSDF Online Storage on the HPC-Clusters also. Therefore the LSDF Online Storage is mounted on the Login- and Datamover-Nodes.
Furthermore it can be used on the compute nodes during the job runtime with the constraint flag "LSDF" ([[bwUniCluster_2.0_Slurm_common_Features|Slurm common features]]
). There is also an example about the LSDF batch usage: [[bwUniCluster_2.0_Slurm_common_Features#LSDF_Online_Storage|Slurm LSDF example ]] .
<pre>
#!/bin/bash
#SBATCH ...
#SBATCH --constraint=LSDF
</pre>
 
For the usage of the LSDF Online Storage the following environment variables are available: $LSDF, $LSDFPROJECTS, $LSDFHOME.
Please request storage projects in the LSDF Online Storage seperately:
[https://www.lsdf.kit.edu/os/storagerequest/: LSDF Storage Request].

==BeeOND (BeeGFS On-Demand)==

Users of the UniCluster have possibility to request a private BeeOND (on-demand BeeGFS) parallel filesystem for each job. The file system is created during job startup and purged after your job.

; '''IMPORTANT: '''
: All data on the private filesystem will be deleted after your job. Make sure you have copied your data back to the global filesystem (within job), e.g., $HOME or any workspace.

BeeOND/BeeGFS can be used like any other parallel file system. Tools like cp or rsync can be used to copy data in and out.

For detailed usage see here: [[BwUniCluster_2.0_Slurm_common_Features#BeeOND_.28BeeGFS_On-Demand.29|Request on-demand file system]]

==Backup and Archiving==

There are regular backups of all data of the home directories,whereas ACLs and extended attributes will
not be backuped.

Please open a ticket if you need backuped data.

BwUniCluster2.0/Hardware and Architecture

2024-10-17T17:41:51Z

R Laifer: /* Linking workspaces in Home */

= Architecture of bwUniCluster 2.0 =

The bwUniCluster 2.0 is a parallel computer with distributed memory. Each node of system consists of at least two Intel Xeon processor, local memory, disks, network adapters and optionally accelerators (NVIDIA Tesla V100, A100 or H100). All nodes are connected by a fast InfiniBand interconnect. In addition the file system
Lustre, that is connected by coupling the InfiniBand of the file server with the InfiniBand
switch of the compute cluster, is added to bwUniCluster 2.0 to provide a fast and scalable
parallel file system.

The operating system on each node is Red Hat Enterprise Linux (RHEL) 8.4. A number of additional software packages like e.g. SLURM have been installed on top. Some of these components are of special interest to end users and are briefly
discussed in this document. Others which are of greater importance to system
administrators will not be covered by this document.

The individual nodes of the system may act in different roles. According to the services supplied by the nodes, they are separated into disjoint groups. From an end users point of view the different groups of nodes are login nodes, compute nodes, file server nodes and administrative server nodes.

'''Login Nodes'''

The login nodes are the only nodes that are directly accessible by end users. These nodes
are used for interactive login, file management, program development and interactive pre-
and postprocessing. Three nodes are dedicated to this service but they are all accessible via
one address and a DNS round-robin alias distributes the login sessions to the
different login nodes.

'''Compute Node'''

The majority of nodes are compute nodes which are managed by a batch system. Users
submit their jobs to the SLURM batch system and a job is executed when the required resources become available (depending on its fair-share priority).

'''File Server Nodes'''

The hardware of the parallel file system Lustre incorporates some file server nodes; the file
system Lustre is connected by coupling the InfiniBand of the file server with the independent InfiniBand switch of the compute cluster. In addition to shared file space there is also local storage on the disks of each node (for details see chapter "File Systems").

'''Administrative Server Nodes'''

Some other nodes are delivering additional services like resource management, external
network connection, administration etc. These nodes can be accessed directly by system administrators only.

= Components of bwUniCluster =

{| class="wikitable"
|-
! style="width:9%"|
! style="width:13%"| Compute nodes "Thin"
! style="width:13%"| Compute nodes "HPC"
! style="width:13%"| Compute nodes "IceLake"
! style="width:13%"| Compute nodes "Fat"
! style="width:13%"| GPU x4
! style="width:13%"| GPU x8
! style="width:13%"| IceLake + GPU x4
! style="width:13%"| Login
|-
!scope="column"| Number of nodes
| 200 + 60
| 260
| 272
| 6
| 14
| 10
| 15
| 3
|-
!scope="column"| Processors
| Intel Xeon Gold 6230
| Intel Xeon Gold 6230
| Intel Xeon Platinum 8358
| Intel Xeon Gold 6230
| Intel Xeon Gold 6230
| Intel Xeon Gold 6248
| Intel Xeon Platinum 8358
|-
!scope="column"| Number of sockets
| 2
| 2
| 2
| 4
| 2
| 2
| 2
| 2
|-
!scope="column"| Processor frequency (GHz)
| 2.1 Ghz
| 2.1 Ghz
| 2.6 Ghz
| 2.1 Ghz
| 2.1 Ghz
| 2.6 Ghz
| 2.5 Ghz
|
|-
!scope="column"| Total number of cores
| 40
| 40
| 64
| 80
| 40
| 40
| 64
| 40
|-
!scope="column"| Main memory
| 96 GB / 192 GB
| 96 GB
| 256 GB
| 3 TB
| 384 GB
| 768 GB
| 512 GB
| 384 GB
|-
!scope="column"| Local SSD
| 960 GB SATA
| 960 GB SATA
| 1,8 TB NVMe
| 4,8 TB NVMe
| 3,2 TB NVMe
| 15 TB NVMe
| 6,4 TB NVMe
|
|-
!scope="column"| Accelerators
| -
| -
| -
| -
| 4x NVIDIA Tesla V100
| 8x NVIDIA Tesla V100
| 4x NVIDIA A100 / 4x NVIDIA H100
| -
|-
!scope="column"| Accelerator memory
| -
| -
| -
| -
| 32 GB
| 32 GB
| 80 GB / 94 GB
| -
|-
!scope="column"| Interconnect
| IB HDR100 (blocking)
| IB HDR100
| IB HDR200
| IB HDR
| IB HDR
| IB HDR
| IB HDR200
| IB HDR100 (blocking)
|}
Table 1: Properties of the nodes

= File Systems =

On bwUniCluster 2.0 the parallel file system Lustre is used for most globally visible user data. Lustre is open source and Lustre solutions and support are available from different vendors. Nowadays, most of the biggest HPC systems are using Lustre. An initial home directory on a Lustre file system is created automatically after account activation, and the environment variable $HOME holds its name. Users can create so-called workspaces on another Lustre file system for non-permanent data with temporary lifetime. There is another workspace file system based on flash storage for special requirements available.

Within a batch job further file systems are available:
* The directory $TMPDIR is only available and visible on the local node. It is located on fast SSD storage devices.
* On request a parallel on-demand file system (BeeOND) is created which uses the SSDs of the nodes which were allocated to the batch job.
* On request the external LSDF Online Storage is mounted on the nodes which were allocated to the batch job. This file system is based on the parallel file system Spectrum Scale.

Some of the characteristics of the file systems are shown in Table 2.

{| style="width:100%; vertical-align:top; background:#f5fffa;border:1px solid #000000;padding:1px"
|- style="width:20%;height=20px; text-align:left;padding:3px"
! style="background-color:#AAA;padding:3px"| Property
! style="background-color:#AAA;padding:3px"| $TMPDIR
! style="background-color:#AAA;padding:3px"| BeeOND
! style="background-color:#AAA;padding:3px"| $HOME
! style="background-color:#AAA;padding:3px"| Workspace
! style="background-color:#AAA;padding:3px"| Workspace on flash
|-
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Visibility
| style="height=20px; text-align:left;padding:3px"| local node
| style="height=20px; text-align:left;padding:3px"| nodes of batch job
| style="height=20px; text-align:left;padding:3px"| global
| style="height=20px; text-align:left;padding:3px"| global
| style="height=20px; text-align:left;padding:3px"| global
|-
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Lifetime
| style="height=20px; text-align:left;padding:3px"| batch job runtime
| style="height=20px; text-align:left;padding:3px"| batch job runtime
| style="height=20px; text-align:left;padding:3px"| permanent
| style="height=20px; text-align:left;padding:3px"| max. 240 days
| style="height=20px; text-align:left;padding:3px"| max. 240 days
|- style="vertical-align:top;"
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Disk space
| style="height=20px; text-align:left;padding:3px"| 960 GB - 6.4 TB details see table 1
| style="height=20px; text-align:left;padding:3px"| n*750 GB
| style="height=20px; text-align:left;padding:3px"| 1.2 PiB
| style="height=20px; text-align:left;padding:3px"| 4.1 PiB
| style="height=20px; text-align:left;padding:3px"| 236 TiB
|-
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Capacity Quotas
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| yes 1 TiB per user, for MA users 256 GiB also per organization
| style="height=20px; text-align:left;padding:3px"| yes 40 TiB per user
| style="height=20px; text-align:left;padding:3px"| yes 1 TiB per user
|-
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Inode Quotas
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| yes 10 million per user for MA users 2.5 million
| style="height=20px; text-align:left;padding:3px"| yes 30 million per user
| style="height=20px; text-align:left;padding:3px"| yes 5 million per user
|-
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Backup
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| yes
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| no
|- style="vertical-align:top;"
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Read perf./node
| style="height=20px; text-align:left;padding:3px"| 500 MB/s - 6 GB/s depends on type of local SSD / job queue: 520 MB/s @ single / multiple 800 MB/s @ multiple_e 6600 MB/s @ fat 6500 MB/s @ gpu_4 6500 MB/s @ gpu_8
| style="height=20px; text-align:left;padding:3px"| 400 MB/s - 500 MB/s depends on type of local SSDs / job queue: 500 MB/s @ multiple 400 MB/s @ multiple_e
| style="height=20px; text-align:left;padding:3px"| 1 GB/s
| style="height=20px; text-align:left;padding:3px"| 1 GB/s
| style="height=20px; text-align:left;padding:3px"| 1 GB/s
|- style="vertical-align:top;"
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Write perf./node
| style="height=20px; text-align:left;padding:3px"| 500 MB/s - 4 GB/s depends on type of local SSD / job queue: 500 MB/s @ single / multiple 650 MB/s @ multiple_e 2900 MB/s @ fat 2090 MB/s @ gpu_4 4060 MB/s @ gpu_8
| style="height=20px; text-align:left;padding:3px"| 250 MB/s - 350 MB/s depends on type of local SSDs / job queue: 350 MB/s @ multiple 250 MB/s @ multiple_e
| style="height=20px; text-align:left;padding:3px"| 1 GB/s
| style="height=20px; text-align:left;padding:3px"| 1 GB/s
| style="height=20px; text-align:left;padding:3px"| 1 GB/s
|- style="vertical-align:top;"
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Total read perf.
| style="height=20px; text-align:left;padding:3px"| n*500-6000 MB/s
| style="height=20px; text-align:left;padding:3px"| n*400-500 MB/s
| style="height=20px; text-align:left;padding:3px"| 18 GB/s
| style="height=20px; text-align:left;padding:3px"| 54 GB/s
| style="height=20px; text-align:left;padding:3px"| 45 GB/s
|- style="vertical-align:top;"
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Total write perf.
| style="height=20px; text-align:left;padding:3px"| n*500-4000 MB/s
| style="height=20px; text-align:left;padding:3px"| n*250-350 MB/s
| style="height=20px; text-align:left;padding:3px"| 18 GB/s
| style="height=20px; text-align:left;padding:3px"| 54 GB/s
| style="height=20px; text-align:left;padding:3px"| 38 GB/s
|}
---------------------------------------------------------------------------------------------------------
global: all nodes of UniCluster access the same file system;
local: each node has its own file system;
permanent: files are stored permanently;
batch job: files are removed at end of the batch job.
---------------------------------------------------------------------------------------------------------
Table 2: Properties of the file systems

== Selecting the appropriate file system ==

In general, you should separate your data and store it on the appropriate file system.
Permanently needed data like software or important results should be stored below $HOME
but capacity restrictions (quotas) apply. In case you accidentally deleted data on $HOME
there is a chance that we can restore it from backup. Permanent data which is not needed for months
or exceeds the capacity restrictions should be sent to the LSDF Online Storage
or to the archive and deleted from the file systems. Temporary data which is only needed on a single
node and which does not exceed the disk space shown in the table above should be stored
below $TMPDIR. Data which is read many times on a single node, e.g. if you are doing AI training,
should be copied to $TMPDIR and read from there. Temporary data which is used from many nodes
of your batch job and which is only needed during job runtime should be stored on a
parallel on-demand file system. Temporary data which can be recomputed or which is the
result of one job and input for another job should be stored in workspaces. The lifetime
of data in workspaces is limited and depends on the lifetime of the workspace which can be
several months.

For further details please check the chapters below.

== $HOME ==

The home directories of bwUniCluster 2.0 (uc2) users are located in the parallel file system Lustre.
You have access to your home directory from all nodes of uc2. A regular backup of these directories
to tape archive is done automatically. The directory $HOME is used to hold those files that are
permanently used like source codes, configuration files, executable programs etc.

On uc2 there is a default user quota limit of 1 TiB and 10 million inodes (files and directories) per user.
For users of University of Mannheim the limit is 256 GiB and 2.5 million inodes.
You can check your current usage and limits with the command
<pre>
$ lfs quota -uh $(whoami) $HOME
</pre>
In addition to the user limit there is a limit of your organization (e.g. university) which depends on the financial share. This limit is enforced with so-called Lustre project quotas. You can show the current usage and limits of your organization with the following command:
<pre>
lfs quota -ph $(grep $(echo $HOME | sed -e "s|/[^/]*/[^/]*$||") /pfs/data5/project_ids.txt | cut -f 1 -d\ ) $HOME
</pre>

== Workspaces ==

On uc2 workspaces can be used to store large non-permanent data sets, e.g. restart files or output
data that has to be post-processed. The file system used for workspaces is also the parallel file system Lustre. This file system is especially designed for parallel access and for a high throughput to large
files. It is able to provide high data transfer rates of up to 54 GB/s write and read performance when data access is parallel.

On uc2 there is a default user quota limit of 40 TiB and 30 million inodes (files and directories) per user.
You can chek your current usage and limits with the command
<pre>
$ lfs quota -uh $(whoami) /pfs/work7
</pre>
Note that the quotas include data and inodes for all of your workspaces and all of your expired workspaces (as long as they are not yet completely removed).

Workspaces have a lifetime and the data on a workspace expires as a whole after a fixed period. The maximum lifetime of a workspace on uc2 is 60 days, but it can be renewed at the end of that period 3 times to a total maximum of 240 days after workspace generation.

Creating, deleting, finding, extending and sharing workspaces is explained on the [[workspace]] page.

=== Reminder for workspace deletion ===

Normally you will get an email every day starting 7 days before a workspace expires. You can send yourself a calender entry which reminds you when a workspace will be automatically deleted:

$ ws_send_ical <workspace> <email>

=== Restoring expired Workspaces ===

At expiration time your workspace will be moved to a special, hidden directory. On uc2 expired workspaces are currently kept for 30 days. During this time you can still restore your data into a valid workspace. The same is true for released workspaces but they are only kept until the next night. In order to restore an expired workspace, use
<pre>
ws_restore -l
</pre>
to get a list of your expired workspaces, and then restore them into an '''existing, active workspace''' (here with name <code>my_restored</code>):
<pre>
ws_restore <full_name_of_expired_workspace> my_restored
</pre>
NOTE: The expired workspace has to be specified using the full name as listed by <code>ws_restore -l</code>, including username prefix and timestamp suffix (otherwise, it cannot be uniquely identified).
The target workspace, on the other hand, must be given with just its short name as listed by <code>ws_list</code>, without the username prefix.

NOTE: <code>ws_restore</code> can only work on the same filesystem! So you have to ensure that the new workspace allocated with <code>ws_allocate</code> is placed on the same filesystem as the expired workspace. Therefore, you can use <code>-F <filesystem></code> flag if needed.

=== Linking workspaces in Home ===

It might be valuable to have links to personal workspaces within a certain directory, e.g., below the user home directory. The command
<pre>
ws_register <DIR>
</pre>
will create and manage links to all personal workspaces within in the directory <DIR>. Calling this command will do the following:

* The directory <DIR> will be created if necessary
* Links to all personal workspaces will be managed:
** Creates links to all available workspaces if not already present
** Removes links to released or expired workspaces

== Improving Performance on $HOME and workspaces ==

The following recommendations might help to improve throughput and metadata
performance on Lustre filesystems.

=== '''Improving Throughput Performance''' ===

Depending on your application some adaptations might be necessary if you want to reach
the full bandwidth of the filesystems. Parallel filesystems typically stripe files over storage subsystems, i.e. large files are separated into stripes and distributed to different storage subsystems. In Lustre, the size of these stripes (sometimes also mentioned as chunks) is called stripe size and the number of used storage subsystems is called stripe count.

When you are designing your application you should consider that the performance of
parallel filesystems is generally better if data is transferred in large blocks and stored in
few large files. In more detail, to increase throughput performance of a parallel application
following aspects should be considered:

* collect large chunks of data and write them sequentially at once,

* to exploit complete filesystem bandwidth use several clients,

* avoid competitive file access by different tasks or use blocks with boundaries at stripe size (default is 1MB),

* if files are small enough for the SSDs and are only used by one process store them on $TMPDIR.

With previous Lustre versions adapting the Lustre stripe count was the most important optimization. However, for the workspaces of uc2 the new Lustre feature Progressive File Layouts has been used to define file striping parameters. This means that the stripe count is adapted if the file size is growing. In normal cases users no longer need to adapt file striping parameters in case they have very huge files or in order to reach better performance.

If you know what you are doing you can still change striping parameters, e.g. the stripe count, of a directory and of newly created files. New files and directories inherit the stripe count from the parent directory. E.g. if you want to enhance throughput on a single very large file which is created in the directory $HOME/my_output_dir you can use the command
<pre>
$ lfs setstripe -c-1 $HOME/my_output_dir
</pre>
to change the stripe count to -1 which means that all storage subsystems of the file system are used to store that file. If you change the stripe count of a directory the stripe count of existing files inside this
directory is not changed. If you want to change the stripe count of existing files, change
the stripe count of the parent directory, copy the files to new files, remove the old files
and move the new files back to the old name. In order to check the stripe setting of
the file my_file use
<pre>
$ lfs getstripe my_file
</pre>
Also note that changes on the striping parameters (e.g. stripe count) are not saved in the
backup, i.e. if directories have to be recreated this information is lost and the default stripe
count will be used. Therefore, you should annotate for which directories you made changes
to the striping parameters so that you can repeat these changes if required.

=== '''Improving Metadata Performance''' ===

Metadata performance on parallel file systems is usually not as good as with local
filesystems. In addition, it is usually not scalable, i.e. a limited resource. Therefore,
you should omit metadata operations whenever possible. For example, it is much better
to have few large files than lots of small files. In more detail, to increase metadata
performance of a parallel application following aspects should be considered:

* avoid creating many small files,

* avoid competitive directory access, e.g. by creating files in separate subdirectories for each task,

* if many small files are only used within a batch job and accessed by one process store them on $TMPDIR,

* change the default colorization setting of the command ls (see below).

On modern Linux systems, the GNU ls command often uses colorization by default to
visually highlight the file type; this is especially true if the command is run within a terminal
session. This is because the default shell profile initializations usually contain an alias
directive similar to the following for the ls command:
<pre>
$ alias ls="ls --color=tty"
</pre>
However, running the ls command in this way for files on a Lustre file system requires
a stat() call to be used to determine the file type. This can result in a performance
overhead, because the stat() call always needs to determine the size of a file, and that
in turn means that the client node must query the object size of all the backing objects
that make up a file. As a result of the default colorization setting, running a simple
ls command on a Lustre file system often takes as much time as running the ls command
with the -l option (the same is true if the -F, -p, or the -classify option, or any other option
that requires information from a stat() call, is used). To avoid this performance overhead
when using ls commands, add an alias directive similar to the following
to your shell startup script:
<pre>
$ alias ls="ls --color=never"
</pre>

== Workspaces on flash storage ==

There is another workspace file system for special requirements available. The file system is called ''full flash pfs'' and is based on the parallel file system Lustre.

=== Advantages of this file system ===

# All storage devices are based on flash (no hard disks) with low access times. Hence performance is better compared to other parallel file systems for read and write access with small blocks and with small files, i.e. IOPS rates are improved.
# The file system is mounted on bwUniCluster 2.0 and HoreKa, i.e. it can be used to share data between these clusters.

=== Access restrictions ===

Only HoreKa users or KIT users of bwUniCluster 2.0 can use this file system.

=== Using the file system ===

As KIT or HoreKa user you can use the file system in the same way as a normal workspace. You just have to specify the name of the flash-based workspace file system using the option ''-F'' to all the commands that manage workspaces. On bwUniCluster 2.0 it is called ''ffuc'', on HoreKa it is ''ffhk''. For example, to create a workspace with name myws and a lifetime of 60 days on bwUniCluster 2.0 execute:
ws_allocate -F ffuc myws 60

If you want to use the full flash pfs on bwUniCluster 2.0 '''and''' HoreKa at the same time, please note that you only have to manage a particular workspace on one of the clusters since the name of the workspace directory is different. However, the path to each workspace is visible and can be used on both clusters.

Other features are similar to normal workspaces. For example, we are able to restore expired workspaces for few weeks and you have to open a ticket to request the restore. There are quota limits with a default limit of 1 TiB capacity and 5 millions inodes per user. You can check your current usage with
lfs quota -uh $(whoami) /pfs/work8

== $TMPDIR ==

The environment variable $TMPDIR contains the name of a directory which is local to each node. This means
that different tasks of a parallel application use different directories when they do not utilize the same node.
Although $TMPDIR points to the same path name for different nodes of a batch job, the physical location and the
content of this directory path on these nodes is different.

This directory should be used for temporary files being accessed from the local node during job runtime. It should
also be used if you read the same data many times from a single node, e.g. if you are doing AI training. In this
case you should copy the data at the beginning of your batch job to $TMPDIR and read the data from there.

The $TMPDIR directory is located on extremely fast SSD local storage devices. This means that performance
on small files is much better than on the parallel file systems. The capacity of the local SSDs for each node type
is different and can be checked in Table 1 above. The capacity of $TMPDIR is at least 800 GB.

Each time a batch job is started, a subdirectory is created on the SSD of each node and assigned to the job.
$TMPDIR is set to the name of the subdirectory and this name contains the job ID so that it is unique
for each job. At the end of the job the subdirectory is removed.

On login nodes $TMPDIR also points to a fast directory on a local SSD disk but this directory is not unique.
It is recommended to create your own unique subdirectory on these nodes. This directory should be used for the
installation of software packages. This means that the software package to be installed should be unpacked,
compiled and linked in a subdirectory of $TMPDIR. The real installation of the package (e.g. make install)
should be made into the $HOME folder.

=== Usage example for $TMPDIR ===

We will provide an example for using $TMPDIR and describe efficient data transfer to and from $TMPDIR.

If you have a data set with many files which is frequently used by batch jobs you should create
a compressed archive on a workspace. This archive can be extracted on $TMPDIR inside your batch jobs.
Such an archive can be read efficiently from a parallel file system since it is a single huge file.
On a login node you can create such an archive with the following steps:
<source lang="bash">
# Create a workspace to store the archive
[ab1234@uc2n997 ~]$ ws_allocate data-ssd 60
# Create the archive from a local dataset folder (example)
[ab1234@uc2n997 ~]$ tar -cvzf $(ws_find data-ssd)/dataset.tgz dataset/
</source>

Inside a batch job extract the archive on $TMPDIR, read input data from $TMPDIR, store results on $TMPDIR
and save the results on a workspace:
<source lang="bash">
#!/bin/bash
# very simple example on how to use local $TMPDIR
#SBATCH -N 1
#SBATCH -t 24:00:00

# Extract compressed input dataset on local SSD
tar -C $TMPDIR/ -xvzf $(ws_find data-ssd)/dataset.tgz

# The application reads data from dataset on $TMPDIR and writes results to $TMPDIR
myapp -input $TMPDIR/dataset/myinput.csv -outputdir $TMPDIR/results

# Before job completes save results on a workspace
rsync -av $TMPDIR/results $(ws_find data-ssd)/results-${SLURM_JOB_ID}/
</source>

== LSDF Online Storage==

In some cases it is useful to have access to the LSDF Online Storage on the HPC-Clusters also. Therefore the LSDF Online Storage is mounted on the Login- and Datamover-Nodes.
Furthermore it can be used on the compute nodes during the job runtime with the constraint flag "LSDF" ([[bwUniCluster_2.0_Slurm_common_Features|Slurm common features]]
). There is also an example about the LSDF batch usage: [[bwUniCluster_2.0_Slurm_common_Features#LSDF_Online_Storage|Slurm LSDF example ]] .
<pre>
#!/bin/bash
#SBATCH ...
#SBATCH --constraint=LSDF
</pre>
 
For the usage of the LSDF Online Storage the following environment variables are available: $LSDF, $LSDFPROJECTS, $LSDFHOME.
Please request storage projects in the LSDF Online Storage seperately:
[https://www.lsdf.kit.edu/os/storagerequest/: LSDF Storage Request].

==BeeOND (BeeGFS On-Demand)==

Users of the UniCluster have possibility to request a private BeeOND (on-demand BeeGFS) parallel filesystem for each job. The file system is created during job startup and purged after your job.

; '''IMPORTANT: '''
: All data on the private filesystem will be deleted after your job. Make sure you have copied your data back to the global filesystem (within job), e.g., $HOME or any workspace.

BeeOND/BeeGFS can be used like any other parallel file system. Tools like cp or rsync can be used to copy data in and out.

For detailed usage see here: [[BwUniCluster_2.0_Slurm_common_Features#BeeOND_.28BeeGFS_On-Demand.29|Request on-demand file system]]

==Backup and Archiving==

There are regular backups of all data of the home directories,whereas ACLs and extended attributes will
not be backuped.

Please open a ticket if you need backuped data.

BwUniCluster2.0/Hardware and Architecture

2024-10-17T17:24:23Z

R Laifer: /* Reminder for workspace deletion */

= Architecture of bwUniCluster 2.0 =

The bwUniCluster 2.0 is a parallel computer with distributed memory. Each node of system consists of at least two Intel Xeon processor, local memory, disks, network adapters and optionally accelerators (NVIDIA Tesla V100, A100 or H100). All nodes are connected by a fast InfiniBand interconnect. In addition the file system
Lustre, that is connected by coupling the InfiniBand of the file server with the InfiniBand
switch of the compute cluster, is added to bwUniCluster 2.0 to provide a fast and scalable
parallel file system.

The operating system on each node is Red Hat Enterprise Linux (RHEL) 8.4. A number of additional software packages like e.g. SLURM have been installed on top. Some of these components are of special interest to end users and are briefly
discussed in this document. Others which are of greater importance to system
administrators will not be covered by this document.

The individual nodes of the system may act in different roles. According to the services supplied by the nodes, they are separated into disjoint groups. From an end users point of view the different groups of nodes are login nodes, compute nodes, file server nodes and administrative server nodes.

'''Login Nodes'''

The login nodes are the only nodes that are directly accessible by end users. These nodes
are used for interactive login, file management, program development and interactive pre-
and postprocessing. Three nodes are dedicated to this service but they are all accessible via
one address and a DNS round-robin alias distributes the login sessions to the
different login nodes.

'''Compute Node'''

The majority of nodes are compute nodes which are managed by a batch system. Users
submit their jobs to the SLURM batch system and a job is executed when the required resources become available (depending on its fair-share priority).

'''File Server Nodes'''

The hardware of the parallel file system Lustre incorporates some file server nodes; the file
system Lustre is connected by coupling the InfiniBand of the file server with the independent InfiniBand switch of the compute cluster. In addition to shared file space there is also local storage on the disks of each node (for details see chapter "File Systems").

'''Administrative Server Nodes'''

Some other nodes are delivering additional services like resource management, external
network connection, administration etc. These nodes can be accessed directly by system administrators only.

= Components of bwUniCluster =

{| class="wikitable"
|-
! style="width:9%"|
! style="width:13%"| Compute nodes "Thin"
! style="width:13%"| Compute nodes "HPC"
! style="width:13%"| Compute nodes "IceLake"
! style="width:13%"| Compute nodes "Fat"
! style="width:13%"| GPU x4
! style="width:13%"| GPU x8
! style="width:13%"| IceLake + GPU x4
! style="width:13%"| Login
|-
!scope="column"| Number of nodes
| 200 + 60
| 260
| 272
| 6
| 14
| 10
| 15
| 3
|-
!scope="column"| Processors
| Intel Xeon Gold 6230
| Intel Xeon Gold 6230
| Intel Xeon Platinum 8358
| Intel Xeon Gold 6230
| Intel Xeon Gold 6230
| Intel Xeon Gold 6248
| Intel Xeon Platinum 8358
|-
!scope="column"| Number of sockets
| 2
| 2
| 2
| 4
| 2
| 2
| 2
| 2
|-
!scope="column"| Processor frequency (GHz)
| 2.1 Ghz
| 2.1 Ghz
| 2.6 Ghz
| 2.1 Ghz
| 2.1 Ghz
| 2.6 Ghz
| 2.5 Ghz
|
|-
!scope="column"| Total number of cores
| 40
| 40
| 64
| 80
| 40
| 40
| 64
| 40
|-
!scope="column"| Main memory
| 96 GB / 192 GB
| 96 GB
| 256 GB
| 3 TB
| 384 GB
| 768 GB
| 512 GB
| 384 GB
|-
!scope="column"| Local SSD
| 960 GB SATA
| 960 GB SATA
| 1,8 TB NVMe
| 4,8 TB NVMe
| 3,2 TB NVMe
| 15 TB NVMe
| 6,4 TB NVMe
|
|-
!scope="column"| Accelerators
| -
| -
| -
| -
| 4x NVIDIA Tesla V100
| 8x NVIDIA Tesla V100
| 4x NVIDIA A100 / 4x NVIDIA H100
| -
|-
!scope="column"| Accelerator memory
| -
| -
| -
| -
| 32 GB
| 32 GB
| 80 GB / 94 GB
| -
|-
!scope="column"| Interconnect
| IB HDR100 (blocking)
| IB HDR100
| IB HDR200
| IB HDR
| IB HDR
| IB HDR
| IB HDR200
| IB HDR100 (blocking)
|}
Table 1: Properties of the nodes

= File Systems =

On bwUniCluster 2.0 the parallel file system Lustre is used for most globally visible user data. Lustre is open source and Lustre solutions and support are available from different vendors. Nowadays, most of the biggest HPC systems are using Lustre. An initial home directory on a Lustre file system is created automatically after account activation, and the environment variable $HOME holds its name. Users can create so-called workspaces on another Lustre file system for non-permanent data with temporary lifetime. There is another workspace file system based on flash storage for special requirements available.

Within a batch job further file systems are available:
* The directory $TMPDIR is only available and visible on the local node. It is located on fast SSD storage devices.
* On request a parallel on-demand file system (BeeOND) is created which uses the SSDs of the nodes which were allocated to the batch job.
* On request the external LSDF Online Storage is mounted on the nodes which were allocated to the batch job. This file system is based on the parallel file system Spectrum Scale.

Some of the characteristics of the file systems are shown in Table 2.

{| style="width:100%; vertical-align:top; background:#f5fffa;border:1px solid #000000;padding:1px"
|- style="width:20%;height=20px; text-align:left;padding:3px"
! style="background-color:#AAA;padding:3px"| Property
! style="background-color:#AAA;padding:3px"| $TMPDIR
! style="background-color:#AAA;padding:3px"| BeeOND
! style="background-color:#AAA;padding:3px"| $HOME
! style="background-color:#AAA;padding:3px"| Workspace
! style="background-color:#AAA;padding:3px"| Workspace on flash
|-
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Visibility
| style="height=20px; text-align:left;padding:3px"| local node
| style="height=20px; text-align:left;padding:3px"| nodes of batch job
| style="height=20px; text-align:left;padding:3px"| global
| style="height=20px; text-align:left;padding:3px"| global
| style="height=20px; text-align:left;padding:3px"| global
|-
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Lifetime
| style="height=20px; text-align:left;padding:3px"| batch job runtime
| style="height=20px; text-align:left;padding:3px"| batch job runtime
| style="height=20px; text-align:left;padding:3px"| permanent
| style="height=20px; text-align:left;padding:3px"| max. 240 days
| style="height=20px; text-align:left;padding:3px"| max. 240 days
|- style="vertical-align:top;"
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Disk space
| style="height=20px; text-align:left;padding:3px"| 960 GB - 6.4 TB details see table 1
| style="height=20px; text-align:left;padding:3px"| n*750 GB
| style="height=20px; text-align:left;padding:3px"| 1.2 PiB
| style="height=20px; text-align:left;padding:3px"| 4.1 PiB
| style="height=20px; text-align:left;padding:3px"| 236 TiB
|-
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Capacity Quotas
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| yes 1 TiB per user, for MA users 256 GiB also per organization
| style="height=20px; text-align:left;padding:3px"| yes 40 TiB per user
| style="height=20px; text-align:left;padding:3px"| yes 1 TiB per user
|-
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Inode Quotas
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| yes 10 million per user for MA users 2.5 million
| style="height=20px; text-align:left;padding:3px"| yes 30 million per user
| style="height=20px; text-align:left;padding:3px"| yes 5 million per user
|-
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Backup
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| yes
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| no
|- style="vertical-align:top;"
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Read perf./node
| style="height=20px; text-align:left;padding:3px"| 500 MB/s - 6 GB/s depends on type of local SSD / job queue: 520 MB/s @ single / multiple 800 MB/s @ multiple_e 6600 MB/s @ fat 6500 MB/s @ gpu_4 6500 MB/s @ gpu_8
| style="height=20px; text-align:left;padding:3px"| 400 MB/s - 500 MB/s depends on type of local SSDs / job queue: 500 MB/s @ multiple 400 MB/s @ multiple_e
| style="height=20px; text-align:left;padding:3px"| 1 GB/s
| style="height=20px; text-align:left;padding:3px"| 1 GB/s
| style="height=20px; text-align:left;padding:3px"| 1 GB/s
|- style="vertical-align:top;"
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Write perf./node
| style="height=20px; text-align:left;padding:3px"| 500 MB/s - 4 GB/s depends on type of local SSD / job queue: 500 MB/s @ single / multiple 650 MB/s @ multiple_e 2900 MB/s @ fat 2090 MB/s @ gpu_4 4060 MB/s @ gpu_8
| style="height=20px; text-align:left;padding:3px"| 250 MB/s - 350 MB/s depends on type of local SSDs / job queue: 350 MB/s @ multiple 250 MB/s @ multiple_e
| style="height=20px; text-align:left;padding:3px"| 1 GB/s
| style="height=20px; text-align:left;padding:3px"| 1 GB/s
| style="height=20px; text-align:left;padding:3px"| 1 GB/s
|- style="vertical-align:top;"
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Total read perf.
| style="height=20px; text-align:left;padding:3px"| n*500-6000 MB/s
| style="height=20px; text-align:left;padding:3px"| n*400-500 MB/s
| style="height=20px; text-align:left;padding:3px"| 18 GB/s
| style="height=20px; text-align:left;padding:3px"| 54 GB/s
| style="height=20px; text-align:left;padding:3px"| 45 GB/s
|- style="vertical-align:top;"
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Total write perf.
| style="height=20px; text-align:left;padding:3px"| n*500-4000 MB/s
| style="height=20px; text-align:left;padding:3px"| n*250-350 MB/s
| style="height=20px; text-align:left;padding:3px"| 18 GB/s
| style="height=20px; text-align:left;padding:3px"| 54 GB/s
| style="height=20px; text-align:left;padding:3px"| 38 GB/s
|}
---------------------------------------------------------------------------------------------------------
global: all nodes of UniCluster access the same file system;
local: each node has its own file system;
permanent: files are stored permanently;
batch job: files are removed at end of the batch job.
---------------------------------------------------------------------------------------------------------
Table 2: Properties of the file systems

== Selecting the appropriate file system ==

In general, you should separate your data and store it on the appropriate file system.
Permanently needed data like software or important results should be stored below $HOME
but capacity restrictions (quotas) apply. In case you accidentally deleted data on $HOME
there is a chance that we can restore it from backup. Permanent data which is not needed for months
or exceeds the capacity restrictions should be sent to the LSDF Online Storage
or to the archive and deleted from the file systems. Temporary data which is only needed on a single
node and which does not exceed the disk space shown in the table above should be stored
below $TMPDIR. Data which is read many times on a single node, e.g. if you are doing AI training,
should be copied to $TMPDIR and read from there. Temporary data which is used from many nodes
of your batch job and which is only needed during job runtime should be stored on a
parallel on-demand file system. Temporary data which can be recomputed or which is the
result of one job and input for another job should be stored in workspaces. The lifetime
of data in workspaces is limited and depends on the lifetime of the workspace which can be
several months.

For further details please check the chapters below.

== $HOME ==

The home directories of bwUniCluster 2.0 (uc2) users are located in the parallel file system Lustre.
You have access to your home directory from all nodes of uc2. A regular backup of these directories
to tape archive is done automatically. The directory $HOME is used to hold those files that are
permanently used like source codes, configuration files, executable programs etc.

On uc2 there is a default user quota limit of 1 TiB and 10 million inodes (files and directories) per user.
For users of University of Mannheim the limit is 256 GiB and 2.5 million inodes.
You can check your current usage and limits with the command
<pre>
$ lfs quota -uh $(whoami) $HOME
</pre>
In addition to the user limit there is a limit of your organization (e.g. university) which depends on the financial share. This limit is enforced with so-called Lustre project quotas. You can show the current usage and limits of your organization with the following command:
<pre>
lfs quota -ph $(grep $(echo $HOME | sed -e "s|/[^/]*/[^/]*$||") /pfs/data5/project_ids.txt | cut -f 1 -d\ ) $HOME
</pre>

== Workspaces ==

On uc2 workspaces can be used to store large non-permanent data sets, e.g. restart files or output
data that has to be post-processed. The file system used for workspaces is also the parallel file system Lustre. This file system is especially designed for parallel access and for a high throughput to large
files. It is able to provide high data transfer rates of up to 54 GB/s write and read performance when data access is parallel.

On uc2 there is a default user quota limit of 40 TiB and 30 million inodes (files and directories) per user.
You can chek your current usage and limits with the command
<pre>
$ lfs quota -uh $(whoami) /pfs/work7
</pre>
Note that the quotas include data and inodes for all of your workspaces and all of your expired workspaces (as long as they are not yet completely removed).

Workspaces have a lifetime and the data on a workspace expires as a whole after a fixed period. The maximum lifetime of a workspace on uc2 is 60 days, but it can be renewed at the end of that period 3 times to a total maximum of 240 days after workspace generation.

Creating, deleting, finding, extending and sharing workspaces is explained on the [[workspace]] page.

=== Reminder for workspace deletion ===

Normally you will get an email every day starting 7 days before a workspace expires. You can send yourself a calender entry which reminds you when a workspace will be automatically deleted:

$ ws_send_ical <workspace> <email>

=== Restoring expired Workspaces ===

At expiration time your workspace will be moved to a special, hidden directory. On uc2 expired workspaces are currently kept for 30 days. During this time you can still restore your data into a valid workspace. The same is true for released workspaces but they are only kept until the next night. In order to restore an expired workspace, use
<pre>
ws_restore -l
</pre>
to get a list of your expired workspaces, and then restore them into an '''existing, active workspace''' (here with name <code>my_restored</code>):
<pre>
ws_restore <full_name_of_expired_workspace> my_restored
</pre>
NOTE: The expired workspace has to be specified using the full name as listed by <code>ws_restore -l</code>, including username prefix and timestamp suffix (otherwise, it cannot be uniquely identified).
The target workspace, on the other hand, must be given with just its short name as listed by <code>ws_list</code>, without the username prefix.

NOTE: <code>ws_restore</code> can only work on the same filesystem! So you have to ensure that the new workspace allocated with <code>ws_allocate</code> is placed on the same filesystem as the expired workspace. Therefore, you can use <code>-F <filesystem></code> flag if needed.

=== Linking workspaces in Home ===

It might be valuable to have links to personal workspaces within a certain directory, e.g., below the user home directory. The command
<pre>
ws_register <DIR>
</pre>
will create and manage links to all personal workspaces within in the directory <DIR>. Calling this command will do the following:

* The directory <DIR> will be created if necessary
* Links to all personal workspaces will be managed:
** Creates links to all available workspaces if not already present
** Removes links to released workspaces

== Improving Performance on $HOME and workspaces ==

The following recommendations might help to improve throughput and metadata
performance on Lustre filesystems.

=== '''Improving Throughput Performance''' ===

Depending on your application some adaptations might be necessary if you want to reach
the full bandwidth of the filesystems. Parallel filesystems typically stripe files over storage subsystems, i.e. large files are separated into stripes and distributed to different storage subsystems. In Lustre, the size of these stripes (sometimes also mentioned as chunks) is called stripe size and the number of used storage subsystems is called stripe count.

When you are designing your application you should consider that the performance of
parallel filesystems is generally better if data is transferred in large blocks and stored in
few large files. In more detail, to increase throughput performance of a parallel application
following aspects should be considered:

* collect large chunks of data and write them sequentially at once,

* to exploit complete filesystem bandwidth use several clients,

* avoid competitive file access by different tasks or use blocks with boundaries at stripe size (default is 1MB),

* if files are small enough for the SSDs and are only used by one process store them on $TMPDIR.

With previous Lustre versions adapting the Lustre stripe count was the most important optimization. However, for the workspaces of uc2 the new Lustre feature Progressive File Layouts has been used to define file striping parameters. This means that the stripe count is adapted if the file size is growing. In normal cases users no longer need to adapt file striping parameters in case they have very huge files or in order to reach better performance.

If you know what you are doing you can still change striping parameters, e.g. the stripe count, of a directory and of newly created files. New files and directories inherit the stripe count from the parent directory. E.g. if you want to enhance throughput on a single very large file which is created in the directory $HOME/my_output_dir you can use the command
<pre>
$ lfs setstripe -c-1 $HOME/my_output_dir
</pre>
to change the stripe count to -1 which means that all storage subsystems of the file system are used to store that file. If you change the stripe count of a directory the stripe count of existing files inside this
directory is not changed. If you want to change the stripe count of existing files, change
the stripe count of the parent directory, copy the files to new files, remove the old files
and move the new files back to the old name. In order to check the stripe setting of
the file my_file use
<pre>
$ lfs getstripe my_file
</pre>
Also note that changes on the striping parameters (e.g. stripe count) are not saved in the
backup, i.e. if directories have to be recreated this information is lost and the default stripe
count will be used. Therefore, you should annotate for which directories you made changes
to the striping parameters so that you can repeat these changes if required.

=== '''Improving Metadata Performance''' ===

Metadata performance on parallel file systems is usually not as good as with local
filesystems. In addition, it is usually not scalable, i.e. a limited resource. Therefore,
you should omit metadata operations whenever possible. For example, it is much better
to have few large files than lots of small files. In more detail, to increase metadata
performance of a parallel application following aspects should be considered:

* avoid creating many small files,

* avoid competitive directory access, e.g. by creating files in separate subdirectories for each task,

* if many small files are only used within a batch job and accessed by one process store them on $TMPDIR,

* change the default colorization setting of the command ls (see below).

On modern Linux systems, the GNU ls command often uses colorization by default to
visually highlight the file type; this is especially true if the command is run within a terminal
session. This is because the default shell profile initializations usually contain an alias
directive similar to the following for the ls command:
<pre>
$ alias ls="ls --color=tty"
</pre>
However, running the ls command in this way for files on a Lustre file system requires
a stat() call to be used to determine the file type. This can result in a performance
overhead, because the stat() call always needs to determine the size of a file, and that
in turn means that the client node must query the object size of all the backing objects
that make up a file. As a result of the default colorization setting, running a simple
ls command on a Lustre file system often takes as much time as running the ls command
with the -l option (the same is true if the -F, -p, or the -classify option, or any other option
that requires information from a stat() call, is used). To avoid this performance overhead
when using ls commands, add an alias directive similar to the following
to your shell startup script:
<pre>
$ alias ls="ls --color=never"
</pre>

== Workspaces on flash storage ==

There is another workspace file system for special requirements available. The file system is called ''full flash pfs'' and is based on the parallel file system Lustre.

=== Advantages of this file system ===

# All storage devices are based on flash (no hard disks) with low access times. Hence performance is better compared to other parallel file systems for read and write access with small blocks and with small files, i.e. IOPS rates are improved.
# The file system is mounted on bwUniCluster 2.0 and HoreKa, i.e. it can be used to share data between these clusters.

=== Access restrictions ===

Only HoreKa users or KIT users of bwUniCluster 2.0 can use this file system.

=== Using the file system ===

As KIT or HoreKa user you can use the file system in the same way as a normal workspace. You just have to specify the name of the flash-based workspace file system using the option ''-F'' to all the commands that manage workspaces. On bwUniCluster 2.0 it is called ''ffuc'', on HoreKa it is ''ffhk''. For example, to create a workspace with name myws and a lifetime of 60 days on bwUniCluster 2.0 execute:
ws_allocate -F ffuc myws 60

If you want to use the full flash pfs on bwUniCluster 2.0 '''and''' HoreKa at the same time, please note that you only have to manage a particular workspace on one of the clusters since the name of the workspace directory is different. However, the path to each workspace is visible and can be used on both clusters.

Other features are similar to normal workspaces. For example, we are able to restore expired workspaces for few weeks and you have to open a ticket to request the restore. There are quota limits with a default limit of 1 TiB capacity and 5 millions inodes per user. You can check your current usage with
lfs quota -uh $(whoami) /pfs/work8

== $TMPDIR ==

The environment variable $TMPDIR contains the name of a directory which is local to each node. This means
that different tasks of a parallel application use different directories when they do not utilize the same node.
Although $TMPDIR points to the same path name for different nodes of a batch job, the physical location and the
content of this directory path on these nodes is different.

This directory should be used for temporary files being accessed from the local node during job runtime. It should
also be used if you read the same data many times from a single node, e.g. if you are doing AI training. In this
case you should copy the data at the beginning of your batch job to $TMPDIR and read the data from there.

The $TMPDIR directory is located on extremely fast SSD local storage devices. This means that performance
on small files is much better than on the parallel file systems. The capacity of the local SSDs for each node type
is different and can be checked in Table 1 above. The capacity of $TMPDIR is at least 800 GB.

Each time a batch job is started, a subdirectory is created on the SSD of each node and assigned to the job.
$TMPDIR is set to the name of the subdirectory and this name contains the job ID so that it is unique
for each job. At the end of the job the subdirectory is removed.

On login nodes $TMPDIR also points to a fast directory on a local SSD disk but this directory is not unique.
It is recommended to create your own unique subdirectory on these nodes. This directory should be used for the
installation of software packages. This means that the software package to be installed should be unpacked,
compiled and linked in a subdirectory of $TMPDIR. The real installation of the package (e.g. make install)
should be made into the $HOME folder.

=== Usage example for $TMPDIR ===

We will provide an example for using $TMPDIR and describe efficient data transfer to and from $TMPDIR.

If you have a data set with many files which is frequently used by batch jobs you should create
a compressed archive on a workspace. This archive can be extracted on $TMPDIR inside your batch jobs.
Such an archive can be read efficiently from a parallel file system since it is a single huge file.
On a login node you can create such an archive with the following steps:
<source lang="bash">
# Create a workspace to store the archive
[ab1234@uc2n997 ~]$ ws_allocate data-ssd 60
# Create the archive from a local dataset folder (example)
[ab1234@uc2n997 ~]$ tar -cvzf $(ws_find data-ssd)/dataset.tgz dataset/
</source>

Inside a batch job extract the archive on $TMPDIR, read input data from $TMPDIR, store results on $TMPDIR
and save the results on a workspace:
<source lang="bash">
#!/bin/bash
# very simple example on how to use local $TMPDIR
#SBATCH -N 1
#SBATCH -t 24:00:00

# Extract compressed input dataset on local SSD
tar -C $TMPDIR/ -xvzf $(ws_find data-ssd)/dataset.tgz

# The application reads data from dataset on $TMPDIR and writes results to $TMPDIR
myapp -input $TMPDIR/dataset/myinput.csv -outputdir $TMPDIR/results

# Before job completes save results on a workspace
rsync -av $TMPDIR/results $(ws_find data-ssd)/results-${SLURM_JOB_ID}/
</source>

== LSDF Online Storage==

In some cases it is useful to have access to the LSDF Online Storage on the HPC-Clusters also. Therefore the LSDF Online Storage is mounted on the Login- and Datamover-Nodes.
Furthermore it can be used on the compute nodes during the job runtime with the constraint flag "LSDF" ([[bwUniCluster_2.0_Slurm_common_Features|Slurm common features]]
). There is also an example about the LSDF batch usage: [[bwUniCluster_2.0_Slurm_common_Features#LSDF_Online_Storage|Slurm LSDF example ]] .
<pre>
#!/bin/bash
#SBATCH ...
#SBATCH --constraint=LSDF
</pre>
 
For the usage of the LSDF Online Storage the following environment variables are available: $LSDF, $LSDFPROJECTS, $LSDFHOME.
Please request storage projects in the LSDF Online Storage seperately:
[https://www.lsdf.kit.edu/os/storagerequest/: LSDF Storage Request].

==BeeOND (BeeGFS On-Demand)==

Users of the UniCluster have possibility to request a private BeeOND (on-demand BeeGFS) parallel filesystem for each job. The file system is created during job startup and purged after your job.

; '''IMPORTANT: '''
: All data on the private filesystem will be deleted after your job. Make sure you have copied your data back to the global filesystem (within job), e.g., $HOME or any workspace.

BeeOND/BeeGFS can be used like any other parallel file system. Tools like cp or rsync can be used to copy data in and out.

For detailed usage see here: [[BwUniCluster_2.0_Slurm_common_Features#BeeOND_.28BeeGFS_On-Demand.29|Request on-demand file system]]

==Backup and Archiving==

There are regular backups of all data of the home directories,whereas ACLs and extended attributes will
not be backuped.

Please open a ticket if you need backuped data.

BwUniCluster2.0/Hardware and Architecture

2024-10-17T13:05:26Z

R Laifer: /* Workspaces */

= Architecture of bwUniCluster 2.0 =

The bwUniCluster 2.0 is a parallel computer with distributed memory. Each node of system consists of at least two Intel Xeon processor, local memory, disks, network adapters and optionally accelerators (NVIDIA Tesla V100, A100 or H100). All nodes are connected by a fast InfiniBand interconnect. In addition the file system
Lustre, that is connected by coupling the InfiniBand of the file server with the InfiniBand
switch of the compute cluster, is added to bwUniCluster 2.0 to provide a fast and scalable
parallel file system.

The operating system on each node is Red Hat Enterprise Linux (RHEL) 8.4. A number of additional software packages like e.g. SLURM have been installed on top. Some of these components are of special interest to end users and are briefly
discussed in this document. Others which are of greater importance to system
administrators will not be covered by this document.

The individual nodes of the system may act in different roles. According to the services supplied by the nodes, they are separated into disjoint groups. From an end users point of view the different groups of nodes are login nodes, compute nodes, file server nodes and administrative server nodes.

'''Login Nodes'''

The login nodes are the only nodes that are directly accessible by end users. These nodes
are used for interactive login, file management, program development and interactive pre-
and postprocessing. Three nodes are dedicated to this service but they are all accessible via
one address and a DNS round-robin alias distributes the login sessions to the
different login nodes.

'''Compute Node'''

The majority of nodes are compute nodes which are managed by a batch system. Users
submit their jobs to the SLURM batch system and a job is executed when the required resources become available (depending on its fair-share priority).

'''File Server Nodes'''

The hardware of the parallel file system Lustre incorporates some file server nodes; the file
system Lustre is connected by coupling the InfiniBand of the file server with the independent InfiniBand switch of the compute cluster. In addition to shared file space there is also local storage on the disks of each node (for details see chapter "File Systems").

'''Administrative Server Nodes'''

Some other nodes are delivering additional services like resource management, external
network connection, administration etc. These nodes can be accessed directly by system administrators only.

= Components of bwUniCluster =

{| class="wikitable"
|-
! style="width:9%"|
! style="width:13%"| Compute nodes "Thin"
! style="width:13%"| Compute nodes "HPC"
! style="width:13%"| Compute nodes "IceLake"
! style="width:13%"| Compute nodes "Fat"
! style="width:13%"| GPU x4
! style="width:13%"| GPU x8
! style="width:13%"| IceLake + GPU x4
! style="width:13%"| Login
|-
!scope="column"| Number of nodes
| 200 + 60
| 260
| 272
| 6
| 14
| 10
| 15
| 3
|-
!scope="column"| Processors
| Intel Xeon Gold 6230
| Intel Xeon Gold 6230
| Intel Xeon Platinum 8358
| Intel Xeon Gold 6230
| Intel Xeon Gold 6230
| Intel Xeon Gold 6248
| Intel Xeon Platinum 8358
|-
!scope="column"| Number of sockets
| 2
| 2
| 2
| 4
| 2
| 2
| 2
| 2
|-
!scope="column"| Processor frequency (GHz)
| 2.1 Ghz
| 2.1 Ghz
| 2.6 Ghz
| 2.1 Ghz
| 2.1 Ghz
| 2.6 Ghz
| 2.5 Ghz
|
|-
!scope="column"| Total number of cores
| 40
| 40
| 64
| 80
| 40
| 40
| 64
| 40
|-
!scope="column"| Main memory
| 96 GB / 192 GB
| 96 GB
| 256 GB
| 3 TB
| 384 GB
| 768 GB
| 512 GB
| 384 GB
|-
!scope="column"| Local SSD
| 960 GB SATA
| 960 GB SATA
| 1,8 TB NVMe
| 4,8 TB NVMe
| 3,2 TB NVMe
| 15 TB NVMe
| 6,4 TB NVMe
|
|-
!scope="column"| Accelerators
| -
| -
| -
| -
| 4x NVIDIA Tesla V100
| 8x NVIDIA Tesla V100
| 4x NVIDIA A100 / 4x NVIDIA H100
| -
|-
!scope="column"| Accelerator memory
| -
| -
| -
| -
| 32 GB
| 32 GB
| 80 GB / 94 GB
| -
|-
!scope="column"| Interconnect
| IB HDR100 (blocking)
| IB HDR100
| IB HDR200
| IB HDR
| IB HDR
| IB HDR
| IB HDR200
| IB HDR100 (blocking)
|}
Table 1: Properties of the nodes

= File Systems =

On bwUniCluster 2.0 the parallel file system Lustre is used for most globally visible user data. Lustre is open source and Lustre solutions and support are available from different vendors. Nowadays, most of the biggest HPC systems are using Lustre. An initial home directory on a Lustre file system is created automatically after account activation, and the environment variable $HOME holds its name. Users can create so-called workspaces on another Lustre file system for non-permanent data with temporary lifetime. There is another workspace file system based on flash storage for special requirements available.

Within a batch job further file systems are available:
* The directory $TMPDIR is only available and visible on the local node. It is located on fast SSD storage devices.
* On request a parallel on-demand file system (BeeOND) is created which uses the SSDs of the nodes which were allocated to the batch job.
* On request the external LSDF Online Storage is mounted on the nodes which were allocated to the batch job. This file system is based on the parallel file system Spectrum Scale.

Some of the characteristics of the file systems are shown in Table 2.

{| style="width:100%; vertical-align:top; background:#f5fffa;border:1px solid #000000;padding:1px"
|- style="width:20%;height=20px; text-align:left;padding:3px"
! style="background-color:#AAA;padding:3px"| Property
! style="background-color:#AAA;padding:3px"| $TMPDIR
! style="background-color:#AAA;padding:3px"| BeeOND
! style="background-color:#AAA;padding:3px"| $HOME
! style="background-color:#AAA;padding:3px"| Workspace
! style="background-color:#AAA;padding:3px"| Workspace on flash
|-
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Visibility
| style="height=20px; text-align:left;padding:3px"| local node
| style="height=20px; text-align:left;padding:3px"| nodes of batch job
| style="height=20px; text-align:left;padding:3px"| global
| style="height=20px; text-align:left;padding:3px"| global
| style="height=20px; text-align:left;padding:3px"| global
|-
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Lifetime
| style="height=20px; text-align:left;padding:3px"| batch job runtime
| style="height=20px; text-align:left;padding:3px"| batch job runtime
| style="height=20px; text-align:left;padding:3px"| permanent
| style="height=20px; text-align:left;padding:3px"| max. 240 days
| style="height=20px; text-align:left;padding:3px"| max. 240 days
|- style="vertical-align:top;"
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Disk space
| style="height=20px; text-align:left;padding:3px"| 960 GB - 6.4 TB details see table 1
| style="height=20px; text-align:left;padding:3px"| n*750 GB
| style="height=20px; text-align:left;padding:3px"| 1.2 PiB
| style="height=20px; text-align:left;padding:3px"| 4.1 PiB
| style="height=20px; text-align:left;padding:3px"| 236 TiB
|-
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Capacity Quotas
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| yes 1 TiB per user, for MA users 256 GiB also per organization
| style="height=20px; text-align:left;padding:3px"| yes 40 TiB per user
| style="height=20px; text-align:left;padding:3px"| yes 1 TiB per user
|-
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Inode Quotas
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| yes 10 million per user for MA users 2.5 million
| style="height=20px; text-align:left;padding:3px"| yes 30 million per user
| style="height=20px; text-align:left;padding:3px"| yes 5 million per user
|-
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Backup
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| yes
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| no
|- style="vertical-align:top;"
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Read perf./node
| style="height=20px; text-align:left;padding:3px"| 500 MB/s - 6 GB/s depends on type of local SSD / job queue: 520 MB/s @ single / multiple 800 MB/s @ multiple_e 6600 MB/s @ fat 6500 MB/s @ gpu_4 6500 MB/s @ gpu_8
| style="height=20px; text-align:left;padding:3px"| 400 MB/s - 500 MB/s depends on type of local SSDs / job queue: 500 MB/s @ multiple 400 MB/s @ multiple_e
| style="height=20px; text-align:left;padding:3px"| 1 GB/s
| style="height=20px; text-align:left;padding:3px"| 1 GB/s
| style="height=20px; text-align:left;padding:3px"| 1 GB/s
|- style="vertical-align:top;"
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Write perf./node
| style="height=20px; text-align:left;padding:3px"| 500 MB/s - 4 GB/s depends on type of local SSD / job queue: 500 MB/s @ single / multiple 650 MB/s @ multiple_e 2900 MB/s @ fat 2090 MB/s @ gpu_4 4060 MB/s @ gpu_8
| style="height=20px; text-align:left;padding:3px"| 250 MB/s - 350 MB/s depends on type of local SSDs / job queue: 350 MB/s @ multiple 250 MB/s @ multiple_e
| style="height=20px; text-align:left;padding:3px"| 1 GB/s
| style="height=20px; text-align:left;padding:3px"| 1 GB/s
| style="height=20px; text-align:left;padding:3px"| 1 GB/s
|- style="vertical-align:top;"
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Total read perf.
| style="height=20px; text-align:left;padding:3px"| n*500-6000 MB/s
| style="height=20px; text-align:left;padding:3px"| n*400-500 MB/s
| style="height=20px; text-align:left;padding:3px"| 18 GB/s
| style="height=20px; text-align:left;padding:3px"| 54 GB/s
| style="height=20px; text-align:left;padding:3px"| 45 GB/s
|- style="vertical-align:top;"
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Total write perf.
| style="height=20px; text-align:left;padding:3px"| n*500-4000 MB/s
| style="height=20px; text-align:left;padding:3px"| n*250-350 MB/s
| style="height=20px; text-align:left;padding:3px"| 18 GB/s
| style="height=20px; text-align:left;padding:3px"| 54 GB/s
| style="height=20px; text-align:left;padding:3px"| 38 GB/s
|}
---------------------------------------------------------------------------------------------------------
global: all nodes of UniCluster access the same file system;
local: each node has its own file system;
permanent: files are stored permanently;
batch job: files are removed at end of the batch job.
---------------------------------------------------------------------------------------------------------
Table 2: Properties of the file systems

== Selecting the appropriate file system ==

In general, you should separate your data and store it on the appropriate file system.
Permanently needed data like software or important results should be stored below $HOME
but capacity restrictions (quotas) apply. In case you accidentally deleted data on $HOME
there is a chance that we can restore it from backup. Permanent data which is not needed for months
or exceeds the capacity restrictions should be sent to the LSDF Online Storage
or to the archive and deleted from the file systems. Temporary data which is only needed on a single
node and which does not exceed the disk space shown in the table above should be stored
below $TMPDIR. Data which is read many times on a single node, e.g. if you are doing AI training,
should be copied to $TMPDIR and read from there. Temporary data which is used from many nodes
of your batch job and which is only needed during job runtime should be stored on a
parallel on-demand file system. Temporary data which can be recomputed or which is the
result of one job and input for another job should be stored in workspaces. The lifetime
of data in workspaces is limited and depends on the lifetime of the workspace which can be
several months.

For further details please check the chapters below.

== $HOME ==

The home directories of bwUniCluster 2.0 (uc2) users are located in the parallel file system Lustre.
You have access to your home directory from all nodes of uc2. A regular backup of these directories
to tape archive is done automatically. The directory $HOME is used to hold those files that are
permanently used like source codes, configuration files, executable programs etc.

On uc2 there is a default user quota limit of 1 TiB and 10 million inodes (files and directories) per user.
For users of University of Mannheim the limit is 256 GiB and 2.5 million inodes.
You can check your current usage and limits with the command
<pre>
$ lfs quota -uh $(whoami) $HOME
</pre>
In addition to the user limit there is a limit of your organization (e.g. university) which depends on the financial share. This limit is enforced with so-called Lustre project quotas. You can show the current usage and limits of your organization with the following command:
<pre>
lfs quota -ph $(grep $(echo $HOME | sed -e "s|/[^/]*/[^/]*$||") /pfs/data5/project_ids.txt | cut -f 1 -d\ ) $HOME
</pre>

== Workspaces ==

On uc2 workspaces can be used to store large non-permanent data sets, e.g. restart files or output
data that has to be post-processed. The file system used for workspaces is also the parallel file system Lustre. This file system is especially designed for parallel access and for a high throughput to large
files. It is able to provide high data transfer rates of up to 54 GB/s write and read performance when data access is parallel.

On uc2 there is a default user quota limit of 40 TiB and 30 million inodes (files and directories) per user.
You can chek your current usage and limits with the command
<pre>
$ lfs quota -uh $(whoami) /pfs/work7
</pre>
Note that the quotas include data and inodes for all of your workspaces and all of your expired workspaces (as long as they are not yet completely removed).

Workspaces have a lifetime and the data on a workspace expires as a whole after a fixed period. The maximum lifetime of a workspace on uc2 is 60 days, but it can be renewed at the end of that period 3 times to a total maximum of 240 days after workspace generation.

Creating, deleting, finding, extending and sharing workspaces is explained on the [[workspace]] page.

=== Reminder for workspace deletion ===

Normally you will get an email every day starting 7 days before a workspace expires. You can send yourself a calender entry which reminds you when a workspace will be automatically deleted:

$ ws_send_ical.sh <workspace> <email>

=== Restoring expired Workspaces ===

At expiration time your workspace will be moved to a special, hidden directory. On uc2 expired workspaces are currently kept for 30 days. During this time you can still restore your data into a valid workspace. The same is true for released workspaces but they are only kept until the next night. In order to restore an expired workspace, use
<pre>
ws_restore -l
</pre>
to get a list of your expired workspaces, and then restore them into an '''existing, active workspace''' (here with name <code>my_restored</code>):
<pre>
ws_restore <full_name_of_expired_workspace> my_restored
</pre>
NOTE: The expired workspace has to be specified using the full name as listed by <code>ws_restore -l</code>, including username prefix and timestamp suffix (otherwise, it cannot be uniquely identified).
The target workspace, on the other hand, must be given with just its short name as listed by <code>ws_list</code>, without the username prefix.

NOTE: <code>ws_restore</code> can only work on the same filesystem! So you have to ensure that the new workspace allocated with <code>ws_allocate</code> is placed on the same filesystem as the expired workspace. Therefore, you can use <code>-F <filesystem></code> flag if needed.

=== Linking workspaces in Home ===

It might be valuable to have links to personal workspaces within a certain directory, e.g., below the user home directory. The command
<pre>
ws_register <DIR>
</pre>
will create and manage links to all personal workspaces within in the directory <DIR>. Calling this command will do the following:

* The directory <DIR> will be created if necessary
* Links to all personal workspaces will be managed:
** Creates links to all available workspaces if not already present
** Removes links to released workspaces

== Improving Performance on $HOME and workspaces ==

The following recommendations might help to improve throughput and metadata
performance on Lustre filesystems.

=== '''Improving Throughput Performance''' ===

Depending on your application some adaptations might be necessary if you want to reach
the full bandwidth of the filesystems. Parallel filesystems typically stripe files over storage subsystems, i.e. large files are separated into stripes and distributed to different storage subsystems. In Lustre, the size of these stripes (sometimes also mentioned as chunks) is called stripe size and the number of used storage subsystems is called stripe count.

When you are designing your application you should consider that the performance of
parallel filesystems is generally better if data is transferred in large blocks and stored in
few large files. In more detail, to increase throughput performance of a parallel application
following aspects should be considered:

* collect large chunks of data and write them sequentially at once,

* to exploit complete filesystem bandwidth use several clients,

* avoid competitive file access by different tasks or use blocks with boundaries at stripe size (default is 1MB),

* if files are small enough for the SSDs and are only used by one process store them on $TMPDIR.

With previous Lustre versions adapting the Lustre stripe count was the most important optimization. However, for the workspaces of uc2 the new Lustre feature Progressive File Layouts has been used to define file striping parameters. This means that the stripe count is adapted if the file size is growing. In normal cases users no longer need to adapt file striping parameters in case they have very huge files or in order to reach better performance.

If you know what you are doing you can still change striping parameters, e.g. the stripe count, of a directory and of newly created files. New files and directories inherit the stripe count from the parent directory. E.g. if you want to enhance throughput on a single very large file which is created in the directory $HOME/my_output_dir you can use the command
<pre>
$ lfs setstripe -c-1 $HOME/my_output_dir
</pre>
to change the stripe count to -1 which means that all storage subsystems of the file system are used to store that file. If you change the stripe count of a directory the stripe count of existing files inside this
directory is not changed. If you want to change the stripe count of existing files, change
the stripe count of the parent directory, copy the files to new files, remove the old files
and move the new files back to the old name. In order to check the stripe setting of
the file my_file use
<pre>
$ lfs getstripe my_file
</pre>
Also note that changes on the striping parameters (e.g. stripe count) are not saved in the
backup, i.e. if directories have to be recreated this information is lost and the default stripe
count will be used. Therefore, you should annotate for which directories you made changes
to the striping parameters so that you can repeat these changes if required.

=== '''Improving Metadata Performance''' ===

Metadata performance on parallel file systems is usually not as good as with local
filesystems. In addition, it is usually not scalable, i.e. a limited resource. Therefore,
you should omit metadata operations whenever possible. For example, it is much better
to have few large files than lots of small files. In more detail, to increase metadata
performance of a parallel application following aspects should be considered:

* avoid creating many small files,

* avoid competitive directory access, e.g. by creating files in separate subdirectories for each task,

* if many small files are only used within a batch job and accessed by one process store them on $TMPDIR,

* change the default colorization setting of the command ls (see below).

On modern Linux systems, the GNU ls command often uses colorization by default to
visually highlight the file type; this is especially true if the command is run within a terminal
session. This is because the default shell profile initializations usually contain an alias
directive similar to the following for the ls command:
<pre>
$ alias ls="ls --color=tty"
</pre>
However, running the ls command in this way for files on a Lustre file system requires
a stat() call to be used to determine the file type. This can result in a performance
overhead, because the stat() call always needs to determine the size of a file, and that
in turn means that the client node must query the object size of all the backing objects
that make up a file. As a result of the default colorization setting, running a simple
ls command on a Lustre file system often takes as much time as running the ls command
with the -l option (the same is true if the -F, -p, or the -classify option, or any other option
that requires information from a stat() call, is used). To avoid this performance overhead
when using ls commands, add an alias directive similar to the following
to your shell startup script:
<pre>
$ alias ls="ls --color=never"
</pre>

== Workspaces on flash storage ==

There is another workspace file system for special requirements available. The file system is called ''full flash pfs'' and is based on the parallel file system Lustre.

=== Advantages of this file system ===

# All storage devices are based on flash (no hard disks) with low access times. Hence performance is better compared to other parallel file systems for read and write access with small blocks and with small files, i.e. IOPS rates are improved.
# The file system is mounted on bwUniCluster 2.0 and HoreKa, i.e. it can be used to share data between these clusters.

=== Access restrictions ===

Only HoreKa users or KIT users of bwUniCluster 2.0 can use this file system.

=== Using the file system ===

As KIT or HoreKa user you can use the file system in the same way as a normal workspace. You just have to specify the name of the flash-based workspace file system using the option ''-F'' to all the commands that manage workspaces. On bwUniCluster 2.0 it is called ''ffuc'', on HoreKa it is ''ffhk''. For example, to create a workspace with name myws and a lifetime of 60 days on bwUniCluster 2.0 execute:
ws_allocate -F ffuc myws 60

If you want to use the full flash pfs on bwUniCluster 2.0 '''and''' HoreKa at the same time, please note that you only have to manage a particular workspace on one of the clusters since the name of the workspace directory is different. However, the path to each workspace is visible and can be used on both clusters.

Other features are similar to normal workspaces. For example, we are able to restore expired workspaces for few weeks and you have to open a ticket to request the restore. There are quota limits with a default limit of 1 TiB capacity and 5 millions inodes per user. You can check your current usage with
lfs quota -uh $(whoami) /pfs/work8

== $TMPDIR ==

The environment variable $TMPDIR contains the name of a directory which is local to each node. This means
that different tasks of a parallel application use different directories when they do not utilize the same node.
Although $TMPDIR points to the same path name for different nodes of a batch job, the physical location and the
content of this directory path on these nodes is different.

This directory should be used for temporary files being accessed from the local node during job runtime. It should
also be used if you read the same data many times from a single node, e.g. if you are doing AI training. In this
case you should copy the data at the beginning of your batch job to $TMPDIR and read the data from there.

The $TMPDIR directory is located on extremely fast SSD local storage devices. This means that performance
on small files is much better than on the parallel file systems. The capacity of the local SSDs for each node type
is different and can be checked in Table 1 above. The capacity of $TMPDIR is at least 800 GB.

Each time a batch job is started, a subdirectory is created on the SSD of each node and assigned to the job.
$TMPDIR is set to the name of the subdirectory and this name contains the job ID so that it is unique
for each job. At the end of the job the subdirectory is removed.

On login nodes $TMPDIR also points to a fast directory on a local SSD disk but this directory is not unique.
It is recommended to create your own unique subdirectory on these nodes. This directory should be used for the
installation of software packages. This means that the software package to be installed should be unpacked,
compiled and linked in a subdirectory of $TMPDIR. The real installation of the package (e.g. make install)
should be made into the $HOME folder.

=== Usage example for $TMPDIR ===

We will provide an example for using $TMPDIR and describe efficient data transfer to and from $TMPDIR.

If you have a data set with many files which is frequently used by batch jobs you should create
a compressed archive on a workspace. This archive can be extracted on $TMPDIR inside your batch jobs.
Such an archive can be read efficiently from a parallel file system since it is a single huge file.
On a login node you can create such an archive with the following steps:
<source lang="bash">
# Create a workspace to store the archive
[ab1234@uc2n997 ~]$ ws_allocate data-ssd 60
# Create the archive from a local dataset folder (example)
[ab1234@uc2n997 ~]$ tar -cvzf $(ws_find data-ssd)/dataset.tgz dataset/
</source>

Inside a batch job extract the archive on $TMPDIR, read input data from $TMPDIR, store results on $TMPDIR
and save the results on a workspace:
<source lang="bash">
#!/bin/bash
# very simple example on how to use local $TMPDIR
#SBATCH -N 1
#SBATCH -t 24:00:00

# Extract compressed input dataset on local SSD
tar -C $TMPDIR/ -xvzf $(ws_find data-ssd)/dataset.tgz

# The application reads data from dataset on $TMPDIR and writes results to $TMPDIR
myapp -input $TMPDIR/dataset/myinput.csv -outputdir $TMPDIR/results

# Before job completes save results on a workspace
rsync -av $TMPDIR/results $(ws_find data-ssd)/results-${SLURM_JOB_ID}/
</source>

== LSDF Online Storage==

In some cases it is useful to have access to the LSDF Online Storage on the HPC-Clusters also. Therefore the LSDF Online Storage is mounted on the Login- and Datamover-Nodes.
Furthermore it can be used on the compute nodes during the job runtime with the constraint flag "LSDF" ([[bwUniCluster_2.0_Slurm_common_Features|Slurm common features]]
). There is also an example about the LSDF batch usage: [[bwUniCluster_2.0_Slurm_common_Features#LSDF_Online_Storage|Slurm LSDF example ]] .
<pre>
#!/bin/bash
#SBATCH ...
#SBATCH --constraint=LSDF
</pre>
 
For the usage of the LSDF Online Storage the following environment variables are available: $LSDF, $LSDFPROJECTS, $LSDFHOME.
Please request storage projects in the LSDF Online Storage seperately:
[https://www.lsdf.kit.edu/os/storagerequest/: LSDF Storage Request].

==BeeOND (BeeGFS On-Demand)==

Users of the UniCluster have possibility to request a private BeeOND (on-demand BeeGFS) parallel filesystem for each job. The file system is created during job startup and purged after your job.

; '''IMPORTANT: '''
: All data on the private filesystem will be deleted after your job. Make sure you have copied your data back to the global filesystem (within job), e.g., $HOME or any workspace.

BeeOND/BeeGFS can be used like any other parallel file system. Tools like cp or rsync can be used to copy data in and out.

For detailed usage see here: [[BwUniCluster_2.0_Slurm_common_Features#BeeOND_.28BeeGFS_On-Demand.29|Request on-demand file system]]

==Backup and Archiving==

There are regular backups of all data of the home directories,whereas ACLs and extended attributes will
not be backuped.

Please open a ticket if you need backuped data.

BwUniCluster2.0/Hardware and Architecture

2024-10-17T13:02:25Z

R Laifer: /* Workspaces */

= Architecture of bwUniCluster 2.0 =

The bwUniCluster 2.0 is a parallel computer with distributed memory. Each node of system consists of at least two Intel Xeon processor, local memory, disks, network adapters and optionally accelerators (NVIDIA Tesla V100, A100 or H100). All nodes are connected by a fast InfiniBand interconnect. In addition the file system
Lustre, that is connected by coupling the InfiniBand of the file server with the InfiniBand
switch of the compute cluster, is added to bwUniCluster 2.0 to provide a fast and scalable
parallel file system.

The operating system on each node is Red Hat Enterprise Linux (RHEL) 8.4. A number of additional software packages like e.g. SLURM have been installed on top. Some of these components are of special interest to end users and are briefly
discussed in this document. Others which are of greater importance to system
administrators will not be covered by this document.

The individual nodes of the system may act in different roles. According to the services supplied by the nodes, they are separated into disjoint groups. From an end users point of view the different groups of nodes are login nodes, compute nodes, file server nodes and administrative server nodes.

'''Login Nodes'''

The login nodes are the only nodes that are directly accessible by end users. These nodes
are used for interactive login, file management, program development and interactive pre-
and postprocessing. Three nodes are dedicated to this service but they are all accessible via
one address and a DNS round-robin alias distributes the login sessions to the
different login nodes.

'''Compute Node'''

The majority of nodes are compute nodes which are managed by a batch system. Users
submit their jobs to the SLURM batch system and a job is executed when the required resources become available (depending on its fair-share priority).

'''File Server Nodes'''

The hardware of the parallel file system Lustre incorporates some file server nodes; the file
system Lustre is connected by coupling the InfiniBand of the file server with the independent InfiniBand switch of the compute cluster. In addition to shared file space there is also local storage on the disks of each node (for details see chapter "File Systems").

'''Administrative Server Nodes'''

Some other nodes are delivering additional services like resource management, external
network connection, administration etc. These nodes can be accessed directly by system administrators only.

= Components of bwUniCluster =

{| class="wikitable"
|-
! style="width:9%"|
! style="width:13%"| Compute nodes "Thin"
! style="width:13%"| Compute nodes "HPC"
! style="width:13%"| Compute nodes "IceLake"
! style="width:13%"| Compute nodes "Fat"
! style="width:13%"| GPU x4
! style="width:13%"| GPU x8
! style="width:13%"| IceLake + GPU x4
! style="width:13%"| Login
|-
!scope="column"| Number of nodes
| 200 + 60
| 260
| 272
| 6
| 14
| 10
| 15
| 3
|-
!scope="column"| Processors
| Intel Xeon Gold 6230
| Intel Xeon Gold 6230
| Intel Xeon Platinum 8358
| Intel Xeon Gold 6230
| Intel Xeon Gold 6230
| Intel Xeon Gold 6248
| Intel Xeon Platinum 8358
|-
!scope="column"| Number of sockets
| 2
| 2
| 2
| 4
| 2
| 2
| 2
| 2
|-
!scope="column"| Processor frequency (GHz)
| 2.1 Ghz
| 2.1 Ghz
| 2.6 Ghz
| 2.1 Ghz
| 2.1 Ghz
| 2.6 Ghz
| 2.5 Ghz
|
|-
!scope="column"| Total number of cores
| 40
| 40
| 64
| 80
| 40
| 40
| 64
| 40
|-
!scope="column"| Main memory
| 96 GB / 192 GB
| 96 GB
| 256 GB
| 3 TB
| 384 GB
| 768 GB
| 512 GB
| 384 GB
|-
!scope="column"| Local SSD
| 960 GB SATA
| 960 GB SATA
| 1,8 TB NVMe
| 4,8 TB NVMe
| 3,2 TB NVMe
| 15 TB NVMe
| 6,4 TB NVMe
|
|-
!scope="column"| Accelerators
| -
| -
| -
| -
| 4x NVIDIA Tesla V100
| 8x NVIDIA Tesla V100
| 4x NVIDIA A100 / 4x NVIDIA H100
| -
|-
!scope="column"| Accelerator memory
| -
| -
| -
| -
| 32 GB
| 32 GB
| 80 GB / 94 GB
| -
|-
!scope="column"| Interconnect
| IB HDR100 (blocking)
| IB HDR100
| IB HDR200
| IB HDR
| IB HDR
| IB HDR
| IB HDR200
| IB HDR100 (blocking)
|}
Table 1: Properties of the nodes

= File Systems =

On bwUniCluster 2.0 the parallel file system Lustre is used for most globally visible user data. Lustre is open source and Lustre solutions and support are available from different vendors. Nowadays, most of the biggest HPC systems are using Lustre. An initial home directory on a Lustre file system is created automatically after account activation, and the environment variable $HOME holds its name. Users can create so-called workspaces on another Lustre file system for non-permanent data with temporary lifetime. There is another workspace file system based on flash storage for special requirements available.

Within a batch job further file systems are available:
* The directory $TMPDIR is only available and visible on the local node. It is located on fast SSD storage devices.
* On request a parallel on-demand file system (BeeOND) is created which uses the SSDs of the nodes which were allocated to the batch job.
* On request the external LSDF Online Storage is mounted on the nodes which were allocated to the batch job. This file system is based on the parallel file system Spectrum Scale.

Some of the characteristics of the file systems are shown in Table 2.

{| style="width:100%; vertical-align:top; background:#f5fffa;border:1px solid #000000;padding:1px"
|- style="width:20%;height=20px; text-align:left;padding:3px"
! style="background-color:#AAA;padding:3px"| Property
! style="background-color:#AAA;padding:3px"| $TMPDIR
! style="background-color:#AAA;padding:3px"| BeeOND
! style="background-color:#AAA;padding:3px"| $HOME
! style="background-color:#AAA;padding:3px"| Workspace
! style="background-color:#AAA;padding:3px"| Workspace on flash
|-
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Visibility
| style="height=20px; text-align:left;padding:3px"| local node
| style="height=20px; text-align:left;padding:3px"| nodes of batch job
| style="height=20px; text-align:left;padding:3px"| global
| style="height=20px; text-align:left;padding:3px"| global
| style="height=20px; text-align:left;padding:3px"| global
|-
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Lifetime
| style="height=20px; text-align:left;padding:3px"| batch job runtime
| style="height=20px; text-align:left;padding:3px"| batch job runtime
| style="height=20px; text-align:left;padding:3px"| permanent
| style="height=20px; text-align:left;padding:3px"| max. 240 days
| style="height=20px; text-align:left;padding:3px"| max. 240 days
|- style="vertical-align:top;"
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Disk space
| style="height=20px; text-align:left;padding:3px"| 960 GB - 6.4 TB details see table 1
| style="height=20px; text-align:left;padding:3px"| n*750 GB
| style="height=20px; text-align:left;padding:3px"| 1.2 PiB
| style="height=20px; text-align:left;padding:3px"| 4.1 PiB
| style="height=20px; text-align:left;padding:3px"| 236 TiB
|-
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Capacity Quotas
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| yes 1 TiB per user, for MA users 256 GiB also per organization
| style="height=20px; text-align:left;padding:3px"| yes 40 TiB per user
| style="height=20px; text-align:left;padding:3px"| yes 1 TiB per user
|-
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Inode Quotas
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| yes 10 million per user for MA users 2.5 million
| style="height=20px; text-align:left;padding:3px"| yes 30 million per user
| style="height=20px; text-align:left;padding:3px"| yes 5 million per user
|-
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Backup
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| yes
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| no
|- style="vertical-align:top;"
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Read perf./node
| style="height=20px; text-align:left;padding:3px"| 500 MB/s - 6 GB/s depends on type of local SSD / job queue: 520 MB/s @ single / multiple 800 MB/s @ multiple_e 6600 MB/s @ fat 6500 MB/s @ gpu_4 6500 MB/s @ gpu_8
| style="height=20px; text-align:left;padding:3px"| 400 MB/s - 500 MB/s depends on type of local SSDs / job queue: 500 MB/s @ multiple 400 MB/s @ multiple_e
| style="height=20px; text-align:left;padding:3px"| 1 GB/s
| style="height=20px; text-align:left;padding:3px"| 1 GB/s
| style="height=20px; text-align:left;padding:3px"| 1 GB/s
|- style="vertical-align:top;"
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Write perf./node
| style="height=20px; text-align:left;padding:3px"| 500 MB/s - 4 GB/s depends on type of local SSD / job queue: 500 MB/s @ single / multiple 650 MB/s @ multiple_e 2900 MB/s @ fat 2090 MB/s @ gpu_4 4060 MB/s @ gpu_8
| style="height=20px; text-align:left;padding:3px"| 250 MB/s - 350 MB/s depends on type of local SSDs / job queue: 350 MB/s @ multiple 250 MB/s @ multiple_e
| style="height=20px; text-align:left;padding:3px"| 1 GB/s
| style="height=20px; text-align:left;padding:3px"| 1 GB/s
| style="height=20px; text-align:left;padding:3px"| 1 GB/s
|- style="vertical-align:top;"
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Total read perf.
| style="height=20px; text-align:left;padding:3px"| n*500-6000 MB/s
| style="height=20px; text-align:left;padding:3px"| n*400-500 MB/s
| style="height=20px; text-align:left;padding:3px"| 18 GB/s
| style="height=20px; text-align:left;padding:3px"| 54 GB/s
| style="height=20px; text-align:left;padding:3px"| 45 GB/s
|- style="vertical-align:top;"
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Total write perf.
| style="height=20px; text-align:left;padding:3px"| n*500-4000 MB/s
| style="height=20px; text-align:left;padding:3px"| n*250-350 MB/s
| style="height=20px; text-align:left;padding:3px"| 18 GB/s
| style="height=20px; text-align:left;padding:3px"| 54 GB/s
| style="height=20px; text-align:left;padding:3px"| 38 GB/s
|}
---------------------------------------------------------------------------------------------------------
global: all nodes of UniCluster access the same file system;
local: each node has its own file system;
permanent: files are stored permanently;
batch job: files are removed at end of the batch job.
---------------------------------------------------------------------------------------------------------
Table 2: Properties of the file systems

== Selecting the appropriate file system ==

In general, you should separate your data and store it on the appropriate file system.
Permanently needed data like software or important results should be stored below $HOME
but capacity restrictions (quotas) apply. In case you accidentally deleted data on $HOME
there is a chance that we can restore it from backup. Permanent data which is not needed for months
or exceeds the capacity restrictions should be sent to the LSDF Online Storage
or to the archive and deleted from the file systems. Temporary data which is only needed on a single
node and which does not exceed the disk space shown in the table above should be stored
below $TMPDIR. Data which is read many times on a single node, e.g. if you are doing AI training,
should be copied to $TMPDIR and read from there. Temporary data which is used from many nodes
of your batch job and which is only needed during job runtime should be stored on a
parallel on-demand file system. Temporary data which can be recomputed or which is the
result of one job and input for another job should be stored in workspaces. The lifetime
of data in workspaces is limited and depends on the lifetime of the workspace which can be
several months.

For further details please check the chapters below.

== $HOME ==

The home directories of bwUniCluster 2.0 (uc2) users are located in the parallel file system Lustre.
You have access to your home directory from all nodes of uc2. A regular backup of these directories
to tape archive is done automatically. The directory $HOME is used to hold those files that are
permanently used like source codes, configuration files, executable programs etc.

On uc2 there is a default user quota limit of 1 TiB and 10 million inodes (files and directories) per user.
For users of University of Mannheim the limit is 256 GiB and 2.5 million inodes.
You can check your current usage and limits with the command
<pre>
$ lfs quota -uh $(whoami) $HOME
</pre>
In addition to the user limit there is a limit of your organization (e.g. university) which depends on the financial share. This limit is enforced with so-called Lustre project quotas. You can show the current usage and limits of your organization with the following command:
<pre>
lfs quota -ph $(grep $(echo $HOME | sed -e "s|/[^/]*/[^/]*$||") /pfs/data5/project_ids.txt | cut -f 1 -d\ ) $HOME
</pre>

== Workspaces ==

On uc2 workspaces can be used to store large non-permanent data sets, e.g. restart files or output
data that has to be post-processed. The file system used for workspaces is also the parallel file system Lustre. This file system is especially designed for parallel access and for a high throughput to large
files. It is able to provide high data transfer rates of up to 54 GB/s write and read performance when data access is parallel.

On uc2 there is a default user quota limit of 40 TiB and 30 million inodes (files and directories) per user.
You can chek your current usage and limits with the command
<pre>
$ lfs quota -uh $(whoami) /pfs/work7
</pre>
Note that the quotas include data and inodes for all of your workspaces and all of your expired workspaces (as long as they are not yet completely removed).

Workspaces have a lifetime and the data on a workspace expires as a whole after a fixed period. The maximum lifetime of a workspace on bwUniCluster 2.0 is 60 days, but it can be renewed at the end of that period 3 times to a total maximum of 240 days after workspace generation.

Creating, deleting, finding, extending and sharing workspaces is explained on the [[workspace]] page.

=== Reminder for workspace deletion ===

Normally you will get an email every day starting 7 days before a workspace expires. You can send yourself a calender entry which reminds you when a workspace will be automatically deleted:

$ ws_send_ical.sh <workspace> <email>

=== Restoring expired Workspaces ===

At expiration time your workspace will be moved to a special, hidden directory. On bwUniCluster 2.0 expired workspaces are currently kept for 30 days. During this time you can still restore your data into a valid workspace. The same is true for released workspaces but they are only kept until the next night. In order to restore an expired workspace, use
<pre>
ws_restore -l
</pre>
to get a list of your expired workspaces, and then restore them into an '''existing, active workspace''' (here with name <code>my_restored</code>):
<pre>
ws_restore <full_name_of_expired_workspace> my_restored
</pre>
NOTE: The expired workspace has to be specified using the full name as listed by <code>ws_restore -l</code>, including username prefix and timestamp suffix (otherwise, it cannot be uniquely identified).
The target workspace, on the other hand, must be given with just its short name as listed by <code>ws_list</code>, without the username prefix.

NOTE: <code>ws_restore</code> can only work on the same filesystem! So you have to ensure that the new workspace allocated with <code>ws_allocate</code> is placed on the same filesystem as the expired workspace. Therefore, you can use <code>-F <filesystem></code> flag if needed.

=== Linking workspaces in Home ===

It might be valuable to have links to personal workspaces within a certain directory, e.g., below the user home directory. The command
<pre>
ws_register <DIR>
</pre>
will create and manage links to all personal workspaces within in the directory <DIR>. Calling this command will do the following:

* The directory <DIR> will be created if necessary
* Links to all personal workspaces will be managed:
** Creates links to all available workspaces if not already present
** Removes links to released workspaces

== Improving Performance on $HOME and workspaces ==

The following recommendations might help to improve throughput and metadata
performance on Lustre filesystems.

=== '''Improving Throughput Performance''' ===

Depending on your application some adaptations might be necessary if you want to reach
the full bandwidth of the filesystems. Parallel filesystems typically stripe files over storage subsystems, i.e. large files are separated into stripes and distributed to different storage subsystems. In Lustre, the size of these stripes (sometimes also mentioned as chunks) is called stripe size and the number of used storage subsystems is called stripe count.

When you are designing your application you should consider that the performance of
parallel filesystems is generally better if data is transferred in large blocks and stored in
few large files. In more detail, to increase throughput performance of a parallel application
following aspects should be considered:

* collect large chunks of data and write them sequentially at once,

* to exploit complete filesystem bandwidth use several clients,

* avoid competitive file access by different tasks or use blocks with boundaries at stripe size (default is 1MB),

* if files are small enough for the SSDs and are only used by one process store them on $TMPDIR.

With previous Lustre versions adapting the Lustre stripe count was the most important optimization. However, for the workspaces of uc2 the new Lustre feature Progressive File Layouts has been used to define file striping parameters. This means that the stripe count is adapted if the file size is growing. In normal cases users no longer need to adapt file striping parameters in case they have very huge files or in order to reach better performance.

If you know what you are doing you can still change striping parameters, e.g. the stripe count, of a directory and of newly created files. New files and directories inherit the stripe count from the parent directory. E.g. if you want to enhance throughput on a single very large file which is created in the directory $HOME/my_output_dir you can use the command
<pre>
$ lfs setstripe -c-1 $HOME/my_output_dir
</pre>
to change the stripe count to -1 which means that all storage subsystems of the file system are used to store that file. If you change the stripe count of a directory the stripe count of existing files inside this
directory is not changed. If you want to change the stripe count of existing files, change
the stripe count of the parent directory, copy the files to new files, remove the old files
and move the new files back to the old name. In order to check the stripe setting of
the file my_file use
<pre>
$ lfs getstripe my_file
</pre>
Also note that changes on the striping parameters (e.g. stripe count) are not saved in the
backup, i.e. if directories have to be recreated this information is lost and the default stripe
count will be used. Therefore, you should annotate for which directories you made changes
to the striping parameters so that you can repeat these changes if required.

=== '''Improving Metadata Performance''' ===

Metadata performance on parallel file systems is usually not as good as with local
filesystems. In addition, it is usually not scalable, i.e. a limited resource. Therefore,
you should omit metadata operations whenever possible. For example, it is much better
to have few large files than lots of small files. In more detail, to increase metadata
performance of a parallel application following aspects should be considered:

* avoid creating many small files,

* avoid competitive directory access, e.g. by creating files in separate subdirectories for each task,

* if many small files are only used within a batch job and accessed by one process store them on $TMPDIR,

* change the default colorization setting of the command ls (see below).

On modern Linux systems, the GNU ls command often uses colorization by default to
visually highlight the file type; this is especially true if the command is run within a terminal
session. This is because the default shell profile initializations usually contain an alias
directive similar to the following for the ls command:
<pre>
$ alias ls="ls --color=tty"
</pre>
However, running the ls command in this way for files on a Lustre file system requires
a stat() call to be used to determine the file type. This can result in a performance
overhead, because the stat() call always needs to determine the size of a file, and that
in turn means that the client node must query the object size of all the backing objects
that make up a file. As a result of the default colorization setting, running a simple
ls command on a Lustre file system often takes as much time as running the ls command
with the -l option (the same is true if the -F, -p, or the -classify option, or any other option
that requires information from a stat() call, is used). To avoid this performance overhead
when using ls commands, add an alias directive similar to the following
to your shell startup script:
<pre>
$ alias ls="ls --color=never"
</pre>

== Workspaces on flash storage ==

There is another workspace file system for special requirements available. The file system is called ''full flash pfs'' and is based on the parallel file system Lustre.

=== Advantages of this file system ===

# All storage devices are based on flash (no hard disks) with low access times. Hence performance is better compared to other parallel file systems for read and write access with small blocks and with small files, i.e. IOPS rates are improved.
# The file system is mounted on bwUniCluster 2.0 and HoreKa, i.e. it can be used to share data between these clusters.

=== Access restrictions ===

Only HoreKa users or KIT users of bwUniCluster 2.0 can use this file system.

=== Using the file system ===

As KIT or HoreKa user you can use the file system in the same way as a normal workspace. You just have to specify the name of the flash-based workspace file system using the option ''-F'' to all the commands that manage workspaces. On bwUniCluster 2.0 it is called ''ffuc'', on HoreKa it is ''ffhk''. For example, to create a workspace with name myws and a lifetime of 60 days on bwUniCluster 2.0 execute:
ws_allocate -F ffuc myws 60

If you want to use the full flash pfs on bwUniCluster 2.0 '''and''' HoreKa at the same time, please note that you only have to manage a particular workspace on one of the clusters since the name of the workspace directory is different. However, the path to each workspace is visible and can be used on both clusters.

Other features are similar to normal workspaces. For example, we are able to restore expired workspaces for few weeks and you have to open a ticket to request the restore. There are quota limits with a default limit of 1 TiB capacity and 5 millions inodes per user. You can check your current usage with
lfs quota -uh $(whoami) /pfs/work8

== $TMPDIR ==

The environment variable $TMPDIR contains the name of a directory which is local to each node. This means
that different tasks of a parallel application use different directories when they do not utilize the same node.
Although $TMPDIR points to the same path name for different nodes of a batch job, the physical location and the
content of this directory path on these nodes is different.

This directory should be used for temporary files being accessed from the local node during job runtime. It should
also be used if you read the same data many times from a single node, e.g. if you are doing AI training. In this
case you should copy the data at the beginning of your batch job to $TMPDIR and read the data from there.

The $TMPDIR directory is located on extremely fast SSD local storage devices. This means that performance
on small files is much better than on the parallel file systems. The capacity of the local SSDs for each node type
is different and can be checked in Table 1 above. The capacity of $TMPDIR is at least 800 GB.

Each time a batch job is started, a subdirectory is created on the SSD of each node and assigned to the job.
$TMPDIR is set to the name of the subdirectory and this name contains the job ID so that it is unique
for each job. At the end of the job the subdirectory is removed.

On login nodes $TMPDIR also points to a fast directory on a local SSD disk but this directory is not unique.
It is recommended to create your own unique subdirectory on these nodes. This directory should be used for the
installation of software packages. This means that the software package to be installed should be unpacked,
compiled and linked in a subdirectory of $TMPDIR. The real installation of the package (e.g. make install)
should be made into the $HOME folder.

=== Usage example for $TMPDIR ===

We will provide an example for using $TMPDIR and describe efficient data transfer to and from $TMPDIR.

If you have a data set with many files which is frequently used by batch jobs you should create
a compressed archive on a workspace. This archive can be extracted on $TMPDIR inside your batch jobs.
Such an archive can be read efficiently from a parallel file system since it is a single huge file.
On a login node you can create such an archive with the following steps:
<source lang="bash">
# Create a workspace to store the archive
[ab1234@uc2n997 ~]$ ws_allocate data-ssd 60
# Create the archive from a local dataset folder (example)
[ab1234@uc2n997 ~]$ tar -cvzf $(ws_find data-ssd)/dataset.tgz dataset/
</source>

Inside a batch job extract the archive on $TMPDIR, read input data from $TMPDIR, store results on $TMPDIR
and save the results on a workspace:
<source lang="bash">
#!/bin/bash
# very simple example on how to use local $TMPDIR
#SBATCH -N 1
#SBATCH -t 24:00:00

# Extract compressed input dataset on local SSD
tar -C $TMPDIR/ -xvzf $(ws_find data-ssd)/dataset.tgz

# The application reads data from dataset on $TMPDIR and writes results to $TMPDIR
myapp -input $TMPDIR/dataset/myinput.csv -outputdir $TMPDIR/results

# Before job completes save results on a workspace
rsync -av $TMPDIR/results $(ws_find data-ssd)/results-${SLURM_JOB_ID}/
</source>

== LSDF Online Storage==

In some cases it is useful to have access to the LSDF Online Storage on the HPC-Clusters also. Therefore the LSDF Online Storage is mounted on the Login- and Datamover-Nodes.
Furthermore it can be used on the compute nodes during the job runtime with the constraint flag "LSDF" ([[bwUniCluster_2.0_Slurm_common_Features|Slurm common features]]
). There is also an example about the LSDF batch usage: [[bwUniCluster_2.0_Slurm_common_Features#LSDF_Online_Storage|Slurm LSDF example ]] .
<pre>
#!/bin/bash
#SBATCH ...
#SBATCH --constraint=LSDF
</pre>
 
For the usage of the LSDF Online Storage the following environment variables are available: $LSDF, $LSDFPROJECTS, $LSDFHOME.
Please request storage projects in the LSDF Online Storage seperately:
[https://www.lsdf.kit.edu/os/storagerequest/: LSDF Storage Request].

==BeeOND (BeeGFS On-Demand)==

Users of the UniCluster have possibility to request a private BeeOND (on-demand BeeGFS) parallel filesystem for each job. The file system is created during job startup and purged after your job.

; '''IMPORTANT: '''
: All data on the private filesystem will be deleted after your job. Make sure you have copied your data back to the global filesystem (within job), e.g., $HOME or any workspace.

BeeOND/BeeGFS can be used like any other parallel file system. Tools like cp or rsync can be used to copy data in and out.

For detailed usage see here: [[BwUniCluster_2.0_Slurm_common_Features#BeeOND_.28BeeGFS_On-Demand.29|Request on-demand file system]]

==Backup and Archiving==

There are regular backups of all data of the home directories,whereas ACLs and extended attributes will
not be backuped.

Please open a ticket if you need backuped data.

BwUniCluster2.0/Hardware and Architecture

2024-10-17T12:39:33Z

R Laifer: /* Restoring expired Workspaces */

= Architecture of bwUniCluster 2.0 =

The bwUniCluster 2.0 is a parallel computer with distributed memory. Each node of system consists of at least two Intel Xeon processor, local memory, disks, network adapters and optionally accelerators (NVIDIA Tesla V100, A100 or H100). All nodes are connected by a fast InfiniBand interconnect. In addition the file system
Lustre, that is connected by coupling the InfiniBand of the file server with the InfiniBand
switch of the compute cluster, is added to bwUniCluster 2.0 to provide a fast and scalable
parallel file system.

The operating system on each node is Red Hat Enterprise Linux (RHEL) 8.4. A number of additional software packages like e.g. SLURM have been installed on top. Some of these components are of special interest to end users and are briefly
discussed in this document. Others which are of greater importance to system
administrators will not be covered by this document.

The individual nodes of the system may act in different roles. According to the services supplied by the nodes, they are separated into disjoint groups. From an end users point of view the different groups of nodes are login nodes, compute nodes, file server nodes and administrative server nodes.

'''Login Nodes'''

The login nodes are the only nodes that are directly accessible by end users. These nodes
are used for interactive login, file management, program development and interactive pre-
and postprocessing. Three nodes are dedicated to this service but they are all accessible via
one address and a DNS round-robin alias distributes the login sessions to the
different login nodes.

'''Compute Node'''

The majority of nodes are compute nodes which are managed by a batch system. Users
submit their jobs to the SLURM batch system and a job is executed when the required resources become available (depending on its fair-share priority).

'''File Server Nodes'''

The hardware of the parallel file system Lustre incorporates some file server nodes; the file
system Lustre is connected by coupling the InfiniBand of the file server with the independent InfiniBand switch of the compute cluster. In addition to shared file space there is also local storage on the disks of each node (for details see chapter "File Systems").

'''Administrative Server Nodes'''

Some other nodes are delivering additional services like resource management, external
network connection, administration etc. These nodes can be accessed directly by system administrators only.

= Components of bwUniCluster =

{| class="wikitable"
|-
! style="width:9%"|
! style="width:13%"| Compute nodes "Thin"
! style="width:13%"| Compute nodes "HPC"
! style="width:13%"| Compute nodes "IceLake"
! style="width:13%"| Compute nodes "Fat"
! style="width:13%"| GPU x4
! style="width:13%"| GPU x8
! style="width:13%"| IceLake + GPU x4
! style="width:13%"| Login
|-
!scope="column"| Number of nodes
| 200 + 60
| 260
| 272
| 6
| 14
| 10
| 15
| 3
|-
!scope="column"| Processors
| Intel Xeon Gold 6230
| Intel Xeon Gold 6230
| Intel Xeon Platinum 8358
| Intel Xeon Gold 6230
| Intel Xeon Gold 6230
| Intel Xeon Gold 6248
| Intel Xeon Platinum 8358
|-
!scope="column"| Number of sockets
| 2
| 2
| 2
| 4
| 2
| 2
| 2
| 2
|-
!scope="column"| Processor frequency (GHz)
| 2.1 Ghz
| 2.1 Ghz
| 2.6 Ghz
| 2.1 Ghz
| 2.1 Ghz
| 2.6 Ghz
| 2.5 Ghz
|
|-
!scope="column"| Total number of cores
| 40
| 40
| 64
| 80
| 40
| 40
| 64
| 40
|-
!scope="column"| Main memory
| 96 GB / 192 GB
| 96 GB
| 256 GB
| 3 TB
| 384 GB
| 768 GB
| 512 GB
| 384 GB
|-
!scope="column"| Local SSD
| 960 GB SATA
| 960 GB SATA
| 1,8 TB NVMe
| 4,8 TB NVMe
| 3,2 TB NVMe
| 15 TB NVMe
| 6,4 TB NVMe
|
|-
!scope="column"| Accelerators
| -
| -
| -
| -
| 4x NVIDIA Tesla V100
| 8x NVIDIA Tesla V100
| 4x NVIDIA A100 / 4x NVIDIA H100
| -
|-
!scope="column"| Accelerator memory
| -
| -
| -
| -
| 32 GB
| 32 GB
| 80 GB / 94 GB
| -
|-
!scope="column"| Interconnect
| IB HDR100 (blocking)
| IB HDR100
| IB HDR200
| IB HDR
| IB HDR
| IB HDR
| IB HDR200
| IB HDR100 (blocking)
|}
Table 1: Properties of the nodes

= File Systems =

On bwUniCluster 2.0 the parallel file system Lustre is used for most globally visible user data. Lustre is open source and Lustre solutions and support are available from different vendors. Nowadays, most of the biggest HPC systems are using Lustre. An initial home directory on a Lustre file system is created automatically after account activation, and the environment variable $HOME holds its name. Users can create so-called workspaces on another Lustre file system for non-permanent data with temporary lifetime. There is another workspace file system based on flash storage for special requirements available.

Within a batch job further file systems are available:
* The directory $TMPDIR is only available and visible on the local node. It is located on fast SSD storage devices.
* On request a parallel on-demand file system (BeeOND) is created which uses the SSDs of the nodes which were allocated to the batch job.
* On request the external LSDF Online Storage is mounted on the nodes which were allocated to the batch job. This file system is based on the parallel file system Spectrum Scale.

Some of the characteristics of the file systems are shown in Table 2.

{| style="width:100%; vertical-align:top; background:#f5fffa;border:1px solid #000000;padding:1px"
|- style="width:20%;height=20px; text-align:left;padding:3px"
! style="background-color:#AAA;padding:3px"| Property
! style="background-color:#AAA;padding:3px"| $TMPDIR
! style="background-color:#AAA;padding:3px"| BeeOND
! style="background-color:#AAA;padding:3px"| $HOME
! style="background-color:#AAA;padding:3px"| Workspace
! style="background-color:#AAA;padding:3px"| Workspace on flash
|-
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Visibility
| style="height=20px; text-align:left;padding:3px"| local node
| style="height=20px; text-align:left;padding:3px"| nodes of batch job
| style="height=20px; text-align:left;padding:3px"| global
| style="height=20px; text-align:left;padding:3px"| global
| style="height=20px; text-align:left;padding:3px"| global
|-
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Lifetime
| style="height=20px; text-align:left;padding:3px"| batch job runtime
| style="height=20px; text-align:left;padding:3px"| batch job runtime
| style="height=20px; text-align:left;padding:3px"| permanent
| style="height=20px; text-align:left;padding:3px"| max. 240 days
| style="height=20px; text-align:left;padding:3px"| max. 240 days
|- style="vertical-align:top;"
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Disk space
| style="height=20px; text-align:left;padding:3px"| 960 GB - 6.4 TB details see table 1
| style="height=20px; text-align:left;padding:3px"| n*750 GB
| style="height=20px; text-align:left;padding:3px"| 1.2 PiB
| style="height=20px; text-align:left;padding:3px"| 4.1 PiB
| style="height=20px; text-align:left;padding:3px"| 236 TiB
|-
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Capacity Quotas
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| yes 1 TiB per user, for MA users 256 GiB also per organization
| style="height=20px; text-align:left;padding:3px"| yes 40 TiB per user
| style="height=20px; text-align:left;padding:3px"| yes 1 TiB per user
|-
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Inode Quotas
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| yes 10 million per user for MA users 2.5 million
| style="height=20px; text-align:left;padding:3px"| yes 30 million per user
| style="height=20px; text-align:left;padding:3px"| yes 5 million per user
|-
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Backup
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| yes
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| no
|- style="vertical-align:top;"
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Read perf./node
| style="height=20px; text-align:left;padding:3px"| 500 MB/s - 6 GB/s depends on type of local SSD / job queue: 520 MB/s @ single / multiple 800 MB/s @ multiple_e 6600 MB/s @ fat 6500 MB/s @ gpu_4 6500 MB/s @ gpu_8
| style="height=20px; text-align:left;padding:3px"| 400 MB/s - 500 MB/s depends on type of local SSDs / job queue: 500 MB/s @ multiple 400 MB/s @ multiple_e
| style="height=20px; text-align:left;padding:3px"| 1 GB/s
| style="height=20px; text-align:left;padding:3px"| 1 GB/s
| style="height=20px; text-align:left;padding:3px"| 1 GB/s
|- style="vertical-align:top;"
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Write perf./node
| style="height=20px; text-align:left;padding:3px"| 500 MB/s - 4 GB/s depends on type of local SSD / job queue: 500 MB/s @ single / multiple 650 MB/s @ multiple_e 2900 MB/s @ fat 2090 MB/s @ gpu_4 4060 MB/s @ gpu_8
| style="height=20px; text-align:left;padding:3px"| 250 MB/s - 350 MB/s depends on type of local SSDs / job queue: 350 MB/s @ multiple 250 MB/s @ multiple_e
| style="height=20px; text-align:left;padding:3px"| 1 GB/s
| style="height=20px; text-align:left;padding:3px"| 1 GB/s
| style="height=20px; text-align:left;padding:3px"| 1 GB/s
|- style="vertical-align:top;"
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Total read perf.
| style="height=20px; text-align:left;padding:3px"| n*500-6000 MB/s
| style="height=20px; text-align:left;padding:3px"| n*400-500 MB/s
| style="height=20px; text-align:left;padding:3px"| 18 GB/s
| style="height=20px; text-align:left;padding:3px"| 54 GB/s
| style="height=20px; text-align:left;padding:3px"| 45 GB/s
|- style="vertical-align:top;"
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Total write perf.
| style="height=20px; text-align:left;padding:3px"| n*500-4000 MB/s
| style="height=20px; text-align:left;padding:3px"| n*250-350 MB/s
| style="height=20px; text-align:left;padding:3px"| 18 GB/s
| style="height=20px; text-align:left;padding:3px"| 54 GB/s
| style="height=20px; text-align:left;padding:3px"| 38 GB/s
|}
---------------------------------------------------------------------------------------------------------
global: all nodes of UniCluster access the same file system;
local: each node has its own file system;
permanent: files are stored permanently;
batch job: files are removed at end of the batch job.
---------------------------------------------------------------------------------------------------------
Table 2: Properties of the file systems

== Selecting the appropriate file system ==

In general, you should separate your data and store it on the appropriate file system.
Permanently needed data like software or important results should be stored below $HOME
but capacity restrictions (quotas) apply. In case you accidentally deleted data on $HOME
there is a chance that we can restore it from backup. Permanent data which is not needed for months
or exceeds the capacity restrictions should be sent to the LSDF Online Storage
or to the archive and deleted from the file systems. Temporary data which is only needed on a single
node and which does not exceed the disk space shown in the table above should be stored
below $TMPDIR. Data which is read many times on a single node, e.g. if you are doing AI training,
should be copied to $TMPDIR and read from there. Temporary data which is used from many nodes
of your batch job and which is only needed during job runtime should be stored on a
parallel on-demand file system. Temporary data which can be recomputed or which is the
result of one job and input for another job should be stored in workspaces. The lifetime
of data in workspaces is limited and depends on the lifetime of the workspace which can be
several months.

For further details please check the chapters below.

== $HOME ==

The home directories of bwUniCluster 2.0 (uc2) users are located in the parallel file system Lustre.
You have access to your home directory from all nodes of uc2. A regular backup of these directories
to tape archive is done automatically. The directory $HOME is used to hold those files that are
permanently used like source codes, configuration files, executable programs etc.

On uc2 there is a default user quota limit of 1 TiB and 10 million inodes (files and directories) per user.
For users of University of Mannheim the limit is 256 GiB and 2.5 million inodes.
You can check your current usage and limits with the command
<pre>
$ lfs quota -uh $(whoami) $HOME
</pre>
In addition to the user limit there is a limit of your organization (e.g. university) which depends on the financial share. This limit is enforced with so-called Lustre project quotas. You can show the current usage and limits of your organization with the following command:
<pre>
lfs quota -ph $(grep $(echo $HOME | sed -e "s|/[^/]*/[^/]*$||") /pfs/data5/project_ids.txt | cut -f 1 -d\ ) $HOME
</pre>

== Workspaces ==

On uc2 workspaces can be used to store large non-permanent data sets, e.g. restart files or output
data that has to be post-processed. The file system used for workspaces is also the parallel file system Lustre. This file system is especially designed for parallel access and for a high throughput to large
files. It is able to provide high data transfer rates of up to 54 GB/s write and read performance when data access is parallel.

Workspaces have a lifetime and the data on a workspace expires as a whole after a fixed period. The maximum lifetime of a workspace is 60 days, but it can be renewed at the end of that period 3 times to a total maximum of 240 days after workspace generation. If a workspace has inadvertently expired we can restore the data during a limited time (few weeks). In this case you should create a new workspace and report the name of the new and of the expired workspace in a ticket.

Creating, deleting, finding and extending workspaces is explained on the [[workspace]] page.

On uc2 there is a default user quota limit of 40 TiB and 30 million inodes (files and directories) per user.
You can chek your current usage and limits with the command
<pre>
$ lfs quota -uh $(whoami) /pfs/work7
</pre>
Note that the quotas include data and inodes for all of your workspaces and all of your expired workspaces (as long as they are not yet completely removed).

=== Reminder for workspace deletion ===

Normally you will get an email every day starting 7 days before a workspace expires. You can send yourself a calender entry which reminds you when a workspace will be automatically deleted:

$ ws_send_ical.sh <workspace> <email>

=== Restoring expired Workspaces ===

At expiration time your workspace will be moved to a special, hidden directory. On bwUniCluster 2.0 expired workspaces are currently kept for 30 days. During this time you can still restore your data into a valid workspace. The same is true for released workspaces but they are only kept until the next night. In order to restore an expired workspace, use
<pre>
ws_restore -l
</pre>
to get a list of your expired workspaces, and then restore them into an '''existing, active workspace''' (here with name <code>my_restored</code>):
<pre>
ws_restore <full_name_of_expired_workspace> my_restored
</pre>
NOTE: The expired workspace has to be specified using the full name as listed by <code>ws_restore -l</code>, including username prefix and timestamp suffix (otherwise, it cannot be uniquely identified).
The target workspace, on the other hand, must be given with just its short name as listed by <code>ws_list</code>, without the username prefix.

NOTE: <code>ws_restore</code> can only work on the same filesystem! So you have to ensure that the new workspace allocated with <code>ws_allocate</code> is placed on the same filesystem as the expired workspace. Therefore, you can use <code>-F <filesystem></code> flag if needed.

== Improving Performance on $HOME and workspaces ==

The following recommendations might help to improve throughput and metadata
performance on Lustre filesystems.

=== '''Improving Throughput Performance''' ===

Depending on your application some adaptations might be necessary if you want to reach
the full bandwidth of the filesystems. Parallel filesystems typically stripe files over storage subsystems, i.e. large files are separated into stripes and distributed to different storage subsystems. In Lustre, the size of these stripes (sometimes also mentioned as chunks) is called stripe size and the number of used storage subsystems is called stripe count.

When you are designing your application you should consider that the performance of
parallel filesystems is generally better if data is transferred in large blocks and stored in
few large files. In more detail, to increase throughput performance of a parallel application
following aspects should be considered:

* collect large chunks of data and write them sequentially at once,

* to exploit complete filesystem bandwidth use several clients,

* avoid competitive file access by different tasks or use blocks with boundaries at stripe size (default is 1MB),

* if files are small enough for the SSDs and are only used by one process store them on $TMPDIR.

With previous Lustre versions adapting the Lustre stripe count was the most important optimization. However, for the workspaces of uc2 the new Lustre feature Progressive File Layouts has been used to define file striping parameters. This means that the stripe count is adapted if the file size is growing. In normal cases users no longer need to adapt file striping parameters in case they have very huge files or in order to reach better performance.

If you know what you are doing you can still change striping parameters, e.g. the stripe count, of a directory and of newly created files. New files and directories inherit the stripe count from the parent directory. E.g. if you want to enhance throughput on a single very large file which is created in the directory $HOME/my_output_dir you can use the command
<pre>
$ lfs setstripe -c-1 $HOME/my_output_dir
</pre>
to change the stripe count to -1 which means that all storage subsystems of the file system are used to store that file. If you change the stripe count of a directory the stripe count of existing files inside this
directory is not changed. If you want to change the stripe count of existing files, change
the stripe count of the parent directory, copy the files to new files, remove the old files
and move the new files back to the old name. In order to check the stripe setting of
the file my_file use
<pre>
$ lfs getstripe my_file
</pre>
Also note that changes on the striping parameters (e.g. stripe count) are not saved in the
backup, i.e. if directories have to be recreated this information is lost and the default stripe
count will be used. Therefore, you should annotate for which directories you made changes
to the striping parameters so that you can repeat these changes if required.

=== '''Improving Metadata Performance''' ===

Metadata performance on parallel file systems is usually not as good as with local
filesystems. In addition, it is usually not scalable, i.e. a limited resource. Therefore,
you should omit metadata operations whenever possible. For example, it is much better
to have few large files than lots of small files. In more detail, to increase metadata
performance of a parallel application following aspects should be considered:

* avoid creating many small files,

* avoid competitive directory access, e.g. by creating files in separate subdirectories for each task,

* if many small files are only used within a batch job and accessed by one process store them on $TMPDIR,

* change the default colorization setting of the command ls (see below).

On modern Linux systems, the GNU ls command often uses colorization by default to
visually highlight the file type; this is especially true if the command is run within a terminal
session. This is because the default shell profile initializations usually contain an alias
directive similar to the following for the ls command:
<pre>
$ alias ls="ls --color=tty"
</pre>
However, running the ls command in this way for files on a Lustre file system requires
a stat() call to be used to determine the file type. This can result in a performance
overhead, because the stat() call always needs to determine the size of a file, and that
in turn means that the client node must query the object size of all the backing objects
that make up a file. As a result of the default colorization setting, running a simple
ls command on a Lustre file system often takes as much time as running the ls command
with the -l option (the same is true if the -F, -p, or the -classify option, or any other option
that requires information from a stat() call, is used). To avoid this performance overhead
when using ls commands, add an alias directive similar to the following
to your shell startup script:
<pre>
$ alias ls="ls --color=never"
</pre>

== Workspaces on flash storage ==

There is another workspace file system for special requirements available. The file system is called ''full flash pfs'' and is based on the parallel file system Lustre.

=== Advantages of this file system ===

# All storage devices are based on flash (no hard disks) with low access times. Hence performance is better compared to other parallel file systems for read and write access with small blocks and with small files, i.e. IOPS rates are improved.
# The file system is mounted on bwUniCluster 2.0 and HoreKa, i.e. it can be used to share data between these clusters.

=== Access restrictions ===

Only HoreKa users or KIT users of bwUniCluster 2.0 can use this file system.

=== Using the file system ===

As KIT or HoreKa user you can use the file system in the same way as a normal workspace. You just have to specify the name of the flash-based workspace file system using the option ''-F'' to all the commands that manage workspaces. On bwUniCluster 2.0 it is called ''ffuc'', on HoreKa it is ''ffhk''. For example, to create a workspace with name myws and a lifetime of 60 days on bwUniCluster 2.0 execute:
ws_allocate -F ffuc myws 60

If you want to use the full flash pfs on bwUniCluster 2.0 '''and''' HoreKa at the same time, please note that you only have to manage a particular workspace on one of the clusters since the name of the workspace directory is different. However, the path to each workspace is visible and can be used on both clusters.

Other features are similar to normal workspaces. For example, we are able to restore expired workspaces for few weeks and you have to open a ticket to request the restore. There are quota limits with a default limit of 1 TiB capacity and 5 millions inodes per user. You can check your current usage with
lfs quota -uh $(whoami) /pfs/work8

== $TMPDIR ==

The environment variable $TMPDIR contains the name of a directory which is local to each node. This means
that different tasks of a parallel application use different directories when they do not utilize the same node.
Although $TMPDIR points to the same path name for different nodes of a batch job, the physical location and the
content of this directory path on these nodes is different.

This directory should be used for temporary files being accessed from the local node during job runtime. It should
also be used if you read the same data many times from a single node, e.g. if you are doing AI training. In this
case you should copy the data at the beginning of your batch job to $TMPDIR and read the data from there.

The $TMPDIR directory is located on extremely fast SSD local storage devices. This means that performance
on small files is much better than on the parallel file systems. The capacity of the local SSDs for each node type
is different and can be checked in Table 1 above. The capacity of $TMPDIR is at least 800 GB.

Each time a batch job is started, a subdirectory is created on the SSD of each node and assigned to the job.
$TMPDIR is set to the name of the subdirectory and this name contains the job ID so that it is unique
for each job. At the end of the job the subdirectory is removed.

On login nodes $TMPDIR also points to a fast directory on a local SSD disk but this directory is not unique.
It is recommended to create your own unique subdirectory on these nodes. This directory should be used for the
installation of software packages. This means that the software package to be installed should be unpacked,
compiled and linked in a subdirectory of $TMPDIR. The real installation of the package (e.g. make install)
should be made into the $HOME folder.

=== Usage example for $TMPDIR ===

We will provide an example for using $TMPDIR and describe efficient data transfer to and from $TMPDIR.

If you have a data set with many files which is frequently used by batch jobs you should create
a compressed archive on a workspace. This archive can be extracted on $TMPDIR inside your batch jobs.
Such an archive can be read efficiently from a parallel file system since it is a single huge file.
On a login node you can create such an archive with the following steps:
<source lang="bash">
# Create a workspace to store the archive
[ab1234@uc2n997 ~]$ ws_allocate data-ssd 60
# Create the archive from a local dataset folder (example)
[ab1234@uc2n997 ~]$ tar -cvzf $(ws_find data-ssd)/dataset.tgz dataset/
</source>

Inside a batch job extract the archive on $TMPDIR, read input data from $TMPDIR, store results on $TMPDIR
and save the results on a workspace:
<source lang="bash">
#!/bin/bash
# very simple example on how to use local $TMPDIR
#SBATCH -N 1
#SBATCH -t 24:00:00

# Extract compressed input dataset on local SSD
tar -C $TMPDIR/ -xvzf $(ws_find data-ssd)/dataset.tgz

# The application reads data from dataset on $TMPDIR and writes results to $TMPDIR
myapp -input $TMPDIR/dataset/myinput.csv -outputdir $TMPDIR/results

# Before job completes save results on a workspace
rsync -av $TMPDIR/results $(ws_find data-ssd)/results-${SLURM_JOB_ID}/
</source>

== LSDF Online Storage==

In some cases it is useful to have access to the LSDF Online Storage on the HPC-Clusters also. Therefore the LSDF Online Storage is mounted on the Login- and Datamover-Nodes.
Furthermore it can be used on the compute nodes during the job runtime with the constraint flag "LSDF" ([[bwUniCluster_2.0_Slurm_common_Features|Slurm common features]]
). There is also an example about the LSDF batch usage: [[bwUniCluster_2.0_Slurm_common_Features#LSDF_Online_Storage|Slurm LSDF example ]] .
<pre>
#!/bin/bash
#SBATCH ...
#SBATCH --constraint=LSDF
</pre>
 
For the usage of the LSDF Online Storage the following environment variables are available: $LSDF, $LSDFPROJECTS, $LSDFHOME.
Please request storage projects in the LSDF Online Storage seperately:
[https://www.lsdf.kit.edu/os/storagerequest/: LSDF Storage Request].

==BeeOND (BeeGFS On-Demand)==

Users of the UniCluster have possibility to request a private BeeOND (on-demand BeeGFS) parallel filesystem for each job. The file system is created during job startup and purged after your job.

; '''IMPORTANT: '''
: All data on the private filesystem will be deleted after your job. Make sure you have copied your data back to the global filesystem (within job), e.g., $HOME or any workspace.

BeeOND/BeeGFS can be used like any other parallel file system. Tools like cp or rsync can be used to copy data in and out.

For detailed usage see here: [[BwUniCluster_2.0_Slurm_common_Features#BeeOND_.28BeeGFS_On-Demand.29|Request on-demand file system]]

==Backup and Archiving==

There are regular backups of all data of the home directories,whereas ACLs and extended attributes will
not be backuped.

Please open a ticket if you need backuped data.

BwUniCluster2.0/Hardware and Architecture

2024-10-17T12:38:56Z

R Laifer: /* Reminder for workspace deletion */

= Architecture of bwUniCluster 2.0 =

The bwUniCluster 2.0 is a parallel computer with distributed memory. Each node of system consists of at least two Intel Xeon processor, local memory, disks, network adapters and optionally accelerators (NVIDIA Tesla V100, A100 or H100). All nodes are connected by a fast InfiniBand interconnect. In addition the file system
Lustre, that is connected by coupling the InfiniBand of the file server with the InfiniBand
switch of the compute cluster, is added to bwUniCluster 2.0 to provide a fast and scalable
parallel file system.

The operating system on each node is Red Hat Enterprise Linux (RHEL) 8.4. A number of additional software packages like e.g. SLURM have been installed on top. Some of these components are of special interest to end users and are briefly
discussed in this document. Others which are of greater importance to system
administrators will not be covered by this document.

The individual nodes of the system may act in different roles. According to the services supplied by the nodes, they are separated into disjoint groups. From an end users point of view the different groups of nodes are login nodes, compute nodes, file server nodes and administrative server nodes.

'''Login Nodes'''

The login nodes are the only nodes that are directly accessible by end users. These nodes
are used for interactive login, file management, program development and interactive pre-
and postprocessing. Three nodes are dedicated to this service but they are all accessible via
one address and a DNS round-robin alias distributes the login sessions to the
different login nodes.

'''Compute Node'''

The majority of nodes are compute nodes which are managed by a batch system. Users
submit their jobs to the SLURM batch system and a job is executed when the required resources become available (depending on its fair-share priority).

'''File Server Nodes'''

The hardware of the parallel file system Lustre incorporates some file server nodes; the file
system Lustre is connected by coupling the InfiniBand of the file server with the independent InfiniBand switch of the compute cluster. In addition to shared file space there is also local storage on the disks of each node (for details see chapter "File Systems").

'''Administrative Server Nodes'''

Some other nodes are delivering additional services like resource management, external
network connection, administration etc. These nodes can be accessed directly by system administrators only.

= Components of bwUniCluster =

{| class="wikitable"
|-
! style="width:9%"|
! style="width:13%"| Compute nodes "Thin"
! style="width:13%"| Compute nodes "HPC"
! style="width:13%"| Compute nodes "IceLake"
! style="width:13%"| Compute nodes "Fat"
! style="width:13%"| GPU x4
! style="width:13%"| GPU x8
! style="width:13%"| IceLake + GPU x4
! style="width:13%"| Login
|-
!scope="column"| Number of nodes
| 200 + 60
| 260
| 272
| 6
| 14
| 10
| 15
| 3
|-
!scope="column"| Processors
| Intel Xeon Gold 6230
| Intel Xeon Gold 6230
| Intel Xeon Platinum 8358
| Intel Xeon Gold 6230
| Intel Xeon Gold 6230
| Intel Xeon Gold 6248
| Intel Xeon Platinum 8358
|-
!scope="column"| Number of sockets
| 2
| 2
| 2
| 4
| 2
| 2
| 2
| 2
|-
!scope="column"| Processor frequency (GHz)
| 2.1 Ghz
| 2.1 Ghz
| 2.6 Ghz
| 2.1 Ghz
| 2.1 Ghz
| 2.6 Ghz
| 2.5 Ghz
|
|-
!scope="column"| Total number of cores
| 40
| 40
| 64
| 80
| 40
| 40
| 64
| 40
|-
!scope="column"| Main memory
| 96 GB / 192 GB
| 96 GB
| 256 GB
| 3 TB
| 384 GB
| 768 GB
| 512 GB
| 384 GB
|-
!scope="column"| Local SSD
| 960 GB SATA
| 960 GB SATA
| 1,8 TB NVMe
| 4,8 TB NVMe
| 3,2 TB NVMe
| 15 TB NVMe
| 6,4 TB NVMe
|
|-
!scope="column"| Accelerators
| -
| -
| -
| -
| 4x NVIDIA Tesla V100
| 8x NVIDIA Tesla V100
| 4x NVIDIA A100 / 4x NVIDIA H100
| -
|-
!scope="column"| Accelerator memory
| -
| -
| -
| -
| 32 GB
| 32 GB
| 80 GB / 94 GB
| -
|-
!scope="column"| Interconnect
| IB HDR100 (blocking)
| IB HDR100
| IB HDR200
| IB HDR
| IB HDR
| IB HDR
| IB HDR200
| IB HDR100 (blocking)
|}
Table 1: Properties of the nodes

= File Systems =

On bwUniCluster 2.0 the parallel file system Lustre is used for most globally visible user data. Lustre is open source and Lustre solutions and support are available from different vendors. Nowadays, most of the biggest HPC systems are using Lustre. An initial home directory on a Lustre file system is created automatically after account activation, and the environment variable $HOME holds its name. Users can create so-called workspaces on another Lustre file system for non-permanent data with temporary lifetime. There is another workspace file system based on flash storage for special requirements available.

Within a batch job further file systems are available:
* The directory $TMPDIR is only available and visible on the local node. It is located on fast SSD storage devices.
* On request a parallel on-demand file system (BeeOND) is created which uses the SSDs of the nodes which were allocated to the batch job.
* On request the external LSDF Online Storage is mounted on the nodes which were allocated to the batch job. This file system is based on the parallel file system Spectrum Scale.

Some of the characteristics of the file systems are shown in Table 2.

{| style="width:100%; vertical-align:top; background:#f5fffa;border:1px solid #000000;padding:1px"
|- style="width:20%;height=20px; text-align:left;padding:3px"
! style="background-color:#AAA;padding:3px"| Property
! style="background-color:#AAA;padding:3px"| $TMPDIR
! style="background-color:#AAA;padding:3px"| BeeOND
! style="background-color:#AAA;padding:3px"| $HOME
! style="background-color:#AAA;padding:3px"| Workspace
! style="background-color:#AAA;padding:3px"| Workspace on flash
|-
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Visibility
| style="height=20px; text-align:left;padding:3px"| local node
| style="height=20px; text-align:left;padding:3px"| nodes of batch job
| style="height=20px; text-align:left;padding:3px"| global
| style="height=20px; text-align:left;padding:3px"| global
| style="height=20px; text-align:left;padding:3px"| global
|-
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Lifetime
| style="height=20px; text-align:left;padding:3px"| batch job runtime
| style="height=20px; text-align:left;padding:3px"| batch job runtime
| style="height=20px; text-align:left;padding:3px"| permanent
| style="height=20px; text-align:left;padding:3px"| max. 240 days
| style="height=20px; text-align:left;padding:3px"| max. 240 days
|- style="vertical-align:top;"
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Disk space
| style="height=20px; text-align:left;padding:3px"| 960 GB - 6.4 TB details see table 1
| style="height=20px; text-align:left;padding:3px"| n*750 GB
| style="height=20px; text-align:left;padding:3px"| 1.2 PiB
| style="height=20px; text-align:left;padding:3px"| 4.1 PiB
| style="height=20px; text-align:left;padding:3px"| 236 TiB
|-
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Capacity Quotas
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| yes 1 TiB per user, for MA users 256 GiB also per organization
| style="height=20px; text-align:left;padding:3px"| yes 40 TiB per user
| style="height=20px; text-align:left;padding:3px"| yes 1 TiB per user
|-
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Inode Quotas
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| yes 10 million per user for MA users 2.5 million
| style="height=20px; text-align:left;padding:3px"| yes 30 million per user
| style="height=20px; text-align:left;padding:3px"| yes 5 million per user
|-
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Backup
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| yes
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| no
|- style="vertical-align:top;"
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Read perf./node
| style="height=20px; text-align:left;padding:3px"| 500 MB/s - 6 GB/s depends on type of local SSD / job queue: 520 MB/s @ single / multiple 800 MB/s @ multiple_e 6600 MB/s @ fat 6500 MB/s @ gpu_4 6500 MB/s @ gpu_8
| style="height=20px; text-align:left;padding:3px"| 400 MB/s - 500 MB/s depends on type of local SSDs / job queue: 500 MB/s @ multiple 400 MB/s @ multiple_e
| style="height=20px; text-align:left;padding:3px"| 1 GB/s
| style="height=20px; text-align:left;padding:3px"| 1 GB/s
| style="height=20px; text-align:left;padding:3px"| 1 GB/s
|- style="vertical-align:top;"
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Write perf./node
| style="height=20px; text-align:left;padding:3px"| 500 MB/s - 4 GB/s depends on type of local SSD / job queue: 500 MB/s @ single / multiple 650 MB/s @ multiple_e 2900 MB/s @ fat 2090 MB/s @ gpu_4 4060 MB/s @ gpu_8
| style="height=20px; text-align:left;padding:3px"| 250 MB/s - 350 MB/s depends on type of local SSDs / job queue: 350 MB/s @ multiple 250 MB/s @ multiple_e
| style="height=20px; text-align:left;padding:3px"| 1 GB/s
| style="height=20px; text-align:left;padding:3px"| 1 GB/s
| style="height=20px; text-align:left;padding:3px"| 1 GB/s
|- style="vertical-align:top;"
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Total read perf.
| style="height=20px; text-align:left;padding:3px"| n*500-6000 MB/s
| style="height=20px; text-align:left;padding:3px"| n*400-500 MB/s
| style="height=20px; text-align:left;padding:3px"| 18 GB/s
| style="height=20px; text-align:left;padding:3px"| 54 GB/s
| style="height=20px; text-align:left;padding:3px"| 45 GB/s
|- style="vertical-align:top;"
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Total write perf.
| style="height=20px; text-align:left;padding:3px"| n*500-4000 MB/s
| style="height=20px; text-align:left;padding:3px"| n*250-350 MB/s
| style="height=20px; text-align:left;padding:3px"| 18 GB/s
| style="height=20px; text-align:left;padding:3px"| 54 GB/s
| style="height=20px; text-align:left;padding:3px"| 38 GB/s
|}
---------------------------------------------------------------------------------------------------------
global: all nodes of UniCluster access the same file system;
local: each node has its own file system;
permanent: files are stored permanently;
batch job: files are removed at end of the batch job.
---------------------------------------------------------------------------------------------------------
Table 2: Properties of the file systems

== Selecting the appropriate file system ==

In general, you should separate your data and store it on the appropriate file system.
Permanently needed data like software or important results should be stored below $HOME
but capacity restrictions (quotas) apply. In case you accidentally deleted data on $HOME
there is a chance that we can restore it from backup. Permanent data which is not needed for months
or exceeds the capacity restrictions should be sent to the LSDF Online Storage
or to the archive and deleted from the file systems. Temporary data which is only needed on a single
node and which does not exceed the disk space shown in the table above should be stored
below $TMPDIR. Data which is read many times on a single node, e.g. if you are doing AI training,
should be copied to $TMPDIR and read from there. Temporary data which is used from many nodes
of your batch job and which is only needed during job runtime should be stored on a
parallel on-demand file system. Temporary data which can be recomputed or which is the
result of one job and input for another job should be stored in workspaces. The lifetime
of data in workspaces is limited and depends on the lifetime of the workspace which can be
several months.

For further details please check the chapters below.

== $HOME ==

The home directories of bwUniCluster 2.0 (uc2) users are located in the parallel file system Lustre.
You have access to your home directory from all nodes of uc2. A regular backup of these directories
to tape archive is done automatically. The directory $HOME is used to hold those files that are
permanently used like source codes, configuration files, executable programs etc.

On uc2 there is a default user quota limit of 1 TiB and 10 million inodes (files and directories) per user.
For users of University of Mannheim the limit is 256 GiB and 2.5 million inodes.
You can check your current usage and limits with the command
<pre>
$ lfs quota -uh $(whoami) $HOME
</pre>
In addition to the user limit there is a limit of your organization (e.g. university) which depends on the financial share. This limit is enforced with so-called Lustre project quotas. You can show the current usage and limits of your organization with the following command:
<pre>
lfs quota -ph $(grep $(echo $HOME | sed -e "s|/[^/]*/[^/]*$||") /pfs/data5/project_ids.txt | cut -f 1 -d\ ) $HOME
</pre>

== Workspaces ==

On uc2 workspaces can be used to store large non-permanent data sets, e.g. restart files or output
data that has to be post-processed. The file system used for workspaces is also the parallel file system Lustre. This file system is especially designed for parallel access and for a high throughput to large
files. It is able to provide high data transfer rates of up to 54 GB/s write and read performance when data access is parallel.

Workspaces have a lifetime and the data on a workspace expires as a whole after a fixed period. The maximum lifetime of a workspace is 60 days, but it can be renewed at the end of that period 3 times to a total maximum of 240 days after workspace generation. If a workspace has inadvertently expired we can restore the data during a limited time (few weeks). In this case you should create a new workspace and report the name of the new and of the expired workspace in a ticket.

Creating, deleting, finding and extending workspaces is explained on the [[workspace]] page.

On uc2 there is a default user quota limit of 40 TiB and 30 million inodes (files and directories) per user.
You can chek your current usage and limits with the command
<pre>
$ lfs quota -uh $(whoami) /pfs/work7
</pre>
Note that the quotas include data and inodes for all of your workspaces and all of your expired workspaces (as long as they are not yet completely removed).

=== Reminder for workspace deletion ===

Normally you will get an email every day starting 7 days before a workspace expires. You can send yourself a calender entry which reminds you when a workspace will be automatically deleted:

$ ws_send_ical.sh <workspace> <email>

==== Restoring expired Workspaces ====

At expiration time your workspace will be moved to a special, hidden directory. On bwUniCluster 2.0 expired workspaces are currently kept for 30 days. During this time you can still restore your data into a valid workspace. The same is true for released workspaces but they are only kept until the next night. In order to restore an expired workspace, use
<pre>
ws_restore -l
</pre>
to get a list of your expired workspaces, and then restore them into an '''existing, active workspace''' (here with name <code>my_restored</code>):
<pre>
ws_restore <full_name_of_expired_workspace> my_restored
</pre>
NOTE: The expired workspace has to be specified using the full name as listed by <code>ws_restore -l</code>, including username prefix and timestamp suffix (otherwise, it cannot be uniquely identified).
The target workspace, on the other hand, must be given with just its short name as listed by <code>ws_list</code>, without the username prefix.

NOTE: <code>ws_restore</code> can only work on the same filesystem! So you have to ensure that the new workspace allocated with <code>ws_allocate</code> is placed on the same filesystem as the expired workspace. Therefore, you can use <code>-F <filesystem></code> flag if needed.

== Improving Performance on $HOME and workspaces ==

The following recommendations might help to improve throughput and metadata
performance on Lustre filesystems.

=== '''Improving Throughput Performance''' ===

Depending on your application some adaptations might be necessary if you want to reach
the full bandwidth of the filesystems. Parallel filesystems typically stripe files over storage subsystems, i.e. large files are separated into stripes and distributed to different storage subsystems. In Lustre, the size of these stripes (sometimes also mentioned as chunks) is called stripe size and the number of used storage subsystems is called stripe count.

When you are designing your application you should consider that the performance of
parallel filesystems is generally better if data is transferred in large blocks and stored in
few large files. In more detail, to increase throughput performance of a parallel application
following aspects should be considered:

* collect large chunks of data and write them sequentially at once,

* to exploit complete filesystem bandwidth use several clients,

* avoid competitive file access by different tasks or use blocks with boundaries at stripe size (default is 1MB),

* if files are small enough for the SSDs and are only used by one process store them on $TMPDIR.

With previous Lustre versions adapting the Lustre stripe count was the most important optimization. However, for the workspaces of uc2 the new Lustre feature Progressive File Layouts has been used to define file striping parameters. This means that the stripe count is adapted if the file size is growing. In normal cases users no longer need to adapt file striping parameters in case they have very huge files or in order to reach better performance.

If you know what you are doing you can still change striping parameters, e.g. the stripe count, of a directory and of newly created files. New files and directories inherit the stripe count from the parent directory. E.g. if you want to enhance throughput on a single very large file which is created in the directory $HOME/my_output_dir you can use the command
<pre>
$ lfs setstripe -c-1 $HOME/my_output_dir
</pre>
to change the stripe count to -1 which means that all storage subsystems of the file system are used to store that file. If you change the stripe count of a directory the stripe count of existing files inside this
directory is not changed. If you want to change the stripe count of existing files, change
the stripe count of the parent directory, copy the files to new files, remove the old files
and move the new files back to the old name. In order to check the stripe setting of
the file my_file use
<pre>
$ lfs getstripe my_file
</pre>
Also note that changes on the striping parameters (e.g. stripe count) are not saved in the
backup, i.e. if directories have to be recreated this information is lost and the default stripe
count will be used. Therefore, you should annotate for which directories you made changes
to the striping parameters so that you can repeat these changes if required.

=== '''Improving Metadata Performance''' ===

Metadata performance on parallel file systems is usually not as good as with local
filesystems. In addition, it is usually not scalable, i.e. a limited resource. Therefore,
you should omit metadata operations whenever possible. For example, it is much better
to have few large files than lots of small files. In more detail, to increase metadata
performance of a parallel application following aspects should be considered:

* avoid creating many small files,

* avoid competitive directory access, e.g. by creating files in separate subdirectories for each task,

* if many small files are only used within a batch job and accessed by one process store them on $TMPDIR,

* change the default colorization setting of the command ls (see below).

On modern Linux systems, the GNU ls command often uses colorization by default to
visually highlight the file type; this is especially true if the command is run within a terminal
session. This is because the default shell profile initializations usually contain an alias
directive similar to the following for the ls command:
<pre>
$ alias ls="ls --color=tty"
</pre>
However, running the ls command in this way for files on a Lustre file system requires
a stat() call to be used to determine the file type. This can result in a performance
overhead, because the stat() call always needs to determine the size of a file, and that
in turn means that the client node must query the object size of all the backing objects
that make up a file. As a result of the default colorization setting, running a simple
ls command on a Lustre file system often takes as much time as running the ls command
with the -l option (the same is true if the -F, -p, or the -classify option, or any other option
that requires information from a stat() call, is used). To avoid this performance overhead
when using ls commands, add an alias directive similar to the following
to your shell startup script:
<pre>
$ alias ls="ls --color=never"
</pre>

== Workspaces on flash storage ==

There is another workspace file system for special requirements available. The file system is called ''full flash pfs'' and is based on the parallel file system Lustre.

=== Advantages of this file system ===

# All storage devices are based on flash (no hard disks) with low access times. Hence performance is better compared to other parallel file systems for read and write access with small blocks and with small files, i.e. IOPS rates are improved.
# The file system is mounted on bwUniCluster 2.0 and HoreKa, i.e. it can be used to share data between these clusters.

=== Access restrictions ===

Only HoreKa users or KIT users of bwUniCluster 2.0 can use this file system.

=== Using the file system ===

As KIT or HoreKa user you can use the file system in the same way as a normal workspace. You just have to specify the name of the flash-based workspace file system using the option ''-F'' to all the commands that manage workspaces. On bwUniCluster 2.0 it is called ''ffuc'', on HoreKa it is ''ffhk''. For example, to create a workspace with name myws and a lifetime of 60 days on bwUniCluster 2.0 execute:
ws_allocate -F ffuc myws 60

If you want to use the full flash pfs on bwUniCluster 2.0 '''and''' HoreKa at the same time, please note that you only have to manage a particular workspace on one of the clusters since the name of the workspace directory is different. However, the path to each workspace is visible and can be used on both clusters.

Other features are similar to normal workspaces. For example, we are able to restore expired workspaces for few weeks and you have to open a ticket to request the restore. There are quota limits with a default limit of 1 TiB capacity and 5 millions inodes per user. You can check your current usage with
lfs quota -uh $(whoami) /pfs/work8

== $TMPDIR ==

The environment variable $TMPDIR contains the name of a directory which is local to each node. This means
that different tasks of a parallel application use different directories when they do not utilize the same node.
Although $TMPDIR points to the same path name for different nodes of a batch job, the physical location and the
content of this directory path on these nodes is different.

This directory should be used for temporary files being accessed from the local node during job runtime. It should
also be used if you read the same data many times from a single node, e.g. if you are doing AI training. In this
case you should copy the data at the beginning of your batch job to $TMPDIR and read the data from there.

The $TMPDIR directory is located on extremely fast SSD local storage devices. This means that performance
on small files is much better than on the parallel file systems. The capacity of the local SSDs for each node type
is different and can be checked in Table 1 above. The capacity of $TMPDIR is at least 800 GB.

Each time a batch job is started, a subdirectory is created on the SSD of each node and assigned to the job.
$TMPDIR is set to the name of the subdirectory and this name contains the job ID so that it is unique
for each job. At the end of the job the subdirectory is removed.

On login nodes $TMPDIR also points to a fast directory on a local SSD disk but this directory is not unique.
It is recommended to create your own unique subdirectory on these nodes. This directory should be used for the
installation of software packages. This means that the software package to be installed should be unpacked,
compiled and linked in a subdirectory of $TMPDIR. The real installation of the package (e.g. make install)
should be made into the $HOME folder.

=== Usage example for $TMPDIR ===

We will provide an example for using $TMPDIR and describe efficient data transfer to and from $TMPDIR.

If you have a data set with many files which is frequently used by batch jobs you should create
a compressed archive on a workspace. This archive can be extracted on $TMPDIR inside your batch jobs.
Such an archive can be read efficiently from a parallel file system since it is a single huge file.
On a login node you can create such an archive with the following steps:
<source lang="bash">
# Create a workspace to store the archive
[ab1234@uc2n997 ~]$ ws_allocate data-ssd 60
# Create the archive from a local dataset folder (example)
[ab1234@uc2n997 ~]$ tar -cvzf $(ws_find data-ssd)/dataset.tgz dataset/
</source>

Inside a batch job extract the archive on $TMPDIR, read input data from $TMPDIR, store results on $TMPDIR
and save the results on a workspace:
<source lang="bash">
#!/bin/bash
# very simple example on how to use local $TMPDIR
#SBATCH -N 1
#SBATCH -t 24:00:00

# Extract compressed input dataset on local SSD
tar -C $TMPDIR/ -xvzf $(ws_find data-ssd)/dataset.tgz

# The application reads data from dataset on $TMPDIR and writes results to $TMPDIR
myapp -input $TMPDIR/dataset/myinput.csv -outputdir $TMPDIR/results

# Before job completes save results on a workspace
rsync -av $TMPDIR/results $(ws_find data-ssd)/results-${SLURM_JOB_ID}/
</source>

== LSDF Online Storage==

In some cases it is useful to have access to the LSDF Online Storage on the HPC-Clusters also. Therefore the LSDF Online Storage is mounted on the Login- and Datamover-Nodes.
Furthermore it can be used on the compute nodes during the job runtime with the constraint flag "LSDF" ([[bwUniCluster_2.0_Slurm_common_Features|Slurm common features]]
). There is also an example about the LSDF batch usage: [[bwUniCluster_2.0_Slurm_common_Features#LSDF_Online_Storage|Slurm LSDF example ]] .
<pre>
#!/bin/bash
#SBATCH ...
#SBATCH --constraint=LSDF
</pre>
 
For the usage of the LSDF Online Storage the following environment variables are available: $LSDF, $LSDFPROJECTS, $LSDFHOME.
Please request storage projects in the LSDF Online Storage seperately:
[https://www.lsdf.kit.edu/os/storagerequest/: LSDF Storage Request].

==BeeOND (BeeGFS On-Demand)==

Users of the UniCluster have possibility to request a private BeeOND (on-demand BeeGFS) parallel filesystem for each job. The file system is created during job startup and purged after your job.

; '''IMPORTANT: '''
: All data on the private filesystem will be deleted after your job. Make sure you have copied your data back to the global filesystem (within job), e.g., $HOME or any workspace.

BeeOND/BeeGFS can be used like any other parallel file system. Tools like cp or rsync can be used to copy data in and out.

For detailed usage see here: [[BwUniCluster_2.0_Slurm_common_Features#BeeOND_.28BeeGFS_On-Demand.29|Request on-demand file system]]

==Backup and Archiving==

There are regular backups of all data of the home directories,whereas ACLs and extended attributes will
not be backuped.

Please open a ticket if you need backuped data.

BwUniCluster2.0/Hardware and Architecture

2024-08-14T12:58:42Z

R Laifer: /* Selecting the appropriate file system */ backup par corrected

= Architecture of bwUniCluster 2.0 =

The bwUniCluster 2.0 is a parallel computer with distributed memory. Each node of system consists of at least two Intel Xeon processor, local memory, disks, network adapters and optionally accelerators (NVIDIA Tesla V100, A100 or H100). All nodes are connected by a fast InfiniBand interconnect. In addition the file system
Lustre, that is connected by coupling the InfiniBand of the file server with the InfiniBand
switch of the compute cluster, is added to bwUniCluster 2.0 to provide a fast and scalable
parallel file system.

The operating system on each node is Red Hat Enterprise Linux (RHEL) 8.4. A number of additional software packages like e.g. SLURM have been installed on top. Some of these components are of special interest to end users and are briefly
discussed in this document. Others which are of greater importance to system
administrators will not be covered by this document.

The individual nodes of the system may act in different roles. According to the services supplied by the nodes, they are separated into disjoint groups. From an end users point of view the different groups of nodes are login nodes, compute nodes, file server nodes and administrative server nodes.

'''Login Nodes'''

The login nodes are the only nodes that are directly accessible by end users. These nodes
are used for interactive login, file management, program development and interactive pre-
and postprocessing. Three nodes are dedicated to this service but they are all accessible via
one address and a DNS round-robin alias distributes the login sessions to the
different login nodes.

'''Compute Node'''

The majority of nodes are compute nodes which are managed by a batch system. Users
submit their jobs to the SLURM batch system and a job is executed when the required resources become available (depending on its fair-share priority).

'''File Server Nodes'''

The hardware of the parallel file system Lustre incorporates some file server nodes; the file
system Lustre is connected by coupling the InfiniBand of the file server with the independent InfiniBand switch of the compute cluster. In addition to shared file space there is also local storage on the disks of each node (for details see chapter "File Systems").

'''Administrative Server Nodes'''

Some other nodes are delivering additional services like resource management, external
network connection, administration etc. These nodes can be accessed directly by system administrators only.

= Components of bwUniCluster =

{| class="wikitable"
|-
! style="width:9%"|
! style="width:13%"| Compute nodes "Thin"
! style="width:13%"| Compute nodes "HPC"
! style="width:13%"| Compute nodes "IceLake"
! style="width:13%"| Compute nodes "Fat"
! style="width:13%"| GPU x4
! style="width:13%"| GPU x8
! style="width:13%"| IceLake + GPU x4
! style="width:13%"| Login
|-
!scope="column"| Number of nodes
| 200 + 60
| 260
| 272
| 6
| 14
| 10
| 15
| 3
|-
!scope="column"| Processors
| Intel Xeon Gold 6230
| Intel Xeon Gold 6230
| Intel Xeon Platinum 8358
| Intel Xeon Gold 6230
| Intel Xeon Gold 6230
| Intel Xeon Gold 6248
| Intel Xeon Platinum 8358
|-
!scope="column"| Number of sockets
| 2
| 2
| 2
| 4
| 2
| 2
| 2
| 2
|-
!scope="column"| Processor frequency (GHz)
| 2.1 Ghz
| 2.1 Ghz
| 2.6 Ghz
| 2.1 Ghz
| 2.1 Ghz
| 2.6 Ghz
| 2.5 Ghz
|
|-
!scope="column"| Total number of cores
| 40
| 40
| 64
| 80
| 40
| 40
| 64
| 40
|-
!scope="column"| Main memory
| 96 GB / 192 GB
| 96 GB
| 256 GB
| 3 TB
| 384 GB
| 768 GB
| 512 GB
| 384 GB
|-
!scope="column"| Local SSD
| 960 GB SATA
| 960 GB SATA
| 1,8 TB NVMe
| 4,8 TB NVMe
| 3,2 TB NVMe
| 15 TB NVMe
| 6,4 TB NVMe
|
|-
!scope="column"| Accelerators
| -
| -
| -
| -
| 4x NVIDIA Tesla V100
| 8x NVIDIA Tesla V100
| 4x NVIDIA A100 / 4x NVIDIA H100
| -
|-
!scope="column"| Accelerator memory
| -
| -
| -
| -
| 32 GB
| 32 GB
| 80 GB / 94 GB
| -
|-
!scope="column"| Interconnect
| IB HDR100 (blocking)
| IB HDR100
| IB HDR200
| IB HDR
| IB HDR
| IB HDR
| IB HDR200
| IB HDR100 (blocking)
|}
Table 1: Properties of the nodes

= File Systems =

On bwUniCluster 2.0 the parallel file system Lustre is used for most globally visible user data. Lustre is open source and Lustre solutions and support are available from different vendors. Nowadays, most of the biggest HPC systems are using Lustre. An initial home directory on a Lustre file system is created automatically after account activation, and the environment variable $HOME holds its name. Users can create so-called workspaces on another Lustre file system for non-permanent data with temporary lifetime. There is another workspace file system based on flash storage for special requirements available.

Within a batch job further file systems are available:
* The directory $TMPDIR is only available and visible on the local node. It is located on fast SSD storage devices.
* On request a parallel on-demand file system (BeeOND) is created which uses the SSDs of the nodes which were allocated to the batch job.
* On request the external LSDF Online Storage is mounted on the nodes which were allocated to the batch job. This file system is based on the parallel file system Spectrum Scale.

Some of the characteristics of the file systems are shown in Table 2.

{| style="width:100%; vertical-align:top; background:#f5fffa;border:1px solid #000000;padding:1px"
|- style="width:20%;height=20px; text-align:left;padding:3px"
! style="background-color:#AAA;padding:3px"| Property
! style="background-color:#AAA;padding:3px"| $TMPDIR
! style="background-color:#AAA;padding:3px"| BeeOND
! style="background-color:#AAA;padding:3px"| $HOME
! style="background-color:#AAA;padding:3px"| Workspace
! style="background-color:#AAA;padding:3px"| Workspace on flash
|-
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Visibility
| style="height=20px; text-align:left;padding:3px"| local node
| style="height=20px; text-align:left;padding:3px"| nodes of batch job
| style="height=20px; text-align:left;padding:3px"| global
| style="height=20px; text-align:left;padding:3px"| global
| style="height=20px; text-align:left;padding:3px"| global
|-
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Lifetime
| style="height=20px; text-align:left;padding:3px"| batch job runtime
| style="height=20px; text-align:left;padding:3px"| batch job runtime
| style="height=20px; text-align:left;padding:3px"| permanent
| style="height=20px; text-align:left;padding:3px"| max. 240 days
| style="height=20px; text-align:left;padding:3px"| max. 240 days
|- style="vertical-align:top;"
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Disk space
| style="height=20px; text-align:left;padding:3px"| 960 GB - 6.4 TB details see table 1
| style="height=20px; text-align:left;padding:3px"| n*750 GB
| style="height=20px; text-align:left;padding:3px"| 1.2 PiB
| style="height=20px; text-align:left;padding:3px"| 4.1 PiB
| style="height=20px; text-align:left;padding:3px"| 236 TiB
|-
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Capacity Quotas
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| yes 1 TiB per user, for MA users 256 GiB also per organization
| style="height=20px; text-align:left;padding:3px"| yes 40 TiB per user
| style="height=20px; text-align:left;padding:3px"| yes 1 TiB per user
|-
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Inode Quotas
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| yes 10 million per user for MA users 2.5 million
| style="height=20px; text-align:left;padding:3px"| yes 30 million per user
| style="height=20px; text-align:left;padding:3px"| yes 5 million per user
|-
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Backup
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| yes
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| no
|- style="vertical-align:top;"
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Read perf./node
| style="height=20px; text-align:left;padding:3px"| 500 MB/s - 6 GB/s depends on type of local SSD / job queue: 520 MB/s @ single / multiple 800 MB/s @ multiple_e 6600 MB/s @ fat 6500 MB/s @ gpu_4 6500 MB/s @ gpu_8
| style="height=20px; text-align:left;padding:3px"| 400 MB/s - 500 MB/s depends on type of local SSDs / job queue: 500 MB/s @ multiple 400 MB/s @ multiple_e
| style="height=20px; text-align:left;padding:3px"| 1 GB/s
| style="height=20px; text-align:left;padding:3px"| 1 GB/s
| style="height=20px; text-align:left;padding:3px"| 1 GB/s
|- style="vertical-align:top;"
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Write perf./node
| style="height=20px; text-align:left;padding:3px"| 500 MB/s - 4 GB/s depends on type of local SSD / job queue: 500 MB/s @ single / multiple 650 MB/s @ multiple_e 2900 MB/s @ fat 2090 MB/s @ gpu_4 4060 MB/s @ gpu_8
| style="height=20px; text-align:left;padding:3px"| 250 MB/s - 350 MB/s depends on type of local SSDs / job queue: 350 MB/s @ multiple 250 MB/s @ multiple_e
| style="height=20px; text-align:left;padding:3px"| 1 GB/s
| style="height=20px; text-align:left;padding:3px"| 1 GB/s
| style="height=20px; text-align:left;padding:3px"| 1 GB/s
|- style="vertical-align:top;"
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Total read perf.
| style="height=20px; text-align:left;padding:3px"| n*500-6000 MB/s
| style="height=20px; text-align:left;padding:3px"| n*400-500 MB/s
| style="height=20px; text-align:left;padding:3px"| 18 GB/s
| style="height=20px; text-align:left;padding:3px"| 54 GB/s
| style="height=20px; text-align:left;padding:3px"| 45 GB/s
|- style="vertical-align:top;"
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Total write perf.
| style="height=20px; text-align:left;padding:3px"| n*500-4000 MB/s
| style="height=20px; text-align:left;padding:3px"| n*250-350 MB/s
| style="height=20px; text-align:left;padding:3px"| 18 GB/s
| style="height=20px; text-align:left;padding:3px"| 54 GB/s
| style="height=20px; text-align:left;padding:3px"| 38 GB/s
|}
---------------------------------------------------------------------------------------------------------
global: all nodes of UniCluster access the same file system;
local: each node has its own file system;
permanent: files are stored permanently;
batch job: files are removed at end of the batch job.
---------------------------------------------------------------------------------------------------------
Table 2: Properties of the file systems

== Selecting the appropriate file system ==

In general, you should separate your data and store it on the appropriate file system.
Permanently needed data like software or important results should be stored below $HOME
but capacity restrictions (quotas) apply. In case you accidentally deleted data on $HOME
there is a chance that we can restore it from backup. Permanent data which is not needed for months
or exceeds the capacity restrictions should be sent to the LSDF Online Storage
or to the archive and deleted from the file systems. Temporary data which is only needed on a single
node and which does not exceed the disk space shown in the table above should be stored
below $TMPDIR. Data which is read many times on a single node, e.g. if you are doing AI training,
should be copied to $TMPDIR and read from there. Temporary data which is used from many nodes
of your batch job and which is only needed during job runtime should be stored on a
parallel on-demand file system. Temporary data which can be recomputed or which is the
result of one job and input for another job should be stored in workspaces. The lifetime
of data in workspaces is limited and depends on the lifetime of the workspace which can be
several months.

For further details please check the chapters below.

== $HOME ==

The home directories of bwUniCluster 2.0 (uc2) users are located in the parallel file system Lustre.
You have access to your home directory from all nodes of uc2. A regular backup of these directories
to tape archive is done automatically. The directory $HOME is used to hold those files that are
permanently used like source codes, configuration files, executable programs etc.

On uc2 there is a default user quota limit of 1 TiB and 10 million inodes (files and directories) per user.
For users of University of Mannheim the limit is 256 GiB and 2.5 million inodes.
You can check your current usage and limits with the command
<pre>
$ lfs quota -uh $(whoami) $HOME
</pre>
In addition to the user limit there is a limit of your organization (e.g. university) which depends on the financial share. This limit is enforced with so-called Lustre project quotas. You can show the current usage and limits of your organization with the following command:
<pre>
lfs quota -ph $(grep $(echo $HOME | sed -e "s|/[^/]*/[^/]*$||") /pfs/data5/project_ids.txt | cut -f 1 -d\ ) $HOME
</pre>

== Workspaces ==

On uc2 workspaces can be used to store large non-permanent data sets, e.g. restart files or output
data that has to be post-processed. The file system used for workspaces is also the parallel file system Lustre. This file system is especially designed for parallel access and for a high throughput to large
files. It is able to provide high data transfer rates of up to 54 GB/s write and read performance when data access is parallel.

Workspaces have a lifetime and the data on a workspace expires as a whole after a fixed period. The maximum lifetime of a workspace is 60 days, but it can be renewed at the end of that period 3 times to a total maximum of 240 days after workspace generation. If a workspace has inadvertently expired we can restore the data during a limited time (few weeks). In this case you should create a new workspace and report the name of the new and of the expired workspace in a ticket.

Creating, deleting, finding and extending workspaces is explained on the [[workspace]] page.

On uc2 there is a default user quota limit of 40 TiB and 30 million inodes (files and directories) per user.
You can chek your current usage and limits with the command
<pre>
$ lfs quota -uh $(whoami) /pfs/work7
</pre>
Note that the quotas include data and inodes for all of your workspaces and all of your expired workspaces (as long as they are not yet completely removed).

=== Reminder for workspace deletion ===

Normally you will get an email every day starting 7 days before a workspace expires. You can send yourself a calender entry which reminds you when a workspace will be automatically deleted:

$ ws_send_ical.sh <workspace> <email>

== Improving Performance on $HOME and workspaces ==

The following recommendations might help to improve throughput and metadata
performance on Lustre filesystems.

=== '''Improving Throughput Performance''' ===

Depending on your application some adaptations might be necessary if you want to reach
the full bandwidth of the filesystems. Parallel filesystems typically stripe files over storage subsystems, i.e. large files are separated into stripes and distributed to different storage subsystems. In Lustre, the size of these stripes (sometimes also mentioned as chunks) is called stripe size and the number of used storage subsystems is called stripe count.

When you are designing your application you should consider that the performance of
parallel filesystems is generally better if data is transferred in large blocks and stored in
few large files. In more detail, to increase throughput performance of a parallel application
following aspects should be considered:

* collect large chunks of data and write them sequentially at once,

* to exploit complete filesystem bandwidth use several clients,

* avoid competitive file access by different tasks or use blocks with boundaries at stripe size (default is 1MB),

* if files are small enough for the SSDs and are only used by one process store them on $TMPDIR.

With previous Lustre versions adapting the Lustre stripe count was the most important optimization. However, for the workspaces of uc2 the new Lustre feature Progressive File Layouts has been used to define file striping parameters. This means that the stripe count is adapted if the file size is growing. In normal cases users no longer need to adapt file striping parameters in case they have very huge files or in order to reach better performance.

If you know what you are doing you can still change striping parameters, e.g. the stripe count, of a directory and of newly created files. New files and directories inherit the stripe count from the parent directory. E.g. if you want to enhance throughput on a single very large file which is created in the directory $HOME/my_output_dir you can use the command
<pre>
$ lfs setstripe -c-1 $HOME/my_output_dir
</pre>
to change the stripe count to -1 which means that all storage subsystems of the file system are used to store that file. If you change the stripe count of a directory the stripe count of existing files inside this
directory is not changed. If you want to change the stripe count of existing files, change
the stripe count of the parent directory, copy the files to new files, remove the old files
and move the new files back to the old name. In order to check the stripe setting of
the file my_file use
<pre>
$ lfs getstripe my_file
</pre>
Also note that changes on the striping parameters (e.g. stripe count) are not saved in the
backup, i.e. if directories have to be recreated this information is lost and the default stripe
count will be used. Therefore, you should annotate for which directories you made changes
to the striping parameters so that you can repeat these changes if required.

=== '''Improving Metadata Performance''' ===

Metadata performance on parallel file systems is usually not as good as with local
filesystems. In addition, it is usually not scalable, i.e. a limited resource. Therefore,
you should omit metadata operations whenever possible. For example, it is much better
to have few large files than lots of small files. In more detail, to increase metadata
performance of a parallel application following aspects should be considered:

* avoid creating many small files,

* avoid competitive directory access, e.g. by creating files in separate subdirectories for each task,

* if many small files are only used within a batch job and accessed by one process store them on $TMPDIR,

* change the default colorization setting of the command ls (see below).

On modern Linux systems, the GNU ls command often uses colorization by default to
visually highlight the file type; this is especially true if the command is run within a terminal
session. This is because the default shell profile initializations usually contain an alias
directive similar to the following for the ls command:
<pre>
$ alias ls="ls --color=tty"
</pre>
However, running the ls command in this way for files on a Lustre file system requires
a stat() call to be used to determine the file type. This can result in a performance
overhead, because the stat() call always needs to determine the size of a file, and that
in turn means that the client node must query the object size of all the backing objects
that make up a file. As a result of the default colorization setting, running a simple
ls command on a Lustre file system often takes as much time as running the ls command
with the -l option (the same is true if the -F, -p, or the -classify option, or any other option
that requires information from a stat() call, is used). To avoid this performance overhead
when using ls commands, add an alias directive similar to the following
to your shell startup script:
<pre>
$ alias ls="ls --color=never"
</pre>

== Workspaces on flash storage ==

There is another workspace file system for special requirements available. The file system is called ''full flash pfs'' and is based on the parallel file system Lustre.

=== Advantages of this file system ===

# All storage devices are based on flash (no hard disks) with low access times. Hence performance is better compared to other parallel file systems for read and write access with small blocks and with small files, i.e. IOPS rates are improved.
# The file system is mounted on bwUniCluster 2.0 and HoreKa, i.e. it can be used to share data between these clusters.

=== Access restrictions ===

Only HoreKa users or KIT users of bwUniCluster 2.0 can use this file system.

=== Using the file system ===

As KIT or HoreKa user you can use the file system in the same way as a normal workspace. You just have to specify the name of the flash-based workspace file system using the option ''-F'' to all the commands that manage workspaces. On bwUniCluster 2.0 it is called ''ffuc'', on HoreKa it is ''ffhk''. For example, to create a workspace with name myws and a lifetime of 60 days on bwUniCluster 2.0 execute:
ws_allocate -F ffuc myws 60

If you want to use the full flash pfs on bwUniCluster 2.0 '''and''' HoreKa at the same time, please note that you only have to manage a particular workspace on one of the clusters since the name of the workspace directory is different. However, the path to each workspace is visible and can be used on both clusters.

Other features are similar to normal workspaces. For example, we are able to restore expired workspaces for few weeks and you have to open a ticket to request the restore. There are quota limits with a default limit of 1 TiB capacity and 5 millions inodes per user. You can check your current usage with
lfs quota -uh $(whoami) /pfs/work8

== $TMPDIR ==

The environment variable $TMPDIR contains the name of a directory which is local to each node. This means
that different tasks of a parallel application use different directories when they do not utilize the same node.
Although $TMPDIR points to the same path name for different nodes of a batch job, the physical location and the
content of this directory path on these nodes is different.

This directory should be used for temporary files being accessed from the local node during job runtime. It should
also be used if you read the same data many times from a single node, e.g. if you are doing AI training. In this
case you should copy the data at the beginning of your batch job to $TMPDIR and read the data from there.

The $TMPDIR directory is located on extremely fast SSD local storage devices. This means that performance
on small files is much better than on the parallel file systems. The capacity of the local SSDs for each node type
is different and can be checked in Table 1 above. The capacity of $TMPDIR is at least 800 GB.

Each time a batch job is started, a subdirectory is created on the SSD of each node and assigned to the job.
$TMPDIR is set to the name of the subdirectory and this name contains the job ID so that it is unique
for each job. At the end of the job the subdirectory is removed.

On login nodes $TMPDIR also points to a fast directory on a local SSD disk but this directory is not unique.
It is recommended to create your own unique subdirectory on these nodes. This directory should be used for the
installation of software packages. This means that the software package to be installed should be unpacked,
compiled and linked in a subdirectory of $TMPDIR. The real installation of the package (e.g. make install)
should be made into the $HOME folder.

=== Usage example for $TMPDIR ===

We will provide an example for using $TMPDIR and describe efficient data transfer to and from $TMPDIR.

If you have a data set with many files which is frequently used by batch jobs you should create
a compressed archive on a workspace. This archive can be extracted on $TMPDIR inside your batch jobs.
Such an archive can be read efficiently from a parallel file system since it is a single huge file.
On a login node you can create such an archive with the following steps:
<source lang="bash">
# Create a workspace to store the archive
[ab1234@uc2n997 ~]$ ws_allocate data-ssd 60
# Create the archive from a local dataset folder (example)
[ab1234@uc2n997 ~]$ tar -cvzf $(ws_find data-ssd)/dataset.tgz dataset/
</source>

Inside a batch job extract the archive on $TMPDIR, read input data from $TMPDIR, store results on $TMPDIR
and save the results on a workspace:
<source lang="bash">
#!/bin/bash
# very simple example on how to use local $TMPDIR
#SBATCH -N 1
#SBATCH -t 24:00:00

# Extract compressed input dataset on local SSD
tar -C $TMPDIR/ -xvzf $(ws_find data-ssd)/dataset.tgz

# The application reads data from dataset on $TMPDIR and writes results to $TMPDIR
myapp -input $TMPDIR/dataset/myinput.csv -outputdir $TMPDIR/results

# Before job completes save results on a workspace
rsync -av $TMPDIR/results $(ws_find data-ssd)/results-${SLURM_JOB_ID}/
</source>

== LSDF Online Storage==

In some cases it is useful to have access to the LSDF Online Storage on the HPC-Clusters also. Therefore the LSDF Online Storage is mounted on the Login- and Datamover-Nodes.
Furthermore it can be used on the compute nodes during the job runtime with the constraint flag "LSDF" ([[bwUniCluster_2.0_Slurm_common_Features|Slurm common features]]
). There is also an example about the LSDF batch usage: [[bwUniCluster_2.0_Slurm_common_Features#LSDF_Online_Storage|Slurm LSDF example ]] .
<pre>
#!/bin/bash
#SBATCH ...
#SBATCH --constraint=LSDF
</pre>
 
For the usage of the LSDF Online Storage the following environment variables are available: $LSDF, $LSDFPROJECTS, $LSDFHOME.
Please request storage projects in the LSDF Online Storage seperately:
[https://www.lsdf.kit.edu/os/storagerequest/: LSDF Storage Request].

==BeeOND (BeeGFS On-Demand)==

Users of the UniCluster have possibility to request a private BeeOND (on-demand BeeGFS) parallel filesystem for each job. The file system is created during job startup and purged after your job.

; '''IMPORTANT: '''
: All data on the private filesystem will be deleted after your job. Make sure you have copied your data back to the global filesystem (within job), e.g., $HOME or any workspace.

BeeOND/BeeGFS can be used like any other parallel file system. Tools like cp or rsync can be used to copy data in and out.

For detailed usage see here: [[BwUniCluster_2.0_Slurm_common_Features#BeeOND_.28BeeGFS_On-Demand.29|Request on-demand file system]]

==Backup and Archiving==

There are regular backups of all data of the home directories,whereas ACLs and extended attributes will
not be backuped.

Please open a ticket if you need backuped data.

Environment Modules

2024-03-25T08:24:06Z

R Laifer: /* Software job examples added echo and comment to omit that users are copying "/" */

Software on the bwHPC Clusters is provided as '''Software Environment Modules''', or short '''Modules'''.

Modules make it possible to have different versions of a software installed at a the same time.
The complete environments for the software package, compilers and libraries and needed by this specific version is then loaded by a single command. This happens usually in the beginning of the jobscript.

= Basic Usage =
== General Documentation on the Modules Environment Software ==

We will provide an overview of the most important commands in the next sections.

For your reference on what is not covered here, the full documentation written by the software developers is available on the cluster via the commands:

<code>module help</code>

<code>man module</code>

Online documentation of the project is available on the [https://lmod.readthedocs.io/en/latest/ Environment Modules Website].

== Module categories, versions and defaults ==
The bwHPC clusters categorize ''Modules'', each software can exist in different versions:

category/softwarename/version
For instance the Intel compiler X.Y belongs to the category of compilers, therefore the
modulefile ''X.Y'' is placed under the category ''compiler'' and ''intel''.
 
In case of multiple software versions, one version will be always defined as the '''default'''
version. The ''Module'' of the default can be addressed by simply omitting the version number:
category/softwarename

e.g. if mathematica is installed, it is in the module

math/mathematica

Currently all bwHPC software packages are assigned to the following ''Module'' categories:

<code> bio cae chem compiler devel lib math mpi numlib phys system vis </code>



== Display and search available Modules ==
Available ''Modules'' are modulefiles that can be loaded by the user. A ''Module'' must be loaded before it provides changes to your environment. You can display all available ''Modules'' on the system by executing:
<pre>
$ module avail
</pre>

You can selectively list software in one of those categories using, e.g. for the category "compiler", or just all versions of a certain module:
<pre>
$ module avail compiler/
$ module avail compiler/gnu
</pre>

== module help ==
A help message for a specific ''Module'' can be displayed with ''''module help category/softwarename/version''''.
 
The help message usually contains additional information about the software and points to the software website and documentation.
<pre>
$ module help system/example/1.0
----------------- Module Specific Help for "system/example/1.0" ---------------------------
"This module provides a bwhpc-examples job that works on every cluster.

[... rest of the output is omitted in the Wiki for clarity ...]
</pre>

== Loading Modules and Check they are loaded ==
To load a software ''Module'' and display all loaded modules:
<pre>
$ module list
No Modulefiles Currently Loaded.
$ module load system/example/1.0
$ module list
Currently Loaded Modulefiles:
1) system/example/1.0
</pre>

Module make software available only in your current shell. Whenever you login in, you have to load the software again. Please don not auto-load modules in .bashrc on login, this can lead to problems with other modules you may load later.

== Software job examples ==
bwHPC provides example job scripts for most installed software modules.

For a Software ''Module'' with the sofware called '''SOMESOFTWARE''', you can find the example directory by:
<pre>
$ cd $SOMESOFTWARE_EXA_DIR
</pre>

Copy the whole example folder to your $HOME directory, so you can edit those job examples:

<pre>
$ cd
$ mkdir softwarename_examples
$ echo $SOMESOFTWARE_EXA_DIR
# Please do not proceed if the command above does not provide any text !
# Otherwise you will start to copy all system data (the directory "/").
$ cp -r $SOMESOFTWARE_EXA_DIR/ softwarename_examples/

</pre>

If your specific software isn't installed, there is a dummy software example module "system/example" present on all clusters. For this module, the process looks like this:

<pre>
# Load the example module
$ module load system/example/1.0

# Run example in a temporary directory
$ mkdir tmp_example_dir
$ cp -r $EXAMPLE_EXA_DIR/ softwarename_examples/
$ cd tmp_example_dir/bwhpc-examples

# Example jobscript for clusters using the SLURM batch system
sbatch examples-1.0.slurm
# Example jobscript for clusters using PBS
qsub examples-1.0.pbs

# Print the results
cat examples_result.txt
</pre>
----

= Additional Usage Recommendations =

=== Loading conflicts ===
By default you can not load different versions of same software ''Module'' in same session. Loading for example Intel compiler version X while Intel compiler version Y is loaded results in error message as follows:
<source lang="bash">
Module 'compiler/intel/X' conflicts with the currently loaded module(s) 'compiler/intel/Y'
</source>
The solution is [[#Unloading Modules|unloading]] or switching ''Modules''.

=== Showing the changes introduced by a Module ===
Loading a ''Module'' will change the environment of the current shell session. For instance the $PATH variable will be expanded by the software's binary directory. Other ''Module'' variables may even change the behavior of the current shell session or the software program(s) in a more drastic way.
 
Loaded ''Modules'' may also invoke an additional set of environment variables, which e.g. point to directories or destinations of documentation and examples. Their nomenclature is systematic:
{| width=600px class="wikitable"
|-
! Variable
! Pointing to
|-
| $SWN_HOME
| Root directory of the software package
|-
| $SWN_DOC_DIR
| Documentation
|-
| $SWN_EXA_DIR
| Examples
|-
| $SWN_BPR_URL
| URL of software's Wiki article
|-
| and many many more...
|  
|}
with SWN being the place holder of the software ''Module'' name.
 
All the changes to the current shell session to be invoked by loading the ''Module'' can be reviewed using ''''module show category/softwarename/version''''.
 
<pre>
$ module show system/example/1.0
---------------------------------------------------------------------------------------------------
/opt/bwhpc/common/modulefiles/Core/system/example/1.0.lua:
---------------------------------------------------------------------------------------------------
whatis("A generic module containing a working bwhpc-examples job.")
setenv("EXAMPLE_VERSION","1.0")
setenv("EXAMPLE_HOME","/opt/bwhpc/common/system/example/1.0")
setenv("EXAMPLE_BIN_DIR","/opt/bwhpc/common/system/example/1.0/bin")
setenv("EXAMPLE_EXA_DIR","/opt/bwhpc/common/system/example/1.0/bwhpc-examples")
prepend_path("PATH","/opt/bwhpc/common/system/example/1.0/bin")
help([["This module provides a bwhpc-examples job that works on every cluster.
The module is used as example in the bwHPC-Wiki and therefore should be installed on every cluster,
such that users can try the commands out.

* The executable of this module can be found in the folder
$EXAMPLE_BIN_DIR
Upon loading the module, the binaries are added to PATH.

* Further documentation for using the example can be found in
https://wiki.bwhpc.de/e/Environment_Modules

* Examples are located at:
$EXAMPLE_EXA_DIR
]])

</pre>

=== Modules depending on Modules ===
Some program ''Modules'' depend on libraries to be loaded to the user environment. Therefore the
corresponding ''Modules'' of the software must be loaded together with the ''Modules'' of
the libraries.
 
By default such software ''Modules'' try to load required ''Modules'' and corresponding versions automatically. However, automatic loading might fail if a different version of that required ''Module''
is already loaded (cf. [[#Loading conflicts|Loading conflicts]]).
 

== Unloading Modules ==
To unload or to remove a software ''Module'' execute:
<pre>
$ module unload category/softwarename/version
</pre>

=== Unloading all loaded modules ===
In order to remove all previously loaded software modules from your environment issue the command 'module purge'.
 
<pre>
$ module list
Currently Loaded Modulefiles:
1) devel/gdb/7.7
2) compiler/intel/14.0
3) mpi/openmpi/1.8-intel-14.0(default)
$
$ module purge
$ module list
No Modulefiles Currently Loaded.
$
</pre>

== Other Module commands ==
=== module whatis ===
A short description for a specific ''Module'' can be displayed with ''''module whatis category/softwarename/version''''
<pre>
$ module whatis system/example/1.0
system/example/1.0 : A generic module containing a working bwhpc-examples job.
</pre>

BwUniCluster2.0/Hardware and Architecture

2023-11-28T17:19:16Z

R Laifer: /* $TMPDIR */ adapted capacity

= Architecture of bwUniCluster 2.0 =

The bwUniCluster 2.0 is a parallel computer with distributed memory. Each node of system consists of at least two Intel Xeon processor, local memory, disks, network adapters and optionally accelerators (NVIDIA Tesla V100, A100 or H100). All nodes are connected by a fast InfiniBand interconnect. In addition the file system
Lustre, that is connected by coupling the InfiniBand of the file server with the InfiniBand
switch of the compute cluster, is added to bwUniCluster 2.0 to provide a fast and scalable
parallel file system.

The operating system on each node is Red Hat Enterprise Linux (RHEL) 8.4. A number of additional software packages like e.g. SLURM have been installed on top. Some of these components are of special interest to end users and are briefly
discussed in this document. Others which are of greater importance to system
administrators will not be covered by this document.

The individual nodes of the system may act in different roles. According to the services supplied by the nodes, they are separated into disjoint groups. From an end users point of view the different groups of nodes are login nodes, compute nodes, file server nodes and administrative server nodes.

'''Login Nodes'''

The login nodes are the only nodes that are directly accessible by end users. These nodes
are used for interactive login, file management, program development and interactive pre-
and postprocessing. Two nodes are dedicated to this service but they are all accessible via
one address and a DNS round-robin alias distributes the login sessions to the
different login nodes.

'''Compute Node'''

The majority of nodes are compute nodes which are managed by a batch system. Users
submit their jobs to the SLURM batch system and a job is executed when the required resources become available (depending on its fair-share priority).

'''File Server Nodes'''

The hardware of the parallel file system Lustre incorporates some file server nodes; the file
system Lustre is connected by coupling the InfiniBand of the file server with the independent InfiniBand switch of the compute cluster. In addition to shared file space there is also local storage on the disks of each node (for details see chapter "File Systems").

'''Administrative Server Nodes'''

Some other nodes are delivering additional services like resource management, external
network connection, administration etc. These nodes can be accessed directly by system administrators only.

= Components of bwUniCluster =

{| class="wikitable"
|-
! style="width:9%"|
! style="width:13%"| Compute nodes "Thin"
! style="width:13%"| Compute nodes "HPC"
! style="width:13%"| Compute nodes "IceLake"
! style="width:13%"| Compute nodes "Fat"
! style="width:13%"| GPU x4
! style="width:13%"| GPU x8
! style="width:13%"| IceLake + GPU x4
! style="width:13%"| Login
|-
!scope="column"| Number of nodes
| 200 + 60
| 260
| 272
| 6
| 14
| 10
| 15
| 4
|-
!scope="column"| Processors
| Intel Xeon Gold 6230
| Intel Xeon Gold 6230
| Intel Xeon Platinum 8358
| Intel Xeon Gold 6230
| Intel Xeon Gold 6230
| Intel Xeon Gold 6248
| Intel Xeon Platinum 8358
|-
!scope="column"| Number of sockets
| 2
| 2
| 2
| 4
| 2
| 2
| 2
| 2
|-
!scope="column"| Processor frequency (GHz)
| 2.1 Ghz
| 2.1 Ghz
| 2.6 Ghz
| 2.1 Ghz
| 2.1 Ghz
| 2.6 Ghz
| 2.5 Ghz
|
|-
!scope="column"| Total number of cores
| 40
| 40
| 64
| 80
| 40
| 40
| 64
| 40
|-
!scope="column"| Main memory
| 96 GB / 192 GB
| 96 GB
| 256 GB
| 3 TB
| 384 GB
| 768 GB
| 512 GB
| 384 GB
|-
!scope="column"| Local SSD
| 960 GB SATA
| 960 GB SATA
| 1,8 TB NVMe
| 4,8 TB NVMe
| 3,2 TB NVMe
| 15 TB NVMe
| 6,4 TB NVMe
|
|-
!scope="column"| Accelerators
| -
| -
| -
| -
| 4x NVIDIA Tesla V100
| 8x NVIDIA Tesla V100
| 4x NVIDIA A100 / 4x NVIDIA H100
| -
|-
!scope="column"| Accelerator memory
| -
| -
| -
| -
| 32 GB
| 32 GB
| 80 GB / 94 GB
| -
|-
!scope="column"| Interconnect
| IB HDR100 (blocking)
| IB HDR100
| IB HDR200
| IB HDR
| IB HDR
| IB HDR
| IB HDR200
| IB HDR100 (blocking)
|}
Table 1: Properties of the nodes

= File Systems =

On bwUniCluster 2.0 the parallel file system Lustre is used for most globally visible user data. Lustre is open source and Lustre solutions and support are available from different vendors. Nowadays, most of the biggest HPC systems are using Lustre. An initial home directory on a Lustre file system is created automatically after account activation, and the environment variable $HOME holds its name. Users can create so-called workspaces on another Lustre file system for non-permanent data with temporary lifetime. There is another workspace file system based on flash storage for special requirements available.

Within a batch job further file systems are available:
* The directory $TMPDIR is only available and visible on the local node. It is located on fast SSD storage devices.
* On request a parallel on-demand file system (BeeOND) is created which uses the SSDs of the nodes which were allocated to the batch job.
* On request the external LSDF Online Storage is mounted on the nodes which were allocated to the batch job. This file system is based on the parallel file system Spectrum Scale.

Some of the characteristics of the file systems are shown in Table 2.

{| style="width:100%; vertical-align:top; background:#f5fffa;border:1px solid #000000;padding:1px"
|- style="width:20%;height=20px; text-align:left;padding:3px"
! style="background-color:#AAA;padding:3px"| Property
! style="background-color:#AAA;padding:3px"| $TMPDIR
! style="background-color:#AAA;padding:3px"| BeeOND
! style="background-color:#AAA;padding:3px"| $HOME
! style="background-color:#AAA;padding:3px"| Workspace
! style="background-color:#AAA;padding:3px"| Workspace on flash
|-
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Visibility
| style="height=20px; text-align:left;padding:3px"| local node
| style="height=20px; text-align:left;padding:3px"| nodes of batch job
| style="height=20px; text-align:left;padding:3px"| global
| style="height=20px; text-align:left;padding:3px"| global
| style="height=20px; text-align:left;padding:3px"| global
|-
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Lifetime
| style="height=20px; text-align:left;padding:3px"| batch job runtime
| style="height=20px; text-align:left;padding:3px"| batch job runtime
| style="height=20px; text-align:left;padding:3px"| permanent
| style="height=20px; text-align:left;padding:3px"| max. 240 days
| style="height=20px; text-align:left;padding:3px"| max. 240 days
|- style="vertical-align:top;"
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Disk space
| style="height=20px; text-align:left;padding:3px"| 960 GB - 6.4 TB details see table 1
| style="height=20px; text-align:left;padding:3px"| n*750 GB
| style="height=20px; text-align:left;padding:3px"| 1.2 PiB
| style="height=20px; text-align:left;padding:3px"| 4.1 PiB
| style="height=20px; text-align:left;padding:3px"| 236 TiB
|-
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Capacity Quotas
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| yes 1 TiB per user, for MA users 256 GiB also per organization
| style="height=20px; text-align:left;padding:3px"| yes 40 TiB per user
| style="height=20px; text-align:left;padding:3px"| yes 1 TiB per user
|-
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Inode Quotas
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| yes 10 million per user for MA users 2.5 million
| style="height=20px; text-align:left;padding:3px"| yes 30 million per user
| style="height=20px; text-align:left;padding:3px"| yes 5 million per user
|-
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Backup
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| yes
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| no
|- style="vertical-align:top;"
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Read perf./node
| style="height=20px; text-align:left;padding:3px"| 500 MB/s - 6 GB/s depends on type of local SSD / job queue: 520 MB/s @ single / multiple 800 MB/s @ multiple_e 6600 MB/s @ fat 6500 MB/s @ gpu_4 6500 MB/s @ gpu_8
| style="height=20px; text-align:left;padding:3px"| 400 MB/s - 500 MB/s depends on type of local SSDs / job queue: 500 MB/s @ multiple 400 MB/s @ multiple_e
| style="height=20px; text-align:left;padding:3px"| 1 GB/s
| style="height=20px; text-align:left;padding:3px"| 1 GB/s
| style="height=20px; text-align:left;padding:3px"| 1 GB/s
|- style="vertical-align:top;"
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Write perf./node
| style="height=20px; text-align:left;padding:3px"| 500 MB/s - 4 GB/s depends on type of local SSD / job queue: 500 MB/s @ single / multiple 650 MB/s @ multiple_e 2900 MB/s @ fat 2090 MB/s @ gpu_4 4060 MB/s @ gpu_8
| style="height=20px; text-align:left;padding:3px"| 250 MB/s - 350 MB/s depends on type of local SSDs / job queue: 350 MB/s @ multiple 250 MB/s @ multiple_e
| style="height=20px; text-align:left;padding:3px"| 1 GB/s
| style="height=20px; text-align:left;padding:3px"| 1 GB/s
| style="height=20px; text-align:left;padding:3px"| 1 GB/s
|- style="vertical-align:top;"
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Total read perf.
| style="height=20px; text-align:left;padding:3px"| n*500-6000 MB/s
| style="height=20px; text-align:left;padding:3px"| n*400-500 MB/s
| style="height=20px; text-align:left;padding:3px"| 18 GB/s
| style="height=20px; text-align:left;padding:3px"| 54 GB/s
| style="height=20px; text-align:left;padding:3px"| 45 GB/s
|- style="vertical-align:top;"
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Total write perf.
| style="height=20px; text-align:left;padding:3px"| n*500-4000 MB/s
| style="height=20px; text-align:left;padding:3px"| n*250-350 MB/s
| style="height=20px; text-align:left;padding:3px"| 18 GB/s
| style="height=20px; text-align:left;padding:3px"| 54 GB/s
| style="height=20px; text-align:left;padding:3px"| 38 GB/s
|}
---------------------------------------------------------------------------------------------------------
global: all nodes of UniCluster access the same file system;
local: each node has its own file system;
permanent: files are stored permanently;
batch job: files are removed at end of the batch job.
---------------------------------------------------------------------------------------------------------
Table 2: Properties of the file systems

== Selecting the appropriate file system ==

In general, you should separate your data and store it on the appropriate file system.
Permanently needed data like software or important results should be stored below $HOME
but capacity restrictions (quotas) apply. In case you accidentally deleted data on $HOME
you can usually restore it from backup. Permanent data which is not needed for months
or exceeds the capacity restrictions should be sent to the LSDF Online Storage
or to the archive and deleted from the file systems. Temporary data which is only needed on a single
node and which does not exceed the disk space shown in the table above should be stored
below $TMPDIR. Data which is read many times on a single node, e.g. if you are doing AI training,
should be copied to $TMPDIR and read from there. Temporary data which is used from many nodes
of your batch job and which is only needed during job runtime should be stored on a
parallel on-demand file system. Temporary data which can be recomputed or which is the
result of one job and input for another job should be stored in workspaces. The lifetime
of data in workspaces is limited and depends on the lifetime of the workspace which can be
several months.

For further details please check the chapters below.

== $HOME ==

The home directories of bwUniCluster 2.0 (uc2) users are located in the parallel file system Lustre.
You have access to your home directory from all nodes of uc2. A regular backup of these directories
to tape archive is done automatically. The directory $HOME is used to hold those files that are
permanently used like source codes, configuration files, executable programs etc.

On uc2 there is a default user quota limit of 1 TiB and 10 million inodes (files and directories) per user.
For users of University of Mannheim the limit is 256 GiB and 2.5 million inodes.
You can check your current usage and limits with the command
<pre>
$ lfs quota -uh $(whoami) $HOME
</pre>
In addition to the user limit there is a limit of your organization (e.g. university) which depends on the financial share. This limit is enforced with so-called Lustre project quotas. You can show the current usage and limits of your organization with the following command:
<pre>
lfs quota -ph $(grep $(echo $HOME | sed -e "s|/[^/]*/[^/]*$||") /pfs/data5/project_ids.txt | cut -f 1 -d\ ) $HOME
</pre>

== Workspaces ==

On uc2 workspaces can be used to store large non-permanent data sets, e.g. restart files or output
data that has to be post-processed. The file system used for workspaces is also the parallel file system Lustre. This file system is especially designed for parallel access and for a high throughput to large
files. It is able to provide high data transfer rates of up to 54 GB/s write and read performance when data access is parallel.

Workspaces have a lifetime and the data on a workspace expires as a whole after a fixed period. The maximum lifetime of a workspace is 60 days, but it can be renewed at the end of that period 3 times to a total maximum of 240 days after workspace generation. If a workspace has inadvertently expired we can restore the data during a limited time (few weeks). In this case you should create a new workspace and report the name of the new and of the expired workspace in a ticket.

Creating, deleting, finding and extending workspaces is explained on the [[workspace]] page.

On uc2 there is a default user quota limit of 40 TiB and 30 million inodes (files and directories) per user.
You can chek your current usage and limits with the command
<pre>
$ lfs quota -uh $(whoami) /pfs/work7
</pre>
Note that the quotas include data and inodes for all of your workspaces and all of your expired workspaces (as long as they are not yet completely removed).

=== Reminder for workspace deletion ===

Normally you will get an email every day starting 7 days before a workspace expires. You can send yourself a calender entry which reminds you when a workspace will be automatically deleted:

$ ws_send_ical.sh <workspace> <email>

== Improving Performance on $HOME and workspaces ==

The following recommendations might help to improve throughput and metadata
performance on Lustre filesystems.

=== '''Improving Throughput Performance''' ===

Depending on your application some adaptations might be necessary if you want to reach
the full bandwidth of the filesystems. Parallel filesystems typically stripe files over storage subsystems, i.e. large files are separated into stripes and distributed to different storage subsystems. In Lustre, the size of these stripes (sometimes also mentioned as chunks) is called stripe size and the number of used storage subsystems is called stripe count.

When you are designing your application you should consider that the performance of
parallel filesystems is generally better if data is transferred in large blocks and stored in
few large files. In more detail, to increase throughput performance of a parallel application
following aspects should be considered:

* collect large chunks of data and write them sequentially at once,

* to exploit complete filesystem bandwidth use several clients,

* avoid competitive file access by different tasks or use blocks with boundaries at stripe size (default is 1MB),

* if files are small enough for the SSDs and are only used by one process store them on $TMPDIR.

With previous Lustre versions adapting the Lustre stripe count was the most important optimization. However, for the workspaces of uc2 the new Lustre feature Progressive File Layouts has been used to define file striping parameters. This means that the stripe count is adapted if the file size is growing. In normal cases users no longer need to adapt file striping parameters in case they have very huge files or in order to reach better performance.

If you know what you are doing you can still change striping parameters, e.g. the stripe count, of a directory and of newly created files. New files and directories inherit the stripe count from the parent directory. E.g. if you want to enhance throughput on a single very large file which is created in the directory $HOME/my_output_dir you can use the command
<pre>
$ lfs setstripe -c-1 $HOME/my_output_dir
</pre>
to change the stripe count to -1 which means that all storage subsystems of the file system are used to store that file. If you change the stripe count of a directory the stripe count of existing files inside this
directory is not changed. If you want to change the stripe count of existing files, change
the stripe count of the parent directory, copy the files to new files, remove the old files
and move the new files back to the old name. In order to check the stripe setting of
the file my_file use
<pre>
$ lfs getstripe my_file
</pre>
Also note that changes on the striping parameters (e.g. stripe count) are not saved in the
backup, i.e. if directories have to be recreated this information is lost and the default stripe
count will be used. Therefore, you should annotate for which directories you made changes
to the striping parameters so that you can repeat these changes if required.

=== '''Improving Metadata Performance''' ===

Metadata performance on parallel file systems is usually not as good as with local
filesystems. In addition, it is usually not scalable, i.e. a limited resource. Therefore,
you should omit metadata operations whenever possible. For example, it is much better
to have few large files than lots of small files. In more detail, to increase metadata
performance of a parallel application following aspects should be considered:

* avoid creating many small files,

* avoid competitive directory access, e.g. by creating files in separate subdirectories for each task,

* if many small files are only used within a batch job and accessed by one process store them on $TMPDIR,

* change the default colorization setting of the command ls (see below).

On modern Linux systems, the GNU ls command often uses colorization by default to
visually highlight the file type; this is especially true if the command is run within a terminal
session. This is because the default shell profile initializations usually contain an alias
directive similar to the following for the ls command:
<pre>
$ alias ls="ls --color=tty"
</pre>
However, running the ls command in this way for files on a Lustre file system requires
a stat() call to be used to determine the file type. This can result in a performance
overhead, because the stat() call always needs to determine the size of a file, and that
in turn means that the client node must query the object size of all the backing objects
that make up a file. As a result of the default colorization setting, running a simple
ls command on a Lustre file system often takes as much time as running the ls command
with the -l option (the same is true if the -F, -p, or the -classify option, or any other option
that requires information from a stat() call, is used). To avoid this performance overhead
when using ls commands, add an alias directive similar to the following
to your shell startup script:
<pre>
$ alias ls="ls --color=never"
</pre>

== Workspaces on flash storage ==

There is another workspace file system for special requirements available. The file system is called ''full flash pfs'' and is based on the parallel file system Lustre.

=== Advantages of this file system ===

# All storage devices are based on flash (no hard disks) with low access times. Hence performance is better compared to other parallel file systems for read and write access with small blocks and with small files, i.e. IOPS rates are improved.
# The file system is mounted on bwUniCluster 2.0 and HoreKa, i.e. it can be used to share data between these clusters.

=== Access restrictions ===

Only HoreKa users or KIT users of bwUniCluster 2.0 can use this file system.

=== Using the file system ===

As KIT or HoreKa user you can use the file system in the same way as a normal workspace. You just have to specify the name of the flash-based workspace file system using the option ''-F'' to all the commands that manage workspaces. On bwUniCluster 2.0 it is called ''ffuc'', on HoreKa it is ''ffhk''. For example, to create a workspace with name myws and a lifetime of 60 days on bwUniCluster 2.0 execute:
ws_allocate -F ffuc myws 60

If you want to use the full flash pfs on bwUniCluster 2.0 '''and''' HoreKa at the same time, please note that you only have to manage a particular workspace on one of the clusters since the name of the workspace directory is different. However, the path to each workspace is visible and can be used on both clusters.

Other features are similar to normal workspaces. For example, we are able to restore expired workspaces for few weeks and you have to open a ticket to request the restore. There are quota limits with a default limit of 1 TiB capacity and 5 millions inodes per user. You can check your current usage with
lfs quota -uh $(whoami) /pfs/work8

== $TMPDIR ==

The environment variable $TMPDIR contains the name of a directory which is local to each node. This means
that different tasks of a parallel application use different directories when they do not utilize the same node.
Although $TMPDIR points to the same path name for different nodes of a batch job, the physical location and the
content of this directory path on these nodes is different.

This directory should be used for temporary files being accessed from the local node during job runtime. It should
also be used if you read the same data many times from a single node, e.g. if you are doing AI training. In this
case you should copy the data at the beginning of your batch job to $TMPDIR and read the data from there.

The $TMPDIR directory is located on extremely fast SSD local storage devices. This means that performance
on small files is much better than on the parallel file systems. The capacity of the local SSDs for each node type
is different and can be checked in Table 1 above. The capacity of $TMPDIR is at least 800 GB.

Each time a batch job is started, a subdirectory is created on the SSD of each node and assigned to the job.
$TMPDIR is set to the name of the subdirectory and this name contains the job ID so that it is unique
for each job. At the end of the job the subdirectory is removed.

On login nodes $TMPDIR also points to a fast directory on a local SSD disk but this directory is not unique.
It is recommended to create your own unique subdirectory on these nodes. This directory should be used for the
installation of software packages. This means that the software package to be installed should be unpacked,
compiled and linked in a subdirectory of $TMPDIR. The real installation of the package (e.g. make install)
should be made into the $HOME folder.

=== Usage example for $TMPDIR ===

We will provide an example for using $TMPDIR and describe efficient data transfer to and from $TMPDIR.

If you have a data set with many files which is frequently used by batch jobs you should create
a compressed archive on a workspace. This archive can be extracted on $TMPDIR inside your batch jobs.
Such an archive can be read efficiently from a parallel file system since it is a single huge file.
On a login node you can create such an archive with the following steps:
<source lang="bash">
# Create a workspace to store the archive
[ab1234@uc2n997 ~]$ ws_allocate data-ssd 60
# Create the archive from a local dataset folder (example)
[ab1234@uc2n997 ~]$ tar -cvzf $(ws_find data-ssd)/dataset.tgz dataset/
</source>

Inside a batch job extract the archive on $TMPDIR, read input data from $TMPDIR, store results on $TMPDIR
and save the results on a workspace:
<source lang="bash">
#!/bin/bash
# very simple example on how to use local $TMPDIR
#SBATCH -N 1
#SBATCH -t 24:00:00

# Extract compressed input dataset on local SSD
tar -C $TMPDIR/ -xvzf $(ws_find data-ssd)/dataset.tgz

# The application reads data from dataset on $TMPDIR and writes results to $TMPDIR
myapp -input $TMPDIR/dataset/myinput.csv -outputdir $TMPDIR/results

# Before job completes save results on a workspace
rsync -av $TMPDIR/results $(ws_find data-ssd)/results-${SLURM_JOB_ID}/
</source>

== LSDF Online Storage==

In some cases it is useful to have access to the LSDF Online Storage on the HPC-Clusters also. Therefore the LSDF Online Storage is mounted on the Login- and Datamover-Nodes.
Furthermore it can be used on the compute nodes during the job runtime with the constraint flag "LSDF" ([[bwUniCluster_2.0_Slurm_common_Features|Slurm common features]]
). There is also an example about the LSDF batch usage: [[bwUniCluster_2.0_Slurm_common_Features#LSDF_Online_Storage|Slurm LSDF example ]] .
<pre>
#!/bin/bash
#SBATCH ...
#SBATCH --constraint=LSDF
</pre>
 
For the usage of the LSDF Online Storage the following environment variables are available: $LSDF, $LSDFPROJECTS, $LSDFHOME.
Please request storage projects in the LSDF Online Storage seperately:
[https://www.lsdf.kit.edu/os/storagerequest/: LSDF Storage Request].

==BeeOND (BeeGFS On-Demand)==

Users of the UniCluster have possibility to request a private BeeOND (on-demand BeeGFS) parallel filesystem for each job. The file system is created during job startup and purged after your job.

; '''IMPORTANT: '''
: All data on the private filesystem will be deleted after your job. Make sure you have copied your data back to the global filesystem (within job), e.g., $HOME or any workspace.

BeeOND/BeeGFS can be used like any other parallel file system. Tools like cp or rsync can be used to copy data in and out.

For detailed usage see here: [[BwUniCluster_2.0_Slurm_common_Features#BeeOND_.28BeeGFS_On-Demand.29|Request on-demand file system]]

==Backup and Archiving==

There are regular backups of all data of the home directories,whereas ACLs and extended attributes will
not be backuped.

Please open a ticket if you need backuped data.

[[Category:bwUniCluster 2.0|File System]][[Category:Hardware_and_Architecture|bwUniCluster 2.0]]

BwUniCluster2.0/Hardware and Architecture

2023-10-10T15:08:52Z

R Laifer: /* File Systems replace all remaining TMP with TMPDIR*/

= Architecture of bwUniCluster 2.0 =

The bwUniCluster 2.0 is a parallel computer with distributed memory. Each node of system consists of at least two Intel Xeon processor, local memory, disks, network adapters and optionally accelerators (NVIDIA Tesla V100, A100 or H100). All nodes are connected by a fast InfiniBand interconnect. In addition the file system
Lustre, that is connected by coupling the InfiniBand of the file server with the InfiniBand
switch of the compute cluster, is added to bwUniCluster 2.0 to provide a fast and scalable
parallel file system.

The operating system on each node is Red Hat Enterprise Linux (RHEL) 8.4. A number of additional software packages like e.g. SLURM have been installed on top. Some of these components are of special interest to end users and are briefly
discussed in this document. Others which are of greater importance to system
administrators will not be covered by this document.

The individual nodes of the system may act in different roles. According to the services supplied by the nodes, they are separated into disjoint groups. From an end users point of view the different groups of nodes are login nodes, compute nodes, file server nodes and administrative server nodes.

'''Login Nodes'''

The login nodes are the only nodes that are directly accessible by end users. These nodes
are used for interactive login, file management, program development and interactive pre-
and postprocessing. Two nodes are dedicated to this service but they are all accessible via
one address and a DNS round-robin alias distributes the login sessions to the
different login nodes.

'''Compute Node'''

The majority of nodes are compute nodes which are managed by a batch system. Users
submit their jobs to the SLURM batch system and a job is executed when the required resources become available (depending on its fair-share priority).

'''File Server Nodes'''

The hardware of the parallel file system Lustre incorporates some file server nodes; the file
system Lustre is connected by coupling the InfiniBand of the file server with the independent InfiniBand switch of the compute cluster. In addition to shared file space there is also local storage on the disks of each node (for details see chapter "File Systems").

'''Administrative Server Nodes'''

Some other nodes are delivering additional services like resource management, external
network connection, administration etc. These nodes can be accessed directly by system administrators only.

= Components of bwUniCluster =

{| class="wikitable"
|-
! style="width:9%"|
! style="width:13%"| Compute nodes "Thin"
! style="width:13%"| Compute nodes "HPC"
! style="width:13%"| Compute nodes "IceLake"
! style="width:13%"| Compute nodes "Fat"
! style="width:13%"| GPU x4
! style="width:13%"| GPU x8
! style="width:13%"| IceLake + GPU x4
! style="width:13%"| Login
|-
!scope="column"| Number of nodes
| 200 + 60
| 260
| 272
| 6
| 14
| 10
| 15
| 4
|-
!scope="column"| Processors
| Intel Xeon Gold 6230
| Intel Xeon Gold 6230
| Intel Xeon Platinum 8358
| Intel Xeon Gold 6230
| Intel Xeon Gold 6230
| Intel Xeon Gold 6248
| Intel Xeon Platinum 8358
|-
!scope="column"| Number of sockets
| 2
| 2
| 2
| 4
| 2
| 2
| 2
| 2
|-
!scope="column"| Processor frequency (GHz)
| 2.1 Ghz
| 2.1 Ghz
| 2.6 Ghz
| 2.1 Ghz
| 2.1 Ghz
| 2.6 Ghz
| 2.5 Ghz
|
|-
!scope="column"| Total number of cores
| 40
| 40
| 64
| 80
| 40
| 40
| 64
| 40
|-
!scope="column"| Main memory
| 96 GB / 192 GB
| 96 GB
| 256 GB
| 3 TB
| 384 GB
| 768 GB
| 512 GB
| 384 GB
|-
!scope="column"| Local SSD
| 960 GB SATA
| 960 GB SATA
| 1,8 TB NVMe
| 4,8 TB NVMe
| 3,2 TB NVMe
| 15 TB NVMe
| 6,4 TB NVMe
|
|-
!scope="column"| Accelerators
| -
| -
| -
| -
| 4x NVIDIA Tesla V100
| 8x NVIDIA Tesla V100
| 4x NVIDIA A100 / 4x NVIDIA H100
| -
|-
!scope="column"| Accelerator memory
| -
| -
| -
| -
| 32 GB
| 32 GB
| 80 GB / 94 GB
| -
|-
!scope="column"| Interconnect
| IB HDR100 (blocking)
| IB HDR100
| IB HDR200
| IB HDR
| IB HDR
| IB HDR
| IB HDR200
| IB HDR100 (blocking)
|}
Table 1: Properties of the nodes

= File Systems =

On bwUniCluster 2.0 the parallel file system Lustre is used for most globally visible user data. Lustre is open source and Lustre solutions and support are available from different vendors. Nowadays, most of the biggest HPC systems are using Lustre. An initial home directory on a Lustre file system is created automatically after account activation, and the environment variable $HOME holds its name. Users can create so-called workspaces on another Lustre file system for non-permanent data with temporary lifetime. There is another workspace file system based on flash storage for special requirements available.

Within a batch job further file systems are available:
* The directory $TMPDIR is only available and visible on the local node. It is located on fast SSD storage devices.
* On request a parallel on-demand file system (BeeOND) is created which uses the SSDs of the nodes which were allocated to the batch job.
* On request the external LSDF Online Storage is mounted on the nodes which were allocated to the batch job. This file system is based on the parallel file system Spectrum Scale.

Some of the characteristics of the file systems are shown in Table 2.

{| style="width:100%; vertical-align:top; background:#f5fffa;border:1px solid #000000;padding:1px"
|- style="width:20%;height=20px; text-align:left;padding:3px"
! style="background-color:#AAA;padding:3px"| Property
! style="background-color:#AAA;padding:3px"| $TMPDIR
! style="background-color:#AAA;padding:3px"| BeeOND
! style="background-color:#AAA;padding:3px"| $HOME
! style="background-color:#AAA;padding:3px"| Workspace
! style="background-color:#AAA;padding:3px"| Workspace on flash
|-
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Visibility
| style="height=20px; text-align:left;padding:3px"| local node
| style="height=20px; text-align:left;padding:3px"| nodes of batch job
| style="height=20px; text-align:left;padding:3px"| global
| style="height=20px; text-align:left;padding:3px"| global
| style="height=20px; text-align:left;padding:3px"| global
|-
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Lifetime
| style="height=20px; text-align:left;padding:3px"| batch job runtime
| style="height=20px; text-align:left;padding:3px"| batch job runtime
| style="height=20px; text-align:left;padding:3px"| permanent
| style="height=20px; text-align:left;padding:3px"| max. 240 days
| style="height=20px; text-align:left;padding:3px"| max. 240 days
|- style="vertical-align:top;"
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Disk space
| style="height=20px; text-align:left;padding:3px"| 960 GB - 6.4 TB details see table 1
| style="height=20px; text-align:left;padding:3px"| n*750 GB
| style="height=20px; text-align:left;padding:3px"| 1.2 PiB
| style="height=20px; text-align:left;padding:3px"| 4.1 PiB
| style="height=20px; text-align:left;padding:3px"| 236 TiB
|-
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Capacity Quotas
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| yes 1 TiB per user, for MA users 256 GiB also per organization
| style="height=20px; text-align:left;padding:3px"| yes 40 TiB per user
| style="height=20px; text-align:left;padding:3px"| yes 1 TiB per user
|-
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Inode Quotas
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| yes 10 million per user for MA users 2.5 million
| style="height=20px; text-align:left;padding:3px"| yes 30 million per user
| style="height=20px; text-align:left;padding:3px"| yes 5 million per user
|-
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Backup
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| yes
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| no
|- style="vertical-align:top;"
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Read perf./node
| style="height=20px; text-align:left;padding:3px"| 500 MB/s - 6 GB/s depends on type of local SSD / job queue: 520 MB/s @ single / multiple 800 MB/s @ multiple_e 6600 MB/s @ fat 6500 MB/s @ gpu_4 6500 MB/s @ gpu_8
| style="height=20px; text-align:left;padding:3px"| 400 MB/s - 500 MB/s depends on type of local SSDs / job queue: 500 MB/s @ multiple 400 MB/s @ multiple_e
| style="height=20px; text-align:left;padding:3px"| 1 GB/s
| style="height=20px; text-align:left;padding:3px"| 1 GB/s
| style="height=20px; text-align:left;padding:3px"| 1 GB/s
|- style="vertical-align:top;"
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Write perf./node
| style="height=20px; text-align:left;padding:3px"| 500 MB/s - 4 GB/s depends on type of local SSD / job queue: 500 MB/s @ single / multiple 650 MB/s @ multiple_e 2900 MB/s @ fat 2090 MB/s @ gpu_4 4060 MB/s @ gpu_8
| style="height=20px; text-align:left;padding:3px"| 250 MB/s - 350 MB/s depends on type of local SSDs / job queue: 350 MB/s @ multiple 250 MB/s @ multiple_e
| style="height=20px; text-align:left;padding:3px"| 1 GB/s
| style="height=20px; text-align:left;padding:3px"| 1 GB/s
| style="height=20px; text-align:left;padding:3px"| 1 GB/s
|- style="vertical-align:top;"
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Total read perf.
| style="height=20px; text-align:left;padding:3px"| n*500-6000 MB/s
| style="height=20px; text-align:left;padding:3px"| n*400-500 MB/s
| style="height=20px; text-align:left;padding:3px"| 18 GB/s
| style="height=20px; text-align:left;padding:3px"| 54 GB/s
| style="height=20px; text-align:left;padding:3px"| 45 GB/s
|- style="vertical-align:top;"
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Total write perf.
| style="height=20px; text-align:left;padding:3px"| n*500-4000 MB/s
| style="height=20px; text-align:left;padding:3px"| n*250-350 MB/s
| style="height=20px; text-align:left;padding:3px"| 18 GB/s
| style="height=20px; text-align:left;padding:3px"| 54 GB/s
| style="height=20px; text-align:left;padding:3px"| 38 GB/s
|}
---------------------------------------------------------------------------------------------------------
global: all nodes of UniCluster access the same file system;
local: each node has its own file system;
permanent: files are stored permanently;
batch job: files are removed at end of the batch job.
---------------------------------------------------------------------------------------------------------
Table 2: Properties of the file systems

== Selecting the appropriate file system ==

In general, you should separate your data and store it on the appropriate file system.
Permanently needed data like software or important results should be stored below $HOME
but capacity restrictions (quotas) apply. In case you accidentally deleted data on $HOME
you can usually restore it from backup. Permanent data which is not needed for months
or exceeds the capacity restrictions should be sent to the LSDF Online Storage
or to the archive and deleted from the file systems. Temporary data which is only needed on a single
node and which does not exceed the disk space shown in the table above should be stored
below $TMPDIR. Data which is read many times on a single node, e.g. if you are doing AI training,
should be copied to $TMPDIR and read from there. Temporary data which is used from many nodes
of your batch job and which is only needed during job runtime should be stored on a
parallel on-demand file system. Temporary data which can be recomputed or which is the
result of one job and input for another job should be stored in workspaces. The lifetime
of data in workspaces is limited and depends on the lifetime of the workspace which can be
several months.

For further details please check the chapters below.

== $HOME ==

The home directories of bwUniCluster 2.0 (uc2) users are located in the parallel file system Lustre.
You have access to your home directory from all nodes of uc2. A regular backup of these directories
to tape archive is done automatically. The directory $HOME is used to hold those files that are
permanently used like source codes, configuration files, executable programs etc.

On uc2 there is a default user quota limit of 1 TiB and 10 million inodes (files and directories) per user.
For users of University of Mannheim the limit is 256 GiB and 2.5 million inodes.
You can check your current usage and limits with the command
<pre>
$ lfs quota -uh $(whoami) $HOME
</pre>
In addition to the user limit there is a limit of your organization (e.g. university) which depends on the financial share. This limit is enforced with so-called Lustre project quotas. You can show the current usage and limits of your organization with the following command:
<pre>
lfs quota -ph $(grep $(echo $HOME | sed -e "s|/[^/]*/[^/]*$||") /pfs/data5/project_ids.txt | cut -f 1 -d\ ) $HOME
</pre>

== Workspaces ==

On uc2 workspaces can be used to store large non-permanent data sets, e.g. restart files or output
data that has to be post-processed. The file system used for workspaces is also the parallel file system Lustre. This file system is especially designed for parallel access and for a high throughput to large
files. It is able to provide high data transfer rates of up to 54 GB/s write and read performance when data access is parallel.

Workspaces have a lifetime and the data on a workspace expires as a whole after a fixed period. The maximum lifetime of a workspace is 60 days, but it can be renewed at the end of that period 3 times to a total maximum of 240 days after workspace generation. If a workspace has inadvertently expired we can restore the data during a limited time (few weeks). In this case you should create a new workspace and report the name of the new and of the expired workspace in a ticket.

Creating, deleting, finding and extending workspaces is explained on the [[workspace]] page.

On uc2 there is a default user quota limit of 40 TiB and 30 million inodes (files and directories) per user.
You can chek your current usage and limits with the command
<pre>
$ lfs quota -uh $(whoami) /pfs/work7
</pre>
Note that the quotas include data and inodes for all of your workspaces and all of your expired workspaces (as long as they are not yet completely removed).

=== Reminder for workspace deletion ===

Normally you will get an email every day starting 7 days before a workspace expires. You can send yourself a calender entry which reminds you when a workspace will be automatically deleted:

$ ws_send_ical.sh <workspace> <email>

== Improving Performance on $HOME and workspaces ==

The following recommendations might help to improve throughput and metadata
performance on Lustre filesystems.

=== '''Improving Throughput Performance''' ===

Depending on your application some adaptations might be necessary if you want to reach
the full bandwidth of the filesystems. Parallel filesystems typically stripe files over storage subsystems, i.e. large files are separated into stripes and distributed to different storage subsystems. In Lustre, the size of these stripes (sometimes also mentioned as chunks) is called stripe size and the number of used storage subsystems is called stripe count.

When you are designing your application you should consider that the performance of
parallel filesystems is generally better if data is transferred in large blocks and stored in
few large files. In more detail, to increase throughput performance of a parallel application
following aspects should be considered:

* collect large chunks of data and write them sequentially at once,

* to exploit complete filesystem bandwidth use several clients,

* avoid competitive file access by different tasks or use blocks with boundaries at stripe size (default is 1MB),

* if files are small enough for the SSDs and are only used by one process store them on $TMPDIR.

With previous Lustre versions adapting the Lustre stripe count was the most important optimization. However, for the workspaces of uc2 the new Lustre feature Progressive File Layouts has been used to define file striping parameters. This means that the stripe count is adapted if the file size is growing. In normal cases users no longer need to adapt file striping parameters in case they have very huge files or in order to reach better performance.

If you know what you are doing you can still change striping parameters, e.g. the stripe count, of a directory and of newly created files. New files and directories inherit the stripe count from the parent directory. E.g. if you want to enhance throughput on a single very large file which is created in the directory $HOME/my_output_dir you can use the command
<pre>
$ lfs setstripe -c-1 $HOME/my_output_dir
</pre>
to change the stripe count to -1 which means that all storage subsystems of the file system are used to store that file. If you change the stripe count of a directory the stripe count of existing files inside this
directory is not changed. If you want to change the stripe count of existing files, change
the stripe count of the parent directory, copy the files to new files, remove the old files
and move the new files back to the old name. In order to check the stripe setting of
the file my_file use
<pre>
$ lfs getstripe my_file
</pre>
Also note that changes on the striping parameters (e.g. stripe count) are not saved in the
backup, i.e. if directories have to be recreated this information is lost and the default stripe
count will be used. Therefore, you should annotate for which directories you made changes
to the striping parameters so that you can repeat these changes if required.

=== '''Improving Metadata Performance''' ===

Metadata performance on parallel file systems is usually not as good as with local
filesystems. In addition, it is usually not scalable, i.e. a limited resource. Therefore,
you should omit metadata operations whenever possible. For example, it is much better
to have few large files than lots of small files. In more detail, to increase metadata
performance of a parallel application following aspects should be considered:

* avoid creating many small files,

* avoid competitive directory access, e.g. by creating files in separate subdirectories for each task,

* if many small files are only used within a batch job and accessed by one process store them on $TMPDIR,

* change the default colorization setting of the command ls (see below).

On modern Linux systems, the GNU ls command often uses colorization by default to
visually highlight the file type; this is especially true if the command is run within a terminal
session. This is because the default shell profile initializations usually contain an alias
directive similar to the following for the ls command:
<pre>
$ alias ls="ls --color=tty"
</pre>
However, running the ls command in this way for files on a Lustre file system requires
a stat() call to be used to determine the file type. This can result in a performance
overhead, because the stat() call always needs to determine the size of a file, and that
in turn means that the client node must query the object size of all the backing objects
that make up a file. As a result of the default colorization setting, running a simple
ls command on a Lustre file system often takes as much time as running the ls command
with the -l option (the same is true if the -F, -p, or the -classify option, or any other option
that requires information from a stat() call, is used). To avoid this performance overhead
when using ls commands, add an alias directive similar to the following
to your shell startup script:
<pre>
$ alias ls="ls --color=never"
</pre>

== Workspaces on flash storage ==

There is another workspace file system for special requirements available. The file system is called ''full flash pfs'' and is based on the parallel file system Lustre.

=== Advantages of this file system ===

# All storage devices are based on flash (no hard disks) with low access times. Hence performance is better compared to other parallel file systems for read and write access with small blocks and with small files, i.e. IOPS rates are improved.
# The file system is mounted on bwUniCluster 2.0 and HoreKa, i.e. it can be used to share data between these clusters.

=== Access restrictions ===

Only HoreKa users or KIT users of bwUniCluster 2.0 can use this file system.

=== Using the file system ===

As KIT or HoreKa user you can use the file system in the same way as a normal workspace. You just have to specify the name of the flash-based workspace file system using the option ''-F'' to all the commands that manage workspaces. On bwUniCluster 2.0 it is called ''ffuc'', on HoreKa it is ''ffhk''. For example, to create a workspace with name myws and a lifetime of 60 days on bwUniCluster 2.0 execute:
ws_allocate -F ffuc myws 60

If you want to use the full flash pfs on bwUniCluster 2.0 '''and''' HoreKa at the same time, please note that you only have to manage a particular workspace on one of the clusters since the name of the workspace directory is different. However, the path to each workspace is visible and can be used on both clusters.

Other features are similar to normal workspaces. For example, we are able to restore expired workspaces for few weeks and you have to open a ticket to request the restore. There are quota limits with a default limit of 1 TiB capacity and 5 millions inodes per user. You can check your current usage with
lfs quota -uh $(whoami) /pfs/work8

== $TMPDIR ==

The environment variable $TMPDIR contains the name of a directory which is local to each node. This means
that different tasks of a parallel application use different directories when they do not utilize the same node.
Although $TMPDIR points to the same path name for different nodes of a batch job, the physical location and the
content of this directory path on these nodes is different.

This directory should be used for temporary files being accessed from the local node during job runtime. It should
also be used if you read the same data many times from a single node, e.g. if you are doing AI training. In this
case you should copy the data at the beginning of your batch job to $TMPDIR and read the data from there.

The $TMPDIR directory is located on extremely fast SSD local storage devices. This means that performance
on small files is much better than on the parallel file systems. The capacity of $TMPDIR for each node type
can be checked in Table 1 above. The capacity is at least 900 GB.

Each time a batch job is started, a subdirectory is created on the SSD of each node and assigned to the job.
$TMPDIR is set to the name of the subdirectory and this name contains the job ID so that it is unique
for each job. At the end of the job the subdirectory is removed.

On login nodes $TMPDIR also points to a fast directory on a local SSD disk but this directory is not unique.
It is recommended to create your own unique subdirectory on these nodes. This directory should be used for the
installation of software packages. This means that the software package to be installed should be unpacked,
compiled and linked in a subdirectory of $TMPDIR. The real installation of the package (e.g. make install)
should be made into the $HOME folder.

=== Usage example for $TMPDIR ===

We will provide an example for using $TMPDIR and describe efficient data transfer to and from $TMPDIR.

If you have a data set with many files which is frequently used by batch jobs you should create
a compressed archive on a workspace. This archive can be extracted on $TMPDIR inside your batch jobs.
Such an archive can be read efficiently from a parallel file system since it is a single huge file.
On a login node you can create such an archive with the following steps:
<source lang="bash">
# Create a workspace to store the archive
[ab1234@uc2n997 ~]$ ws_allocate data-ssd 60
# Create the archive from a local dataset folder (example)
[ab1234@uc2n997 ~]$ tar -cvzf $(ws_find data-ssd)/dataset.tgz dataset/
</source>

Inside a batch job extract the archive on $TMPDIR, read input data from $TMPDIR, store results on $TMPDIR
and save the results on a workspace:
<source lang="bash">
#!/bin/bash
# very simple example on how to use local $TMPDIR
#SBATCH -N 1
#SBATCH -t 24:00:00

# Extract compressed input dataset on local SSD
tar -C $TMPDIR/ -xvzf $(ws_find data-ssd)/dataset.tgz

# The application reads data from dataset on $TMPDIR and writes results to $TMPDIR
myapp -input $TMPDIR/dataset/myinput.csv -outputdir $TMPDIR/results

# Before job completes save results on a workspace
rsync -av $TMPDIR/results $(ws_find data-ssd)/results-${SLURM_JOB_ID}/
</source>

== LSDF Online Storage==

In some cases it is useful to have access to the LSDF Online Storage on the HPC-Clusters also. Therefore the LSDF Online Storage is mounted on the Login- and Datamover-Nodes.
Furthermore it can be used on the compute nodes during the job runtime with the constraint flag "LSDF" ([[bwUniCluster_2.0_Slurm_common_Features|Slurm common features]]
). There is also an example about the LSDF batch usage: [[bwUniCluster_2.0_Slurm_common_Features#LSDF_Online_Storage|Slurm LSDF example ]] .
<pre>
#!/bin/bash
#SBATCH ...
#SBATCH --constraint=LSDF
</pre>
 
For the usage of the LSDF Online Storage the following environment variables are available: $LSDF, $LSDFPROJECTS, $LSDFHOME.
Please request storage projects in the LSDF Online Storage seperately:
[https://www.lsdf.kit.edu/os/storagerequest/: LSDF Storage Request].

==BeeOND (BeeGFS On-Demand)==

Users of the UniCluster have possibility to request a private BeeOND (on-demand BeeGFS) parallel filesystem for each job. The file system is created during job startup and purged after your job.

'''IMPORTANT: All data on the private filesystem will be deleted after your job. Make sure you have copied your data back to the global filesystem (within job), e.g., $HOME or any workspace.'''

BeeOND/BeeGFS can be used like any other parallel file system. Tools like cp or rsync can be used to copy data in and out.

For detailed usage see here:[[BwUniCluster_2.0_Slurm_common_Features#BeeOND_.28BeeGFS_On-Demand.29|Request on-demand file system]]

==Backup and Archiving==

There are regular backups of all data of the home directories,whereas ACLs and extended attributes will
not be backuped.

Please open a ticket if you need backuped data.

[[Category:bwUniCluster 2.0|File System]][[Category:Hardware_and_Architecture|bwUniCluster 2.0]]

BwUniCluster2.0/Hardware and Architecture

2023-07-03T07:49:47Z

R Laifer: /* File Systems */ add special quota limits for MA users

= Architecture of bwUniCluster 2.0 =

The bwUniCluster 2.0 is a parallel computer with distributed memory. Each node of system consists of at least two Intel Xeon processor, local memory, disks, network adapters and optionally accelerators (NVIDIA Tesla V100, A100 or H100). All nodes are connected by a fast InfiniBand interconnect. In addition the file system
Lustre, that is connected by coupling the InfiniBand of the file server with the InfiniBand
switch of the compute cluster, is added to bwUniCluster 2.0 to provide a fast and scalable
parallel file system.

The operating system on each node is Red Hat Enterprise Linux (RHEL) 8.4. A number of additional software packages like e.g. SLURM have been installed on top. Some of these components are of special interest to end users and are briefly
discussed in this document. Others which are of greater importance to system
administrators will not be covered by this document.

The individual nodes of the system may act in different roles. According to the services supplied by the nodes, they are separated into disjoint groups. From an end users point of view the different groups of nodes are login nodes, compute nodes, file server nodes and administrative server nodes.

'''Login Nodes'''

The login nodes are the only nodes that are directly accessible by end users. These nodes
are used for interactive login, file management, program development and interactive pre-
and postprocessing. Two nodes are dedicated to this service but they are all accessible via
one address and a DNS round-robin alias distributes the login sessions to the
different login nodes.

'''Compute Node'''

The majority of nodes are compute nodes which are managed by a batch system. Users
submit their jobs to the SLURM batch system and a job is executed when the required resources become available (depending on its fair-share priority).

'''File Server Nodes'''

The hardware of the parallel file system Lustre incorporates some file server nodes; the file
system Lustre is connected by coupling the InfiniBand of the file server with the independent InfiniBand switch of the compute cluster. In addition to shared file space there is also local storage on the disks of each node (for details see chapter "File Systems").

'''Administrative Server Nodes'''

Some other nodes are delivering additional services like resource management, external
network connection, administration etc. These nodes can be accessed directly by system administrators only.

= Components of bwUniCluster =

{| class="wikitable"
|-
! style="width:9%"|
! style="width:13%"| Compute nodes "Thin"
! style="width:13%"| Compute nodes "HPC"
! style="width:13%"| Compute nodes "IceLake"
! style="width:13%"| Compute nodes "Fat"
! style="width:13%"| GPU x4
! style="width:13%"| GPU x8
! style="width:13%"| IceLake + GPU x4
! style="width:13%"| Login
|-
!scope="column"| Number of nodes
| 200 + 60
| 260
| 272
| 6
| 14
| 10
| 15
| 4
|-
!scope="column"| Processors
| Intel Xeon Gold 6230
| Intel Xeon Gold 6230
| Intel Xeon Platinum 8358
| Intel Xeon Gold 6230
| Intel Xeon Gold 6230
| Intel Xeon Gold 6248
| Intel Xeon Platinum 8358
|-
!scope="column"| Number of sockets
| 2
| 2
| 2
| 4
| 2
| 2
| 2
| 2
|-
!scope="column"| Processor frequency (GHz)
| 2.1 Ghz
| 2.1 Ghz
| 2.6 Ghz
| 2.1 Ghz
| 2.1 Ghz
| 2.6 Ghz
| 2.5 Ghz
|
|-
!scope="column"| Total number of cores
| 40
| 40
| 64
| 80
| 40
| 40
| 64
| 40
|-
!scope="column"| Main memory
| 96 GB / 192 GB
| 96 GB
| 256 GB
| 3 TB
| 384 GB
| 768 GB
| 512 GB
| 384 GB
|-
!scope="column"| Local SSD
| 960 GB SATA
| 960 GB SATA
| 1,8 TB NVMe
| 4,8 TB NVMe
| 3,2 TB NVMe
| 15 TB NVMe
| 6,4 TB NVMe
|
|-
!scope="column"| Accelerators
| -
| -
| -
| -
| 4x NVIDIA Tesla V100
| 8x NVIDIA Tesla V100
| 4x NVIDIA A100 / 4x NVIDIA H100
| -
|-
!scope="column"| Accelerator memory
| -
| -
| -
| -
| 32 GB
| 32 GB
| 80 GB / 94 GB
| -
|-
!scope="column"| Interconnect
| IB HDR100 (blocking)
| IB HDR100
| IB HDR200
| IB HDR
| IB HDR
| IB HDR
| IB HDR200
| IB HDR100 (blocking)
|}
Table 1: Properties of the nodes

= File Systems =

On bwUniCluster 2.0 the parallel file system Lustre is used for most globally visible user data. Lustre is open source and Lustre solutions and support are available from different vendors. Nowadays, most of the biggest HPC systems are using Lustre. An initial home directory on a Lustre file system is created automatically after account activation, and the environment variable $HOME holds its name. Users can create so-called workspaces on another Lustre file system for non-permanent data with temporary lifetime. There is another workspace file system based on flash storage for special requirements available.

Within a batch job further file systems are available:
* The directory $TMP is only available and visible on the local node. It is located on fast SSD storage devices.
* On request a parallel on-demand file system (BeeOND) is created which uses the SSDs of the nodes which were allocated to the batch job.
* On request the external LSDF Online Storage is mounted on the nodes which were allocated to the batch job. This file system is based on the parallel file system Spectrum Scale.

Some of the characteristics of the file systems are shown in Table 2.

{| style="width:100%; vertical-align:top; background:#f5fffa;border:1px solid #000000;padding:1px"
|- style="width:20%;height=20px; text-align:left;padding:3px"
! style="background-color:#AAA;padding:3px"| Property
! style="background-color:#AAA;padding:3px"| $TMP
! style="background-color:#AAA;padding:3px"| BeeOND
! style="background-color:#AAA;padding:3px"| $HOME
! style="background-color:#AAA;padding:3px"| Workspace
! style="background-color:#AAA;padding:3px"| Workspace on flash
|-
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Visibility
| style="height=20px; text-align:left;padding:3px"| local node
| style="height=20px; text-align:left;padding:3px"| nodes of batch job
| style="height=20px; text-align:left;padding:3px"| global
| style="height=20px; text-align:left;padding:3px"| global
| style="height=20px; text-align:left;padding:3px"| global
|-
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Lifetime
| style="height=20px; text-align:left;padding:3px"| batch job runtime
| style="height=20px; text-align:left;padding:3px"| batch job runtime
| style="height=20px; text-align:left;padding:3px"| permanent
| style="height=20px; text-align:left;padding:3px"| max. 240 days
| style="height=20px; text-align:left;padding:3px"| max. 240 days
|- style="vertical-align:top;"
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Disk space
| style="height=20px; text-align:left;padding:3px"| 960 GB - 6.4 TB details see table 1
| style="height=20px; text-align:left;padding:3px"| n*750 GB
| style="height=20px; text-align:left;padding:3px"| 1.2 PiB
| style="height=20px; text-align:left;padding:3px"| 4.1 PiB
| style="height=20px; text-align:left;padding:3px"| 236 TiB
|-
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Capacity Quotas
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| yes 1 TiB per user, for MA users 256 GiB also per organization
| style="height=20px; text-align:left;padding:3px"| yes 40 TiB per user
| style="height=20px; text-align:left;padding:3px"| yes 1 TiB per user
|-
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Inode Quotas
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| yes 10 million per user for MA users 2.5 million
| style="height=20px; text-align:left;padding:3px"| yes 30 million per user
| style="height=20px; text-align:left;padding:3px"| yes 5 million per user
|-
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Backup
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| yes
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| no
|- style="vertical-align:top;"
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Read perf./node
| style="height=20px; text-align:left;padding:3px"| 500 MB/s - 6 GB/s depends on type of local SSD / job queue: 520 MB/s @ single / multiple 800 MB/s @ multiple_e 6600 MB/s @ fat 6500 MB/s @ gpu_4 6500 MB/s @ gpu_8
| style="height=20px; text-align:left;padding:3px"| 400 MB/s - 500 MB/s depends on type of local SSDs / job queue: 500 MB/s @ multiple 400 MB/s @ multiple_e
| style="height=20px; text-align:left;padding:3px"| 1 GB/s
| style="height=20px; text-align:left;padding:3px"| 1 GB/s
| style="height=20px; text-align:left;padding:3px"| 1 GB/s
|- style="vertical-align:top;"
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Write perf./node
| style="height=20px; text-align:left;padding:3px"| 500 MB/s - 4 GB/s depends on type of local SSD / job queue: 500 MB/s @ single / multiple 650 MB/s @ multiple_e 2900 MB/s @ fat 2090 MB/s @ gpu_4 4060 MB/s @ gpu_8
| style="height=20px; text-align:left;padding:3px"| 250 MB/s - 350 MB/s depends on type of local SSDs / job queue: 350 MB/s @ multiple 250 MB/s @ multiple_e
| style="height=20px; text-align:left;padding:3px"| 1 GB/s
| style="height=20px; text-align:left;padding:3px"| 1 GB/s
| style="height=20px; text-align:left;padding:3px"| 1 GB/s
|- style="vertical-align:top;"
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Total read perf.
| style="height=20px; text-align:left;padding:3px"| n*500-6000 MB/s
| style="height=20px; text-align:left;padding:3px"| n*400-500 MB/s
| style="height=20px; text-align:left;padding:3px"| 18 GB/s
| style="height=20px; text-align:left;padding:3px"| 54 GB/s
| style="height=20px; text-align:left;padding:3px"| 45 GB/s
|- style="vertical-align:top;"
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Total write perf.
| style="height=20px; text-align:left;padding:3px"| n*500-4000 MB/s
| style="height=20px; text-align:left;padding:3px"| n*250-350 MB/s
| style="height=20px; text-align:left;padding:3px"| 18 GB/s
| style="height=20px; text-align:left;padding:3px"| 54 GB/s
| style="height=20px; text-align:left;padding:3px"| 38 GB/s
|}
---------------------------------------------------------------------------------------------------------
global: all nodes of UniCluster access the same file system;
local: each node has its own file system;
permanent: files are stored permanently;
batch job: files are removed at end of the batch job.
---------------------------------------------------------------------------------------------------------
Table 2: Properties of the file systems

== Selecting the appropriate file system ==

In general, you should separate your data and store it on the appropriate file system.
Permanently needed data like software or important results should be stored below $HOME
but capacity restrictions (quotas) apply. In case you accidentally deleted data on $HOME
you can usually restore it from backup. Permanent data which is not needed for months
or exceeds the capacity restrictions should be sent to the LSDF Online Storage
or to the archive and deleted from the file systems. Temporary data which is only needed on a single
node and which does not exceed the disk space shown in the table above should be stored
below $TMP. Data which is read many times on a single node, e.g. if you are doing AI training,
should be copied to $TMP and read from there. Temporary data which is used from many nodes
of your batch job and which is only needed during job runtime should be stored on a
parallel on-demand file system. Temporary data which can be recomputed or which is the
result of one job and input for another job should be stored in workspaces. The lifetime
of data in workspaces is limited and depends on the lifetime of the workspace which can be
several months.

For further details please check the chapters below.

== $HOME ==

The home directories of bwUniCluster 2.0 (uc2) users are located in the parallel file system Lustre.
You have access to your home directory from all nodes of uc2. A regular backup of these directories
to tape archive is done automatically. The directory $HOME is used to hold those files that are
permanently used like source codes, configuration files, executable programs etc.

On uc2 there is a default user quota limit of 1 TiB and 10 million inodes (files and directories) per user.
For users of University of Mannheim the limit is 256 GiB and 2.5 million inodes.
You can check your current usage and limits with the command
<pre>
$ lfs quota -uh $(whoami) $HOME
</pre>
In addition to the user limit there is a limit of your organization (e.g. university) which depends on the financial share. This limit is enforced with so-called Lustre project quotas. You can show the current usage and limits of your organization with the following command:
<pre>
lfs quota -ph $(grep $(echo $HOME | sed -e "s|/[^/]*/[^/]*$||") /pfs/data5/project_ids.txt | cut -f 1 -d\ ) $HOME
</pre>

== Workspaces ==

On uc2 workspaces can be used to store large non-permanent data sets, e.g. restart files or output
data that has to be post-processed. The file system used for workspaces is also the parallel file system Lustre. This file system is especially designed for parallel access and for a high throughput to large
files. It is able to provide high data transfer rates of up to 54 GB/s write and read performance when data access is parallel.

Workspaces have a lifetime and the data on a workspace expires as a whole after a fixed period. The maximum lifetime of a workspace is 60 days, but it can be renewed at the end of that period 3 times to a total maximum of 240 days after workspace generation. If a workspace has inadvertently expired we can restore the data during a limited time (few weeks). In this case you should create a new workspace and report the name of the new and of the expired workspace in a ticket.

Creating, deleting, finding and extending workspaces is explained on the [[workspace]] page.

On uc2 there is a default user quota limit of 40 TiB and 30 million inodes (files and directories) per user.
You can chek your current usage and limits with the command
<pre>
$ lfs quota -uh $(whoami) /pfs/work7
</pre>
Note that the quotas include data and inodes for all of your workspaces and all of your expired workspaces (as long as they are not yet completely removed).

=== Reminder for workspace deletion ===

Normally you will get an email every day starting 7 days before a workspace expires. You can send yourself a calender entry which reminds you when a workspace will be automatically deleted:

$ ws_send_ical.sh <workspace> <email>

== Improving Performance on $HOME and workspaces ==

The following recommendations might help to improve throughput and metadata
performance on Lustre filesystems.

=== '''Improving Throughput Performance''' ===

Depending on your application some adaptations might be necessary if you want to reach
the full bandwidth of the filesystems. Parallel filesystems typically stripe files over storage subsystems, i.e. large files are separated into stripes and distributed to different storage subsystems. In Lustre, the size of these stripes (sometimes also mentioned as chunks) is called stripe size and the number of used storage subsystems is called stripe count.

When you are designing your application you should consider that the performance of
parallel filesystems is generally better if data is transferred in large blocks and stored in
few large files. In more detail, to increase throughput performance of a parallel application
following aspects should be considered:

* collect large chunks of data and write them sequentially at once,

* to exploit complete filesystem bandwidth use several clients,

* avoid competitive file access by different tasks or use blocks with boundaries at stripe size (default is 1MB),

* if files are small enough for the SSDs and are only used by one process store them on $TMP.

With previous Lustre versions adapting the Lustre stripe count was the most important optimization. However, for the workspaces of uc2 the new Lustre feature Progressive File Layouts has been used to define file striping parameters. This means that the stripe count is adapted if the file size is growing. In normal cases users no longer need to adapt file striping parameters in case they have very huge files or in order to reach better performance.

If you know what you are doing you can still change striping parameters, e.g. the stripe count, of a directory and of newly created files. New files and directories inherit the stripe count from the parent directory. E.g. if you want to enhance throughput on a single very large file which is created in the directory $HOME/my_output_dir you can use the command
<pre>
$ lfs setstripe -c-1 $HOME/my_output_dir
</pre>
to change the stripe count to -1 which means that all storage subsystems of the file system are used to store that file. If you change the stripe count of a directory the stripe count of existing files inside this
directory is not changed. If you want to change the stripe count of existing files, change
the stripe count of the parent directory, copy the files to new files, remove the old files
and move the new files back to the old name. In order to check the stripe setting of
the file my_file use
<pre>
$ lfs getstripe my_file
</pre>
Also note that changes on the striping parameters (e.g. stripe count) are not saved in the
backup, i.e. if directories have to be recreated this information is lost and the default stripe
count will be used. Therefore, you should annotate for which directories you made changes
to the striping parameters so that you can repeat these changes if required.

=== '''Improving Metadata Performance''' ===

Metadata performance on parallel file systems is usually not as good as with local
filesystems. In addition, it is usually not scalable, i.e. a limited resource. Therefore,
you should omit metadata operations whenever possible. For example, it is much better
to have few large files than lots of small files. In more detail, to increase metadata
performance of a parallel application following aspects should be considered:

* avoid creating many small files,

* avoid competitive directory access, e.g. by creating files in separate subdirectories for each task,

* if many small files are only used within a batch job and accessed by one process store them on $TMP,

* change the default colorization setting of the command ls (see below).

On modern Linux systems, the GNU ls command often uses colorization by default to
visually highlight the file type; this is especially true if the command is run within a terminal
session. This is because the default shell profile initializations usually contain an alias
directive similar to the following for the ls command:
<pre>
$ alias ls="ls --color=tty"
</pre>
However, running the ls command in this way for files on a Lustre file system requires
a stat() call to be used to determine the file type. This can result in a performance
overhead, because the stat() call always needs to determine the size of a file, and that
in turn means that the client node must query the object size of all the backing objects
that make up a file. As a result of the default colorization setting, running a simple
ls command on a Lustre file system often takes as much time as running the ls command
with the -l option (the same is true if the -F, -p, or the -classify option, or any other option
that requires information from a stat() call, is used). To avoid this performance overhead
when using ls commands, add an alias directive similar to the following
to your shell startup script:
<pre>
$ alias ls="ls --color=never"
</pre>

== Workspaces on flash storage ==

There is another workspace file system for special requirements available. The file system is called ''full flash pfs'' and is based on the parallel file system Lustre.

=== Advantages of this file system ===

# All storage devices are based on flash (no hard disks) with low access times. Hence performance is better compared to other parallel file systems for read and write access with small blocks and with small files, i.e. IOPS rates are improved.
# The file system is mounted on bwUniCluster 2.0 and HoreKa, i.e. it can be used to share data between these clusters.

=== Access restrictions ===

Only HoreKa users or KIT users of bwUniCluster 2.0 can use this file system.

=== Using the file system ===

As KIT or HoreKa user you can use the file system in the same way as a normal workspace. You just have to specify the name of the flash-based workspace file system using the option ''-F'' to all the commands that manage workspaces. On bwUniCluster 2.0 it is called ''ffuc'', on HoreKa it is ''ffhk''. For example, to create a workspace with name myws and a lifetime of 60 days on bwUniCluster 2.0 execute:
ws_allocate -F ffuc myws 60

If you want to use the full flash pfs on bwUniCluster 2.0 '''and''' HoreKa at the same time, please note that you only have to manage a particular workspace on one of the clusters since the name of the workspace directory is different. However, the path to each workspace is visible and can be used on both clusters.

Other features are similar to normal workspaces. For example, we are able to restore expired workspaces for few weeks and you have to open a ticket to request the restore. There are quota limits with a default limit of 1 TiB capacity and 5 millions inodes per user. You can check your current usage with
lfs quota -uh $(whoami) /pfs/work8

== $TMPDIR ==

The environment variable $TMPDIR contains the name of a directory which is local to each node. This means
that different tasks of a parallel application use different directories when they do not utilize the same node.
Although $TMPDIR points to the same path name for different nodes of a batch job, the physical location and the
content of this directory path on these nodes is different.

This directory should be used for temporary files being accessed from the local node during job runtime. It should
also be used if you read the same data many times from a single node, e.g. if you are doing AI training. In this
case you should copy the data at the beginning of your batch job to $TMPDIR and read the data from there.

The $TMPDIR directory is located on extremely fast SSD local storage devices. This means that performance
on small files is much better than on the parallel file systems. The capacity of $TMPDIR for each node type
can be checked in Table 1 above. The capacity is at least 900 GB.

Each time a batch job is started, a subdirectory is created on the SSD of each node and assigned to the job.
$TMPDIR is set to the name of the subdirectory and this name contains the job ID so that it is unique
for each job. At the end of the job the subdirectory is removed.

On login nodes $TMPDIR also points to a fast directory on a local SSD disk but this directory is not unique.
It is recommended to create your own unique subdirectory on these nodes. This directory should be used for the
installation of software packages. This means that the software package to be installed should be unpacked,
compiled and linked in a subdirectory of $TMPDIR. The real installation of the package (e.g. make install)
should be made into the $HOME folder.

=== Usage example for $TMPDIR ===

We will provide an example for using $TMPDIR and describe efficient data transfer to and from $TMPDIR.

If you have a data set with many files which is frequently used by batch jobs you should create
a compressed archive on a workspace. This archive can be extracted on $TMPDIR inside your batch jobs.
Such an archive can be read efficiently from a parallel file system since it is a single huge file.
On a login node you can create such an archive with the following steps:
<source lang="bash">
# Create a workspace to store the archive
[ab1234@uc2n997 ~]$ ws_allocate data-ssd 60
# Create the archive from a local dataset folder (example)
[ab1234@uc2n997 ~]$ tar -cvzf $(ws_find data-ssd)/dataset.tgz dataset/
</source>

Inside a batch job extract the archive on $TMPDIR, read input data from $TMPDIR, store results on $TMPDIR
and save the results on a workspace:
<source lang="bash">
#!/bin/bash
# very simple example on how to use local $TMPDIR
#SBATCH -N 1
#SBATCH -t 24:00:00

# Extract compressed input dataset on local SSD
tar -C $TMPDIR/ -xvzf $(ws_find data-ssd)/dataset.tgz

# The application reads data from dataset on $TMPDIR and writes results to $TMPDIR
myapp -input $TMPDIR/dataset/myinput.csv -outputdir $TMPDIR/results

# Before job completes save results on a workspace
rsync -av $TMPDIR/results $(ws_find data-ssd)/results-${SLURM_JOB_ID}/
</source>

== LSDF Online Storage==

In some cases it is useful to have access to the LSDF Online Storage on the HPC-Clusters also. Therefore the LSDF Online Storage is mounted on the Login- and Datamover-Nodes.
Furthermore it can be used on the compute nodes during the job runtime with the constraint flag "LSDF" ([[bwUniCluster_2.0_Slurm_common_Features|Slurm common features]]
). There is also an example about the LSDF batch usage: [[bwUniCluster_2.0_Slurm_common_Features#LSDF_Online_Storage|Slurm LSDF example ]] .
<pre>
#!/bin/bash
#SBATCH ...
#SBATCH --constraint=LSDF
</pre>
 
For the usage of the LSDF Online Storage the following environment variables are available: $LSDF, $LSDFPROJECTS, $LSDFHOME.
Please request storage projects in the LSDF Online Storage seperately:
[https://www.lsdf.kit.edu/os/storagerequest/: LSDF Storage Request].

==BeeOND (BeeGFS On-Demand)==

Users of the UniCluster have possibility to request a private BeeOND (on-demand BeeGFS) parallel filesystem for each job. The file system is created during job startup and purged after your job.

'''IMPORTANT: All data on the private filesystem will be deleted after your job. Make sure you have copied your data back to the global filesystem (within job), e.g., $HOME or any workspace.'''

BeeOND/BeeGFS can be used like any other parallel file system. Tools like cp or rsync can be used to copy data in and out.

For detailed usage see here:[[BwUniCluster_2.0_Slurm_common_Features#BeeOND_.28BeeGFS_On-Demand.29|Request on-demand file system]]

==Backup and Archiving==

There are regular backups of all data of the home directories,whereas ACLs and extended attributes will
not be backuped.

Please open a ticket if you need backuped data.

[[Category:bwUniCluster 2.0|File System]][[Category:Hardware_and_Architecture|bwUniCluster 2.0]]

BwUniCluster2.0/Hardware and Architecture

2023-05-12T08:26:37Z

R Laifer: /* $TMPDIR */

= Architecture of bwUniCluster 2.0 =

The bwUniCluster 2.0 is a parallel computer with distributed memory. Each node of system consists of at least two Intel Xeon processor, local memory, disks, network adapters and optionally accelerators (NVIDIA Tesla V100, A100 or H100). All nodes are connected by a fast InfiniBand interconnect. In addition the file system
Lustre, that is connected by coupling the InfiniBand of the file server with the InfiniBand
switch of the compute cluster, is added to bwUniCluster 2.0 to provide a fast and scalable
parallel file system.

The operating system on each node is Red Hat Enterprise Linux (RHEL) 8.4. A number of additional software packages like e.g. SLURM have been installed on top. Some of these components are of special interest to end users and are briefly
discussed in this document. Others which are of greater importance to system
administrators will not be covered by this document.

The individual nodes of the system may act in different roles. According to the services supplied by the nodes, they are separated into disjoint groups. From an end users point of view the different groups of nodes are login nodes, compute nodes, file server nodes and administrative server nodes.

'''Login Nodes'''

The login nodes are the only nodes that are directly accessible by end users. These nodes
are used for interactive login, file management, program development and interactive pre-
and postprocessing. Two nodes are dedicated to this service but they are all accessible via
one address and a DNS round-robin alias distributes the login sessions to the
different login nodes.

'''Compute Node'''

The majority of nodes are compute nodes which are managed by a batch system. Users
submit their jobs to the SLURM batch system and a job is executed when the required resources become available (depending on its fair-share priority).

'''File Server Nodes'''

The hardware of the parallel file system Lustre incorporates some file server nodes; the file
system Lustre is connected by coupling the InfiniBand of the file server with the independent InfiniBand switch of the compute cluster. In addition to shared file space there is also local storage on the disks of each node (for details see chapter "File Systems").

'''Administrative Server Nodes'''

Some other nodes are delivering additional services like resource management, external
network connection, administration etc. These nodes can be accessed directly by system administrators only.

= Components of bwUniCluster =

{| class="wikitable"
|-
! style="width:9%"|
! style="width:13%"| Compute nodes "Thin"
! style="width:13%"| Compute nodes "HPC"
! style="width:13%"| Compute nodes "IceLake"
! style="width:13%"| Compute nodes "Fat"
! style="width:13%"| GPU x4
! style="width:13%"| GPU x8
! style="width:13%"| IceLake + GPU x4
! style="width:13%"| Login
|-
!scope="column"| Number of nodes
| 100 + 60
| 360
| 272
| 6
| 14
| 10
| 15
| 4
|-
!scope="column"| Processors
| Intel Xeon Gold 6230
| Intel Xeon Gold 6230
| Intel Xeon Platinum 8358
| Intel Xeon Gold 6230
| Intel Xeon Gold 6230
| Intel Xeon Gold 6248
| Intel Xeon Platinum 8358
|-
!scope="column"| Number of sockets
| 2
| 2
| 2
| 4
| 2
| 2
| 2
| 2
|-
!scope="column"| Processor frequency (GHz)
| 2.1 Ghz
| 2.1 Ghz
| 2.6 Ghz
| 2.1 Ghz
| 2.1 Ghz
| 2.6 Ghz
| 2.5 Ghz
|
|-
!scope="column"| Total number of cores
| 40
| 40
| 64
| 80
| 40
| 40
| 64
| 40
|-
!scope="column"| Main memory
| 96 GB / 192 GB
| 96 GB
| 256 GB
| 3 TB
| 384 GB
| 768 GB
| 512 GB
| 384 GB
|-
!scope="column"| Local SSD
| 960 GB SATA
| 960 GB SATA
| 1,8 TB NVMe
| 4,8 TB NVMe
| 3,2 TB NVMe
| 15 TB NVMe
| 6,4 TB NVMe
|
|-
!scope="column"| Accelerators
| -
| -
| -
| -
| 4x NVIDIA Tesla V100
| 8x NVIDIA Tesla V100
| 4x NVIDIA A100 / 4x NVIDIA H100
| -
|-
!scope="column"| Accelerator memory
| -
| -
| -
| -
| 32 GB
| 32 GB
| 80 GB / 94 GB
| -
|-
!scope="column"| Interconnect
| IB HDR100 (blocking)
| IB HDR100
| IB HDR200
| IB HDR
| IB HDR
| IB HDR
| IB HDR200
| IB HDR100 (blocking)
|}
Table 1: Properties of the nodes

= File Systems =

On bwUniCluster 2.0 the parallel file system Lustre is used for most globally visible user data. Lustre is open source and Lustre solutions and support are available from different vendors. Nowadays, most of the biggest HPC systems are using Lustre. An initial home directory on a Lustre file system is created automatically after account activation, and the environment variable $HOME holds its name. Users can create so-called workspaces on another Lustre file system for non-permanent data with temporary lifetime. There is another workspace file system based on flash storage for special requirements available.

Within a batch job further file systems are available:
* The directory $TMP is only available and visible on the local node. It is located on fast SSD storage devices.
* On request a parallel on-demand file system (BeeOND) is created which uses the SSDs of the nodes which were allocated to the batch job.
* On request the external LSDF Online Storage is mounted on the nodes which were allocated to the batch job. This file system is based on the parallel file system Spectrum Scale.

Some of the characteristics of the file systems are shown in Table 2.

{| style="width:100%; vertical-align:top; background:#f5fffa;border:1px solid #000000;padding:1px"
|- style="width:20%;height=20px; text-align:left;padding:3px"
! style="background-color:#AAA;padding:3px"| Property
! style="background-color:#AAA;padding:3px"| $TMP
! style="background-color:#AAA;padding:3px"| BeeOND
! style="background-color:#AAA;padding:3px"| $HOME
! style="background-color:#AAA;padding:3px"| Workspace
! style="background-color:#AAA;padding:3px"| Workspace on flash
|-
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Visibility
| style="height=20px; text-align:left;padding:3px"| local node
| style="height=20px; text-align:left;padding:3px"| nodes of batch job
| style="height=20px; text-align:left;padding:3px"| global
| style="height=20px; text-align:left;padding:3px"| global
| style="height=20px; text-align:left;padding:3px"| global
|-
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Lifetime
| style="height=20px; text-align:left;padding:3px"| batch job runtime
| style="height=20px; text-align:left;padding:3px"| batch job runtime
| style="height=20px; text-align:left;padding:3px"| permanent
| style="height=20px; text-align:left;padding:3px"| max. 240 days
| style="height=20px; text-align:left;padding:3px"| max. 240 days
|- style="vertical-align:top;"
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Disk space
| style="height=20px; text-align:left;padding:3px"| 960 GB - 6.4 TB details see table 1
| style="height=20px; text-align:left;padding:3px"| n*750 GB
| style="height=20px; text-align:left;padding:3px"| 1.2 PiB
| style="height=20px; text-align:left;padding:3px"| 4.1 PiB
| style="height=20px; text-align:left;padding:3px"| 236 TiB
|-
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Capacity Quotas
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| yes 1 TiB per user also per organization
| style="height=20px; text-align:left;padding:3px"| yes 40 TiB per user
| style="height=20px; text-align:left;padding:3px"| yes 1 TiB per user
|-
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Inode Quotas
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| yes 10 million per user
| style="height=20px; text-align:left;padding:3px"| yes 30 million per user
| style="height=20px; text-align:left;padding:3px"| yes 5 million per user
|-
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Backup
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| yes
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| no
|- style="vertical-align:top;"
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Read perf./node
| style="height=20px; text-align:left;padding:3px"| 500 MB/s - 6 GB/s depends on type of local SSD / job queue: 520 MB/s @ single / multiple 800 MB/s @ multiple_e 6600 MB/s @ fat 6500 MB/s @ gpu_4 6500 MB/s @ gpu_8
| style="height=20px; text-align:left;padding:3px"| 400 MB/s - 500 MB/s depends on type of local SSDs / job queue: 500 MB/s @ multiple 400 MB/s @ multiple_e
| style="height=20px; text-align:left;padding:3px"| 1 GB/s
| style="height=20px; text-align:left;padding:3px"| 1 GB/s
| style="height=20px; text-align:left;padding:3px"| 1 GB/s
|- style="vertical-align:top;"
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Write perf./node
| style="height=20px; text-align:left;padding:3px"| 500 MB/s - 4 GB/s depends on type of local SSD / job queue: 500 MB/s @ single / multiple 650 MB/s @ multiple_e 2900 MB/s @ fat 2090 MB/s @ gpu_4 4060 MB/s @ gpu_8
| style="height=20px; text-align:left;padding:3px"| 250 MB/s - 350 MB/s depends on type of local SSDs / job queue: 350 MB/s @ multiple 250 MB/s @ multiple_e
| style="height=20px; text-align:left;padding:3px"| 1 GB/s
| style="height=20px; text-align:left;padding:3px"| 1 GB/s
| style="height=20px; text-align:left;padding:3px"| 1 GB/s
|- style="vertical-align:top;"
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Total read perf.
| style="height=20px; text-align:left;padding:3px"| n*500-6000 MB/s
| style="height=20px; text-align:left;padding:3px"| n*400-500 MB/s
| style="height=20px; text-align:left;padding:3px"| 18 GB/s
| style="height=20px; text-align:left;padding:3px"| 54 GB/s
| style="height=20px; text-align:left;padding:3px"| 45 GB/s
|- style="vertical-align:top;"
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Total write perf.
| style="height=20px; text-align:left;padding:3px"| n*500-4000 MB/s
| style="height=20px; text-align:left;padding:3px"| n*250-350 MB/s
| style="height=20px; text-align:left;padding:3px"| 18 GB/s
| style="height=20px; text-align:left;padding:3px"| 54 GB/s
| style="height=20px; text-align:left;padding:3px"| 38 GB/s
|}
---------------------------------------------------------------------------------------------------------
global: all nodes of UniCluster access the same file system;
local: each node has its own file system;
permanent: files are stored permanently;
batch job: files are removed at end of the batch job.
---------------------------------------------------------------------------------------------------------
Table 2: Properties of the file systems

== Selecting the appropriate file system ==

In general, you should separate your data and store it on the appropriate file system.
Permanently needed data like software or important results should be stored below $HOME
but capacity restrictions (quotas) apply. In case you accidentally deleted data on $HOME
you can usually restore it from backup. Permanent data which is not needed for months
or exceeds the capacity restrictions should be sent to the LSDF Online Storage
or to the archive and deleted from the file systems. Temporary data which is only needed on a single
node and which does not exceed the disk space shown in the table above should be stored
below $TMP. Data which is read many times on a single node, e.g. if you are doing AI training,
should be copied to $TMP and read from there. Temporary data which is used from many nodes
of your batch job and which is only needed during job runtime should be stored on a
parallel on-demand file system. Temporary data which can be recomputed or which is the
result of one job and input for another job should be stored in workspaces. The lifetime
of data in workspaces is limited and depends on the lifetime of the workspace which can be
several months.

For further details please check the chapters below.

== $HOME ==

The home directories of bwUniCluster 2.0 (uc2) users are located in the parallel file system Lustre.
You have access to your home directory from all nodes of uc2. A regular backup of these directories
to tape archive is done automatically. The directory $HOME is used to hold those files that are
permanently used like source codes, configuration files, executable programs etc.

On uc2 there is a default user quota limit of 1 TiB and 10 million inodes (files and directories) per user.
You can chek your current usage and limits with the command
<pre>
$ lfs quota -uh $(whoami) $HOME
</pre>
In addition to the user limit there is a limit of your organization (e.g. university) which depends on the financial share. This limit is enforced with so-called Lustre project quotas. You can show the current usage and limits of your organization with the following command:
<pre>
lfs quota -ph $(grep $(echo $HOME | sed -e "s|/[^/]*/[^/]*$||") /pfs/data5/project_ids.txt | cut -f 1 -d\ ) $HOME
</pre>

== Workspaces ==

On uc2 workspaces can be used to store large non-permanent data sets, e.g. restart files or output
data that has to be post-processed. The file system used for workspaces is also the parallel file system Lustre. This file system is especially designed for parallel access and for a high throughput to large
files. It is able to provide high data transfer rates of up to 54 GB/s write and read performance when data access is parallel.

Workspaces have a lifetime and the data on a workspace expires as a whole after a fixed period. The maximum lifetime of a workspace is 60 days, but it can be renewed at the end of that period 3 times to a total maximum of 240 days after workspace generation. If a workspace has inadvertently expired we can restore the data during a limited time (few weeks). In this case you should create a new workspace and report the name of the new and of the expired workspace in a ticket.

Creating, deleting, finding and extending workspaces is explained on the [[workspace]] page.

On uc2 there is a default user quota limit of 40 TiB and 30 million inodes (files and directories) per user.
You can chek your current usage and limits with the command
<pre>
$ lfs quota -uh $(whoami) /pfs/work7
</pre>
Note that the quotas include data and inodes for all of your workspaces and all of your expired workspaces (as long as they are not yet completely removed).

=== Reminder for workspace deletion ===

Normally you will get an email every day starting 7 days before a workspace expires. You can send yourself a calender entry which reminds you when a workspace will be automatically deleted:

$ ws_send_ical.sh <workspace> <email>

== Improving Performance on $HOME and workspaces ==

The following recommendations might help to improve throughput and metadata
performance on Lustre filesystems.

=== '''Improving Throughput Performance''' ===

Depending on your application some adaptations might be necessary if you want to reach
the full bandwidth of the filesystems. Parallel filesystems typically stripe files over storage subsystems, i.e. large files are separated into stripes and distributed to different storage subsystems. In Lustre, the size of these stripes (sometimes also mentioned as chunks) is called stripe size and the number of used storage subsystems is called stripe count.

When you are designing your application you should consider that the performance of
parallel filesystems is generally better if data is transferred in large blocks and stored in
few large files. In more detail, to increase throughput performance of a parallel application
following aspects should be considered:

* collect large chunks of data and write them sequentially at once,

* to exploit complete filesystem bandwidth use several clients,

* avoid competitive file access by different tasks or use blocks with boundaries at stripe size (default is 1MB),

* if files are small enough for the SSDs and are only used by one process store them on $TMP.

With previous Lustre versions adapting the Lustre stripe count was the most important optimization. However, for the workspaces of uc2 the new Lustre feature Progressive File Layouts has been used to define file striping parameters. This means that the stripe count is adapted if the file size is growing. In normal cases users no longer need to adapt file striping parameters in case they have very huge files or in order to reach better performance.

If you know what you are doing you can still change striping parameters, e.g. the stripe count, of a directory and of newly created files. New files and directories inherit the stripe count from the parent directory. E.g. if you want to enhance throughput on a single very large file which is created in the directory $HOME/my_output_dir you can use the command
<pre>
$ lfs setstripe -c-1 $HOME/my_output_dir
</pre>
to change the stripe count to -1 which means that all storage subsystems of the file system are used to store that file. If you change the stripe count of a directory the stripe count of existing files inside this
directory is not changed. If you want to change the stripe count of existing files, change
the stripe count of the parent directory, copy the files to new files, remove the old files
and move the new files back to the old name. In order to check the stripe setting of
the file my_file use
<pre>
$ lfs getstripe my_file
</pre>
Also note that changes on the striping parameters (e.g. stripe count) are not saved in the
backup, i.e. if directories have to be recreated this information is lost and the default stripe
count will be used. Therefore, you should annotate for which directories you made changes
to the striping parameters so that you can repeat these changes if required.

=== '''Improving Metadata Performance''' ===

Metadata performance on parallel file systems is usually not as good as with local
filesystems. In addition, it is usually not scalable, i.e. a limited resource. Therefore,
you should omit metadata operations whenever possible. For example, it is much better
to have few large files than lots of small files. In more detail, to increase metadata
performance of a parallel application following aspects should be considered:

* avoid creating many small files,

* avoid competitive directory access, e.g. by creating files in separate subdirectories for each task,

* if many small files are only used within a batch job and accessed by one process store them on $TMP,

* change the default colorization setting of the command ls (see below).

On modern Linux systems, the GNU ls command often uses colorization by default to
visually highlight the file type; this is especially true if the command is run within a terminal
session. This is because the default shell profile initializations usually contain an alias
directive similar to the following for the ls command:
<pre>
$ alias ls="ls --color=tty"
</pre>
However, running the ls command in this way for files on a Lustre file system requires
a stat() call to be used to determine the file type. This can result in a performance
overhead, because the stat() call always needs to determine the size of a file, and that
in turn means that the client node must query the object size of all the backing objects
that make up a file. As a result of the default colorization setting, running a simple
ls command on a Lustre file system often takes as much time as running the ls command
with the -l option (the same is true if the -F, -p, or the -classify option, or any other option
that requires information from a stat() call, is used). To avoid this performance overhead
when using ls commands, add an alias directive similar to the following
to your shell startup script:
<pre>
$ alias ls="ls --color=never"
</pre>

== Workspaces on flash storage ==

There is another workspace file system for special requirements available. The file system is called ''full flash pfs'' and is based on the parallel file system Lustre.

=== Advantages of this file system ===

# All storage devices are based on flash (no hard disks) with low access times. Hence performance is better compared to other parallel file systems for read and write access with small blocks and with small files, i.e. IOPS rates are improved.
# The file system is mounted on bwUniCluster 2.0 and HoreKa, i.e. it can be used to share data between these clusters.

=== Access restrictions ===

Only HoreKa users or KIT users of bwUniCluster 2.0 can use this file system.

=== Using the file system ===

As KIT or HoreKa user you can use the file system in the same way as a normal workspace. You just have to specify the name of the flash-based workspace file system using the option ''-F'' to all the commands that manage workspaces. On bwUniCluster 2.0 it is called ''ffuc'', on HoreKa it is ''ffhk''. For example, to create a workspace with name myws and a lifetime of 60 days on bwUniCluster 2.0 execute:
ws_allocate -F ffuc myws 60

If you want to use the full flash pfs on bwUniCluster 2.0 '''and''' HoreKa at the same time, please note that you only have to manage a particular workspace on one of the clusters since the name of the workspace directory is different. However, the path to each workspace is visible and can be used on both clusters.

Other features are similar to normal workspaces. For example, we are able to restore expired workspaces for few weeks and you have to open a ticket to request the restore. There are quota limits with a default limit of 1 TB capacity and 5 millions inodes per user. You can check your current usage with
lfs quota -uh $(whoami) /pfs/work8

== $TMPDIR ==

The environment variable $TMPDIR contains the name of a directory which is local to each node. This means
that different tasks of a parallel application use different directories when they do not utilize the same node.
Although $TMPDIR points to the same path name for different nodes of a batch job, the physical location and the
content of this directory path on these nodes is different.

This directory should be used for temporary files being accessed from the local node during job runtime. It should
also be used if you read the same data many times from a single node, e.g. if you are doing AI training. In this
case you should copy the data at the beginning of your batch job to $TMPDIR and read the data from there.

The $TMPDIR directory is located on extremely fast SSD local storage devices. This means that performance
on small files is much better than on the parallel file systems. The capacity of $TMPDIR for each node type
can be checked in Table 1 above. The capacity is at least 900 GB.

Each time a batch job is started, a subdirectory is created on the SSD of each node and assigned to the job.
$TMPDIR is set to the name of the subdirectory and this name contains the job ID so that it is unique
for each job. At the end of the job the subdirectory is removed.

On login nodes $TMPDIR also points to a fast directory on a local SSD disk but this directory is not unique.
It is recommended to create your own unique subdirectory on these nodes. This directory should be used for the
installation of software packages. This means that the software package to be installed should be unpacked,
compiled and linked in a subdirectory of $TMPDIR. The real installation of the package (e.g. make install)
should be made into the $HOME folder.

=== Usage example for $TMPDIR ===

We will provide an example for using $TMPDIR and describe efficient data transfer to and from $TMPDIR.

If you have a data set with many files which is frequently used by batch jobs you should create
a compressed archive on a workspace. This archive can be extracted on $TMPDIR inside your batch jobs.
Such an archive can be read efficiently from a parallel file system since it is a single huge file.
On a login node you can create such an archive with the following steps:
<source lang="bash">
# Create a workspace to store the archive
[ab1234@uc2n997 ~]$ ws_allocate data-ssd 60
# Create the archive from a local dataset folder (example)
[ab1234@uc2n997 ~]$ tar -cvzf $(ws_find data-ssd)/dataset.tgz dataset/
</source>

Inside a batch job extract the archive on $TMPDIR, read input data from $TMPDIR, store results on $TMPDIR
and save the results on a workspace:
<source lang="bash">
#!/bin/bash
# very simple example on how to use local $TMPDIR
#SBATCH -N 1
#SBATCH -t 24:00:00

# Extract compressed input dataset on local SSD
tar -C $TMPDIR/ -xvzf $(ws_find data-ssd)/dataset.tgz

# The application reads data from dataset on $TMPDIR and writes results to $TMPDIR
myapp -input $TMPDIR/dataset/myinput.csv -outputdir $TMPDIR/results

# Before job completes save results on a workspace
rsync -av $TMPDIR/results $(ws_find data-ssd)/results-${SLURM_JOB_ID}/
</source>

== LSDF Online Storage==

In some cases it is useful to have access to the LSDF Online Storage on the HPC-Clusters also. Therefore the LSDF Online Storage is mounted on the Login- and Datamover-Nodes.
Furthermore it can be used on the compute nodes during the job runtime with the constraint flag "LSDF" ([[bwUniCluster_2.0_Slurm_common_Features|Slurm common features]]
). There is also an example about the LSDF batch usage: [[bwUniCluster_2.0_Slurm_common_Features#LSDF_Online_Storage|Slurm LSDF example ]] .
<pre>
#!/bin/bash
#SBATCH ...
#SBATCH --constraint=LSDF
</pre>
 
For the usage of the LSDF Online Storage the following environment variables are available: $LSDF, $LSDFPROJECTS, $LSDFHOME.
Please request storage projects in the LSDF Online Storage seperately:
[https://www.lsdf.kit.edu/os/storagerequest/: LSDF Storage Request].

==BeeOND (BeeGFS On-Demand)==

Users of the UniCluster have possibility to request a private BeeOND (on-demand BeeGFS) parallel filesystem for each job. The file system is created during job startup and purged after your job.

'''IMPORTANT: All data on the private filesystem will be deleted after your job. Make sure you have copied your data back to the global filesystem (within job), e.g., $HOME or any workspace.'''

BeeOND/BeeGFS can be used like any other parallel file system. Tools like cp or rsync can be used to copy data in and out.

For detailed usage see here:[[BwUniCluster_2.0_Slurm_common_Features#BeeOND_.28BeeGFS_On-Demand.29|Request on-demand file system]]

==Backup and Archiving==

There are regular backups of all data of the home directories,whereas ACLs and extended attributes will
not be backuped.

Please open a ticket if you need backuped data.

[[Category:bwUniCluster 2.0|File System]][[Category:Hardware_and_Architecture|bwUniCluster 2.0]]

BwUniCluster2.0/Hardware and Architecture

2023-05-12T08:24:45Z

R Laifer: /* $TMPDIR */

= Architecture of bwUniCluster 2.0 =

The bwUniCluster 2.0 is a parallel computer with distributed memory. Each node of system consists of at least two Intel Xeon processor, local memory, disks, network adapters and optionally accelerators (NVIDIA Tesla V100, A100 or H100). All nodes are connected by a fast InfiniBand interconnect. In addition the file system
Lustre, that is connected by coupling the InfiniBand of the file server with the InfiniBand
switch of the compute cluster, is added to bwUniCluster 2.0 to provide a fast and scalable
parallel file system.

The operating system on each node is Red Hat Enterprise Linux (RHEL) 8.4. A number of additional software packages like e.g. SLURM have been installed on top. Some of these components are of special interest to end users and are briefly
discussed in this document. Others which are of greater importance to system
administrators will not be covered by this document.

The individual nodes of the system may act in different roles. According to the services supplied by the nodes, they are separated into disjoint groups. From an end users point of view the different groups of nodes are login nodes, compute nodes, file server nodes and administrative server nodes.

'''Login Nodes'''

The login nodes are the only nodes that are directly accessible by end users. These nodes
are used for interactive login, file management, program development and interactive pre-
and postprocessing. Two nodes are dedicated to this service but they are all accessible via
one address and a DNS round-robin alias distributes the login sessions to the
different login nodes.

'''Compute Node'''

The majority of nodes are compute nodes which are managed by a batch system. Users
submit their jobs to the SLURM batch system and a job is executed when the required resources become available (depending on its fair-share priority).

'''File Server Nodes'''

The hardware of the parallel file system Lustre incorporates some file server nodes; the file
system Lustre is connected by coupling the InfiniBand of the file server with the independent InfiniBand switch of the compute cluster. In addition to shared file space there is also local storage on the disks of each node (for details see chapter "File Systems").

'''Administrative Server Nodes'''

Some other nodes are delivering additional services like resource management, external
network connection, administration etc. These nodes can be accessed directly by system administrators only.

= Components of bwUniCluster =

{| class="wikitable"
|-
! style="width:9%"|
! style="width:13%"| Compute nodes "Thin"
! style="width:13%"| Compute nodes "HPC"
! style="width:13%"| Compute nodes "IceLake"
! style="width:13%"| Compute nodes "Fat"
! style="width:13%"| GPU x4
! style="width:13%"| GPU x8
! style="width:13%"| IceLake + GPU x4
! style="width:13%"| Login
|-
!scope="column"| Number of nodes
| 100 + 60
| 360
| 272
| 6
| 14
| 10
| 15
| 4
|-
!scope="column"| Processors
| Intel Xeon Gold 6230
| Intel Xeon Gold 6230
| Intel Xeon Platinum 8358
| Intel Xeon Gold 6230
| Intel Xeon Gold 6230
| Intel Xeon Gold 6248
| Intel Xeon Platinum 8358
|-
!scope="column"| Number of sockets
| 2
| 2
| 2
| 4
| 2
| 2
| 2
| 2
|-
!scope="column"| Processor frequency (GHz)
| 2.1 Ghz
| 2.1 Ghz
| 2.6 Ghz
| 2.1 Ghz
| 2.1 Ghz
| 2.6 Ghz
| 2.5 Ghz
|
|-
!scope="column"| Total number of cores
| 40
| 40
| 64
| 80
| 40
| 40
| 64
| 40
|-
!scope="column"| Main memory
| 96 GB / 192 GB
| 96 GB
| 256 GB
| 3 TB
| 384 GB
| 768 GB
| 512 GB
| 384 GB
|-
!scope="column"| Local SSD
| 960 GB SATA
| 960 GB SATA
| 1,8 TB NVMe
| 4,8 TB NVMe
| 3,2 TB NVMe
| 15 TB NVMe
| 6,4 TB NVMe
|
|-
!scope="column"| Accelerators
| -
| -
| -
| -
| 4x NVIDIA Tesla V100
| 8x NVIDIA Tesla V100
| 4x NVIDIA A100 / 4x NVIDIA H100
| -
|-
!scope="column"| Accelerator memory
| -
| -
| -
| -
| 32 GB
| 32 GB
| 80 GB / 94 GB
| -
|-
!scope="column"| Interconnect
| IB HDR100 (blocking)
| IB HDR100
| IB HDR200
| IB HDR
| IB HDR
| IB HDR
| IB HDR200
| IB HDR100 (blocking)
|}
Table 1: Properties of the nodes

= File Systems =

On bwUniCluster 2.0 the parallel file system Lustre is used for most globally visible user data. Lustre is open source and Lustre solutions and support are available from different vendors. Nowadays, most of the biggest HPC systems are using Lustre. An initial home directory on a Lustre file system is created automatically after account activation, and the environment variable $HOME holds its name. Users can create so-called workspaces on another Lustre file system for non-permanent data with temporary lifetime. There is another workspace file system based on flash storage for special requirements available.

Within a batch job further file systems are available:
* The directory $TMP is only available and visible on the local node. It is located on fast SSD storage devices.
* On request a parallel on-demand file system (BeeOND) is created which uses the SSDs of the nodes which were allocated to the batch job.
* On request the external LSDF Online Storage is mounted on the nodes which were allocated to the batch job. This file system is based on the parallel file system Spectrum Scale.

Some of the characteristics of the file systems are shown in Table 2.

{| style="width:100%; vertical-align:top; background:#f5fffa;border:1px solid #000000;padding:1px"
|- style="width:20%;height=20px; text-align:left;padding:3px"
! style="background-color:#AAA;padding:3px"| Property
! style="background-color:#AAA;padding:3px"| $TMP
! style="background-color:#AAA;padding:3px"| BeeOND
! style="background-color:#AAA;padding:3px"| $HOME
! style="background-color:#AAA;padding:3px"| Workspace
! style="background-color:#AAA;padding:3px"| Workspace on flash
|-
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Visibility
| style="height=20px; text-align:left;padding:3px"| local node
| style="height=20px; text-align:left;padding:3px"| nodes of batch job
| style="height=20px; text-align:left;padding:3px"| global
| style="height=20px; text-align:left;padding:3px"| global
| style="height=20px; text-align:left;padding:3px"| global
|-
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Lifetime
| style="height=20px; text-align:left;padding:3px"| batch job runtime
| style="height=20px; text-align:left;padding:3px"| batch job runtime
| style="height=20px; text-align:left;padding:3px"| permanent
| style="height=20px; text-align:left;padding:3px"| max. 240 days
| style="height=20px; text-align:left;padding:3px"| max. 240 days
|- style="vertical-align:top;"
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Disk space
| style="height=20px; text-align:left;padding:3px"| 960 GB - 6.4 TB details see table 1
| style="height=20px; text-align:left;padding:3px"| n*750 GB
| style="height=20px; text-align:left;padding:3px"| 1.2 PiB
| style="height=20px; text-align:left;padding:3px"| 4.1 PiB
| style="height=20px; text-align:left;padding:3px"| 236 TiB
|-
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Capacity Quotas
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| yes 1 TiB per user also per organization
| style="height=20px; text-align:left;padding:3px"| yes 40 TiB per user
| style="height=20px; text-align:left;padding:3px"| yes 1 TiB per user
|-
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Inode Quotas
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| yes 10 million per user
| style="height=20px; text-align:left;padding:3px"| yes 30 million per user
| style="height=20px; text-align:left;padding:3px"| yes 5 million per user
|-
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Backup
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| yes
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| no
|- style="vertical-align:top;"
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Read perf./node
| style="height=20px; text-align:left;padding:3px"| 500 MB/s - 6 GB/s depends on type of local SSD / job queue: 520 MB/s @ single / multiple 800 MB/s @ multiple_e 6600 MB/s @ fat 6500 MB/s @ gpu_4 6500 MB/s @ gpu_8
| style="height=20px; text-align:left;padding:3px"| 400 MB/s - 500 MB/s depends on type of local SSDs / job queue: 500 MB/s @ multiple 400 MB/s @ multiple_e
| style="height=20px; text-align:left;padding:3px"| 1 GB/s
| style="height=20px; text-align:left;padding:3px"| 1 GB/s
| style="height=20px; text-align:left;padding:3px"| 1 GB/s
|- style="vertical-align:top;"
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Write perf./node
| style="height=20px; text-align:left;padding:3px"| 500 MB/s - 4 GB/s depends on type of local SSD / job queue: 500 MB/s @ single / multiple 650 MB/s @ multiple_e 2900 MB/s @ fat 2090 MB/s @ gpu_4 4060 MB/s @ gpu_8
| style="height=20px; text-align:left;padding:3px"| 250 MB/s - 350 MB/s depends on type of local SSDs / job queue: 350 MB/s @ multiple 250 MB/s @ multiple_e
| style="height=20px; text-align:left;padding:3px"| 1 GB/s
| style="height=20px; text-align:left;padding:3px"| 1 GB/s
| style="height=20px; text-align:left;padding:3px"| 1 GB/s
|- style="vertical-align:top;"
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Total read perf.
| style="height=20px; text-align:left;padding:3px"| n*500-6000 MB/s
| style="height=20px; text-align:left;padding:3px"| n*400-500 MB/s
| style="height=20px; text-align:left;padding:3px"| 18 GB/s
| style="height=20px; text-align:left;padding:3px"| 54 GB/s
| style="height=20px; text-align:left;padding:3px"| 45 GB/s
|- style="vertical-align:top;"
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Total write perf.
| style="height=20px; text-align:left;padding:3px"| n*500-4000 MB/s
| style="height=20px; text-align:left;padding:3px"| n*250-350 MB/s
| style="height=20px; text-align:left;padding:3px"| 18 GB/s
| style="height=20px; text-align:left;padding:3px"| 54 GB/s
| style="height=20px; text-align:left;padding:3px"| 38 GB/s
|}
---------------------------------------------------------------------------------------------------------
global: all nodes of UniCluster access the same file system;
local: each node has its own file system;
permanent: files are stored permanently;
batch job: files are removed at end of the batch job.
---------------------------------------------------------------------------------------------------------
Table 2: Properties of the file systems

== Selecting the appropriate file system ==

In general, you should separate your data and store it on the appropriate file system.
Permanently needed data like software or important results should be stored below $HOME
but capacity restrictions (quotas) apply. In case you accidentally deleted data on $HOME
you can usually restore it from backup. Permanent data which is not needed for months
or exceeds the capacity restrictions should be sent to the LSDF Online Storage
or to the archive and deleted from the file systems. Temporary data which is only needed on a single
node and which does not exceed the disk space shown in the table above should be stored
below $TMP. Data which is read many times on a single node, e.g. if you are doing AI training,
should be copied to $TMP and read from there. Temporary data which is used from many nodes
of your batch job and which is only needed during job runtime should be stored on a
parallel on-demand file system. Temporary data which can be recomputed or which is the
result of one job and input for another job should be stored in workspaces. The lifetime
of data in workspaces is limited and depends on the lifetime of the workspace which can be
several months.

For further details please check the chapters below.

== $HOME ==

The home directories of bwUniCluster 2.0 (uc2) users are located in the parallel file system Lustre.
You have access to your home directory from all nodes of uc2. A regular backup of these directories
to tape archive is done automatically. The directory $HOME is used to hold those files that are
permanently used like source codes, configuration files, executable programs etc.

On uc2 there is a default user quota limit of 1 TiB and 10 million inodes (files and directories) per user.
You can chek your current usage and limits with the command
<pre>
$ lfs quota -uh $(whoami) $HOME
</pre>
In addition to the user limit there is a limit of your organization (e.g. university) which depends on the financial share. This limit is enforced with so-called Lustre project quotas. You can show the current usage and limits of your organization with the following command:
<pre>
lfs quota -ph $(grep $(echo $HOME | sed -e "s|/[^/]*/[^/]*$||") /pfs/data5/project_ids.txt | cut -f 1 -d\ ) $HOME
</pre>

== Workspaces ==

On uc2 workspaces can be used to store large non-permanent data sets, e.g. restart files or output
data that has to be post-processed. The file system used for workspaces is also the parallel file system Lustre. This file system is especially designed for parallel access and for a high throughput to large
files. It is able to provide high data transfer rates of up to 54 GB/s write and read performance when data access is parallel.

Workspaces have a lifetime and the data on a workspace expires as a whole after a fixed period. The maximum lifetime of a workspace is 60 days, but it can be renewed at the end of that period 3 times to a total maximum of 240 days after workspace generation. If a workspace has inadvertently expired we can restore the data during a limited time (few weeks). In this case you should create a new workspace and report the name of the new and of the expired workspace in a ticket.

Creating, deleting, finding and extending workspaces is explained on the [[workspace]] page.

On uc2 there is a default user quota limit of 40 TiB and 30 million inodes (files and directories) per user.
You can chek your current usage and limits with the command
<pre>
$ lfs quota -uh $(whoami) /pfs/work7
</pre>
Note that the quotas include data and inodes for all of your workspaces and all of your expired workspaces (as long as they are not yet completely removed).

=== Reminder for workspace deletion ===

Normally you will get an email every day starting 7 days before a workspace expires. You can send yourself a calender entry which reminds you when a workspace will be automatically deleted:

$ ws_send_ical.sh <workspace> <email>

== Improving Performance on $HOME and workspaces ==

The following recommendations might help to improve throughput and metadata
performance on Lustre filesystems.

=== '''Improving Throughput Performance''' ===

Depending on your application some adaptations might be necessary if you want to reach
the full bandwidth of the filesystems. Parallel filesystems typically stripe files over storage subsystems, i.e. large files are separated into stripes and distributed to different storage subsystems. In Lustre, the size of these stripes (sometimes also mentioned as chunks) is called stripe size and the number of used storage subsystems is called stripe count.

When you are designing your application you should consider that the performance of
parallel filesystems is generally better if data is transferred in large blocks and stored in
few large files. In more detail, to increase throughput performance of a parallel application
following aspects should be considered:

* collect large chunks of data and write them sequentially at once,

* to exploit complete filesystem bandwidth use several clients,

* avoid competitive file access by different tasks or use blocks with boundaries at stripe size (default is 1MB),

* if files are small enough for the SSDs and are only used by one process store them on $TMP.

With previous Lustre versions adapting the Lustre stripe count was the most important optimization. However, for the workspaces of uc2 the new Lustre feature Progressive File Layouts has been used to define file striping parameters. This means that the stripe count is adapted if the file size is growing. In normal cases users no longer need to adapt file striping parameters in case they have very huge files or in order to reach better performance.

If you know what you are doing you can still change striping parameters, e.g. the stripe count, of a directory and of newly created files. New files and directories inherit the stripe count from the parent directory. E.g. if you want to enhance throughput on a single very large file which is created in the directory $HOME/my_output_dir you can use the command
<pre>
$ lfs setstripe -c-1 $HOME/my_output_dir
</pre>
to change the stripe count to -1 which means that all storage subsystems of the file system are used to store that file. If you change the stripe count of a directory the stripe count of existing files inside this
directory is not changed. If you want to change the stripe count of existing files, change
the stripe count of the parent directory, copy the files to new files, remove the old files
and move the new files back to the old name. In order to check the stripe setting of
the file my_file use
<pre>
$ lfs getstripe my_file
</pre>
Also note that changes on the striping parameters (e.g. stripe count) are not saved in the
backup, i.e. if directories have to be recreated this information is lost and the default stripe
count will be used. Therefore, you should annotate for which directories you made changes
to the striping parameters so that you can repeat these changes if required.

=== '''Improving Metadata Performance''' ===

Metadata performance on parallel file systems is usually not as good as with local
filesystems. In addition, it is usually not scalable, i.e. a limited resource. Therefore,
you should omit metadata operations whenever possible. For example, it is much better
to have few large files than lots of small files. In more detail, to increase metadata
performance of a parallel application following aspects should be considered:

* avoid creating many small files,

* avoid competitive directory access, e.g. by creating files in separate subdirectories for each task,

* if many small files are only used within a batch job and accessed by one process store them on $TMP,

* change the default colorization setting of the command ls (see below).

On modern Linux systems, the GNU ls command often uses colorization by default to
visually highlight the file type; this is especially true if the command is run within a terminal
session. This is because the default shell profile initializations usually contain an alias
directive similar to the following for the ls command:
<pre>
$ alias ls="ls --color=tty"
</pre>
However, running the ls command in this way for files on a Lustre file system requires
a stat() call to be used to determine the file type. This can result in a performance
overhead, because the stat() call always needs to determine the size of a file, and that
in turn means that the client node must query the object size of all the backing objects
that make up a file. As a result of the default colorization setting, running a simple
ls command on a Lustre file system often takes as much time as running the ls command
with the -l option (the same is true if the -F, -p, or the -classify option, or any other option
that requires information from a stat() call, is used). To avoid this performance overhead
when using ls commands, add an alias directive similar to the following
to your shell startup script:
<pre>
$ alias ls="ls --color=never"
</pre>

== Workspaces on flash storage ==

There is another workspace file system for special requirements available. The file system is called ''full flash pfs'' and is based on the parallel file system Lustre.

=== Advantages of this file system ===

# All storage devices are based on flash (no hard disks) with low access times. Hence performance is better compared to other parallel file systems for read and write access with small blocks and with small files, i.e. IOPS rates are improved.
# The file system is mounted on bwUniCluster 2.0 and HoreKa, i.e. it can be used to share data between these clusters.

=== Access restrictions ===

Only HoreKa users or KIT users of bwUniCluster 2.0 can use this file system.

=== Using the file system ===

As KIT or HoreKa user you can use the file system in the same way as a normal workspace. You just have to specify the name of the flash-based workspace file system using the option ''-F'' to all the commands that manage workspaces. On bwUniCluster 2.0 it is called ''ffuc'', on HoreKa it is ''ffhk''. For example, to create a workspace with name myws and a lifetime of 60 days on bwUniCluster 2.0 execute:
ws_allocate -F ffuc myws 60

If you want to use the full flash pfs on bwUniCluster 2.0 '''and''' HoreKa at the same time, please note that you only have to manage a particular workspace on one of the clusters since the name of the workspace directory is different. However, the path to each workspace is visible and can be used on both clusters.

Other features are similar to normal workspaces. For example, we are able to restore expired workspaces for few weeks and you have to open a ticket to request the restore. There are quota limits with a default limit of 1 TB capacity and 5 millions inodes per user. You can check your current usage with
lfs quota -uh $(whoami) /pfs/work8

== $TMPDIR ==

The environment variable $TMPDIR contains the name of a directory which is local to each node. This means
that different tasks of a parallel application use different directories when they do not utilize the same node.
Although $TMPDIR points to the same path name for different nodes of a batch job, the physical location and the
content of this directory path on these nodes is different.

This directory should be used for temporary files being accessed from the local node during job runtime. It should
also be used if you read the same data many times from a single node, e.g. if you are doing AI training. In this
case you should copy the data at the beginning of your batch job to $TMPDIR and read the data from there.

The $TMPDIR directory is located on extremely fast SSD local storage devices. This means that performance
on small files is much better than on the parallel file systems. The capacity of $TMPDIR for each node type
can be checked above.

Each time a batch job is started, a subdirectory is created on the SSD of each node and assigned to the job.
$TMPDIR is set to the name of the subdirectory and this name contains the job ID so that it is unique
for each job. At the end of the job the subdirectory is removed.

On login nodes $TMPDIR also points to a fast directory on a local SSD disk but this directory is not unique.
It is recommended to create your own unique subdirectory on these nodes. This directory should be used for the
installation of software packages. This means that the software package to be installed should be unpacked,
compiled and linked in a subdirectory of $TMPDIR. The real installation of the package (e.g. make install)
should be made into the $HOME folder.

=== Usage example for $TMPDIR ===

We will provide an example for using $TMPDIR and describe efficient data transfer to and from $TMPDIR.

If you have a data set with many files which is frequently used by batch jobs you should create
a compressed archive on a workspace. This archive can be extracted on $TMPDIR inside your batch jobs.
Such an archive can be read efficiently from a parallel file system since it is a single huge file.
On a login node you can create such an archive with the following steps:
<source lang="bash">
# Create a workspace to store the archive
[ab1234@uc2n997 ~]$ ws_allocate data-ssd 60
# Create the archive from a local dataset folder (example)
[ab1234@uc2n997 ~]$ tar -cvzf $(ws_find data-ssd)/dataset.tgz dataset/
</source>

Inside a batch job extract the archive on $TMPDIR, read input data from $TMPDIR, store results on $TMPDIR
and save the results on a workspace:
<source lang="bash">
#!/bin/bash
# very simple example on how to use local $TMPDIR
#SBATCH -N 1
#SBATCH -t 24:00:00

# Extract compressed input dataset on local SSD
tar -C $TMPDIR/ -xvzf $(ws_find data-ssd)/dataset.tgz

# The application reads data from dataset on $TMPDIR and writes results to $TMPDIR
myapp -input $TMPDIR/dataset/myinput.csv -outputdir $TMPDIR/results

# Before job completes save results on a workspace
rsync -av $TMPDIR/results $(ws_find data-ssd)/results-${SLURM_JOB_ID}/
</source>

== LSDF Online Storage==

In some cases it is useful to have access to the LSDF Online Storage on the HPC-Clusters also. Therefore the LSDF Online Storage is mounted on the Login- and Datamover-Nodes.
Furthermore it can be used on the compute nodes during the job runtime with the constraint flag "LSDF" ([[bwUniCluster_2.0_Slurm_common_Features|Slurm common features]]
). There is also an example about the LSDF batch usage: [[bwUniCluster_2.0_Slurm_common_Features#LSDF_Online_Storage|Slurm LSDF example ]] .
<pre>
#!/bin/bash
#SBATCH ...
#SBATCH --constraint=LSDF
</pre>
 
For the usage of the LSDF Online Storage the following environment variables are available: $LSDF, $LSDFPROJECTS, $LSDFHOME.
Please request storage projects in the LSDF Online Storage seperately:
[https://www.lsdf.kit.edu/os/storagerequest/: LSDF Storage Request].

==BeeOND (BeeGFS On-Demand)==

Users of the UniCluster have possibility to request a private BeeOND (on-demand BeeGFS) parallel filesystem for each job. The file system is created during job startup and purged after your job.

'''IMPORTANT: All data on the private filesystem will be deleted after your job. Make sure you have copied your data back to the global filesystem (within job), e.g., $HOME or any workspace.'''

BeeOND/BeeGFS can be used like any other parallel file system. Tools like cp or rsync can be used to copy data in and out.

For detailed usage see here:[[BwUniCluster_2.0_Slurm_common_Features#BeeOND_.28BeeGFS_On-Demand.29|Request on-demand file system]]

==Backup and Archiving==

There are regular backups of all data of the home directories,whereas ACLs and extended attributes will
not be backuped.

Please open a ticket if you need backuped data.

[[Category:bwUniCluster 2.0|File System]][[Category:Hardware_and_Architecture|bwUniCluster 2.0]]

BwUniCluster2.0/Hardware and Architecture

2023-05-10T08:47:56Z

R Laifer: /* $TMPDIR */ typo

= Architecture of bwUniCluster 2.0 =

The bwUniCluster 2.0 is a parallel computer with distributed memory. Each node of system consists of at least two Intel Xeon processor, local memory, disks, network adapters and optionally accelerators (NVIDIA Tesla V100, A100 or H100). All nodes are connected by a fast InfiniBand interconnect. In addition the file system
Lustre, that is connected by coupling the InfiniBand of the file server with the InfiniBand
switch of the compute cluster, is added to bwUniCluster 2.0 to provide a fast and scalable
parallel file system.

The operating system on each node is Red Hat Enterprise Linux (RHEL) 8.4. A number of additional software packages like e.g. SLURM have been installed on top. Some of these components are of special interest to end users and are briefly
discussed in this document. Others which are of greater importance to system
administrators will not be covered by this document.

The individual nodes of the system may act in different roles. According to the services supplied by the nodes, they are separated into disjoint groups. From an end users point of view the different groups of nodes are login nodes, compute nodes, file server nodes and administrative server nodes.

'''Login Nodes'''

The login nodes are the only nodes that are directly accessible by end users. These nodes
are used for interactive login, file management, program development and interactive pre-
and postprocessing. Two nodes are dedicated to this service but they are all accessible via
one address and a DNS round-robin alias distributes the login sessions to the
different login nodes.

'''Compute Node'''

The majority of nodes are compute nodes which are managed by a batch system. Users
submit their jobs to the SLURM batch system and a job is executed when the required resources become available (depending on its fair-share priority).

'''File Server Nodes'''

The hardware of the parallel file system Lustre incorporates some file server nodes; the file
system Lustre is connected by coupling the InfiniBand of the file server with the independent InfiniBand switch of the compute cluster. In addition to shared file space there is also local storage on the disks of each node (for details see chapter "File Systems").

'''Administrative Server Nodes'''

Some other nodes are delivering additional services like resource management, external
network connection, administration etc. These nodes can be accessed directly by system administrators only.

= Components of bwUniCluster =

{| class="wikitable"
|-
! style="width:9%"|
! style="width:13%"| Compute nodes "Thin"
! style="width:13%"| Compute nodes "HPC"
! style="width:13%"| Compute nodes "IceLake"
! style="width:13%"| Compute nodes "Fat"
! style="width:13%"| GPU x4
! style="width:13%"| GPU x8
! style="width:13%"| IceLake + GPU x4
! style="width:13%"| Login
|-
!scope="column"| Number of nodes
| 100 + 60
| 360
| 272
| 6
| 14
| 10
| 15
| 4
|-
!scope="column"| Processors
| Intel Xeon Gold 6230
| Intel Xeon Gold 6230
| Intel Xeon Platinum 8358
| Intel Xeon Gold 6230
| Intel Xeon Gold 6230
| Intel Xeon Gold 6248
| Intel Xeon Platinum 8358
|-
!scope="column"| Number of sockets
| 2
| 2
| 2
| 4
| 2
| 2
| 2
| 2
|-
!scope="column"| Processor frequency (GHz)
| 2.1 Ghz
| 2.1 Ghz
| 2.6 Ghz
| 2.1 Ghz
| 2.1 Ghz
| 2.6 Ghz
| 2.5 Ghz
|
|-
!scope="column"| Total number of cores
| 40
| 40
| 64
| 80
| 40
| 40
| 64
| 40
|-
!scope="column"| Main memory
| 96 GB / 192 GB
| 96 GB
| 256 GB
| 3 TB
| 384 GB
| 768 GB
| 512 GB
| 384 GB
|-
!scope="column"| Local SSD
| 960 GB SATA
| 960 GB SATA
| 1,8 TB NVMe
| 4,8 TB NVMe
| 3,2 TB NVMe
| 15 TB NVMe
| 6,4 TB NVMe
|
|-
!scope="column"| Accelerators
| -
| -
| -
| -
| 4x NVIDIA Tesla V100
| 8x NVIDIA Tesla V100
| 4x NVIDIA A100 / 4x NVIDIA H100
| -
|-
!scope="column"| Accelerator memory
| -
| -
| -
| -
| 32 GB
| 32 GB
| 80 GB / 94 GB
| -
|-
!scope="column"| Interconnect
| IB HDR100 (blocking)
| IB HDR100
| IB HDR200
| IB HDR
| IB HDR
| IB HDR
| IB HDR200
| IB HDR100 (blocking)
|}
Table 1: Properties of the nodes

= File Systems =

On bwUniCluster 2.0 the parallel file system Lustre is used for most globally visible user data. Lustre is open source and Lustre solutions and support are available from different vendors. Nowadays, most of the biggest HPC systems are using Lustre. An initial home directory on a Lustre file system is created automatically after account activation, and the environment variable $HOME holds its name. Users can create so-called workspaces on another Lustre file system for non-permanent data with temporary lifetime. There is another workspace file system based on flash storage for special requirements available.

Within a batch job further file systems are available:
* The directory $TMP is only available and visible on the local node. It is located on fast SSD storage devices.
* On request a parallel on-demand file system (BeeOND) is created which uses the SSDs of the nodes which were allocated to the batch job.
* On request the external LSDF Online Storage is mounted on the nodes which were allocated to the batch job. This file system is based on the parallel file system Spectrum Scale.

Some of the characteristics of the file systems are shown in Table 2.

{| style="width:100%; vertical-align:top; background:#f5fffa;border:1px solid #000000;padding:1px"
|- style="width:20%;height=20px; text-align:left;padding:3px"
! style="background-color:#AAA;padding:3px"| Property
! style="background-color:#AAA;padding:3px"| $TMP
! style="background-color:#AAA;padding:3px"| BeeOND
! style="background-color:#AAA;padding:3px"| $HOME
! style="background-color:#AAA;padding:3px"| Workspace
! style="background-color:#AAA;padding:3px"| Workspace on flash
|-
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Visibility
| style="height=20px; text-align:left;padding:3px"| local node
| style="height=20px; text-align:left;padding:3px"| nodes of batch job
| style="height=20px; text-align:left;padding:3px"| global
| style="height=20px; text-align:left;padding:3px"| global
| style="height=20px; text-align:left;padding:3px"| global
|-
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Lifetime
| style="height=20px; text-align:left;padding:3px"| batch job runtime
| style="height=20px; text-align:left;padding:3px"| batch job runtime
| style="height=20px; text-align:left;padding:3px"| permanent
| style="height=20px; text-align:left;padding:3px"| max. 240 days
| style="height=20px; text-align:left;padding:3px"| max. 240 days
|- style="vertical-align:top;"
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Disk space
| style="height=20px; text-align:left;padding:3px"| 960 GB - 6.4 TB details see table 1
| style="height=20px; text-align:left;padding:3px"| n*750 GB
| style="height=20px; text-align:left;padding:3px"| 1.2 PiB
| style="height=20px; text-align:left;padding:3px"| 4.1 PiB
| style="height=20px; text-align:left;padding:3px"| 236 TiB
|-
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Capacity Quotas
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| yes 1 TiB per user also per organization
| style="height=20px; text-align:left;padding:3px"| yes 40 TiB per user
| style="height=20px; text-align:left;padding:3px"| yes 1 TiB per user
|-
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Inode Quotas
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| yes 10 million per user
| style="height=20px; text-align:left;padding:3px"| yes 30 million per user
| style="height=20px; text-align:left;padding:3px"| yes 5 million per user
|-
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Backup
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| yes
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| no
|- style="vertical-align:top;"
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Read perf./node
| style="height=20px; text-align:left;padding:3px"| 500 MB/s - 6 GB/s depends on type of local SSD / job queue: 520 MB/s @ single / multiple 800 MB/s @ multiple_e 6600 MB/s @ fat 6500 MB/s @ gpu_4 6500 MB/s @ gpu_8
| style="height=20px; text-align:left;padding:3px"| 400 MB/s - 500 MB/s depends on type of local SSDs / job queue: 500 MB/s @ multiple 400 MB/s @ multiple_e
| style="height=20px; text-align:left;padding:3px"| 1 GB/s
| style="height=20px; text-align:left;padding:3px"| 1 GB/s
| style="height=20px; text-align:left;padding:3px"| 1 GB/s
|- style="vertical-align:top;"
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Write perf./node
| style="height=20px; text-align:left;padding:3px"| 500 MB/s - 4 GB/s depends on type of local SSD / job queue: 500 MB/s @ single / multiple 650 MB/s @ multiple_e 2900 MB/s @ fat 2090 MB/s @ gpu_4 4060 MB/s @ gpu_8
| style="height=20px; text-align:left;padding:3px"| 250 MB/s - 350 MB/s depends on type of local SSDs / job queue: 350 MB/s @ multiple 250 MB/s @ multiple_e
| style="height=20px; text-align:left;padding:3px"| 1 GB/s
| style="height=20px; text-align:left;padding:3px"| 1 GB/s
| style="height=20px; text-align:left;padding:3px"| 1 GB/s
|- style="vertical-align:top;"
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Total read perf.
| style="height=20px; text-align:left;padding:3px"| n*500-6000 MB/s
| style="height=20px; text-align:left;padding:3px"| n*400-500 MB/s
| style="height=20px; text-align:left;padding:3px"| 18 GB/s
| style="height=20px; text-align:left;padding:3px"| 54 GB/s
| style="height=20px; text-align:left;padding:3px"| 45 GB/s
|- style="vertical-align:top;"
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Total write perf.
| style="height=20px; text-align:left;padding:3px"| n*500-4000 MB/s
| style="height=20px; text-align:left;padding:3px"| n*250-350 MB/s
| style="height=20px; text-align:left;padding:3px"| 18 GB/s
| style="height=20px; text-align:left;padding:3px"| 54 GB/s
| style="height=20px; text-align:left;padding:3px"| 38 GB/s
|}
---------------------------------------------------------------------------------------------------------
global: all nodes of UniCluster access the same file system;
local: each node has its own file system;
permanent: files are stored permanently;
batch job: files are removed at end of the batch job.
---------------------------------------------------------------------------------------------------------
Table 2: Properties of the file systems

== Selecting the appropriate file system ==

In general, you should separate your data and store it on the appropriate file system.
Permanently needed data like software or important results should be stored below $HOME
but capacity restrictions (quotas) apply. In case you accidentally deleted data on $HOME
you can usually restore it from backup. Permanent data which is not needed for months
or exceeds the capacity restrictions should be sent to the LSDF Online Storage
or to the archive and deleted from the file systems. Temporary data which is only needed on a single
node and which does not exceed the disk space shown in the table above should be stored
below $TMP. Data which is read many times on a single node, e.g. if you are doing AI training,
should be copied to $TMP and read from there. Temporary data which is used from many nodes
of your batch job and which is only needed during job runtime should be stored on a
parallel on-demand file system. Temporary data which can be recomputed or which is the
result of one job and input for another job should be stored in workspaces. The lifetime
of data in workspaces is limited and depends on the lifetime of the workspace which can be
several months.

For further details please check the chapters below.

== $HOME ==

The home directories of bwUniCluster 2.0 (uc2) users are located in the parallel file system Lustre.
You have access to your home directory from all nodes of uc2. A regular backup of these directories
to tape archive is done automatically. The directory $HOME is used to hold those files that are
permanently used like source codes, configuration files, executable programs etc.

On uc2 there is a default user quota limit of 1 TiB and 10 million inodes (files and directories) per user.
You can chek your current usage and limits with the command
<pre>
$ lfs quota -uh $(whoami) $HOME
</pre>
In addition to the user limit there is a limit of your organization (e.g. university) which depends on the financial share. This limit is enforced with so-called Lustre project quotas. You can show the current usage and limits of your organization with the following command:
<pre>
lfs quota -ph $(grep $(echo $HOME | sed -e "s|/[^/]*/[^/]*$||") /pfs/data5/project_ids.txt | cut -f 1 -d\ ) $HOME
</pre>

== Workspaces ==

On uc2 workspaces can be used to store large non-permanent data sets, e.g. restart files or output
data that has to be post-processed. The file system used for workspaces is also the parallel file system Lustre. This file system is especially designed for parallel access and for a high throughput to large
files. It is able to provide high data transfer rates of up to 54 GB/s write and read performance when data access is parallel.

Workspaces have a lifetime and the data on a workspace expires as a whole after a fixed period. The maximum lifetime of a workspace is 60 days, but it can be renewed at the end of that period 3 times to a total maximum of 240 days after workspace generation. If a workspace has inadvertently expired we can restore the data during a limited time (few weeks). In this case you should create a new workspace and report the name of the new and of the expired workspace in a ticket.

Creating, deleting, finding and extending workspaces is explained on the [[workspace]] page.

On uc2 there is a default user quota limit of 40 TiB and 30 million inodes (files and directories) per user.
You can chek your current usage and limits with the command
<pre>
$ lfs quota -uh $(whoami) /pfs/work7
</pre>
Note that the quotas include data and inodes for all of your workspaces and all of your expired workspaces (as long as they are not yet completely removed).

=== Reminder for workspace deletion ===

Normally you will get an email every day starting 7 days before a workspace expires. You can send yourself a calender entry which reminds you when a workspace will be automatically deleted:

$ ws_send_ical.sh <workspace> <email>

== Improving Performance on $HOME and workspaces ==

The following recommendations might help to improve throughput and metadata
performance on Lustre filesystems.

=== '''Improving Throughput Performance''' ===

Depending on your application some adaptations might be necessary if you want to reach
the full bandwidth of the filesystems. Parallel filesystems typically stripe files over storage subsystems, i.e. large files are separated into stripes and distributed to different storage subsystems. In Lustre, the size of these stripes (sometimes also mentioned as chunks) is called stripe size and the number of used storage subsystems is called stripe count.

When you are designing your application you should consider that the performance of
parallel filesystems is generally better if data is transferred in large blocks and stored in
few large files. In more detail, to increase throughput performance of a parallel application
following aspects should be considered:

* collect large chunks of data and write them sequentially at once,

* to exploit complete filesystem bandwidth use several clients,

* avoid competitive file access by different tasks or use blocks with boundaries at stripe size (default is 1MB),

* if files are small enough for the SSDs and are only used by one process store them on $TMP.

With previous Lustre versions adapting the Lustre stripe count was the most important optimization. However, for the workspaces of uc2 the new Lustre feature Progressive File Layouts has been used to define file striping parameters. This means that the stripe count is adapted if the file size is growing. In normal cases users no longer need to adapt file striping parameters in case they have very huge files or in order to reach better performance.

If you know what you are doing you can still change striping parameters, e.g. the stripe count, of a directory and of newly created files. New files and directories inherit the stripe count from the parent directory. E.g. if you want to enhance throughput on a single very large file which is created in the directory $HOME/my_output_dir you can use the command
<pre>
$ lfs setstripe -c-1 $HOME/my_output_dir
</pre>
to change the stripe count to -1 which means that all storage subsystems of the file system are used to store that file. If you change the stripe count of a directory the stripe count of existing files inside this
directory is not changed. If you want to change the stripe count of existing files, change
the stripe count of the parent directory, copy the files to new files, remove the old files
and move the new files back to the old name. In order to check the stripe setting of
the file my_file use
<pre>
$ lfs getstripe my_file
</pre>
Also note that changes on the striping parameters (e.g. stripe count) are not saved in the
backup, i.e. if directories have to be recreated this information is lost and the default stripe
count will be used. Therefore, you should annotate for which directories you made changes
to the striping parameters so that you can repeat these changes if required.

=== '''Improving Metadata Performance''' ===

Metadata performance on parallel file systems is usually not as good as with local
filesystems. In addition, it is usually not scalable, i.e. a limited resource. Therefore,
you should omit metadata operations whenever possible. For example, it is much better
to have few large files than lots of small files. In more detail, to increase metadata
performance of a parallel application following aspects should be considered:

* avoid creating many small files,

* avoid competitive directory access, e.g. by creating files in separate subdirectories for each task,

* if many small files are only used within a batch job and accessed by one process store them on $TMP,

* change the default colorization setting of the command ls (see below).

On modern Linux systems, the GNU ls command often uses colorization by default to
visually highlight the file type; this is especially true if the command is run within a terminal
session. This is because the default shell profile initializations usually contain an alias
directive similar to the following for the ls command:
<pre>
$ alias ls="ls --color=tty"
</pre>
However, running the ls command in this way for files on a Lustre file system requires
a stat() call to be used to determine the file type. This can result in a performance
overhead, because the stat() call always needs to determine the size of a file, and that
in turn means that the client node must query the object size of all the backing objects
that make up a file. As a result of the default colorization setting, running a simple
ls command on a Lustre file system often takes as much time as running the ls command
with the -l option (the same is true if the -F, -p, or the -classify option, or any other option
that requires information from a stat() call, is used). To avoid this performance overhead
when using ls commands, add an alias directive similar to the following
to your shell startup script:
<pre>
$ alias ls="ls --color=never"
</pre>

== Workspaces on flash storage ==

There is another workspace file system for special requirements available. The file system is called ''full flash pfs'' and is based on the parallel file system Lustre.

=== Advantages of this file system ===

# All storage devices are based on flash (no hard disks) with low access times. Hence performance is better compared to other parallel file systems for read and write access with small blocks and with small files, i.e. IOPS rates are improved.
# The file system is mounted on bwUniCluster 2.0 and HoreKa, i.e. it can be used to share data between these clusters.

=== Access restrictions ===

Only HoreKa users or KIT users of bwUniCluster 2.0 can use this file system.

=== Using the file system ===

As KIT or HoreKa user you can use the file system in the same way as a normal workspace. You just have to specify the name of the flash-based workspace file system using the option ''-F'' to all the commands that manage workspaces. On bwUniCluster 2.0 it is called ''ffuc'', on HoreKa it is ''ffhk''. For example, to create a workspace with name myws and a lifetime of 60 days on bwUniCluster 2.0 execute:
ws_allocate -F ffuc myws 60

If you want to use the full flash pfs on bwUniCluster 2.0 '''and''' HoreKa at the same time, please note that you only have to manage a particular workspace on one of the clusters since the name of the workspace directory is different. However, the path to each workspace is visible and can be used on both clusters.

Other features are similar to normal workspaces. For example, we are able to restore expired workspaces for few weeks and you have to open a ticket to request the restore. There are quota limits with a default limit of 1 TB capacity and 5 millions inodes per user. You can check your current usage with
lfs quota -uh $(whoami) /pfs/work8

== $TMPDIR ==

While all tasks of a parallel application access the same $HOME and workspace directory, the
$TMPDIR directory is local to each node on bwUniCluster 2.0. All nodes have fast SSD
local storage devices which are used to store data below $TMPDIR. The capacity of $TMPDIR for each node type
can be checked above. Different tasks of a parallel application use different $TMPDIR directories when
they do not utilize one node. Although $TMPDIR points to the same path name for different nodes of a job,
the physical location on these nodes is different.

Each time a batch job is started, a subdirectory is created on each node and assigned to the job.
$TMPDIR is newly set and the name of the subdirectory contains the job ID so that the
subdirectory name is unique for each job. This unique name is then assigned to the
environment variable $TMPDIR within the job. At the end of the job the subdirectory is removed.

This directory should be used for temporary files being accessed by single tasks. It should also be used
if you read the same data many times from a single node, e.g. if you are doing AI training. In this case
you should copy the data at the beginning of your batch job to $TMPDIR and read the data from there.
In addition, this directory should be used for the installation of software packages. This means that
the software package to be installed should be unpacked, compiled and linked in a subdirectory of $TMPDIR.
The real installation of the package (e.g. make install) could be made in(to) the $HOME filesystem.

=== Usage example of $TMPDIR ===

We will provide an example for using $TMPDIR and describe efficient data transfer to and from $TMPDIR.

If you have a data set with many files which is frequently used by batch jobs you should create
a compressed archive on a workspace. This archive can be extracted on $TMPDIR inside your batch jobs.
Such an archive can be read efficiently from a parallel file system since it is a single huge file.
On a login node you can create such an archive with the following steps:
<source lang="bash">
# Create a workspace to store the archive
[ab1234@uc2n997 ~]$ ws_allocate data-ssd 60
# Create the archive from a local dataset folder (example)
[ab1234@uc2n997 ~]$ tar -cvzf $(ws_find data-ssd)/dataset.tgz dataset/
</source>

Inside a batch job extract the archive on $TMPDIR, read input data from $TMPDIR, store results on $TMPDIR
and save the results on a workspace:
<source lang="bash">
#!/bin/bash
# very simple example on how to use local $TMPDIR
#SBATCH -N 1
#SBATCH -t 24:00:00

# Extract compressed input dataset on local SSD
tar -C $TMPDIR/ -xvzf $(ws_find data-ssd)/dataset.tgz

# The application reads data from dataset on $TMPDIR and writes results to $TMPDIR
myapp -input $TMPDIR/dataset/myinput.csv -outputdir $TMPDIR/results

# Before job completes save results on a workspace
rsync -av $TMPDIR/results $(ws_find data-ssd)/results-${SLURM_JOB_ID}/
</source>

== LSDF Online Storage==

In some cases it is useful to have access to the LSDF Online Storage on the HPC-Clusters also. Therefore the LSDF Online Storage is mounted on the Login- and Datamover-Nodes.
Furthermore it can be used on the compute nodes during the job runtime with the constraint flag "LSDF" ([[bwUniCluster_2.0_Slurm_common_Features|Slurm common features]]
). There is also an example about the LSDF batch usage: [[bwUniCluster_2.0_Slurm_common_Features#LSDF_Online_Storage|Slurm LSDF example ]] .
<pre>
#!/bin/bash
#SBATCH ...
#SBATCH --constraint=LSDF
</pre>
 
For the usage of the LSDF Online Storage the following environment variables are available: $LSDF, $LSDFPROJECTS, $LSDFHOME.
Please request storage projects in the LSDF Online Storage seperately:
[https://www.lsdf.kit.edu/os/storagerequest/: LSDF Storage Request].

==BeeOND (BeeGFS On-Demand)==

Users of the UniCluster have possibility to request a private BeeOND (on-demand BeeGFS) parallel filesystem for each job. The file system is created during job startup and purged after your job.

'''IMPORTANT: All data on the private filesystem will be deleted after your job. Make sure you have copied your data back to the global filesystem (within job), e.g., $HOME or any workspace.'''

BeeOND/BeeGFS can be used like any other parallel file system. Tools like cp or rsync can be used to copy data in and out.

For detailed usage see here:[[BwUniCluster_2.0_Slurm_common_Features#BeeOND_.28BeeGFS_On-Demand.29|Request on-demand file system]]

==Backup and Archiving==

There are regular backups of all data of the home directories,whereas ACLs and extended attributes will
not be backuped.

Please open a ticket if you need backuped data.

[[Category:bwUniCluster 2.0|File System]][[Category:Hardware_and_Architecture|bwUniCluster 2.0]]

BwUniCluster2.0/Hardware and Architecture

2023-05-10T08:40:05Z

R Laifer: /* $TMP */ Umbenennung von $TMP in $TMPDIR

= Architecture of bwUniCluster 2.0 =

The bwUniCluster 2.0 is a parallel computer with distributed memory. Each node of system consists of at least two Intel Xeon processor, local memory, disks, network adapters and optionally accelerators (NVIDIA Tesla V100, A100 or H100). All nodes are connected by a fast InfiniBand interconnect. In addition the file system
Lustre, that is connected by coupling the InfiniBand of the file server with the InfiniBand
switch of the compute cluster, is added to bwUniCluster 2.0 to provide a fast and scalable
parallel file system.

The operating system on each node is Red Hat Enterprise Linux (RHEL) 8.4. A number of additional software packages like e.g. SLURM have been installed on top. Some of these components are of special interest to end users and are briefly
discussed in this document. Others which are of greater importance to system
administrators will not be covered by this document.

The individual nodes of the system may act in different roles. According to the services supplied by the nodes, they are separated into disjoint groups. From an end users point of view the different groups of nodes are login nodes, compute nodes, file server nodes and administrative server nodes.

'''Login Nodes'''

The login nodes are the only nodes that are directly accessible by end users. These nodes
are used for interactive login, file management, program development and interactive pre-
and postprocessing. Two nodes are dedicated to this service but they are all accessible via
one address and a DNS round-robin alias distributes the login sessions to the
different login nodes.

'''Compute Node'''

The majority of nodes are compute nodes which are managed by a batch system. Users
submit their jobs to the SLURM batch system and a job is executed when the required resources become available (depending on its fair-share priority).

'''File Server Nodes'''

The hardware of the parallel file system Lustre incorporates some file server nodes; the file
system Lustre is connected by coupling the InfiniBand of the file server with the independent InfiniBand switch of the compute cluster. In addition to shared file space there is also local storage on the disks of each node (for details see chapter "File Systems").

'''Administrative Server Nodes'''

Some other nodes are delivering additional services like resource management, external
network connection, administration etc. These nodes can be accessed directly by system administrators only.

= Components of bwUniCluster =

{| class="wikitable"
|-
! style="width:9%"|
! style="width:13%"| Compute nodes "Thin"
! style="width:13%"| Compute nodes "HPC"
! style="width:13%"| Compute nodes "IceLake"
! style="width:13%"| Compute nodes "Fat"
! style="width:13%"| GPU x4
! style="width:13%"| GPU x8
! style="width:13%"| IceLake + GPU x4
! style="width:13%"| Login
|-
!scope="column"| Number of nodes
| 100 + 60
| 360
| 272
| 6
| 14
| 10
| 15
| 4
|-
!scope="column"| Processors
| Intel Xeon Gold 6230
| Intel Xeon Gold 6230
| Intel Xeon Platinum 8358
| Intel Xeon Gold 6230
| Intel Xeon Gold 6230
| Intel Xeon Gold 6248
| Intel Xeon Platinum 8358
|-
!scope="column"| Number of sockets
| 2
| 2
| 2
| 4
| 2
| 2
| 2
| 2
|-
!scope="column"| Processor frequency (GHz)
| 2.1 Ghz
| 2.1 Ghz
| 2.6 Ghz
| 2.1 Ghz
| 2.1 Ghz
| 2.6 Ghz
| 2.5 Ghz
|
|-
!scope="column"| Total number of cores
| 40
| 40
| 64
| 80
| 40
| 40
| 64
| 40
|-
!scope="column"| Main memory
| 96 GB / 192 GB
| 96 GB
| 256 GB
| 3 TB
| 384 GB
| 768 GB
| 512 GB
| 384 GB
|-
!scope="column"| Local SSD
| 960 GB SATA
| 960 GB SATA
| 1,8 TB NVMe
| 4,8 TB NVMe
| 3,2 TB NVMe
| 15 TB NVMe
| 6,4 TB NVMe
|
|-
!scope="column"| Accelerators
| -
| -
| -
| -
| 4x NVIDIA Tesla V100
| 8x NVIDIA Tesla V100
| 4x NVIDIA A100 / 4x NVIDIA H100
| -
|-
!scope="column"| Accelerator memory
| -
| -
| -
| -
| 32 GB
| 32 GB
| 80 GB / 94 GB
| -
|-
!scope="column"| Interconnect
| IB HDR100 (blocking)
| IB HDR100
| IB HDR200
| IB HDR
| IB HDR
| IB HDR
| IB HDR200
| IB HDR100 (blocking)
|}
Table 1: Properties of the nodes

= File Systems =

On bwUniCluster 2.0 the parallel file system Lustre is used for most globally visible user data. Lustre is open source and Lustre solutions and support are available from different vendors. Nowadays, most of the biggest HPC systems are using Lustre. An initial home directory on a Lustre file system is created automatically after account activation, and the environment variable $HOME holds its name. Users can create so-called workspaces on another Lustre file system for non-permanent data with temporary lifetime. There is another workspace file system based on flash storage for special requirements available.

Within a batch job further file systems are available:
* The directory $TMP is only available and visible on the local node. It is located on fast SSD storage devices.
* On request a parallel on-demand file system (BeeOND) is created which uses the SSDs of the nodes which were allocated to the batch job.
* On request the external LSDF Online Storage is mounted on the nodes which were allocated to the batch job. This file system is based on the parallel file system Spectrum Scale.

Some of the characteristics of the file systems are shown in Table 2.

{| style="width:100%; vertical-align:top; background:#f5fffa;border:1px solid #000000;padding:1px"
|- style="width:20%;height=20px; text-align:left;padding:3px"
! style="background-color:#AAA;padding:3px"| Property
! style="background-color:#AAA;padding:3px"| $TMP
! style="background-color:#AAA;padding:3px"| BeeOND
! style="background-color:#AAA;padding:3px"| $HOME
! style="background-color:#AAA;padding:3px"| Workspace
! style="background-color:#AAA;padding:3px"| Workspace on flash
|-
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Visibility
| style="height=20px; text-align:left;padding:3px"| local node
| style="height=20px; text-align:left;padding:3px"| nodes of batch job
| style="height=20px; text-align:left;padding:3px"| global
| style="height=20px; text-align:left;padding:3px"| global
| style="height=20px; text-align:left;padding:3px"| global
|-
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Lifetime
| style="height=20px; text-align:left;padding:3px"| batch job runtime
| style="height=20px; text-align:left;padding:3px"| batch job runtime
| style="height=20px; text-align:left;padding:3px"| permanent
| style="height=20px; text-align:left;padding:3px"| max. 240 days
| style="height=20px; text-align:left;padding:3px"| max. 240 days
|- style="vertical-align:top;"
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Disk space
| style="height=20px; text-align:left;padding:3px"| 960 GB - 6.4 TB details see table 1
| style="height=20px; text-align:left;padding:3px"| n*750 GB
| style="height=20px; text-align:left;padding:3px"| 1.2 PiB
| style="height=20px; text-align:left;padding:3px"| 4.1 PiB
| style="height=20px; text-align:left;padding:3px"| 236 TiB
|-
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Capacity Quotas
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| yes 1 TiB per user also per organization
| style="height=20px; text-align:left;padding:3px"| yes 40 TiB per user
| style="height=20px; text-align:left;padding:3px"| yes 1 TiB per user
|-
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Inode Quotas
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| yes 10 million per user
| style="height=20px; text-align:left;padding:3px"| yes 30 million per user
| style="height=20px; text-align:left;padding:3px"| yes 5 million per user
|-
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Backup
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| yes
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| no
|- style="vertical-align:top;"
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Read perf./node
| style="height=20px; text-align:left;padding:3px"| 500 MB/s - 6 GB/s depends on type of local SSD / job queue: 520 MB/s @ single / multiple 800 MB/s @ multiple_e 6600 MB/s @ fat 6500 MB/s @ gpu_4 6500 MB/s @ gpu_8
| style="height=20px; text-align:left;padding:3px"| 400 MB/s - 500 MB/s depends on type of local SSDs / job queue: 500 MB/s @ multiple 400 MB/s @ multiple_e
| style="height=20px; text-align:left;padding:3px"| 1 GB/s
| style="height=20px; text-align:left;padding:3px"| 1 GB/s
| style="height=20px; text-align:left;padding:3px"| 1 GB/s
|- style="vertical-align:top;"
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Write perf./node
| style="height=20px; text-align:left;padding:3px"| 500 MB/s - 4 GB/s depends on type of local SSD / job queue: 500 MB/s @ single / multiple 650 MB/s @ multiple_e 2900 MB/s @ fat 2090 MB/s @ gpu_4 4060 MB/s @ gpu_8
| style="height=20px; text-align:left;padding:3px"| 250 MB/s - 350 MB/s depends on type of local SSDs / job queue: 350 MB/s @ multiple 250 MB/s @ multiple_e
| style="height=20px; text-align:left;padding:3px"| 1 GB/s
| style="height=20px; text-align:left;padding:3px"| 1 GB/s
| style="height=20px; text-align:left;padding:3px"| 1 GB/s
|- style="vertical-align:top;"
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Total read perf.
| style="height=20px; text-align:left;padding:3px"| n*500-6000 MB/s
| style="height=20px; text-align:left;padding:3px"| n*400-500 MB/s
| style="height=20px; text-align:left;padding:3px"| 18 GB/s
| style="height=20px; text-align:left;padding:3px"| 54 GB/s
| style="height=20px; text-align:left;padding:3px"| 45 GB/s
|- style="vertical-align:top;"
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Total write perf.
| style="height=20px; text-align:left;padding:3px"| n*500-4000 MB/s
| style="height=20px; text-align:left;padding:3px"| n*250-350 MB/s
| style="height=20px; text-align:left;padding:3px"| 18 GB/s
| style="height=20px; text-align:left;padding:3px"| 54 GB/s
| style="height=20px; text-align:left;padding:3px"| 38 GB/s
|}
---------------------------------------------------------------------------------------------------------
global: all nodes of UniCluster access the same file system;
local: each node has its own file system;
permanent: files are stored permanently;
batch job: files are removed at end of the batch job.
---------------------------------------------------------------------------------------------------------
Table 2: Properties of the file systems

== Selecting the appropriate file system ==

In general, you should separate your data and store it on the appropriate file system.
Permanently needed data like software or important results should be stored below $HOME
but capacity restrictions (quotas) apply. In case you accidentally deleted data on $HOME
you can usually restore it from backup. Permanent data which is not needed for months
or exceeds the capacity restrictions should be sent to the LSDF Online Storage
or to the archive and deleted from the file systems. Temporary data which is only needed on a single
node and which does not exceed the disk space shown in the table above should be stored
below $TMP. Data which is read many times on a single node, e.g. if you are doing AI training,
should be copied to $TMP and read from there. Temporary data which is used from many nodes
of your batch job and which is only needed during job runtime should be stored on a
parallel on-demand file system. Temporary data which can be recomputed or which is the
result of one job and input for another job should be stored in workspaces. The lifetime
of data in workspaces is limited and depends on the lifetime of the workspace which can be
several months.

For further details please check the chapters below.

== $HOME ==

The home directories of bwUniCluster 2.0 (uc2) users are located in the parallel file system Lustre.
You have access to your home directory from all nodes of uc2. A regular backup of these directories
to tape archive is done automatically. The directory $HOME is used to hold those files that are
permanently used like source codes, configuration files, executable programs etc.

On uc2 there is a default user quota limit of 1 TiB and 10 million inodes (files and directories) per user.
You can chek your current usage and limits with the command
<pre>
$ lfs quota -uh $(whoami) $HOME
</pre>
In addition to the user limit there is a limit of your organization (e.g. university) which depends on the financial share. This limit is enforced with so-called Lustre project quotas. You can show the current usage and limits of your organization with the following command:
<pre>
lfs quota -ph $(grep $(echo $HOME | sed -e "s|/[^/]*/[^/]*$||") /pfs/data5/project_ids.txt | cut -f 1 -d\ ) $HOME
</pre>

== Workspaces ==

On uc2 workspaces can be used to store large non-permanent data sets, e.g. restart files or output
data that has to be post-processed. The file system used for workspaces is also the parallel file system Lustre. This file system is especially designed for parallel access and for a high throughput to large
files. It is able to provide high data transfer rates of up to 54 GB/s write and read performance when data access is parallel.

Workspaces have a lifetime and the data on a workspace expires as a whole after a fixed period. The maximum lifetime of a workspace is 60 days, but it can be renewed at the end of that period 3 times to a total maximum of 240 days after workspace generation. If a workspace has inadvertently expired we can restore the data during a limited time (few weeks). In this case you should create a new workspace and report the name of the new and of the expired workspace in a ticket.

Creating, deleting, finding and extending workspaces is explained on the [[workspace]] page.

On uc2 there is a default user quota limit of 40 TiB and 30 million inodes (files and directories) per user.
You can chek your current usage and limits with the command
<pre>
$ lfs quota -uh $(whoami) /pfs/work7
</pre>
Note that the quotas include data and inodes for all of your workspaces and all of your expired workspaces (as long as they are not yet completely removed).

=== Reminder for workspace deletion ===

Normally you will get an email every day starting 7 days before a workspace expires. You can send yourself a calender entry which reminds you when a workspace will be automatically deleted:

$ ws_send_ical.sh <workspace> <email>

== Improving Performance on $HOME and workspaces ==

The following recommendations might help to improve throughput and metadata
performance on Lustre filesystems.

=== '''Improving Throughput Performance''' ===

Depending on your application some adaptations might be necessary if you want to reach
the full bandwidth of the filesystems. Parallel filesystems typically stripe files over storage subsystems, i.e. large files are separated into stripes and distributed to different storage subsystems. In Lustre, the size of these stripes (sometimes also mentioned as chunks) is called stripe size and the number of used storage subsystems is called stripe count.

When you are designing your application you should consider that the performance of
parallel filesystems is generally better if data is transferred in large blocks and stored in
few large files. In more detail, to increase throughput performance of a parallel application
following aspects should be considered:

* collect large chunks of data and write them sequentially at once,

* to exploit complete filesystem bandwidth use several clients,

* avoid competitive file access by different tasks or use blocks with boundaries at stripe size (default is 1MB),

* if files are small enough for the SSDs and are only used by one process store them on $TMP.

With previous Lustre versions adapting the Lustre stripe count was the most important optimization. However, for the workspaces of uc2 the new Lustre feature Progressive File Layouts has been used to define file striping parameters. This means that the stripe count is adapted if the file size is growing. In normal cases users no longer need to adapt file striping parameters in case they have very huge files or in order to reach better performance.

If you know what you are doing you can still change striping parameters, e.g. the stripe count, of a directory and of newly created files. New files and directories inherit the stripe count from the parent directory. E.g. if you want to enhance throughput on a single very large file which is created in the directory $HOME/my_output_dir you can use the command
<pre>
$ lfs setstripe -c-1 $HOME/my_output_dir
</pre>
to change the stripe count to -1 which means that all storage subsystems of the file system are used to store that file. If you change the stripe count of a directory the stripe count of existing files inside this
directory is not changed. If you want to change the stripe count of existing files, change
the stripe count of the parent directory, copy the files to new files, remove the old files
and move the new files back to the old name. In order to check the stripe setting of
the file my_file use
<pre>
$ lfs getstripe my_file
</pre>
Also note that changes on the striping parameters (e.g. stripe count) are not saved in the
backup, i.e. if directories have to be recreated this information is lost and the default stripe
count will be used. Therefore, you should annotate for which directories you made changes
to the striping parameters so that you can repeat these changes if required.

=== '''Improving Metadata Performance''' ===

Metadata performance on parallel file systems is usually not as good as with local
filesystems. In addition, it is usually not scalable, i.e. a limited resource. Therefore,
you should omit metadata operations whenever possible. For example, it is much better
to have few large files than lots of small files. In more detail, to increase metadata
performance of a parallel application following aspects should be considered:

* avoid creating many small files,

* avoid competitive directory access, e.g. by creating files in separate subdirectories for each task,

* if many small files are only used within a batch job and accessed by one process store them on $TMP,

* change the default colorization setting of the command ls (see below).

On modern Linux systems, the GNU ls command often uses colorization by default to
visually highlight the file type; this is especially true if the command is run within a terminal
session. This is because the default shell profile initializations usually contain an alias
directive similar to the following for the ls command:
<pre>
$ alias ls="ls --color=tty"
</pre>
However, running the ls command in this way for files on a Lustre file system requires
a stat() call to be used to determine the file type. This can result in a performance
overhead, because the stat() call always needs to determine the size of a file, and that
in turn means that the client node must query the object size of all the backing objects
that make up a file. As a result of the default colorization setting, running a simple
ls command on a Lustre file system often takes as much time as running the ls command
with the -l option (the same is true if the -F, -p, or the -classify option, or any other option
that requires information from a stat() call, is used). To avoid this performance overhead
when using ls commands, add an alias directive similar to the following
to your shell startup script:
<pre>
$ alias ls="ls --color=never"
</pre>

== Workspaces on flash storage ==

There is another workspace file system for special requirements available. The file system is called ''full flash pfs'' and is based on the parallel file system Lustre.

=== Advantages of this file system ===

# All storage devices are based on flash (no hard disks) with low access times. Hence performance is better compared to other parallel file systems for read and write access with small blocks and with small files, i.e. IOPS rates are improved.
# The file system is mounted on bwUniCluster 2.0 and HoreKa, i.e. it can be used to share data between these clusters.

=== Access restrictions ===

Only HoreKa users or KIT users of bwUniCluster 2.0 can use this file system.

=== Using the file system ===

As KIT or HoreKa user you can use the file system in the same way as a normal workspace. You just have to specify the name of the flash-based workspace file system using the option ''-F'' to all the commands that manage workspaces. On bwUniCluster 2.0 it is called ''ffuc'', on HoreKa it is ''ffhk''. For example, to create a workspace with name myws and a lifetime of 60 days on bwUniCluster 2.0 execute:
ws_allocate -F ffuc myws 60

If you want to use the full flash pfs on bwUniCluster 2.0 '''and''' HoreKa at the same time, please note that you only have to manage a particular workspace on one of the clusters since the name of the workspace directory is different. However, the path to each workspace is visible and can be used on both clusters.

Other features are similar to normal workspaces. For example, we are able to restore expired workspaces for few weeks and you have to open a ticket to request the restore. There are quota limits with a default limit of 1 TB capacity and 5 millions inodes per user. You can check your current usage with
lfs quota -uh $(whoami) /pfs/work8

== $TMPDIR ==

While all tasks of a parallel application access the same $HOME and workspace directory, the
$TMPDIR directory is local to each node on bwUniCluster 2.0. All nodes have fast SSD
local storage devices which are used to store data below $TMPDIR. The capacity of $TMPDIR for each node type
can be checked above. Different tasks of a parallel application use different $TMPDIR directories when
they do not utilize one node. Although $TMPDIR points to the same path name for different nodes of a job,
the physical location on these nodes is different.

Each time a batch job is started, a subdirectory is created on each node and assigned to the job.
$TMPDIR is newly set and the name of the subdirectory contains the Job-id so that the
subdirectory name is unique for each job. This unique name is then assigned to the
environment variable $TMPDIR within the job. At the end of the job the subdirectory is removed.

This directory should be used for temporary files being accessed by single tasks. It should also be used
if you read the same data many times from a single node, e.g. if you are doing AI training. In this case
you should copy the data at the beginning of your batch job to $TMPDIR and read the data from there.
In addition, this directory should be used for the installation of software packages. This means that
the software package to be installed should be unpacked, compiled and linked in a subdirectory of $TMPDIR.
The real installation of the package (e.g. make install) could be made in(to) the $HOME filesystem.

=== Usage example of $TMPDIR ===

We will provide an example for using $TMPDIR and describe efficient data transfer to and from $TMPDIR.

If you have a data set with many files which is frequently used by batch jobs you should create
a compressed archive on a workspace. This archive can be extracted on $TMPDIR inside your batch jobs.
Such an archive can be read efficiently from a parallel file system since it is a single huge file.
On a login node you can create such an archive with the following steps:
<source lang="bash">
# Create a workspace to store the archive
[ab1234@uc2n997 ~]$ ws_allocate data-ssd 60
# Create the archive from a local dataset folder (example)
[ab1234@uc2n997 ~]$ tar -cvzf $(ws_find data-ssd)/dataset.tgz dataset/
</source>

Inside a batch job extract the archive on $TMPDIR, read input data from $TMPDIR, store results on $TMPDIR
and save the results on a workspace:
<source lang="bash">
#!/bin/bash
# very simple example on how to use local $TMPDIR
#SBATCH -N 1
#SBATCH -t 24:00:00

# Extract compressed input dataset on local SSD
tar -C $TMPDIR/ -xvzf $(ws_find data-ssd)/dataset.tgz

# The application reads data from dataset on $TMPDIR and writes results to $TMPDIR
myapp -input $TMPDIR/dataset/myinput.csv -outputdir $TMPDIR/results

# Before job completes save results on a workspace
rsync -av $TMPDIR/results $(ws_find data-ssd)/results-${SLURM_JOB_ID}/
</source>

== LSDF Online Storage==

In some cases it is useful to have access to the LSDF Online Storage on the HPC-Clusters also. Therefore the LSDF Online Storage is mounted on the Login- and Datamover-Nodes.
Furthermore it can be used on the compute nodes during the job runtime with the constraint flag "LSDF" ([[bwUniCluster_2.0_Slurm_common_Features|Slurm common features]]
). There is also an example about the LSDF batch usage: [[bwUniCluster_2.0_Slurm_common_Features#LSDF_Online_Storage|Slurm LSDF example ]] .
<pre>
#!/bin/bash
#SBATCH ...
#SBATCH --constraint=LSDF
</pre>
 
For the usage of the LSDF Online Storage the following environment variables are available: $LSDF, $LSDFPROJECTS, $LSDFHOME.
Please request storage projects in the LSDF Online Storage seperately:
[https://www.lsdf.kit.edu/os/storagerequest/: LSDF Storage Request].

==BeeOND (BeeGFS On-Demand)==

Users of the UniCluster have possibility to request a private BeeOND (on-demand BeeGFS) parallel filesystem for each job. The file system is created during job startup and purged after your job.

'''IMPORTANT: All data on the private filesystem will be deleted after your job. Make sure you have copied your data back to the global filesystem (within job), e.g., $HOME or any workspace.'''

BeeOND/BeeGFS can be used like any other parallel file system. Tools like cp or rsync can be used to copy data in and out.

For detailed usage see here:[[BwUniCluster_2.0_Slurm_common_Features#BeeOND_.28BeeGFS_On-Demand.29|Request on-demand file system]]

==Backup and Archiving==

There are regular backups of all data of the home directories,whereas ACLs and extended attributes will
not be backuped.

Please open a ticket if you need backuped data.

[[Category:bwUniCluster 2.0|File System]][[Category:Hardware_and_Architecture|bwUniCluster 2.0]]

BwUniCluster2.0/Hardware and Architecture

2023-03-29T07:58:26Z

R Laifer: /* $TMP */ correct typos

= Architecture of bwUniCluster 2.0 =

The bwUniCluster 2.0 is a parallel computer with distributed memory. Each node of system consists of at least two Intel Xeon processor, local memory, disks, network adapters and optionally accelerators (NVIDIA Tesla V100, A100 or H100). All nodes are connected by a fast InfiniBand interconnect. In addition the file system
Lustre, that is connected by coupling the InfiniBand of the file server with the InfiniBand
switch of the compute cluster, is added to bwUniCluster 2.0 to provide a fast and scalable
parallel file system.

The operating system on each node is Red Hat Enterprise Linux (RHEL) 8.4. A number of additional software packages like e.g. SLURM have been installed on top. Some of these components are of special interest to end users and are briefly
discussed in this document. Others which are of greater importance to system
administrators will not be covered by this document.

The individual nodes of the system may act in different roles. According to the services supplied by the nodes, they are separated into disjoint groups. From an end users point of view the different groups of nodes are login nodes, compute nodes, file server nodes and administrative server nodes.

'''Login Nodes'''

The login nodes are the only nodes that are directly accessible by end users. These nodes
are used for interactive login, file management, program development and interactive pre-
and postprocessing. Two nodes are dedicated to this service but they are all accessible via
one address and a DNS round-robin alias distributes the login sessions to the
different login nodes.

'''Compute Node'''

The majority of nodes are compute nodes which are managed by a batch system. Users
submit their jobs to the SLURM batch system and a job is executed when the required resources become available (depending on its fair-share priority).

'''File Server Nodes'''

The hardware of the parallel file system Lustre incorporates some file server nodes; the file
system Lustre is connected by coupling the InfiniBand of the file server with the independent InfiniBand switch of the compute cluster. In addition to shared file space there is also local storage on the disks of each node (for details see chapter "File Systems").

'''Administrative Server Nodes'''

Some other nodes are delivering additional services like resource management, external
network connection, administration etc. These nodes can be accessed directly by system administrators only.

= Components of bwUniCluster =

{| class="wikitable"
|-
! style="width:9%"|
! style="width:13%"| Compute nodes "Thin"
! style="width:13%"| Compute nodes "HPC"
! style="width:13%"| Compute nodes "IceLake"
! style="width:13%"| Compute nodes "Fat"
! style="width:13%"| GPU x4
! style="width:13%"| GPU x8
! style="width:13%"| IceLake + GPU x4
! style="width:13%"| Login
|-
!scope="column"| Number of nodes
| 100 + 60
| 360
| 272
| 6
| 14
| 10
| 15
| 4
|-
!scope="column"| Processors
| Intel Xeon Gold 6230
| Intel Xeon Gold 6230
| Intel Xeon Platinum 8358
| Intel Xeon Gold 6230
| Intel Xeon Gold 6230
| Intel Xeon Gold 6248
| Intel Xeon Platinum 8358
|-
!scope="column"| Number of sockets
| 2
| 2
| 2
| 4
| 2
| 2
| 2
| 2
|-
!scope="column"| Processor frequency (GHz)
| 2.1 Ghz
| 2.1 Ghz
| 2.6 Ghz
| 2.1 Ghz
| 2.1 Ghz
| 2.6 Ghz
| 2.5 Ghz
|
|-
!scope="column"| Total number of cores
| 40
| 40
| 64
| 80
| 40
| 40
| 64
| 40
|-
!scope="column"| Main memory
| 96 GB / 192 GB
| 96 GB
| 256 GB
| 3 TB
| 384 GB
| 768 GB
| 512 GB
| 384 GB
|-
!scope="column"| Local SSD
| 960 GB SATA
| 960 GB SATA
| 1,8 TB NVMe
| 4,8 TB NVMe
| 3,2 TB NVMe
| 15 TB NVMe
| 6,4 TB NVMe
|
|-
!scope="column"| Accelerators
| -
| -
| -
| -
| 4x NVIDIA Tesla V100
| 8x NVIDIA Tesla V100
| 4x NVIDIA A100 / 4x NVIDIA H100
| -
|-
!scope="column"| Accelerator memory
| -
| -
| -
| -
| 32 GB
| 32 GB
| 80 GB / 94 GB
| -
|-
!scope="column"| Interconnect
| IB HDR100 (blocking)
| IB HDR100
| IB HDR200
| IB HDR
| IB HDR
| IB HDR
| IB HDR200
| IB HDR100 (blocking)
|}
Table 1: Properties of the nodes

= File Systems =

On bwUniCluster 2.0 the parallel file system Lustre is used for most globally visible user data. Lustre is open source and Lustre solutions and support are available from different vendors. Nowadays, most of the biggest HPC systems are using Lustre. An initial home directory on a Lustre file system is created automatically after account activation, and the environment variable $HOME holds its name. Users can create so-called workspaces on another Lustre file system for non-permanent data with temporary lifetime. There is another workspace file system based on flash storage for special requirements available.

Within a batch job further file systems are available:
* The directory $TMP is only available and visible on the local node. It is located on fast SSD storage devices.
* On request a parallel on-demand file system (BeeOND) is created which uses the SSDs of the nodes which were allocated to the batch job.
* On request the external LSDF Online Storage is mounted on the nodes which were allocated to the batch job. This file system is based on the parallel file system Spectrum Scale.

Some of the characteristics of the file systems are shown in Table 2.

{| style="width:100%; vertical-align:top; background:#f5fffa;border:1px solid #000000;padding:1px"
|- style="width:20%;height=20px; text-align:left;padding:3px"
! style="background-color:#AAA;padding:3px"| Property
! style="background-color:#AAA;padding:3px"| $TMP
! style="background-color:#AAA;padding:3px"| BeeOND
! style="background-color:#AAA;padding:3px"| $HOME
! style="background-color:#AAA;padding:3px"| Workspace
! style="background-color:#AAA;padding:3px"| Workspace on flash
|-
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Visibility
| style="height=20px; text-align:left;padding:3px"| local node
| style="height=20px; text-align:left;padding:3px"| nodes of batch job
| style="height=20px; text-align:left;padding:3px"| global
| style="height=20px; text-align:left;padding:3px"| global
| style="height=20px; text-align:left;padding:3px"| global
|-
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Lifetime
| style="height=20px; text-align:left;padding:3px"| batch job runtime
| style="height=20px; text-align:left;padding:3px"| batch job runtime
| style="height=20px; text-align:left;padding:3px"| permanent
| style="height=20px; text-align:left;padding:3px"| max. 240 days
| style="height=20px; text-align:left;padding:3px"| max. 240 days
|- style="vertical-align:top;"
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Disk space
| style="height=20px; text-align:left;padding:3px"| 960 GB - 6.4 TB details see table 1
| style="height=20px; text-align:left;padding:3px"| n*750 GB
| style="height=20px; text-align:left;padding:3px"| 1.2 PiB
| style="height=20px; text-align:left;padding:3px"| 4.1 PiB
| style="height=20px; text-align:left;padding:3px"| 236 TiB
|-
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Capacity Quotas
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| yes 1 TiB per user also per organization
| style="height=20px; text-align:left;padding:3px"| yes 40 TiB per user
| style="height=20px; text-align:left;padding:3px"| yes 1 TiB per user
|-
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Inode Quotas
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| yes 10 million per user
| style="height=20px; text-align:left;padding:3px"| yes 30 million per user
| style="height=20px; text-align:left;padding:3px"| yes 5 million per user
|-
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Backup
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| yes
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| no
|- style="vertical-align:top;"
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Read perf./node
| style="height=20px; text-align:left;padding:3px"| 500 MB/s - 6 GB/s depends on type of local SSD / job queue: 520 MB/s @ single / multiple 800 MB/s @ multiple_e 6600 MB/s @ fat 6500 MB/s @ gpu_4 6500 MB/s @ gpu_8
| style="height=20px; text-align:left;padding:3px"| 400 MB/s - 500 MB/s depends on type of local SSDs / job queue: 500 MB/s @ multiple 400 MB/s @ multiple_e
| style="height=20px; text-align:left;padding:3px"| 1 GB/s
| style="height=20px; text-align:left;padding:3px"| 1 GB/s
| style="height=20px; text-align:left;padding:3px"| 1 GB/s
|- style="vertical-align:top;"
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Write perf./node
| style="height=20px; text-align:left;padding:3px"| 500 MB/s - 4 GB/s depends on type of local SSD / job queue: 500 MB/s @ single / multiple 650 MB/s @ multiple_e 2900 MB/s @ fat 2090 MB/s @ gpu_4 4060 MB/s @ gpu_8
| style="height=20px; text-align:left;padding:3px"| 250 MB/s - 350 MB/s depends on type of local SSDs / job queue: 350 MB/s @ multiple 250 MB/s @ multiple_e
| style="height=20px; text-align:left;padding:3px"| 1 GB/s
| style="height=20px; text-align:left;padding:3px"| 1 GB/s
| style="height=20px; text-align:left;padding:3px"| 1 GB/s
|- style="vertical-align:top;"
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Total read perf.
| style="height=20px; text-align:left;padding:3px"| n*500-6000 MB/s
| style="height=20px; text-align:left;padding:3px"| n*400-500 MB/s
| style="height=20px; text-align:left;padding:3px"| 18 GB/s
| style="height=20px; text-align:left;padding:3px"| 54 GB/s
| style="height=20px; text-align:left;padding:3px"| 45 GB/s
|- style="vertical-align:top;"
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Total write perf.
| style="height=20px; text-align:left;padding:3px"| n*500-4000 MB/s
| style="height=20px; text-align:left;padding:3px"| n*250-350 MB/s
| style="height=20px; text-align:left;padding:3px"| 18 GB/s
| style="height=20px; text-align:left;padding:3px"| 54 GB/s
| style="height=20px; text-align:left;padding:3px"| 38 GB/s
|}
---------------------------------------------------------------------------------------------------------
global: all nodes of UniCluster access the same file system;
local: each node has its own file system;
permanent: files are stored permanently;
batch job: files are removed at end of the batch job.
---------------------------------------------------------------------------------------------------------
Table 2: Properties of the file systems

== Selecting the appropriate file system ==

In general, you should separate your data and store it on the appropriate file system.
Permanently needed data like software or important results should be stored below $HOME
but capacity restrictions (quotas) apply. In case you accidentally deleted data on $HOME
you can usually restore it from backup. Permanent data which is not needed for months
or exceeds the capacity restrictions should be sent to the LSDF Online Storage
or to the archive and deleted from the file systems. Temporary data which is only needed on a single
node and which does not exceed the disk space shown in the table above should be stored
below $TMP. Data which is read many times on a single node, e.g. if you are doing AI training,
should be copied to $TMP and read from there. Temporary data which is used from many nodes
of your batch job and which is only needed during job runtime should be stored on a
parallel on-demand file system. Temporary data which can be recomputed or which is the
result of one job and input for another job should be stored in workspaces. The lifetime
of data in workspaces is limited and depends on the lifetime of the workspace which can be
several months.

For further details please check the chapters below.

== $HOME ==

The home directories of bwUniCluster 2.0 (uc2) users are located in the parallel file system Lustre.
You have access to your home directory from all nodes of uc2. A regular backup of these directories
to tape archive is done automatically. The directory $HOME is used to hold those files that are
permanently used like source codes, configuration files, executable programs etc.

On uc2 there is a default user quota limit of 1 TiB and 10 million inodes (files and directories) per user.
You can chek your current usage and limits with the command
<pre>
$ lfs quota -uh $(whoami) $HOME
</pre>
In addition to the user limit there is a limit of your organization (e.g. university) which depends on the financial share. This limit is enforced with so-called Lustre project quotas. You can show the current usage and limits of your organization with the following command:
<pre>
lfs quota -ph $(grep $(echo $HOME | sed -e "s|/[^/]*/[^/]*$||") /pfs/data5/project_ids.txt | cut -f 1 -d\ ) $HOME
</pre>

== Workspaces ==

On uc2 workspaces can be used to store large non-permanent data sets, e.g. restart files or output
data that has to be post-processed. The file system used for workspaces is also the parallel file system Lustre. This file system is especially designed for parallel access and for a high throughput to large
files. It is able to provide high data transfer rates of up to 54 GB/s write and read performance when data access is parallel.

Workspaces have a lifetime and the data on a workspace expires as a whole after a fixed period. The maximum lifetime of a workspace is 60 days, but it can be renewed at the end of that period 3 times to a total maximum of 240 days after workspace generation. If a workspace has inadvertently expired we can restore the data during a limited time (few weeks). In this case you should create a new workspace and report the name of the new and of the expired workspace in a ticket.

Creating, deleting, finding and extending workspaces is explained on the [[workspace]] page.

On uc2 there is a default user quota limit of 40 TiB and 30 million inodes (files and directories) per user.
You can chek your current usage and limits with the command
<pre>
$ lfs quota -uh $(whoami) /pfs/work7
</pre>
Note that the quotas include data and inodes for all of your workspaces and all of your expired workspaces (as long as they are not yet completely removed).

=== Reminder for workspace deletion ===

Normally you will get an email every day starting 7 days before a workspace expires. You can send yourself a calender entry which reminds you when a workspace will be automatically deleted:

$ ws_send_ical.sh <workspace> <email>

== Improving Performance on $HOME and workspaces ==

The following recommendations might help to improve throughput and metadata
performance on Lustre filesystems.

=== '''Improving Throughput Performance''' ===

Depending on your application some adaptations might be necessary if you want to reach
the full bandwidth of the filesystems. Parallel filesystems typically stripe files over storage subsystems, i.e. large files are separated into stripes and distributed to different storage subsystems. In Lustre, the size of these stripes (sometimes also mentioned as chunks) is called stripe size and the number of used storage subsystems is called stripe count.

When you are designing your application you should consider that the performance of
parallel filesystems is generally better if data is transferred in large blocks and stored in
few large files. In more detail, to increase throughput performance of a parallel application
following aspects should be considered:

* collect large chunks of data and write them sequentially at once,

* to exploit complete filesystem bandwidth use several clients,

* avoid competitive file access by different tasks or use blocks with boundaries at stripe size (default is 1MB),

* if files are small enough for the SSDs and are only used by one process store them on $TMP.

With previous Lustre versions adapting the Lustre stripe count was the most important optimization. However, for the workspaces of uc2 the new Lustre feature Progressive File Layouts has been used to define file striping parameters. This means that the stripe count is adapted if the file size is growing. In normal cases users no longer need to adapt file striping parameters in case they have very huge files or in order to reach better performance.

If you know what you are doing you can still change striping parameters, e.g. the stripe count, of a directory and of newly created files. New files and directories inherit the stripe count from the parent directory. E.g. if you want to enhance throughput on a single very large file which is created in the directory $HOME/my_output_dir you can use the command
<pre>
$ lfs setstripe -c-1 $HOME/my_output_dir
</pre>
to change the stripe count to -1 which means that all storage subsystems of the file system are used to store that file. If you change the stripe count of a directory the stripe count of existing files inside this
directory is not changed. If you want to change the stripe count of existing files, change
the stripe count of the parent directory, copy the files to new files, remove the old files
and move the new files back to the old name. In order to check the stripe setting of
the file my_file use
<pre>
$ lfs getstripe my_file
</pre>
Also note that changes on the striping parameters (e.g. stripe count) are not saved in the
backup, i.e. if directories have to be recreated this information is lost and the default stripe
count will be used. Therefore, you should annotate for which directories you made changes
to the striping parameters so that you can repeat these changes if required.

=== '''Improving Metadata Performance''' ===

Metadata performance on parallel file systems is usually not as good as with local
filesystems. In addition, it is usually not scalable, i.e. a limited resource. Therefore,
you should omit metadata operations whenever possible. For example, it is much better
to have few large files than lots of small files. In more detail, to increase metadata
performance of a parallel application following aspects should be considered:

* avoid creating many small files,

* avoid competitive directory access, e.g. by creating files in separate subdirectories for each task,

* if many small files are only used within a batch job and accessed by one process store them on $TMP,

* change the default colorization setting of the command ls (see below).

On modern Linux systems, the GNU ls command often uses colorization by default to
visually highlight the file type; this is especially true if the command is run within a terminal
session. This is because the default shell profile initializations usually contain an alias
directive similar to the following for the ls command:
<pre>
$ alias ls="ls --color=tty"
</pre>
However, running the ls command in this way for files on a Lustre file system requires
a stat() call to be used to determine the file type. This can result in a performance
overhead, because the stat() call always needs to determine the size of a file, and that
in turn means that the client node must query the object size of all the backing objects
that make up a file. As a result of the default colorization setting, running a simple
ls command on a Lustre file system often takes as much time as running the ls command
with the -l option (the same is true if the -F, -p, or the -classify option, or any other option
that requires information from a stat() call, is used). To avoid this performance overhead
when using ls commands, add an alias directive similar to the following
to your shell startup script:
<pre>
$ alias ls="ls --color=never"
</pre>

== Workspaces on flash storage ==

There is another workspace file system for special requirements available. The file system is called ''full flash pfs'' and is based on the parallel file system Lustre.

=== Advantages of this file system ===

# All storage devices are based on flash (no hard disks) with low access times. Hence performance is better compared to other parallel file systems for read and write access with small blocks and with small files, i.e. IOPS rates are improved.
# The file system is mounted on bwUniCluster 2.0 and HoreKa, i.e. it can be used to share data between these clusters.

=== Access restrictions ===

Only HoreKa users or KIT users of bwUniCluster 2.0 can use this file system.

=== Using the file system ===

As KIT or HoreKa user you can use the file system in the same way as a normal workspace. You just have to specify the name of the flash-based workspace file system using the option ''-F'' to all the commands that manage workspaces. On bwUniCluster 2.0 it is called ''ffuc'', on HoreKa it is ''ffhk''. For example, to create a workspace with name myws and a lifetime of 60 days on bwUniCluster 2.0 execute:
ws_allocate -F ffuc myws 60

If you want to use the full flash pfs on bwUniCluster 2.0 '''and''' HoreKa at the same time, please note that you only have to manage a particular workspace on one of the clusters since the name of the workspace directory is different. However, the path to each workspace is visible and can be used on both clusters.

Other features are similar to normal workspaces. For example, we are able to restore expired workspaces for few weeks and you have to open a ticket to request the restore. There are quota limits with a default limit of 1 TB capacity and 5 millions inodes per user. You can check your current usage with
lfs quota -uh $(whoami) /pfs/work8

== $TMP ==

While all tasks of a parallel application access the same $HOME and workspace directory, the
$TMP directory is local to each node on bwUniCluster 2.0. All nodes have fast SSD
local storage devices which are used to store data below $TMP. The capacity of $TMP for each node type
can be checked above. Different tasks of a parallel application use different $TMP directories when
they do not utilize one node. Although $TMP points to the same path name for different nodes of a job,
the physical location on these nodes is different.

Each time a batch job is started, a subdirectory is created on each node and assigned to the job.
$TMP is newly set and the name of the subdirectory contains the Job-id so that the
subdirectory name is unique for each job. This unique name is then assigned to the
environment variable $TMP within the job. At the end of the job the subdirectory is removed.

This directory should be used for temporary files being accessed by single tasks. It should also be used
if you read the same data many times from a single node, e.g. if you are doing AI training. In this case
you should copy the data at the beginning of your batch job to $TMP and read the data from there.
In addition, this directory should be used for the installation of software packages. This means that
the software package to be installed should be unpacked, compiled and linked in a subdirectory of $TMP.
The real installation of the package (e.g. make install) could be made in(to) the $HOME filesystem.

=== Usage example of $TMP ===

We will provide an example for using $TMP and describe efficient data transfer to and from $TMP.

If you have a data set with many files which is frequently used by batch jobs you should create
a compressed archive on a workspace. This archive can be extracted on $TMP inside your batch jobs.
Such an archive can be read efficiently from a parallel file system since it is a single huge file.
On a login node you can create such an archive with the following steps:
<source lang="bash">
# Create a workspace to store the archive
[ab1234@uc2n997 ~]$ ws_allocate data-ssd 60
# Create the archive from a local dataset folder (example)
[ab1234@uc2n997 ~]$ tar -cvzf $(ws_find data-ssd)/dataset.tgz dataset/
</source>

Inside a batch job extract the archive on $TMP, read input data from $TMP, store results on $TMP
and save the results on a workspace:
<source lang="bash">
#!/bin/bash
# very simple example on how to use local $TMP
#SBATCH -N 1
#SBATCH -t 24:00:00

# Extract compressed input dataset on local SSD
tar -C $TMP/ -xvzf $(ws_find data-ssd)/dataset.tgz

# The application reads data from dataset on $TMP and writes results to $TMP
myapp -input $TMP/dataset/myinput.csv -outputdir $TMP/results

# Before job completes save results on a workspace
rsync -av $TMP/results $(ws_find data-ssd)/results-${SLURM_JOB_ID}/
</source>

== LSDF Online Storage==

In some cases it is useful to have access to the LSDF Online Storage on the HPC-Clusters also. Therefore the LSDF Online Storage is mounted on the Login- and Datamover-Nodes.
Furthermore it can be used on the compute nodes during the job runtime with the constraint flag "LSDF" ([[bwUniCluster_2.0_Slurm_common_Features|Slurm common features]]
). There is also an example about the LSDF batch usage: [[bwUniCluster_2.0_Slurm_common_Features#LSDF_Online_Storage|Slurm LSDF example ]] .
<pre>
#!/bin/bash
#SBATCH ...
#SBATCH --constraint=LSDF
</pre>
 
For the usage of the LSDF Online Storage the following environment variables are available: $LSDF, $LSDFPROJECTS, $LSDFHOME.
Please request storage projects in the LSDF Online Storage seperately:
[https://www.lsdf.kit.edu/os/storagerequest/: LSDF Storage Request].

==BeeOND (BeeGFS On-Demand)==

Users of the UniCluster have possibility to request a private BeeOND (on-demand BeeGFS) parallel filesystem for each job. The file system is created during job startup and purged after your job.

'''IMPORTANT: All data on the private filesystem will be deleted after your job. Make sure you have copied your data back to the global filesystem (within job), e.g., $HOME or any workspace.'''

BeeOND/BeeGFS can be used like any other parallel file system. Tools like cp or rsync can be used to copy data in and out.

For detailed usage see here:[[BwUniCluster_2.0_Slurm_common_Features#BeeOND_.28BeeGFS_On-Demand.29|Request on-demand file system]]

==Backup and Archiving==

There are regular backups of all data of the home directories,whereas ACLs and extended attributes will
not be backuped.

Please open a ticket if you need backuped data.

[[Category:bwUniCluster 2.0|File System]][[Category:Hardware_and_Architecture|bwUniCluster 2.0]]

BwUniCluster2.0/Hardware and Architecture

2023-03-28T16:39:45Z

R Laifer: /* $TMP */ Added usage example

= Architecture of bwUniCluster 2.0 =

The bwUniCluster 2.0 is a parallel computer with distributed memory. Each node of system consists of at least two Intel Xeon processor, local memory, disks, network adapters and optionally accelerators (NVIDIA Tesla V100, A100 or H100). All nodes are connected by a fast InfiniBand interconnect. In addition the file system
Lustre, that is connected by coupling the InfiniBand of the file server with the InfiniBand
switch of the compute cluster, is added to bwUniCluster 2.0 to provide a fast and scalable
parallel file system.

The operating system on each node is Red Hat Enterprise Linux (RHEL) 8.4. A number of additional software packages like e.g. SLURM have been installed on top. Some of these components are of special interest to end users and are briefly
discussed in this document. Others which are of greater importance to system
administrators will not be covered by this document.

The individual nodes of the system may act in different roles. According to the services supplied by the nodes, they are separated into disjoint groups. From an end users point of view the different groups of nodes are login nodes, compute nodes, file server nodes and administrative server nodes.

'''Login Nodes'''

The login nodes are the only nodes that are directly accessible by end users. These nodes
are used for interactive login, file management, program development and interactive pre-
and postprocessing. Two nodes are dedicated to this service but they are all accessible via
one address and a DNS round-robin alias distributes the login sessions to the
different login nodes.

'''Compute Node'''

The majority of nodes are compute nodes which are managed by a batch system. Users
submit their jobs to the SLURM batch system and a job is executed when the required resources become available (depending on its fair-share priority).

'''File Server Nodes'''

The hardware of the parallel file system Lustre incorporates some file server nodes; the file
system Lustre is connected by coupling the InfiniBand of the file server with the independent InfiniBand switch of the compute cluster. In addition to shared file space there is also local storage on the disks of each node (for details see chapter "File Systems").

'''Administrative Server Nodes'''

Some other nodes are delivering additional services like resource management, external
network connection, administration etc. These nodes can be accessed directly by system administrators only.

= Components of bwUniCluster =

{| class="wikitable"
|-
! style="width:9%"|
! style="width:13%"| Compute nodes "Thin"
! style="width:13%"| Compute nodes "HPC"
! style="width:13%"| Compute nodes "IceLake"
! style="width:13%"| Compute nodes "Fat"
! style="width:13%"| GPU x4
! style="width:13%"| GPU x8
! style="width:13%"| IceLake + GPU x4
! style="width:13%"| Login
|-
!scope="column"| Number of nodes
| 100 + 60
| 360
| 272
| 6
| 14
| 10
| 15
| 4
|-
!scope="column"| Processors
| Intel Xeon Gold 6230
| Intel Xeon Gold 6230
| Intel Xeon Platinum 8358
| Intel Xeon Gold 6230
| Intel Xeon Gold 6230
| Intel Xeon Gold 6248
| Intel Xeon Platinum 8358
|-
!scope="column"| Number of sockets
| 2
| 2
| 2
| 4
| 2
| 2
| 2
| 2
|-
!scope="column"| Processor frequency (GHz)
| 2.1 Ghz
| 2.1 Ghz
| 2.6 Ghz
| 2.1 Ghz
| 2.1 Ghz
| 2.6 Ghz
| 2.5 Ghz
|
|-
!scope="column"| Total number of cores
| 40
| 40
| 64
| 80
| 40
| 40
| 64
| 40
|-
!scope="column"| Main memory
| 96 GB / 192 GB
| 96 GB
| 256 GB
| 3 TB
| 384 GB
| 768 GB
| 512 GB
| 384 GB
|-
!scope="column"| Local SSD
| 960 GB SATA
| 960 GB SATA
| 1,8 TB NVMe
| 4,8 TB NVMe
| 3,2 TB NVMe
| 15 TB NVMe
| 6,4 TB NVMe
|
|-
!scope="column"| Accelerators
| -
| -
| -
| -
| 4x NVIDIA Tesla V100
| 8x NVIDIA Tesla V100
| 4x NVIDIA A100 / 4x NVIDIA H100
| -
|-
!scope="column"| Accelerator memory
| -
| -
| -
| -
| 32 GB
| 32 GB
| 80 GB / 94 GB
| -
|-
!scope="column"| Interconnect
| IB HDR100 (blocking)
| IB HDR100
| IB HDR200
| IB HDR
| IB HDR
| IB HDR
| IB HDR200
| IB HDR100 (blocking)
|}
Table 1: Properties of the nodes

= File Systems =

On bwUniCluster 2.0 the parallel file system Lustre is used for most globally visible user data. Lustre is open source and Lustre solutions and support are available from different vendors. Nowadays, most of the biggest HPC systems are using Lustre. An initial home directory on a Lustre file system is created automatically after account activation, and the environment variable $HOME holds its name. Users can create so-called workspaces on another Lustre file system for non-permanent data with temporary lifetime. There is another workspace file system based on flash storage for special requirements available.

Within a batch job further file systems are available:
* The directory $TMP is only available and visible on the local node. It is located on fast SSD storage devices.
* On request a parallel on-demand file system (BeeOND) is created which uses the SSDs of the nodes which were allocated to the batch job.
* On request the external LSDF Online Storage is mounted on the nodes which were allocated to the batch job. This file system is based on the parallel file system Spectrum Scale.

Some of the characteristics of the file systems are shown in Table 2.

{| style="width:100%; vertical-align:top; background:#f5fffa;border:1px solid #000000;padding:1px"
|- style="width:20%;height=20px; text-align:left;padding:3px"
! style="background-color:#AAA;padding:3px"| Property
! style="background-color:#AAA;padding:3px"| $TMP
! style="background-color:#AAA;padding:3px"| BeeOND
! style="background-color:#AAA;padding:3px"| $HOME
! style="background-color:#AAA;padding:3px"| Workspace
! style="background-color:#AAA;padding:3px"| Workspace on flash
|-
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Visibility
| style="height=20px; text-align:left;padding:3px"| local node
| style="height=20px; text-align:left;padding:3px"| nodes of batch job
| style="height=20px; text-align:left;padding:3px"| global
| style="height=20px; text-align:left;padding:3px"| global
| style="height=20px; text-align:left;padding:3px"| global
|-
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Lifetime
| style="height=20px; text-align:left;padding:3px"| batch job runtime
| style="height=20px; text-align:left;padding:3px"| batch job runtime
| style="height=20px; text-align:left;padding:3px"| permanent
| style="height=20px; text-align:left;padding:3px"| max. 240 days
| style="height=20px; text-align:left;padding:3px"| max. 240 days
|- style="vertical-align:top;"
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Disk space
| style="height=20px; text-align:left;padding:3px"| 960 GB - 6.4 TB details see table 1
| style="height=20px; text-align:left;padding:3px"| n*750 GB
| style="height=20px; text-align:left;padding:3px"| 1.2 PiB
| style="height=20px; text-align:left;padding:3px"| 4.1 PiB
| style="height=20px; text-align:left;padding:3px"| 236 TiB
|-
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Capacity Quotas
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| yes 1 TiB per user also per organization
| style="height=20px; text-align:left;padding:3px"| yes 40 TiB per user
| style="height=20px; text-align:left;padding:3px"| yes 1 TiB per user
|-
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Inode Quotas
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| yes 10 million per user
| style="height=20px; text-align:left;padding:3px"| yes 30 million per user
| style="height=20px; text-align:left;padding:3px"| yes 5 million per user
|-
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Backup
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| yes
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| no
|- style="vertical-align:top;"
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Read perf./node
| style="height=20px; text-align:left;padding:3px"| 500 MB/s - 6 GB/s depends on type of local SSD / job queue: 520 MB/s @ single / multiple 800 MB/s @ multiple_e 6600 MB/s @ fat 6500 MB/s @ gpu_4 6500 MB/s @ gpu_8
| style="height=20px; text-align:left;padding:3px"| 400 MB/s - 500 MB/s depends on type of local SSDs / job queue: 500 MB/s @ multiple 400 MB/s @ multiple_e
| style="height=20px; text-align:left;padding:3px"| 1 GB/s
| style="height=20px; text-align:left;padding:3px"| 1 GB/s
| style="height=20px; text-align:left;padding:3px"| 1 GB/s
|- style="vertical-align:top;"
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Write perf./node
| style="height=20px; text-align:left;padding:3px"| 500 MB/s - 4 GB/s depends on type of local SSD / job queue: 500 MB/s @ single / multiple 650 MB/s @ multiple_e 2900 MB/s @ fat 2090 MB/s @ gpu_4 4060 MB/s @ gpu_8
| style="height=20px; text-align:left;padding:3px"| 250 MB/s - 350 MB/s depends on type of local SSDs / job queue: 350 MB/s @ multiple 250 MB/s @ multiple_e
| style="height=20px; text-align:left;padding:3px"| 1 GB/s
| style="height=20px; text-align:left;padding:3px"| 1 GB/s
| style="height=20px; text-align:left;padding:3px"| 1 GB/s
|- style="vertical-align:top;"
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Total read perf.
| style="height=20px; text-align:left;padding:3px"| n*500-6000 MB/s
| style="height=20px; text-align:left;padding:3px"| n*400-500 MB/s
| style="height=20px; text-align:left;padding:3px"| 18 GB/s
| style="height=20px; text-align:left;padding:3px"| 54 GB/s
| style="height=20px; text-align:left;padding:3px"| 45 GB/s
|- style="vertical-align:top;"
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Total write perf.
| style="height=20px; text-align:left;padding:3px"| n*500-4000 MB/s
| style="height=20px; text-align:left;padding:3px"| n*250-350 MB/s
| style="height=20px; text-align:left;padding:3px"| 18 GB/s
| style="height=20px; text-align:left;padding:3px"| 54 GB/s
| style="height=20px; text-align:left;padding:3px"| 38 GB/s
|}
---------------------------------------------------------------------------------------------------------
global: all nodes of UniCluster access the same file system;
local: each node has its own file system;
permanent: files are stored permanently;
batch job: files are removed at end of the batch job.
---------------------------------------------------------------------------------------------------------
Table 2: Properties of the file systems

== Selecting the appropriate file system ==

In general, you should separate your data and store it on the appropriate file system.
Permanently needed data like software or important results should be stored below $HOME
but capacity restrictions (quotas) apply. In case you accidentally deleted data on $HOME
you can usually restore it from backup. Permanent data which is not needed for months
or exceeds the capacity restrictions should be sent to the LSDF Online Storage
or to the archive and deleted from the file systems. Temporary data which is only needed on a single
node and which does not exceed the disk space shown in the table above should be stored
below $TMP. Data which is read many times on a single node, e.g. if you are doing AI training,
should be copied to $TMP and read from there. Temporary data which is used from many nodes
of your batch job and which is only needed during job runtime should be stored on a
parallel on-demand file system. Temporary data which can be recomputed or which is the
result of one job and input for another job should be stored in workspaces. The lifetime
of data in workspaces is limited and depends on the lifetime of the workspace which can be
several months.

For further details please check the chapters below.

== $HOME ==

The home directories of bwUniCluster 2.0 (uc2) users are located in the parallel file system Lustre.
You have access to your home directory from all nodes of uc2. A regular backup of these directories
to tape archive is done automatically. The directory $HOME is used to hold those files that are
permanently used like source codes, configuration files, executable programs etc.

On uc2 there is a default user quota limit of 1 TiB and 10 million inodes (files and directories) per user.
You can chek your current usage and limits with the command
<pre>
$ lfs quota -uh $(whoami) $HOME
</pre>
In addition to the user limit there is a limit of your organization (e.g. university) which depends on the financial share. This limit is enforced with so-called Lustre project quotas. You can show the current usage and limits of your organization with the following command:
<pre>
lfs quota -ph $(grep $(echo $HOME | sed -e "s|/[^/]*/[^/]*$||") /pfs/data5/project_ids.txt | cut -f 1 -d\ ) $HOME
</pre>

== Workspaces ==

On uc2 workspaces can be used to store large non-permanent data sets, e.g. restart files or output
data that has to be post-processed. The file system used for workspaces is also the parallel file system Lustre. This file system is especially designed for parallel access and for a high throughput to large
files. It is able to provide high data transfer rates of up to 54 GB/s write and read performance when data access is parallel.

Workspaces have a lifetime and the data on a workspace expires as a whole after a fixed period. The maximum lifetime of a workspace is 60 days, but it can be renewed at the end of that period 3 times to a total maximum of 240 days after workspace generation. If a workspace has inadvertently expired we can restore the data during a limited time (few weeks). In this case you should create a new workspace and report the name of the new and of the expired workspace in a ticket.

Creating, deleting, finding and extending workspaces is explained on the [[workspace]] page.

On uc2 there is a default user quota limit of 40 TiB and 30 million inodes (files and directories) per user.
You can chek your current usage and limits with the command
<pre>
$ lfs quota -uh $(whoami) /pfs/work7
</pre>
Note that the quotas include data and inodes for all of your workspaces and all of your expired workspaces (as long as they are not yet completely removed).

=== Reminder for workspace deletion ===

Normally you will get an email every day starting 7 days before a workspace expires. You can send yourself a calender entry which reminds you when a workspace will be automatically deleted:

$ ws_send_ical.sh <workspace> <email>

== Improving Performance on $HOME and workspaces ==

The following recommendations might help to improve throughput and metadata
performance on Lustre filesystems.

=== '''Improving Throughput Performance''' ===

Depending on your application some adaptations might be necessary if you want to reach
the full bandwidth of the filesystems. Parallel filesystems typically stripe files over storage subsystems, i.e. large files are separated into stripes and distributed to different storage subsystems. In Lustre, the size of these stripes (sometimes also mentioned as chunks) is called stripe size and the number of used storage subsystems is called stripe count.

When you are designing your application you should consider that the performance of
parallel filesystems is generally better if data is transferred in large blocks and stored in
few large files. In more detail, to increase throughput performance of a parallel application
following aspects should be considered:

* collect large chunks of data and write them sequentially at once,

* to exploit complete filesystem bandwidth use several clients,

* avoid competitive file access by different tasks or use blocks with boundaries at stripe size (default is 1MB),

* if files are small enough for the SSDs and are only used by one process store them on $TMP.

With previous Lustre versions adapting the Lustre stripe count was the most important optimization. However, for the workspaces of uc2 the new Lustre feature Progressive File Layouts has been used to define file striping parameters. This means that the stripe count is adapted if the file size is growing. In normal cases users no longer need to adapt file striping parameters in case they have very huge files or in order to reach better performance.

If you know what you are doing you can still change striping parameters, e.g. the stripe count, of a directory and of newly created files. New files and directories inherit the stripe count from the parent directory. E.g. if you want to enhance throughput on a single very large file which is created in the directory $HOME/my_output_dir you can use the command
<pre>
$ lfs setstripe -c-1 $HOME/my_output_dir
</pre>
to change the stripe count to -1 which means that all storage subsystems of the file system are used to store that file. If you change the stripe count of a directory the stripe count of existing files inside this
directory is not changed. If you want to change the stripe count of existing files, change
the stripe count of the parent directory, copy the files to new files, remove the old files
and move the new files back to the old name. In order to check the stripe setting of
the file my_file use
<pre>
$ lfs getstripe my_file
</pre>
Also note that changes on the striping parameters (e.g. stripe count) are not saved in the
backup, i.e. if directories have to be recreated this information is lost and the default stripe
count will be used. Therefore, you should annotate for which directories you made changes
to the striping parameters so that you can repeat these changes if required.

=== '''Improving Metadata Performance''' ===

Metadata performance on parallel file systems is usually not as good as with local
filesystems. In addition, it is usually not scalable, i.e. a limited resource. Therefore,
you should omit metadata operations whenever possible. For example, it is much better
to have few large files than lots of small files. In more detail, to increase metadata
performance of a parallel application following aspects should be considered:

* avoid creating many small files,

* avoid competitive directory access, e.g. by creating files in separate subdirectories for each task,

* if many small files are only used within a batch job and accessed by one process store them on $TMP,

* change the default colorization setting of the command ls (see below).

On modern Linux systems, the GNU ls command often uses colorization by default to
visually highlight the file type; this is especially true if the command is run within a terminal
session. This is because the default shell profile initializations usually contain an alias
directive similar to the following for the ls command:
<pre>
$ alias ls="ls --color=tty"
</pre>
However, running the ls command in this way for files on a Lustre file system requires
a stat() call to be used to determine the file type. This can result in a performance
overhead, because the stat() call always needs to determine the size of a file, and that
in turn means that the client node must query the object size of all the backing objects
that make up a file. As a result of the default colorization setting, running a simple
ls command on a Lustre file system often takes as much time as running the ls command
with the -l option (the same is true if the -F, -p, or the -classify option, or any other option
that requires information from a stat() call, is used). To avoid this performance overhead
when using ls commands, add an alias directive similar to the following
to your shell startup script:
<pre>
$ alias ls="ls --color=never"
</pre>

== Workspaces on flash storage ==

There is another workspace file system for special requirements available. The file system is called ''full flash pfs'' and is based on the parallel file system Lustre.

=== Advantages of this file system ===

# All storage devices are based on flash (no hard disks) with low access times. Hence performance is better compared to other parallel file systems for read and write access with small blocks and with small files, i.e. IOPS rates are improved.
# The file system is mounted on bwUniCluster 2.0 and HoreKa, i.e. it can be used to share data between these clusters.

=== Access restrictions ===

Only HoreKa users or KIT users of bwUniCluster 2.0 can use this file system.

=== Using the file system ===

As KIT or HoreKa user you can use the file system in the same way as a normal workspace. You just have to specify the name of the flash-based workspace file system using the option ''-F'' to all the commands that manage workspaces. On bwUniCluster 2.0 it is called ''ffuc'', on HoreKa it is ''ffhk''. For example, to create a workspace with name myws and a lifetime of 60 days on bwUniCluster 2.0 execute:
ws_allocate -F ffuc myws 60

If you want to use the full flash pfs on bwUniCluster 2.0 '''and''' HoreKa at the same time, please note that you only have to manage a particular workspace on one of the clusters since the name of the workspace directory is different. However, the path to each workspace is visible and can be used on both clusters.

Other features are similar to normal workspaces. For example, we are able to restore expired workspaces for few weeks and you have to open a ticket to request the restore. There are quota limits with a default limit of 1 TB capacity and 5 millions inodes per user. You can check your current usage with
lfs quota -uh $(whoami) /pfs/work8

== $TMP ==

While all tasks of a parallel application access the same $HOME and workspace directory, the
$TMP directory is local to each node on bwUniCluster 2.0. All nodes have fast SSDs
local storage devices which are used to store data below $TMP. The capacity of $TMP for each node type
can be checked above. Different tasks of a parallel application use different $TMP directories when
they do not utilize one node. Although $TMP points to the same path name for different nodes of a job,
the physical location on these nodes is different.

Each time a batch job is started, a subdirectory is created on each node and assigned to the job.
$TMP is newly set and the name of the subdirectory contains the Job-id so that the
subdirectory name is unique for each job. This unique name is then assigned to the
environment variable $TMP within the job. At the end of the job the subdirectory is removed.

This directory should be used for temporary files being accessed by single tasks. It should also be used
if you read the same data many times from a single node, e.g. if you are doing AI training. In this case
you should copy the data at the beginning of your batch job to $TMP and read the data from there.
In addition, this directory should be used for the installation of software packages. This means that
the software package to be installed should be unpacked, compiled and linked in a subdirectory of $TMP.
The real installation of the package (e.g. make install) could be made in(to) the HOME filesystem.

=== Usage example of $TMP ===

We will provide an example for using $TMP and describe efficient data transfer to and from $TMP.

If you have a data set with many files which is frequently used by batch jobs you should create
a compressed archive on a workspace. This archive can be extracted to $TMP inside your batch jobs.
Such an archive can be read efficiently from a parallel file system since it is a single huge file.
On a login node you can create such an archive with the following steps:
<source lang="bash">
# Create a workspace to store the archive
[ab1234@uc2n997 ~]$ ws_allocate data-ssd 60
# Create the archive from a local dataset folder (example)
[ab1234@uc2n997 ~]$ tar -cvzf $(ws_find data-ssd)/dataset.tgz dataset/
</source>

Inside a batch job extract the archive to $TMP, read input data from $TMP, store results on $TMP
and before the job ends save the results on a workspace:
<source lang="bash">
#!/bin/bash
# very simple example on how to use local $TMP
#SBATCH -N 1
#SBATCH -t 24:00:00

# Extract compressed input dataset to local SSD
tar -C $TMP/ -xvzf $(ws_find data-ssd)/dataset.tgz

# The application reads data from dataset on $TMP and writes results to $TMP
myapp -input $TMP/dataset/myinput.csv -outputdir $TMP/results

# Before job completes save results on a workspace
rsync -av $TMP/results $(ws_find data-ssd)/results-${SLURM_JOB_ID}/
</source>

== LSDF Online Storage==

In some cases it is useful to have access to the LSDF Online Storage on the HPC-Clusters also. Therefore the LSDF Online Storage is mounted on the Login- and Datamover-Nodes.
Furthermore it can be used on the compute nodes during the job runtime with the constraint flag "LSDF" ([[bwUniCluster_2.0_Slurm_common_Features|Slurm common features]]
). There is also an example about the LSDF batch usage: [[bwUniCluster_2.0_Slurm_common_Features#LSDF_Online_Storage|Slurm LSDF example ]] .
<pre>
#!/bin/bash
#SBATCH ...
#SBATCH --constraint=LSDF
</pre>
 
For the usage of the LSDF Online Storage the following environment variables are available: $LSDF, $LSDFPROJECTS, $LSDFHOME.
Please request storage projects in the LSDF Online Storage seperately:
[https://www.lsdf.kit.edu/os/storagerequest/: LSDF Storage Request].

==BeeOND (BeeGFS On-Demand)==

Users of the UniCluster have possibility to request a private BeeOND (on-demand BeeGFS) parallel filesystem for each job. The file system is created during job startup and purged after your job.

'''IMPORTANT: All data on the private filesystem will be deleted after your job. Make sure you have copied your data back to the global filesystem (within job), e.g., $HOME or any workspace.'''

BeeOND/BeeGFS can be used like any other parallel file system. Tools like cp or rsync can be used to copy data in and out.

For detailed usage see here:[[BwUniCluster_2.0_Slurm_common_Features#BeeOND_.28BeeGFS_On-Demand.29|Request on-demand file system]]

==Backup and Archiving==

There are regular backups of all data of the home directories,whereas ACLs and extended attributes will
not be backuped.

Please open a ticket if you need backuped data.

[[Category:bwUniCluster 2.0|File System]][[Category:Hardware_and_Architecture|bwUniCluster 2.0]]