bwHPC Wiki - User contributions [en]

BwUniCluster2.0/Hardware and Architecture

2023-06-05T09:12:57Z

S Raffeiner:

= Architecture of bwUniCluster 2.0 =

The bwUniCluster 2.0 is a parallel computer with distributed memory. Each node of system consists of at least two Intel Xeon processor, local memory, disks, network adapters and optionally accelerators (NVIDIA Tesla V100, A100 or H100). All nodes are connected by a fast InfiniBand interconnect. In addition the file system
Lustre, that is connected by coupling the InfiniBand of the file server with the InfiniBand
switch of the compute cluster, is added to bwUniCluster 2.0 to provide a fast and scalable
parallel file system.

The operating system on each node is Red Hat Enterprise Linux (RHEL) 8.4. A number of additional software packages like e.g. SLURM have been installed on top. Some of these components are of special interest to end users and are briefly
discussed in this document. Others which are of greater importance to system
administrators will not be covered by this document.

The individual nodes of the system may act in different roles. According to the services supplied by the nodes, they are separated into disjoint groups. From an end users point of view the different groups of nodes are login nodes, compute nodes, file server nodes and administrative server nodes.

'''Login Nodes'''

The login nodes are the only nodes that are directly accessible by end users. These nodes
are used for interactive login, file management, program development and interactive pre-
and postprocessing. Two nodes are dedicated to this service but they are all accessible via
one address and a DNS round-robin alias distributes the login sessions to the
different login nodes.

'''Compute Node'''

The majority of nodes are compute nodes which are managed by a batch system. Users
submit their jobs to the SLURM batch system and a job is executed when the required resources become available (depending on its fair-share priority).

'''File Server Nodes'''

The hardware of the parallel file system Lustre incorporates some file server nodes; the file
system Lustre is connected by coupling the InfiniBand of the file server with the independent InfiniBand switch of the compute cluster. In addition to shared file space there is also local storage on the disks of each node (for details see chapter "File Systems").

'''Administrative Server Nodes'''

Some other nodes are delivering additional services like resource management, external
network connection, administration etc. These nodes can be accessed directly by system administrators only.

= Components of bwUniCluster =

{| class="wikitable"
|-
! style="width:9%"|
! style="width:13%"| Compute nodes "Thin"
! style="width:13%"| Compute nodes "HPC"
! style="width:13%"| Compute nodes "IceLake"
! style="width:13%"| Compute nodes "Fat"
! style="width:13%"| GPU x4
! style="width:13%"| GPU x8
! style="width:13%"| IceLake + GPU x4
! style="width:13%"| Login
|-
!scope="column"| Number of nodes
| 200 + 60
| 260
| 272
| 6
| 14
| 10
| 15
| 4
|-
!scope="column"| Processors
| Intel Xeon Gold 6230
| Intel Xeon Gold 6230
| Intel Xeon Platinum 8358
| Intel Xeon Gold 6230
| Intel Xeon Gold 6230
| Intel Xeon Gold 6248
| Intel Xeon Platinum 8358
|-
!scope="column"| Number of sockets
| 2
| 2
| 2
| 4
| 2
| 2
| 2
| 2
|-
!scope="column"| Processor frequency (GHz)
| 2.1 Ghz
| 2.1 Ghz
| 2.6 Ghz
| 2.1 Ghz
| 2.1 Ghz
| 2.6 Ghz
| 2.5 Ghz
|
|-
!scope="column"| Total number of cores
| 40
| 40
| 64
| 80
| 40
| 40
| 64
| 40
|-
!scope="column"| Main memory
| 96 GB / 192 GB
| 96 GB
| 256 GB
| 3 TB
| 384 GB
| 768 GB
| 512 GB
| 384 GB
|-
!scope="column"| Local SSD
| 960 GB SATA
| 960 GB SATA
| 1,8 TB NVMe
| 4,8 TB NVMe
| 3,2 TB NVMe
| 15 TB NVMe
| 6,4 TB NVMe
|
|-
!scope="column"| Accelerators
| -
| -
| -
| -
| 4x NVIDIA Tesla V100
| 8x NVIDIA Tesla V100
| 4x NVIDIA A100 / 4x NVIDIA H100
| -
|-
!scope="column"| Accelerator memory
| -
| -
| -
| -
| 32 GB
| 32 GB
| 80 GB / 94 GB
| -
|-
!scope="column"| Interconnect
| IB HDR100 (blocking)
| IB HDR100
| IB HDR200
| IB HDR
| IB HDR
| IB HDR
| IB HDR200
| IB HDR100 (blocking)
|}
Table 1: Properties of the nodes

= File Systems =

On bwUniCluster 2.0 the parallel file system Lustre is used for most globally visible user data. Lustre is open source and Lustre solutions and support are available from different vendors. Nowadays, most of the biggest HPC systems are using Lustre. An initial home directory on a Lustre file system is created automatically after account activation, and the environment variable $HOME holds its name. Users can create so-called workspaces on another Lustre file system for non-permanent data with temporary lifetime. There is another workspace file system based on flash storage for special requirements available.

Within a batch job further file systems are available:
* The directory $TMP is only available and visible on the local node. It is located on fast SSD storage devices.
* On request a parallel on-demand file system (BeeOND) is created which uses the SSDs of the nodes which were allocated to the batch job.
* On request the external LSDF Online Storage is mounted on the nodes which were allocated to the batch job. This file system is based on the parallel file system Spectrum Scale.

Some of the characteristics of the file systems are shown in Table 2.

{| style="width:100%; vertical-align:top; background:#f5fffa;border:1px solid #000000;padding:1px"
|- style="width:20%;height=20px; text-align:left;padding:3px"
! style="background-color:#AAA;padding:3px"| Property
! style="background-color:#AAA;padding:3px"| $TMP
! style="background-color:#AAA;padding:3px"| BeeOND
! style="background-color:#AAA;padding:3px"| $HOME
! style="background-color:#AAA;padding:3px"| Workspace
! style="background-color:#AAA;padding:3px"| Workspace on flash
|-
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Visibility
| style="height=20px; text-align:left;padding:3px"| local node
| style="height=20px; text-align:left;padding:3px"| nodes of batch job
| style="height=20px; text-align:left;padding:3px"| global
| style="height=20px; text-align:left;padding:3px"| global
| style="height=20px; text-align:left;padding:3px"| global
|-
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Lifetime
| style="height=20px; text-align:left;padding:3px"| batch job runtime
| style="height=20px; text-align:left;padding:3px"| batch job runtime
| style="height=20px; text-align:left;padding:3px"| permanent
| style="height=20px; text-align:left;padding:3px"| max. 240 days
| style="height=20px; text-align:left;padding:3px"| max. 240 days
|- style="vertical-align:top;"
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Disk space
| style="height=20px; text-align:left;padding:3px"| 960 GB - 6.4 TB details see table 1
| style="height=20px; text-align:left;padding:3px"| n*750 GB
| style="height=20px; text-align:left;padding:3px"| 1.2 PiB
| style="height=20px; text-align:left;padding:3px"| 4.1 PiB
| style="height=20px; text-align:left;padding:3px"| 236 TiB
|-
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Capacity Quotas
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| yes 1 TiB per user also per organization
| style="height=20px; text-align:left;padding:3px"| yes 40 TiB per user
| style="height=20px; text-align:left;padding:3px"| yes 1 TiB per user
|-
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Inode Quotas
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| yes 10 million per user
| style="height=20px; text-align:left;padding:3px"| yes 30 million per user
| style="height=20px; text-align:left;padding:3px"| yes 5 million per user
|-
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Backup
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| yes
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| no
|- style="vertical-align:top;"
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Read perf./node
| style="height=20px; text-align:left;padding:3px"| 500 MB/s - 6 GB/s depends on type of local SSD / job queue: 520 MB/s @ single / multiple 800 MB/s @ multiple_e 6600 MB/s @ fat 6500 MB/s @ gpu_4 6500 MB/s @ gpu_8
| style="height=20px; text-align:left;padding:3px"| 400 MB/s - 500 MB/s depends on type of local SSDs / job queue: 500 MB/s @ multiple 400 MB/s @ multiple_e
| style="height=20px; text-align:left;padding:3px"| 1 GB/s
| style="height=20px; text-align:left;padding:3px"| 1 GB/s
| style="height=20px; text-align:left;padding:3px"| 1 GB/s
|- style="vertical-align:top;"
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Write perf./node
| style="height=20px; text-align:left;padding:3px"| 500 MB/s - 4 GB/s depends on type of local SSD / job queue: 500 MB/s @ single / multiple 650 MB/s @ multiple_e 2900 MB/s @ fat 2090 MB/s @ gpu_4 4060 MB/s @ gpu_8
| style="height=20px; text-align:left;padding:3px"| 250 MB/s - 350 MB/s depends on type of local SSDs / job queue: 350 MB/s @ multiple 250 MB/s @ multiple_e
| style="height=20px; text-align:left;padding:3px"| 1 GB/s
| style="height=20px; text-align:left;padding:3px"| 1 GB/s
| style="height=20px; text-align:left;padding:3px"| 1 GB/s
|- style="vertical-align:top;"
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Total read perf.
| style="height=20px; text-align:left;padding:3px"| n*500-6000 MB/s
| style="height=20px; text-align:left;padding:3px"| n*400-500 MB/s
| style="height=20px; text-align:left;padding:3px"| 18 GB/s
| style="height=20px; text-align:left;padding:3px"| 54 GB/s
| style="height=20px; text-align:left;padding:3px"| 45 GB/s
|- style="vertical-align:top;"
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Total write perf.
| style="height=20px; text-align:left;padding:3px"| n*500-4000 MB/s
| style="height=20px; text-align:left;padding:3px"| n*250-350 MB/s
| style="height=20px; text-align:left;padding:3px"| 18 GB/s
| style="height=20px; text-align:left;padding:3px"| 54 GB/s
| style="height=20px; text-align:left;padding:3px"| 38 GB/s
|}
---------------------------------------------------------------------------------------------------------
global: all nodes of UniCluster access the same file system;
local: each node has its own file system;
permanent: files are stored permanently;
batch job: files are removed at end of the batch job.
---------------------------------------------------------------------------------------------------------
Table 2: Properties of the file systems

== Selecting the appropriate file system ==

In general, you should separate your data and store it on the appropriate file system.
Permanently needed data like software or important results should be stored below $HOME
but capacity restrictions (quotas) apply. In case you accidentally deleted data on $HOME
you can usually restore it from backup. Permanent data which is not needed for months
or exceeds the capacity restrictions should be sent to the LSDF Online Storage
or to the archive and deleted from the file systems. Temporary data which is only needed on a single
node and which does not exceed the disk space shown in the table above should be stored
below $TMP. Data which is read many times on a single node, e.g. if you are doing AI training,
should be copied to $TMP and read from there. Temporary data which is used from many nodes
of your batch job and which is only needed during job runtime should be stored on a
parallel on-demand file system. Temporary data which can be recomputed or which is the
result of one job and input for another job should be stored in workspaces. The lifetime
of data in workspaces is limited and depends on the lifetime of the workspace which can be
several months.

For further details please check the chapters below.

== $HOME ==

The home directories of bwUniCluster 2.0 (uc2) users are located in the parallel file system Lustre.
You have access to your home directory from all nodes of uc2. A regular backup of these directories
to tape archive is done automatically. The directory $HOME is used to hold those files that are
permanently used like source codes, configuration files, executable programs etc.

On uc2 there is a default user quota limit of 1 TiB and 10 million inodes (files and directories) per user.
You can chek your current usage and limits with the command
<pre>
$ lfs quota -uh $(whoami) $HOME
</pre>
In addition to the user limit there is a limit of your organization (e.g. university) which depends on the financial share. This limit is enforced with so-called Lustre project quotas. You can show the current usage and limits of your organization with the following command:
<pre>
lfs quota -ph $(grep $(echo $HOME | sed -e "s|/[^/]*/[^/]*$||") /pfs/data5/project_ids.txt | cut -f 1 -d\ ) $HOME
</pre>

== Workspaces ==

On uc2 workspaces can be used to store large non-permanent data sets, e.g. restart files or output
data that has to be post-processed. The file system used for workspaces is also the parallel file system Lustre. This file system is especially designed for parallel access and for a high throughput to large
files. It is able to provide high data transfer rates of up to 54 GB/s write and read performance when data access is parallel.

Workspaces have a lifetime and the data on a workspace expires as a whole after a fixed period. The maximum lifetime of a workspace is 60 days, but it can be renewed at the end of that period 3 times to a total maximum of 240 days after workspace generation. If a workspace has inadvertently expired we can restore the data during a limited time (few weeks). In this case you should create a new workspace and report the name of the new and of the expired workspace in a ticket.

Creating, deleting, finding and extending workspaces is explained on the [[workspace]] page.

On uc2 there is a default user quota limit of 40 TiB and 30 million inodes (files and directories) per user.
You can chek your current usage and limits with the command
<pre>
$ lfs quota -uh $(whoami) /pfs/work7
</pre>
Note that the quotas include data and inodes for all of your workspaces and all of your expired workspaces (as long as they are not yet completely removed).

=== Reminder for workspace deletion ===

Normally you will get an email every day starting 7 days before a workspace expires. You can send yourself a calender entry which reminds you when a workspace will be automatically deleted:

$ ws_send_ical.sh <workspace> <email>

== Improving Performance on $HOME and workspaces ==

The following recommendations might help to improve throughput and metadata
performance on Lustre filesystems.

=== '''Improving Throughput Performance''' ===

Depending on your application some adaptations might be necessary if you want to reach
the full bandwidth of the filesystems. Parallel filesystems typically stripe files over storage subsystems, i.e. large files are separated into stripes and distributed to different storage subsystems. In Lustre, the size of these stripes (sometimes also mentioned as chunks) is called stripe size and the number of used storage subsystems is called stripe count.

When you are designing your application you should consider that the performance of
parallel filesystems is generally better if data is transferred in large blocks and stored in
few large files. In more detail, to increase throughput performance of a parallel application
following aspects should be considered:

* collect large chunks of data and write them sequentially at once,

* to exploit complete filesystem bandwidth use several clients,

* avoid competitive file access by different tasks or use blocks with boundaries at stripe size (default is 1MB),

* if files are small enough for the SSDs and are only used by one process store them on $TMP.

With previous Lustre versions adapting the Lustre stripe count was the most important optimization. However, for the workspaces of uc2 the new Lustre feature Progressive File Layouts has been used to define file striping parameters. This means that the stripe count is adapted if the file size is growing. In normal cases users no longer need to adapt file striping parameters in case they have very huge files or in order to reach better performance.

If you know what you are doing you can still change striping parameters, e.g. the stripe count, of a directory and of newly created files. New files and directories inherit the stripe count from the parent directory. E.g. if you want to enhance throughput on a single very large file which is created in the directory $HOME/my_output_dir you can use the command
<pre>
$ lfs setstripe -c-1 $HOME/my_output_dir
</pre>
to change the stripe count to -1 which means that all storage subsystems of the file system are used to store that file. If you change the stripe count of a directory the stripe count of existing files inside this
directory is not changed. If you want to change the stripe count of existing files, change
the stripe count of the parent directory, copy the files to new files, remove the old files
and move the new files back to the old name. In order to check the stripe setting of
the file my_file use
<pre>
$ lfs getstripe my_file
</pre>
Also note that changes on the striping parameters (e.g. stripe count) are not saved in the
backup, i.e. if directories have to be recreated this information is lost and the default stripe
count will be used. Therefore, you should annotate for which directories you made changes
to the striping parameters so that you can repeat these changes if required.

=== '''Improving Metadata Performance''' ===

Metadata performance on parallel file systems is usually not as good as with local
filesystems. In addition, it is usually not scalable, i.e. a limited resource. Therefore,
you should omit metadata operations whenever possible. For example, it is much better
to have few large files than lots of small files. In more detail, to increase metadata
performance of a parallel application following aspects should be considered:

* avoid creating many small files,

* avoid competitive directory access, e.g. by creating files in separate subdirectories for each task,

* if many small files are only used within a batch job and accessed by one process store them on $TMP,

* change the default colorization setting of the command ls (see below).

On modern Linux systems, the GNU ls command often uses colorization by default to
visually highlight the file type; this is especially true if the command is run within a terminal
session. This is because the default shell profile initializations usually contain an alias
directive similar to the following for the ls command:
<pre>
$ alias ls="ls --color=tty"
</pre>
However, running the ls command in this way for files on a Lustre file system requires
a stat() call to be used to determine the file type. This can result in a performance
overhead, because the stat() call always needs to determine the size of a file, and that
in turn means that the client node must query the object size of all the backing objects
that make up a file. As a result of the default colorization setting, running a simple
ls command on a Lustre file system often takes as much time as running the ls command
with the -l option (the same is true if the -F, -p, or the -classify option, or any other option
that requires information from a stat() call, is used). To avoid this performance overhead
when using ls commands, add an alias directive similar to the following
to your shell startup script:
<pre>
$ alias ls="ls --color=never"
</pre>

== Workspaces on flash storage ==

There is another workspace file system for special requirements available. The file system is called ''full flash pfs'' and is based on the parallel file system Lustre.

=== Advantages of this file system ===

# All storage devices are based on flash (no hard disks) with low access times. Hence performance is better compared to other parallel file systems for read and write access with small blocks and with small files, i.e. IOPS rates are improved.
# The file system is mounted on bwUniCluster 2.0 and HoreKa, i.e. it can be used to share data between these clusters.

=== Access restrictions ===

Only HoreKa users or KIT users of bwUniCluster 2.0 can use this file system.

=== Using the file system ===

As KIT or HoreKa user you can use the file system in the same way as a normal workspace. You just have to specify the name of the flash-based workspace file system using the option ''-F'' to all the commands that manage workspaces. On bwUniCluster 2.0 it is called ''ffuc'', on HoreKa it is ''ffhk''. For example, to create a workspace with name myws and a lifetime of 60 days on bwUniCluster 2.0 execute:
ws_allocate -F ffuc myws 60

If you want to use the full flash pfs on bwUniCluster 2.0 '''and''' HoreKa at the same time, please note that you only have to manage a particular workspace on one of the clusters since the name of the workspace directory is different. However, the path to each workspace is visible and can be used on both clusters.

Other features are similar to normal workspaces. For example, we are able to restore expired workspaces for few weeks and you have to open a ticket to request the restore. There are quota limits with a default limit of 1 TB capacity and 5 millions inodes per user. You can check your current usage with
lfs quota -uh $(whoami) /pfs/work8

== $TMPDIR ==

The environment variable $TMPDIR contains the name of a directory which is local to each node. This means
that different tasks of a parallel application use different directories when they do not utilize the same node.
Although $TMPDIR points to the same path name for different nodes of a batch job, the physical location and the
content of this directory path on these nodes is different.

This directory should be used for temporary files being accessed from the local node during job runtime. It should
also be used if you read the same data many times from a single node, e.g. if you are doing AI training. In this
case you should copy the data at the beginning of your batch job to $TMPDIR and read the data from there.

The $TMPDIR directory is located on extremely fast SSD local storage devices. This means that performance
on small files is much better than on the parallel file systems. The capacity of $TMPDIR for each node type
can be checked in Table 1 above. The capacity is at least 900 GB.

Each time a batch job is started, a subdirectory is created on the SSD of each node and assigned to the job.
$TMPDIR is set to the name of the subdirectory and this name contains the job ID so that it is unique
for each job. At the end of the job the subdirectory is removed.

On login nodes $TMPDIR also points to a fast directory on a local SSD disk but this directory is not unique.
It is recommended to create your own unique subdirectory on these nodes. This directory should be used for the
installation of software packages. This means that the software package to be installed should be unpacked,
compiled and linked in a subdirectory of $TMPDIR. The real installation of the package (e.g. make install)
should be made into the $HOME folder.

=== Usage example for $TMPDIR ===

We will provide an example for using $TMPDIR and describe efficient data transfer to and from $TMPDIR.

If you have a data set with many files which is frequently used by batch jobs you should create
a compressed archive on a workspace. This archive can be extracted on $TMPDIR inside your batch jobs.
Such an archive can be read efficiently from a parallel file system since it is a single huge file.
On a login node you can create such an archive with the following steps:
<source lang="bash">
# Create a workspace to store the archive
[ab1234@uc2n997 ~]$ ws_allocate data-ssd 60
# Create the archive from a local dataset folder (example)
[ab1234@uc2n997 ~]$ tar -cvzf $(ws_find data-ssd)/dataset.tgz dataset/
</source>

Inside a batch job extract the archive on $TMPDIR, read input data from $TMPDIR, store results on $TMPDIR
and save the results on a workspace:
<source lang="bash">
#!/bin/bash
# very simple example on how to use local $TMPDIR
#SBATCH -N 1
#SBATCH -t 24:00:00

# Extract compressed input dataset on local SSD
tar -C $TMPDIR/ -xvzf $(ws_find data-ssd)/dataset.tgz

# The application reads data from dataset on $TMPDIR and writes results to $TMPDIR
myapp -input $TMPDIR/dataset/myinput.csv -outputdir $TMPDIR/results

# Before job completes save results on a workspace
rsync -av $TMPDIR/results $(ws_find data-ssd)/results-${SLURM_JOB_ID}/
</source>

== LSDF Online Storage==

In some cases it is useful to have access to the LSDF Online Storage on the HPC-Clusters also. Therefore the LSDF Online Storage is mounted on the Login- and Datamover-Nodes.
Furthermore it can be used on the compute nodes during the job runtime with the constraint flag "LSDF" ([[bwUniCluster_2.0_Slurm_common_Features|Slurm common features]]
). There is also an example about the LSDF batch usage: [[bwUniCluster_2.0_Slurm_common_Features#LSDF_Online_Storage|Slurm LSDF example ]] .
<pre>
#!/bin/bash
#SBATCH ...
#SBATCH --constraint=LSDF
</pre>
 
For the usage of the LSDF Online Storage the following environment variables are available: $LSDF, $LSDFPROJECTS, $LSDFHOME.
Please request storage projects in the LSDF Online Storage seperately:
[https://www.lsdf.kit.edu/os/storagerequest/: LSDF Storage Request].

==BeeOND (BeeGFS On-Demand)==

Users of the UniCluster have possibility to request a private BeeOND (on-demand BeeGFS) parallel filesystem for each job. The file system is created during job startup and purged after your job.

'''IMPORTANT: All data on the private filesystem will be deleted after your job. Make sure you have copied your data back to the global filesystem (within job), e.g., $HOME or any workspace.'''

BeeOND/BeeGFS can be used like any other parallel file system. Tools like cp or rsync can be used to copy data in and out.

For detailed usage see here:[[BwUniCluster_2.0_Slurm_common_Features#BeeOND_.28BeeGFS_On-Demand.29|Request on-demand file system]]

==Backup and Archiving==

There are regular backups of all data of the home directories,whereas ACLs and extended attributes will
not be backuped.

Please open a ticket if you need backuped data.

[[Category:bwUniCluster 2.0|File System]][[Category:Hardware_and_Architecture|bwUniCluster 2.0]]

BwUniCluster2.0/Batch Queues

2023-06-05T09:11:57Z

S Raffeiner:

=== sbatch -p ''queue'' ===
Compute resources such as (wall-)time, nodes and memory are restricted and must fit into '''queues'''. Since requested compute resources are NOT always automatically mapped to the correct queue class, '''you must add the correct queue class to your sbatch command '''. The specification of a queue is obligatory on BwUniCluster 2.0.
 
Details are:

{| width=750px class="wikitable"
! colspan="5" | bwUniCluster 2.0 sbatch -p ''queue''
|- style="text-align:left;"
! queue !! node !! default resources !! minimum resources !! maximum resources
|- style="text-align:left"
| dev_single
| thin
| time=10, mem-per-cpu=1125mb
|
| time=30, nodes=1, mem=180000mb, ntasks-per-node=40, (threads-per-core=2) 6 nodes are reserved for this queue. Only for development, i.e. debugging or performance optimization ...
|- style="text-align:left;"
| single
| thin
| time=30, mem-per-cpu=1125mb
|
| time=72:00:00, nodes=1, mem=180000mb, ntasks-per-node=40, (threads-per-core)=2
|- style="text-align:left;"
| dev_multiple
| hpc
| time=10, mem-per-cpu=1125mb
| nodes=2
| time=30, nodes=4, mem=90000mb, ntasks-per-node=40, (threads-per-core=2) 8 nodes are reserved for this queue. Only for development, i.e. debugging or performance optimization ...
|- style="text-align:left;"
| multiple
| hpc
| time=30, mem-per-cpu=1125mb
| nodes=2
| time=72:00:00, mem=90000mb, nodes=80, ntasks-per-node=40, (threads-per-core=2)
|- style="text-align:left;"
| dev_multiple_il
| IceLake
| time=10, mem-per-cpu=1950mb
| nodes=2
| time=30, nodes=8, mem=249600mb, ntasks-per-node=64, (threads-per-core=2) 8 nodes are reserved for this queue Only for development, i.e. debugging or performance optimization ...
|- style="text-align:left;"
| multiple_il
| IceLake
| time=10, mem-per-cpu=1950mb
| nodes=2
| time=72:00:00, nodes=80, mem=249600mb, ntasks-per-node=64, (threads-per-core=2)
|- style="text-align:left;"
| dev_gpu_4_a100
| IceLake + A100
| time=10, mem-per-gpu=127500mb, cpus-per-gpu=16
|
| time=30, nodes=1, mem=510000mb, ntasks-per-node=64, (threads-per-core=2)
|- style="text-align:left;"
| gpu_4_a100
| IceLake + A100
| time=10, mem-per-gpu=127500mb, cpus-per-gpu=16
|
| time=48:00:00, nodes=9, mem=510000mb, ntasks-per-node=64, (threads-per-core=2)
|- style="text-align:left;"
| gpu_4_h100
| IceLake + H100
| time=10, mem-per-gpu=127500mb, cpus-per-gpu=16
|
| time=48:00:00, nodes=5, mem=510000mb, ntasks-per-node=64, (threads-per-core=2)
|- style="vertical-align:top; text-align:left"
| fat
| fat
| time=10, mem-per-cpu=18750mb
| mem=180001mb
| time=72:00:00, nodes=1, mem=3000000mb, ntasks-per-node=80, (threads-per-core=2)
|- style="vertical-align:top; text-align:left"
| dev_gpu_4
| gpu4
| time=10, mem-per-gpu=94000mb, cpus-per-gpu=10
|
| time=30, nodes=1, mem=376000, ntasks-per-node=40, (threads-per-core=2) 1 node is reserved for this queue Only for development, i.e. debugging or performance optimization ...
|- style="text-align:left;"
| gpu_4
| gpu4
| time=10, mem-per-gpu=94000mb, cpus-per-gpu=10
|
| time=48:00:00, mem=376000, nodes=14, ntasks-per-node=40, (threads-per-core=2)
|- style="vertical-align:top; text-align:left"
| gpu_8
| gpu8
| time=10, mem-per-cpu=94000mb, cpus-per-gpu=10
|
| time=48:00:00, mem=752000, nodes=10, ntasks-per-node=40, (threads-per-core=2)
|-
|}

Default resources of a queue class defines time, #tasks and memory if not explicitly given with sbatch command. Resource list acronyms ''--time'', ''--ntasks'', ''--nodes'', ''--mem'' and ''--mem-per-cpu'' are described [[BwUniCluster_2.0_Slurm_common_Features|here]].
 
 
==== Queue class examples ====
To run your batch job on one of the thin nodes, please use:

<pre>
$ sbatch --partition=dev_multiple
or
$ sbatch -p dev_multiple
</pre>
 

==== Interactive Jobs ====
On bwUniCluster 2.0 you are only allowed to run short jobs (<< 1 hour) with little memory requirements (<< 8 GByte) on the logins nodes. If you want to run longer jobs and/or jobs with a request of more than 8 GByte of memory, you must allocate resources for so-called interactive jobs by usage of the command salloc on a login node. Considering a serial application running on a compute node that requires 5000 MByte of memory and limiting the interactive run to 2 hours the following command has to be executed:
<pre>
$ salloc -p single -n 1 -t 120 --mem=5000
</pre>
Then you will get one core on a compute node within the partition "single". After execution of this command '''DO NOT CLOSE''' your current terminal session but wait until the queueing system Slurm has granted you the requested resources on the compute system. You will be logged in automatically on the granted core! To run a serial program on the granted core you only have to type the name of the executable.
<pre>
$ ./<my_serial_program>
</pre>
Please be aware that your serial job must run less than 2 hours in this example, else the job will be killed during runtime by the system.
 
You can also start now a graphical X11-terminal connecting you to the dedicated resource that is available for 2 hours. You can start it by the command:
<pre>
$ xterm
</pre>
Note that, once the walltime limit has been reached the resources - i.e. the compute node - will automatically be revoked.
 
An interactive parallel application running on one compute node or on many compute nodes (e.g. here 5 nodes) with 40 cores each requires usually an amount of memory in GByte (e.g. 50 GByte) and a maximum time (e.g. 1 hour). E.g. 5 nodes can be allocated by the following command:
<pre>
$ salloc -p multiple -N 5 --ntasks-per-node=40 -t 01:00:00 --mem=50gb
</pre>
Now you can run parallel jobs on 200 cores requiring 50 GByte of memory per node. Please be aware that you will be logged in on core 0 of the first node.
If you want to have access to another node you have to open a new terminal, connect it also to BwUniCluster 2.0 and type the following commands to
connect to the running interactive job and then to a specific node:
<pre>
$ srun --jobid=XXXXXXXX --pty /bin/bash
$ srun --nodelist=uc2nXXX --pty /bin/bash
</pre>
With the command:
<pre>
$ squeue
</pre>
the jobid and the nodelist can be shown.

If you want to run MPI-programs, you can do it by simply typing mpirun <program_name>. Then your program will be run on 200 cores. A very simple example for starting a parallel job can be:
<pre>
$ mpirun <my_mpi_program>
</pre>
You can also start the debugger ddt by the commands:
<pre>
$ module add devel/ddt
$ ddt <my_mpi_program>
</pre>
The above commands will execute the parallel program <my_mpi_program> on all available cores. You can also start parallel programs on a subset of cores; an example for this can be:
<pre>
$ mpirun -n 50 <my_mpi_program>
</pre>
If you are using Intel MPI you must start <my_mpi_program> by the command mpiexec.hydra (instead of mpirun).
 
 

----
[[Category:bwUniCluster 2.0|Batch Jobs - bwUniCluster 2.0 Features]]

BwUniCluster2.0/Hardware and Architecture

2023-01-05T13:35:08Z

S Raffeiner: /* Components of bwUniCluster */

= Architecture of bwUniCluster 2.0 =

The bwUniCluster 2.0 is a parallel computer with distributed memory. Each node of system consists of at least two Intel Xeon processor, local memory, disks, network adapters and optionally accelerators (NVIDIA Tesla V100, A100 or H100). All nodes are connected by a fast InfiniBand interconnect. In addition the file system
Lustre, that is connected by coupling the InfiniBand of the file server with the InfiniBand
switch of the compute cluster, is added to bwUniCluster 2.0 to provide a fast and scalable
parallel file system.

The operating system on each node is Red Hat Enterprise Linux (RHEL) 8.4. A number of additional software packages like e.g. SLURM have been installed on top. Some of these components are of special interest to end users and are briefly
discussed in this document. Others which are of greater importance to system
administrators will not be covered by this document.

The individual nodes of the system may act in different roles. According to the services supplied by the nodes, they are separated into disjoint groups. From an end users point of view the different groups of nodes are login nodes, compute nodes, file server nodes and administrative server nodes.

'''Login Nodes'''

The login nodes are the only nodes that are directly accessible by end users. These nodes
are used for interactive login, file management, program development and interactive pre-
and postprocessing. Two nodes are dedicated to this service but they are all accessible via
one address and a DNS round-robin alias distributes the login sessions to the
different login nodes.

'''Compute Node'''

The majority of nodes are compute nodes which are managed by a batch system. Users
submit their jobs to the SLURM batch system and a job is executed when the required resources become available (depending on its fair-share priority).

'''File Server Nodes'''

The hardware of the parallel file system Lustre incorporates some file server nodes; the file
system Lustre is connected by coupling the InfiniBand of the file server with the independent InfiniBand switch of the compute cluster. In addition to shared file space there is also local storage on the disks of each node (for details see chapter "File Systems").

'''Administrative Server Nodes'''

Some other nodes are delivering additional services like resource management, external
network connection, administration etc. These nodes can be accessed directly by system administrators only.

= Components of bwUniCluster =

{| class="wikitable"
|-
! style="width:9%"|
! style="width:13%"| Compute nodes "Thin"
! style="width:13%"| Compute nodes "HPC"
! style="width:13%"| Compute nodes "IceLake"
! style="width:13%"| Compute nodes "Fat"
! style="width:13%"| GPU x4
! style="width:13%"| GPU x8
! style="width:13%"| IceLake + GPU x4
! style="width:13%"| Login
|-
!scope="column"| Number of nodes
| 100 + 60
| 360
| 272
| 6
| 14
| 10
| 15
| 4
|-
!scope="column"| Processors
| Intel Xeon Gold 6230
| Intel Xeon Gold 6230
| Intel Xeon Platinum 8358
| Intel Xeon Gold 6230
| Intel Xeon Gold 6230
| Intel Xeon Gold 6248
| Intel Xeon Platinum 8358
|-
!scope="column"| Number of sockets
| 2
| 2
| 2
| 4
| 2
| 2
| 2
| 2
|-
!scope="column"| Processor frequency (GHz)
| 2.1 Ghz
| 2.1 Ghz
| 2.6 Ghz
| 2.1 Ghz
| 2.1 Ghz
| 2.6 Ghz
| 2.5 Ghz
|
|-
!scope="column"| Total number of cores
| 40
| 40
| 64
| 80
| 40
| 40
| 64
| 40
|-
!scope="column"| Main memory
| 96 GB / 192 GB
| 96 GB
| 256 GB
| 3 TB
| 384 GB
| 768 GB
| 512 GB
| 384 GB
|-
!scope="column"| Local SSD
| 960 GB SATA
| 960 GB SATA
| 1,8 TB NVMe
| 4,8 TB NVMe
| 3,2 TB NVMe
| 15 TB NVMe
| 6,4 TB NVMe
|
|-
!scope="column"| Accelerators
| -
| -
| -
| -
| 4x NVIDIA Tesla V100
| 8x NVIDIA Tesla V100
| 4x NVIDIA A100 / 4x NVIDIA H100
| -
|-
!scope="column"| Accelerator memory
| -
| -
| -
| -
| 32 GB
| 32 GB
| 80 GB / 94 GB
| -
|-
!scope="column"| Interconnect
| IB HDR100 (blocking)
| IB HDR100
| IB HDR200
| IB HDR
| IB HDR
| IB HDR
| IB HDR200
| IB HDR100 (blocking)
|}
Table 1: Properties of the nodes

= File Systems =

On bwUniCluster 2.0 the parallel file system Lustre is used for most globally visible user data. Lustre is open source and Lustre solutions and support are available from different vendors. Nowadays, most of the biggest HPC systems are using Lustre. An initial home directory on a Lustre file system is created automatically after account activation, and the environment variable $HOME holds its name. Users can create so-called workspaces on another Lustre file system for non-permanent data with temporary lifetime. There is another workspace file system based on flash storage for special requirements available.

Within a batch job further file systems are available:
* The directory $TMP is only available and visible on the local node. It is located on fast SSD storage devices.
* On request a parallel on-demand file system (BeeOND) is created which uses the SSDs of the nodes which were allocated to the batch job.
* On request the external LSDF Online Storage is mounted on the nodes which were allocated to the batch job. This file system is based on the parallel file system Spectrum Scale.

Some of the characteristics of the file systems are shown in Table 2.

{| style="width:100%; vertical-align:top; background:#f5fffa;border:1px solid #000000;padding:1px"
|- style="width:20%;height=20px; text-align:left;padding:3px"
! style="background-color:#AAA;padding:3px"| Property
! style="background-color:#AAA;padding:3px"| $TMP
! style="background-color:#AAA;padding:3px"| BeeOND
! style="background-color:#AAA;padding:3px"| $HOME
! style="background-color:#AAA;padding:3px"| Workspace
! style="background-color:#AAA;padding:3px"| Workspace on flash
|-
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Visibility
| style="height=20px; text-align:left;padding:3px"| local node
| style="height=20px; text-align:left;padding:3px"| nodes of batch job
| style="height=20px; text-align:left;padding:3px"| global
| style="height=20px; text-align:left;padding:3px"| global
| style="height=20px; text-align:left;padding:3px"| global
|-
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Lifetime
| style="height=20px; text-align:left;padding:3px"| batch job runtime
| style="height=20px; text-align:left;padding:3px"| batch job runtime
| style="height=20px; text-align:left;padding:3px"| permanent
| style="height=20px; text-align:left;padding:3px"| max. 240 days
| style="height=20px; text-align:left;padding:3px"| max. 240 days
|- style="vertical-align:top;"
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Disk space
| style="height=20px; text-align:left;padding:3px"| 960 GB - 6.4 TB details see table 1
| style="height=20px; text-align:left;padding:3px"| n*750 GB
| style="height=20px; text-align:left;padding:3px"| 1.2 PiB
| style="height=20px; text-align:left;padding:3px"| 4.1 PiB
| style="height=20px; text-align:left;padding:3px"| 236 TiB
|-
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Capacity Quotas
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| yes 1 TiB per user also per organization
| style="height=20px; text-align:left;padding:3px"| yes 40 TiB per user
| style="height=20px; text-align:left;padding:3px"| yes 1 TiB per user
|-
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Inode Quotas
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| yes 10 million per user
| style="height=20px; text-align:left;padding:3px"| yes 30 million per user
| style="height=20px; text-align:left;padding:3px"| yes 5 million per user
|-
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Backup
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| yes
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| no
|- style="vertical-align:top;"
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Read perf./node
| style="height=20px; text-align:left;padding:3px"| 500 MB/s - 6 GB/s depends on type of local SSD / job queue: 520 MB/s @ single / multiple 800 MB/s @ multiple_e 6600 MB/s @ fat 6500 MB/s @ gpu_4 6500 MB/s @ gpu_8
| style="height=20px; text-align:left;padding:3px"| 400 MB/s - 500 MB/s depends on type of local SSDs / job queue: 500 MB/s @ multiple 400 MB/s @ multiple_e
| style="height=20px; text-align:left;padding:3px"| 1 GB/s
| style="height=20px; text-align:left;padding:3px"| 1 GB/s
| style="height=20px; text-align:left;padding:3px"| 1 GB/s
|- style="vertical-align:top;"
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Write perf./node
| style="height=20px; text-align:left;padding:3px"| 500 MB/s - 4 GB/s depends on type of local SSD / job queue: 500 MB/s @ single / multiple 650 MB/s @ multiple_e 2900 MB/s @ fat 2090 MB/s @ gpu_4 4060 MB/s @ gpu_8
| style="height=20px; text-align:left;padding:3px"| 250 MB/s - 350 MB/s depends on type of local SSDs / job queue: 350 MB/s @ multiple 250 MB/s @ multiple_e
| style="height=20px; text-align:left;padding:3px"| 1 GB/s
| style="height=20px; text-align:left;padding:3px"| 1 GB/s
| style="height=20px; text-align:left;padding:3px"| 1 GB/s
|- style="vertical-align:top;"
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Total read perf.
| style="height=20px; text-align:left;padding:3px"| n*500-6000 MB/s
| style="height=20px; text-align:left;padding:3px"| n*400-500 MB/s
| style="height=20px; text-align:left;padding:3px"| 18 GB/s
| style="height=20px; text-align:left;padding:3px"| 54 GB/s
| style="height=20px; text-align:left;padding:3px"| 45 GB/s
|- style="vertical-align:top;"
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Total write perf.
| style="height=20px; text-align:left;padding:3px"| n*500-4000 MB/s
| style="height=20px; text-align:left;padding:3px"| n*250-350 MB/s
| style="height=20px; text-align:left;padding:3px"| 18 GB/s
| style="height=20px; text-align:left;padding:3px"| 54 GB/s
| style="height=20px; text-align:left;padding:3px"| 38 GB/s
|}
---------------------------------------------------------------------------------------------------------
global: all nodes of UniCluster access the same file system;
local: each node has its own file system;
permanent: files are stored permanently;
batch job: files are removed at end of the batch job.
---------------------------------------------------------------------------------------------------------
Table 2: Properties of the file systems

== Selecting the appropriate file system ==

In general, you should separate your data and store it on the appropriate file system.
Permanently needed data like software or important results should be stored below $HOME
but capacity restrictions (quotas) apply. In case you accidentally deleted data on $HOME
you can usually restore it from backup. Permanent data which is not needed for months
or exceeds the capacity restrictions should be sent to the LSDF Online Storage
or to the archive and deleted from the file systems. Temporary data which is only needed on a single
node and which does not exceed the disk space shown in the table above should be stored
below $TMP. Data which is read many times on a single node, e.g. if you are doing AI training,
should be copied to $TMP and read from there. Temporary data which is used from many nodes
of your batch job and which is only needed during job runtime should be stored on a
parallel on-demand file system. Temporary data which can be recomputed or which is the
result of one job and input for another job should be stored in workspaces. The lifetime
of data in workspaces is limited and depends on the lifetime of the workspace which can be
several months.

For further details please check the chapters below.

== $HOME ==

The home directories of bwUniCluster 2.0 (uc2) users are located in the parallel file system Lustre.
You have access to your home directory from all nodes of uc2. A regular backup of these directories
to tape archive is done automatically. The directory $HOME is used to hold those files that are
permanently used like source codes, configuration files, executable programs etc.

On uc2 there is a default user quota limit of 1 TiB and 10 million inodes (files and directories) per user.
You can chek your current usage and limits with the command
<pre>
$ lfs quota -uh $(whoami) $HOME
</pre>
In addition to the user limit there is a limit of your organization (e.g. university) which depends on the financial share. This limit is enforced with so-called Lustre project quotas. You can show the current usage and limits of your organization with the following command:
<pre>
lfs quota -ph $(grep $(echo $HOME | sed -e "s|/[^/]*/[^/]*$||") /pfs/data5/project_ids.txt | cut -f 1 -d\ ) $HOME
</pre>

== Workspaces ==

On uc2 workspaces can be used to store large non-permanent data sets, e.g. restart files or output
data that has to be post-processed. The file system used for workspaces is also the parallel file system Lustre. This file system is especially designed for parallel access and for a high throughput to large
files. It is able to provide high data transfer rates of up to 54 GB/s write and read performance when data access is parallel.

Workspaces have a lifetime and the data on a workspace expires as a whole after a fixed period. The maximum lifetime of a workspace is 60 days, but it can be renewed at the end of that period 3 times to a total maximum of 240 days after workspace generation. If a workspace has inadvertently expired we can restore the data during a limited time (few weeks). In this case you should create a new workspace and report the name of the new and of the expired workspace in a ticket.

Creating, deleting, finding and extending workspaces is explained on the [[workspace]] page.

On uc2 there is a default user quota limit of 40 TiB and 30 million inodes (files and directories) per user.
You can chek your current usage and limits with the command
<pre>
$ lfs quota -uh $(whoami) /pfs/work7
</pre>
Note that the quotas include data and inodes for all of your workspaces and all of your expired workspaces (as long as they are not yet completely removed).

=== Reminder for workspace deletion ===

Normally you will get an email every day starting 7 days before a workspace expires. You can send yourself a calender entry which reminds you when a workspace will be automatically deleted:

$ ws_send_ical.sh <workspace> <email>

== Improving Performance on $HOME and workspaces ==

The following recommendations might help to improve throughput and metadata
performance on Lustre filesystems.

=== '''Improving Throughput Performance''' ===

Depending on your application some adaptations might be necessary if you want to reach
the full bandwidth of the filesystems. Parallel filesystems typically stripe files over storage subsystems, i.e. large files are separated into stripes and distributed to different storage subsystems. In Lustre, the size of these stripes (sometimes also mentioned as chunks) is called stripe size and the number of used storage subsystems is called stripe count.

When you are designing your application you should consider that the performance of
parallel filesystems is generally better if data is transferred in large blocks and stored in
few large files. In more detail, to increase throughput performance of a parallel application
following aspects should be considered:

* collect large chunks of data and write them sequentially at once,

* to exploit complete filesystem bandwidth use several clients,

* avoid competitive file access by different tasks or use blocks with boundaries at stripe size (default is 1MB),

* if files are small enough for the SSDs and are only used by one process store them on $TMP.

With previous Lustre versions adapting the Lustre stripe count was the most important optimization. However, for the workspaces of uc2 the new Lustre feature Progressive File Layouts has been used to define file striping parameters. This means that the stripe count is adapted if the file size is growing. In normal cases users no longer need to adapt file striping parameters in case they have very huge files or in order to reach better performance.

If you know what you are doing you can still change striping parameters, e.g. the stripe count, of a directory and of newly created files. New files and directories inherit the stripe count from the parent directory. E.g. if you want to enhance throughput on a single very large file which is created in the directory $HOME/my_output_dir you can use the command
<pre>
$ lfs setstripe -c-1 $HOME/my_output_dir
</pre>
to change the stripe count to -1 which means that all storage subsystems of the file system are used to store that file. If you change the stripe count of a directory the stripe count of existing files inside this
directory is not changed. If you want to change the stripe count of existing files, change
the stripe count of the parent directory, copy the files to new files, remove the old files
and move the new files back to the old name. In order to check the stripe setting of
the file my_file use
<pre>
$ lfs getstripe my_file
</pre>
Also note that changes on the striping parameters (e.g. stripe count) are not saved in the
backup, i.e. if directories have to be recreated this information is lost and the default stripe
count will be used. Therefore, you should annotate for which directories you made changes
to the striping parameters so that you can repeat these changes if required.

=== '''Improving Metadata Performance''' ===

Metadata performance on parallel file systems is usually not as good as with local
filesystems. In addition, it is usually not scalable, i.e. a limited resource. Therefore,
you should omit metadata operations whenever possible. For example, it is much better
to have few large files than lots of small files. In more detail, to increase metadata
performance of a parallel application following aspects should be considered:

* avoid creating many small files,

* avoid competitive directory access, e.g. by creating files in separate subdirectories for each task,

* if many small files are only used within a batch job and accessed by one process store them on $TMP,

* change the default colorization setting of the command ls (see below).

On modern Linux systems, the GNU ls command often uses colorization by default to
visually highlight the file type; this is especially true if the command is run within a terminal
session. This is because the default shell profile initializations usually contain an alias
directive similar to the following for the ls command:
<pre>
$ alias ls="ls --color=tty"
</pre>
However, running the ls command in this way for files on a Lustre file system requires
a stat() call to be used to determine the file type. This can result in a performance
overhead, because the stat() call always needs to determine the size of a file, and that
in turn means that the client node must query the object size of all the backing objects
that make up a file. As a result of the default colorization setting, running a simple
ls command on a Lustre file system often takes as much time as running the ls command
with the -l option (the same is true if the -F, -p, or the -classify option, or any other option
that requires information from a stat() call, is used). To avoid this performance overhead
when using ls commands, add an alias directive similar to the following
to your shell startup script:
<pre>
$ alias ls="ls --color=never"
</pre>

== Workspaces on flash storage ==

There is another workspace file system for special requirements available. The file system is called ''full flash pfs'' and is based on the parallel file system Lustre.

=== Advantages of this file system ===

# All storage devices are based on flash (no hard disks) with low access times. Hence performance is better compared to other parallel file systems for read and write access with small blocks and with small files, i.e. IOPS rates are improved.
# The file system is mounted on bwUniCluster 2.0 and HoreKa, i.e. it can be used to share data between these clusters.

=== Access restrictions ===

Only HoreKa users or KIT users of bwUniCluster 2.0 can use this file system.

=== Using the file system ===

After access is granted, you can use the file system in the same way as a normal workspace. You just have to specify the name of the flash-based workspace file system using the option ''-F'' to all the commands that manage workspaces. On bwUniCluster 2.0 it is called ''ffuc'', on HoreKa it is ''ffhk''. For example, to create a workspace with name myws and a lifetime of 60 days on bwUniCluster 2.0 execute:
ws_allocate -F ffuc myws 60

If you want to use the full flash pfs on bwUniCluster 2.0 '''and''' HoreKa at the same time, please note that you only have to manage a particular workspace on one of the clusters since the name of the workspace directory is different. However, the path to each workspace is visible and can be used on both clusters.

Other features are similar to normal workspaces. For example, we are able to restore expired workspaces for few weeks and you have to open a ticket to request the restore. There are quota limits with a default limit of 1 TB capacity and 5 millions inodes per user. You can check your current usage with
lfs quota -uh $(whoami) /pfs/work8

== $TMP ==

While all tasks of a parallel application access the same $HOME and workspace directory, the
$TMP directory is local to each node on bwUniCluster 2.0. All nodes have fast SSDs
local storage devices which are used to store data below $TMP. Different tasks of a parallel
application use different $TMP directories when they do not utilize one node. This directory should
be used for temporary files being accessed by single tasks. It should also be used if you read the
same data many times from a single node, e.g. if you are doing AI training. In this case you should
copy the data at the beginning of your batch job to $TMP and read the data from there.
In addition, this directory should be used for the installation
of software packages. This means that the software package to be installed should be
unpacked, compiled and linked in a subdirectory of $TMP. The real installation of the
package (e.g. make install) should be made in(to) the Lustre filesystem.

Each time a batch job is started, a subdirectory is created on each node and assigned to the job.
$TMP is newly set and the name of the subdirectory contains the Job-id so that the
subdirectory name is unique for each job. This unique name is then assigned to the
environment variable $TMP within the job. At the end of the job the subdirectory is removed.
Although $TMP points to the same path name for different nodes of a job, the physical location
on these nodes is different.

== LSDF Online Storage==

In some cases it is useful to have access to the LSDF Online Storage on the HPC-Clusters also. Therefore the LSDF Online Storage is mounted on the Login- and Datamover-Nodes.
Furthermore it can be used on the compute nodes during the job runtime with the constraint flag "LSDF" ([[bwUniCluster_2.0_Slurm_common_Features|Slurm common features]]
). There is also an example about the LSDF batch usage: [[bwUniCluster_2.0_Slurm_common_Features#LSDF_Online_Storage|Slurm LSDF example ]] .
<pre>
#!/bin/bash
#SBATCH ...
#SBATCH --constraint=LSDF
</pre>
 
For the usage of the LSDF Online Storage the following environment variables are available: $LSDF, $LSDFPROJECTS, $LSDFHOME.
Please request storage projects in the LSDF Online Storage seperately:
[https://www.lsdf.kit.edu/os/storagerequest/: LSDF Storage Request].

==BeeOND (BeeGFS On-Demand)==

Users of the UniCluster have possibility to request a private BeeOND (on-demand BeeGFS) parallel filesystem for each job. The file system is created during job startup and purged after your job.

'''IMPORTANT: All data on the private filesystem will be deleted after your job. Make sure you have copied your data back to the global filesystem (within job), e.g., $HOME or any workspace.'''

BeeOND/BeeGFS can be used like any other parallel file system. Tools like cp or rsync can be used to copy data in and out.

For detailed usage see here:[[BwUniCluster_2.0_Slurm_common_Features#BeeOND_.28BeeGFS_On-Demand.29|Request on-demand file system]]

==Backup and Archiving==

There are regular backups of all data of the home directories,whereas ACLs and extended attributes will
not be backuped.

Please open a ticket if you need backuped data.

[[Category:bwUniCluster 2.0|File System]][[Category:Hardware_and_Architecture|bwUniCluster 2.0]]

BwUniCluster2.0/Hardware and Architecture

2023-01-05T13:34:24Z

S Raffeiner: /* Components of bwUniCluster */

= Architecture of bwUniCluster 2.0 =

The bwUniCluster 2.0 is a parallel computer with distributed memory. Each node of system consists of at least two Intel Xeon processor, local memory, disks, network adapters and optionally accelerators (NVIDIA Tesla V100, A100 or H100). All nodes are connected by a fast InfiniBand interconnect. In addition the file system
Lustre, that is connected by coupling the InfiniBand of the file server with the InfiniBand
switch of the compute cluster, is added to bwUniCluster 2.0 to provide a fast and scalable
parallel file system.

The operating system on each node is Red Hat Enterprise Linux (RHEL) 8.4. A number of additional software packages like e.g. SLURM have been installed on top. Some of these components are of special interest to end users and are briefly
discussed in this document. Others which are of greater importance to system
administrators will not be covered by this document.

The individual nodes of the system may act in different roles. According to the services supplied by the nodes, they are separated into disjoint groups. From an end users point of view the different groups of nodes are login nodes, compute nodes, file server nodes and administrative server nodes.

'''Login Nodes'''

The login nodes are the only nodes that are directly accessible by end users. These nodes
are used for interactive login, file management, program development and interactive pre-
and postprocessing. Two nodes are dedicated to this service but they are all accessible via
one address and a DNS round-robin alias distributes the login sessions to the
different login nodes.

'''Compute Node'''

The majority of nodes are compute nodes which are managed by a batch system. Users
submit their jobs to the SLURM batch system and a job is executed when the required resources become available (depending on its fair-share priority).

'''File Server Nodes'''

The hardware of the parallel file system Lustre incorporates some file server nodes; the file
system Lustre is connected by coupling the InfiniBand of the file server with the independent InfiniBand switch of the compute cluster. In addition to shared file space there is also local storage on the disks of each node (for details see chapter "File Systems").

'''Administrative Server Nodes'''

Some other nodes are delivering additional services like resource management, external
network connection, administration etc. These nodes can be accessed directly by system administrators only.

= Components of bwUniCluster =

{| class="wikitable"
|-
! style="width:9%"|
! style="width:13%"| Compute nodes "Thin"
! style="width:13%"| Compute nodes "HPC"
! style="width:13%"| Compute nodes "IceLake"
! style="width:13%"| Compute nodes "Fat"
! style="width:13%"| GPU x4
! style="width:13%"| GPU x8
! style="width:13%"| IceLake + GPU x4
! style="width:13%"| Login
|-
!scope="column"| Number of nodes
| 100 + 60
| 360
| 272
| 6
| 14
| 10
| 15
| 4
|-
!scope="column"| Processors
| Intel Xeon Gold 6230
| Intel Xeon Gold 6230
| Intel Xeon Platinum 8358
| Intel Xeon Gold 6230
| Intel Xeon Gold 6230
| Intel Xeon Gold 6248
| Intel Xeon Platinum 8358
|-
!scope="column"| Number of sockets
| 2
| 2
| 2
| 4
| 2
| 2
| 2
| 2
|-
!scope="column"| Processor frequency (GHz)
| 2.1 Ghz
| 2.1 Ghz
| 2.6 Ghz
| 2.1 Ghz
| 2.1 Ghz
| 2.6 Ghz
| 2.5 Ghz
|
|-
!scope="column"| Total number of cores
| 40
| 40
| 64
| 80
| 40
| 40
| 64
| 40
|-
!scope="column"| Main memory
| 96 GB / 192 GB
| 96 GB
| 256 GB
| 3 TB
| 384 GB
| 768 GB
| 512 GB
| 384 GB
|-
!scope="column"| Local SSD
| 960 GB SATA
| 960 GB SATA
| 1,8 TB NVMe
| 4,8 TB NVMe
| 3,2 TB NVMe
| 15 TB NVMe
| 6,4 TB NVMe
|
|-
!scope="column"| Accelerators
| -
| -
| -
| -
| 4x NVIDIA Tesla V100
| 8x NVIDIA Tesla V100
| 4x NVIDIA A100 / 4x NVIDIA H100
|-
!scope="column"| Accelerator memory
| -
| -
| -
| -
| 32 GB
| 32 GB
| 80 GB / 94 GB
|-
!scope="column"| Interconnect
| IB HDR100 (blocking)
| IB HDR100
| IB HDR200
| IB HDR
| IB HDR
| IB HDR
| IB HDR200
| IB HDR100 (blocking)
|}
Table 1: Properties of the nodes

= File Systems =

On bwUniCluster 2.0 the parallel file system Lustre is used for most globally visible user data. Lustre is open source and Lustre solutions and support are available from different vendors. Nowadays, most of the biggest HPC systems are using Lustre. An initial home directory on a Lustre file system is created automatically after account activation, and the environment variable $HOME holds its name. Users can create so-called workspaces on another Lustre file system for non-permanent data with temporary lifetime. There is another workspace file system based on flash storage for special requirements available.

Within a batch job further file systems are available:
* The directory $TMP is only available and visible on the local node. It is located on fast SSD storage devices.
* On request a parallel on-demand file system (BeeOND) is created which uses the SSDs of the nodes which were allocated to the batch job.
* On request the external LSDF Online Storage is mounted on the nodes which were allocated to the batch job. This file system is based on the parallel file system Spectrum Scale.

Some of the characteristics of the file systems are shown in Table 2.

{| style="width:100%; vertical-align:top; background:#f5fffa;border:1px solid #000000;padding:1px"
|- style="width:20%;height=20px; text-align:left;padding:3px"
! style="background-color:#AAA;padding:3px"| Property
! style="background-color:#AAA;padding:3px"| $TMP
! style="background-color:#AAA;padding:3px"| BeeOND
! style="background-color:#AAA;padding:3px"| $HOME
! style="background-color:#AAA;padding:3px"| Workspace
! style="background-color:#AAA;padding:3px"| Workspace on flash
|-
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Visibility
| style="height=20px; text-align:left;padding:3px"| local node
| style="height=20px; text-align:left;padding:3px"| nodes of batch job
| style="height=20px; text-align:left;padding:3px"| global
| style="height=20px; text-align:left;padding:3px"| global
| style="height=20px; text-align:left;padding:3px"| global
|-
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Lifetime
| style="height=20px; text-align:left;padding:3px"| batch job runtime
| style="height=20px; text-align:left;padding:3px"| batch job runtime
| style="height=20px; text-align:left;padding:3px"| permanent
| style="height=20px; text-align:left;padding:3px"| max. 240 days
| style="height=20px; text-align:left;padding:3px"| max. 240 days
|- style="vertical-align:top;"
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Disk space
| style="height=20px; text-align:left;padding:3px"| 960 GB - 6.4 TB details see table 1
| style="height=20px; text-align:left;padding:3px"| n*750 GB
| style="height=20px; text-align:left;padding:3px"| 1.2 PiB
| style="height=20px; text-align:left;padding:3px"| 4.1 PiB
| style="height=20px; text-align:left;padding:3px"| 236 TiB
|-
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Capacity Quotas
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| yes 1 TiB per user also per organization
| style="height=20px; text-align:left;padding:3px"| yes 40 TiB per user
| style="height=20px; text-align:left;padding:3px"| yes 1 TiB per user
|-
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Inode Quotas
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| yes 10 million per user
| style="height=20px; text-align:left;padding:3px"| yes 30 million per user
| style="height=20px; text-align:left;padding:3px"| yes 5 million per user
|-
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Backup
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| yes
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| no
|- style="vertical-align:top;"
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Read perf./node
| style="height=20px; text-align:left;padding:3px"| 500 MB/s - 6 GB/s depends on type of local SSD / job queue: 520 MB/s @ single / multiple 800 MB/s @ multiple_e 6600 MB/s @ fat 6500 MB/s @ gpu_4 6500 MB/s @ gpu_8
| style="height=20px; text-align:left;padding:3px"| 400 MB/s - 500 MB/s depends on type of local SSDs / job queue: 500 MB/s @ multiple 400 MB/s @ multiple_e
| style="height=20px; text-align:left;padding:3px"| 1 GB/s
| style="height=20px; text-align:left;padding:3px"| 1 GB/s
| style="height=20px; text-align:left;padding:3px"| 1 GB/s
|- style="vertical-align:top;"
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Write perf./node
| style="height=20px; text-align:left;padding:3px"| 500 MB/s - 4 GB/s depends on type of local SSD / job queue: 500 MB/s @ single / multiple 650 MB/s @ multiple_e 2900 MB/s @ fat 2090 MB/s @ gpu_4 4060 MB/s @ gpu_8
| style="height=20px; text-align:left;padding:3px"| 250 MB/s - 350 MB/s depends on type of local SSDs / job queue: 350 MB/s @ multiple 250 MB/s @ multiple_e
| style="height=20px; text-align:left;padding:3px"| 1 GB/s
| style="height=20px; text-align:left;padding:3px"| 1 GB/s
| style="height=20px; text-align:left;padding:3px"| 1 GB/s
|- style="vertical-align:top;"
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Total read perf.
| style="height=20px; text-align:left;padding:3px"| n*500-6000 MB/s
| style="height=20px; text-align:left;padding:3px"| n*400-500 MB/s
| style="height=20px; text-align:left;padding:3px"| 18 GB/s
| style="height=20px; text-align:left;padding:3px"| 54 GB/s
| style="height=20px; text-align:left;padding:3px"| 45 GB/s
|- style="vertical-align:top;"
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Total write perf.
| style="height=20px; text-align:left;padding:3px"| n*500-4000 MB/s
| style="height=20px; text-align:left;padding:3px"| n*250-350 MB/s
| style="height=20px; text-align:left;padding:3px"| 18 GB/s
| style="height=20px; text-align:left;padding:3px"| 54 GB/s
| style="height=20px; text-align:left;padding:3px"| 38 GB/s
|}
---------------------------------------------------------------------------------------------------------
global: all nodes of UniCluster access the same file system;
local: each node has its own file system;
permanent: files are stored permanently;
batch job: files are removed at end of the batch job.
---------------------------------------------------------------------------------------------------------
Table 2: Properties of the file systems

== Selecting the appropriate file system ==

In general, you should separate your data and store it on the appropriate file system.
Permanently needed data like software or important results should be stored below $HOME
but capacity restrictions (quotas) apply. In case you accidentally deleted data on $HOME
you can usually restore it from backup. Permanent data which is not needed for months
or exceeds the capacity restrictions should be sent to the LSDF Online Storage
or to the archive and deleted from the file systems. Temporary data which is only needed on a single
node and which does not exceed the disk space shown in the table above should be stored
below $TMP. Data which is read many times on a single node, e.g. if you are doing AI training,
should be copied to $TMP and read from there. Temporary data which is used from many nodes
of your batch job and which is only needed during job runtime should be stored on a
parallel on-demand file system. Temporary data which can be recomputed or which is the
result of one job and input for another job should be stored in workspaces. The lifetime
of data in workspaces is limited and depends on the lifetime of the workspace which can be
several months.

For further details please check the chapters below.

== $HOME ==

The home directories of bwUniCluster 2.0 (uc2) users are located in the parallel file system Lustre.
You have access to your home directory from all nodes of uc2. A regular backup of these directories
to tape archive is done automatically. The directory $HOME is used to hold those files that are
permanently used like source codes, configuration files, executable programs etc.

On uc2 there is a default user quota limit of 1 TiB and 10 million inodes (files and directories) per user.
You can chek your current usage and limits with the command
<pre>
$ lfs quota -uh $(whoami) $HOME
</pre>
In addition to the user limit there is a limit of your organization (e.g. university) which depends on the financial share. This limit is enforced with so-called Lustre project quotas. You can show the current usage and limits of your organization with the following command:
<pre>
lfs quota -ph $(grep $(echo $HOME | sed -e "s|/[^/]*/[^/]*$||") /pfs/data5/project_ids.txt | cut -f 1 -d\ ) $HOME
</pre>

== Workspaces ==

On uc2 workspaces can be used to store large non-permanent data sets, e.g. restart files or output
data that has to be post-processed. The file system used for workspaces is also the parallel file system Lustre. This file system is especially designed for parallel access and for a high throughput to large
files. It is able to provide high data transfer rates of up to 54 GB/s write and read performance when data access is parallel.

Workspaces have a lifetime and the data on a workspace expires as a whole after a fixed period. The maximum lifetime of a workspace is 60 days, but it can be renewed at the end of that period 3 times to a total maximum of 240 days after workspace generation. If a workspace has inadvertently expired we can restore the data during a limited time (few weeks). In this case you should create a new workspace and report the name of the new and of the expired workspace in a ticket.

Creating, deleting, finding and extending workspaces is explained on the [[workspace]] page.

On uc2 there is a default user quota limit of 40 TiB and 30 million inodes (files and directories) per user.
You can chek your current usage and limits with the command
<pre>
$ lfs quota -uh $(whoami) /pfs/work7
</pre>
Note that the quotas include data and inodes for all of your workspaces and all of your expired workspaces (as long as they are not yet completely removed).

=== Reminder for workspace deletion ===

Normally you will get an email every day starting 7 days before a workspace expires. You can send yourself a calender entry which reminds you when a workspace will be automatically deleted:

$ ws_send_ical.sh <workspace> <email>

== Improving Performance on $HOME and workspaces ==

The following recommendations might help to improve throughput and metadata
performance on Lustre filesystems.

=== '''Improving Throughput Performance''' ===

Depending on your application some adaptations might be necessary if you want to reach
the full bandwidth of the filesystems. Parallel filesystems typically stripe files over storage subsystems, i.e. large files are separated into stripes and distributed to different storage subsystems. In Lustre, the size of these stripes (sometimes also mentioned as chunks) is called stripe size and the number of used storage subsystems is called stripe count.

When you are designing your application you should consider that the performance of
parallel filesystems is generally better if data is transferred in large blocks and stored in
few large files. In more detail, to increase throughput performance of a parallel application
following aspects should be considered:

* collect large chunks of data and write them sequentially at once,

* to exploit complete filesystem bandwidth use several clients,

* avoid competitive file access by different tasks or use blocks with boundaries at stripe size (default is 1MB),

* if files are small enough for the SSDs and are only used by one process store them on $TMP.

With previous Lustre versions adapting the Lustre stripe count was the most important optimization. However, for the workspaces of uc2 the new Lustre feature Progressive File Layouts has been used to define file striping parameters. This means that the stripe count is adapted if the file size is growing. In normal cases users no longer need to adapt file striping parameters in case they have very huge files or in order to reach better performance.

If you know what you are doing you can still change striping parameters, e.g. the stripe count, of a directory and of newly created files. New files and directories inherit the stripe count from the parent directory. E.g. if you want to enhance throughput on a single very large file which is created in the directory $HOME/my_output_dir you can use the command
<pre>
$ lfs setstripe -c-1 $HOME/my_output_dir
</pre>
to change the stripe count to -1 which means that all storage subsystems of the file system are used to store that file. If you change the stripe count of a directory the stripe count of existing files inside this
directory is not changed. If you want to change the stripe count of existing files, change
the stripe count of the parent directory, copy the files to new files, remove the old files
and move the new files back to the old name. In order to check the stripe setting of
the file my_file use
<pre>
$ lfs getstripe my_file
</pre>
Also note that changes on the striping parameters (e.g. stripe count) are not saved in the
backup, i.e. if directories have to be recreated this information is lost and the default stripe
count will be used. Therefore, you should annotate for which directories you made changes
to the striping parameters so that you can repeat these changes if required.

=== '''Improving Metadata Performance''' ===

Metadata performance on parallel file systems is usually not as good as with local
filesystems. In addition, it is usually not scalable, i.e. a limited resource. Therefore,
you should omit metadata operations whenever possible. For example, it is much better
to have few large files than lots of small files. In more detail, to increase metadata
performance of a parallel application following aspects should be considered:

* avoid creating many small files,

* avoid competitive directory access, e.g. by creating files in separate subdirectories for each task,

* if many small files are only used within a batch job and accessed by one process store them on $TMP,

* change the default colorization setting of the command ls (see below).

On modern Linux systems, the GNU ls command often uses colorization by default to
visually highlight the file type; this is especially true if the command is run within a terminal
session. This is because the default shell profile initializations usually contain an alias
directive similar to the following for the ls command:
<pre>
$ alias ls="ls --color=tty"
</pre>
However, running the ls command in this way for files on a Lustre file system requires
a stat() call to be used to determine the file type. This can result in a performance
overhead, because the stat() call always needs to determine the size of a file, and that
in turn means that the client node must query the object size of all the backing objects
that make up a file. As a result of the default colorization setting, running a simple
ls command on a Lustre file system often takes as much time as running the ls command
with the -l option (the same is true if the -F, -p, or the -classify option, or any other option
that requires information from a stat() call, is used). To avoid this performance overhead
when using ls commands, add an alias directive similar to the following
to your shell startup script:
<pre>
$ alias ls="ls --color=never"
</pre>

== Workspaces on flash storage ==

There is another workspace file system for special requirements available. The file system is called ''full flash pfs'' and is based on the parallel file system Lustre.

=== Advantages of this file system ===

# All storage devices are based on flash (no hard disks) with low access times. Hence performance is better compared to other parallel file systems for read and write access with small blocks and with small files, i.e. IOPS rates are improved.
# The file system is mounted on bwUniCluster 2.0 and HoreKa, i.e. it can be used to share data between these clusters.

=== Access restrictions ===

Only HoreKa users or KIT users of bwUniCluster 2.0 can use this file system.

=== Using the file system ===

After access is granted, you can use the file system in the same way as a normal workspace. You just have to specify the name of the flash-based workspace file system using the option ''-F'' to all the commands that manage workspaces. On bwUniCluster 2.0 it is called ''ffuc'', on HoreKa it is ''ffhk''. For example, to create a workspace with name myws and a lifetime of 60 days on bwUniCluster 2.0 execute:
ws_allocate -F ffuc myws 60

If you want to use the full flash pfs on bwUniCluster 2.0 '''and''' HoreKa at the same time, please note that you only have to manage a particular workspace on one of the clusters since the name of the workspace directory is different. However, the path to each workspace is visible and can be used on both clusters.

Other features are similar to normal workspaces. For example, we are able to restore expired workspaces for few weeks and you have to open a ticket to request the restore. There are quota limits with a default limit of 1 TB capacity and 5 millions inodes per user. You can check your current usage with
lfs quota -uh $(whoami) /pfs/work8

== $TMP ==

While all tasks of a parallel application access the same $HOME and workspace directory, the
$TMP directory is local to each node on bwUniCluster 2.0. All nodes have fast SSDs
local storage devices which are used to store data below $TMP. Different tasks of a parallel
application use different $TMP directories when they do not utilize one node. This directory should
be used for temporary files being accessed by single tasks. It should also be used if you read the
same data many times from a single node, e.g. if you are doing AI training. In this case you should
copy the data at the beginning of your batch job to $TMP and read the data from there.
In addition, this directory should be used for the installation
of software packages. This means that the software package to be installed should be
unpacked, compiled and linked in a subdirectory of $TMP. The real installation of the
package (e.g. make install) should be made in(to) the Lustre filesystem.

Each time a batch job is started, a subdirectory is created on each node and assigned to the job.
$TMP is newly set and the name of the subdirectory contains the Job-id so that the
subdirectory name is unique for each job. This unique name is then assigned to the
environment variable $TMP within the job. At the end of the job the subdirectory is removed.
Although $TMP points to the same path name for different nodes of a job, the physical location
on these nodes is different.

== LSDF Online Storage==

In some cases it is useful to have access to the LSDF Online Storage on the HPC-Clusters also. Therefore the LSDF Online Storage is mounted on the Login- and Datamover-Nodes.
Furthermore it can be used on the compute nodes during the job runtime with the constraint flag "LSDF" ([[bwUniCluster_2.0_Slurm_common_Features|Slurm common features]]
). There is also an example about the LSDF batch usage: [[bwUniCluster_2.0_Slurm_common_Features#LSDF_Online_Storage|Slurm LSDF example ]] .
<pre>
#!/bin/bash
#SBATCH ...
#SBATCH --constraint=LSDF
</pre>
 
For the usage of the LSDF Online Storage the following environment variables are available: $LSDF, $LSDFPROJECTS, $LSDFHOME.
Please request storage projects in the LSDF Online Storage seperately:
[https://www.lsdf.kit.edu/os/storagerequest/: LSDF Storage Request].

==BeeOND (BeeGFS On-Demand)==

Users of the UniCluster have possibility to request a private BeeOND (on-demand BeeGFS) parallel filesystem for each job. The file system is created during job startup and purged after your job.

'''IMPORTANT: All data on the private filesystem will be deleted after your job. Make sure you have copied your data back to the global filesystem (within job), e.g., $HOME or any workspace.'''

BeeOND/BeeGFS can be used like any other parallel file system. Tools like cp or rsync can be used to copy data in and out.

For detailed usage see here:[[BwUniCluster_2.0_Slurm_common_Features#BeeOND_.28BeeGFS_On-Demand.29|Request on-demand file system]]

==Backup and Archiving==

There are regular backups of all data of the home directories,whereas ACLs and extended attributes will
not be backuped.

Please open a ticket if you need backuped data.

[[Category:bwUniCluster 2.0|File System]][[Category:Hardware_and_Architecture|bwUniCluster 2.0]]

BwUniCluster2.0/Hardware and Architecture

2023-01-05T13:33:46Z

S Raffeiner: /* Components of bwUniCluster */

= Architecture of bwUniCluster 2.0 =

The bwUniCluster 2.0 is a parallel computer with distributed memory. Each node of system consists of at least two Intel Xeon processor, local memory, disks, network adapters and optionally accelerators (NVIDIA Tesla V100, A100 or H100). All nodes are connected by a fast InfiniBand interconnect. In addition the file system
Lustre, that is connected by coupling the InfiniBand of the file server with the InfiniBand
switch of the compute cluster, is added to bwUniCluster 2.0 to provide a fast and scalable
parallel file system.

The operating system on each node is Red Hat Enterprise Linux (RHEL) 8.4. A number of additional software packages like e.g. SLURM have been installed on top. Some of these components are of special interest to end users and are briefly
discussed in this document. Others which are of greater importance to system
administrators will not be covered by this document.

The individual nodes of the system may act in different roles. According to the services supplied by the nodes, they are separated into disjoint groups. From an end users point of view the different groups of nodes are login nodes, compute nodes, file server nodes and administrative server nodes.

'''Login Nodes'''

The login nodes are the only nodes that are directly accessible by end users. These nodes
are used for interactive login, file management, program development and interactive pre-
and postprocessing. Two nodes are dedicated to this service but they are all accessible via
one address and a DNS round-robin alias distributes the login sessions to the
different login nodes.

'''Compute Node'''

The majority of nodes are compute nodes which are managed by a batch system. Users
submit their jobs to the SLURM batch system and a job is executed when the required resources become available (depending on its fair-share priority).

'''File Server Nodes'''

The hardware of the parallel file system Lustre incorporates some file server nodes; the file
system Lustre is connected by coupling the InfiniBand of the file server with the independent InfiniBand switch of the compute cluster. In addition to shared file space there is also local storage on the disks of each node (for details see chapter "File Systems").

'''Administrative Server Nodes'''

Some other nodes are delivering additional services like resource management, external
network connection, administration etc. These nodes can be accessed directly by system administrators only.

= Components of bwUniCluster =

{| class="wikitable"
|-
! style="width:9%"|
! style="width:13%"| Compute nodes "Thin"
! style="width:13%"| Compute nodes "HPC"
! style="width:13%"| Compute nodes "IceLake"
! style="width:13%"| Compute nodes "Fat"
! style="width:13%"| GPU x4
! style="width:13%"| GPU x8
! style="width:13%"| IceLake + GPU x4
! style="width:13%"| Login
|-
!scope="column"| Number of nodes
| 100 + 60
| 360
| 272
| 6
| 14
| 10
| 15
| 4
|-
!scope="column"| Processors
| Intel Xeon Gold 6230
| Intel Xeon Gold 6230
| Intel Xeon Platinum 8358
| Intel Xeon Gold 6230
| Intel Xeon Gold 6230
| Intel Xeon Gold 6248
| Intel Xeon Platinum 8358
|-
!scope="column"| Number of sockets
| 2
| 2
| 2
| 4
| 2
| 2
| 2
| 2
|-
!scope="column"| Processor frequency (GHz)
| 2.1 Ghz
| 2.1 Ghz
| 2.6 Ghz
| 2.1 Ghz
| 2.1 Ghz
| 2.6 Ghz
| 2.5 Ghz
|
|-
!scope="column"| Total number of cores
| 40
| 40
| 64
| 80
| 40
| 40
| 64
| 40
|-
!scope="column"| Main memory
| 96 GB / 192 GB
| 96 GB
| 256 GB
| 3 TB
| 384 GB
| 768 GB
| 512 GB
| 384 GB
|-
!scope="column"| Local SSD
| 960 GB SATA
| 960 GB SATA
| 1,8 TB NVMe
| 4,8 TB NVMe
| 3,2 TB NVMe
| 15 TB NVMe
| 6,4 TB NVMe
|
|-
!scope="column"| Accelerators
| -
| -
| -
| -
| 4x NVIDIA Tesla V100
| 8x NVIDIA Tesla V100
| 4x NVIDIA A100 / 4x NVIDIA H100
|-
!scope="column"| Interconnect
| IB HDR100 (blocking)
| IB HDR100
| IB HDR200
| IB HDR
| IB HDR
| IB HDR
| IB HDR200
| IB HDR100 (blocking)
|}
Table 1: Properties of the nodes

= File Systems =

On bwUniCluster 2.0 the parallel file system Lustre is used for most globally visible user data. Lustre is open source and Lustre solutions and support are available from different vendors. Nowadays, most of the biggest HPC systems are using Lustre. An initial home directory on a Lustre file system is created automatically after account activation, and the environment variable $HOME holds its name. Users can create so-called workspaces on another Lustre file system for non-permanent data with temporary lifetime. There is another workspace file system based on flash storage for special requirements available.

Within a batch job further file systems are available:
* The directory $TMP is only available and visible on the local node. It is located on fast SSD storage devices.
* On request a parallel on-demand file system (BeeOND) is created which uses the SSDs of the nodes which were allocated to the batch job.
* On request the external LSDF Online Storage is mounted on the nodes which were allocated to the batch job. This file system is based on the parallel file system Spectrum Scale.

Some of the characteristics of the file systems are shown in Table 2.

{| style="width:100%; vertical-align:top; background:#f5fffa;border:1px solid #000000;padding:1px"
|- style="width:20%;height=20px; text-align:left;padding:3px"
! style="background-color:#AAA;padding:3px"| Property
! style="background-color:#AAA;padding:3px"| $TMP
! style="background-color:#AAA;padding:3px"| BeeOND
! style="background-color:#AAA;padding:3px"| $HOME
! style="background-color:#AAA;padding:3px"| Workspace
! style="background-color:#AAA;padding:3px"| Workspace on flash
|-
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Visibility
| style="height=20px; text-align:left;padding:3px"| local node
| style="height=20px; text-align:left;padding:3px"| nodes of batch job
| style="height=20px; text-align:left;padding:3px"| global
| style="height=20px; text-align:left;padding:3px"| global
| style="height=20px; text-align:left;padding:3px"| global
|-
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Lifetime
| style="height=20px; text-align:left;padding:3px"| batch job runtime
| style="height=20px; text-align:left;padding:3px"| batch job runtime
| style="height=20px; text-align:left;padding:3px"| permanent
| style="height=20px; text-align:left;padding:3px"| max. 240 days
| style="height=20px; text-align:left;padding:3px"| max. 240 days
|- style="vertical-align:top;"
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Disk space
| style="height=20px; text-align:left;padding:3px"| 960 GB - 6.4 TB details see table 1
| style="height=20px; text-align:left;padding:3px"| n*750 GB
| style="height=20px; text-align:left;padding:3px"| 1.2 PiB
| style="height=20px; text-align:left;padding:3px"| 4.1 PiB
| style="height=20px; text-align:left;padding:3px"| 236 TiB
|-
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Capacity Quotas
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| yes 1 TiB per user also per organization
| style="height=20px; text-align:left;padding:3px"| yes 40 TiB per user
| style="height=20px; text-align:left;padding:3px"| yes 1 TiB per user
|-
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Inode Quotas
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| yes 10 million per user
| style="height=20px; text-align:left;padding:3px"| yes 30 million per user
| style="height=20px; text-align:left;padding:3px"| yes 5 million per user
|-
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Backup
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| yes
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| no
|- style="vertical-align:top;"
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Read perf./node
| style="height=20px; text-align:left;padding:3px"| 500 MB/s - 6 GB/s depends on type of local SSD / job queue: 520 MB/s @ single / multiple 800 MB/s @ multiple_e 6600 MB/s @ fat 6500 MB/s @ gpu_4 6500 MB/s @ gpu_8
| style="height=20px; text-align:left;padding:3px"| 400 MB/s - 500 MB/s depends on type of local SSDs / job queue: 500 MB/s @ multiple 400 MB/s @ multiple_e
| style="height=20px; text-align:left;padding:3px"| 1 GB/s
| style="height=20px; text-align:left;padding:3px"| 1 GB/s
| style="height=20px; text-align:left;padding:3px"| 1 GB/s
|- style="vertical-align:top;"
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Write perf./node
| style="height=20px; text-align:left;padding:3px"| 500 MB/s - 4 GB/s depends on type of local SSD / job queue: 500 MB/s @ single / multiple 650 MB/s @ multiple_e 2900 MB/s @ fat 2090 MB/s @ gpu_4 4060 MB/s @ gpu_8
| style="height=20px; text-align:left;padding:3px"| 250 MB/s - 350 MB/s depends on type of local SSDs / job queue: 350 MB/s @ multiple 250 MB/s @ multiple_e
| style="height=20px; text-align:left;padding:3px"| 1 GB/s
| style="height=20px; text-align:left;padding:3px"| 1 GB/s
| style="height=20px; text-align:left;padding:3px"| 1 GB/s
|- style="vertical-align:top;"
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Total read perf.
| style="height=20px; text-align:left;padding:3px"| n*500-6000 MB/s
| style="height=20px; text-align:left;padding:3px"| n*400-500 MB/s
| style="height=20px; text-align:left;padding:3px"| 18 GB/s
| style="height=20px; text-align:left;padding:3px"| 54 GB/s
| style="height=20px; text-align:left;padding:3px"| 45 GB/s
|- style="vertical-align:top;"
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Total write perf.
| style="height=20px; text-align:left;padding:3px"| n*500-4000 MB/s
| style="height=20px; text-align:left;padding:3px"| n*250-350 MB/s
| style="height=20px; text-align:left;padding:3px"| 18 GB/s
| style="height=20px; text-align:left;padding:3px"| 54 GB/s
| style="height=20px; text-align:left;padding:3px"| 38 GB/s
|}
---------------------------------------------------------------------------------------------------------
global: all nodes of UniCluster access the same file system;
local: each node has its own file system;
permanent: files are stored permanently;
batch job: files are removed at end of the batch job.
---------------------------------------------------------------------------------------------------------
Table 2: Properties of the file systems

== Selecting the appropriate file system ==

In general, you should separate your data and store it on the appropriate file system.
Permanently needed data like software or important results should be stored below $HOME
but capacity restrictions (quotas) apply. In case you accidentally deleted data on $HOME
you can usually restore it from backup. Permanent data which is not needed for months
or exceeds the capacity restrictions should be sent to the LSDF Online Storage
or to the archive and deleted from the file systems. Temporary data which is only needed on a single
node and which does not exceed the disk space shown in the table above should be stored
below $TMP. Data which is read many times on a single node, e.g. if you are doing AI training,
should be copied to $TMP and read from there. Temporary data which is used from many nodes
of your batch job and which is only needed during job runtime should be stored on a
parallel on-demand file system. Temporary data which can be recomputed or which is the
result of one job and input for another job should be stored in workspaces. The lifetime
of data in workspaces is limited and depends on the lifetime of the workspace which can be
several months.

For further details please check the chapters below.

== $HOME ==

The home directories of bwUniCluster 2.0 (uc2) users are located in the parallel file system Lustre.
You have access to your home directory from all nodes of uc2. A regular backup of these directories
to tape archive is done automatically. The directory $HOME is used to hold those files that are
permanently used like source codes, configuration files, executable programs etc.

On uc2 there is a default user quota limit of 1 TiB and 10 million inodes (files and directories) per user.
You can chek your current usage and limits with the command
<pre>
$ lfs quota -uh $(whoami) $HOME
</pre>
In addition to the user limit there is a limit of your organization (e.g. university) which depends on the financial share. This limit is enforced with so-called Lustre project quotas. You can show the current usage and limits of your organization with the following command:
<pre>
lfs quota -ph $(grep $(echo $HOME | sed -e "s|/[^/]*/[^/]*$||") /pfs/data5/project_ids.txt | cut -f 1 -d\ ) $HOME
</pre>

== Workspaces ==

On uc2 workspaces can be used to store large non-permanent data sets, e.g. restart files or output
data that has to be post-processed. The file system used for workspaces is also the parallel file system Lustre. This file system is especially designed for parallel access and for a high throughput to large
files. It is able to provide high data transfer rates of up to 54 GB/s write and read performance when data access is parallel.

Workspaces have a lifetime and the data on a workspace expires as a whole after a fixed period. The maximum lifetime of a workspace is 60 days, but it can be renewed at the end of that period 3 times to a total maximum of 240 days after workspace generation. If a workspace has inadvertently expired we can restore the data during a limited time (few weeks). In this case you should create a new workspace and report the name of the new and of the expired workspace in a ticket.

Creating, deleting, finding and extending workspaces is explained on the [[workspace]] page.

On uc2 there is a default user quota limit of 40 TiB and 30 million inodes (files and directories) per user.
You can chek your current usage and limits with the command
<pre>
$ lfs quota -uh $(whoami) /pfs/work7
</pre>
Note that the quotas include data and inodes for all of your workspaces and all of your expired workspaces (as long as they are not yet completely removed).

=== Reminder for workspace deletion ===

Normally you will get an email every day starting 7 days before a workspace expires. You can send yourself a calender entry which reminds you when a workspace will be automatically deleted:

$ ws_send_ical.sh <workspace> <email>

== Improving Performance on $HOME and workspaces ==

The following recommendations might help to improve throughput and metadata
performance on Lustre filesystems.

=== '''Improving Throughput Performance''' ===

Depending on your application some adaptations might be necessary if you want to reach
the full bandwidth of the filesystems. Parallel filesystems typically stripe files over storage subsystems, i.e. large files are separated into stripes and distributed to different storage subsystems. In Lustre, the size of these stripes (sometimes also mentioned as chunks) is called stripe size and the number of used storage subsystems is called stripe count.

When you are designing your application you should consider that the performance of
parallel filesystems is generally better if data is transferred in large blocks and stored in
few large files. In more detail, to increase throughput performance of a parallel application
following aspects should be considered:

* collect large chunks of data and write them sequentially at once,

* to exploit complete filesystem bandwidth use several clients,

* avoid competitive file access by different tasks or use blocks with boundaries at stripe size (default is 1MB),

* if files are small enough for the SSDs and are only used by one process store them on $TMP.

With previous Lustre versions adapting the Lustre stripe count was the most important optimization. However, for the workspaces of uc2 the new Lustre feature Progressive File Layouts has been used to define file striping parameters. This means that the stripe count is adapted if the file size is growing. In normal cases users no longer need to adapt file striping parameters in case they have very huge files or in order to reach better performance.

If you know what you are doing you can still change striping parameters, e.g. the stripe count, of a directory and of newly created files. New files and directories inherit the stripe count from the parent directory. E.g. if you want to enhance throughput on a single very large file which is created in the directory $HOME/my_output_dir you can use the command
<pre>
$ lfs setstripe -c-1 $HOME/my_output_dir
</pre>
to change the stripe count to -1 which means that all storage subsystems of the file system are used to store that file. If you change the stripe count of a directory the stripe count of existing files inside this
directory is not changed. If you want to change the stripe count of existing files, change
the stripe count of the parent directory, copy the files to new files, remove the old files
and move the new files back to the old name. In order to check the stripe setting of
the file my_file use
<pre>
$ lfs getstripe my_file
</pre>
Also note that changes on the striping parameters (e.g. stripe count) are not saved in the
backup, i.e. if directories have to be recreated this information is lost and the default stripe
count will be used. Therefore, you should annotate for which directories you made changes
to the striping parameters so that you can repeat these changes if required.

=== '''Improving Metadata Performance''' ===

Metadata performance on parallel file systems is usually not as good as with local
filesystems. In addition, it is usually not scalable, i.e. a limited resource. Therefore,
you should omit metadata operations whenever possible. For example, it is much better
to have few large files than lots of small files. In more detail, to increase metadata
performance of a parallel application following aspects should be considered:

* avoid creating many small files,

* avoid competitive directory access, e.g. by creating files in separate subdirectories for each task,

* if many small files are only used within a batch job and accessed by one process store them on $TMP,

* change the default colorization setting of the command ls (see below).

On modern Linux systems, the GNU ls command often uses colorization by default to
visually highlight the file type; this is especially true if the command is run within a terminal
session. This is because the default shell profile initializations usually contain an alias
directive similar to the following for the ls command:
<pre>
$ alias ls="ls --color=tty"
</pre>
However, running the ls command in this way for files on a Lustre file system requires
a stat() call to be used to determine the file type. This can result in a performance
overhead, because the stat() call always needs to determine the size of a file, and that
in turn means that the client node must query the object size of all the backing objects
that make up a file. As a result of the default colorization setting, running a simple
ls command on a Lustre file system often takes as much time as running the ls command
with the -l option (the same is true if the -F, -p, or the -classify option, or any other option
that requires information from a stat() call, is used). To avoid this performance overhead
when using ls commands, add an alias directive similar to the following
to your shell startup script:
<pre>
$ alias ls="ls --color=never"
</pre>

== Workspaces on flash storage ==

There is another workspace file system for special requirements available. The file system is called ''full flash pfs'' and is based on the parallel file system Lustre.

=== Advantages of this file system ===

# All storage devices are based on flash (no hard disks) with low access times. Hence performance is better compared to other parallel file systems for read and write access with small blocks and with small files, i.e. IOPS rates are improved.
# The file system is mounted on bwUniCluster 2.0 and HoreKa, i.e. it can be used to share data between these clusters.

=== Access restrictions ===

Only HoreKa users or KIT users of bwUniCluster 2.0 can use this file system.

=== Using the file system ===

After access is granted, you can use the file system in the same way as a normal workspace. You just have to specify the name of the flash-based workspace file system using the option ''-F'' to all the commands that manage workspaces. On bwUniCluster 2.0 it is called ''ffuc'', on HoreKa it is ''ffhk''. For example, to create a workspace with name myws and a lifetime of 60 days on bwUniCluster 2.0 execute:
ws_allocate -F ffuc myws 60

If you want to use the full flash pfs on bwUniCluster 2.0 '''and''' HoreKa at the same time, please note that you only have to manage a particular workspace on one of the clusters since the name of the workspace directory is different. However, the path to each workspace is visible and can be used on both clusters.

Other features are similar to normal workspaces. For example, we are able to restore expired workspaces for few weeks and you have to open a ticket to request the restore. There are quota limits with a default limit of 1 TB capacity and 5 millions inodes per user. You can check your current usage with
lfs quota -uh $(whoami) /pfs/work8

== $TMP ==

While all tasks of a parallel application access the same $HOME and workspace directory, the
$TMP directory is local to each node on bwUniCluster 2.0. All nodes have fast SSDs
local storage devices which are used to store data below $TMP. Different tasks of a parallel
application use different $TMP directories when they do not utilize one node. This directory should
be used for temporary files being accessed by single tasks. It should also be used if you read the
same data many times from a single node, e.g. if you are doing AI training. In this case you should
copy the data at the beginning of your batch job to $TMP and read the data from there.
In addition, this directory should be used for the installation
of software packages. This means that the software package to be installed should be
unpacked, compiled and linked in a subdirectory of $TMP. The real installation of the
package (e.g. make install) should be made in(to) the Lustre filesystem.

Each time a batch job is started, a subdirectory is created on each node and assigned to the job.
$TMP is newly set and the name of the subdirectory contains the Job-id so that the
subdirectory name is unique for each job. This unique name is then assigned to the
environment variable $TMP within the job. At the end of the job the subdirectory is removed.
Although $TMP points to the same path name for different nodes of a job, the physical location
on these nodes is different.

== LSDF Online Storage==

In some cases it is useful to have access to the LSDF Online Storage on the HPC-Clusters also. Therefore the LSDF Online Storage is mounted on the Login- and Datamover-Nodes.
Furthermore it can be used on the compute nodes during the job runtime with the constraint flag "LSDF" ([[bwUniCluster_2.0_Slurm_common_Features|Slurm common features]]
). There is also an example about the LSDF batch usage: [[bwUniCluster_2.0_Slurm_common_Features#LSDF_Online_Storage|Slurm LSDF example ]] .
<pre>
#!/bin/bash
#SBATCH ...
#SBATCH --constraint=LSDF
</pre>
 
For the usage of the LSDF Online Storage the following environment variables are available: $LSDF, $LSDFPROJECTS, $LSDFHOME.
Please request storage projects in the LSDF Online Storage seperately:
[https://www.lsdf.kit.edu/os/storagerequest/: LSDF Storage Request].

==BeeOND (BeeGFS On-Demand)==

Users of the UniCluster have possibility to request a private BeeOND (on-demand BeeGFS) parallel filesystem for each job. The file system is created during job startup and purged after your job.

'''IMPORTANT: All data on the private filesystem will be deleted after your job. Make sure you have copied your data back to the global filesystem (within job), e.g., $HOME or any workspace.'''

BeeOND/BeeGFS can be used like any other parallel file system. Tools like cp or rsync can be used to copy data in and out.

For detailed usage see here:[[BwUniCluster_2.0_Slurm_common_Features#BeeOND_.28BeeGFS_On-Demand.29|Request on-demand file system]]

==Backup and Archiving==

There are regular backups of all data of the home directories,whereas ACLs and extended attributes will
not be backuped.

Please open a ticket if you need backuped data.

[[Category:bwUniCluster 2.0|File System]][[Category:Hardware_and_Architecture|bwUniCluster 2.0]]

BwUniCluster2.0/Hardware and Architecture

2023-01-05T13:09:51Z

S Raffeiner: /* Components of bwUniCluster */

= Architecture of bwUniCluster 2.0 =

The bwUniCluster 2.0 is a parallel computer with distributed memory. Each node of system consists of at least two Intel Xeon processor, local memory, disks, network adapters and optionally accelerators (NVIDIA Tesla V100, A100 or H100). All nodes are connected by a fast InfiniBand interconnect. In addition the file system
Lustre, that is connected by coupling the InfiniBand of the file server with the InfiniBand
switch of the compute cluster, is added to bwUniCluster 2.0 to provide a fast and scalable
parallel file system.

The operating system on each node is Red Hat Enterprise Linux (RHEL) 8.4. A number of additional software packages like e.g. SLURM have been installed on top. Some of these components are of special interest to end users and are briefly
discussed in this document. Others which are of greater importance to system
administrators will not be covered by this document.

The individual nodes of the system may act in different roles. According to the services supplied by the nodes, they are separated into disjoint groups. From an end users point of view the different groups of nodes are login nodes, compute nodes, file server nodes and administrative server nodes.

'''Login Nodes'''

The login nodes are the only nodes that are directly accessible by end users. These nodes
are used for interactive login, file management, program development and interactive pre-
and postprocessing. Two nodes are dedicated to this service but they are all accessible via
one address and a DNS round-robin alias distributes the login sessions to the
different login nodes.

'''Compute Node'''

The majority of nodes are compute nodes which are managed by a batch system. Users
submit their jobs to the SLURM batch system and a job is executed when the required resources become available (depending on its fair-share priority).

'''File Server Nodes'''

The hardware of the parallel file system Lustre incorporates some file server nodes; the file
system Lustre is connected by coupling the InfiniBand of the file server with the independent InfiniBand switch of the compute cluster. In addition to shared file space there is also local storage on the disks of each node (for details see chapter "File Systems").

'''Administrative Server Nodes'''

Some other nodes are delivering additional services like resource management, external
network connection, administration etc. These nodes can be accessed directly by system administrators only.

= Components of bwUniCluster =

{| class="wikitable"
|-
! style="width:9%"|
! style="width:13%"| Compute nodes "Thin"
! style="width:13%"| Compute nodes "HPC"
! style="width:13%"| Compute nodes "IceLake"
! style="width:13%"| Compute nodes "Fat"
! style="width:13%"| GPU x4
! style="width:13%"| GPU x8
! style="width:13%"| IceLake + GPU x4
! style="width:13%"| Login
|-
!scope="column"| Number of nodes
| 100 + 60
| 360
| 272
| 6
| 14
| 10
| 15
| 4
|-
!scope="column"| Processors
| Intel Xeon Gold 6230
| Intel Xeon Gold 6230
| Intel Xeon Platinum 8358
| Intel Xeon Gold 6230
| Intel Xeon Gold 6230
| Intel Xeon Gold 6248
| Intel Xeon Platinum 8358
|-
!scope="column"| Number of sockets
| 2
| 2
| 2
| 4
| 2
| 2
| 2
| 2
|-
!scope="column"| Processor frequency (GHz)
| 2.1 Ghz
| 2.1 Ghz
| 2.6 Ghz
| 2.1 Ghz
| 2.1 Ghz
| 2.6 Ghz
| 2.5 Ghz
|
|-
!scope="column"| Total number of cores
| 40
| 40
| 64
| 80
| 40
| 40
| 64
| 40
|-
!scope="column"| Main memory
| 96 GB / 192 GB
| 96 GB
| 256 GB
| 3 TB
| 384 GB
| 768 GB
| 512 GB
| 384 GB
|-
!scope="column"| Local SSD
| 960 GB SATA
| 960 GB SATA
| 1,8 TB NVMe
| 4,8 TB NVMe
| 3,2 TB NVMe
| 15 TB NVMe
| 6,4 TB NVMe
|
|-
!scope="column"| Accelerators
| -
| -
| -
| -
| 4x NVIDIA Tesla V100
| 8x NVIDIA Tesla V100
| 4x NVIDIA H100 / 4x NVIDIA H100
|-
!scope="column"| Interconnect
| IB HDR100 (blocking)
| IB HDR100
| IB HDR200
| IB HDR
| IB HDR
| IB HDR
| IB HDR200
| IB HDR100 (blocking)
|}
Table 1: Properties of the nodes

= File Systems =

On bwUniCluster 2.0 the parallel file system Lustre is used for most globally visible user data. Lustre is open source and Lustre solutions and support are available from different vendors. Nowadays, most of the biggest HPC systems are using Lustre. An initial home directory on a Lustre file system is created automatically after account activation, and the environment variable $HOME holds its name. Users can create so-called workspaces on another Lustre file system for non-permanent data with temporary lifetime. There is another workspace file system based on flash storage for special requirements available.

Within a batch job further file systems are available:
* The directory $TMP is only available and visible on the local node. It is located on fast SSD storage devices.
* On request a parallel on-demand file system (BeeOND) is created which uses the SSDs of the nodes which were allocated to the batch job.
* On request the external LSDF Online Storage is mounted on the nodes which were allocated to the batch job. This file system is based on the parallel file system Spectrum Scale.

Some of the characteristics of the file systems are shown in Table 2.

{| style="width:100%; vertical-align:top; background:#f5fffa;border:1px solid #000000;padding:1px"
|- style="width:20%;height=20px; text-align:left;padding:3px"
! style="background-color:#AAA;padding:3px"| Property
! style="background-color:#AAA;padding:3px"| $TMP
! style="background-color:#AAA;padding:3px"| BeeOND
! style="background-color:#AAA;padding:3px"| $HOME
! style="background-color:#AAA;padding:3px"| Workspace
! style="background-color:#AAA;padding:3px"| Workspace on flash
|-
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Visibility
| style="height=20px; text-align:left;padding:3px"| local node
| style="height=20px; text-align:left;padding:3px"| nodes of batch job
| style="height=20px; text-align:left;padding:3px"| global
| style="height=20px; text-align:left;padding:3px"| global
| style="height=20px; text-align:left;padding:3px"| global
|-
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Lifetime
| style="height=20px; text-align:left;padding:3px"| batch job runtime
| style="height=20px; text-align:left;padding:3px"| batch job runtime
| style="height=20px; text-align:left;padding:3px"| permanent
| style="height=20px; text-align:left;padding:3px"| max. 240 days
| style="height=20px; text-align:left;padding:3px"| max. 240 days
|- style="vertical-align:top;"
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Disk space
| style="height=20px; text-align:left;padding:3px"| 960 GB - 6.4 TB details see table 1
| style="height=20px; text-align:left;padding:3px"| n*750 GB
| style="height=20px; text-align:left;padding:3px"| 1.2 PiB
| style="height=20px; text-align:left;padding:3px"| 4.1 PiB
| style="height=20px; text-align:left;padding:3px"| 236 TiB
|-
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Capacity Quotas
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| yes 1 TiB per user also per organization
| style="height=20px; text-align:left;padding:3px"| yes 40 TiB per user
| style="height=20px; text-align:left;padding:3px"| yes 1 TiB per user
|-
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Inode Quotas
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| yes 10 million per user
| style="height=20px; text-align:left;padding:3px"| yes 30 million per user
| style="height=20px; text-align:left;padding:3px"| yes 5 million per user
|-
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Backup
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| yes
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| no
|- style="vertical-align:top;"
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Read perf./node
| style="height=20px; text-align:left;padding:3px"| 500 MB/s - 6 GB/s depends on type of local SSD / job queue: 520 MB/s @ single / multiple 800 MB/s @ multiple_e 6600 MB/s @ fat 6500 MB/s @ gpu_4 6500 MB/s @ gpu_8
| style="height=20px; text-align:left;padding:3px"| 400 MB/s - 500 MB/s depends on type of local SSDs / job queue: 500 MB/s @ multiple 400 MB/s @ multiple_e
| style="height=20px; text-align:left;padding:3px"| 1 GB/s
| style="height=20px; text-align:left;padding:3px"| 1 GB/s
| style="height=20px; text-align:left;padding:3px"| 1 GB/s
|- style="vertical-align:top;"
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Write perf./node
| style="height=20px; text-align:left;padding:3px"| 500 MB/s - 4 GB/s depends on type of local SSD / job queue: 500 MB/s @ single / multiple 650 MB/s @ multiple_e 2900 MB/s @ fat 2090 MB/s @ gpu_4 4060 MB/s @ gpu_8
| style="height=20px; text-align:left;padding:3px"| 250 MB/s - 350 MB/s depends on type of local SSDs / job queue: 350 MB/s @ multiple 250 MB/s @ multiple_e
| style="height=20px; text-align:left;padding:3px"| 1 GB/s
| style="height=20px; text-align:left;padding:3px"| 1 GB/s
| style="height=20px; text-align:left;padding:3px"| 1 GB/s
|- style="vertical-align:top;"
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Total read perf.
| style="height=20px; text-align:left;padding:3px"| n*500-6000 MB/s
| style="height=20px; text-align:left;padding:3px"| n*400-500 MB/s
| style="height=20px; text-align:left;padding:3px"| 18 GB/s
| style="height=20px; text-align:left;padding:3px"| 54 GB/s
| style="height=20px; text-align:left;padding:3px"| 45 GB/s
|- style="vertical-align:top;"
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Total write perf.
| style="height=20px; text-align:left;padding:3px"| n*500-4000 MB/s
| style="height=20px; text-align:left;padding:3px"| n*250-350 MB/s
| style="height=20px; text-align:left;padding:3px"| 18 GB/s
| style="height=20px; text-align:left;padding:3px"| 54 GB/s
| style="height=20px; text-align:left;padding:3px"| 38 GB/s
|}
---------------------------------------------------------------------------------------------------------
global: all nodes of UniCluster access the same file system;
local: each node has its own file system;
permanent: files are stored permanently;
batch job: files are removed at end of the batch job.
---------------------------------------------------------------------------------------------------------
Table 2: Properties of the file systems

== Selecting the appropriate file system ==

In general, you should separate your data and store it on the appropriate file system.
Permanently needed data like software or important results should be stored below $HOME
but capacity restrictions (quotas) apply. In case you accidentally deleted data on $HOME
you can usually restore it from backup. Permanent data which is not needed for months
or exceeds the capacity restrictions should be sent to the LSDF Online Storage
or to the archive and deleted from the file systems. Temporary data which is only needed on a single
node and which does not exceed the disk space shown in the table above should be stored
below $TMP. Data which is read many times on a single node, e.g. if you are doing AI training,
should be copied to $TMP and read from there. Temporary data which is used from many nodes
of your batch job and which is only needed during job runtime should be stored on a
parallel on-demand file system. Temporary data which can be recomputed or which is the
result of one job and input for another job should be stored in workspaces. The lifetime
of data in workspaces is limited and depends on the lifetime of the workspace which can be
several months.

For further details please check the chapters below.

== $HOME ==

The home directories of bwUniCluster 2.0 (uc2) users are located in the parallel file system Lustre.
You have access to your home directory from all nodes of uc2. A regular backup of these directories
to tape archive is done automatically. The directory $HOME is used to hold those files that are
permanently used like source codes, configuration files, executable programs etc.

On uc2 there is a default user quota limit of 1 TiB and 10 million inodes (files and directories) per user.
You can chek your current usage and limits with the command
<pre>
$ lfs quota -uh $(whoami) $HOME
</pre>
In addition to the user limit there is a limit of your organization (e.g. university) which depends on the financial share. This limit is enforced with so-called Lustre project quotas. You can show the current usage and limits of your organization with the following command:
<pre>
lfs quota -ph $(grep $(echo $HOME | sed -e "s|/[^/]*/[^/]*$||") /pfs/data5/project_ids.txt | cut -f 1 -d\ ) $HOME
</pre>

== Workspaces ==

On uc2 workspaces can be used to store large non-permanent data sets, e.g. restart files or output
data that has to be post-processed. The file system used for workspaces is also the parallel file system Lustre. This file system is especially designed for parallel access and for a high throughput to large
files. It is able to provide high data transfer rates of up to 54 GB/s write and read performance when data access is parallel.

Workspaces have a lifetime and the data on a workspace expires as a whole after a fixed period. The maximum lifetime of a workspace is 60 days, but it can be renewed at the end of that period 3 times to a total maximum of 240 days after workspace generation. If a workspace has inadvertently expired we can restore the data during a limited time (few weeks). In this case you should create a new workspace and report the name of the new and of the expired workspace in a ticket.

Creating, deleting, finding and extending workspaces is explained on the [[workspace]] page.

On uc2 there is a default user quota limit of 40 TiB and 30 million inodes (files and directories) per user.
You can chek your current usage and limits with the command
<pre>
$ lfs quota -uh $(whoami) /pfs/work7
</pre>
Note that the quotas include data and inodes for all of your workspaces and all of your expired workspaces (as long as they are not yet completely removed).

=== Reminder for workspace deletion ===

Normally you will get an email every day starting 7 days before a workspace expires. You can send yourself a calender entry which reminds you when a workspace will be automatically deleted:

$ ws_send_ical.sh <workspace> <email>

== Improving Performance on $HOME and workspaces ==

The following recommendations might help to improve throughput and metadata
performance on Lustre filesystems.

=== '''Improving Throughput Performance''' ===

Depending on your application some adaptations might be necessary if you want to reach
the full bandwidth of the filesystems. Parallel filesystems typically stripe files over storage subsystems, i.e. large files are separated into stripes and distributed to different storage subsystems. In Lustre, the size of these stripes (sometimes also mentioned as chunks) is called stripe size and the number of used storage subsystems is called stripe count.

When you are designing your application you should consider that the performance of
parallel filesystems is generally better if data is transferred in large blocks and stored in
few large files. In more detail, to increase throughput performance of a parallel application
following aspects should be considered:

* collect large chunks of data and write them sequentially at once,

* to exploit complete filesystem bandwidth use several clients,

* avoid competitive file access by different tasks or use blocks with boundaries at stripe size (default is 1MB),

* if files are small enough for the SSDs and are only used by one process store them on $TMP.

With previous Lustre versions adapting the Lustre stripe count was the most important optimization. However, for the workspaces of uc2 the new Lustre feature Progressive File Layouts has been used to define file striping parameters. This means that the stripe count is adapted if the file size is growing. In normal cases users no longer need to adapt file striping parameters in case they have very huge files or in order to reach better performance.

If you know what you are doing you can still change striping parameters, e.g. the stripe count, of a directory and of newly created files. New files and directories inherit the stripe count from the parent directory. E.g. if you want to enhance throughput on a single very large file which is created in the directory $HOME/my_output_dir you can use the command
<pre>
$ lfs setstripe -c-1 $HOME/my_output_dir
</pre>
to change the stripe count to -1 which means that all storage subsystems of the file system are used to store that file. If you change the stripe count of a directory the stripe count of existing files inside this
directory is not changed. If you want to change the stripe count of existing files, change
the stripe count of the parent directory, copy the files to new files, remove the old files
and move the new files back to the old name. In order to check the stripe setting of
the file my_file use
<pre>
$ lfs getstripe my_file
</pre>
Also note that changes on the striping parameters (e.g. stripe count) are not saved in the
backup, i.e. if directories have to be recreated this information is lost and the default stripe
count will be used. Therefore, you should annotate for which directories you made changes
to the striping parameters so that you can repeat these changes if required.

=== '''Improving Metadata Performance''' ===

Metadata performance on parallel file systems is usually not as good as with local
filesystems. In addition, it is usually not scalable, i.e. a limited resource. Therefore,
you should omit metadata operations whenever possible. For example, it is much better
to have few large files than lots of small files. In more detail, to increase metadata
performance of a parallel application following aspects should be considered:

* avoid creating many small files,

* avoid competitive directory access, e.g. by creating files in separate subdirectories for each task,

* if many small files are only used within a batch job and accessed by one process store them on $TMP,

* change the default colorization setting of the command ls (see below).

On modern Linux systems, the GNU ls command often uses colorization by default to
visually highlight the file type; this is especially true if the command is run within a terminal
session. This is because the default shell profile initializations usually contain an alias
directive similar to the following for the ls command:
<pre>
$ alias ls="ls --color=tty"
</pre>
However, running the ls command in this way for files on a Lustre file system requires
a stat() call to be used to determine the file type. This can result in a performance
overhead, because the stat() call always needs to determine the size of a file, and that
in turn means that the client node must query the object size of all the backing objects
that make up a file. As a result of the default colorization setting, running a simple
ls command on a Lustre file system often takes as much time as running the ls command
with the -l option (the same is true if the -F, -p, or the -classify option, or any other option
that requires information from a stat() call, is used). To avoid this performance overhead
when using ls commands, add an alias directive similar to the following
to your shell startup script:
<pre>
$ alias ls="ls --color=never"
</pre>

== Workspaces on flash storage ==

There is another workspace file system for special requirements available. The file system is called ''full flash pfs'' and is based on the parallel file system Lustre.

=== Advantages of this file system ===

# All storage devices are based on flash (no hard disks) with low access times. Hence performance is better compared to other parallel file systems for read and write access with small blocks and with small files, i.e. IOPS rates are improved.
# The file system is mounted on bwUniCluster 2.0 and HoreKa, i.e. it can be used to share data between these clusters.

=== Access restrictions ===

Only HoreKa users or KIT users of bwUniCluster 2.0 can use this file system.

=== Using the file system ===

After access is granted, you can use the file system in the same way as a normal workspace. You just have to specify the name of the flash-based workspace file system using the option ''-F'' to all the commands that manage workspaces. On bwUniCluster 2.0 it is called ''ffuc'', on HoreKa it is ''ffhk''. For example, to create a workspace with name myws and a lifetime of 60 days on bwUniCluster 2.0 execute:
ws_allocate -F ffuc myws 60

If you want to use the full flash pfs on bwUniCluster 2.0 '''and''' HoreKa at the same time, please note that you only have to manage a particular workspace on one of the clusters since the name of the workspace directory is different. However, the path to each workspace is visible and can be used on both clusters.

Other features are similar to normal workspaces. For example, we are able to restore expired workspaces for few weeks and you have to open a ticket to request the restore. There are quota limits with a default limit of 1 TB capacity and 5 millions inodes per user. You can check your current usage with
lfs quota -uh $(whoami) /pfs/work8

== $TMP ==

While all tasks of a parallel application access the same $HOME and workspace directory, the
$TMP directory is local to each node on bwUniCluster 2.0. All nodes have fast SSDs
local storage devices which are used to store data below $TMP. Different tasks of a parallel
application use different $TMP directories when they do not utilize one node. This directory should
be used for temporary files being accessed by single tasks. It should also be used if you read the
same data many times from a single node, e.g. if you are doing AI training. In this case you should
copy the data at the beginning of your batch job to $TMP and read the data from there.
In addition, this directory should be used for the installation
of software packages. This means that the software package to be installed should be
unpacked, compiled and linked in a subdirectory of $TMP. The real installation of the
package (e.g. make install) should be made in(to) the Lustre filesystem.

Each time a batch job is started, a subdirectory is created on each node and assigned to the job.
$TMP is newly set and the name of the subdirectory contains the Job-id so that the
subdirectory name is unique for each job. This unique name is then assigned to the
environment variable $TMP within the job. At the end of the job the subdirectory is removed.
Although $TMP points to the same path name for different nodes of a job, the physical location
on these nodes is different.

== LSDF Online Storage==

In some cases it is useful to have access to the LSDF Online Storage on the HPC-Clusters also. Therefore the LSDF Online Storage is mounted on the Login- and Datamover-Nodes.
Furthermore it can be used on the compute nodes during the job runtime with the constraint flag "LSDF" ([[bwUniCluster_2.0_Slurm_common_Features|Slurm common features]]
). There is also an example about the LSDF batch usage: [[bwUniCluster_2.0_Slurm_common_Features#LSDF_Online_Storage|Slurm LSDF example ]] .
<pre>
#!/bin/bash
#SBATCH ...
#SBATCH --constraint=LSDF
</pre>
 
For the usage of the LSDF Online Storage the following environment variables are available: $LSDF, $LSDFPROJECTS, $LSDFHOME.
Please request storage projects in the LSDF Online Storage seperately:
[https://www.lsdf.kit.edu/os/storagerequest/: LSDF Storage Request].

==BeeOND (BeeGFS On-Demand)==

Users of the UniCluster have possibility to request a private BeeOND (on-demand BeeGFS) parallel filesystem for each job. The file system is created during job startup and purged after your job.

'''IMPORTANT: All data on the private filesystem will be deleted after your job. Make sure you have copied your data back to the global filesystem (within job), e.g., $HOME or any workspace.'''

BeeOND/BeeGFS can be used like any other parallel file system. Tools like cp or rsync can be used to copy data in and out.

For detailed usage see here:[[BwUniCluster_2.0_Slurm_common_Features#BeeOND_.28BeeGFS_On-Demand.29|Request on-demand file system]]

==Backup and Archiving==

There are regular backups of all data of the home directories,whereas ACLs and extended attributes will
not be backuped.

Please open a ticket if you need backuped data.

[[Category:bwUniCluster 2.0|File System]][[Category:Hardware_and_Architecture|bwUniCluster 2.0]]

BwUniCluster2.0

2022-11-25T14:57:06Z

S Raffeiner:

[[File:BwUniCluster_2.0_Feb2020_1024x423.jpg|right|frameless|thumb|alt=bwUniCluster2.0 |upright=1| bwUniCluster 2.0 ]]


The '''bwUniCluster 2.0''' is the joint high-performance computer system of Baden-Württemberg's Universities and Universities of Applied Sciences for '''general purpose and teaching''' and located at the Steinbuch Centre for Computing (SCC) at Karlsruhe Institute of Technology (KIT). The bwUniCluster 2.0 complements the four bwForClusters and their dedicated scientific areas.

{| style="background:#FFCCCC; width:100%;"
| '''The following issue is known:''' Due to the hardware configuration, there is currently an already known problem with OpenMPI on the nodes in the "multiple_il" partition. It manifests itself in the warning "No OpenFabrics connection schemes reported" when starting an MPI application and refers to the device "mlx5_2". This is an Ethernet port, which is not supposed to be used by OpenMPI. The warning is informative, we are working on suppressing this message.
|}


{| style=" background:#FEF4AB; width:100%;"
| style="padding:8px; background:#FFE856; font-size:120%; font-weight:bold; text-align:left" | News
|-
|
* 2022-11-25: Most of the new nodes from bwUniCluster 2.0 Stage 2 are now available.
|}

{| style=" background:#eeeefe; width:100%;"
| style="padding:8px; background:#dedefe; font-size:120%; font-weight:bold; text-align:left" | Training & Support
|-
|
* [[BwUniCluster2.0/First_Steps|Geting Started]]
* [https://training.bwhpc.de E-Learning Courses]
* [[BwUniCluster2.0/Support|Support]]
* [[BwUniCluster2.0/FAQ|FAQ]]
* Send [[:Category:Feedback|Feedback]] about Wiki pages
|}

{| style=" background:#deffee; width:100%;"
| style="padding:8px; background:#cef2e0; font-size:120%; font-weight:bold; text-align:left" | User Documentation
|-
|
* Access: [[Registration/bwUniCluster|Registration]], [[Registration/Deregistration|Deregistration]], [[BwUniCluster2.0/Jupyter|Using Jupyter]]
* [[BwUniCluster2.0/Login|Login]]
*
* [[BwUniCluster2.0/Hardware_and_Architecture|Hardware and Architecture]]
** [[BwUniCluster2.0/Hardware_and_Architecture#File_Systems|File Systems and Workspaces]]
* [[BwUniCluster2.0/Software|Cluster Specific Software]]
** [[BwUniCluster2.0/Containers|Using Containers]]
* [[BwUniCluster2.0/Slurm|Batch System]]
** [[BwUniCluster2.0/Batch_Queues|Queues and interactive Jobs]]
* [[BwUniCluster2.0/Maintenance|Operational Changes]]
|}

{| style=" background:#deffee; width:100%;"
| style="padding:8px; background:#cef2e0; font-size:120%; font-weight:bold; text-align:left" | Cluster Funding
|-
|
* Please [[BwUniCluster2.0/Acknowledgement|acknowledge]] bwUniCluster 2.0 in your publications.
|}

BwUniCluster2.0/Maintenance/2022-11

2022-11-25T14:53:30Z

S Raffeiner:

The following changes have been introduced during the maintenance interval between on 07.11.2022 (Monday) 08:00 and 10.11.2022 (Thursday) 17:00.

The host key of the system have not changed. You should not receive any warnings by your SSH client(s), but if there should be a warning or if you want to check that you are connecting to the correct system, you can verify the key hashes using the following list:

{|class="wikitable"
! Algorithm
! Hash (SHA256)
! Hash (MD5)
|-
|RSA
|p6Ion2YKZr5cnzf6L6DS1xGnIwnC1BhLbOEmDdp7FA0
|59:2a:67:44:4a:d7:89:6c:c0:0d:74:ba:3c:c4:63:6d
|-
|ECDSA
|k8l1JnfLf1y1Qi55IQmo11+/NZx06Rbze7akT5R7tE8
|85:d4:d9:97:e0:f0:43:30:6e:66:8e:d0:b6:9b:85:d1
|-
|ED25519
|yEe5nJ5hZZ1YbgieWr+phqRZKYbrV7zRe8OR3X03cn0
|42:d2:0d:ab:87:48:fc:1d:5d:b3:7c:bf:22:c3:5f:b7
|}

= Hardware =

* All firmware versions on all components were upgraded

= Operating system =

* The operating system remains at RHEL 8.4 EUS

* The Mellanox OFED InfiniBand Stack was updated to version 5.5-2.1.7.0

= Compilers, Libaries and Runtime Environments =
* The Intel Parallel Studio (Compiler, MKL, MPI) version 2019 modules were removed during the maintenance.
* clang 9 and llvm 10 modules were removed

= Userspace tools =

= Software Modules =

= Batch system =

* The Slurm version was upgraded to version 22.05.5
* Pyxis Plugin version was upgraded to 0.14.0

= Storage =
* Lustre client, BeeGFS client and Spectrum Scale client were updated

= Graphics stack =

* The NVIDIA driver was upgraded to version 515.65.07
* Cuda 11.7 was installed

= Containers =

= JupyterHub =

BwUniCluster2.0

2022-11-25T14:30:44Z

S Raffeiner:

[[File:BwUniCluster_2.0_Feb2020_1024x423.jpg|right|frameless|thumb|alt=bwUniCluster2.0 |upright=1| bwUniCluster 2.0 ]]


The '''bwUniCluster 2.0''' is the joint high-performance computer system of Baden-Württemberg's Universities and Universities of Applied Sciences for '''general purpose and teaching''' and located at the Steinbuch Centre for Computing (SCC) at Karlsruhe Institute of Technology (KIT). The bwUniCluster 2.0 complements the four bwForClusters and their dedicated scientific areas.


{| style=" background:#FEF4AB; width:100%;"
| style="padding:8px; background:#FFE856; font-size:120%; font-weight:bold; text-align:left" | News
|-
|
* 2022-11-25: Most of the new nodes from bwUniCluster 2.0 Stage 2 are now available.
|}

{| style=" background:#eeeefe; width:100%;"
| style="padding:8px; background:#dedefe; font-size:120%; font-weight:bold; text-align:left" | Training & Support
|-
|
* [[BwUniCluster2.0/First_Steps|Geting Started]]
* [https://training.bwhpc.de E-Learning Courses]
* [[BwUniCluster2.0/Support|Support]]
* [[BwUniCluster2.0/FAQ|FAQ]]
* Send [[:Category:Feedback|Feedback]] about Wiki pages
|}

{| style=" background:#deffee; width:100%;"
| style="padding:8px; background:#cef2e0; font-size:120%; font-weight:bold; text-align:left" | User Documentation
|-
|
* Access: [[Registration/bwUniCluster|Registration]], [[Registration/Deregistration|Deregistration]], [[BwUniCluster2.0/Jupyter|Using Jupyter]]
* [[BwUniCluster2.0/Login|Login]]
*
* [[BwUniCluster2.0/Hardware_and_Architecture|Hardware and Architecture]]
** [[BwUniCluster2.0/Hardware_and_Architecture#File_Systems|File Systems and Workspaces]]
* [[BwUniCluster2.0/Software|Cluster Specific Software]]
** [[BwUniCluster2.0/Containers|Using Containers]]
* [[BwUniCluster2.0/Slurm|Batch System]]
** [[BwUniCluster2.0/Batch_Queues|Queues and interactive Jobs]]
* [[BwUniCluster2.0/Maintenance|Operational Changes]]
|}

{| style=" background:#deffee; width:100%;"
| style="padding:8px; background:#cef2e0; font-size:120%; font-weight:bold; text-align:left" | Cluster Funding
|-
|
* Please [[BwUniCluster2.0/Acknowledgement|acknowledge]] bwUniCluster 2.0 in your publications.
|}

BwUniCluster2.0/Maintenance/2022-11

2022-11-25T14:18:30Z

S Raffeiner:

The following changes have been introduced during the maintenance interval between on 07.11.2022 (Monday) 08:00 and 10.11.2022 (Thursday) 17:00.

The host key of the system have not changed. You should not receive any warnings by your SSH client(s), but if there should be a warning or if you want to check that you are connecting to the correct system, you can verify the key hashes using the following list:

{|class="wikitable"
! Algorithm
! Hash (SHA256)
! Hash (MD5)
|-
|RSA
|p6Ion2YKZr5cnzf6L6DS1xGnIwnC1BhLbOEmDdp7FA0
|59:2a:67:44:4a:d7:89:6c:c0:0d:74:ba:3c:c4:63:6d
|-
|ECDSA
|k8l1JnfLf1y1Qi55IQmo11+/NZx06Rbze7akT5R7tE8
|85:d4:d9:97:e0:f0:43:30:6e:66:8e:d0:b6:9b:85:d1
|-
|ED25519
|yEe5nJ5hZZ1YbgieWr+phqRZKYbrV7zRe8OR3X03cn0
|42:d2:0d:ab:87:48:fc:1d:5d:b3:7c:bf:22:c3:5f:b7
|}

The following issue is known:

{| style=" background:#FFCCCC; width:100%;"
| Due to the hardware configuration, there is currently an already known problem with OpenMPI on the nodes in the "multiple_il" partition. It manifests itself in the warning "No OpenFabrics connection schemes reported" when starting an MPI application and refers to the device "mlx5_2". This is an Ethernet port, which is not supposed to be used by OpenMPI. The warning is informative, we are working on suppressing this message.
|}

= Hardware =

* All firmware versions on all components were upgraded

= Operating system =

* The operating system remains at RHEL 8.4 EUS

* The Mellanox OFED InfiniBand Stack was updated to version 5.5-2.1.7.0

= Compilers, Libaries and Runtime Environments =
* The Intel Parallel Studio (Compiler, MKL, MPI) version 2019 modules were removed during the maintenance.
* clang 9 and llvm 10 modules were removed

= Userspace tools =

= Software Modules =

= Batch system =

* The Slurm version was upgraded to version 22.05.5
* Pyxis Plugin version was upgraded to 0.14.0

= Storage =
* Lustre client, BeeGFS client and Spectrum Scale client were updated

= Graphics stack =

* The NVIDIA driver was upgraded to version 515.65.07
* Cuda 11.7 was installed

= Containers =

= JupyterHub =

BwUniCluster2.0

2022-10-26T10:22:09Z

S Raffeiner:

[[File:BwUniCluster_2.0_Feb2020_1024x423.jpg|right|frameless|thumb|alt=bwUniCluster2.0 |upright=1| bwUniCluster 2.0 ]]


The '''bwUniCluster 2.0''' is the joint high-performance computer system of Baden-Württemberg's Universities and Universities of Applied Sciences for '''general purpose and teaching''' and located at the Steinbuch Centre for Computing (SCC) at Karlsruhe Institute of Technology (KIT). The bwUniCluster 2.0 complements the four bwForClusters and their dedicated scientific areas.


{| style=" background:#FEF4AB; width:100%;"
| style="padding:8px; background:#FFE856; font-size:120%; font-weight:bold; text-align:left" | News
|-
|
* 2022-09-22: A [[BwUniCluster2.0/Maintenance/2022-11|maintenance interval]] from 07.11.2022 to 10.11.2022 was announced.
|}

{| style=" background:#eeeefe; width:100%;"
| style="padding:8px; background:#dedefe; font-size:120%; font-weight:bold; text-align:left" | Training & Support
|-
|
* [[BwUniCluster2.0/First_Steps|Geting Started]]
* [https://training.bwhpc.de E-Learning Courses]
* [[BwUniCluster2.0/Support|Support]]
* [[BwUniCluster2.0/FAQ|FAQ]]
* Send [[:Category:Feedback|Feedback]] about Wiki pages
|}

{| style=" background:#deffee; width:100%;"
| style="padding:8px; background:#cef2e0; font-size:120%; font-weight:bold; text-align:left" | User Documentation
|-
|
* Access: [[Registration/bwUniCluster|Registration]], [[Registration/Deregistration|Deregistration]], [[BwUniCluster2.0/Jupyter|Using Jupyter]]
* [[BwUniCluster2.0/Login|Login]]
*
* [[BwUniCluster2.0/Hardware_and_Architecture|Hardware and Architecture]]
** [[BwUniCluster2.0/Hardware_and_Architecture#File_Systems|File Systems and Workspaces]]
* [[BwUniCluster2.0/Software|Cluster Specific Software]]
** [[BwUniCluster2.0/Containers|Using Containers]]
* [[BwUniCluster2.0/Slurm|Batch System]]
** [[BwUniCluster2.0/Batch_Queues|Queues and interactive Jobs]]
* [[BwUniCluster2.0/Maintenance|Operational Changes]]
|}

{| style=" background:#deffee; width:100%;"
| style="padding:8px; background:#cef2e0; font-size:120%; font-weight:bold; text-align:left" | Cluster Funding
|-
|
* Please [[BwUniCluster2.0/Acknowledgement|acknowledge]] bwUniCluster 2.0 in your publications.
|}

BwUniCluster2.0/Maintenance/2022-11

2022-10-26T10:18:14Z

S Raffeiner: Created page with "The following changes have been introduced during the maintenance interval between on 07.11.2022 (Monday) 08:00 and 10.11.2022 (Thursday) 17:00. The host key of the system ha..."

BwUniCluster2.0/Maintenance

2022-10-26T10:16:43Z

S Raffeiner:

'''2022'''

* [[BwUniCluster2.0/Maintenance/2022-11]] from 07.11.2022 to 10.11.2022

* [[BwUniCluster2.0/Maintenance/2022-03]] from 28.03.2022 to 31.03.2022

'''2021'''

* [[BwUniCluster2.0/Maintenance/2021-10]] from 11.10.2021 to 15.10.2021

'''2020'''

* [[BwUniCluster2.0/Maintenance/2020-10]] from 06.10.2020 to 13.10.2020

=== Maintenance records of retired bwUniCluster 1.0 ===

[[Category:BwUniCluster 2.0]]

'''2019'''

* [[BwUniCluster/Maintenance/2019-02]] from 02.02.2019 to 08.02.2019

'''2017'''

* [[BwUniCluster/Maintenance/2017-05]] from 02.05.2017 to 02.05.2017
* [[BwUniCluster/Maintenance/2017-03]] from 20.03.2017 to 21.03.2017

'''2016'''

* [[BwUniCluster/Maintenance/2016-10]] from 17.10.2016 to 21.10.2016

Category:BwUniCluster 2.0

2021-12-15T15:41:33Z

S Raffeiner:

BwUniCluster 2.0 User Access

2021-10-20T11:47:56Z

S Raffeiner: /* Client application: MobaXterm */

[[bwUniCluster_2.0|bwUniCluster 2.0]] is Baden-Württemberg's general purpose tier 3 high performance computing (HPC)
cluster co-financed by Baden-Württemberg's ministry of science, research and arts and the shareholders:

* Albert Ludwig University of Freiburg
* Eberhard Karls University, Tübingen
* Karlsruhe Institute of Technology (KIT)
* Heidelberg University (Ruprecht-Karls-Universität Heidelberg)
* Ulm University
* University of Hohenheim
* University of Konstanz
* University of Mannheim
* University of Stuttgart
* HAW BW e.V. (an association of several universities of applied sciences in Baden-Württemberg, see below)
 
To '''log on''' [[bwUniCluster_2.0|bwUniCluster 2.0]] a user account is required. All members of the shareholder
universities can apply for an account.
 

{| style="width: 100%; border-spacing: 5px;"
| style="text-align:left; color:#000;vertical-align:top;" |__TOC__
| [[File:bwUniCluster_17Jan2014_p044-rot_t10.10.00.jpg|center|border|250px|bwUniCluster wiring by Holger Obermaier, copyright: KIT (SCC)]] bwUniCluster wiring © KIT (SCC)
|}

= Registration =

Granting access and issuing a user account for '''bwUniCluster 2.0''' requires the registration at the KIT service website
* [https://bwidm.scc.kit.edu/ https://bwidm.scc.kit.edu] (step B).
However, this registration depends on the
* '''bwUniCluster entitlement''' (step A)
issued by your university .
 
Please log in to
* https://bwidm.scc.kit.edu/
to see a list of your entitlements. If the list contains
<pre> http://bwidm.de/entitlement/bwUniCluster </pre> you already have the entitlement and can skip step A.

== Step A: bwUniCluster entitlement for registration ==
'''The entitlement is called bwUniCluster (not bwUniCluster 2.0)''' and each university issues the bwUniCluster entitlement '''only''' for their own respective members. Some have established on-line processes or provide downloads of the entitlement application forms. If there is no link behind the name of an institution in the following list, please contact the local IT support services:

* [[BwCluster_User_Access_Uni_Freiburg|Albert Ludwig University of Freiburg]]
* [https://bwunicluster.urz.uni-heidelberg.de/ Heidelberg University]
* [https://kim.uni-hohenheim.de/bwhpc-account University of Hohenheim]
* [http://www.scc.kit.edu/downloads/ism/Accessform_bwUniCluster_DE_EN.pdf Karlsruhe Institute of Technology (KIT)]
* [[BWUniCluster_User_Access_Members_Uni_Konstanz|University of Konstanz]]
* [[BWUniCluster_User_Access_Members_Uni_Mannheim|University of Mannheim]]
* [https://www.hlrs.de/solutions-services/academic-users/bwunicluster-access/ University of Stuttgart]
* [https://uni-tuebingen.de/de/155157 Eberhard Karls University Tübingen]
* [[BWUniCluster_User_Access_Members_Uni_Ulm|Ulm University]]
* Hochschule Aalen
* Hochschule Albstadt-Sigmaringen
* Hochschule Esslingen
* Hochschule Heilbronn
* Hochschule Karlsruhe
* Hochschule Konstanz
* Hochschule Mannheim
* Hochschule Offenburg
* Hochschule Reutlingen
* Hochschule Rottenburg
* Hochschule Stuttgart (HfT)
* Hochschule Ulm
 

== Step B: Web Registration, service password and 2-factor authentication ==

After completing step A, i.e., after successfull issueing of the bwUniCluster entitlement, you have to register yourself for the service. To do so please visit [https://bwidm.scc.kit.edu/ https://bwidm.scc.kit.edu] and complete the following steps.

1. Select your home organization from the list on the main page and click '''Proceed''' or '''Fortfahren'''.

[[File:Bwidm-register-red.png|center|border|]]
 

2. You will be directed to the ''Identity Provider'' of your home organisation. Enter the user ID / username and password of your home organisation - this is usually the same password used for your e-mail account and other services - and click on '''Login''', '''Einloggen''' or something similar.
 

3. You will be redirected back to the registration website [https://bwidm.scc.kit.edu/ https://bwidm.scc.kit.edu/]. If you are logging into bwIDM for the first time, there will be a summary screen which shows the account details your home institution is providing to the central system. Please check that all data is valid and then click on '''Continue''' or '''Weiter'''.
 

4. Once you have successfully logged into the bwIDM system, you will be greeted by a home screen showing all state-wide services you have access to. There will be a box labelled "bwUniCluster". Click on '''Register''' or '''Registrieren''' to start the registration process.

[[File:Bwidm-2-red.png|center|border|]]
 

5. Since August 13, 2020 a '''2-factor authentication''' mechanism (2FA) is being enforced to improve security. If you have never registered a 2FA token on bwIDM before, the following error message will appear:

[[File:Bwidm-3-red.png|center|]]

Click on the [https://bwidm.scc.kit.edu/user/twofa.xhtml Link] or on the '''My Tokens''' link in the main menu. The instructions for registering a new 2FA token can be found on the following page: [[bwUniCluster 2.0 User Access/2FA Tokens]]. Please complete them before proceeding.
 

6. Make sure all requirements are met by checking the '''Requirements''' box at the top. If the requirements are not met you might be able to correct the issure by following the instructions. In all other cases please [[Registration_Support_-_bwUniCluster|contact your local hotline]].

[[File:BwUniCluster 2.0 access login bwidm registration requirements.png|center|border|]]
 

7. Read the Terms of Use ('''Nutzungsbedingungen und -richtlinien'''), check the box besides '''I have read and accepted the terms of use''' and click on '''Register''' or '''Registrieren'''.
 

8. Set a service password for the bwUniCluster and click on '''Save''' or '''Speichern'''. Logging in with the password of your home organisation, like on the former bwUniCluster 1, is no longer possible. Please make sure to use a strong password which is different from any other password you are currently using or have used on other systems. You will also be asked to change the service password regularly.

[[File:Bwidm-5-red.png|center|]]
 
 

== Step C: Fill out the bwUniCluster questionnaire ==

Filling out the bwUniCluster questionaire on

https://zas.bwhpc.de/shib/en/bwunicluster_survey.php

is mandatory for all users. The input is solely used to improve our support activities and for capacity planning of future HPC resources. '''If the questionaire is not filled out, access to bwUniCluster 2.0 is blocked 14 days after the registration.'''
 
 

== Changing the Service Password ==

Your bwUniCluster 2.0 '''password''' is the service password you set during the web registration (compare step 7 of chapter 1.2). At any time, you can set a new bwUniCluster 2.0 password via the registration website [https://bwidm.scc.kit.edu/ https://bwidm.scc.kit.edu] by carrying out the following steps:
# Go to [https://bwidm.scc.kit.edu/ https://bwidm.scc.kit.edu] and select your home organization
# Authenticate yourself via the user id / username and password provided by your home institution
# Find the entry '''bwUniCluster''' and select '''Set Service Password'''
# Enter the new password, repeat it and click '''Save''' button.
# If the change was sucessfull, the message "Das Passwort wurde bei dem Dienst geändert" ("Password has been changed") will be shown
# Proceed to log in using the new password
 
 
== Contact / Support ==
If you have questions or problems concerning the bwUniCluster (2.0) registration, please [[bwUniCluster 2.0 Support|contact your local hotline]].
 
 

= Establishing network access =

Access to bwUniCluster 2.0 is '''limited to IP addresses from the so-called BelWü networks'''. All home institutions of our current users are connected to BelWue, so if you are on your campus network (e.g. in your office or on the Campus WiFi) you should be able to connect to bwUniCluster 2.0 without restrictions. If you are outside of one of the BelWue networks (e.g. in your home office instead of in your campus office), a VPN connection to your home institution has to be established first (see e.g. [1] for the KIT).
 
 

= Login =

After finishing the web registration and making sure that you are on a network from which you have access to bwUniCluster 2.0 (e.g. by establishing a VPN connection), the HPC cluster is ready for your '''SSH''' based login. Recommended SSH clients applications are:

* the ssh (OpenSSH) command included in all Linux distributions and macOS, -in command under Linux and macOS using the application ''terminal''
* [http://mobaxterm.mobatek.net/ MobaXterm] under Windows
 

== Hostnames ==

The main hostname required to connect to bwUniCluster 2.0 is '''bwunicluster.scc.kit.edu''' or '''uc2.scc.kit.edu'''. The system has four login nodes and we use so-called ''DNS round-robin scheduling'' to load-balance the incoming connections between the nodes. If you open multiple SSH sessions to bwUniCluster 2.0, these sessions will be established to different login nodes, so processes started in one session might not be visible in other sessions.

The older Broadwell extension partition of the former bwUniCluster 1 is connected to bwUniCluster 2.0.

If you need to connect to specific login nodes, you can use the following hostnames:

{| class="wikitable"
! Hostname !! Node type
|-
| '''uc2-login1.scc.kit.edu''' || bwUniCluster 2.0, first login node
|-
| '''uc2-login2.scc.kit.edu''' || bwUniCluster 2.0, second login node
|-
| '''uc2-login3.scc.kit.edu''' || bwUniCluster 2.0, third login node
|-
| '''uc2-login4.scc.kit.edu''' || bwUniCluster 2.0, fourth login node
|-
|}

Only the secure shell ''SSH'' is allowed to login. Other protocols like ''telnet'' or ''rlogin'' are not allowed for security reasons.
 

== Usernames ==

Your username will be the same as the one provided by your home institution, but '''prefixed''' with two characters and an underscore indicating your home institution. For example: If you are a member of the university of Konstanz and your local username is ab1234, your username on bwUniCluster 2.0 is kn_ab1234.

The following list contains all prefixes currently in use:

{| class="wikitable"
! Home organization !! <UserID>
|-
| Universität Freiburg || ''fr_''username
|-
| Universität Heidelberg || ''hd_''username
|-
| Universität Hohenheim || ''ho_''username
|-
| KIT || username ''(without any prefix)''
|-
| Universität Konstanz || ''kn_''username
|-
| Universität Mannheim || ''ma_''username
|-
| Universität Stuttgart || ''st_''username
|-
| Universität Tübingen || ''tu_''username
|-
| Universität Ulm || ''ul_''username
|-
| Hochschule Aalen || ''aa_''username
|-
| Hochschule Albstadt-Sigmaringen || ''as_''username
|-
| Hochschule Esslingen || ''es_''username
|-
| Hochschule Heilbronn || ''hn_''username
|-
| Hochschule Karlsruhe || ''hk_''username
|-
| HTWG Konstanz || ''ht_''username
|-
| Hochschule Mannheim || ''mn_''username
|-
| Hochschule Offenburg || ''of_''username
|-
| Hochschule Reutlingen || ''hr_''username
|-
| Hochschule Rottenburg || ''ro_''username
|-
| Hochschule für Technik Stuttgart || ''hs_''username
|-
| Hochschule Ulm || ''hu_''username
|-
|}
 

== Client application: OpenSSH ==

Most Unix and Unix-like operating systems like Linux, macOS and *BSD come with a built-in SSH client provided by the OpenSSH project. More recent versions of Windows 10 and the Windows Subsystem for Linux also come with a built-in OpenSSH client.

To use this client, simply open a command line terminal (the exact process differs on every operating system, but usually involves starting an application called '''Terminal''' or '''Command Prompt''') and enter the following command to connect to bwUniCluster 2.0:

<pre>
$ ssh <UserID>@bwunicluster.scc.kit.edu
</pre>

If you are on a Linux or Unix system running the X Window System (X11) and want to use a GUI-based application on bwUniCluster 2.0, you can use the ''-X'' option for the ssh command to set up X11 forwarding:

<pre>
$ ssh -X <UserID>@uc2.scc.kit.edu
</pre>

Windows users requiring X11 forwarding for graphical applications should use '''MobaXterm''' instead.
 

== Client application: MobaXterm ==

The bwHPC-C5 support team strongly recommends to use [http://mobaxterm.mobatek.net/ MobaXterm] instead of ''PuTTY'' or ''WinSCP'' on Windows. ''MobaXterm'' provides a built-in X11 server allowing to start GUI based software.

Start ''MobaXterm'', fill in the following fields:
<pre>
Remote name : uc2.scc.kit.edu
Specify user name : <UserID>
Port : 22
</pre>

After that click on 'ok'. Then a terminal will be opened and there you can enter your credentials.
 

== Client application: FileZilla ==

Many GUI applications that support SFTP transfers on Linux don't work well with 2-factor authentification, e.g. Nautilus and Dolphin don't support it. A good alternative for Linux is FileZilla.

Start FileZilla, Select "File -> Site Manager..." from the main menu and set up a new connection with the following settings:

<pre>
Protocol: SFTP - SSH File Transfer Protocol
Host: uc2.scc.kit.edu
Logon Typ: Interactive
User: <UserID>
</pre>

Then click on the "Connect" button.

Files can be transferred between the local system and the cluster by navigating to the respective folders in the split file view and then either dragging files and folders between the views or by clicking on a file/folder with the right mouse button and then selecting "Upload" or "Download" from the menu.

== Example login process ==

After the connection has been initiated, a successful login process will go through the following three steps:

1. The system asks for a '''One-Time Password'''. Generate one using the Software or Hardware Token registered on the bwIDM system (see [[bwUniCluster 2.0 User Access/2FA Tokens]]) and enter it after the '''Your OTP:''' prompt.

2. The systems asks for your service password. Enter it after the '''Password:''' prompt.

3. You are greeted by the bwUniCluster 2.0 banner followed by a shell.

The result should look like this:

[[File:BwUniCluster 2.0 access login example.png|center|]]
 

== Troubleshooting ==

'''Issue: The "Your OTP:" prompt never appears and the connection hangs/times out instead'''

Likely cause: You are most likely not on a network from which access to the bwUniCluster 2.0 system is allowed. Please check if you might have to establish a VPN connection first.

 

'''Issue: The system asks for the One-Time Password multiple times'''

Likely cause: Make sure you are using the correct Software Token to generate the One-Time Password.

 

'''Issue: The system asks for the service password multiple times'''

Likely cause: Make sure you are using the service password set on bwIDM and not the password valid for your home institution. Unlike the bwUniCluster 1, the bwUniCluster 2.0 only accepts the service password.

 

'''Issue: There is an error message by the pam_ses_open.sh skript'''

Likely cause: Your account is in the "LOST_ACCESS" state because the entitlement is no longer valid, the questionaire was not filled out or there was a problem during the communication between your home institution and the central bwIDM system. Please try the following steps:

* Log into [https://bwidm.scc.kit.edu bwIDM], look for the bwUniCluster entry and click on '''Registry info'''. Your "Status:" should be "ACTIVE". If it is not, please wait for ten minutes since logging into the bwIDM causes a refresh and the problem might fix itself. If the status does not change to ACTIVE after a longer amount of time, please contact the support channels.

* If you have not filled out the questionaire, please do so on [https://zas.bwhpc.de/shib/en/bwunicluster_survey.php https://zas.bwhpc.de/shib/en/bwunicluster_survey.php] and then wait for about ten minutes before attempting to log into the HPC system again.
 
 

== Allowed activities on login nodes ==

The login nodes of bwUniCluster 2.0 are the access point to the compute system and to your bwUniCluster 2.0 $HOME directory. The login nodes are shared with all the users of bwUniCluster 2.0. Therefore, your activities on the login nodes are limited to primarily set up your batch jobs. Your activities may also be:

* '''short''' compilation of your program code and
* '''short''' pre- and postprocessing of your batch jobs.

To guarantee usability for all the users of bwUniCluster 2.0 '''you must not run your compute jobs on the login nodes'''. Compute jobs must be submitted to the
[[bwUniCluster Batch Jobs|queueing system]]. Any compute job running on the login nodes will be terminated without any notice. Any long-running compilation or any long-running pre- or postprocessing of batch jobs must also be submitted to the [[bwUniCluster Batch Jobs|queueing system]].
 
 

== SSH Keys ==

In contrast to the bwUniCluster 1 and many other HPC systems it is '''no longer possible to self-manage your SSH Keys by adding them to the ~/.ssh/authorized_keys file'''. Existing files will no longer be evaluated. SSH Keys have to be managed via the central bwIDM system instead. Please refer to the user guide for this functionality:

[[bwUniCluster 2.0 User Access/SSH Keys]]
 
 

= [[First_Steps_on_bwHPC_cluster|First steps on bwUniCluster]] =

First and some important steps on bwUniCluster 2.0 can be found [[First_Steps_on_bwHPC_cluster|here]].
 
 

= Deregistration =

Aka: unsubscribe from bwUniCluster mailing list

If you plan to permanently leave the bwUniCluster 2.0, follow the deregister checklist:
# Transfer all your data in $HOME and workspace to your local computer/storage and after that clear off all your data
# Visit [https://bwidm.scc.kit.edu/ https://bwidm.scc.kit.edu]
#* Select your home organization from the list and click '''Proceed'''
#* Enter your home-organisational user ID / username and your home-organisational password and click '''Login''' button
#* You will be redirected back to the registration website [https://bwidm.scc.kit.edu/ https://bwidm.scc.kit.edu/]
#* <div>Select '''Registry Info''' of the service '''bwUniCluster''' (on the left hand side) [[File:bwUniCluster_registration_sidebar.png|center|border|]]</div>
#* Click '''Deregister'''
Note that Step 2 will automatically unsubscribe you from the bwUniCluster mailing list.

----
[[Category:bwUniCluster_2.0]][[Category:Access]]

BwUniCluster 2.0 User Access

2021-10-20T11:47:42Z

S Raffeiner: /* Hostnames */

[[bwUniCluster_2.0|bwUniCluster 2.0]] is Baden-Württemberg's general purpose tier 3 high performance computing (HPC)
cluster co-financed by Baden-Württemberg's ministry of science, research and arts and the shareholders:

* Albert Ludwig University of Freiburg
* Eberhard Karls University, Tübingen
* Karlsruhe Institute of Technology (KIT)
* Heidelberg University (Ruprecht-Karls-Universität Heidelberg)
* Ulm University
* University of Hohenheim
* University of Konstanz
* University of Mannheim
* University of Stuttgart
* HAW BW e.V. (an association of several universities of applied sciences in Baden-Württemberg, see below)
 
To '''log on''' [[bwUniCluster_2.0|bwUniCluster 2.0]] a user account is required. All members of the shareholder
universities can apply for an account.
 

{| style="width: 100%; border-spacing: 5px;"
| style="text-align:left; color:#000;vertical-align:top;" |__TOC__
| [[File:bwUniCluster_17Jan2014_p044-rot_t10.10.00.jpg|center|border|250px|bwUniCluster wiring by Holger Obermaier, copyright: KIT (SCC)]] bwUniCluster wiring © KIT (SCC)
|}

= Registration =

Granting access and issuing a user account for '''bwUniCluster 2.0''' requires the registration at the KIT service website
* [https://bwidm.scc.kit.edu/ https://bwidm.scc.kit.edu] (step B).
However, this registration depends on the
* '''bwUniCluster entitlement''' (step A)
issued by your university .
 
Please log in to
* https://bwidm.scc.kit.edu/
to see a list of your entitlements. If the list contains
<pre> http://bwidm.de/entitlement/bwUniCluster </pre> you already have the entitlement and can skip step A.

== Step A: bwUniCluster entitlement for registration ==
'''The entitlement is called bwUniCluster (not bwUniCluster 2.0)''' and each university issues the bwUniCluster entitlement '''only''' for their own respective members. Some have established on-line processes or provide downloads of the entitlement application forms. If there is no link behind the name of an institution in the following list, please contact the local IT support services:

* [[BwCluster_User_Access_Uni_Freiburg|Albert Ludwig University of Freiburg]]
* [https://bwunicluster.urz.uni-heidelberg.de/ Heidelberg University]
* [https://kim.uni-hohenheim.de/bwhpc-account University of Hohenheim]
* [http://www.scc.kit.edu/downloads/ism/Accessform_bwUniCluster_DE_EN.pdf Karlsruhe Institute of Technology (KIT)]
* [[BWUniCluster_User_Access_Members_Uni_Konstanz|University of Konstanz]]
* [[BWUniCluster_User_Access_Members_Uni_Mannheim|University of Mannheim]]
* [https://www.hlrs.de/solutions-services/academic-users/bwunicluster-access/ University of Stuttgart]
* [https://uni-tuebingen.de/de/155157 Eberhard Karls University Tübingen]
* [[BWUniCluster_User_Access_Members_Uni_Ulm|Ulm University]]
* Hochschule Aalen
* Hochschule Albstadt-Sigmaringen
* Hochschule Esslingen
* Hochschule Heilbronn
* Hochschule Karlsruhe
* Hochschule Konstanz
* Hochschule Mannheim
* Hochschule Offenburg
* Hochschule Reutlingen
* Hochschule Rottenburg
* Hochschule Stuttgart (HfT)
* Hochschule Ulm
 

== Step B: Web Registration, service password and 2-factor authentication ==

After completing step A, i.e., after successfull issueing of the bwUniCluster entitlement, you have to register yourself for the service. To do so please visit [https://bwidm.scc.kit.edu/ https://bwidm.scc.kit.edu] and complete the following steps.

1. Select your home organization from the list on the main page and click '''Proceed''' or '''Fortfahren'''.

[[File:Bwidm-register-red.png|center|border|]]
 

2. You will be directed to the ''Identity Provider'' of your home organisation. Enter the user ID / username and password of your home organisation - this is usually the same password used for your e-mail account and other services - and click on '''Login''', '''Einloggen''' or something similar.
 

3. You will be redirected back to the registration website [https://bwidm.scc.kit.edu/ https://bwidm.scc.kit.edu/]. If you are logging into bwIDM for the first time, there will be a summary screen which shows the account details your home institution is providing to the central system. Please check that all data is valid and then click on '''Continue''' or '''Weiter'''.
 

4. Once you have successfully logged into the bwIDM system, you will be greeted by a home screen showing all state-wide services you have access to. There will be a box labelled "bwUniCluster". Click on '''Register''' or '''Registrieren''' to start the registration process.

[[File:Bwidm-2-red.png|center|border|]]
 

5. Since August 13, 2020 a '''2-factor authentication''' mechanism (2FA) is being enforced to improve security. If you have never registered a 2FA token on bwIDM before, the following error message will appear:

[[File:Bwidm-3-red.png|center|]]

Click on the [https://bwidm.scc.kit.edu/user/twofa.xhtml Link] or on the '''My Tokens''' link in the main menu. The instructions for registering a new 2FA token can be found on the following page: [[bwUniCluster 2.0 User Access/2FA Tokens]]. Please complete them before proceeding.
 

6. Make sure all requirements are met by checking the '''Requirements''' box at the top. If the requirements are not met you might be able to correct the issure by following the instructions. In all other cases please [[Registration_Support_-_bwUniCluster|contact your local hotline]].

[[File:BwUniCluster 2.0 access login bwidm registration requirements.png|center|border|]]
 

7. Read the Terms of Use ('''Nutzungsbedingungen und -richtlinien'''), check the box besides '''I have read and accepted the terms of use''' and click on '''Register''' or '''Registrieren'''.
 

8. Set a service password for the bwUniCluster and click on '''Save''' or '''Speichern'''. Logging in with the password of your home organisation, like on the former bwUniCluster 1, is no longer possible. Please make sure to use a strong password which is different from any other password you are currently using or have used on other systems. You will also be asked to change the service password regularly.

[[File:Bwidm-5-red.png|center|]]
 
 

== Step C: Fill out the bwUniCluster questionnaire ==

Filling out the bwUniCluster questionaire on

https://zas.bwhpc.de/shib/en/bwunicluster_survey.php

is mandatory for all users. The input is solely used to improve our support activities and for capacity planning of future HPC resources. '''If the questionaire is not filled out, access to bwUniCluster 2.0 is blocked 14 days after the registration.'''
 
 

== Changing the Service Password ==

Your bwUniCluster 2.0 '''password''' is the service password you set during the web registration (compare step 7 of chapter 1.2). At any time, you can set a new bwUniCluster 2.0 password via the registration website [https://bwidm.scc.kit.edu/ https://bwidm.scc.kit.edu] by carrying out the following steps:
# Go to [https://bwidm.scc.kit.edu/ https://bwidm.scc.kit.edu] and select your home organization
# Authenticate yourself via the user id / username and password provided by your home institution
# Find the entry '''bwUniCluster''' and select '''Set Service Password'''
# Enter the new password, repeat it and click '''Save''' button.
# If the change was sucessfull, the message "Das Passwort wurde bei dem Dienst geändert" ("Password has been changed") will be shown
# Proceed to log in using the new password
 
 
== Contact / Support ==
If you have questions or problems concerning the bwUniCluster (2.0) registration, please [[bwUniCluster 2.0 Support|contact your local hotline]].
 
 

= Establishing network access =

Access to bwUniCluster 2.0 is '''limited to IP addresses from the so-called BelWü networks'''. All home institutions of our current users are connected to BelWue, so if you are on your campus network (e.g. in your office or on the Campus WiFi) you should be able to connect to bwUniCluster 2.0 without restrictions. If you are outside of one of the BelWue networks (e.g. in your home office instead of in your campus office), a VPN connection to your home institution has to be established first (see e.g. [1] for the KIT).
 
 

= Login =

After finishing the web registration and making sure that you are on a network from which you have access to bwUniCluster 2.0 (e.g. by establishing a VPN connection), the HPC cluster is ready for your '''SSH''' based login. Recommended SSH clients applications are:

* the ssh (OpenSSH) command included in all Linux distributions and macOS, -in command under Linux and macOS using the application ''terminal''
* [http://mobaxterm.mobatek.net/ MobaXterm] under Windows
 

== Hostnames ==

The main hostname required to connect to bwUniCluster 2.0 is '''bwunicluster.scc.kit.edu''' or '''uc2.scc.kit.edu'''. The system has four login nodes and we use so-called ''DNS round-robin scheduling'' to load-balance the incoming connections between the nodes. If you open multiple SSH sessions to bwUniCluster 2.0, these sessions will be established to different login nodes, so processes started in one session might not be visible in other sessions.

The older Broadwell extension partition of the former bwUniCluster 1 is connected to bwUniCluster 2.0.

If you need to connect to specific login nodes, you can use the following hostnames:

{| class="wikitable"
! Hostname !! Node type
|-
| '''uc2-login1.scc.kit.edu''' || bwUniCluster 2.0, first login node
|-
| '''uc2-login2.scc.kit.edu''' || bwUniCluster 2.0, second login node
|-
| '''uc2-login3.scc.kit.edu''' || bwUniCluster 2.0, third login node
|-
| '''uc2-login4.scc.kit.edu''' || bwUniCluster 2.0, fourth login node
|-
|}

Only the secure shell ''SSH'' is allowed to login. Other protocols like ''telnet'' or ''rlogin'' are not allowed for security reasons.
 

== Usernames ==

Your username will be the same as the one provided by your home institution, but '''prefixed''' with two characters and an underscore indicating your home institution. For example: If you are a member of the university of Konstanz and your local username is ab1234, your username on bwUniCluster 2.0 is kn_ab1234.

The following list contains all prefixes currently in use:

{| class="wikitable"
! Home organization !! <UserID>
|-
| Universität Freiburg || ''fr_''username
|-
| Universität Heidelberg || ''hd_''username
|-
| Universität Hohenheim || ''ho_''username
|-
| KIT || username ''(without any prefix)''
|-
| Universität Konstanz || ''kn_''username
|-
| Universität Mannheim || ''ma_''username
|-
| Universität Stuttgart || ''st_''username
|-
| Universität Tübingen || ''tu_''username
|-
| Universität Ulm || ''ul_''username
|-
| Hochschule Aalen || ''aa_''username
|-
| Hochschule Albstadt-Sigmaringen || ''as_''username
|-
| Hochschule Esslingen || ''es_''username
|-
| Hochschule Heilbronn || ''hn_''username
|-
| Hochschule Karlsruhe || ''hk_''username
|-
| HTWG Konstanz || ''ht_''username
|-
| Hochschule Mannheim || ''mn_''username
|-
| Hochschule Offenburg || ''of_''username
|-
| Hochschule Reutlingen || ''hr_''username
|-
| Hochschule Rottenburg || ''ro_''username
|-
| Hochschule für Technik Stuttgart || ''hs_''username
|-
| Hochschule Ulm || ''hu_''username
|-
|}
 

== Client application: OpenSSH ==

Most Unix and Unix-like operating systems like Linux, macOS and *BSD come with a built-in SSH client provided by the OpenSSH project. More recent versions of Windows 10 and the Windows Subsystem for Linux also come with a built-in OpenSSH client.

To use this client, simply open a command line terminal (the exact process differs on every operating system, but usually involves starting an application called '''Terminal''' or '''Command Prompt''') and enter the following command to connect to bwUniCluster 2.0:

<pre>
$ ssh <UserID>@bwunicluster.scc.kit.edu
</pre>

If you are on a Linux or Unix system running the X Window System (X11) and want to use a GUI-based application on bwUniCluster 2.0, you can use the ''-X'' option for the ssh command to set up X11 forwarding:

<pre>
$ ssh -X <UserID>@uc2.scc.kit.edu
</pre>

Windows users requiring X11 forwarding for graphical applications should use '''MobaXterm''' instead.
 

== Client application: MobaXterm ==

The bwHPC-C5 support team strongly recommends to use [http://mobaxterm.mobatek.net/ MobaXterm] instead of ''PuTTY'' or ''WinSCP'' on Windows. ''MobaXterm'' provides a built-in X11 server allowing to start GUI based software.

Start ''MobaXterm'', fill in the following fields:
<pre>
Remote name : uc2.scc.kit.edu # or uc1e.scc.kit.edu
Specify user name : <UserID>
Port : 22
</pre>

After that click on 'ok'. Then a terminal will be opened and there you can enter your credentials.
 

== Client application: FileZilla ==

Many GUI applications that support SFTP transfers on Linux don't work well with 2-factor authentification, e.g. Nautilus and Dolphin don't support it. A good alternative for Linux is FileZilla.

Start FileZilla, Select "File -> Site Manager..." from the main menu and set up a new connection with the following settings:

<pre>
Protocol: SFTP - SSH File Transfer Protocol
Host: uc2.scc.kit.edu
Logon Typ: Interactive
User: <UserID>
</pre>

Then click on the "Connect" button.

Files can be transferred between the local system and the cluster by navigating to the respective folders in the split file view and then either dragging files and folders between the views or by clicking on a file/folder with the right mouse button and then selecting "Upload" or "Download" from the menu.

== Example login process ==

After the connection has been initiated, a successful login process will go through the following three steps:

1. The system asks for a '''One-Time Password'''. Generate one using the Software or Hardware Token registered on the bwIDM system (see [[bwUniCluster 2.0 User Access/2FA Tokens]]) and enter it after the '''Your OTP:''' prompt.

2. The systems asks for your service password. Enter it after the '''Password:''' prompt.

3. You are greeted by the bwUniCluster 2.0 banner followed by a shell.

The result should look like this:

[[File:BwUniCluster 2.0 access login example.png|center|]]
 

== Troubleshooting ==

'''Issue: The "Your OTP:" prompt never appears and the connection hangs/times out instead'''

Likely cause: You are most likely not on a network from which access to the bwUniCluster 2.0 system is allowed. Please check if you might have to establish a VPN connection first.

 

'''Issue: The system asks for the One-Time Password multiple times'''

Likely cause: Make sure you are using the correct Software Token to generate the One-Time Password.

 

'''Issue: The system asks for the service password multiple times'''

Likely cause: Make sure you are using the service password set on bwIDM and not the password valid for your home institution. Unlike the bwUniCluster 1, the bwUniCluster 2.0 only accepts the service password.

 

'''Issue: There is an error message by the pam_ses_open.sh skript'''

Likely cause: Your account is in the "LOST_ACCESS" state because the entitlement is no longer valid, the questionaire was not filled out or there was a problem during the communication between your home institution and the central bwIDM system. Please try the following steps:

* Log into [https://bwidm.scc.kit.edu bwIDM], look for the bwUniCluster entry and click on '''Registry info'''. Your "Status:" should be "ACTIVE". If it is not, please wait for ten minutes since logging into the bwIDM causes a refresh and the problem might fix itself. If the status does not change to ACTIVE after a longer amount of time, please contact the support channels.

* If you have not filled out the questionaire, please do so on [https://zas.bwhpc.de/shib/en/bwunicluster_survey.php https://zas.bwhpc.de/shib/en/bwunicluster_survey.php] and then wait for about ten minutes before attempting to log into the HPC system again.
 
 

== Allowed activities on login nodes ==

The login nodes of bwUniCluster 2.0 are the access point to the compute system and to your bwUniCluster 2.0 $HOME directory. The login nodes are shared with all the users of bwUniCluster 2.0. Therefore, your activities on the login nodes are limited to primarily set up your batch jobs. Your activities may also be:

* '''short''' compilation of your program code and
* '''short''' pre- and postprocessing of your batch jobs.

To guarantee usability for all the users of bwUniCluster 2.0 '''you must not run your compute jobs on the login nodes'''. Compute jobs must be submitted to the
[[bwUniCluster Batch Jobs|queueing system]]. Any compute job running on the login nodes will be terminated without any notice. Any long-running compilation or any long-running pre- or postprocessing of batch jobs must also be submitted to the [[bwUniCluster Batch Jobs|queueing system]].
 
 

== SSH Keys ==

In contrast to the bwUniCluster 1 and many other HPC systems it is '''no longer possible to self-manage your SSH Keys by adding them to the ~/.ssh/authorized_keys file'''. Existing files will no longer be evaluated. SSH Keys have to be managed via the central bwIDM system instead. Please refer to the user guide for this functionality:

[[bwUniCluster 2.0 User Access/SSH Keys]]
 
 

= [[First_Steps_on_bwHPC_cluster|First steps on bwUniCluster]] =

First and some important steps on bwUniCluster 2.0 can be found [[First_Steps_on_bwHPC_cluster|here]].
 
 

= Deregistration =

Aka: unsubscribe from bwUniCluster mailing list

If you plan to permanently leave the bwUniCluster 2.0, follow the deregister checklist:
# Transfer all your data in $HOME and workspace to your local computer/storage and after that clear off all your data
# Visit [https://bwidm.scc.kit.edu/ https://bwidm.scc.kit.edu]
#* Select your home organization from the list and click '''Proceed'''
#* Enter your home-organisational user ID / username and your home-organisational password and click '''Login''' button
#* You will be redirected back to the registration website [https://bwidm.scc.kit.edu/ https://bwidm.scc.kit.edu/]
#* <div>Select '''Registry Info''' of the service '''bwUniCluster''' (on the left hand side) [[File:bwUniCluster_registration_sidebar.png|center|border|]]</div>
#* Click '''Deregister'''
Note that Step 2 will automatically unsubscribe you from the bwUniCluster mailing list.

----
[[Category:bwUniCluster_2.0]][[Category:Access]]

BwUniCluster2.0/Maintenance/2021-10

2021-10-15T08:22:35Z

S Raffeiner:

The following changes have been introduced during the maintenance interval between on 11.10.2021 (Monday) 08:00 and 15.10.2020 (Friday) 12:00.

The host key of the system have not changed. You should not receive any warnings by your SSH client(s), but if there should be a warning or if you want to check that you are connecting to the correct system, you can verify the key hashes using the following list:

{|class="wikitable"
! Algorithm
! Hash (SHA256)
! Hash (MD5)
|-
|RSA
|p6Ion2YKZr5cnzf6L6DS1xGnIwnC1BhLbOEmDdp7FA0
|59:2a:67:44:4a:d7:89:6c:c0:0d:74:ba:3c:c4:63:6d
|-
|ECDSA
|k8l1JnfLf1y1Qi55IQmo11+/NZx06Rbze7akT5R7tE8
|85:d4:d9:97:e0:f0:43:30:6e:66:8e:d0:b6:9b:85:d1
|-
|ED25519
|yEe5nJ5hZZ1YbgieWr+phqRZKYbrV7zRe8OR3X03cn0
|42:d2:0d:ab:87:48:fc:1d:5d:b3:7c:bf:22:c3:5f:b7
|}

= Hardware =

* The Broadwell Login nodes (uc1e.scc.kit.edu) have been taken out of service due to ongoing hardware and software issues and low utilisation. It is still possible to generate code for the Broadwell nodes using the compilers installed on the normal login nodes, or by starting an interactive job on the Broadwell compute nodes.

* All firmware versions on all components have been upgraded.

= Operating system =

* The operating system version is still based on Red Hat Enterprise Linux (RHEL) 8.2.

= Compilers, Libaries and Runtime Environments =

* The modules below toolkit/oneAPI have been integrated into the normal module structure.

* Intel oneAPI versions 2021.1.0 to 2021.3.0 have been replaced by the current version 2021.4.0.

* The obsolete Intel compiler version 18.0 has been removed. The officially supported Intel Compiler versions are now 19.0, 19.1 and 2021.4.0 (oneAPI). The latest version, namely 2021.4.0, has become the default compiler.

'''Please note that Intel Compiler Version 2021.4.0 is delivered as part of oneAPI in two different versions'''. The so-called Classic Compilers are based on the previous Intel Compilers. The next generation of Intel Compilers, on the other hand, is based on LLVM and, according to Intel's current plans, will replace the Classic Compilers in the medium term (see [https://software.intel.com/content/www/us/en/develop/blogs/adoption-of-llvm-complete-icx.html]). The new LLVM-based compilers support additional features that the Classic Compilers didn't, e.g. offloading to GPUs.

The Classic compilers are available via the ''compiler/intel/2021.4.0'' module and are marked as default. The LLVM-based compilers are available via the ''compiler/intel/2021.4.0_llvm'' module and are not marked as default.

We recommend to start testing the new LLVM-based compilers now. Please note that they accept different command line arguments than the Classic compilers and produce different compiler messages. You may therefore have to modify your build scripts. If you do not use environment variables like CC, CXX etc., you also have to keep in mind that the LLVM-based compilers use different names for the commands (''icx'', ''ipcx'', ''dpcpp''). Further information can be found, for example, at [https://software.intel.com/content/www/us/en/develop/documentation/oneapi-dpcpp-cpp-compiler-dev-guide-and-reference/top/compiler-setup/using-the-command-line/invoking-the-compiler.html].

= Userspace tools =

* The workspace tools have been switched to a different, more modern implementation. The workspace management commands have not changed, and all existing workspaces have been converted automatically, but the new commands produce slightly different messages on the command line. You may therefore have to modify your build scripts.

= Software Modules =

* The ''math/R'' software module has been updated to the latest version 4.1.x.

= Batch system =

* The Slurm version has been update to 20.11.5, the same version currently being used on HoreKa.

= Storage =

* The Lustre file systems have been updated.

= Graphics stack =

* The NVIDIA driver has been updated to version 470.57.02.

= Containers =

* Enroot has been updated to version 3.3.1.

* Singularity has been updated to version 3.8.3.

= JupyterHub =

* The JupyterHub version has been updated to version 1.4.2.

BwUniCluster2.0/Maintenance/2021-10

2021-09-29T08:24:44Z

S Raffeiner: /* Operating system */

The following changes are currently planned for the maintenance interval starting on 11.10.2021 (Monday) at 8 AM and ending on 15.10.2020 (Friday) at 12 AM.

This list will be extended and specified in the next days and weeks and is subject to change

= Hardware =

* The Broadwell Login nodes (uc1e.scc.kit.edu) will be taken out of service permanently due to ongoing hardware and software issues and low utilisation. It is still possible to generate code for the Broadwell nodes using the compilers installed on the normal login nodes, or by starting an interactive job on the Broadwell compute nodes.

* All firmware versions on all components will be upgraded.

= Operating system =

* The operating system version will continue to be based on Red Hat Enterprise Linux (RHEL) 8.2.

= Compilers, Libaries and Runtime Environments =

= Development tools =

= Userspace tools =

* The workspace tools will be switched to a different, more modern implementation. The workspace management commands will not change, and all existing workspaces will be converted automatically.

= Software Modules =

The following list it not exhaustive.

* Update math/R software module to latest version 4.1.x.

= Batch system =

= Storage =

* Upgrade of the Lustre file systems

= Graphics stack =

* The NVIDIA driver will be updated to the most recent version available.

= Containers =

= JupyterHub =

* The JupyterHub version will be updated to a more recent version.

BwUniCluster2.0/Maintenance/2021-10

2021-09-29T07:25:55Z

S Raffeiner: /* Userspace tools */

The following changes are currently planned for the maintenance interval starting on 11.10.2021 (Monday) at 8 AM and ending on 15.10.2020 (Friday) at 12 AM.

This list will be extended and specified in the next days and weeks and is subject to change

= Hardware =

* The Broadwell Login nodes (uc1e.scc.kit.edu) will be taken out of service permanently due to ongoing hardware and software issues and low utilisation. It is still possible to generate code for the Broadwell nodes using the compilers installed on the normal login nodes, or by starting an interactive job on the Broadwell compute nodes.

* All firmware versions on all components will be upgraded.

= Operating system =

* The operating system version may be upgraded to Red Hat Enterprise Linux (RHEL) 8.4. Should this happen we recommend to re-compile all applications after the upgrade.

= Compilers, Libaries and Runtime Environments =

= Development tools =

= Userspace tools =

* The workspace tools will be switched to a different, more modern implementation. The workspace management commands will not change, and all existing workspaces will be converted automatically.

= Software Modules =

The following list it not exhaustive.

* Update math/R software module to latest version 4.1.x.

= Batch system =

= Storage =

* Upgrade of the Lustre file systems

= Graphics stack =

* The NVIDIA driver will be updated to the most recent version available.

= Containers =

= JupyterHub =

* The JupyterHub version will be updated to a more recent version.

Category:BwUniCluster 2.0

2021-09-27T08:03:50Z

S Raffeiner:

{| style="width: 100%; border-spacing: 5px;"
| style="text-align:center; color:#000;vertical-align:middle;font-size:75%;" |
[[File:BwUniCluster_2.0_Feb2020.jpg|center|border|550px|Close-up of bwUniCluster by Simon Raffeiner, Copyright: KIT (SCC)]]
|-
| style="text-align:center; color:#000;vertical-align:middle;" |Close-up of bwUniCluster © KIT (Simon Raffeiner/SCC)
|}

On 17.03.2020, the Steinbuch Centre for Computing (SCC) at Karlsruhe Institute of Technology (KIT) commissioned a new parallel computer system called "bwUniCluster 2.0+GFB-HPC" as a state service within the bwHPC framework. The bwUniCluster 2.0 replaces the predecessor system [[bwUniCluster]] and also includes the additional compute nodes which were procured as an extension to the bwUniCluster in November 2016.

The modern bwUniCluster 2.0 system consists of more than 840 SMP nodes with 64-bit Intel Xeon processors. It provides the universities of the state of Baden-Württemberg with general compute resources and can be used free of charge by the staff of all universities in Baden-Württemberg. Users who currently have access to bwUniCluster will automatically also have access to bwUniCluster 2.0. There is no need to apply for new entitlements or to re-register.


{| style="width: 100%; margin:4px 0 0 0; background:none; border-spacing: 0px;"
| style="width:100%; border:1px solid #BBBBBB; background:lightyellow; vertical-align:top; color:#000;" |
{| style="width:100%; vertical-align:top; border:0px solid #BBBBBB; padding:4px;" |
|-
|{{yellow}}| Next maintenance
|-
|
Due to regular maintenance work the HPC system bwUniCluster 2.0 will not be available from

11.10.2021 at 08:00 AM until 15.10.2021 at 12:00 AM

Please see the [[BwUniCluster_2.0_Maintenance/2021-10|maintenance page]] for more information about planned upgrades and other changes.
|}
|}


{| style="width: 100%; margin:4px 0 0 0; background:none; border-spacing: 0px;"
| style="width:100%; border:1px solid #BBBBBB; background:#fff5fa; vertical-align:top; color:#000;" |
{| style="width:100%; vertical-align:top; border:0px solid #BBBBBB; padding:4px;" |
|-
|{{Red}}| New security measures
|-
|
On 13.08.2020 at 10 AM the following changes to the security policies will take effect:

* For authentication, the use of a second factor (2-factor authentication) in addition to the service password will be mandatory. [[BwUniCluster 2.0 User Access/2FA Tokens|You can find the user documentation for this function here]].

* The use of SSH keys will be possible again. However, these can no longer be managed via the authorized_keys files, but only centrally via bwIDM. [[BwUniCluster 2.0 User Access/SSH Keys|You can find the user documentation for this function here]].

The following restrictions still apply:

* Access is limited to IP addresses from within the campus networks of the respective home institutions of our current users. If you are outside of one of these networks (e.g. in your home office), a VPN connection to your home institution has to be established first (see e.g. [https://www.scc.kit.edu/dienste/openvpn.php] for the KIT).
|}
|}


{| style="width: 100%; margin:4px 0 0 0; background:none; border-spacing: 0px;"
| style="width:50%; border:1px solid #BBBBBB; background:#f5fffa; vertical-align:top; color:#000;" |
{| style="width:100%; vertical-align:top; border:0px solid #BBBBBB; padding:4px;" |
|-
|{{Green}}| Access
|-
|
* bwUniCluster [[BwUniCluster_2.0_User_Access|Registration and Login]]
* Registration [[bwUniCluster 2.0 Support|trouble issues]] & [[BwUniCluster_2.0_User_Access#Deregistration|Deregistration]]
* [[First_Steps_on_bwHPC_cluster|First steps on bwUniCluster]]
* [[Jupyter_at_SCC|Access with Jupyter]]
 
|-
|{{Green}}| Software
|-
|
* [[bwUniCluster_2.0_Software|Software and Environment Modules]]
* [[Containers|Using Containers]]
|}
{| style="width:100%; vertical-align:top; border:0px solid #BBBBBB; padding:4px;" |
|-
|{{Green}}| Hardware
|-
|
* [[bwUniCluster_2.0_Hardware_and_Architecture|Hardware and Architecture]]
* [[BwUniCluster_2.0_Hardware_and_Architecture#File_Systems|File Systems]]
|}

| style="padding:2px;" |

| style="width:50%; border:1px solid #BBBBBB; background:#f5faff; vertical-align:top; color:#000;" |
{| style="width:100%; vertical-align:top; border:0px solid #BBBBBB; padding:4px;" |
|-
|{{Blue}}| Batch/Compute Jobs
|-
|
* [[bwUniCluster_2.0_Slurm_common_Features|Slurm common Features]]
* [[BwUniCluster_2.0_Batch_Queues|Batch Queues and interactive Jobs]]
|-
|{{Blue}}| [[BwHPC_Best_Practices_Repository|bwHPC Best Practice Guides]] / FAQs
|-
|




* [[FAQ - bwUniCluster_broadwell_partition|FAQ - bwUniCluster 2.0 Broadwell partition]]
|-
|{{Blue}}| Miscellaneous
|-
|
* [[bwUniCluster_Acknowledgement|Acknowledgement]] of work performed on bwUniCluster (2.0)
* [[BwUniCluster_2.0_File_System_Migration_Guide|File system migration guide]] and [[BwUniCluster_2.0_Batch_System_Migration_Guide|Batch system migration guide]] for users migrating from the former bwUniCluster 1
|}
|}

 
-----
 
 
[[Category:bwHPC_infrastructure]][[Category:bwHPC_Cluster]][[Category:bwCluster]]

Category:BwUniCluster 2.0

2021-09-27T08:03:40Z

S Raffeiner:

BwUniCluster2.0/Maintenance/2021-10

2021-09-22T11:30:41Z

S Raffeiner:

The following changes are currently planned for the maintenance interval starting on 11.10.2021 (Monday) at 8 AM and ending on 15.10.2020 (Friday) at 12 AM.

This list will be extended and specified in the next days and weeks and is subject to change

= Hardware =

* The Broadwell Login nodes (uc1e.scc.kit.edu) will be taken out of service permanently due to ongoing hardware and software issues and low utilisation. It is still possible to generate code for the Broadwell nodes using the compilers installed on the normal login nodes, or by starting an interactive job on the Broadwell compute nodes.

* All firmware versions on all components will be upgraded.

= Operating system =

* The operating system version may be upgraded to Red Hat Enterprise Linux (RHEL) 8.4. Should this happen we recommend to re-compile all applications after the upgrade.

= Compilers, Libaries and Runtime Environments =

= Development tools =

= Userspace tools =

* The workspace tools will be switched to a different, more modern implementation. The workspace management commands will not change, but currently it is still being clarified how existing workspaces will be handled.

= Software Modules =

The following list it not exhaustive.

* Update math/R software module to latest version 4.1.x.

= Batch system =

= Storage =

* Upgrade of the Lustre file systems

= Graphics stack =

* The NVIDIA driver will be updated to the most recent version available.

= Containers =

= JupyterHub =

* The JupyterHub version will be updated to a more recent version.

BwUniCluster2.0/Maintenance/2021-10

2021-09-01T11:24:13Z

S Raffeiner:

The following changes are currently planned for the maintenance interval starting on 11.10.2021 (Monday) at 8 AM and ending on 15.10.2020 (Friday) at 12 AM.

This list will be extended and specified in the next days and weeks and is subject to change

= Operating system =

* The operating system version is planned to be upgraded to Red Hat Enterprise Linux (RHEL) 8.4. We recommend to re-compile all applications after the upgrade.

= Compilers, Libaries and Runtime Environments =

= Development tools =

= Userspace tools =

* The workspace tools will be switched to a different, more modern implementation. The workspace management commands will not change, but currently it is still being clarified how existing workspaces will be handled.

= Software Modules =

The following list it not exhaustive.

* Update math/R software module to latest version 4.1.x.

= Batch system =

= Storage =

* Upgrade of the Lustre file systems

= Graphics stack =

= Containers =

BwUniCluster2.0/Maintenance/2021-10

2021-08-20T11:28:23Z

S Raffeiner: /* Storage */

The following changes are currently planned for the maintenance interval starting on 11.10.2021 (Monday) at 8 AM and ending on 15.10.2020 (Friday) at 12 AM.

This list will be extended and specified in the next days and weeks and is subject to change

= Operating system =

* The operating system version will be upgraded to Red Hat Enterprise Linux (RHEL) 8.4. We recommend to re-compile all applications after the upgrade.

= Compilers, Libaries and Runtime Environments =

= Development tools =

= Userspace tools =

* The workspace tools will be switched to a different, more modern implementation. The workspace management commands will not change, but currently it is still being clarified how existing workspaces will be handled.

= Software Modules =

The following list it not exhaustive.

= Batch system =

= Storage =

* Upgrade of the Lustre file systems

= Graphics stack =

= Containers =

BwUniCluster2.0/Maintenance

2021-08-20T11:26:01Z

S Raffeiner:

= 2020 =

[[BwUniCluster 2.0 Maintenance/2020-10]] from 06.10.2020 to 13.10.2020

[[BwUniCluster 2.0 Maintenance/2021-10]] from 11.10.2021 to 15.10.2021

[[Category:BwUniCluster 2.0]]

BwUniCluster2.0/Maintenance/2021-10

2021-08-20T11:21:29Z

S Raffeiner:

The following changes are currently planned for the maintenance interval starting on 11.10.2021 (Monday) at 8 AM and ending on 15.10.2020 (Friday) at 12 AM.

This list will be extended and specified in the next days and weeks and is subject to change

= Operating system =

* The operating system version will be upgraded to Red Hat Enterprise Linux (RHEL) 8.4. We recommend to re-compile all applications after the upgrade.

= Compilers, Libaries and Runtime Environments =

= Development tools =

= Userspace tools =

* The workspace tools will be switched to a different, more modern implementation. The workspace management commands will not change, but currently it is still being clarified how existing workspaces will be handled.

= Software Modules =

The following list it not exhaustive.

= Batch system =

= Storage =

= Graphics stack =

= Containers =

BwUniCluster2.0/Maintenance/2021-10

2021-08-20T11:13:50Z

S Raffeiner:

The following changes are planned for the maintenance interval starting on 11.10.2021 (Monday) at 8 AM and ending on 15.10.2020 (Friday) at 12 AM.

= Operating system =

* The operating system version will be upgraded to Red Hat Enterprise Linux (RHEL) 8.4. We recommend to re-compile all applications after the upgrade.

= Compilers, Libaries and Runtime Environments =

= Development tools =

= Userspace tools =

= Software Modules =

The following list it not exhaustive.

= Batch system =

= Storage =

= Graphics stack =

= Containers =

BwUniCluster2.0/Maintenance/2021-10

2021-08-20T11:01:21Z

S Raffeiner: Created page with "= A ="

= A =

BwUniCluster2.0/Support

2021-08-04T11:56:45Z

S Raffeiner:

== Registration ==

{| style="vertical-align:top;background:#f5fffa;border:2px solid #000000;"
| The primary support channel for all inquiries is the ticket system at

* '''[https://bw-support.scc.kit.edu/ bwSupport Portal]'''
|}

If you are having issues connected to your local home institution, e.g. getting the entitlement for registration on bwUniCluster 2.0, you may also contact your local hotline:

{| style="vertical-align:top;"
! University !! Hotline
|-
| Albert Ludwig University of Freiburg || hpc-support (ät) hpc.uni-freiburg.de
|-
| Eberhard Karls University, Tübingen || hpcmaster (ät) uni-tuebingen.de
|-
| Hochschule Esslingen || cluster-support (ät) hs-esslingen.de
|-
| Karlsruhe Institute of Technology || servicedesk (ät) scc.kit.edu
|-
| Ruprecht-Karls-Universität Heidelberg || hpc-support (ät) urz.uni-heidelberg.de
|-
| Ulm University || helpdesk (ät) uni-ulm.de
|-
| University of Hohenheim || kim-bw-projekt (ät) uni-hohenheim.de
|-
| University of Konstanz || support (ät) uni-konstanz.de
|-
| University of Mannheim ||hpc-support (ät) mailman.uni-mannheim.de
|-
| University of Stuttgart || bwunicluster (ät) hlrs.de
|}

== Problems with Login/2-Factor-Authentication ==

'''Q: How do I register/deactivate a token?'''

A: Please refer to the [[bwUniCluster 2.0 User Access/2FA Tokens|2FA documentation wiki page]].

 

'''Q: I have lost my 2FA device or can no longer generate One-Time Passwords using my software token.'''

A: KIT users can contact the [https://www.scc.kit.edu/en/services/servicedesk.php KIT ServiceDesk], users from all other institutions should open a support ticket via the [https://bw-support.scc.kit.edu/ bwSupport Portal].

== Deregistration ==

If you are no longer using the bwUniCluster 2.0 and want to deregister yourself, please follow the [[BwUniCluster_2.0_User_Access#Deregistration|Deregistration instructions]]. This will also automatically remove your e-mail address from the user mailing list and you will no longer receive user announcements for this system.

----
[[Category:Support]][[Category:bwUniCluster_2.0]]

BwUniCluster 2.0 User Access

2021-08-04T11:42:02Z

S Raffeiner: /* Login */

[[bwUniCluster_2.0|bwUniCluster 2.0]] is Baden-Württemberg's general purpose tier 3 high performance computing (HPC)
cluster co-financed by Baden-Württemberg's ministry of science, research and arts and the shareholders:

* Albert Ludwig University of Freiburg
* Eberhard Karls University, Tübingen
* Karlsruhe Institute of Technology (KIT)
* Heidelberg University (Ruprecht-Karls-Universität Heidelberg)
* Ulm University
* University of Hohenheim
* University of Konstanz
* University of Mannheim
* University of Stuttgart
* HAW BW e.V. (an association of several universities of applied sciences in Baden-Württemberg, see below)
 
To '''log on''' [[bwUniCluster_2.0|bwUniCluster 2.0]] a user account is required. All members of the shareholder
universities can apply for an account.
 

{| style="width: 100%; border-spacing: 5px;"
| style="text-align:left; color:#000;vertical-align:top;" |__TOC__
| [[File:bwUniCluster_17Jan2014_p044-rot_t10.10.00.jpg|center|border|250px|bwUniCluster wiring by Holger Obermaier, copyright: KIT (SCC)]] bwUniCluster wiring © KIT (SCC)
|}

= Registration =

Granting access and issuing a user account for '''bwUniCluster 2.0''' requires the registration at the KIT service website
* [https://bwidm.scc.kit.edu/ https://bwidm.scc.kit.edu] (step B).
However, this registration depends on the
* '''bwUniCluster entitlement''' (step A)
issued by your university .
 
Please log in to
* https://bwidm.scc.kit.edu/
to see a list of your entitlements. If the list contains
<pre> http://bwidm.de/entitlement/bwUniCluster </pre> you already have the entitlement and can skip step A.

== Step A: bwUniCluster entitlement for registration ==
'''The entitlement is called bwUniCluster (not bwUniCluster 2.0)''' and each university issues the bwUniCluster entitlement '''only''' for their own respective members. Some have established on-line processes or provide downloads of the entitlement application forms. If there is no link behind the name of an institution in the following list, please contact the local IT support services:

* [[BwCluster_User_Access_Uni_Freiburg|Albert Ludwig University of Freiburg]]
* [https://bwunicluster.urz.uni-heidelberg.de/ Heidelberg University]
* [https://kim.uni-hohenheim.de/bwhpc-account University of Hohenheim]
* [http://www.scc.kit.edu/downloads/ism/Accessform_bwUniCluster_DE_EN.pdf Karlsruhe Institute of Technology (KIT)]
* [[BWUniCluster_User_Access_Members_Uni_Konstanz|University of Konstanz]]
* [[BWUniCluster_User_Access_Members_Uni_Mannheim|University of Mannheim]]
* [https://www.hlrs.de/solutions-services/academic-users/bwunicluster-access/ University of Stuttgart]
* [https://uni-tuebingen.de/de/155157 Eberhard Karls University Tübingen]
* [[BWUniCluster_User_Access_Members_Uni_Ulm|Ulm University]]
* Hochschule Aalen
* Hochschule Albstadt-Sigmaringen
* Hochschule Esslingen
* Hochschule Heilbronn
* Hochschule Karlsruhe
* Hochschule Konstanz
* Hochschule Mannheim
* Hochschule Offenburg
* Hochschule Reutlingen
* Hochschule Rottenburg
* Hochschule Stuttgart (HfT)
* Hochschule Ulm
 

== Step B: Web Registration, service password and 2-factor authentication ==

After completing step A, i.e., after successfull issueing of the bwUniCluster entitlement, you have to register yourself for the service. To do so please visit [https://bwidm.scc.kit.edu/ https://bwidm.scc.kit.edu] and complete the following steps.

1. Select your home organization from the list on the main page and click '''Proceed''' or '''Fortfahren'''.

[[File:Bwidm-register-red.png|center|border|]]
 

2. You will be directed to the ''Identity Provider'' of your home organisation. Enter the user ID / username and password of your home organisation - this is usually the same password used for your e-mail account and other services - and click on '''Login''', '''Einloggen''' or something similar.
 

3. You will be redirected back to the registration website [https://bwidm.scc.kit.edu/ https://bwidm.scc.kit.edu/]. If you are logging into bwIDM for the first time, there will be a summary screen which shows the account details your home institution is providing to the central system. Please check that all data is valid and then click on '''Continue''' or '''Weiter'''.
 

4. Once you have successfully logged into the bwIDM system, you will be greeted by a home screen showing all state-wide services you have access to. There will be a box labelled "bwUniCluster". Click on '''Register''' or '''Registrieren''' to start the registration process.

[[File:Bwidm-2-red.png|center|border|]]
 

5. Since August 13, 2020 a '''2-factor authentication''' mechanism (2FA) is being enforced to improve security. If you have never registered a 2FA token on bwIDM before, the following error message will appear:

[[File:Bwidm-3-red.png|center|]]

Click on the [https://bwidm.scc.kit.edu/user/twofa.xhtml Link] or on the '''My Tokens''' link in the main menu. The instructions for registering a new 2FA token can be found on the following page: [[bwUniCluster 2.0 User Access/2FA Tokens]]. Please complete them before proceeding.
 

6. Make sure all requirements are met by checking the '''Requirements''' box at the top. If the requirements are not met you might be able to correct the issure by following the instructions. In all other cases please [[Registration_Support_-_bwUniCluster|contact your local hotline]].

[[File:BwUniCluster 2.0 access login bwidm registration requirements.png|center|border|]]
 

7. Read the Terms of Use ('''Nutzungsbedingungen und -richtlinien'''), check the box besides '''I have read and accepted the terms of use''' and click on '''Register''' or '''Registrieren'''.
 

8. Set a service password for the bwUniCluster and click on '''Save''' or '''Speichern'''. Logging in with the password of your home organisation, like on the former bwUniCluster 1, is no longer possible. Please make sure to use a strong password which is different from any other password you are currently using or have used on other systems. You will also be asked to change the service password regularly.

[[File:Bwidm-5-red.png|center|]]
 
 

== Step C: Fill out the bwUniCluster questionnaire ==

Filling out the bwUniCluster questionaire on

https://zas.bwhpc.de/shib/en/bwunicluster_survey.php

is mandatory for all users. The input is solely used to improve our support activities and for capacity planning of future HPC resources. '''If the questionaire is not filled out, access to bwUniCluster 2.0 is blocked 14 days after the registration.'''
 
 

== Changing the Service Password ==

Your bwUniCluster 2.0 '''password''' is the service password you set during the web registration (compare step 7 of chapter 1.2). At any time, you can set a new bwUniCluster 2.0 password via the registration website [https://bwidm.scc.kit.edu/ https://bwidm.scc.kit.edu] by carrying out the following steps:
# Go to [https://bwidm.scc.kit.edu/ https://bwidm.scc.kit.edu] and select your home organization
# Authenticate yourself via the user id / username and password provided by your home institution
# Find the entry '''bwUniCluster''' and select '''Set Service Password'''
# Enter the new password, repeat it and click '''Save''' button.
# If the change was sucessfull, the message "Das Passwort wurde bei dem Dienst geändert" ("Password has been changed") will be shown
# Proceed to log in using the new password
 
 
== Contact / Support ==
If you have questions or problems concerning the bwUniCluster (2.0) registration, please [[bwUniCluster 2.0 Support|contact your local hotline]].
 
 

= Establishing network access =

Access to bwUniCluster 2.0 is '''limited to IP addresses from the so-called BelWü networks'''. All home institutions of our current users are connected to BelWue, so if you are on your campus network (e.g. in your office or on the Campus WiFi) you should be able to connect to bwUniCluster 2.0 without restrictions. If you are outside of one of the BelWue networks (e.g. in your home office instead of in your campus office), a VPN connection to your home institution has to be established first (see e.g. [1] for the KIT).
 
 

= Login =

After finishing the web registration and making sure that you are on a network from which you have access to bwUniCluster 2.0 (e.g. by establishing a VPN connection), the HPC cluster is ready for your '''SSH''' based login. Recommended SSH clients applications are:

* the ssh (OpenSSH) command included in all Linux distributions and macOS, -in command under Linux and macOS using the application ''terminal''
* [http://mobaxterm.mobatek.net/ MobaXterm] under Windows
 

== Hostnames ==

The main hostname required to connect to bwUniCluster 2.0 is '''bwunicluster.scc.kit.edu''' or '''uc2.scc.kit.edu'''. The system has four login nodes and we use so-called ''DNS round-robin scheduling'' to load-balance the incoming connections between the nodes. If you open multiple SSH sessions to bwUniCluster 2.0, these sessions will be established to different login nodes, so processes started in one session might not be visible in other sessions.

The older Broadwell extension partition of the former bwUniCluster 1 is connected to bwUniCluster 2.0. You can use the hostname '''uc1e.scc.kit.edu''' to connect to the login nodes of this partition.

If you need to connect to specific login nodes, you can use the following hostnames:

{| class="wikitable"
! Hostname !! Node type
|-
| '''uc2-login1.scc.kit.edu''' || bwUniCluster 2.0, first login node
|-
| '''uc2-login2.scc.kit.edu''' || bwUniCluster 2.0, second login node
|-
| '''uc2-login3.scc.kit.edu''' || bwUniCluster 2.0, third login node
|-
| '''uc2-login4.scc.kit.edu''' || bwUniCluster 2.0, fourth login node
|-
| '''uc1e-login1.scc.kit.edu''' || Broadwell partition, first login node
|-
| '''uc1e-login2.scc.kit.edu''' || Broadwell partition, second login node
|-
|}

Only the secure shell ''SSH'' is allowed to login. Other protocols like ''telnet'' or ''rlogin'' are not allowed for security reasons.
 

== Usernames ==

Your username will be the same as the one provided by your home institution, but '''prefixed''' with two characters and an underscore indicating your home institution. For example: If you are a member of the university of Konstanz and your local username is ab1234, your username on bwUniCluster 2.0 is kn_ab1234.

The following list contains all prefixes currently in use:

{| class="wikitable"
! Home organization !! <UserID>
|-
| Universität Freiburg || ''fr_''username
|-
| Universität Heidelberg || ''hd_''username
|-
| Universität Hohenheim || ''ho_''username
|-
| KIT || username ''(without any prefix)''
|-
| Universität Konstanz || ''kn_''username
|-
| Universität Mannheim || ''ma_''username
|-
| Universität Stuttgart || ''st_''username
|-
| Universität Tübingen || ''tu_''username
|-
| Universität Ulm || ''ul_''username
|-
| Hochschule Aalen || ''aa_''username
|-
| Hochschule Albstadt-Sigmaringen || ''as_''username
|-
| Hochschule Esslingen || ''es_''username
|-
| Hochschule Heilbronn || ''hn_''username
|-
| Hochschule Karlsruhe || ''hk_''username
|-
| HTWG Konstanz || ''ht_''username
|-
| Hochschule Mannheim || ''mn_''username
|-
| Hochschule Offenburg || ''of_''username
|-
| Hochschule Reutlingen || ''hr_''username
|-
| Hochschule Rottenburg || ''ro_''username
|-
| Hochschule für Technik Stuttgart || ''hs_''username
|-
| Hochschule Ulm || ''hu_''username
|-
|}
 

== Client application: OpenSSH ==

Most Unix and Unix-like operating systems like Linux, macOS and *BSD come with a built-in SSH client provided by the OpenSSH project. More recent versions of Windows 10 and the Windows Subsystem for Linux also come with a built-in OpenSSH client.

To use this client, simply open a command line terminal (the exact process differs on every operating system, but usually involves starting an application called '''Terminal''' or '''Command Prompt''') and enter the following command to connect to bwUniCluster 2.0:

<pre>
$ ssh <UserID>@bwunicluster.scc.kit.edu
</pre>

If you are on a Linux or Unix system running the X Window System (X11) and want to use a GUI-based application on bwUniCluster 2.0, you can use the ''-X'' option for the ssh command to set up X11 forwarding:

<pre>
$ ssh -X <UserID>@uc2.scc.kit.edu
</pre>

Windows users requiring X11 forwarding for graphical applications should use '''MobaXterm''' instead.
 

== Client application: MobaXterm ==

The bwHPC-C5 support team strongly recommends to use [http://mobaxterm.mobatek.net/ MobaXterm] instead of ''PuTTY'' or ''WinSCP'' on Windows. ''MobaXterm'' provides a built-in X11 server allowing to start GUI based software.

Start ''MobaXterm'', fill in the following fields:
<pre>
Remote name : uc2.scc.kit.edu # or uc1e.scc.kit.edu
Specify user name : <UserID>
Port : 22
</pre>

After that click on 'ok'. Then a terminal will be opened and there you can enter your credentials.
 

== Client application: FileZilla ==

Many GUI applications that support SFTP transfers on Linux don't work well with 2-factor authentification, e.g. Nautilus and Dolphin don't support it. A good alternative for Linux is FileZilla.

Start FileZilla, Select "File -> Site Manager..." from the main menu and set up a new connection with the following settings:

<pre>
Protocol: SFTP - SSH File Transfer Protocol
Host: uc2.scc.kit.edu
Logon Typ: Interactive
User: <UserID>
</pre>

Then click on the "Connect" button.

Files can be transferred between the local system and the cluster by navigating to the respective folders in the split file view and then either dragging files and folders between the views or by clicking on a file/folder with the right mouse button and then selecting "Upload" or "Download" from the menu.

== Example login process ==

After the connection has been initiated, a successful login process will go through the following three steps:

1. The system asks for a '''One-Time Password'''. Generate one using the Software or Hardware Token registered on the bwIDM system (see [[bwUniCluster 2.0 User Access/2FA Tokens]]) and enter it after the '''Your OTP:''' prompt.

2. The systems asks for your service password. Enter it after the '''Password:''' prompt.

3. You are greeted by the bwUniCluster 2.0 banner followed by a shell.

The result should look like this:

[[File:BwUniCluster 2.0 access login example.png|center|]]
 

== Troubleshooting ==

'''Issue: The "Your OTP:" prompt never appears and the connection hangs/times out instead'''

Likely cause: You are most likely not on a network from which access to the bwUniCluster 2.0 system is allowed. Please check if you might have to establish a VPN connection first.

 

'''Issue: The system asks for the One-Time Password multiple times'''

Likely cause: Make sure you are using the correct Software Token to generate the One-Time Password.

 

'''Issue: The system asks for the service password multiple times'''

Likely cause: Make sure you are using the service password set on bwIDM and not the password valid for your home institution. Unlike the bwUniCluster 1, the bwUniCluster 2.0 only accepts the service password.

 

'''Issue: There is an error message by the pam_ses_open.sh skript'''

Likely cause: Your account is in the "LOST_ACCESS" state because the entitlement is no longer valid, the questionaire was not filled out or there was a problem during the communication between your home institution and the central bwIDM system. Please try the following steps:

* Log into [https://bwidm.scc.kit.edu bwIDM], look for the bwUniCluster entry and click on '''Registry info'''. Your "Status:" should be "ACTIVE". If it is not, please wait for ten minutes since logging into the bwIDM causes a refresh and the problem might fix itself. If the status does not change to ACTIVE after a longer amount of time, please contact the support channels.

* If you have not filled out the questionaire, please do so on [https://zas.bwhpc.de/shib/en/bwunicluster_survey.php https://zas.bwhpc.de/shib/en/bwunicluster_survey.php] and then wait for about ten minutes before attempting to log into the HPC system again.
 
 

== Allowed activities on login nodes ==

The login nodes of bwUniCluster 2.0 are the access point to the compute system and to your bwUniCluster 2.0 $HOME directory. The login nodes are shared with all the users of bwUniCluster 2.0. Therefore, your activities on the login nodes are limited to primarily set up your batch jobs. Your activities may also be:

* '''short''' compilation of your program code and
* '''short''' pre- and postprocessing of your batch jobs.

To guarantee usability for all the users of bwUniCluster 2.0 '''you must not run your compute jobs on the login nodes'''. Compute jobs must be submitted to the
[[bwUniCluster Batch Jobs|queueing system]]. Any compute job running on the login nodes will be terminated without any notice. Any long-running compilation or any long-running pre- or postprocessing of batch jobs must also be submitted to the [[bwUniCluster Batch Jobs|queueing system]].
 
 

== SSH Keys ==

In contrast to the bwUniCluster 1 and many other HPC systems it is '''no longer possible to self-manage your SSH Keys by adding them to the ~/.ssh/authorized_keys file'''. Existing files will no longer be evaluated. SSH Keys have to be managed via the central bwIDM system instead. Please refer to the user guide for this functionality:

[[bwUniCluster 2.0 User Access/SSH Keys]]
 
 

= [[First_Steps_on_bwHPC_cluster|First steps on bwUniCluster]] =

First and some important steps on bwUniCluster 2.0 can be found [[First_Steps_on_bwHPC_cluster|here]].
 
 

= Deregistration =

Aka: unsubscribe from bwUniCluster mailing list

If you plan to permanently leave the bwUniCluster 2.0, follow the deregister checklist:
# Transfer all your data in $HOME and workspace to your local computer/storage and after that clear off all your data
# Visit [https://bwidm.scc.kit.edu/ https://bwidm.scc.kit.edu]
#* Select your home organization from the list and click '''Proceed'''
#* Enter your home-organisational user ID / username and your home-organisational password and click '''Login''' button
#* You will be redirected back to the registration website [https://bwidm.scc.kit.edu/ https://bwidm.scc.kit.edu/]
#* <div>Select '''Registry Info''' of the service '''bwUniCluster''' (on the left hand side) [[File:bwUniCluster_registration_sidebar.png|center|border|]]</div>
#* Click '''Deregister'''
Note that Step 2 will automatically unsubscribe you from the bwUniCluster mailing list.

----
[[Category:bwUniCluster_2.0]][[Category:Access]]

MediaWiki:Sidebar

2021-05-26T06:38:59Z

S Raffeiner:

* SEARCH
* bwHPC Wiki
** mainpage|Home
** BwHPC_Best_Practices_Repository|Best Practices
** Category:BwHPC News|bwHPC News
** helppage|Wiki help
* Best Practice Guides
** BwHPC_Best_Practices_Repository|Overview
** Batch_Jobs|-- Batch Jobs
** Software_Modules|-- Software Modules
** BwHPC_BPG_Compiler|-- Compiler
** BwHPC_BPG_Numerical_Libraries|-- Numerical Libraries
** BwHPC_BPG_for_Parallel_Programming|-- Parallel Programming
* bwHPC tier 3
** Category:BwUniCluster_2.0|bwUniCluster_2.0
** Category:BwForCluster_JUSTUS_2|bwForCluster JUSTUS_2
** Category:bwForCluster_MLS&WISO|bwForCluster MLS&WISO
** Category:BwForCluster_NEMO|bwForCluster NEMO
** Category:BwForCluster_BinAC|bwForCluster BinAC
* bwHPC tier 1+2
** https://kb.hlrs.de/platforms/index.php/HPE_Hawk|Hawk
** https://www.nhr.kit.edu/userdocs/horeka/|HoreKa
* bwHPC Support Services
** https://training.bwhpc.de|bwHPC courses
** http://www.support.bwhpc-c5.de/|Support/Ticket System
** https://www.bwhpc.de/software.html|Software Search
* Scientifc Data Storage
** Category:Sds-hd|SDS@hd
** https://www.rda.kit.edu/english|bwDataArchive

BwUniCluster2.0/Hardware and Architecture

2021-03-17T15:09:52Z

S Raffeiner: /* Selecting the appropriate file system */

= Architecture of bwUniCluster 2.0 =

The bwUniCluster 2.0 is a parallel computer with distributed memory. Each node of system consists of at least two Intel Xeon processor, local memory, disks, network adapters and optionally accelerators (NVIDIA Tesla V100). All nodes are connected by a fast InfiniBand interconnect. In addition the file system
Lustre, that is connected by coupling the InfiniBand of the file server with the InfiniBand
switch of the compute cluster, is added to bwUniCluster 2.0 to provide a fast and scalable
parallel file system.

The operating system on each node is Red Hat Enterprise Linux (RHEL) 7.7. A number of additional software packages like e.g. SLURM have been installed on top. Some of these components are of special interest to end users and are briefly
discussed in this document. Others which are of greater importance to system
administrators will not be covered by this document.

The individual nodes of the system may act in different roles. According to the services supplied by the nodes, they are separated into disjoint groups. From an end users point of view the different groups of nodes are login nodes, compute nodes, file server nodes and administrative server nodes.

'''Login Nodes'''

The login nodes are the only nodes that are directly accessible by end users. These nodes
are used for interactive login, file management, program development and interactive pre-
and postprocessing. Two nodes are dedicated to this service but they are all accessible via
one address and a DNS round-robin alias distributes the login sessions to the
different login nodes.

'''Compute Node'''

The majority of nodes are compute nodes which are managed by a batch system. Users
submit their jobs to the SLURM batch system and a job is executed when the required resources become available (depending on its fair-share priority).

'''File Server Nodes'''

The hardware of the parallel file system Lustre incorporates some file server nodes; the file
system Lustre is connected by coupling the InfiniBand of the file server with the independent InfiniBand switch of the compute cluster. In addition to shared file space there is also local storage on the disks of each node (for details see chapter "File Systems").

'''Administrative Server Nodes'''

Some other nodes are delivering additional services like resource management, external
network connection, administration etc. These nodes can be accessed directly by system administrators only.

= Components of bwUniCluster =

{| class="wikitable"
|-
! style="width:9%"|
! style="width:13%"| Compute nodes "Thin"
! style="width:13%"| Compute nodes "HPC"
! style="width:13%"| Compute nodes "HPC Broadwell"
! style="width:13%"| Compute nodes "Fat"
! style="width:13%"| GPU x4
! style="width:13%"| GPU x8
! style="width:13%"| Login
|-
!scope="column"| Number of nodes
| 100 + 60
| 360
| 352
| 6
| 14
| 10
| 4 + 2 (Broadwell)
|-
!scope="column"| Processors
| Intel Xeon Gold 6230
| Intel Xeon Gold 6230
| Intel Xeon E5-2660 v4
| Intel Xeon Gold 6230
| Intel Xeon Gold 6230
| Intel Xeon Gold 6248
|-
!scope="column"| Number of sockets
| 2
| 2
| 2
| 4
| 2
| 2
| 2
|-
!scope="column"| Processor frequency (GHz)
| 2.1 Ghz
| 2.1 Ghz
| 2.0 GHz
| 2.1 Ghz
| 2.1 Ghz
| 2.1 Ghz
|
|-
!scope="column"| Total number of cores
| 40
| 40
| 28
| 80
| 40
| 40
| 40 / 20 (Broadwell)
|-
!scope="column"| Main memory
| 96 GB / 192 GB
| 96 GB
| 128 GB
| 3 TB
| 384 GB
| 768 GB
| 384 GB / 128 GB (Broadwell)
|-
!scope="column"| Local SSD
| 960 GB SATA
| 960 GB SATA
| 480 GB SATA
| 4,8 TB NVMe
| 3,2 TB NVMe
| 6,4 TB NVMe
|
|-
!scope="column"| Accelerators
| -
| -
| -
| -
| 4x NVIDIA Tesla V100
| 8x NVIDIA Tesla V100
|
|-
!scope="column"| Interconnect
| IB HDR100 (blocking)
| IB HDR100
| IB FDR
| IB HDR
| IB HDR
| IB HDR
| IB HDR100 (blocking)
|}
Table 1: Properties of the nodes

= File Systems =

Details about changes on the file systems between bwUniCluster 1 and bwUniCluster 2.0 are described in the [[BwUniCluster_2.0_File_System_Migration_Guide|File system migration guide]]. Note that $WORK is deprecated.

On bwUniCluster 2.0 the parallel file system Lustre is used for most globally visible user data. Lustre is open source and Lustre solutions and support are available from different vendors. Nowadays, most of the biggest HPC systems are using Lustre. An initial home directory on a Lustre file system is created during the first login, and the environment variable $HOME holds its name. Users can create so-called workspaces on another Lustre file system for non-permanent data with temporary lifetime.

Within a batch job further file systems are available:
* The directory $TMP is only available and visible on the local node. It is located on fast SSD storage devices.
* On request a parallel on-demand (BeeOND) file system is created which uses the SSDs of the nodes which were allocated to the batch job.
* On request the external LSDF Online Storage is mounted on the nodes which were allocated to the batch job. This file system is based on the parallel file system Spectrum Scale.

Some of the characteristics of the file systems are shown in Table 2.

{| style="width:100%; vertical-align:top; background:#f5fffa;border:1px solid #000000;padding:1px"
|- style="width:20%;height=20px; text-align:left;padding:3px"
! style="background-color:#AAA;padding:3px"| Property
! style="background-color:#AAA;padding:3px"| $TMP
! style="background-color:#AAA;padding:3px"| BeeOND
! style="background-color:#AAA;padding:3px"| $HOME
! style="background-color:#AAA;padding:3px"| Workspace
|-
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Visibility
| style="height=20px; text-align:left;padding:3px"| local node
| style="height=20px; text-align:left;padding:3px"| nodes of batch job
| style="height=20px; text-align:left;padding:3px"| global
| style="height=20px; text-align:left;padding:3px"| global
|-
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Lifetime
| style="height=20px; text-align:left;padding:3px"| batch job runtime
| style="height=20px; text-align:left;padding:3px"| batch job runtime
| style="height=20px; text-align:left;padding:3px"| permanent
| style="height=20px; text-align:left;padding:3px"| max. 240 days
|- style="vertical-align:top;"
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Disk space
| style="height=20px; text-align:left;padding:3px"| 960 GB - 6.4 TB details see table 1
| style="height=20px; text-align:left;padding:3px"| n*250 GB
| style="height=20px; text-align:left;padding:3px"| 1.2 PiB
| style="height=20px; text-align:left;padding:3px"| 4.1 PiB
|-
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Capacity Quotas
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| yes 1 TiB per user also per organization
| style="height=20px; text-align:left;padding:3px"| yes 40 TiB per user
|-
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Inode Quotas
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| yes 10 million per user
| style="height=20px; text-align:left;padding:3px"| yes 30 million per user
|-
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Backup
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| yes
| style="height=20px; text-align:left;padding:3px"| no
|- style="vertical-align:top;"
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Read perf./node
| style="height=20px; text-align:left;padding:3px"| 500 MB/s - 6 GB/s depends on type of local SSD / job queue: 520 MB/s @ single / multiple 800 MB/s @ multiple_e 6600 MB/s @ fat 6500 MB/s @ gpu_4 6500 MB/s @ gpu_8
| style="height=20px; text-align:left;padding:3px"| 400 MB/s - 500 MB/s depends on type of local SSDs / job queue: 500 MB/s @ multiple 400 MB/s @ multiple_e
| style="height=20px; text-align:left;padding:3px"| 1 GB/s
| style="height=20px; text-align:left;padding:3px"| 1 GB/s
|- style="vertical-align:top;"
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Write perf./node
| style="height=20px; text-align:left;padding:3px"| 500 MB/s - 4 GB/s depends on type of local SSD / job queue: 500 MB/s @ single / multiple 650 MB/s @ multiple_e 2900 MB/s @ fat 2090 MB/s @ gpu_4 4060 MB/s @ gpu_8
| style="height=20px; text-align:left;padding:3px"| 250 MB/s - 350 MB/s depends on type of local SSDs / job queue: 350 MB/s @ multiple 250 MB/s @ multiple_e
| style="height=20px; text-align:left;padding:3px"| 1 GB/s
| style="height=20px; text-align:left;padding:3px"| 1 GB/s
|- style="vertical-align:top;"
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Total read perf.
| style="height=20px; text-align:left;padding:3px"| n*500-6000 MB/s
| style="height=20px; text-align:left;padding:3px"| n*400-500 MB/s
| style="height=20px; text-align:left;padding:3px"| 18 GB/s
| style="height=20px; text-align:left;padding:3px"| 54 GB/s
|- style="vertical-align:top;"
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Total write perf.
| style="height=20px; text-align:left;padding:3px"| n*500-4000 MB/s
| style="height=20px; text-align:left;padding:3px"| n*250-350 MB/s
| style="height=20px; text-align:left;padding:3px"| 18 GB/s
| style="height=20px; text-align:left;padding:3px"| 54 GB/s
|}
---------------------------------------------------------------------------------------------------------
global: all nodes of uc1 access the same file system;
local: each node has its own file system;
permanent: files are stored permanently;
batch job: files are removed at end of the batch job.
---------------------------------------------------------------------------------------------------------
Table 2: Properties of the file systems

== Selecting the appropriate file system ==

In general, you should separate your data and store it on the appropriate file system.
Permanently needed data like software or important results should be stored below $HOME
but capacity restrictions (quotas) apply. In case you accidentally deleted data on $HOME
you can usually restore it from backup. Permanent data which is not needed for months
or exceeds the capacity restrictions should be sent to the LSDF Online Storage
or to the archive and deleted from the file systems. Temporary data which is only needed on a single
node and which does not exceed the disk space shown in the table above should be stored
below $TMP. Temporary data which is only needed during job runs should be stored on a
parallel on-demand file system. Temporary data which can be recomputed or which is the
result of one job and input for another job should be stored below in workspaces. The lifetime
of data in workspaces is limited and depends on the lifetime of the workspace which can be
several months.

For further details please check the chapters below.

== $HOME ==

The home directories of bwUniCluster 2.0 (uc2) users are located in the parallel file system Lustre.
You have access to your home directory from all nodes of uc2. A regular backup of these directories
to tape archive is done automatically. The directory $HOME is used to hold those files that are
permanently used like source codes, configuration files, executable programs etc.

On uc2 there is a default user quota limit of 1 TiB and 10 million inodes (files and directories) per user.
You can chek your current usage and limits with the command
<pre>
$ lfs quota -uh $(whoami) $HOME
</pre>
In addition to the user limit there is a limit of your organization (e.g. university) which depends on the financial share. This limit is enforced with so-called Lustre project quotas. You can show the current usage and limits of your organization with the following command:
<pre>
lfs quota -ph $(grep $(echo $HOME | sed -e "s|/[^/]*/[^/]*$||") /pfs/data5/project_ids.txt | cut -f 1 -d\ ) $HOME
</pre>

== Workspaces ==

On uc2 workspaces can be used to store large non-permanent data sets, e.g. restart files or output
data that has to be post-processed. The file system used for workspaces is also the parallel file system Lustre. This file system is especially designed for parallel access and for a high throughput to large
files. It is able to provide high data transfer rates of up to 54 GB/s write and read performance when data access is parallel.

Workspaces have a lifetime and the data on a workspace expires as a whole after a fixed period. The maximum lifetime of a workspace is 60 days, but it can be renewed at the end of that period 3 times to a total maximum of 240 days after workspace generation. If a workspace has inadvertently expired we can restore the data during a limited time (few weeks). In this case you should create a new workspace and report the name of the new and of the expired workspace in a ticket or in an email to the hotline.

Creating, deleting, finding and extending workspaces is explained on the [[workspace]] page.

On uc2 there is a default user quota limit of 40 TiB and 30 million inodes (files and directories) per user.
You can chek your current usage and limits with the command
<pre>
$ lfs quota -uh $(whoami) /pfs/work7
</pre>

=== Reminder for workspace deletion ===

Normally you will get an email every day starting 7 days before a workspace expires. You can send yourself a calender entry which reminds you when a workspace will be automatically deleted:

$ ws_send_ical.sh <workspace> <email>

== Improving Performance on $HOME and workspaces ==

The following recommendations might help to improve throughput and metadata
performance on Lustre filesystems.

=== '''Improving Throughput Performance''' ===

Depending on your application some adaptations might be necessary if you want to reach
the full bandwidth of the filesystems. Parallel filesystems typically stripe files over storage subsystems, i.e. large files are separated into stripes and distributed to different storage subsystems. In Lustre, the size of these stripes (sometimes also mentioned as chunks) is called stripe size and the number of used storage subsystems is called stripe count.

When you are designing your application you should consider that the performance of
parallel filesystems is generally better if data is transferred in large blocks and stored in
few large files. In more detail, to increase throughput performance of a parallel application
following aspects should be considered:

* collect large chunks of data and write them sequentially at once,

* to exploit complete filesystem bandwidth use several clients,

* avoid competitive file access by different tasks or use blocks with boundaries at stripe size (default is 1MB),

* if files are small enough for the SSDs and are only used by one process store them on $TMP.

With previous Lustre versions adapting the Lustre stripe count was the most important optimization. However, for the workspaces of uc2 the new Lustre feature Progressive File Layouts has been used to define file striping parameters. This means that the stripe count is adapted if the file size is growing. In normal cases users no longer need to adapt file striping parameters in case they have very huge files or in order to reach better performance.

If you know what you are doing you can still change striping parameters, e.g. the stripe count, of a directory and of newly created files. New files and directories inherit the stripe count from the parent directory. E.g. if you want to enhance throughput on a single very large file which is created in the directory $HOME/my_output_dir you can use the command
<pre>
$ lfs setstripe -c-1 $HOME/my_output_dir
</pre>
to change the stripe count to -1 which means that all storage subsystems of the file system are used to store that file. If you change the stripe count of a directory the stripe count of existing files inside this
directory is not changed. If you want to change the stripe count of existing files, change
the stripe count of the parent directory, copy the files to new files, remove the old files
and move the new files back to the old name. In order to check the stripe setting of
the file my_file use
<pre>
$ lfs getstripe my_file
</pre>
Also note that changes on the striping parameters (e.g. stripe count) are not saved in the
backup, i.e. if directories have to be recreated this information is lost and the default stripe
count will be used. Therefore, you should annotate for which directories you made changes
to the striping parameters so that you can repeat these changes if required.

=== '''Improving Metadata Performance''' ===

Metadata performance on parallel file systems is usually not as good as with local
filesystems. In addition, it is usually not scalable, i.e. a limited resource. Therefore,
you should omit metadata operations whenever possible. For example, it is much better
to have few large files than lots of small files. In more detail, to increase metadata
performance of a parallel application following aspects should be considered:

* avoid creating many small files,

* avoid competitive directory access, e.g. by creating files in separate subdirectories for each task,

* if many small files are only used within a batch job and accessed by one process store them on $TMP,

* change the default colorization setting of the command ls (see below).

On modern Linux systems, the GNU ls command often uses colorization by default to
visually highlight the file type; this is especially true if the command is run within a terminal
session. This is because the default shell profile initializations usually contain an alias
directive similar to the following for the ls command:
<pre>
$ alias ls="ls --color=tty"
</pre>
However, running the ls command in this way for files on a Lustre file system requires
a stat() call to be used to determine the file type. This can result in a performance
overhead, because the stat() call always needs to determine the size of a file, and that
in turn means that the client node must query the object size of all the backing objects
that make up a file. As a result of the default colorization setting, running a simple
ls command on a Lustre file system often takes as much time as running the ls command
with the -l option (the same is true if the -F, -p, or the -classify option, or any other option
that requires information from a stat() call, is used). To avoid this performance overhead
when using ls commands, add an alias directive similar to the following
to your shell startup script:
<pre>
$ alias ls="ls --color=never"
</pre>

== $TMP ==

While all tasks of a parallel application access the same $HOME and workspace directory, the
$TMP directory is local to each node on bwUniCluster 2.0. Different tasks of a parallel
application use different directories when they do not utilize one node. This directory should
be used for temporary files being accessed by single tasks. All nodes have fast SSDs
local storage devices which are used to store data below $TMP.
In addition, this directory should be used for the installation
of software packages. This means that the software package to be installed should be
unpacked, compiled and linked in a subdirectory of $TMP. The real installation of the
package (e.g. make install) should be made in(to) the Lustre filesystem.

Each time a batch
job is started, a subdirectory is created on each node and assigned to the job. $TMP is newly
set; the name of the subdirectory contains the Job-id and the starting time so that the
subdirectory name is unique for each job. This unique name is then assigned to the
environment variable $TMP within the job. At the end of the job the subdirectory is removed.

== LSDF Online Storage==

In some cases it is useful to have access to the LSDF Online Storage on the HPC-Clusters also. Therefore the LSDF Online Storage is mounted on the Login- and Datamover-Nodes.
Furthermore it can be used on the compute nodes during the job runtime with the constraint flag "LSDF" ([[bwUniCluster_2.0_Slurm_common_Features|Slurm common features]]
). There is also an example about the LSDF batch usage: [[bwUniCluster_2.0_Slurm_common_Features#LSDF_Online_Storage|Slurm LSDF example ]] .
<pre>
#!/bin/bash
#SBATCH ...
#SBATCH --constraint=LSDF
</pre>
 
For the usage of the LSDF Online Storage the following environment variables are available: $LSDF, $LSDFPROJECTS, $LSDFHOME.
Please request storage projects in the LSDF Online Storage seperately:
[https://www.lsdf.kit.edu/os/storagerequest/: LSDF Storage Request].

==BeeOND (BeeGFS On-Demand)==

Users of the HPC-Cluster have possibility to request a private BeeOND (BeeGFS) parallel filesystem for each job. The file system is created during job startup and purged after your job.

'''IMPORTANT: All data on the private filesystem will be deleted after your job. Make sure you have copied your data back to the global filesystem (within job), e.g., $HOME or any workspace.'''

BeeOND/BeeGFS can be used like any other parallel file system. Tools like cp or rsync can be used to copy data in and out. This feature is currently under BETA. If you encounter any problems or have questions, please contact fh-hotline@lists.kit.edu

For detailed usage see here:[[BwUniCluster_2.0_Slurm_common_Features#BeeOND_.28BeeGFS_On-Demand.29|Request on-demand file system]]

==Backup and Archiving==

There are regular backups of all data of the home directories,whereas ACLs and extended attributes will
not be backuped.

Please contact the hotline if you need backuped data.

[[Category:bwUniCluster 2.0|File System]][[Category:Hardware_and_Architecture|bwUniCluster 2.0]]

BwUniCluster2.0/Hardware and Architecture

2021-03-17T15:09:25Z

S Raffeiner: /* Access to other HPC-Filesystems */

= Architecture of bwUniCluster 2.0 =

The bwUniCluster 2.0 is a parallel computer with distributed memory. Each node of system consists of at least two Intel Xeon processor, local memory, disks, network adapters and optionally accelerators (NVIDIA Tesla V100). All nodes are connected by a fast InfiniBand interconnect. In addition the file system
Lustre, that is connected by coupling the InfiniBand of the file server with the InfiniBand
switch of the compute cluster, is added to bwUniCluster 2.0 to provide a fast and scalable
parallel file system.

The operating system on each node is Red Hat Enterprise Linux (RHEL) 7.7. A number of additional software packages like e.g. SLURM have been installed on top. Some of these components are of special interest to end users and are briefly
discussed in this document. Others which are of greater importance to system
administrators will not be covered by this document.

The individual nodes of the system may act in different roles. According to the services supplied by the nodes, they are separated into disjoint groups. From an end users point of view the different groups of nodes are login nodes, compute nodes, file server nodes and administrative server nodes.

'''Login Nodes'''

The login nodes are the only nodes that are directly accessible by end users. These nodes
are used for interactive login, file management, program development and interactive pre-
and postprocessing. Two nodes are dedicated to this service but they are all accessible via
one address and a DNS round-robin alias distributes the login sessions to the
different login nodes.

'''Compute Node'''

The majority of nodes are compute nodes which are managed by a batch system. Users
submit their jobs to the SLURM batch system and a job is executed when the required resources become available (depending on its fair-share priority).

'''File Server Nodes'''

The hardware of the parallel file system Lustre incorporates some file server nodes; the file
system Lustre is connected by coupling the InfiniBand of the file server with the independent InfiniBand switch of the compute cluster. In addition to shared file space there is also local storage on the disks of each node (for details see chapter "File Systems").

'''Administrative Server Nodes'''

Some other nodes are delivering additional services like resource management, external
network connection, administration etc. These nodes can be accessed directly by system administrators only.

= Components of bwUniCluster =

{| class="wikitable"
|-
! style="width:9%"|
! style="width:13%"| Compute nodes "Thin"
! style="width:13%"| Compute nodes "HPC"
! style="width:13%"| Compute nodes "HPC Broadwell"
! style="width:13%"| Compute nodes "Fat"
! style="width:13%"| GPU x4
! style="width:13%"| GPU x8
! style="width:13%"| Login
|-
!scope="column"| Number of nodes
| 100 + 60
| 360
| 352
| 6
| 14
| 10
| 4 + 2 (Broadwell)
|-
!scope="column"| Processors
| Intel Xeon Gold 6230
| Intel Xeon Gold 6230
| Intel Xeon E5-2660 v4
| Intel Xeon Gold 6230
| Intel Xeon Gold 6230
| Intel Xeon Gold 6248
|-
!scope="column"| Number of sockets
| 2
| 2
| 2
| 4
| 2
| 2
| 2
|-
!scope="column"| Processor frequency (GHz)
| 2.1 Ghz
| 2.1 Ghz
| 2.0 GHz
| 2.1 Ghz
| 2.1 Ghz
| 2.1 Ghz
|
|-
!scope="column"| Total number of cores
| 40
| 40
| 28
| 80
| 40
| 40
| 40 / 20 (Broadwell)
|-
!scope="column"| Main memory
| 96 GB / 192 GB
| 96 GB
| 128 GB
| 3 TB
| 384 GB
| 768 GB
| 384 GB / 128 GB (Broadwell)
|-
!scope="column"| Local SSD
| 960 GB SATA
| 960 GB SATA
| 480 GB SATA
| 4,8 TB NVMe
| 3,2 TB NVMe
| 6,4 TB NVMe
|
|-
!scope="column"| Accelerators
| -
| -
| -
| -
| 4x NVIDIA Tesla V100
| 8x NVIDIA Tesla V100
|
|-
!scope="column"| Interconnect
| IB HDR100 (blocking)
| IB HDR100
| IB FDR
| IB HDR
| IB HDR
| IB HDR
| IB HDR100 (blocking)
|}
Table 1: Properties of the nodes

= File Systems =

Details about changes on the file systems between bwUniCluster 1 and bwUniCluster 2.0 are described in the [[BwUniCluster_2.0_File_System_Migration_Guide|File system migration guide]]. Note that $WORK is deprecated.

On bwUniCluster 2.0 the parallel file system Lustre is used for most globally visible user data. Lustre is open source and Lustre solutions and support are available from different vendors. Nowadays, most of the biggest HPC systems are using Lustre. An initial home directory on a Lustre file system is created during the first login, and the environment variable $HOME holds its name. Users can create so-called workspaces on another Lustre file system for non-permanent data with temporary lifetime.

Within a batch job further file systems are available:
* The directory $TMP is only available and visible on the local node. It is located on fast SSD storage devices.
* On request a parallel on-demand (BeeOND) file system is created which uses the SSDs of the nodes which were allocated to the batch job.
* On request the external LSDF Online Storage is mounted on the nodes which were allocated to the batch job. This file system is based on the parallel file system Spectrum Scale.

Some of the characteristics of the file systems are shown in Table 2.

{| style="width:100%; vertical-align:top; background:#f5fffa;border:1px solid #000000;padding:1px"
|- style="width:20%;height=20px; text-align:left;padding:3px"
! style="background-color:#AAA;padding:3px"| Property
! style="background-color:#AAA;padding:3px"| $TMP
! style="background-color:#AAA;padding:3px"| BeeOND
! style="background-color:#AAA;padding:3px"| $HOME
! style="background-color:#AAA;padding:3px"| Workspace
|-
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Visibility
| style="height=20px; text-align:left;padding:3px"| local node
| style="height=20px; text-align:left;padding:3px"| nodes of batch job
| style="height=20px; text-align:left;padding:3px"| global
| style="height=20px; text-align:left;padding:3px"| global
|-
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Lifetime
| style="height=20px; text-align:left;padding:3px"| batch job runtime
| style="height=20px; text-align:left;padding:3px"| batch job runtime
| style="height=20px; text-align:left;padding:3px"| permanent
| style="height=20px; text-align:left;padding:3px"| max. 240 days
|- style="vertical-align:top;"
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Disk space
| style="height=20px; text-align:left;padding:3px"| 960 GB - 6.4 TB details see table 1
| style="height=20px; text-align:left;padding:3px"| n*250 GB
| style="height=20px; text-align:left;padding:3px"| 1.2 PiB
| style="height=20px; text-align:left;padding:3px"| 4.1 PiB
|-
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Capacity Quotas
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| yes 1 TiB per user also per organization
| style="height=20px; text-align:left;padding:3px"| yes 40 TiB per user
|-
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Inode Quotas
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| yes 10 million per user
| style="height=20px; text-align:left;padding:3px"| yes 30 million per user
|-
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Backup
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| yes
| style="height=20px; text-align:left;padding:3px"| no
|- style="vertical-align:top;"
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Read perf./node
| style="height=20px; text-align:left;padding:3px"| 500 MB/s - 6 GB/s depends on type of local SSD / job queue: 520 MB/s @ single / multiple 800 MB/s @ multiple_e 6600 MB/s @ fat 6500 MB/s @ gpu_4 6500 MB/s @ gpu_8
| style="height=20px; text-align:left;padding:3px"| 400 MB/s - 500 MB/s depends on type of local SSDs / job queue: 500 MB/s @ multiple 400 MB/s @ multiple_e
| style="height=20px; text-align:left;padding:3px"| 1 GB/s
| style="height=20px; text-align:left;padding:3px"| 1 GB/s
|- style="vertical-align:top;"
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Write perf./node
| style="height=20px; text-align:left;padding:3px"| 500 MB/s - 4 GB/s depends on type of local SSD / job queue: 500 MB/s @ single / multiple 650 MB/s @ multiple_e 2900 MB/s @ fat 2090 MB/s @ gpu_4 4060 MB/s @ gpu_8
| style="height=20px; text-align:left;padding:3px"| 250 MB/s - 350 MB/s depends on type of local SSDs / job queue: 350 MB/s @ multiple 250 MB/s @ multiple_e
| style="height=20px; text-align:left;padding:3px"| 1 GB/s
| style="height=20px; text-align:left;padding:3px"| 1 GB/s
|- style="vertical-align:top;"
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Total read perf.
| style="height=20px; text-align:left;padding:3px"| n*500-6000 MB/s
| style="height=20px; text-align:left;padding:3px"| n*400-500 MB/s
| style="height=20px; text-align:left;padding:3px"| 18 GB/s
| style="height=20px; text-align:left;padding:3px"| 54 GB/s
|- style="vertical-align:top;"
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Total write perf.
| style="height=20px; text-align:left;padding:3px"| n*500-4000 MB/s
| style="height=20px; text-align:left;padding:3px"| n*250-350 MB/s
| style="height=20px; text-align:left;padding:3px"| 18 GB/s
| style="height=20px; text-align:left;padding:3px"| 54 GB/s
|}
---------------------------------------------------------------------------------------------------------
global: all nodes of uc1 access the same file system;
local: each node has its own file system;
permanent: files are stored permanently;
batch job: files are removed at end of the batch job.
---------------------------------------------------------------------------------------------------------
Table 2: Properties of the file systems

== Selecting the appropriate file system ==

In general, you should separate your data and store it on the appropriate file system.
Permanently needed data like software or important results should be stored below $HOME
but capacity restrictions (quotas) apply. In case you accidentally deleted data on $HOME
you can usually restore it from backup. Permanent data which is not needed for months
or exceeds the capacity restrictions should be sent to bwFileStorage, to the LSDF Online Storage,
or to the archive and deleted from the file systems. Temporary data which is only needed on a single
node and which does not exceed the disk space shown in the table above should be stored
below $TMP. Temporary data which is only needed during job runs should be stored on a
parallel on-demand file system. Temporary data which can be recomputed or which is the
result of one job and input for another job should be stored below in workspaces. The lifetime
of data in workspaces is limited and depends on the lifetime of the workspace which can be
several months.

For further details please check the chapters below.

== $HOME ==

The home directories of bwUniCluster 2.0 (uc2) users are located in the parallel file system Lustre.
You have access to your home directory from all nodes of uc2. A regular backup of these directories
to tape archive is done automatically. The directory $HOME is used to hold those files that are
permanently used like source codes, configuration files, executable programs etc.

On uc2 there is a default user quota limit of 1 TiB and 10 million inodes (files and directories) per user.
You can chek your current usage and limits with the command
<pre>
$ lfs quota -uh $(whoami) $HOME
</pre>
In addition to the user limit there is a limit of your organization (e.g. university) which depends on the financial share. This limit is enforced with so-called Lustre project quotas. You can show the current usage and limits of your organization with the following command:
<pre>
lfs quota -ph $(grep $(echo $HOME | sed -e "s|/[^/]*/[^/]*$||") /pfs/data5/project_ids.txt | cut -f 1 -d\ ) $HOME
</pre>

== Workspaces ==

On uc2 workspaces can be used to store large non-permanent data sets, e.g. restart files or output
data that has to be post-processed. The file system used for workspaces is also the parallel file system Lustre. This file system is especially designed for parallel access and for a high throughput to large
files. It is able to provide high data transfer rates of up to 54 GB/s write and read performance when data access is parallel.

Workspaces have a lifetime and the data on a workspace expires as a whole after a fixed period. The maximum lifetime of a workspace is 60 days, but it can be renewed at the end of that period 3 times to a total maximum of 240 days after workspace generation. If a workspace has inadvertently expired we can restore the data during a limited time (few weeks). In this case you should create a new workspace and report the name of the new and of the expired workspace in a ticket or in an email to the hotline.

Creating, deleting, finding and extending workspaces is explained on the [[workspace]] page.

On uc2 there is a default user quota limit of 40 TiB and 30 million inodes (files and directories) per user.
You can chek your current usage and limits with the command
<pre>
$ lfs quota -uh $(whoami) /pfs/work7
</pre>

=== Reminder for workspace deletion ===

Normally you will get an email every day starting 7 days before a workspace expires. You can send yourself a calender entry which reminds you when a workspace will be automatically deleted:

$ ws_send_ical.sh <workspace> <email>

== Improving Performance on $HOME and workspaces ==

The following recommendations might help to improve throughput and metadata
performance on Lustre filesystems.

=== '''Improving Throughput Performance''' ===

Depending on your application some adaptations might be necessary if you want to reach
the full bandwidth of the filesystems. Parallel filesystems typically stripe files over storage subsystems, i.e. large files are separated into stripes and distributed to different storage subsystems. In Lustre, the size of these stripes (sometimes also mentioned as chunks) is called stripe size and the number of used storage subsystems is called stripe count.

When you are designing your application you should consider that the performance of
parallel filesystems is generally better if data is transferred in large blocks and stored in
few large files. In more detail, to increase throughput performance of a parallel application
following aspects should be considered:

* collect large chunks of data and write them sequentially at once,

* to exploit complete filesystem bandwidth use several clients,

* avoid competitive file access by different tasks or use blocks with boundaries at stripe size (default is 1MB),

* if files are small enough for the SSDs and are only used by one process store them on $TMP.

With previous Lustre versions adapting the Lustre stripe count was the most important optimization. However, for the workspaces of uc2 the new Lustre feature Progressive File Layouts has been used to define file striping parameters. This means that the stripe count is adapted if the file size is growing. In normal cases users no longer need to adapt file striping parameters in case they have very huge files or in order to reach better performance.

If you know what you are doing you can still change striping parameters, e.g. the stripe count, of a directory and of newly created files. New files and directories inherit the stripe count from the parent directory. E.g. if you want to enhance throughput on a single very large file which is created in the directory $HOME/my_output_dir you can use the command
<pre>
$ lfs setstripe -c-1 $HOME/my_output_dir
</pre>
to change the stripe count to -1 which means that all storage subsystems of the file system are used to store that file. If you change the stripe count of a directory the stripe count of existing files inside this
directory is not changed. If you want to change the stripe count of existing files, change
the stripe count of the parent directory, copy the files to new files, remove the old files
and move the new files back to the old name. In order to check the stripe setting of
the file my_file use
<pre>
$ lfs getstripe my_file
</pre>
Also note that changes on the striping parameters (e.g. stripe count) are not saved in the
backup, i.e. if directories have to be recreated this information is lost and the default stripe
count will be used. Therefore, you should annotate for which directories you made changes
to the striping parameters so that you can repeat these changes if required.

=== '''Improving Metadata Performance''' ===

Metadata performance on parallel file systems is usually not as good as with local
filesystems. In addition, it is usually not scalable, i.e. a limited resource. Therefore,
you should omit metadata operations whenever possible. For example, it is much better
to have few large files than lots of small files. In more detail, to increase metadata
performance of a parallel application following aspects should be considered:

* avoid creating many small files,

* avoid competitive directory access, e.g. by creating files in separate subdirectories for each task,

* if many small files are only used within a batch job and accessed by one process store them on $TMP,

* change the default colorization setting of the command ls (see below).

On modern Linux systems, the GNU ls command often uses colorization by default to
visually highlight the file type; this is especially true if the command is run within a terminal
session. This is because the default shell profile initializations usually contain an alias
directive similar to the following for the ls command:
<pre>
$ alias ls="ls --color=tty"
</pre>
However, running the ls command in this way for files on a Lustre file system requires
a stat() call to be used to determine the file type. This can result in a performance
overhead, because the stat() call always needs to determine the size of a file, and that
in turn means that the client node must query the object size of all the backing objects
that make up a file. As a result of the default colorization setting, running a simple
ls command on a Lustre file system often takes as much time as running the ls command
with the -l option (the same is true if the -F, -p, or the -classify option, or any other option
that requires information from a stat() call, is used). To avoid this performance overhead
when using ls commands, add an alias directive similar to the following
to your shell startup script:
<pre>
$ alias ls="ls --color=never"
</pre>

== $TMP ==

While all tasks of a parallel application access the same $HOME and workspace directory, the
$TMP directory is local to each node on bwUniCluster 2.0. Different tasks of a parallel
application use different directories when they do not utilize one node. This directory should
be used for temporary files being accessed by single tasks. All nodes have fast SSDs
local storage devices which are used to store data below $TMP.
In addition, this directory should be used for the installation
of software packages. This means that the software package to be installed should be
unpacked, compiled and linked in a subdirectory of $TMP. The real installation of the
package (e.g. make install) should be made in(to) the Lustre filesystem.

Each time a batch
job is started, a subdirectory is created on each node and assigned to the job. $TMP is newly
set; the name of the subdirectory contains the Job-id and the starting time so that the
subdirectory name is unique for each job. This unique name is then assigned to the
environment variable $TMP within the job. At the end of the job the subdirectory is removed.

== LSDF Online Storage==

In some cases it is useful to have access to the LSDF Online Storage on the HPC-Clusters also. Therefore the LSDF Online Storage is mounted on the Login- and Datamover-Nodes.
Furthermore it can be used on the compute nodes during the job runtime with the constraint flag "LSDF" ([[bwUniCluster_2.0_Slurm_common_Features|Slurm common features]]
). There is also an example about the LSDF batch usage: [[bwUniCluster_2.0_Slurm_common_Features#LSDF_Online_Storage|Slurm LSDF example ]] .
<pre>
#!/bin/bash
#SBATCH ...
#SBATCH --constraint=LSDF
</pre>
 
For the usage of the LSDF Online Storage the following environment variables are available: $LSDF, $LSDFPROJECTS, $LSDFHOME.
Please request storage projects in the LSDF Online Storage seperately:
[https://www.lsdf.kit.edu/os/storagerequest/: LSDF Storage Request].

==BeeOND (BeeGFS On-Demand)==

Users of the HPC-Cluster have possibility to request a private BeeOND (BeeGFS) parallel filesystem for each job. The file system is created during job startup and purged after your job.

'''IMPORTANT: All data on the private filesystem will be deleted after your job. Make sure you have copied your data back to the global filesystem (within job), e.g., $HOME or any workspace.'''

BeeOND/BeeGFS can be used like any other parallel file system. Tools like cp or rsync can be used to copy data in and out. This feature is currently under BETA. If you encounter any problems or have questions, please contact fh-hotline@lists.kit.edu

For detailed usage see here:[[BwUniCluster_2.0_Slurm_common_Features#BeeOND_.28BeeGFS_On-Demand.29|Request on-demand file system]]

==Backup and Archiving==

There are regular backups of all data of the home directories,whereas ACLs and extended attributes will
not be backuped.

Please contact the hotline if you need backuped data.

[[Category:bwUniCluster 2.0|File System]][[Category:Hardware_and_Architecture|bwUniCluster 2.0]]

BwUniCluster2.0/Hardware and Architecture

2021-03-17T15:09:11Z

S Raffeiner: /* Selecting the appropriate file system */

= Architecture of bwUniCluster 2.0 =

The bwUniCluster 2.0 is a parallel computer with distributed memory. Each node of system consists of at least two Intel Xeon processor, local memory, disks, network adapters and optionally accelerators (NVIDIA Tesla V100). All nodes are connected by a fast InfiniBand interconnect. In addition the file system
Lustre, that is connected by coupling the InfiniBand of the file server with the InfiniBand
switch of the compute cluster, is added to bwUniCluster 2.0 to provide a fast and scalable
parallel file system.

The operating system on each node is Red Hat Enterprise Linux (RHEL) 7.7. A number of additional software packages like e.g. SLURM have been installed on top. Some of these components are of special interest to end users and are briefly
discussed in this document. Others which are of greater importance to system
administrators will not be covered by this document.

The individual nodes of the system may act in different roles. According to the services supplied by the nodes, they are separated into disjoint groups. From an end users point of view the different groups of nodes are login nodes, compute nodes, file server nodes and administrative server nodes.

'''Login Nodes'''

The login nodes are the only nodes that are directly accessible by end users. These nodes
are used for interactive login, file management, program development and interactive pre-
and postprocessing. Two nodes are dedicated to this service but they are all accessible via
one address and a DNS round-robin alias distributes the login sessions to the
different login nodes.

'''Compute Node'''

The majority of nodes are compute nodes which are managed by a batch system. Users
submit their jobs to the SLURM batch system and a job is executed when the required resources become available (depending on its fair-share priority).

'''File Server Nodes'''

The hardware of the parallel file system Lustre incorporates some file server nodes; the file
system Lustre is connected by coupling the InfiniBand of the file server with the independent InfiniBand switch of the compute cluster. In addition to shared file space there is also local storage on the disks of each node (for details see chapter "File Systems").

'''Administrative Server Nodes'''

Some other nodes are delivering additional services like resource management, external
network connection, administration etc. These nodes can be accessed directly by system administrators only.

= Components of bwUniCluster =

{| class="wikitable"
|-
! style="width:9%"|
! style="width:13%"| Compute nodes "Thin"
! style="width:13%"| Compute nodes "HPC"
! style="width:13%"| Compute nodes "HPC Broadwell"
! style="width:13%"| Compute nodes "Fat"
! style="width:13%"| GPU x4
! style="width:13%"| GPU x8
! style="width:13%"| Login
|-
!scope="column"| Number of nodes
| 100 + 60
| 360
| 352
| 6
| 14
| 10
| 4 + 2 (Broadwell)
|-
!scope="column"| Processors
| Intel Xeon Gold 6230
| Intel Xeon Gold 6230
| Intel Xeon E5-2660 v4
| Intel Xeon Gold 6230
| Intel Xeon Gold 6230
| Intel Xeon Gold 6248
|-
!scope="column"| Number of sockets
| 2
| 2
| 2
| 4
| 2
| 2
| 2
|-
!scope="column"| Processor frequency (GHz)
| 2.1 Ghz
| 2.1 Ghz
| 2.0 GHz
| 2.1 Ghz
| 2.1 Ghz
| 2.1 Ghz
|
|-
!scope="column"| Total number of cores
| 40
| 40
| 28
| 80
| 40
| 40
| 40 / 20 (Broadwell)
|-
!scope="column"| Main memory
| 96 GB / 192 GB
| 96 GB
| 128 GB
| 3 TB
| 384 GB
| 768 GB
| 384 GB / 128 GB (Broadwell)
|-
!scope="column"| Local SSD
| 960 GB SATA
| 960 GB SATA
| 480 GB SATA
| 4,8 TB NVMe
| 3,2 TB NVMe
| 6,4 TB NVMe
|
|-
!scope="column"| Accelerators
| -
| -
| -
| -
| 4x NVIDIA Tesla V100
| 8x NVIDIA Tesla V100
|
|-
!scope="column"| Interconnect
| IB HDR100 (blocking)
| IB HDR100
| IB FDR
| IB HDR
| IB HDR
| IB HDR
| IB HDR100 (blocking)
|}
Table 1: Properties of the nodes

= File Systems =

Details about changes on the file systems between bwUniCluster 1 and bwUniCluster 2.0 are described in the [[BwUniCluster_2.0_File_System_Migration_Guide|File system migration guide]]. Note that $WORK is deprecated.

On bwUniCluster 2.0 the parallel file system Lustre is used for most globally visible user data. Lustre is open source and Lustre solutions and support are available from different vendors. Nowadays, most of the biggest HPC systems are using Lustre. An initial home directory on a Lustre file system is created during the first login, and the environment variable $HOME holds its name. Users can create so-called workspaces on another Lustre file system for non-permanent data with temporary lifetime.

Within a batch job further file systems are available:
* The directory $TMP is only available and visible on the local node. It is located on fast SSD storage devices.
* On request a parallel on-demand (BeeOND) file system is created which uses the SSDs of the nodes which were allocated to the batch job.
* On request the external LSDF Online Storage is mounted on the nodes which were allocated to the batch job. This file system is based on the parallel file system Spectrum Scale.

Some of the characteristics of the file systems are shown in Table 2.

{| style="width:100%; vertical-align:top; background:#f5fffa;border:1px solid #000000;padding:1px"
|- style="width:20%;height=20px; text-align:left;padding:3px"
! style="background-color:#AAA;padding:3px"| Property
! style="background-color:#AAA;padding:3px"| $TMP
! style="background-color:#AAA;padding:3px"| BeeOND
! style="background-color:#AAA;padding:3px"| $HOME
! style="background-color:#AAA;padding:3px"| Workspace
|-
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Visibility
| style="height=20px; text-align:left;padding:3px"| local node
| style="height=20px; text-align:left;padding:3px"| nodes of batch job
| style="height=20px; text-align:left;padding:3px"| global
| style="height=20px; text-align:left;padding:3px"| global
|-
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Lifetime
| style="height=20px; text-align:left;padding:3px"| batch job runtime
| style="height=20px; text-align:left;padding:3px"| batch job runtime
| style="height=20px; text-align:left;padding:3px"| permanent
| style="height=20px; text-align:left;padding:3px"| max. 240 days
|- style="vertical-align:top;"
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Disk space
| style="height=20px; text-align:left;padding:3px"| 960 GB - 6.4 TB details see table 1
| style="height=20px; text-align:left;padding:3px"| n*250 GB
| style="height=20px; text-align:left;padding:3px"| 1.2 PiB
| style="height=20px; text-align:left;padding:3px"| 4.1 PiB
|-
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Capacity Quotas
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| yes 1 TiB per user also per organization
| style="height=20px; text-align:left;padding:3px"| yes 40 TiB per user
|-
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Inode Quotas
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| yes 10 million per user
| style="height=20px; text-align:left;padding:3px"| yes 30 million per user
|-
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Backup
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| yes
| style="height=20px; text-align:left;padding:3px"| no
|- style="vertical-align:top;"
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Read perf./node
| style="height=20px; text-align:left;padding:3px"| 500 MB/s - 6 GB/s depends on type of local SSD / job queue: 520 MB/s @ single / multiple 800 MB/s @ multiple_e 6600 MB/s @ fat 6500 MB/s @ gpu_4 6500 MB/s @ gpu_8
| style="height=20px; text-align:left;padding:3px"| 400 MB/s - 500 MB/s depends on type of local SSDs / job queue: 500 MB/s @ multiple 400 MB/s @ multiple_e
| style="height=20px; text-align:left;padding:3px"| 1 GB/s
| style="height=20px; text-align:left;padding:3px"| 1 GB/s
|- style="vertical-align:top;"
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Write perf./node
| style="height=20px; text-align:left;padding:3px"| 500 MB/s - 4 GB/s depends on type of local SSD / job queue: 500 MB/s @ single / multiple 650 MB/s @ multiple_e 2900 MB/s @ fat 2090 MB/s @ gpu_4 4060 MB/s @ gpu_8
| style="height=20px; text-align:left;padding:3px"| 250 MB/s - 350 MB/s depends on type of local SSDs / job queue: 350 MB/s @ multiple 250 MB/s @ multiple_e
| style="height=20px; text-align:left;padding:3px"| 1 GB/s
| style="height=20px; text-align:left;padding:3px"| 1 GB/s
|- style="vertical-align:top;"
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Total read perf.
| style="height=20px; text-align:left;padding:3px"| n*500-6000 MB/s
| style="height=20px; text-align:left;padding:3px"| n*400-500 MB/s
| style="height=20px; text-align:left;padding:3px"| 18 GB/s
| style="height=20px; text-align:left;padding:3px"| 54 GB/s
|- style="vertical-align:top;"
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Total write perf.
| style="height=20px; text-align:left;padding:3px"| n*500-4000 MB/s
| style="height=20px; text-align:left;padding:3px"| n*250-350 MB/s
| style="height=20px; text-align:left;padding:3px"| 18 GB/s
| style="height=20px; text-align:left;padding:3px"| 54 GB/s
|}
---------------------------------------------------------------------------------------------------------
global: all nodes of uc1 access the same file system;
local: each node has its own file system;
permanent: files are stored permanently;
batch job: files are removed at end of the batch job.
---------------------------------------------------------------------------------------------------------
Table 2: Properties of the file systems

== Selecting the appropriate file system ==

In general, you should separate your data and store it on the appropriate file system.
Permanently needed data like software or important results should be stored below $HOME
but capacity restrictions (quotas) apply. In case you accidentally deleted data on $HOME
you can usually restore it from backup. Permanent data which is not needed for months
or exceeds the capacity restrictions should be sent to bwFileStorage, to the LSDF Online Storage,
or to the archive and deleted from the file systems. Temporary data which is only needed on a single
node and which does not exceed the disk space shown in the table above should be stored
below $TMP. Temporary data which is only needed during job runs should be stored on a
parallel on-demand file system. Temporary data which can be recomputed or which is the
result of one job and input for another job should be stored below in workspaces. The lifetime
of data in workspaces is limited and depends on the lifetime of the workspace which can be
several months.

For further details please check the chapters below.

== $HOME ==

The home directories of bwUniCluster 2.0 (uc2) users are located in the parallel file system Lustre.
You have access to your home directory from all nodes of uc2. A regular backup of these directories
to tape archive is done automatically. The directory $HOME is used to hold those files that are
permanently used like source codes, configuration files, executable programs etc.

On uc2 there is a default user quota limit of 1 TiB and 10 million inodes (files and directories) per user.
You can chek your current usage and limits with the command
<pre>
$ lfs quota -uh $(whoami) $HOME
</pre>
In addition to the user limit there is a limit of your organization (e.g. university) which depends on the financial share. This limit is enforced with so-called Lustre project quotas. You can show the current usage and limits of your organization with the following command:
<pre>
lfs quota -ph $(grep $(echo $HOME | sed -e "s|/[^/]*/[^/]*$||") /pfs/data5/project_ids.txt | cut -f 1 -d\ ) $HOME
</pre>

== Workspaces ==

On uc2 workspaces can be used to store large non-permanent data sets, e.g. restart files or output
data that has to be post-processed. The file system used for workspaces is also the parallel file system Lustre. This file system is especially designed for parallel access and for a high throughput to large
files. It is able to provide high data transfer rates of up to 54 GB/s write and read performance when data access is parallel.

Workspaces have a lifetime and the data on a workspace expires as a whole after a fixed period. The maximum lifetime of a workspace is 60 days, but it can be renewed at the end of that period 3 times to a total maximum of 240 days after workspace generation. If a workspace has inadvertently expired we can restore the data during a limited time (few weeks). In this case you should create a new workspace and report the name of the new and of the expired workspace in a ticket or in an email to the hotline.

Creating, deleting, finding and extending workspaces is explained on the [[workspace]] page.

On uc2 there is a default user quota limit of 40 TiB and 30 million inodes (files and directories) per user.
You can chek your current usage and limits with the command
<pre>
$ lfs quota -uh $(whoami) /pfs/work7
</pre>

=== Reminder for workspace deletion ===

Normally you will get an email every day starting 7 days before a workspace expires. You can send yourself a calender entry which reminds you when a workspace will be automatically deleted:

$ ws_send_ical.sh <workspace> <email>

== Improving Performance on $HOME and workspaces ==

The following recommendations might help to improve throughput and metadata
performance on Lustre filesystems.

=== '''Improving Throughput Performance''' ===

Depending on your application some adaptations might be necessary if you want to reach
the full bandwidth of the filesystems. Parallel filesystems typically stripe files over storage subsystems, i.e. large files are separated into stripes and distributed to different storage subsystems. In Lustre, the size of these stripes (sometimes also mentioned as chunks) is called stripe size and the number of used storage subsystems is called stripe count.

When you are designing your application you should consider that the performance of
parallel filesystems is generally better if data is transferred in large blocks and stored in
few large files. In more detail, to increase throughput performance of a parallel application
following aspects should be considered:

* collect large chunks of data and write them sequentially at once,

* to exploit complete filesystem bandwidth use several clients,

* avoid competitive file access by different tasks or use blocks with boundaries at stripe size (default is 1MB),

* if files are small enough for the SSDs and are only used by one process store them on $TMP.

With previous Lustre versions adapting the Lustre stripe count was the most important optimization. However, for the workspaces of uc2 the new Lustre feature Progressive File Layouts has been used to define file striping parameters. This means that the stripe count is adapted if the file size is growing. In normal cases users no longer need to adapt file striping parameters in case they have very huge files or in order to reach better performance.

If you know what you are doing you can still change striping parameters, e.g. the stripe count, of a directory and of newly created files. New files and directories inherit the stripe count from the parent directory. E.g. if you want to enhance throughput on a single very large file which is created in the directory $HOME/my_output_dir you can use the command
<pre>
$ lfs setstripe -c-1 $HOME/my_output_dir
</pre>
to change the stripe count to -1 which means that all storage subsystems of the file system are used to store that file. If you change the stripe count of a directory the stripe count of existing files inside this
directory is not changed. If you want to change the stripe count of existing files, change
the stripe count of the parent directory, copy the files to new files, remove the old files
and move the new files back to the old name. In order to check the stripe setting of
the file my_file use
<pre>
$ lfs getstripe my_file
</pre>
Also note that changes on the striping parameters (e.g. stripe count) are not saved in the
backup, i.e. if directories have to be recreated this information is lost and the default stripe
count will be used. Therefore, you should annotate for which directories you made changes
to the striping parameters so that you can repeat these changes if required.

=== '''Improving Metadata Performance''' ===

Metadata performance on parallel file systems is usually not as good as with local
filesystems. In addition, it is usually not scalable, i.e. a limited resource. Therefore,
you should omit metadata operations whenever possible. For example, it is much better
to have few large files than lots of small files. In more detail, to increase metadata
performance of a parallel application following aspects should be considered:

* avoid creating many small files,

* avoid competitive directory access, e.g. by creating files in separate subdirectories for each task,

* if many small files are only used within a batch job and accessed by one process store them on $TMP,

* change the default colorization setting of the command ls (see below).

On modern Linux systems, the GNU ls command often uses colorization by default to
visually highlight the file type; this is especially true if the command is run within a terminal
session. This is because the default shell profile initializations usually contain an alias
directive similar to the following for the ls command:
<pre>
$ alias ls="ls --color=tty"
</pre>
However, running the ls command in this way for files on a Lustre file system requires
a stat() call to be used to determine the file type. This can result in a performance
overhead, because the stat() call always needs to determine the size of a file, and that
in turn means that the client node must query the object size of all the backing objects
that make up a file. As a result of the default colorization setting, running a simple
ls command on a Lustre file system often takes as much time as running the ls command
with the -l option (the same is true if the -F, -p, or the -classify option, or any other option
that requires information from a stat() call, is used). To avoid this performance overhead
when using ls commands, add an alias directive similar to the following
to your shell startup script:
<pre>
$ alias ls="ls --color=never"
</pre>

== $TMP ==

While all tasks of a parallel application access the same $HOME and workspace directory, the
$TMP directory is local to each node on bwUniCluster 2.0. Different tasks of a parallel
application use different directories when they do not utilize one node. This directory should
be used for temporary files being accessed by single tasks. All nodes have fast SSDs
local storage devices which are used to store data below $TMP.
In addition, this directory should be used for the installation
of software packages. This means that the software package to be installed should be
unpacked, compiled and linked in a subdirectory of $TMP. The real installation of the
package (e.g. make install) should be made in(to) the Lustre filesystem.

Each time a batch
job is started, a subdirectory is created on each node and assigned to the job. $TMP is newly
set; the name of the subdirectory contains the Job-id and the starting time so that the
subdirectory name is unique for each job. This unique name is then assigned to the
environment variable $TMP within the job. At the end of the job the subdirectory is removed.

== LSDF Online Storage==

In some cases it is useful to have access to the LSDF Online Storage on the HPC-Clusters also. Therefore the LSDF Online Storage is mounted on the Login- and Datamover-Nodes.
Furthermore it can be used on the compute nodes during the job runtime with the constraint flag "LSDF" ([[bwUniCluster_2.0_Slurm_common_Features|Slurm common features]]
). There is also an example about the LSDF batch usage: [[bwUniCluster_2.0_Slurm_common_Features#LSDF_Online_Storage|Slurm LSDF example ]] .
<pre>
#!/bin/bash
#SBATCH ...
#SBATCH --constraint=LSDF
</pre>
 
For the usage of the LSDF Online Storage the following environment variables are available: $LSDF, $LSDFPROJECTS, $LSDFHOME.
Please request storage projects in the LSDF Online Storage seperately:
[https://www.lsdf.kit.edu/os/storagerequest/: LSDF Storage Request].

== Access to other HPC-Filesystems==



===$WORK of SCC HPC-Clusters===

From ForHLR II users can transfer data of the $WORK filesystem to the bwUniCluster via the tool "rdata".

===$PROJECT of the ForHLR I===

Users of ForHLR II can transfer data of the $PROJECT file system to the bwUniCluster via the tool "rdata".

===LSDF online storage===

Users of the '''LSDF online storage''' can furthermore transfer data to bwUniCluster via the tool '''rdata'''.
Therefore the environment variables $LSDF, $LSDFPROJECTS and $LSDFHOME are set.

==BeeOND (BeeGFS On-Demand)==

Users of the HPC-Cluster have possibility to request a private BeeOND (BeeGFS) parallel filesystem for each job. The file system is created during job startup and purged after your job.

'''IMPORTANT: All data on the private filesystem will be deleted after your job. Make sure you have copied your data back to the global filesystem (within job), e.g., $HOME or any workspace.'''

BeeOND/BeeGFS can be used like any other parallel file system. Tools like cp or rsync can be used to copy data in and out. This feature is currently under BETA. If you encounter any problems or have questions, please contact fh-hotline@lists.kit.edu

For detailed usage see here:[[BwUniCluster_2.0_Slurm_common_Features#BeeOND_.28BeeGFS_On-Demand.29|Request on-demand file system]]

==Backup and Archiving==

There are regular backups of all data of the home directories,whereas ACLs and extended attributes will
not be backuped.

Please contact the hotline if you need backuped data.

[[Category:bwUniCluster 2.0|File System]][[Category:Hardware_and_Architecture|bwUniCluster 2.0]]

BwUniCluster2.0/Software Modules

2021-03-05T10:52:05Z

S Raffeiner: /* Finding software Modules */

<div id="top"></div>
 
= Introduction =
'''Environment Modules''', or short '''Modules''' are the means by which most of the installed scientific software is provided on bwUniCluster 2.0.
 
The use of different compilers, libraries and software packages requires users to set up a specific session environment suited for the program they want to run. bwUniCluster 2.0 provides users with the possibility to load and unload complete environments for compilers, libraries and software packages by a single command.
 
 

= Description =
The Environment ''Modules'' package enables dynamic modification of your environment by the
use of so-called ''modulefiles''. A ''modulefile'' contains information to configure the shell
for a program/software . Typically, a modulefile contains instructions that alter or set shell
environment variables, such as PATH and MANPATH, to enable access to various installed
software.
 
One of the key features of using the Environment ''Modules'' software is to allow multiple versions of the same software to be used in your environment in a controlled manner.
For example, two different versions of the Intel C compiler can be installed on the system at the same time - the version used is based upon which Intel C compiler modulefile is loaded.
 
The software stack of bwUniCluster 2.0 provides a number of modulefiles. You can also
create your own modulefiles. ''Modulefiles'' may be shared by many users on a system, and
users may have their own collection of modulefiles to supplement or replace the shared
modulefiles.
 
A modulefile does not provide configuration of your environment until it is explicitly loaded,
i.e., the specific modulefile for a software product or application must be loaded in your environment before the configuration information in the modulefile is effective.
 
If you want to see which modules are loaded you must execute
''''module list''''.
 
<pre>
$ module list
Currently Loaded Modules:
1) compiler/intel/19.1 2) mpi/impi/2019 3) numlib/mkl/2019
</pre>
 

= Usage =
Lmod on bwUniCluster 2.0: A New Environment Module System from http://lmod.readthedocs.org/en/latest/ is installed.
== Documentation ==
Execute ''''module help'''' or ''''man module'''' for help on how to use ''Modules'' software.
<pre>
$ module help
Usage: module [options] sub-command [args ...]

Options:
-h -? -H --help This help message
-s availStyle --style=availStyle Site controlled avail style: system (default: system)
--regression_testing Lmod regression testing
-D Program tracing written to stderr
--debug=dbglvl Program tracing written to stderr (where dbglvl is a number 1,2,3)
--pin_versions=pinVersions When doing a restore use specified version, do not follow defaults
-d --default List default modules only when used with avail
-q --quiet Do not print out warnings
--expert Expert mode
-t --terse Write out in machine readable format for commands: list, avail, spider, savelist
--initial_load loading Lmod for first time in a user shell
--latest Load latest (ignore default)
--ignore_cache Treat the cache file(s) as out-of-date
--novice Turn off expert and quiet flag
--raw Print modulefile in raw output when used with show
-w twidth --width=twidth Use this as max term width
-v --version Print version info and quit
-r --regexp use regular expression match
--gitversion Dump git version in a machine readable way and quit
--dumpversion Dump version in a machine readable way and quit
--check_syntax --checkSyntax Checking module command syntax: do not load
--config Report Lmod Configuration
--config_json Report Lmod Configuration in json format
--mt Report Module Table State
--timer report run times
--force force removal of a sticky module or save an empty collection
--redirect Send the output of list, avail, spider to stdout (not stderr)
--no_redirect Force output of list, avail and spider to stderr
--show_hidden Avail and spider will report hidden modules
--spider_timeout=timeout a timeout for spider
-T --trace

module [options] sub-command [args ...]

Help sub-commands:
------------------
help prints this message
help module [...] print help message from module(s)

Loading/Unloading sub-commands:
-------------------------------
load | add module [...] load module(s)
try-load | try-add module [...] Add module(s), do not complain if not found
del | unload module [...] Remove module(s), do not complain if not found
swap | sw | switch m1 m2 unload m1 and load m2
purge unload all modules
refresh reload aliases from current list of modules.
update reload all currently loaded modules.

Listing / Searching sub-commands:
---------------------------------
list List loaded modules
list s1 s2 ... List loaded modules that match the pattern
avail | av List available modules
avail | av string List available modules that contain "string".
spider List all possible modules
spider module List all possible version of that module file
spider string List all module that contain the "string".
spider name/version Detailed information about that version of the module.
whatis module Print whatis information about module
keyword | key string Search all name and whatis that contain "string".

Searching with Lmod:
--------------------
All searching (spider, list, avail, keyword) support regular expressions:

-r spider '^p' Finds all the modules that start with `p' or `P'
-r spider mpi Finds all modules that have "mpi" in their name.
-r spider 'mpi$ Finds all modules that end with "mpi" in their name.

Handling a collection of modules:
--------------------------------
save | s Save the current list of modules to a user defined "default" collection.
save | s name Save the current list of modules to "name" collection.
reset The same as "restore system"
restore | r Restore modules from the user's "default" or system default.
restore | r name Restore modules from "name" collection.
restore system Restore module state to system defaults.
savelist List of saved collections.
describe | mcc name Describe the contents of a module collection.
disable name Disable (i.e. remove) a collection.

Deprecated commands:
--------------------
getdefault [name] load name collection of modules or user's "default" if no name given.
===> Use "restore" instead <====
setdefault [name] Save current list of modules to name if given, otherwise save as the default list for you the user.
===> Use "save" instead. <====

Miscellaneous sub-commands:
---------------------------
is-loaded modulefile return a true status if module is loaded
is-avail modulefile return a true status if module can be loaded
show modulefile show the commands in the module file.
use [-a] path Prepend or Append path to MODULEPATH.
unuse path remove path from MODULEPATH.
tablelist output list of active modules as a lua table.

Important Environment Variables:
--------------------------------
LMOD_COLORIZE If defined to be "YES" then Lmod prints properties and warning in color.

-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Lmod Web Sites

Documentation: http://lmod.readthedocs.org
Github: https://github.com/TACC/Lmod
Sourceforge: https://lmod.sf.net
TACC Homepage: https://www.tacc.utexas.edu/research-development/tacc-projects/lmod

To report a bug please read http://lmod.readthedocs.io/en/latest/075_bug_reporting.html
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Modules based on Lua: Version 8.2 (8.2-1-g9c98036c) 2019-10-30 11:17 -05:00
by Robert McLay mclay@tacc.utexas.edu

</pre>
For help on particular version of ''Module'', e.g. Intel default compiler version, execute
''''module help compiler/intel''''.
<pre>
$ module help compiler/intel
---------------------- Module Specific Help for "compiler/intel/19.1" ----------------------
Intel(R) Compilers 19.1 for Linux*
For details see: https://software.intel.com/en-us/intel-compilers
In case of problems, please contact: Hartmut Häfner <hartmut.haefner@kit.edu>
SCC support end: 2022-12-31
</pre>
 
=== Online Documentation ===
[http://lmod.readthedocs.org Lmod: A New Environment Module System]
 
 

== Display all available Modules ==
Available ''Module'' are modulefiles that can be loaded by the user. A ''Module'' must be loaded before it provides changes to your environment, as described in the introduction to this section. You can display all available ''Modules'' on the system by executing:
<pre>
$ module avail
</pre>
The short form the command is:
<pre>
$ module av
</pre>
Available ''Modules'' can be also displayed in different modes, such as
* each ''Module'' per one line
<pre>
$ module -t avail
</pre>
Some modules may not be available right now, because their requirements are not met. To get a complete list of all possible modules use the [[#Display all possible Modules|module spider command]].
 
 

== Module categories, versions and defaults ==
The ForHLR clusters traditionally provide a large variety of software and software versions. Therefore ''Module'' are divided in category folders containing subfolders of modulefiles again containing modulefile versions, and must be addressed as follows:
category/softwarename/version
For instance all versions of the Intel compiler belong to the category of compilers, thus the corresponding modulefiles are placed under the category ''compiler'' and ''intel''.
 
In case of multiple software versions, one version will be always defined as the '''default'''
version. The ''Module'' of the default can be addressed by simply omitting the version number:
category/softwarename
 

== Finding software Modules ==
Currently all bwUniCluster 2.0 software packages are assigned to the following ''Module'' categories (???):


* bio

* cae

* chem

* compiler

* devel

* lib

* math
* mpi

* numlib

* phys

* system

* toolkit

* vis
You can selectively list software in one of those categories using, e.g. for the category "compiler"
<pre>
$ module avail compiler/
</pre>
Searches are looking for a substring starting at the begin of the name, so this would list all software in categories starting with a "c"
<pre>
$ module avail c
</pre>
while this would find nothing
<pre>
$ module avail hem
</pre>
 

== Loading Modules ==
You can load a ''Module'' software in to your environment to enable easier access to software that
you want to use by executing:
<pre>
$ module load category/softwarename/version
</pre>
or
<pre>
$ module add category/softwarename/version
</pre>
Loading a ''Module'' in this manner affects ONLY your environment for the current session.
 
 
=== Loading conflicts ===
You can not load different versions of the same software at the same time! Loading the Intel compiler in version X while Intel compiler in version Y is loaded leads to an automatic unloading of Intel compiler in version Y.
 
 

=== Showing the changes introduced by a Module ===
Loading a ''Module'' will change the environment of the current shell session. For instance the $PATH variable will be expanded by the software's binary directory. Other ''Module'' variables may even change the behavior of the current shell session or the software program(s) in a more drastic way.
 
 
All the changes to the current shell session to be invoked by loading the ''Module'' can be reviewed using
 
''''module show category/softwarename/version''''.
 
 
'''Example (Intel compiler)'''
<pre>
$ module show compiler/intel
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
/opt/bwhpc/common/modulefiles/Core/compiler/intel/19.1.lua:
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
setenv("INTEL_LICENSE_FILE","28518@scclic1.scc.kit.edu")
setenv("AR","/opt/intel/compilers_and_libraries_2020/linux/bin/intel64/xiar")
setenv("CC","/opt/intel/compilers_and_libraries_2020/linux/bin/intel64/icc")
setenv("CXX","/opt/intel/compilers_and_libraries_2020/linux/bin/intel64/icpc")
setenv("F77","/opt/intel/compilers_and_libraries_2020/linux/bin/intel64/ifort")
setenv("FC","/opt/intel/compilers_and_libraries_2020/linux/bin/intel64/ifort")
setenv("CFLAGS","-O2 -xCORE-AVX2")
setenv("CXXFLAGS","-O2 -xCORE-AVX2")
setenv("FFLAGS","-O2 -xCORE-AVX2")
setenv("FCFLAGS","-O2 -xCORE-AVX2")
setenv("INTEL_VERSION","19.1.0.166")
setenv("INTEL_HOME","/opt/intel/compilers_and_libraries_2020/linux")
setenv("INTEL_BIN_DIR","/opt/intel/compilers_and_libraries_2020/linux/bin/intel64")
setenv("INTEL_LIB_DIR","/opt/intel/compilers_and_libraries_2020/linux/lib/intel64")
setenv("INTEL_INC_DIR","/opt/intel/compilers_and_libraries_2020/linux/include")
setenv("INTEL_MAN_DIR","/opt/intel/compilers_and_libraries_2020/linux/man/common")
setenv("INTEL_DOC_DIR","/opt/intel/compilers_and_libraries_2020/linux/documentation/en")
setenv("GDB_VERSION","19.1.0.166")
setenv("GDB_HOME","/opt/intel/debugger_2020/gdb/intel64")
setenv("GDB_BIN_DIR","/opt/intel/debugger_2020/gdb/intel64/bin")
setenv("GDB_LIB_DIR","/opt/intel/debugger_2020/libipt/intel64/lib")
setenv("GDB_INC_DIR","/opt/intel/debugger_2020/gdb/intel64/include")
setenv("GDB_INF_DIR","/opt/intel/documentation_2020/en/debugger/gdb-ia/info")
setenv("GDB_MAN_DIR","/opt/intel/documentation_2020/en/debugger/gdb-ia/man")
setenv("KMP_AFFINITY","noverbose,granularity=core,respect,warnings,compact,1")
prepend_path("PATH","/opt/intel/compilers_and_libraries_2020/linux/bin/intel64")
prepend_path("MANPATH","/opt/intel/compilers_and_libraries_2020/linux/man/common")
prepend_path("LD_LIBRARY_PATH","/opt/intel/compilers_and_libraries_2020/linux/lib/intel64")
whatis("Sets up Intel C/C++ and Fortran compiler version 19.1 (Intel(R) Compilers 19.1 for Linux*) - supported by SCC till 2022-12-31!")
help([[Intel(R) Compilers 19.1 for Linux*
For details see: https://software.intel.com/en-us/intel-compilers
In case of problems, please contact: Hartmut Häfner <hartmut.haefner@kit.edu>
SCC support end: 2022-12-31]])
prepend_path("MODULEPATH","/software/bwhpc/common/modulefiles/Compiler/intel/19.1")
family("compiler")
</pre>
 
 

=== Modules depending on Modules ===
Some program ''Modules'' depend on libraries to be loaded to the user environment. Therefore the
corresponding ''Modules'' of the software must be loaded together with the ''Modules'' of
the libraries.
 
By default such software ''Modules'' try to load required ''Modules'' and corresponding versions automatically.
 
 
 

== Unloading Modules ==
To unload or remove a software ''Module'' execute:
<pre>
$ module unload category/softwarename/version
</pre>
or
<pre>
$ module remove category/softwarename/version
</pre>
 

=== Unloading all loaded modules ===
==== Purge ====
Unloading a ''Module'' that has been loaded by default makes it inactive for the current session only - it will be reloaded the next time you log in.
 
In order to remove all previously loaded software modules from your environment issue the command 'module purge'.
 
Example
<pre>
$ module list
Currently Loaded Modules:
1) compiler/intel/19.1 2) mpi/impi/2019 3) numlib/mkl/2019
$
$ module purge
$ module list
No modules loaded
$
</pre>
Beware!
 
'module purge' is working without any further inquiry.
 
 

== Display your loaded Modules ==
All ''Modules'' that are currently loaded for you can be displayed by the
command ''''module list''''. [[#Purge|See example above]].
 
Note: You only have to load further ''Modules'', if you want to use additional software
packages or to change the version of an already loaded software.
 
 

== Display all possible Modules ==
Modulefiles can be searched by the user. You can dipslay all possible modules by executing:
<pre>
$ module spider

-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
The following is a list of the modules and extensions currently available:
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
cae/abaqus: cae/abaqus/2018, cae/abaqus/2019

cae/adina: cae/adina/9.1.2

cae/ansys: cae/ansys/19.2, cae/ansys/2019R3, cae/ansys/2020R1

cae/comsol: cae/comsol/5.4, cae/comsol/5.5

cae/cst: cae/cst/2018

cae/lsdyna: cae/lsdyna/901

cae/openfoam: cae/openfoam/v1912, cae/openfoam/2.4.x, cae/openfoam/6, cae/openfoam/7

cae/paraview: cae/paraview/5.8

cae/starccm+: cae/starccm+/14.02.010, cae/starccm+/2019.2.1

cae/starcd: cae/starcd/4.28

compiler/clang: compiler/clang/9.0

compiler/gnu: compiler/gnu/9.2

compiler/intel: compiler/intel/18.0, compiler/intel/19.0, compiler/intel/19.1

compiler/pgi: compiler/pgi/2019

devel/cmake: devel/cmake/3.16

devel/cuda: devel/cuda/9.2, devel/cuda/10.0, devel/cuda/10.2

devel/gdb: devel/gdb/9.1

devel/python: devel/python/3.7.4_gnu_9.2, devel/python/3.8.1_gnu_9.2, devel/python/3.8.1_intel_19.1

math/R: math/R/3.6.3

math/julia: math/julia/1.3.1

mpi/impi: mpi/impi/2018, mpi/impi/2019, mpi/impi/2020

mpi/openmpi: mpi/openmpi/4.0

numlib/mkl: numlib/mkl/2018, numlib/mkl/2019, numlib/mkl/2020

numlib/python_numpy: numlib/python_numpy/1.17.2_python_3.7.4_gnu_9.2

numlib/python_scipy: numlib/python_scipy/1.3.1_numpy_1.17.2_python_3.7.4_gnu_9.2

-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

To learn more about a package execute:

$ module spider Foo

where "Foo" is the name of a module.

To find detailed information about a particular package you
must specify the version if there is more than one version:

$ module spider Foo/11.1

----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
</pre>
''''module spider name/version'''' : If you search the full name and version of the module, the search gives detailed information about that module version.
<pre>
$ module spider devel/cmake

------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
devel/cmake: devel/cmake/3.16
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

This module can be loaded directly: module load devel/cmake/3.16

Help:
Home page: https://www.cmake.org
Online Documentation: https://www.cmake.org/HTML/Documentation.html
Local Documentation: /opt/bwhpc/common/devel/cmake/3.16.4/docFAQ: https://gitlab.kitware.com/cmake/community/wikis/FAQ

In case of problems, please contact 'bwunicluster-hotline (at) lists.kit.edu'
or submit a trouble ticket at http://www.support.bwhpc-c5.de.

</pre>
Moreover, you can see the dependencies of the module with using the same command. For example, if the following is executed, you can see which modules need to be loaded before loading the module mpi/impi/2019
<pre>
$ module spider mpi/impi/2019

------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
mpi/impi: mpi/impi/2019
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

You will need to load all module(s) on any one of the lines below before the "mpi/impi/2019" module is available to load.

compiler/clang/9.0
compiler/gnu/9.2
compiler/intel/18.0
compiler/intel/19.0
compiler/intel/19.1

Help:
Intel(R) MPI Library

</pre>
 

= How do Modules work? =
The default shell on the bwHPC clusters is bash, so explanations and examples will be shown for bash. In general, programs cannot modify the environment of the shell they are being run from, so how can the module command do exactly that?
 
The module command is not a program, but a bash-function.
You can view its content using:
<pre>
$ type module
</pre>
and you will get the following result:
<pre>
$ type module
module is a function
module ()
{
eval $($LMOD_CMD bash "$@");
[ $? = 0 ] && eval $(${LMOD_SETTARG_CMD:-:} -s sh)
}
</pre>
In this function, lmod is called. Its output to stdout is then executed inside your current shell using the bash-internal ''eval'' command. As a consequence, all output that you see from the module is transmitted via stderr (output handle 2) or in some cases even stdin (output handle 0).
 
 
----
[[Category:bwUniCluster_2.0|bwUniCluster 2.0]]
[[#top|Back to top]]

BwUniCluster2.0/Software Modules

2021-03-05T10:49:13Z

S Raffeiner: /* Finding software Modules */

<div id="top"></div>
 
= Introduction =
'''Environment Modules''', or short '''Modules''' are the means by which most of the installed scientific software is provided on bwUniCluster 2.0.
 
The use of different compilers, libraries and software packages requires users to set up a specific session environment suited for the program they want to run. bwUniCluster 2.0 provides users with the possibility to load and unload complete environments for compilers, libraries and software packages by a single command.
 
 

= Description =
The Environment ''Modules'' package enables dynamic modification of your environment by the
use of so-called ''modulefiles''. A ''modulefile'' contains information to configure the shell
for a program/software . Typically, a modulefile contains instructions that alter or set shell
environment variables, such as PATH and MANPATH, to enable access to various installed
software.
 
One of the key features of using the Environment ''Modules'' software is to allow multiple versions of the same software to be used in your environment in a controlled manner.
For example, two different versions of the Intel C compiler can be installed on the system at the same time - the version used is based upon which Intel C compiler modulefile is loaded.
 
The software stack of bwUniCluster 2.0 provides a number of modulefiles. You can also
create your own modulefiles. ''Modulefiles'' may be shared by many users on a system, and
users may have their own collection of modulefiles to supplement or replace the shared
modulefiles.
 
A modulefile does not provide configuration of your environment until it is explicitly loaded,
i.e., the specific modulefile for a software product or application must be loaded in your environment before the configuration information in the modulefile is effective.
 
If you want to see which modules are loaded you must execute
''''module list''''.
 
<pre>
$ module list
Currently Loaded Modules:
1) compiler/intel/19.1 2) mpi/impi/2019 3) numlib/mkl/2019
</pre>
 

= Usage =
Lmod on bwUniCluster 2.0: A New Environment Module System from http://lmod.readthedocs.org/en/latest/ is installed.
== Documentation ==
Execute ''''module help'''' or ''''man module'''' for help on how to use ''Modules'' software.
<pre>
$ module help
Usage: module [options] sub-command [args ...]

Options:
-h -? -H --help This help message
-s availStyle --style=availStyle Site controlled avail style: system (default: system)
--regression_testing Lmod regression testing
-D Program tracing written to stderr
--debug=dbglvl Program tracing written to stderr (where dbglvl is a number 1,2,3)
--pin_versions=pinVersions When doing a restore use specified version, do not follow defaults
-d --default List default modules only when used with avail
-q --quiet Do not print out warnings
--expert Expert mode
-t --terse Write out in machine readable format for commands: list, avail, spider, savelist
--initial_load loading Lmod for first time in a user shell
--latest Load latest (ignore default)
--ignore_cache Treat the cache file(s) as out-of-date
--novice Turn off expert and quiet flag
--raw Print modulefile in raw output when used with show
-w twidth --width=twidth Use this as max term width
-v --version Print version info and quit
-r --regexp use regular expression match
--gitversion Dump git version in a machine readable way and quit
--dumpversion Dump version in a machine readable way and quit
--check_syntax --checkSyntax Checking module command syntax: do not load
--config Report Lmod Configuration
--config_json Report Lmod Configuration in json format
--mt Report Module Table State
--timer report run times
--force force removal of a sticky module or save an empty collection
--redirect Send the output of list, avail, spider to stdout (not stderr)
--no_redirect Force output of list, avail and spider to stderr
--show_hidden Avail and spider will report hidden modules
--spider_timeout=timeout a timeout for spider
-T --trace

module [options] sub-command [args ...]

Help sub-commands:
------------------
help prints this message
help module [...] print help message from module(s)

Loading/Unloading sub-commands:
-------------------------------
load | add module [...] load module(s)
try-load | try-add module [...] Add module(s), do not complain if not found
del | unload module [...] Remove module(s), do not complain if not found
swap | sw | switch m1 m2 unload m1 and load m2
purge unload all modules
refresh reload aliases from current list of modules.
update reload all currently loaded modules.

Listing / Searching sub-commands:
---------------------------------
list List loaded modules
list s1 s2 ... List loaded modules that match the pattern
avail | av List available modules
avail | av string List available modules that contain "string".
spider List all possible modules
spider module List all possible version of that module file
spider string List all module that contain the "string".
spider name/version Detailed information about that version of the module.
whatis module Print whatis information about module
keyword | key string Search all name and whatis that contain "string".

Searching with Lmod:
--------------------
All searching (spider, list, avail, keyword) support regular expressions:

-r spider '^p' Finds all the modules that start with `p' or `P'
-r spider mpi Finds all modules that have "mpi" in their name.
-r spider 'mpi$ Finds all modules that end with "mpi" in their name.

Handling a collection of modules:
--------------------------------
save | s Save the current list of modules to a user defined "default" collection.
save | s name Save the current list of modules to "name" collection.
reset The same as "restore system"
restore | r Restore modules from the user's "default" or system default.
restore | r name Restore modules from "name" collection.
restore system Restore module state to system defaults.
savelist List of saved collections.
describe | mcc name Describe the contents of a module collection.
disable name Disable (i.e. remove) a collection.

Deprecated commands:
--------------------
getdefault [name] load name collection of modules or user's "default" if no name given.
===> Use "restore" instead <====
setdefault [name] Save current list of modules to name if given, otherwise save as the default list for you the user.
===> Use "save" instead. <====

Miscellaneous sub-commands:
---------------------------
is-loaded modulefile return a true status if module is loaded
is-avail modulefile return a true status if module can be loaded
show modulefile show the commands in the module file.
use [-a] path Prepend or Append path to MODULEPATH.
unuse path remove path from MODULEPATH.
tablelist output list of active modules as a lua table.

Important Environment Variables:
--------------------------------
LMOD_COLORIZE If defined to be "YES" then Lmod prints properties and warning in color.

-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Lmod Web Sites

Documentation: http://lmod.readthedocs.org
Github: https://github.com/TACC/Lmod
Sourceforge: https://lmod.sf.net
TACC Homepage: https://www.tacc.utexas.edu/research-development/tacc-projects/lmod

To report a bug please read http://lmod.readthedocs.io/en/latest/075_bug_reporting.html
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Modules based on Lua: Version 8.2 (8.2-1-g9c98036c) 2019-10-30 11:17 -05:00
by Robert McLay mclay@tacc.utexas.edu

</pre>
For help on particular version of ''Module'', e.g. Intel default compiler version, execute
''''module help compiler/intel''''.
<pre>
$ module help compiler/intel
---------------------- Module Specific Help for "compiler/intel/19.1" ----------------------
Intel(R) Compilers 19.1 for Linux*
For details see: https://software.intel.com/en-us/intel-compilers
In case of problems, please contact: Hartmut Häfner <hartmut.haefner@kit.edu>
SCC support end: 2022-12-31
</pre>
 
=== Online Documentation ===
[http://lmod.readthedocs.org Lmod: A New Environment Module System]
 
 

== Display all available Modules ==
Available ''Module'' are modulefiles that can be loaded by the user. A ''Module'' must be loaded before it provides changes to your environment, as described in the introduction to this section. You can display all available ''Modules'' on the system by executing:
<pre>
$ module avail
</pre>
The short form the command is:
<pre>
$ module av
</pre>
Available ''Modules'' can be also displayed in different modes, such as
* each ''Module'' per one line
<pre>
$ module -t avail
</pre>
Some modules may not be available right now, because their requirements are not met. To get a complete list of all possible modules use the [[#Display all possible Modules|module spider command]].
 
 

== Module categories, versions and defaults ==
The ForHLR clusters traditionally provide a large variety of software and software versions. Therefore ''Module'' are divided in category folders containing subfolders of modulefiles again containing modulefile versions, and must be addressed as follows:
category/softwarename/version
For instance all versions of the Intel compiler belong to the category of compilers, thus the corresponding modulefiles are placed under the category ''compiler'' and ''intel''.
 
In case of multiple software versions, one version will be always defined as the '''default'''
version. The ''Module'' of the default can be addressed by simply omitting the version number:
category/softwarename
 

== Finding software Modules ==
Currently all bwUniCluster 2.0 software packages are assigned to the following ''Module'' categories (???):


* bio

* cae

* chem

* compiler

* devel

* lib

* math
* mpi

* numlib

* phys

* system

* toolkiit

* vis
You can selectively list software in one of those categories using, e.g. for the category "compiler"
<pre>
$ module avail compiler/
</pre>
Searches are looking for a substring starting at the begin of the name, so this would list all software in categories starting with a "c"
<pre>
$ module avail c
</pre>
while this would find nothing
<pre>
$ module avail hem
</pre>
 

== Loading Modules ==
You can load a ''Module'' software in to your environment to enable easier access to software that
you want to use by executing:
<pre>
$ module load category/softwarename/version
</pre>
or
<pre>
$ module add category/softwarename/version
</pre>
Loading a ''Module'' in this manner affects ONLY your environment for the current session.
 
 
=== Loading conflicts ===
You can not load different versions of the same software at the same time! Loading the Intel compiler in version X while Intel compiler in version Y is loaded leads to an automatic unloading of Intel compiler in version Y.
 
 

=== Showing the changes introduced by a Module ===
Loading a ''Module'' will change the environment of the current shell session. For instance the $PATH variable will be expanded by the software's binary directory. Other ''Module'' variables may even change the behavior of the current shell session or the software program(s) in a more drastic way.
 
 
All the changes to the current shell session to be invoked by loading the ''Module'' can be reviewed using
 
''''module show category/softwarename/version''''.
 
 
'''Example (Intel compiler)'''
<pre>
$ module show compiler/intel
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
/opt/bwhpc/common/modulefiles/Core/compiler/intel/19.1.lua:
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
setenv("INTEL_LICENSE_FILE","28518@scclic1.scc.kit.edu")
setenv("AR","/opt/intel/compilers_and_libraries_2020/linux/bin/intel64/xiar")
setenv("CC","/opt/intel/compilers_and_libraries_2020/linux/bin/intel64/icc")
setenv("CXX","/opt/intel/compilers_and_libraries_2020/linux/bin/intel64/icpc")
setenv("F77","/opt/intel/compilers_and_libraries_2020/linux/bin/intel64/ifort")
setenv("FC","/opt/intel/compilers_and_libraries_2020/linux/bin/intel64/ifort")
setenv("CFLAGS","-O2 -xCORE-AVX2")
setenv("CXXFLAGS","-O2 -xCORE-AVX2")
setenv("FFLAGS","-O2 -xCORE-AVX2")
setenv("FCFLAGS","-O2 -xCORE-AVX2")
setenv("INTEL_VERSION","19.1.0.166")
setenv("INTEL_HOME","/opt/intel/compilers_and_libraries_2020/linux")
setenv("INTEL_BIN_DIR","/opt/intel/compilers_and_libraries_2020/linux/bin/intel64")
setenv("INTEL_LIB_DIR","/opt/intel/compilers_and_libraries_2020/linux/lib/intel64")
setenv("INTEL_INC_DIR","/opt/intel/compilers_and_libraries_2020/linux/include")
setenv("INTEL_MAN_DIR","/opt/intel/compilers_and_libraries_2020/linux/man/common")
setenv("INTEL_DOC_DIR","/opt/intel/compilers_and_libraries_2020/linux/documentation/en")
setenv("GDB_VERSION","19.1.0.166")
setenv("GDB_HOME","/opt/intel/debugger_2020/gdb/intel64")
setenv("GDB_BIN_DIR","/opt/intel/debugger_2020/gdb/intel64/bin")
setenv("GDB_LIB_DIR","/opt/intel/debugger_2020/libipt/intel64/lib")
setenv("GDB_INC_DIR","/opt/intel/debugger_2020/gdb/intel64/include")
setenv("GDB_INF_DIR","/opt/intel/documentation_2020/en/debugger/gdb-ia/info")
setenv("GDB_MAN_DIR","/opt/intel/documentation_2020/en/debugger/gdb-ia/man")
setenv("KMP_AFFINITY","noverbose,granularity=core,respect,warnings,compact,1")
prepend_path("PATH","/opt/intel/compilers_and_libraries_2020/linux/bin/intel64")
prepend_path("MANPATH","/opt/intel/compilers_and_libraries_2020/linux/man/common")
prepend_path("LD_LIBRARY_PATH","/opt/intel/compilers_and_libraries_2020/linux/lib/intel64")
whatis("Sets up Intel C/C++ and Fortran compiler version 19.1 (Intel(R) Compilers 19.1 for Linux*) - supported by SCC till 2022-12-31!")
help([[Intel(R) Compilers 19.1 for Linux*
For details see: https://software.intel.com/en-us/intel-compilers
In case of problems, please contact: Hartmut Häfner <hartmut.haefner@kit.edu>
SCC support end: 2022-12-31]])
prepend_path("MODULEPATH","/software/bwhpc/common/modulefiles/Compiler/intel/19.1")
family("compiler")
</pre>
 
 

=== Modules depending on Modules ===
Some program ''Modules'' depend on libraries to be loaded to the user environment. Therefore the
corresponding ''Modules'' of the software must be loaded together with the ''Modules'' of
the libraries.
 
By default such software ''Modules'' try to load required ''Modules'' and corresponding versions automatically.
 
 
 

== Unloading Modules ==
To unload or remove a software ''Module'' execute:
<pre>
$ module unload category/softwarename/version
</pre>
or
<pre>
$ module remove category/softwarename/version
</pre>
 

=== Unloading all loaded modules ===
==== Purge ====
Unloading a ''Module'' that has been loaded by default makes it inactive for the current session only - it will be reloaded the next time you log in.
 
In order to remove all previously loaded software modules from your environment issue the command 'module purge'.
 
Example
<pre>
$ module list
Currently Loaded Modules:
1) compiler/intel/19.1 2) mpi/impi/2019 3) numlib/mkl/2019
$
$ module purge
$ module list
No modules loaded
$
</pre>
Beware!
 
'module purge' is working without any further inquiry.
 
 

== Display your loaded Modules ==
All ''Modules'' that are currently loaded for you can be displayed by the
command ''''module list''''. [[#Purge|See example above]].
 
Note: You only have to load further ''Modules'', if you want to use additional software
packages or to change the version of an already loaded software.
 
 

== Display all possible Modules ==
Modulefiles can be searched by the user. You can dipslay all possible modules by executing:
<pre>
$ module spider

-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
The following is a list of the modules and extensions currently available:
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
cae/abaqus: cae/abaqus/2018, cae/abaqus/2019

cae/adina: cae/adina/9.1.2

cae/ansys: cae/ansys/19.2, cae/ansys/2019R3, cae/ansys/2020R1

cae/comsol: cae/comsol/5.4, cae/comsol/5.5

cae/cst: cae/cst/2018

cae/lsdyna: cae/lsdyna/901

cae/openfoam: cae/openfoam/v1912, cae/openfoam/2.4.x, cae/openfoam/6, cae/openfoam/7

cae/paraview: cae/paraview/5.8

cae/starccm+: cae/starccm+/14.02.010, cae/starccm+/2019.2.1

cae/starcd: cae/starcd/4.28

compiler/clang: compiler/clang/9.0

compiler/gnu: compiler/gnu/9.2

compiler/intel: compiler/intel/18.0, compiler/intel/19.0, compiler/intel/19.1

compiler/pgi: compiler/pgi/2019

devel/cmake: devel/cmake/3.16

devel/cuda: devel/cuda/9.2, devel/cuda/10.0, devel/cuda/10.2

devel/gdb: devel/gdb/9.1

devel/python: devel/python/3.7.4_gnu_9.2, devel/python/3.8.1_gnu_9.2, devel/python/3.8.1_intel_19.1

math/R: math/R/3.6.3

math/julia: math/julia/1.3.1

mpi/impi: mpi/impi/2018, mpi/impi/2019, mpi/impi/2020

mpi/openmpi: mpi/openmpi/4.0

numlib/mkl: numlib/mkl/2018, numlib/mkl/2019, numlib/mkl/2020

numlib/python_numpy: numlib/python_numpy/1.17.2_python_3.7.4_gnu_9.2

numlib/python_scipy: numlib/python_scipy/1.3.1_numpy_1.17.2_python_3.7.4_gnu_9.2

-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

To learn more about a package execute:

$ module spider Foo

where "Foo" is the name of a module.

To find detailed information about a particular package you
must specify the version if there is more than one version:

$ module spider Foo/11.1

----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
</pre>
''''module spider name/version'''' : If you search the full name and version of the module, the search gives detailed information about that module version.
<pre>
$ module spider devel/cmake

------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
devel/cmake: devel/cmake/3.16
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

This module can be loaded directly: module load devel/cmake/3.16

Help:
Home page: https://www.cmake.org
Online Documentation: https://www.cmake.org/HTML/Documentation.html
Local Documentation: /opt/bwhpc/common/devel/cmake/3.16.4/docFAQ: https://gitlab.kitware.com/cmake/community/wikis/FAQ

In case of problems, please contact 'bwunicluster-hotline (at) lists.kit.edu'
or submit a trouble ticket at http://www.support.bwhpc-c5.de.

</pre>
Moreover, you can see the dependencies of the module with using the same command. For example, if the following is executed, you can see which modules need to be loaded before loading the module mpi/impi/2019
<pre>
$ module spider mpi/impi/2019

------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
mpi/impi: mpi/impi/2019
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

You will need to load all module(s) on any one of the lines below before the "mpi/impi/2019" module is available to load.

compiler/clang/9.0
compiler/gnu/9.2
compiler/intel/18.0
compiler/intel/19.0
compiler/intel/19.1

Help:
Intel(R) MPI Library

</pre>
 

= How do Modules work? =
The default shell on the bwHPC clusters is bash, so explanations and examples will be shown for bash. In general, programs cannot modify the environment of the shell they are being run from, so how can the module command do exactly that?
 
The module command is not a program, but a bash-function.
You can view its content using:
<pre>
$ type module
</pre>
and you will get the following result:
<pre>
$ type module
module is a function
module ()
{
eval $($LMOD_CMD bash "$@");
[ $? = 0 ] && eval $(${LMOD_SETTARG_CMD:-:} -s sh)
}
</pre>
In this function, lmod is called. Its output to stdout is then executed inside your current shell using the bash-internal ''eval'' command. As a consequence, all output that you see from the module is transmitted via stderr (output handle 2) or in some cases even stdin (output handle 0).
 
 
----
[[Category:bwUniCluster_2.0|bwUniCluster 2.0]]
[[#top|Back to top]]

BwUniCluster 2.0 User Access/SSH Keys

2021-02-02T08:51:42Z

S Raffeiner: /* Registering a Command Key */

{| style="width: 100%; margin:4px 0 0 0; background:none; border-spacing: 0px;"
| style="width:100%; border:1px solid #BBBBBB; background:#fff5fa; vertical-align:top; color:#000;" |
{| style="width:100%; vertical-align:top; border:0px solid #BBBBBB; padding:4px;" |
|-
'''NOTE:''' Interactive SSH Keys are not valid all the time, but only for one hour after the last 2-factor login. They have to be "unlocked" by entering the OTP and service password. Please see the more detailed description in the Section [[BwUniCluster_2.0_User_Access/SSH_Keys#Registering_an_Interactive_Key|Registering an interactive key]].
|}
|}

'''SSH Keys''' are a mechanism for logging into a computer system without having to enter a password. Instead of authenticating yourself with something you know (a password), you prove your identity by showing the server something you have (a cryptographic key).

The usual process is the following:

* The user generates a pair of SSH Keys, a private key and a public key, on their local system. The private key never leaves the local system.

* The user then logs into the remote system using the remote system password and adds the public key to a file called ~/.ssh/authorized_keys .

* All following logins will no longer require the entry of the remote system password because the local system can prove to the remote system that it has a private key matching the public key on file.

While SSH Keys have many advantages, the concept also has '''a number of issues''' which make it hard to handle them securely:

* The private key on the local system is supposed to be protected by a strong passphrase. There is no possibility for the server to check if this is the case. Many users do not use a strong passphrase or do not use any passphrase at all. If such a private key is stolen, an attacker can immediately use it to access the remote system.

* There is no concept of validity. Users are not forced to regularly generate new SSH Key pairs and replace the old ones. Often the same key pair is used for many years and the users have no overview of how many systems they have stored their SSH Keys on.

* SSH Keys can be restricted so they can only be used to execute specific commands on the server, or to log in from specified IP addresses. Most users do not do this.

To fix these issues '''it is no longer possible to self-manage your SSH Keys by adding them to the ~/.ssh/authorized_keys file''' on bwUniCluster. SSH Keys have to be managed via the central bwIDM system instead. Existing authorized_keys files are ignored.

= Minimum requirements for SSH Keys =

Algorithms and Key sizes:

* 2048 bits or more for RSA
* 521 bits for ECDSA
* 256 Bits (Default) for ED25519

ECDSA-SK and ED25519-SK keys (for use with U2F Hardware Tokens) cannot be used yet.

'''Please set a strong passphrase for your private keys.'''

 
 

= Adding a new SSH Key =

1. Log into [https://bwidm.scc.kit.edu https://bwidm.scc.kit.edu].

2. Click on '''My SSH Pubkeys''' or '''Meine SSH Pubkeys''' in the main menu.

3. Click on the '''Add SSH Key''' or '''SSH Key hochladen''' button.

[[File:Bwunicluster 2.0 access ssh keys empty.png|center]]

4. A new window will appear. Enter a name for the key and paste your SSH public key (NOT the private key!) into the box labelled "SSH Key:". Click on the button labelled '''Add''' or '''Hinzufügen'''. '''Note that you cannot add an SSH public key that has already been used before.'''

[[File:Bwunicluster 2.0 access ssh keys add.png|center]]

5. If everything worked fine your new key will show up in the user interface:

[[File:Bwunicluster 2.0 access ssh keys added.png|center]]

Newly added keys have a validity of three months. After that they will be revoked and put on a blocklist, so they cannot be used again.

Once you have added SSH Public Keys to the system you can bind them to one or more services for one of two reasons: To use them for generic, interactive logins ('''Interactive Key''') , or for automated logins ('''Command Key''').

 
 

= Registering an Interactive Key =

'''Interactive Keys''' can be used to log into a system for normal interactive use. They are not valid all the time, but only for one hour after the last 2-factor login. This means that on the first attempt to log into the bwUniCluster 2.0 system your SSH key will not be accepted, but you have to log in with an One-Time Password (OTP) and your service password. After that you won't have to enter the OTP and service password anymore for one hour because your SSH Key has been unlocked. After the hour has passed, you have to enter the OTP and service password again on your next login attempt, and then your SSH Key will be unlocked for another hour.

Perform the following steps to register an interactive key:

1. Log into [https://bwidm.scc.kit.edu https://bwidm.scc.kit.edu].

2. Locate the requested service (bwUniCluster) in the main menu and click on '''Set SSH Key''' or '''SSH Key setzen''' in the main menu.

3. The upper block shows the SSH Keys currently registered for the service. The lower block shows all SSH public keys which have been added to your account. Locate the SSH Key you want to use and click on '''Add''' or '''Hinzufügen'''.

[[File:Bwunicluster 2.0 access ssh keys service list.png|center]]

4. A new window appears. Choose '''Interactive''' under '''Type of usage''', enter an optional comment and click on '''Add''' or '''Hinzufügen'''.

[[File:Bwunicluster 2.0 access ssh keys service add.png|center]]

5. Your SSH key has now been registered to the service and can be used.

[[File:Bwunicluster 2.0 access ssh keys service added.png|center]]

 
 

= Registering a Command Key =

Passphrases, 2-factor authentication and service passwords make it impossible to integrate many scientific workflows with bwUniCluster 2.0. We therefore offer a second type of registration: '''Command Keys''', special keys which can be used for automation.

Command Keys are always valid and don't have to be unlocked. This makes these keys extremely valuable to a possible attacker and poses a security risk, so we enforce additional restrictions on these keys:

* They have to be restricted to a single command which can be executed.
* They have to be restricted to a single IP address (e.g. the workflow server) or a small number of IP addresses (e.g. the subnet of the institute).
* They have to be checked and approved by an HPC administrator before they can be used.
* The validity is reduced to one month.

The process for registering a Command Key is the same as the one for an Interactive Key, but after selecting '''Command''' under '''Type of usage''' two additional field labelled '''Command''' and '''From (network address)''' appear which have to be filled in. Please also provide a comment to speed up the approval process.

If you want to register a command key to be able to transfer data automatically, please use the following string as the '''Command''':

<pre>
/usr/bin/rrsync -ro / -rw /
</pre>

After the key has been added, it will be marked as '''Pending''':

[[File:Bwunicluster 2.0 access ssh keys service add command.png|center]]

You will receive an e-mail as soon as the key has been approved and can be used.

 
 

= Revoke/Delete an SSH Key=

1. Log into [https://bwidm.scc.kit.edu https://bwidm.scc.kit.edu].

2. Click on '''My SSH Pubkeys''' or '''Meine SSH Pubkeys''' in the main menu.

3. Click on the '''Revoke''' or '''Zurückziehen''' button next to the SSH Key you want to revoke.

'''Please note that revoked keys are blocked and cannot be used again.'''

BwUniCluster 2.0 User Access/SSH Keys

2021-02-02T08:48:08Z

S Raffeiner:

{| style="width: 100%; margin:4px 0 0 0; background:none; border-spacing: 0px;"
| style="width:100%; border:1px solid #BBBBBB; background:#fff5fa; vertical-align:top; color:#000;" |
{| style="width:100%; vertical-align:top; border:0px solid #BBBBBB; padding:4px;" |
|-
'''NOTE:''' Interactive SSH Keys are not valid all the time, but only for one hour after the last 2-factor login. They have to be "unlocked" by entering the OTP and service password. Please see the more detailed description in the Section [[BwUniCluster_2.0_User_Access/SSH_Keys#Registering_an_Interactive_Key|Registering an interactive key]].
|}
|}

'''SSH Keys''' are a mechanism for logging into a computer system without having to enter a password. Instead of authenticating yourself with something you know (a password), you prove your identity by showing the server something you have (a cryptographic key).

The usual process is the following:

* The user generates a pair of SSH Keys, a private key and a public key, on their local system. The private key never leaves the local system.

* The user then logs into the remote system using the remote system password and adds the public key to a file called ~/.ssh/authorized_keys .

* All following logins will no longer require the entry of the remote system password because the local system can prove to the remote system that it has a private key matching the public key on file.

While SSH Keys have many advantages, the concept also has '''a number of issues''' which make it hard to handle them securely:

* The private key on the local system is supposed to be protected by a strong passphrase. There is no possibility for the server to check if this is the case. Many users do not use a strong passphrase or do not use any passphrase at all. If such a private key is stolen, an attacker can immediately use it to access the remote system.

* There is no concept of validity. Users are not forced to regularly generate new SSH Key pairs and replace the old ones. Often the same key pair is used for many years and the users have no overview of how many systems they have stored their SSH Keys on.

* SSH Keys can be restricted so they can only be used to execute specific commands on the server, or to log in from specified IP addresses. Most users do not do this.

To fix these issues '''it is no longer possible to self-manage your SSH Keys by adding them to the ~/.ssh/authorized_keys file''' on bwUniCluster. SSH Keys have to be managed via the central bwIDM system instead. Existing authorized_keys files are ignored.

= Minimum requirements for SSH Keys =

Algorithms and Key sizes:

* 2048 bits or more for RSA
* 521 bits for ECDSA
* 256 Bits (Default) for ED25519

ECDSA-SK and ED25519-SK keys (for use with U2F Hardware Tokens) cannot be used yet.

'''Please set a strong passphrase for your private keys.'''

 
 

= Adding a new SSH Key =

1. Log into [https://bwidm.scc.kit.edu https://bwidm.scc.kit.edu].

2. Click on '''My SSH Pubkeys''' or '''Meine SSH Pubkeys''' in the main menu.

3. Click on the '''Add SSH Key''' or '''SSH Key hochladen''' button.

[[File:Bwunicluster 2.0 access ssh keys empty.png|center]]

4. A new window will appear. Enter a name for the key and paste your SSH public key (NOT the private key!) into the box labelled "SSH Key:". Click on the button labelled '''Add''' or '''Hinzufügen'''. '''Note that you cannot add an SSH public key that has already been used before.'''

[[File:Bwunicluster 2.0 access ssh keys add.png|center]]

5. If everything worked fine your new key will show up in the user interface:

[[File:Bwunicluster 2.0 access ssh keys added.png|center]]

Newly added keys have a validity of three months. After that they will be revoked and put on a blocklist, so they cannot be used again.

Once you have added SSH Public Keys to the system you can bind them to one or more services for one of two reasons: To use them for generic, interactive logins ('''Interactive Key''') , or for automated logins ('''Command Key''').

 
 

= Registering an Interactive Key =

'''Interactive Keys''' can be used to log into a system for normal interactive use. They are not valid all the time, but only for one hour after the last 2-factor login. This means that on the first attempt to log into the bwUniCluster 2.0 system your SSH key will not be accepted, but you have to log in with an One-Time Password (OTP) and your service password. After that you won't have to enter the OTP and service password anymore for one hour because your SSH Key has been unlocked. After the hour has passed, you have to enter the OTP and service password again on your next login attempt, and then your SSH Key will be unlocked for another hour.

Perform the following steps to register an interactive key:

1. Log into [https://bwidm.scc.kit.edu https://bwidm.scc.kit.edu].

2. Locate the requested service (bwUniCluster) in the main menu and click on '''Set SSH Key''' or '''SSH Key setzen''' in the main menu.

3. The upper block shows the SSH Keys currently registered for the service. The lower block shows all SSH public keys which have been added to your account. Locate the SSH Key you want to use and click on '''Add''' or '''Hinzufügen'''.

[[File:Bwunicluster 2.0 access ssh keys service list.png|center]]

4. A new window appears. Choose '''Interactive''' under '''Type of usage''', enter an optional comment and click on '''Add''' or '''Hinzufügen'''.

[[File:Bwunicluster 2.0 access ssh keys service add.png|center]]

5. Your SSH key has now been registered to the service and can be used.

[[File:Bwunicluster 2.0 access ssh keys service added.png|center]]

 
 

= Registering a Command Key =

Passphrases, 2-factor authentication and service passwords make it impossible to integrate many scientific workflows with bwUniCluster 2.0. We therefore offer a second type of registration: '''Command Keys''', special keys which can be used for automation.

Command Keys are always valid and don't have to be unlocked. This makes these keys extremely valuable to a possible attacker and poses a security risk, so we enforce additional restrictions on these keys:

* They have to be restricted to a single command which can be executed.
* They have to be restricted to a single IP address (e.g. the workflow server) or a small number of IP addresses (e.g. the subnet of the institute).
* They have to be checked and approved by an HPC administrator before they can be used.
* The validity is reduced to one month.

The process for registering a Command Key is the same as the one for an Interactive Key, but after selecting '''Command''' under '''Type of usage''' two additional field labelled '''Command''' and '''From (network address)''' appear which have to be filled in. Please also provide a comment to speed up the approval process.

After the key has been added, it will be marked as '''Pending''':

[[File:Bwunicluster 2.0 access ssh keys service add command.png|center]]

You will receive an e-mail as soon as the key has been approved and can be used.

 
 

= Revoke/Delete an SSH Key=

1. Log into [https://bwidm.scc.kit.edu https://bwidm.scc.kit.edu].

2. Click on '''My SSH Pubkeys''' or '''Meine SSH Pubkeys''' in the main menu.

3. Click on the '''Revoke''' or '''Zurückziehen''' button next to the SSH Key you want to revoke.

'''Please note that revoked keys are blocked and cannot be used again.'''

BwUniCluster2.0/Slurm

2020-12-22T09:58:17Z

S Raffeiner: /* sbatch Command Parameters */

<div id="top"></div>
= Slurm HPC Workload Manager =
== Specification ==
Slurm is an open source, fault-tolerant, and highly scalable cluster management and job scheduling system for large and small Linux clusters. Slurm requires no kernel modifications for its operation and is relatively self-contained. As a cluster workload manager, Slurm has three key functions. First, it allocates exclusive and/or non-exclusive access to resources (compute nodes) to users for some duration of time so they can perform work. Second, it provides a framework for starting, executing, and monitoring work (normally a parallel job) on the set of allocated nodes. Finally, it arbitrates contention for resources by managing a queue of pending work.
 
 
Any kind of calculation on the compute nodes of [[bwUniCluster 2.0|bwUniCluster 2.0]] requires the user to define calculations as a sequence of commands or single command together with required run time, number of CPU cores and main memory and submit all, i.e., the '''batch job''', to a resource and workload managing software. bwUniCluster 2.0 has installed the workload managing software Slurm. Therefore any job submission by the user is to be executed by commands of the Slurm software. Slurm queues and runs user jobs based on fair sharing policies.
 
 

== Slurm Commands (excerpt) ==
Some of the most used Slurm commands for non-administrators working on bwUniCluster 2.0.
{| width=750px class="wikitable"
! Slurm commands !! Brief explanation
|-
| [[#Job Submission : sbatch|sbatch]] || Submits a job and queues it in an input queue [[https://slurm.schedmd.com/sbatch.html sbatch]]
|-
| [[#Detailed job information : scontrol show job|scontrol show job]] || Displays detailed job state information [[https://slurm.schedmd.com/scontrol.html scontrol]]
|-
| [[#List of your submitted jo/bs : squeue|squeue]] || Displays information about active, eligible, blocked, and/or recently completed jobs [[https://slurm.schedmd.com/squeue.html squeue]]
|-
| [[#Start time of job or resources : squeue|squeue --start]] || Returns start time of submitted job or requested resources [[https://slurm.schedmd.com/squeue.html squeue]]
|-
| [[#Shows free resources : sinfo_t_idle|sinfo_t_idle]] || Shows what resources are available for immediate use [[https://slurm.schedmd.com/sinfo.html sinfo]]
|-
| [[#Canceling own jobs : scancel|scancel]] || Cancels a job (obsoleted!) [[https://slurm.schedmd.com/scancel.html scancel]]
|}
If your job was submitted to the "multiple" queue you can log into the allocated nodes via SSH as soon as the job is running.

 
* [https://slurm.schedmd.com/tutorials.html Slurm Tutorials]
* [https://slurm.schedmd.com/pdfs/summary.pdf Slurm command/option summary (2 pages)]
* [https://slurm.schedmd.com/man_index.html Slurm Commands]
 

== Job Submission : sbatch ==
Batch jobs are submitted by using the command '''sbatch'''. The main purpose of the '''sbatch''' command is to specify the resources that are needed to run the job. '''sbatch''' will then queue the batch job. However, starting of batch job depends on the availability of the requested resources and the fair sharing value.
 
 
=== sbatch Command Parameters ===
The syntax and use of '''sbatch''' can be displayed via:
<pre>
$ man sbatch
</pre>
'''sbatch''' options can be used from the command line or in your job script.
{| width=750px class="wikitable"
! colspan="3" | sbatch Options
|-
! Command line
! Script
! Purpose
|- style="vertical-align:top;"
| -t ''time'' or --time=''time''
| #SBATCH --time=''time''
| Wall clock time limit. 
|-
|- style="vertical-align:top;"
| -N ''count'' or --nodes=''count''
| #SBATCH --nodes=''count''
| Number of nodes to be used.
|- style="vertical-align:top;"
| -n ''count'' or --ntasks=''count''
| #SBATCH --ntasks=''count''
| Number of tasks to be launched.
|-
|- style="vertical-align:top;"
| --ntasks-per-node=''count''
| #SBATCH --ntasks-per-node=''count''
| Maximum count (<= 28 and <= 40 resp.) of tasks per node. (Replaces the option ppn of MOAB.)
|-
|- style="vertical-align:top;"
| -c ''count'' or --cpus-per-task=''count''
| #SBATCH --cpus-per-task=''count''
| Number of CPUs required per (MPI-)task.
|-
|- style="vertical-align:top;"
| --mem=''value_in_MB''
| #SBATCH --mem=''value_in_MB''
| Memory in MegaByte per node. (Default value is 128000 and 96000 MB resp., i.e. you should omit the setting of this option.)
|-
|- style="vertical-align:top;"
| --mem-per-cpu=''value_in_MB''
| #SBATCH --mem-per-cpu=''value_in_MB''
| Minimum Memory required per allocated CPU. (Replaces the option pmem of MOAB. You should omit the setting of this option.)
|-
|- style="vertical-align:top;"
| --mail-type=''type''
| #SBATCH --mail-type=''type''
| Notify user by email when certain event types occur. Valid type values are NONE, BEGIN, END, FAIL, REQUEUE, ALL.
|-
|- style="vertical-align:top;"
| --mail-user=''mail-address''
| #SBATCH --mail-user=''mail-address''
| The specified mail-address receives email notification of state changes as defined by --mail-type.
|-
|- style="vertical-align:top;"
| --output=''name''
| #SBATCH --output=''name''
| File in which job output is stored.
|-
|- style="vertical-align:top;"
| --error=''name''
| #SBATCH --error=''name''
| File in which job error messages are stored.
|-
|- style="vertical-align:top;"
| -J ''name'' or --job-name=''name''
| #SBATCH --job-name=''name''
| Job name.
|-
|- style="vertical-align:top;"
| --export=[ALL,] ''env-variables''
| #SBATCH --export=[ALL,] ''env-variables''
| Identifies which environment variables from the submission environment are propagated to the launched application. Default is ALL. If adding an environment variable to the submission environment is intended, the argument ALL must be added.
|-
|- style="vertical-align:top;"
| -A ''group-name'' or --account=''group-name''
| #SBATCH --account=''group-name''
| Change resources used by this job to specified group. You may need this option if your account is assigned to more than one group. By command "scontrol show job" the project group the job is accounted on can be seen behind "Account=".
|-
|- style="vertical-align:top;"
| -p ''queue-name'' or --partition=''queue-name''
| #SBATCH --partition=''queue-name''
| Request a specific queue for the resource allocation.
|-
|- style="vertical-align:top;"
| --reservation=''reservation-name''
| #SBATCH --reservation=''reservation-name''
| Use a specific reservation for the resource allocation.
|-
|- style="vertical-align:top;"
| -C ''LSDF'' or --constraint=''LSDF''
| #SBATCH --constraint=LSDF
| Job constraint LSDF Filesystems.
|-
|- style="vertical-align:top;"
| -C ''BEEOND'' or --constraint=''BEEOND''
| #SBATCH --constraint=BEEOND
| Job constraint BeeOND file system.
|-
|}
 

==== sbatch --partition ''queues'' ====
Queue classes define maximum resources such as walltime, nodes and processes per node and queue of the compute system. Details can be found here:
* [[BwUniCluster_2.0_Batch_Queues#sbatch_-p_queue|bwUniCluster 2.0 queue settings]]
 

=== sbatch Examples ===
==== Serial Programs ====
To submit a serial job that runs the script '''job.sh''' and that requires 5000 MB of main memory and 10 minutes of wall clock time

a) execute:
<pre>
$ sbatch -p dev_single -n 1 -t 10:00 --mem=5000 job.sh
</pre>
or
b) add after the initial line of your script '''job.sh''' the lines (here with a high memory request):
<source lang="bash">
#SBATCH --ntasks=1
#SBATCH --time=10
#SBATCH --mem=200gb
#SBATCH --job-name=simple
</source>
and execute the modified script with the command line option ''--partition=fat'' (with ''--partition=(dev_)single'' maximum ''--mem=96gb'' is possible):
<pre>
$ sbatch --partition=fat job.sh
</pre>
Note, that sbatch command line options overrule script options.
 
 

==== Multithreaded Programs ====
Multithreaded programs operate faster than serial programs on CPUs with multiple cores. 
Moreover, multiple threads of one process share resources such as memory.
 
For multithreaded programs based on '''Open''' '''M'''ulti-'''P'''rocessing (OpenMP) a number of threads is defined by the environment variable OMP_NUM_THREADS. By default this variable is set to 1 (OMP_NUM_THREADS=1).
 
'''Because hyperthreading is switched on on bwUniCluster 2.0, the option --cpus-per-task (-c) must be set to 2*n, if you want to use n threads.'''
 
To submit a batch job called ''OpenMP_Test'' that runs a 40-fold threaded program ''omp_exe'' which requires 6000 MByte of total physical memory and total wall clock time of 40 minutes:
 
a) execute:
<pre>
$ sbatch -p single --export=ALL,OMP_NUM_THREADS=40 -J OpenMP_Test -N 1 -c 80 -t 40 --mem=6000 ./omp_exe
</pre>
or
-->
* generate the script '''job_omp.sh''' containing the following lines:
<source lang="bash">
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --cpus-per-task=80
#SBATCH --time=40:00
#SBATCH --mem=6000mb
#SBATCH --export=ALL,EXECUTABLE=./omp_exe
#SBATCH -J OpenMP_Test

#Usually you should set
export KMP_AFFINITY=compact,1,0
#export KMP_AFFINITY=verbose,compact,1,0 prints messages concerning the supported affinity
#KMP_AFFINITY Description: https://software.intel.com/en-us/node/524790#KMP_AFFINITY_ENVIRONMENT_VARIABLE

export OMP_NUM_THREADS=$((${SLURM_JOB_CPUS_PER_NODE}/2))
echo "Executable ${EXECUTABLE} running on ${SLURM_JOB_CPUS_PER_NODE} cores with ${OMP_NUM_THREADS} threads"
startexe=${EXECUTABLE}
echo $startexe
exec $startexe
</source>
Using Intel compiler the environment variable KMP_AFFINITY switches on binding of threads to specific cores and, if necessary, replace <placeholder> with the required modulefile to enable the OpenMP environment and execute the script '''job_omp.sh''' adding the queue class ''single'' as sbatch option:
<pre>
$ sbatch -p single job_omp.sh
</pre>
Note, that sbatch command line options overrule script options, e.g.,
<pre>
$ sbatch --partition=single --mem=200 job_omp.sh
</pre>
overwrites the script setting of 6000 MByte with 200 MByte.
 
 

==== MPI Parallel Programs ====
MPI parallel programs run faster than serial programs on multi CPU and multi core systems. N-fold spawned processes of the MPI program, i.e., '''MPI tasks''', run simultaneously and communicate via the Message Passing Interface (MPI) paradigm. MPI tasks do not share memory but can be spawned over different nodes.
 
Multiple MPI tasks must be launched via '''mpirun''', e.g. 4 MPI tasks of ''my_par_program'':
<pre>
$ mpirun -n 4 my_par_program
</pre>
This command runs 4 MPI tasks of ''my_par_program'' on the node you are logged in.
To run this command with a loaded Intel MPI the environment variable I_MPI_HYDRA_BOOTSTRAP must be unset ( --> $ unset I_MPI_HYDRA_BOOTSTRAP).

Running MPI parallel programs in a batch job the interactive environment - particularly the loaded modules - will also be set in the batch job. If you want to set a defined module environment in your batch job you have to purge all modules before setting the wished modules.
 
 
===== OpenMPI =====

If you want to run jobs on batch nodes, generate a wrapper script ''job_ompi.sh'' for '''OpenMPI''' containing the following lines:
<source lang="bash">
#!/bin/bash
# Use when a defined module environment related to OpenMPI is wished
module load mpi/openmpi/<placeholder_for_version>
mpirun --bind-to core --map-by core -report-bindings my_par_program
</source>
'''Attention:''' Do '''NOT''' add mpirun options ''-n <number_of_processes>'' or any other option defining processes or nodes, since Slurm instructs mpirun about number of processes and node hostnames. Use '''ALWAYS''' the MPI options '''''--bind-to core''''' and '''''--map-by core|socket|node'''''. Please type ''mpirun --help'' for an explanation of the meaning of the different options of mpirun option ''--map-by''.
 
Considering 4 OpenMPI tasks on a single node, each requiring 2000 MByte, and running for 1 hour, execute:
<pre>
$ sbatch -p single -N 1 -n 4 --mem-per-cpu=2000 --time=01:00:00 ./job_ompi.sh
</pre>
 

===== Intel MPI =====

Generate a wrapper script for '''Intel MPI''', ''job_impi.sh'' containing the following lines:
<source lang="bash">
#!/bin/bash
# Use when a defined module environment related to Intel MPI is wished
module load mpi/impi/<placeholder_for_version>
mpiexec.hydra -bootstrap slurm my_par_program
</source>
'''Attention:''' 
Do '''NOT''' add mpirun options ''-n <number_of_processes>'' or any other option defining processes or nodes, since Slurm instructs mpirun about number of processes and node hostnames.
 
Launching and running 200 Intel MPI tasks on 5 nodes, each requiring 80 GByte, and running for 5 hours, execute:
<pre>
$ sbatch --partition=multiple -N 5 --ntasks-per-node=40 --mem=80gb -t 300 ./job_impi.sh
</pre>
 
If you want to use 128 or more nodes, you must also set the environment variable as follows: 
export I_MPI_HYDRA_BRANCH_COUNT=-1
 
If you want to use the options perhost, ppn or rr, you must additionally set the environment variable I_MPI_JOB_RESPECT_PROCESS_PLACEMENT=off.
 
 

==== Multithreaded + MPI parallel Programs ====
Multithreaded + MPI parallel programs operate faster than serial programs on multi CPUs with multiple cores. All threads of one process share resources such as memory. On the contrary MPI tasks do not share memory but can be spawned over different nodes. '''Because hyperthreading is switched on on bwUniCluster 2.0, the option --cpus-per-task (-c) must be set to 2*n, if you want to use n threads.'''
 
 
===== OpenMPI with Multithreading =====
Multiple MPI tasks using '''OpenMPI''' must be launched by the MPI parallel program '''mpirun'''. For multithreaded programs based on '''Open''' '''M'''ulti-'''P'''rocessing (OpenMP) number of threads are defined by the environment variable OMP_NUM_THREADS. By default this variable is set to 1 (OMP_NUM_THREADS=1).
 
'''For OpenMPI''' a job-script to submit a batch job called ''job_ompi_omp.sh'' that runs a MPI program with 4 tasks and a 28-fold threaded program ''ompi_omp_program'' requiring 3000 MByte of physical memory per thread (using 28 threads per MPI task you will get 28*3000 MByte = 84000 MByte per MPI task) and total wall clock time of 3 hours looks like:

<source lang="bash">
#!/bin/bash
#SBATCH --ntasks=4
#SBATCH --cpus-per-task=56
#SBATCH --time=03:00:00
#SBATCH --mem=83gb # 84000 MB = 84000/1024 GB = 82.1 GB
#SBATCH --export=ALL,MPI_MODULE=mpi/openmpi/3.1,EXECUTABLE=./ompi_omp_program
#SBATCH --output="parprog_hybrid_%j.out"

# Use when a defined module environment related to OpenMPI is wished
module load ${MPI_MODULE}
export OMP_NUM_THREADS=$((${SLURM_CPUS_PER_TASK}/2))
export MPIRUN_OPTIONS="--bind-to core --map-by socket:PE=${OMP_NUM_THREADS} -report-bindings"
export NUM_CORES=${SLURM_NTASKS}*${OMP_NUM_THREADS}
echo "${EXECUTABLE} running on ${NUM_CORES} cores with ${SLURM_NTASKS} MPI-tasks and ${OMP_NUM_THREADS} threads"
startexe="mpirun -n ${SLURM_NTASKS} ${MPIRUN_OPTIONS} ${EXECUTABLE}"
echo $startexe
exec $startexe
</source>
Execute the script '''job_ompi_omp.sh''' by command sbatch:
<pre>
$ sbatch -p multiple_e ./job_ompi_omp.sh
</pre>
* With the mpirun option ''--bind-to core'' MPI tasks and OpenMP threads are bound to physical cores.
* With the option ''--map-by node:PE=<value>'' (neighbored) MPI tasks will be attached to different nodes and each MPI task is bound to the first core of a node. <value> must be set to ${OMP_NUM_THREADS}.
* The option ''-report-bindings'' shows the bindings between MPI tasks and physical cores.
* The mpirun-options '''--bind-to core''', '''--map-by socket|...|node:PE=<value>''' should always be used when running a multithreaded MPI program.
 

===== Intel MPI with Multithreading =====
Multithreaded + MPI parallel programs operate faster than serial programs on multi CPUs with multiple cores. All threads of one process share resources such as memory. On the contrary MPI tasks do not share memory but can be spawned over different nodes.

Multiple Intel MPI tasks must be launched by the MPI parallel program '''mpiexec.hydra'''. For multithreaded programs based on '''Open''' '''M'''ulti-'''P'''rocessing (OpenMP) number of threads are defined by the environment variable OMP_NUM_THREADS. By default this variable is set to 1 (OMP_NUM_THREADS=1).

'''For Intel MPI''' a job-script to submit a batch job called ''job_impi_omp.sh'' that runs a Intel MPI program with 10 tasks and a 40-fold threaded program ''impi_omp_program'' requiring 96000 MByte of total physical memory per task and total wall clock time of 1 hours looks like:


<source lang="bash">
#!/bin/bash
#SBATCH --ntasks=10
#SBATCH --cpus-per-task=80
#SBATCH --time=60
#SBATCH --mem=96000
#SBATCH --export=ALL,MPI_MODULE=mpi/impi,EXE=./impi_omp_program
#SBATCH --output="parprog_impi_omp_%j.out"

#If using more than one MPI task per node please set
export KMP_AFFINITY=compact,1,0
#export KMP_AFFINITY=verbose,scatter prints messages concerning the supported affinity
#KMP_AFFINITY Description: https://software.intel.com/en-us/node/524790#KMP_AFFINITY_ENVIRONMENT_VARIABLE

# Use when a defined module environment related to Intel MPI is wished
module load ${MPI_MODULE}
export OMP_NUM_THREADS=$((${SLURM_CPUS_PER_TASK}/2))
export MPIRUN_OPTIONS="-binding "domain=omp:compact" -print-rank-map -envall"
export NUM_PROCS=eval(${SLURM_NTASKS}*${OMP_NUM_THREADS})
echo "${EXE} running on ${NUM_PROCS} cores with ${SLURM_NTASKS} MPI-tasks and ${OMP_NUM_THREADS} threads"
startexe="mpiexec.hydra -bootstrap slurm ${MPIRUN_OPTIONS} -n ${SLURM_NTASKS} ${EXE}"
echo $startexe
exec $startexe
</source>
Using Intel compiler the environment variable KMP_AFFINITY switches on binding of threads to specific cores. If you only run one MPI task per node please set KMP_AFFINITY=compact,1,0.
 
If you want to use 128 or more nodes, you must also set the environment variable as follows: 
export I_MPI_HYDRA_BRANCH_COUNT=-1
 
If you want to use the options perhost, ppn or rr, you must additionally set the environment variable I_MPI_JOB_RESPECT_PROCESS_PLACEMENT=off.
 
 
Execute the script '''job_impi_omp.sh''' by command sbatch:
<pre>
$ sbatch -p multiple ./job_impi_omp.sh
</pre>
 
The mpirun option ''-print-rank-map'' shows the bindings between MPI tasks and nodes (not very beneficial). The option ''-binding'' binds MPI tasks (processes) to a particular processor; ''domain=omp'' means that the domain size is determined by the number of threads. If you would choose 2 MPI tasks per node, you should choose ''-binding "cell=unit;map=bunch"''; this binding maps one MPI process to each socket.
 
 

==== Chain jobs ====
The CPU time requirements of many applications exceed the limits of the job classes. In those situations it is recommended to solve the problem by a job chain. A job chain is a sequence of jobs where each job automatically starts its successor.
<source lang="bash">
#!/bin/bash
####################################
## simple Slurm submitter script to setup ##
## a chain of jobs using Slurm ##
####################################
## ver. : 2018-11-27, KIT, SCC

## Define maximum number of jobs via positional parameter 1, default is 5
max_nojob=${1:-5}

## Define your jobscript (e.g. "~/chain_job.sh")
chain_link_job=${PWD}/chain_job.sh

## Define type of dependency via positional parameter 2, default is 'afterok'
dep_type="${2:-afterok}"
## -> List of all dependencies:
## https://slurm.schedmd.com/sbatch.html

myloop_counter=1
## Submit loop
while [ ${myloop_counter} -le ${max_nojob} ] ; do
##
## Differ msub_opt depending on chain link number
if [ ${myloop_counter} -eq 1 ] ; then
slurm_opt=""
else
slurm_opt="-d ${dep_type}:${jobID}"
fi
##
## Print current iteration number and sbatch command
echo "Chain job iteration = ${myloop_counter}"
echo " sbatch --export=myloop_counter=${myloop_counter} ${slurm_opt} ${chain_link_job}"
## Store job ID for next iteration by storing output of sbatch command with empty lines
jobID=$(sbatch -p <queue> --export=ALL,myloop_counter=${myloop_counter} ${slurm_opt} ${chain_link_job} 2>&1 | sed 's/[S,a-z]* //g')
##
## Check if ERROR occured
if [[ "${jobID}" =~ "ERROR" ]] ; then
echo " -> submission failed!" ; exit 1
else
echo " -> job number = ${jobID}"
fi
##
## Increase counter
let myloop_counter+=1
done
</source>
 

==== GPU jobs ====

The nodes in the gpu_4 and gpu_8 queues have 4 or 8 NVIDIA Tesla V100 GPUs. Just submitting a job to these queues is not enough to also allocate one or more GPUs, you have to do so using the "--gres=gpu" parameter. You have to specifiy how many GPUs your job needs, e.g. "--gres=gpu:2" will request two GPUs.

The GPU nodes are shared between multiple jobs if the jobs don't request all the GPUs in a node and there are enough ressources to run more than one job. The individual GPUs are always bound to a single job and will not be shared between different jobs.

a) add after the initial line of your script job.sh the line including the
information about the LSDF Online Storage usage: #SBATCH --constraint=LSDF
<pre>
#!/bin/bash
#SBATCH --ntasks=40
#SBATCH --time=02:00:00
#SBATCH --mem=4000
#SBATCH --gres=gpu:2
</pre>

or b) execute:
<pre>
$ sbatch -p <queue> -n 40 -t 02:00:00 --mem 4000 --gres=gpu:2 job.sh
</pre>
 
If you start an interactive session on of the GPU nodes, you can use the "nvidia-smi" command to list the GPUs allocated to your job:
<pre>
$ nvidia-smi
Sun Mar 29 15:20:05 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.33.01 Driver Version: 440.33.01 CUDA Version: 10.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2... Off | 00000000:3A:00.0 Off | 0 |
| N/A 29C P0 39W / 300W | 9MiB / 32510MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla V100-SXM2... Off | 00000000:3B:00.0 Off | 0 |
| N/A 30C P0 41W / 300W | 8MiB / 32510MiB | 0% Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 14228 G /usr/bin/X 8MiB |
| 1 14228 G /usr/bin/X 8MiB |
+-----------------------------------------------------------------------------+
</pre>

 
 

==== LSDF Online Storage ====
On bwUniCluster 2.0 you can use for special cases the LSDF Online Storage on the HPC cluster nodes. Please request for this service separately ([https://www.lsdf.kit.edu/os/storagerequest/: LSDF Storage Request]).
To mount the LSDF Online Storage on the compute nodes during the job runtime the
the constraint flag "LSDF" has to be set.

a) add after the initial line of your script job.sh the line including the
information about the LSDF Online Storage usage: #SBATCH --constraint=LSDF
<pre>
#!/bin/bash
#SBATCH --ntasks=1
#SBATCH --time=120
#SBATCH --mem=200
#SBATCH --constraint=LSDF
</pre>

or b) execute:
<pre>
$ sbatch -p <queue> -n 1 -t 2:00:00 --mem 200 job.sh -C LSDF
</pre>
 
For the usage of the LSDF Online Storage
the following environment variables are available: $LSDF, $LSDFPROJECTS, $LSDFHOME.
 
 

====BeeOND (BeeGFS On-Demand)====

BeeOND instances are integrated into the prolog and epilog script of the cluster batch system, Slurm. It can be used on the compute nodes during the job runtime with the constraint flag "BEEOND" ([[BwUniCluster_2.0_Slurm_common_Features#sbatch_Command_Parameters|Slurm Command Parameters]])
<pre>
#!/bin/bash
#SBATCH ...
#SBATCH --constraint=BEEOND
</pre>

After your job has started you can find the private on-demand file system in '''/mnt/odfs/$SLURM_JOB_ID''' directory. The mountpoint comes with three pre-configured directories:
<pre>
#for small files (stripe count = 1)
/mnt/odfs/$SLURM_JOB_ID/stripe_1
#stripe count = 4
/mnt/odfs/$SLURM_JOB_ID/stripe_default
#stripe count = 8, 16 or 32, use this directories for medium sized and large files or when using MPI-IO
/mnt/odfs/$SLURM_JOB_ID/stripe_8, /mnt/odfs/$SLURM_JOB_ID/stripe_16 or /mnt/odfs/$SLURM_JOB_ID/stripe_32
</pre>

If you request less nodes than stripe count, the stripe count will be max number of nodes, e.g., You only request 8 nodes , so the directory with stripe count 16 is basically only with a stripe count 8.

The capacity of the private file system depends on the number of nodes. For each node you get 250Gbyte.

!!! Be careful when creating large files, use always the directory with the max stripe count for large files.
If you create large files use a higher stripe count. For example, if your largest file is 1.1 Tb, then you have to use a stripe count larger>4 (4 x 250GB).

If you request 100 nodes for your job, the private file system is 100 * 250 Gbyte ~ 25 Tbyte (approx) capacity.

'''Recommendation:'''

The private file system is using its own metadata server. This metadata server is started on the first nodes. Depending on your application, the metadata server is consuming decent amount of CPU power. Probably adding a extra node to your job could improve the usability of the on-demand file system. Start your application with the MPI option:
<pre>
mpirun -nolocal myapplication
</pre>
With the -nolocal option the node where mpirun is initiated is not used for your application. This node is fully available for the meta data server of your requested on-demand file system.

Example job script:
<pre>
#!/bin/bash
#very simple example on how to use a private on-demand file system
#SBATCH -N 10
#SBATCH --constraint=BEEOND

#create a workspace
ws_allocate myresults-$SLURM_JOB_ID 90
RESULTDIR=`ws_find myresults-$SLURM_JOB_ID`

#Set ENV variable to on-demand file system
ODFSDIR=/mnt/odfs/$SLURM_JOB_ID/stripe_16/

#start application and write results to on-demand file system
mpirun -nolocal myapplication -o $ODFSDIR/results

#Copy back data after your job application end
rsync -av $ODFSDIR/results $RESULTDIR
</pre>
 
 

== Start time of job or resources : squeue --start ==
The command can be used by any user to displays the estimated start time of a job based a number of analysis types based on historical usage, earliest available reservable resources, and priority based backlog. The command squeue is explained in detail on the webpage https://slurm.schedmd.com/squeue.html or via manpage (man squeue).
 
 
=== Access ===
By default, this command can be run by '''any user'''.
 
 

== List of your submitted jobs : squeue ==
Displays information about YOUR active, pending and/or recently completed jobs. The command displays all own active and pending jobs. The command squeue is explained in detail on the webpage https://slurm.schedmd.com/squeue.html or via manpage (man squeue).
 
 
=== Access ===
By default, this command can be run by any user.
 
 

=== Flags ===
{| width=750px class="wikitable"
|-
! Flag !! Description
|-
| -l, --long
| Report more of the available information for the selected jobs or job steps, subject to any constraints specified.
|}
 

=== Examples ===
''squeue'' example on bwUniCluster 2.0 (Only your own jobs are displayed!).
<pre>
$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
18088744 single CPV.sbat ab1234 PD 0:00 1 (Priority)
18098414 multiple CPV.sbat ab1234 PD 0:00 2 (Priority)
18090089 multiple CPV.sbat ab1234 R 2:27 2 uc2n[127-128]
$ squeue -l
JOBID PARTITION NAME USER STATE TIME TIME_LIMI NODES NODELIST(REASON)
18088654 single CPV.sbat ab1234 COMPLETI 4:29 2:00:00 1 uc2n374
18088785 single CPV.sbat ab1234 PENDING 0:00 2:00:00 1 (Priority)
18098414 multiple CPV.sbat ab1234 PENDING 0:00 2:00:00 2 (Priority)
18088683 single CPV.sbat ab1234 RUNNING 0:14 2:00:00 1 uc2n413
</pre>
* The output of ''squeue'' shows how many jobs of yours are running or pending and how many nodes are in use by your jobs.
 

== Shows free resources : sinfo_t_idle ==
The Slurm command sinfo is used to view partition and node information for a system running Slurm. It incorporates down time, reservations, and node state information in determining the available backfill window. The sinfo command can only be used by the administrator.
 
SCC has prepared a special script (sinfo_t_idle) to find out how many processors are available for immediate use on the system. It is anticipated that users will use this information to submit jobs that meet these criteria and thus obtain quick job turnaround times.
 
 
=== Access ===
By default, this command can be used by any user or administrator.
 
 
=== Example ===
* The following command displays what resources are available for immediate use for the whole partition.
<pre>$ sinfo_t_idle
Partition dev_multiple : 8 nodes idle
Partition multiple : 332 nodes idle
Partition dev_single : 4 nodes idle
Partition single : 76 nodes idle
Partition long : 80 nodes idle
Partition fat : 5 nodes idle
Partition dev_special : 342 nodes idle
Partition special : 342 nodes idle
Partition dev_multiple_e: 7 nodes idle
Partition multiple_e : 335 nodes idle
Partition gpu_4 : 12 nodes idle
Partition gpu_8 : 6 nodes idle
</pre>
* For the above example jobs in all partitions can be run immediately.
 

== Detailed job information : scontrol show job ==
scontrol show job displays detailed job state information and diagnostic output for all or a specified job of yours. Detailed information is available for active, pending and recently completed jobs. The command scontrol is explained in detail on the webpage https://slurm.schedmd.com/scontrol.html or via manpage (man scontrol).
 
Display the state of all your jobs in normal mode: scontrol show job
 
Display the state of a job with <jobid> in normal mode: scontrol show job <jobid>
 
 
=== Access ===
* End users can use scontrol show job to view the status of their '''own jobs''' only.
 

=== Arguments ===
{| width=750px class="wikitable"
|-
! Option !! Default !! Description !! Example
|- style="vertical-align:top;"
|- style="width:12%;"
| -d
| (n/a)
| Detailed mode
| Example: Display the state with jobid 18089884 in detailed mode. <pre>scontrol -d show job 18089884</pre>
|}
 
 

=== Scontrol show job Example ===
Here is an example from bwUniCluster 2.0.
<pre>
squeue # show my own jobs (here the userid is replaced!)
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
18089884 multiple CPV.sbat bq0742 R 33:44 2 uc2n[165-166]

$
$ # now, see what's up with my pending job with jobid 18089884
$
$ scontrol show job 18089884

JobId=18089884 JobName=CPV.sbatch
UserId=bq0742(8946) GroupId=scc(12345) MCS_label=N/A
Priority=3 Nice=0 Account=kit QOS=normal
JobState=RUNNING Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
RunTime=00:35:06 TimeLimit=02:00:00 TimeMin=N/A
SubmitTime=2020-03-16T14:14:54 EligibleTime=2020-03-16T14:14:54
AccrueTime=2020-03-16T14:14:54
StartTime=2020-03-16T15:12:51 EndTime=2020-03-16T17:12:51 Deadline=N/A
SuspendTime=None SecsPreSuspend=0 LastSchedEval=2020-03-16T15:12:51
Partition=multiple AllocNode:Sid=uc2n995:5064
ReqNodeList=(null) ExcNodeList=(null)
NodeList=uc2n[165-166]
BatchHost=uc2n165
NumNodes=2 NumCPUs=160 NumTasks=80 CPUs/Task=1 ReqB:S:C:T=0:0:*:1
TRES=cpu=160,mem=96320M,node=2,billing=160
Socks/Node=* NtasksPerN:B:S:C=40:0:*:1 CoreSpec=*
MinCPUsNode=40 MinMemoryCPU=1204M MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
OverSubscribe=NO Contiguous=0 Licenses=(null) Network=(null)
Command=/pfs/data5/home/kit/scc/bq0742/git/CPV/bin/CPV.sbatch
WorkDir=/pfs/data5/home/kit/scc/bq0742/git/CPV/bin
StdErr=/pfs/data5/home/kit/scc/bq0742/git/CPV/bin/slurm-18089884.out
StdIn=/dev/null
StdOut=/pfs/data5/home/kit/scc/bq0742/git/CPV/bin/slurm-18089884.out
Power=
MailUser=(null) MailType=NONE
</pre>
 
You can use standard Linux pipe commands to filter the very detailed scontrol show job output.
* In which state the job is?
<pre>$ scontrol show job 18089884 | grep -i State
JobState=COMPLETED Reason=None Dependency=(null)
</pre>
 

== Cancel Slurm Jobs ==
The scancel command is used to cancel jobs. The command scancel is explained in detail on the webpage https://slurm.schedmd.com/scancel.html or via manpage (man scancel).
 
 
=== Canceling own jobs : scancel ===
scancel is used to signal or cancel jobs, job arrays or job steps. The command is:
<pre>
$ scancel [-i] <job-id>
$ scancel -t <job_state_name>
</pre>

 
{| width=750px class="wikitable"
! Flag !! Default !! Description !! Example
|- style="vertical-align:top;"
| -i, --interactive
| (n/a)
| Interactive mode.
| Cancel the job 987654 interactively. <pre> scancel -i 987654 </pre>
|-
| -t, --state
| (n/a)
| Restrict the scancel operation to jobs in a certain state. "job_state_name" may have a value of either "PENDING", "RUNNING" or "SUSPENDED".
| Cancel all jobs in state "PENDING". <pre> scancel -t "PENDING" </pre>
|}
 

= Resource Managers =
=== Batch Job (Slurm) Variables ===
The following environment variables of Slurm are added to your environment once your job has started
(only an excerpt of the most important ones).
{| width=750px class="wikitable"
! Environment !! Brief explanation
|-
| SLURM_JOB_CPUS_PER_NODE
| Number of processes per node dedicated to the job
|-
| SLURM_JOB_NODELIST
| List of nodes dedicated to the job
|-
| SLURM_JOB_NUM_NODES
| Number of nodes dedicated to the job
|-
| SLURM_MEM_PER_NODE
| Memory per node dedicated to the job
|-
| SLURM_NPROCS
| Total number of processes dedicated to the job
|-
| SLURM_CLUSTER_NAME
| Name of the cluster executing the job
|-
| SLURM_CPUS_PER_TASK
| Number of CPUs requested per task
|-
| SLURM_JOB_ACCOUNT
| Account name
|-
| SLURM_JOB_ID
| Job ID
|-
| SLURM_JOB_NAME
| Job Name
|-
| SLURM_JOB_PARTITION
| Partition/queue running the job
|-
| SLURM_JOB_UID
| User ID of the job's owner
|-
| SLURM_SUBMIT_DIR
| Job submit folder. The directory from which msub was invoked.
|-
| SLURM_JOB_USER
| User name of the job's owner
|-
| SLURM_RESTART_COUNT
| Number of times job has restarted
|-
| SLURM_PROCID
| Task ID (MPI rank)
|-
| SLURM_NTASKS
| The total number of tasks available for the job
|-
| SLURM_STEP_ID
| Job step ID
|-
| SLURM_STEP_NUM_TASKS
| Task count (number of PI ranks)
|-
| SLURM_JOB_CONSTRAINT
| Job constraints
|}
See also:
* [https://slurm.schedmd.com/sbatch.html#lbAI Slurm input and output environment variables]
 

=== Job Exit Codes ===
A job's exit code (also known as exit status, return code and completion code) is captured by SLURM and saved as part of the job record.
 
Any non-zero exit code will be assumed to be a job failure and will result in a Job State of FAILED with a reason of "NonZeroExitCode".
 
The exit code is an 8 bit unsigned number ranging between 0 and 255. While it is possible for a job to return a negative exit code, SLURM will display it as an unsigned value in the 0 - 255 range.
 
 
==== Displaying Exit Codes and Signals ====
SLURM displays a job's exit code in the output of the '''scontrol show job''' and the sview utility.
 
When a signal was responsible for a job or step's termination, the signal number will be displayed after the exit code, delineated by a colon(:).
 
 
==== Submitting Termination Signal ====
Here is an example, how to 'save' a Slurm termination signal in a typical jobscript.
<source lang="bash">
[...]
exit_code=$?
mpirun -np <#cores> <EXE_BIN_DIR>/<executable> ... (options) 2>&1
[ "$exit_code" -eq 0 ] && echo "all clean..." || \
echo "Executable <EXE_BIN_DIR>/<executable> finished with exit code ${$exit_code}"
[...]
</source>
* Do not use ''''time'''' mpirun! The exit code will be the one submitted by the first (time) program.
* You do not need an '''exit $exit_code''' in the scripts.
 
 
----
[[Category:bwUniCluster 2.0|bwUniCluster 2.0]]
[[#top|Back to top]]

BwUniCluster2.0/FAQ - broadwell partition

2020-11-21T16:42:48Z

S Raffeiner:

FAQs concerning best practice of [[BwUniCluster_2.0_Hardware_and_Architecture#Components_of_bwUniCluster|bwUniCluster broadwell partition]] (aka "extension" partition).

__TOC__

= Login =
== Are there separate login nodes for the bwUniCluster broadwell partition? ==
* Yes, but primarily to be used for compiling code.

== How to login to broadwell login nodes? ==
* You can directly login on broadwell partition login nodes using
<pre>
$ ssh username@uc1e.scc.kit.edu
</pre>
* If you are compiling code on broadwell login nodes, your code will not optimally run on the new "Cascade Lake" nodes.
 

= Compilation =
== How to compile code on broadwell (extension) nodes? ==
To use the code only on the partition multiple_e:
<pre>
$ icc/ifort -xHost [-further_options]
</pre>

== How to compile code to be used on ALL partitions? ==
On uc1e (= extension) login nodes:
<pre>
$ icc/ifort -xCORE-AVX2 -axCORE-AVX512 [-further_options]
</pre>
 

= Job execution =
== How to submit jobs to the broadwell (= extension) partition ==
The submitted job will be distributed either way to the broadwell nodes if specified correctly, i.e.:
<pre>
$ sbatch -p multiple_e
</pre>

== Can I use my old multinode job script for the new broadwell partition? ==
Yes, but please note that all broadwell nodes do have '''28 cores per node'''.

----
[[Category:bwUniCluster]]

Category:BwUniCluster 2.0

2020-11-10T12:35:49Z

S Raffeiner:

{| style="width: 100%; border-spacing: 5px;"
| style="text-align:center; color:#000;vertical-align:middle;font-size:75%;" |
[[File:BwUniCluster_2.0_Feb2020.jpg|center|border|550px|Close-up of bwUniCluster by Simon Raffeiner, Copyright: KIT (SCC)]]
|-
| style="text-align:center; color:#000;vertical-align:middle;" |Close-up of bwUniCluster © KIT (Simon Raffeiner/SCC)
|}

On 17.03.2020, the Steinbuch Centre for Computing (SCC) at Karlsruhe Institute of Technology (KIT) commissioned a new parallel computer system called "bwUniCluster 2.0+GFB-HPC" as a state service within the bwHPC framework. The bwUniCluster 2.0 replaces the predecessor system [[bwUniCluster]] and also includes the additional compute nodes which were procured as an extension to the bwUniCluster in November 2016.

The modern bwUniCluster 2.0 system consists of more than 840 SMP nodes with 64-bit Intel Xeon processors. It provides the universities of the state of Baden-Württemberg with general compute resources and can be used free of charge by the staff of all universities in Baden-Württemberg. Users who currently have access to bwUniCluster will automatically also have access to bwUniCluster 2.0. There is no need to apply for new entitlements or to re-register.


{| style="width: 100%; margin:4px 0 0 0; background:none; border-spacing: 0px;"
| style="width:100%; border:1px solid #BBBBBB; background:lightyellow; vertical-align:top; color:#000;" |
{| style="width:100%; vertical-align:top; border:0px solid #BBBBBB; padding:4px;" |
|-
|{{yellow}}| MPI/Software issues
|-
|
After the [[BwUniCluster_2.0_Maintenance/2020-10|last maintenance]] there are currently some issues with Intel MPI and the default settings of a few software modules offered on the cluster. Further information be found [[BwUniCluster_2.0_Maintenance/2020-10/Software Issues|here]].
|}
|}


{| style="width: 100%; margin:4px 0 0 0; background:none; border-spacing: 0px;"
| style="width:100%; border:1px solid #BBBBBB; background:#fff5fa; vertical-align:top; color:#000;" |
{| style="width:100%; vertical-align:top; border:0px solid #BBBBBB; padding:4px;" |
|-
|{{Red}}| New security measures
|-
|
On 13.08.2020 at 10 AM the following changes to the security policies will take effect:

* For authentication, the use of a second factor (2-factor authentication) in addition to the service password will be mandatory. [[BwUniCluster 2.0 User Access/2FA Tokens|You can find the user documentation for this function here]].

* The use of SSH keys will be possible again. However, these can no longer be managed via the authorized_keys files, but only centrally via bwIDM. [[BwUniCluster 2.0 User Access/SSH Keys|You can find the user documentation for this function here]].

The following restrictions still apply:

* Access is limited to IP addresses from within the campus networks of the respective home institutions of our current users. If you are outside of one of these networks (e.g. in your home office), a VPN connection to your home institution has to be established first (see e.g. [https://www.scc.kit.edu/dienste/openvpn.php] for the KIT).
|}
|}


{| style="width: 100%; margin:4px 0 0 0; background:none; border-spacing: 0px;"
| style="width:50%; border:1px solid #BBBBBB; background:#f5fffa; vertical-align:top; color:#000;" |
{| style="width:100%; vertical-align:top; border:0px solid #BBBBBB; padding:4px;" |
|-
|{{Green}}| Access
|-
|
* bwUniCluster [[BwUniCluster_2.0_User_Access|Registration and Login]]
* Registration [[bwUniCluster 2.0 Support|trouble issues]] & [[BwUniCluster_2.0_User_Access#Deregistration|Deregistration]]
* [[First_Steps_on_bwHPC_cluster|First steps on bwUniCluster]]
* [[Jupyter_at_SCC|Access with Jupyter]]
 
|-
|{{Green}}| Software
|-
|
* [[bwUniCluster_2.0_Software|Software and Environment Modules]]
|}
{| style="width:100%; vertical-align:top; border:0px solid #BBBBBB; padding:4px;" |
|-
|{{Green}}| Hardware
|-
|
* [[bwUniCluster_2.0_Hardware_and_Architecture|Hardware and Architecture]]
* [[BwUniCluster_2.0_Hardware_and_Architecture#File_Systems|File Systems]]
|}

| style="padding:2px;" |

| style="width:50%; border:1px solid #BBBBBB; background:#f5faff; vertical-align:top; color:#000;" |
{| style="width:100%; vertical-align:top; border:0px solid #BBBBBB; padding:4px;" |
|-
|{{Blue}}| Batch/Compute Jobs
|-
|
* [[bwUniCluster_2.0_Slurm_common_Features|Slurm common Features]]
* [[BwUniCluster_2.0_Batch_Queues|Batch Queues and interactive Jobs]]
|-
|{{Blue}}| [[BwHPC_Best_Practices_Repository|bwHPC Best Practice Guides]] / FAQs
|-
|




* [[FAQ - bwUniCluster_broadwell_partition|FAQ - bwUniCluster 2.0 Broadwell partition]]
|-
|{{Blue}}| Miscellaneous
|-
|
* [[bwUniCluster_Acknowledgement|Acknowledgement]] of work performed on bwUniCluster (2.0)
* [[BwUniCluster_2.0_File_System_Migration_Guide|File system migration guide]] and [[BwUniCluster_2.0_Batch_System_Migration_Guide|Batch system migration guide]] for users migrating from the former bwUniCluster 1
|}
|}

 
-----
 
 
[[Category:bwHPC_infrastructure]][[Category:bwHPC_Cluster]][[Category:bwCluster]]

Category:BwUniCluster 2.0

2020-11-10T09:09:36Z

S Raffeiner:

{| style="width: 100%; border-spacing: 5px;"
| style="text-align:center; color:#000;vertical-align:middle;font-size:75%;" |
[[File:BwUniCluster_2.0_Feb2020.jpg|center|border|550px|Close-up of bwUniCluster by Simon Raffeiner, Copyright: KIT (SCC)]]
|-
| style="text-align:center; color:#000;vertical-align:middle;" |Close-up of bwUniCluster © KIT (Simon Raffeiner/SCC)
|}

On 17.03.2020, the Steinbuch Centre for Computing (SCC) at Karlsruhe Institute of Technology (KIT) commissioned a new parallel computer system called "bwUniCluster 2.0+GFB-HPC" as a state service within the bwHPC framework. The bwUniCluster 2.0 replaces the predecessor system [[bwUniCluster]] and also includes the additional compute nodes which were procured as an extension to the bwUniCluster in November 2016.

The modern bwUniCluster 2.0 system consists of more than 840 SMP nodes with 64-bit Intel Xeon processors. It provides the universities of the state of Baden-Württemberg with general compute resources and can be used free of charge by the staff of all universities in Baden-Württemberg. Users who currently have access to bwUniCluster will automatically also have access to bwUniCluster 2.0. There is no need to apply for new entitlements or to re-register.


{| style="width: 100%; margin:4px 0 0 0; background:none; border-spacing: 0px;"
| style="width:100%; border:1px solid #BBBBBB; background:lightyellow; vertical-align:top; color:#000;" |
{| style="width:100%; vertical-align:top; border:0px solid #BBBBBB; padding:4px;" |
|-
|{{yellow}}| MPI/Software issues
|-
|
After the [[BwUniCluster_2.0_Maintenance/2020-10|last maintenance]] there are currently some issues with Intel MPI and the default settings of a few software modules offered on the cluster. Further information be found [[BwUniCluster_2.0_Maintenance/2020-10/Software Issues|here]].
|}
|}


{| style="width: 100%; margin:4px 0 0 0; background:none; border-spacing: 0px;"
| style="width:100%; border:1px solid #BBBBBB; background:#fff5fa; vertical-align:top; color:#000;" |
{| style="width:100%; vertical-align:top; border:0px solid #BBBBBB; padding:4px;" |
|-
|{{Red}}| New security measures
|-
|
On 13.08.2020 at 10 AM the following changes to the security policies will take effect:

* For authentication, the use of a second factor (2-factor authentication) in addition to the service password will be mandatory. [[BwUniCluster 2.0 User Access/2FA Tokens|You can find the user documentation for this function here]].

* The use of SSH keys will be possible again. However, these can no longer be managed via the authorized_keys files, but only centrally via bwIDM. [[BwUniCluster 2.0 User Access/SSH Keys|You can find the user documentation for this function here]].

The following restrictions still apply:

* Access is limited to IP addresses from within the campus networks of the respective home institutions of our current users. If you are outside of one of these networks (e.g. in your home office), a VPN connection to your home institution has to be established first (see e.g. [https://www.scc.kit.edu/dienste/openvpn.php] for the KIT).
|}
|}


{| style="width: 100%; margin:4px 0 0 0; background:none; border-spacing: 0px;"
| style="width:50%; border:1px solid #BBBBBB; background:#f5fffa; vertical-align:top; color:#000;" |
{| style="width:100%; vertical-align:top; border:0px solid #BBBBBB; padding:4px;" |
|-
|{{Green}}| Access
|-
|
* bwUniCluster [[BwUniCluster_2.0_User_Access|Registration and Login]]
* Registration [[bwUniCluster 2.0 Support|trouble issues]] & [[BwUniCluster_2.0_User_Access#Deregistration|Deregistration]]
* [[First_Steps_on_bwHPC_cluster|First steps on bwUniCluster]]
* [[Jupyter_at_SCC|Access with Jupyter]]
 
|-
|{{Green}}| Software
|-
|
* [[bwUniCluster_2.0_Software|Software and Environment Modules]]
* [[Jupyter at SCC|Jupyter]]
|}
{| style="width:100%; vertical-align:top; border:0px solid #BBBBBB; padding:4px;" |
|-
|{{Green}}| Hardware
|-
|
* [[bwUniCluster_2.0_Hardware_and_Architecture|Hardware and Architecture]]
* [[BwUniCluster_2.0_Hardware_and_Architecture#File_Systems|File Systems]]
|}

| style="padding:2px;" |

| style="width:50%; border:1px solid #BBBBBB; background:#f5faff; vertical-align:top; color:#000;" |
{| style="width:100%; vertical-align:top; border:0px solid #BBBBBB; padding:4px;" |
|-
|{{Blue}}| Batch/Compute Jobs
|-
|
* [[bwUniCluster_2.0_Slurm_common_Features|Slurm common Features]]
* [[BwUniCluster_2.0_Batch_Queues|Batch Queues and interactive Jobs]]
|-
|{{Blue}}| [[BwHPC_Best_Practices_Repository|bwHPC Best Practice Guides]] / FAQs
|-
|




* [[FAQ - bwUniCluster_broadwell_partition|FAQ - bwUniCluster 2.0 Broadwell partition]]
|-
|{{Blue}}| Miscellaneous
|-
|
* [[bwUniCluster_Acknowledgement|Acknowledgement]] of work performed on bwUniCluster (2.0)
* [[BwUniCluster_2.0_File_System_Migration_Guide|File system migration guide]] and [[BwUniCluster_2.0_Batch_System_Migration_Guide|Batch system migration guide]] for users migrating from the former bwUniCluster 1
|}
|}

 
-----
 
 
[[Category:bwHPC_infrastructure]][[Category:bwHPC_Cluster]][[Category:bwCluster]]

BwUniCluster 2.0 Jupyter

2020-11-10T09:08:24Z

S Raffeiner: Changed redirect target from Jupyter am SCC to Jupyter at SCC

#REDIRECT [[Jupyter at SCC]]

[[Category:bwUniCluster 2.0]]

BwUniCluster 2.0 Jupyter

2020-11-10T09:08:04Z

S Raffeiner:

#REDIRECT [[Jupyter am SCC]]

[[Category:bwUniCluster 2.0]]

Category:BwUniCluster 2.0

2020-10-27T10:58:58Z

S Raffeiner:

{| style="width: 100%; border-spacing: 5px;"
| style="text-align:center; color:#000;vertical-align:middle;font-size:75%;" |
[[File:BwUniCluster_2.0_Feb2020.jpg|center|border|550px|Close-up of bwUniCluster by Simon Raffeiner, Copyright: KIT (SCC)]]
|-
| style="text-align:center; color:#000;vertical-align:middle;" |Close-up of bwUniCluster © KIT (Simon Raffeiner/SCC)
|}

On 17.03.2020, the Steinbuch Centre for Computing (SCC) at Karlsruhe Institute of Technology (KIT) commissioned a new parallel computer system called "bwUniCluster 2.0+GFB-HPC" as a state service within the bwHPC framework. The bwUniCluster 2.0 replaces the predecessor system [[bwUniCluster]] and also includes the additional compute nodes which were procured as an extension to the bwUniCluster in November 2016.

The modern bwUniCluster 2.0 system consists of more than 840 SMP nodes with 64-bit Intel Xeon processors. It provides the universities of the state of Baden-Württemberg with general compute resources and can be used free of charge by the staff of all universities in Baden-Württemberg. Users who currently have access to bwUniCluster will automatically also have access to bwUniCluster 2.0. There is no need to apply for new entitlements or to re-register.


{| style="width: 100%; margin:4px 0 0 0; background:none; border-spacing: 0px;"
| style="width:100%; border:1px solid #BBBBBB; background:lightyellow; vertical-align:top; color:#000;" |
{| style="width:100%; vertical-align:top; border:0px solid #BBBBBB; padding:4px;" |
|-
|{{yellow}}| MPI/Software issues
|-
|
After the [[BwUniCluster_2.0_Maintenance/2020-10|last maintenance]] there are currently some issues with Intel MPI and the default settings of a few software modules offered on the cluster. Further information be found [[BwUniCluster_2.0_Maintenance/2020-10/Software Issues|here]].
|}
|}


{| style="width: 100%; margin:4px 0 0 0; background:none; border-spacing: 0px;"
| style="width:100%; border:1px solid #BBBBBB; background:#fff5fa; vertical-align:top; color:#000;" |
{| style="width:100%; vertical-align:top; border:0px solid #BBBBBB; padding:4px;" |
|-
|{{Red}}| New security measures
|-
|
On 13.08.2020 at 10 AM the following changes to the security policies will take effect:

* For authentication, the use of a second factor (2-factor authentication) in addition to the service password will be mandatory. [[BwUniCluster 2.0 User Access/2FA Tokens|You can find the user documentation for this function here]].

* The use of SSH keys will be possible again. However, these can no longer be managed via the authorized_keys files, but only centrally via bwIDM. [[BwUniCluster 2.0 User Access/SSH Keys|You can find the user documentation for this function here]].

The following restrictions still apply:

* Access is limited to IP addresses from within the campus networks of the respective home institutions of our current users. If you are outside of one of these networks (e.g. in your home office), a VPN connection to your home institution has to be established first (see e.g. [https://www.scc.kit.edu/dienste/openvpn.php] for the KIT).
|}
|}


{| style="width: 100%; margin:4px 0 0 0; background:none; border-spacing: 0px;"
| style="width:50%; border:1px solid #BBBBBB; background:#f5fffa; vertical-align:top; color:#000;" |
{| style="width:100%; vertical-align:top; border:0px solid #BBBBBB; padding:4px;" |
|-
|{{Green}}| Access
|-
|
* bwUniCluster [[BwUniCluster_2.0_User_Access|Registration and Login]]
* Registration [[bwUniCluster 2.0 Support|trouble issues]] & [[BwUniCluster_2.0_User_Access#Deregistration|Deregistration]]
* [[First_Steps_on_bwHPC_cluster|First steps on bwUniCluster]]
* [[Jupyter_at_SCC|Access with Jupyter]]
 
|-
|{{Green}}| Software
|-
|
* [[bwUniCluster_2.0_Software|Software and Environment Modules]]
|}
{| style="width:100%; vertical-align:top; border:0px solid #BBBBBB; padding:4px;" |
|-
|{{Green}}| Hardware
|-
|
* [[bwUniCluster_2.0_Hardware_and_Architecture|Hardware and Architecture]]
* [[BwUniCluster_2.0_Hardware_and_Architecture#File_Systems|File Systems]]
|}

| style="padding:2px;" |

| style="width:50%; border:1px solid #BBBBBB; background:#f5faff; vertical-align:top; color:#000;" |
{| style="width:100%; vertical-align:top; border:0px solid #BBBBBB; padding:4px;" |
|-
|{{Blue}}| Batch/Compute Jobs
|-
|
* [[bwUniCluster_2.0_Slurm_common_Features|Slurm common Features]]
* [[BwUniCluster_2.0_Batch_Queues|Batch Queues and interactive Jobs]]
|-
|{{Blue}}| [[BwHPC_Best_Practices_Repository|bwHPC Best Practice Guides]] / FAQs
|-
|




* [[FAQ - bwUniCluster_broadwell_partition|FAQ - bwUniCluster 2.0 Broadwell partition]]
|-
|{{Blue}}| Miscellaneous
|-
|
* [[bwUniCluster_Acknowledgement|Acknowledgement]] of work performed on bwUniCluster (2.0)
* [[BwUniCluster_2.0_File_System_Migration_Guide|File system migration guide]] and [[BwUniCluster_2.0_Batch_System_Migration_Guide|Batch system migration guide]] for users migrating from the former bwUniCluster 1
|}
|}

 
-----
 
 
[[Category:bwHPC_infrastructure]][[Category:bwHPC_Cluster]][[Category:bwCluster]]

BwUniCluster 2.0 Maintenance/2020-10/Software Issues

2020-10-27T10:18:27Z

S Raffeiner: /* Software modules without known fixes */

After the last regular [[BwUniCluster_2.0_Maintenance/2020-10|maintenance]] interval (from 06.10.2020 to 13.10.2020) the following issues with Intel MPI exist:

* Intel MPI 2018 is incompatible with Red Hat 8.2. Any invocation, even a simple "Hello World" MPI program, will result in a crash. The ''mpi/impi/2018'' module has therefore been removed.

* There is a bug in Intel MPI 2019.x which leads to crashes when multiple MPI applications which are linked against Intel MPI 2019.x are run on the same node (e.g. in the "single" partition). The first application will run normally, but all others will crash. This can be fixed by setting the environment variable '''I_MPI_HYDRA_TOPOLIB="ipl"'''. The ''mpi/impi/2019'' and ''mpi/impi/2020'' modules provided on the cluster already set this variable.

* There is a bug in Intel MPI 2019.x which leads to incorrect CPU binding/affinity in conjunction with the Slurm batch system used on the clusters. All MPI ranks will run on the same CPU core instead of being bound to all available CPU cores. This can be fixed by setting the environment variable '''I_MPI_HYDRA_BOOTSTRAP_EXEC_EXTRA_ARGS="--cpu-bind=none"'''. The ''mpi/impi/2019'' and ''mpi/impi/2020'' modules provided on the cluster already set this variable.

There is a number of Third-Party software modules installed on the cluster system which come with their own copies of various Intel MPI library versions. These software modules fall into the following categories:

== Corrected software modules ==

The following software modules have been corrected by the HPC software maintainers. They should currently work as expected.

* ''StarCCM+'': The included Intel MPI 2018 library was replaced with a more recent version.

* ''LS-DYNA'': The included Intel MPI library was replaced with a more recent version.

* ''CST'': The license does not allow multi-node jobs, so the problematic code paths cannot be used.

== Software modules with known fixes ==

The following software modules require additional user interaction to work:

* ''ANSYS Mechanical'' and ''Fluent'': The software has to be switched to OpenMPI using the '''-mpi=openmpi''' command line argument.

* ''ANSYS CFX'': The software has to be switched to OpenMPI using the '''-start-method 'Open MPI Distributed Parallel' ''' command line argument.

== Software modules without known fixes ==

For the following software modules there is currently no known fix:

* ''Abaqus'': Comes with Intel MPI 2017, for which there is currently no known fix. We are working on a solution. The ''cae/abaqus/2019'' software modules will not be removed because it can still be used for pre-/post-processing and single-node parallelisation using e.g. OpenMP.

BwUniCluster 2.0 Maintenance/2020-10/Software Issues

2020-10-27T10:17:21Z

S Raffeiner: /* Corrected software modules */

After the last regular [[BwUniCluster_2.0_Maintenance/2020-10|maintenance]] interval (from 06.10.2020 to 13.10.2020) the following issues with Intel MPI exist:

* Intel MPI 2018 is incompatible with Red Hat 8.2. Any invocation, even a simple "Hello World" MPI program, will result in a crash. The ''mpi/impi/2018'' module has therefore been removed.

* There is a bug in Intel MPI 2019.x which leads to crashes when multiple MPI applications which are linked against Intel MPI 2019.x are run on the same node (e.g. in the "single" partition). The first application will run normally, but all others will crash. This can be fixed by setting the environment variable '''I_MPI_HYDRA_TOPOLIB="ipl"'''. The ''mpi/impi/2019'' and ''mpi/impi/2020'' modules provided on the cluster already set this variable.

* There is a bug in Intel MPI 2019.x which leads to incorrect CPU binding/affinity in conjunction with the Slurm batch system used on the clusters. All MPI ranks will run on the same CPU core instead of being bound to all available CPU cores. This can be fixed by setting the environment variable '''I_MPI_HYDRA_BOOTSTRAP_EXEC_EXTRA_ARGS="--cpu-bind=none"'''. The ''mpi/impi/2019'' and ''mpi/impi/2020'' modules provided on the cluster already set this variable.

There is a number of Third-Party software modules installed on the cluster system which come with their own copies of various Intel MPI library versions. These software modules fall into the following categories:

== Corrected software modules ==

The following software modules have been corrected by the HPC software maintainers. They should currently work as expected.

* ''StarCCM+'': The included Intel MPI 2018 library was replaced with a more recent version.

* ''LS-DYNA'': The included Intel MPI library was replaced with a more recent version.

* ''CST'': The license does not allow multi-node jobs, so the problematic code paths cannot be used.

== Software modules with known fixes ==

The following software modules require additional user interaction to work:

* ''ANSYS Mechanical'' and ''Fluent'': The software has to be switched to OpenMPI using the '''-mpi=openmpi''' command line argument.

* ''ANSYS CFX'': The software has to be switched to OpenMPI using the '''-start-method 'Open MPI Distributed Parallel' ''' command line argument.

== Software modules without known fixes ==

For the following software modules there is currently no known fix:

* ''cae/abaqus/2019'' (comes with Intel MPI 2017).We are working on a solution.

Non-working software modules will not be removed because they can still be used for pre-/post-processing and single-node parallelisation using e.g. OpenMP.

BwUniCluster 2.0 Maintenance/2020-10/Software Issues

2020-10-27T10:13:00Z

S Raffeiner:

After the last regular [[BwUniCluster_2.0_Maintenance/2020-10|maintenance]] interval (from 06.10.2020 to 13.10.2020) the following issues with Intel MPI exist:

* Intel MPI 2018 is incompatible with Red Hat 8.2. Any invocation, even a simple "Hello World" MPI program, will result in a crash. The ''mpi/impi/2018'' module has therefore been removed.

* There is a bug in Intel MPI 2019.x which leads to crashes when multiple MPI applications which are linked against Intel MPI 2019.x are run on the same node (e.g. in the "single" partition). The first application will run normally, but all others will crash. This can be fixed by setting the environment variable '''I_MPI_HYDRA_TOPOLIB="ipl"'''. The ''mpi/impi/2019'' and ''mpi/impi/2020'' modules provided on the cluster already set this variable.

* There is a bug in Intel MPI 2019.x which leads to incorrect CPU binding/affinity in conjunction with the Slurm batch system used on the clusters. All MPI ranks will run on the same CPU core instead of being bound to all available CPU cores. This can be fixed by setting the environment variable '''I_MPI_HYDRA_BOOTSTRAP_EXEC_EXTRA_ARGS="--cpu-bind=none"'''. The ''mpi/impi/2019'' and ''mpi/impi/2020'' modules provided on the cluster already set this variable.

There is a number of Third-Party software modules installed on the cluster system which come with their own copies of various Intel MPI library versions. These software modules fall into the following categories:

== Corrected software modules ==

The following software modules have been corrected by the HPC software maintainers. They should currently work as expected.

* StarCCM+: The included Intel MPI 2018 library was replaced with a more recent version.

* LS-DYNA: The included Intel MPI library was replaced with a more recent version.

* CST: The license does not allow multi-node jobs, so the problematic code paths cannot be used.

== Software modules with known fixes ==

The following software modules require additional user interaction to work:

* ''ANSYS Mechanical'' and ''Fluent'': The software has to be switched to OpenMPI using the '''-mpi=openmpi''' command line argument.

* ''ANSYS CFX'': The software has to be switched to OpenMPI using the '''-start-method 'Open MPI Distributed Parallel' ''' command line argument.

== Software modules without known fixes ==

For the following software modules there is currently no known fix:

* ''cae/abaqus/2019'' (comes with Intel MPI 2017).We are working on a solution.

Non-working software modules will not be removed because they can still be used for pre-/post-processing and single-node parallelisation using e.g. OpenMP.

BwUniCluster 2.0 Maintenance/2020-10/Software Issues

2020-10-27T10:11:02Z

S Raffeiner:

After the last regular [[BwUniCluster_2.0_Maintenance/2020-10|maintenance]] interval (from 06.10.2020 to 13.10.2020) the following issues with Intel MPI exist:

* Intel MPI 2018 is incompatible with Red Hat 8.2. Any invocation, even a simple "Hello World" MPI program, will result in a crash. The ''mpi/impi/2018'' module has therefore been removed.

* There is a bug in Intel MPI 2019.x which leads to crashes when multiple MPI applications which are linked against Intel MPI 2019.x are run on the same node (e.g. in the "single" partition). The first application will run normally, but all others will crash. This can be fixed by setting the environment variable '''I_MPI_HYDRA_TOPOLIB="ipl"'''. The ''mpi/impi/2019'' and ''mpi/impi/2020'' modules provided on the cluster already set this variable.

* There is a bug in Intel MPI 2019.x which leads to incorrect CPU binding/affinity in conjunction with the Slurm batch system used on the clusters. All MPI ranks will run on the same CPU core instead of being bound to all available CPU cores. This can be fixed by setting the environment variable '''I_MPI_HYDRA_BOOTSTRAP_EXEC_EXTRA_ARGS="--cpu-bind=none""'''. The ''mpi/impi/2019'' and ''mpi/impi/2020'' modules provided on the cluster already set this variable.

There is a number of Third-Party software modules installed on the cluster system which come with their own copies of various Intel MPI library versions. These software modules fall into the following categories:

== Corrected software modules ==

The following software modules have been corrected by the HPC software maintainers. They should currently work as expected.

* StarCCM+: The included Intel MPI 2018 library was replaced with a more recent version.

* LS-DYNA: The included Intel MPI library was replaced with a more recent version.

* CST: The license does not allow multi-node jobs, so the problematic code paths cannot be used.

== Software modules with known fixes ==

The following software modules require additional user interaction to work:

* ''ANSYS Mechanical'' and ''Fluent'': The software has to be switched to OpenMPI using the '''-mpi=openmpi''' command line argument.

* ''ANSYS CFX'': The software has to be switched to OpenMPI using the '''-start-method 'Open MPI Distributed Parallel' ''' command line argument.

== Software modules without known fixes ==

For the following software modules there is currently no known fix:

* ''cae/abaqus/2019'' (comes with Intel MPI 2017).We are working on a solution.

Non-working software modules will not be removed because they can still be used for pre-/post-processing and single-node parallelisation using e.g. OpenMP.

BwUniCluster 2.0 Maintenance/2020-10/Software Issues

2020-10-27T10:10:37Z

S Raffeiner:

After the last regular [[BwUniCluster_2.0_Maintenance/2020-10|maintenance]] interval (from 06.10.2020 to 13.10.2020) the following issues with Intel MPI exist:

* Intel MPI 2018 is incompatible with Red Hat 8.2. Any invocation, even a simple "Hello World" MPI program, will result in a crash. The ''mpi/impi/2018'' module has therefore been removed.

* There is a bug in Intel MPI 2019.x which leads to crashes when multiple MPI applications which are linked against Intel MPI 2019.x are run on the same node (e.g. in the "single" partition). The first application will run normally, but all others will crash. This can be fixed by setting the environment variable '''I_MPI_HYDRA_TOPOLIB="ipl"'''. The ''mpi/impi/2019'' and ''mpi/impi/2020'' modules provided on the cluster already set this variable.

* There is a bug in Intel MPI 2019.x which leads to incorrect CPU binding/affinity in conjunction with the Slurm batch system used on the clusters. All MPI ranks will run on the same CPU core instead of being bound to all available CPU cores. This can be fixed by setting the environment variable '''I_MPI_HYDRA_BOOTSTRAP_EXEC_EXTRA_ARGS="--cpu-bind=none""'''. The ''mpi/impi/2019'' and ''mpi/impi/2020'' modules provided on the cluster already set this variable.

There is a number of Third-Party software modules installed on the cluster system which come with their own copies of various Intel MPI library versions. These software modules fall into onthe following categories.

== Corrected software modules ==

The following software modules have been corrected by the HPC software maintainers. They should currently work as expected.

* StarCCM+: The included Intel MPI 2018 library was replaced with a more recent version.

* LS-DYNA: The included Intel MPI library was replaced with a more recent version.

* CST: The license does not allow multi-node jobs, so the problematic code paths cannot be used.

== Software modules with known fixes ==

The following software modules require additional user interaction to work:

* ''ANSYS Mechanical'' and ''Fluent'': The software has to be switched to OpenMPI using the '''-mpi=openmpi''' command line argument.

* ''ANSYS CFX'': The software has to be switched to OpenMPI using the '''-start-method 'Open MPI Distributed Parallel' ''' command line argument.

== Software modules without known fixes ==

For the following software modules there is currently no known fix:

* ''cae/abaqus/2019'' (comes with Intel MPI 2017).We are working on a solution.

Non-working software modules will not be removed because they can still be used for pre-/post-processing and single-node parallelisation using e.g. OpenMP.