BwUniCluster3.0/Hardware and Architecture: Difference between revisions
Line 22: | Line 22: | ||
The login nodes are the only nodes directly accessible by end users. These nodes are used for interactive login, file management, program development, and interactive pre- and post-processing. |
The login nodes are the only nodes directly accessible by end users. These nodes are used for interactive login, file management, program development, and interactive pre- and post-processing. |
||
There are two nodes dedicated to this service, but they can all be reached from a single address. A DNS round-robin alias distributes login sessions to the login nodes. |
There are two nodes dedicated to this service, but they can all be reached from a single address. A DNS round-robin alias distributes login sessions to the login nodes. |
||
To prevent login nodes from being used for activities that are not permitted there and that affect the user experience of other users, '''long-running and/or compute-intensive tasks are periodically terminated without any prior warning'''. Please refer to [[BwUniCluster3.0/Login#Allowed_Activities_on_Login_Nodes|Allowed Activities on Login Nodes]]. |
|||
'''Compute Node''' |
'''Compute Node''' |
Latest revision as of 11:58, 17 January 2025
This page is work in progress. |
Architecture of bwUniCluster 3.0
The bwUniCluster 3.0 is a parallel computer with distributed memory. It consists of the newly procured bwUniCluster 3.0 components and also includes the additional compute nodes which were procured as an extension to the bwUniCluster 2.0 in 2022.
Each node of the system consists of two Intel Xeon or AMD EPYC processors, local memory, local storage, network adapters and optional accelerators (NVIDIA A100 and H100, AMD Instinct MI300A). All nodes are connected via a fast InfiniBand interconnect.
The parallel file system (Lustre) is connected to the InfiniBand switch of the compute cluster. This provides a fast and scalable parallel file system.
The operating system on each node is Red Hat Enterprise Linux (RHEL) 9.4.
The individual nodes of the system act in different roles. From an end users point of view the different groups of nodes are login nodes and compute nodes. File server nodes and administrative server nodes are not accessible by users.
Login Nodes
The login nodes are the only nodes directly accessible by end users. These nodes are used for interactive login, file management, program development, and interactive pre- and post-processing. There are two nodes dedicated to this service, but they can all be reached from a single address. A DNS round-robin alias distributes login sessions to the login nodes.
To prevent login nodes from being used for activities that are not permitted there and that affect the user experience of other users, long-running and/or compute-intensive tasks are periodically terminated without any prior warning. Please refer to Allowed Activities on Login Nodes.
Compute Node
The majority of nodes are compute nodes which are managed by a batch system. Users submit their jobs to the SLURM batch system and a job is executed when the required resources become available (depending on its fair-share priority).
Compute Resources
Login nodes
Any compute job running on the login nodes will be terminated without any notice. |
Compute nodes
CPU nodes
- Ice Lake: From UC2e
- Standard
- High Memory
GPU nodes
- NVIDIA GPU x4
- AMD GPU x4
- Ice Lake NVIDIA GPU x4
CPU nodes Ice Lake |
CPU nodes Standard |
CPU nodes High Memory |
GPU nodes NVIDIA GPU x4 |
GPU node AMD GPU x4 |
GPU nodes Ice Lake NVIDIA GPU x4 |
Login nodes | |
---|---|---|---|---|---|---|---|
Availability in queues | cpu_il , dev_cpu_il
|
cpu , dev_cpu
|
highmem , dev_highmem
|
gpu_h100 , dev_gpu_h100
|
gpu_mi300
|
gpu_a100_il / gpu_h100_il
|
- |
Number of nodes | 272 | 70 | 4 | 12 | 1 | 15 | 2 |
Processors | Intel Xeon Platinum 8358 | AMD EPYC 9454 | AMD EPYC 9454 | AMD EPYC 9454 | AMD Zen 4 | Intel Xeon Platinum 8358 | AMD EPYC 9454 |
Number of sockets | 2 | 2 | 2 | 2 | 4 | 2 | 2 |
Processor frequency (GHz) | 2.6 GHz | 2.75 GHz | 2.75 GHz | 2.75 GHz | 3.7 GHz | 2.6 GHz | 2.75 GHz |
Total number of cores | 64 | 96 | 96 | 96 | 96 (4x 24) | 64 | 96 |
Main memory | 256 GB | 384 GB | 2.3 TB | 768 GB | 4x 128 GB HBM3 | 512 GB | 384 GB |
Local SSD | 1.8 TB NVMe | 3.84 TB NVMe | 15.36 TB NVMe | 15.36 TB NVMe | 7.68 TB NVMe | 6.4 TB NVMe | 7.68 TB SATA SSD |
Accelerators | - | - | - | 4x NVIDIA H100 | 4x AMD Instinct MI300A | 4x NVIDIA A100 / H100 | - |
Accelerator memory | - | - | - | 94 GB | APU | 80 GB / 94 GB | - |
Interconnect | IB HDR200 | IB 2x NDR200 | IB 2x NDR200 | IB 4x NDR200 | IB 2x NDR200 | IB 2x HDR200 | IB 1x NDR200 |
Table 1: Hardware overview and properties
File Systems
- $HOME
- Workspaces
- $TMPDIR
- LSDF Online Storage
- BeeOND (BeeGFS On-Demand)
On bwUniCluster 3.0 the parallel file system Lustre is used for most globally visible user data. An initial HOME directory on a dedicated Lustre file system is created automatically after account activation, and the environment variable $HOME holds its name. Users can create so-called workspaces on another dedicated Lustre file system for non-permanent data with temporary lifetime. There is another workspace file system based on flash storage for special requirements available.
Within a batch job further file systems are available:
- The directory $TMPDIR is only available and visible on the local node. It is located on fast SSD storage devices.
- On request a parallel on-demand file system (BeeOND) is created which uses the SSDs of the nodes which were allocated to the batch job.
- On request the external LSDF Online Storage is mounted on the nodes which were allocated to the batch job. This file system is based on the parallel file system Spectrum Scale.
Which file system to use?
In general, you should separate your data and store it on the appropriate file system. Permanently needed data like software or important results should be stored below $HOME but capacity restrictions (quotas) apply. In case you accidentally deleted data on $HOME there is a chance that we can restore it from backup. Permanent data which is not needed for months or exceeds the capacity restrictions should be sent to the LSDF Online Storage or to the archive and deleted from the file systems. Temporary data which is only needed on a single node and which does not exceed the disk space shown in the table above should be stored below $TMPDIR. Data which is read many times on a single node, e.g. if you are doing AI training, should be copied to $TMPDIR and read from there. Temporary data which is used from many nodes of your batch job and which is only needed during job runtime should be stored on a parallel on-demand file system. Temporary data which can be recomputed or which is the result of one job and input for another job should be stored in workspaces. The lifetime of data in workspaces is limited and depends on the lifetime of the workspace which can be several months.
For further details please check the chapters below.
$HOME
The $HOME directories of bwUniCluster 3.0 users are located in the parallel file system Lustre. You have access to your $HOME directory from all nodes of UC3. A regular backup of these directories to tape archive is done automatically. The directory $HOME is used to hold those files that are permanently used like source codes, configuration files, executable programs etc.
Workspaces
On UC3 workspaces should be used to store large non-permanent data sets, e.g. restart files or output data that has to be post-processed. The file system used for workspaces is also the parallel file system Lustre. This file system is especially designed for parallel access and for a high throughput to large files. It is able to provide high data transfer rates of up to 54 GB/s write and read performance when data access is parallel.
On UC3 there is a default user quota limit of 40 TiB and 30 million inodes (files and directories) per user.
Detailed information on Workspaces
$TMPDIR
The environment variable $TMPDIR contains the name of a directory which is located on the local SSD of each node. This means that different tasks of a parallel application use different directories when they do not utilize the same node. Although $TMPDIR points to the same path name for different nodes of a batch job, the physical location and the content of this directory path on these nodes is different.
This directory should be used for temporary files being accessed from the local node during job runtime. It should also be used if you read the same data many times from a single node, e.g. if you are doing AI training. In this case you should copy the data at the beginning of your batch job to $TMPDIR and read the data from there, see usage example below.
The $TMPDIR directory is located on extremely fast local SSD storage devices. This means that performance on small files is much better than on the parallel file systems. The capacity of the local SSDs for each node type is different and can be checked in Table 1 above. The capacity of $TMPDIR is at least 800 GB.
Detailed information on $TMPDIR
LSDF Online Storage
In some cases it is useful to have access to the LSDF Online Storage on the HPC-Clusters also. Therefore the LSDF Online Storage is mounted on the Login- and Datamover-Nodes. Furthermore it can be used on the compute nodes during the job runtime with the constraint flag "LSDF" (Slurm common features ). There is also an example about the LSDF batch usage: Slurm LSDF example .
#!/bin/bash #SBATCH ... #SBATCH --constraint=LSDF
For the usage of the LSDF Online Storage the following environment variables are available: $LSDF, $LSDFPROJECTS, $LSDFHOME.
Please request storage projects in the LSDF Online Storage seperately:
LSDF Storage Request.
BeeOND (BeeGFS On-Demand)
Users of the UniCluster have possibility to request a private BeeOND (on-demand BeeGFS) parallel filesystem for each job. The file system is created during job startup and purged after your job.
- IMPORTANT:
- All data on the private filesystem will be deleted after your job. Make sure you have copied your data back to the global filesystem (within job), e.g., $HOME or any workspace.
BeeOND/BeeGFS can be used like any other parallel file system. Tools like cp or rsync can be used to copy data in and out.
For detailed usage see here: Request on-demand file system
Backup and Archiving
There are regular backups of all data of the home directories,whereas ACLs and extended attributes will not be backuped.
Please open a ticket if you need backuped data.