BwUniCluster2.0/Hardware and Architecture: Difference between revisions

From bwHPC Wiki
Jump to navigation Jump to search
Line 140: Line 140:
* On request the external LSDF Online Storage is mounted on the nodes which were allocated to the batch job. This file system is based on the parallel file system Spectrum Scale.
* On request the external LSDF Online Storage is mounted on the nodes which were allocated to the batch job. This file system is based on the parallel file system Spectrum Scale.


== Selecting the appropriate file system ==


In general, you should separate your data and store it on the appropriate file system.
Permanently needed data like software or important results should be stored below $HOME
but capacity restrictions (quotas) apply. In case you accidentally deleted data on $HOME
you can usually restore it from backup. Permanent data which is not needed for months
or exceeds the capacity restrictions should be sent to bwFileStorage, to the LSDF Online Storage,
or to the archive and deleted from the file systems. Temporary data which is only needed on a single
node and which does not exceed the disk space shown in the table above should be stored
below $TMP. Temporary data which is only needed during job runs should be stored on a
parallel on-demand file system. Temporary data which can be recomputed or which is the
result of one job and input for another job should be stored below in workspaces. The lifetime
of data in workspaces is limited and depends on the lifetime of the workspace which can be
several months.


The most efficient way to transfer data to/from other HPC file systems or bwFileStorage is done
with the tool rdata.

For further details please check the chapters below.


[[Category:bwUniCluster 2.0|File System]][[Category:Hardware_and_Architecture|bwUniCluster 2.0]]
[[Category:bwUniCluster 2.0|File System]][[Category:Hardware_and_Architecture|bwUniCluster 2.0]]

Revision as of 18:40, 13 March 2020

Architecture of bwUniCluster 2.0

The bwUniCluster 2.0 is a parallel computer with distributed memory. Each node of system consists of at least two Intel Xeon processor, local memory, disks, network adapters and optionally accelerators (NVIDIA Tesla V100). All nodes are connected by a fast InfiniBand 4X FDR interconnect. In addition the file system Lustre, that is connected by coupling the InfiniBand of the file server with the InfiniBand switch of the compute cluster, is added to bwUniCluster (uc1) to provide a fast and scalable parallel file system.

The operating system on each node is Red Hat Enterprise Linux (RHEL) 7.x. A number of additional software packages like e.g. SLURM have been installed on top. Some of these components are of special interest to end users and are briefly discussed in this document. Others which are of greater importance to system administrators will not be covered by this document.

The individual nodes of the system may act in different roles. According to the services supplied by the nodes, they are separated into disjoint groups. From an end users point of view the different groups of nodes are login nodes, compute nodes, file server nodes and administrative server nodes.

Login Nodes

The login nodes are the only nodes that are directly accessible by end users. These nodes are used for interactive login, file management, program development and interactive pre- and postprocessing. Two nodes are dedicated to this service but they are all accessible via one address and a DNS round-robin alias distributes the login sessions to the different login nodes.

Compute Node

The majority of nodes are compute nodes which are managed by a batch system. Users submit their jobs to the SLURM batch system and a job is executed when the required resources become available (depending on its fair-share priority).

File Server Nodes

The hardware of the parallel file system Lustre incorporates some file server nodes; the file system Lustre is connected by coupling the InfiniBand of the file server with the independent InfiniBand switch of the compute cluster. In addition to shared file space there is also local storage on the disks of each node (for details see chapter "File Systems").

Administrative Server Nodes

Some other nodes are delivering additional services like resource management, external network connection, administration etc. These nodes can be accessed directly by system administrators only.

Components of bwUniCluster

Compute nodes "Thin" Compute nodes "HPC" Compute nodes "HPC Broadwell" Compute nodes "Fat" GPU x4 GPU x8 Login
Number of nodes 100 360 352 6 14 10 4 + 2 (Broadwell)
Processors Intel Xeon Gold 6230 Intel Xeon Gold 6230 Intel Xeon E5-2660 v4 Intel Xeon Gold 6230 Intel Xeon Gold 6230 Intel Xeon Gold 6248
Number of sockets 2 2 2 4 2 2 2
Processor frequency (GHz) 2.1 Ghz 2.1 Ghz 2.0 GHz 2.1 Ghz 2.1 Ghz 2.1 Ghz
Total number of cores 40 40 28 80 40 40 40 / 20 (Broadwell)
Main memory 96 GB 96 GB 128 GB 3 TB 384 GB 768 GB 384 GB / 128 GB (Broadwell)
Local disk 960 GB SATA 960 GB SATA 480 GB SATA 4,8 TB NVMe 3,2 TB NVMe 6,4 TB NVMe
Accelerators - - - - 4x NVIDIA Tesla V100 8x NVIDIA Tesla V100
Interconnect IB HDR100 (blocking) IB HDR100 IB FDR IB HDR IB HDR IB HDR IB HDR100 (blocking)

File Systems

Details about changes on the file systems between bwUniCluster 1 and bwUniCluster 2.0 are described in the File system migration guide.

On bwUniCluster 2.0 the parallel file system Lustre is used for most globally visible user data. Lustre is open source and Lustre solutions and support are available from different vendors. Nowadays, most of the biggest HPC systems are using Lustre. An initial home directory on a Lustre file system is created during the first login, and the environment variable $HOME holds its name. Users can create so-called workspaces on another Lustre file system for non-permanent data with temporary lifetime.

Within a batch job further file systems are available:

  • The directory $TMP is only available and visible on the local node. It is located on fast SSD storage devices.
  • On request a parallel on-demand file system is created which uses the SSDs of the nodes which were allocated to the batch job.
  • On request the external LSDF Online Storage is mounted on the nodes which were allocated to the batch job. This file system is based on the parallel file system Spectrum Scale.

Selecting the appropriate file system

In general, you should separate your data and store it on the appropriate file system. Permanently needed data like software or important results should be stored below $HOME but capacity restrictions (quotas) apply. In case you accidentally deleted data on $HOME you can usually restore it from backup. Permanent data which is not needed for months or exceeds the capacity restrictions should be sent to bwFileStorage, to the LSDF Online Storage, or to the archive and deleted from the file systems. Temporary data which is only needed on a single node and which does not exceed the disk space shown in the table above should be stored below $TMP. Temporary data which is only needed during job runs should be stored on a parallel on-demand file system. Temporary data which can be recomputed or which is the result of one job and input for another job should be stored below in workspaces. The lifetime of data in workspaces is limited and depends on the lifetime of the workspace which can be several months.

The most efficient way to transfer data to/from other HPC file systems or bwFileStorage is done with the tool rdata.

For further details please check the chapters below.