BwForCluster MLS&WISO Production Hardware - bwHPC Wiki BwForCluster MLS&WISO Production Hardware - bwHPC Wiki

BwForCluster MLS&WISO Production Hardware

From bwHPC Wiki
Jump to: navigation, search

1 System Architecture

The production part of bwForCluster MLS&WISO is a high performance supercomputer dedicated to research in Molecular Life Science, Economics, and Social Sciences. It features different types of compute nodes with a common software environment.

bwForCluster MLS&WISO Production

1.1 Basic Software Features

1.2 Compute Nodes

The production part of the bwForCluster MLS&WISO consists of different compute nodes. In addition to "Standard" type nodes, "Best" and "Fat" type nodes are targeted towards compute jobs with special demands on processor power and memory capacity. "Coprocessor" nodes feature Intel Many Integrated Core (MIC) accelerators or Nvida Tesla GPUs.

The "Standard" nodes are situated in Mannheim, while the "Best", "Fat", and "Coprocessor" nodes are hosted in Heidelberg. Both sites are connected via a 160 GBit/s Infiniband link.

Standard Best Coprocessor (GPU) Coprocessor (MIC) Fat Fat (Ivy Bridge)
Node Feature standard best gpu mic fat fat-ivy
Quantity 476 184 18 12 8 4
Processors 2 x Intel Xeon E5-2630v3 (Haswell) 2 x Intel Xeon E5-2640v3 (Haswell) 2 x Intel Xeon E5-2630v3 (Haswell) 2 x Intel Xeon E5-2630v3 (Haswell) 4 x Intel Xeon E5-4620v3 (Haswell) 4 x Intel Xeon E5-4620v2 (Ivy Bridge)
Processor Frequency (GHz) 2.4 2.6 2.4 2.4 2.0 2.6
Number of Cores 16 16 16 16 40 32
Working Memory (GB) 64 128 64 64 1536 1024
Local Disk (GB) 128 (SSD) 128 (SSD) 128 (SSD) 128 (SSD) 9000 (SATA) 128 (SSD)
Interconnect QDR FDR FDR FDR FDR FDR
Coprocessors 1 x Nvidia Tesla K80 2 x Intel Xeon Phi 5110P

Due to recent extensions the cluster also provides some compute nodes based on the newer Intel Skylake architecture:

Best (Skylake) Coprocessor (GPU Skylake)
Node Feature best-sky gpu-sky
Quantity (hostlist) 24 1 (h08c0101) 1 (h09c0101) 2 (h09c0201, h09c0301) 1 (h09c0401)
Processors 2 x Intel Xeon Gold 6130 (Skylake) 2 x Intel Xeon Gold 6130 (Skylake) 2 x Intel Xeon Gold 6130 (Skylake) 2 x Intel Xeon Gold 6130 (Skylake) 2 x Intel Xeon Gold 6130 (Skylake)
Processor Frequency (GHz) 2.1 2.1 2.1 2.1 2.1
Number of Cores 32 32 32 32 32
Working Memory (GB) 192 192 384 384 384
Local Disk (GB) 512 (SSD) 512 (SSD) 512 (SSD) 512 (SSD) 512 (SSD)
Interconnect EDR EDR EDR EDR EDR
Coprocessors 4 x Nvidia Titan Xp (12 GB) 4 x Nvidia Tesla V100 (16 GB) 4 x Nvidia GeForce GTX 1080Ti (11 GB) 4 x Nvidia GeForce GTX 2080Ti (11 GB)

2 Storage Architecture

There are two separate storage systems, one for $HOME and one for workspaces. Both use the parallel file system BeeGFS. Additionally, each compute node provides high-speed temporary storage on the node-local solid state disk via the $TMPDIR environment variable. For details and best practices see File Systems.

$HOME Workspaces $TMPDIR
Visibility global global node local
Lifetime permanent workspace lifetime batch job walltime
Capacity 36 TB 384 TB 128 GB per node (9 TB per fat node)
Quotas 100 GB none none
Backup no no no

3 Network

The components of the cluster at both sites are connected via two independent networks, a management network (Ethernet and IPMI) and an Infiniband fabric for MPI communication and storage access. The Infiniband interconnect in Mannheim is of Quad Data Rate (QDR) and fully non-blocking across all "Standard" nodes. The Infiniband fabric in Heidelberg is of Full Data Rate (FDR) and also fully non-blocking across all "Best", "Fat", and "Coprocessor" nodes.

The 28 km distance between the cluster sites in Mannheim and Heidelberg is linked via optical fibre and Mellanox MetroX TX6240 LongHaul appliances, transparently aggregating four 40 GBit/s links into a single 160 GBit/s connection. Latency is only slightly above the hard limit set by the speed of light, effectively merging the two parts into a single high-performance computing resource.