Helix/Hardware: Difference between revisions
S Richling (talk | contribs) |
S Richling (talk | contribs) |
||
(3 intermediate revisions by the same user not shown) | |||
Line 13: | Line 13: | ||
== Compute Nodes == |
== Compute Nodes == |
||
The cluster is equipped with the following |
The cluster is equipped with the following CPU and GPU nodes. |
||
{| class="wikitable" style="width:80%;" |
{| class="wikitable" style="width:80%;" |
||
Line 53: | Line 53: | ||
| 2.7 |
| 2.7 |
||
|- |
|- |
||
!scope="column"; style="background-color:#A2C4EB" | Number of |
!scope="column"; style="background-color:#A2C4EB" | Number of Cores per Node |
||
!style="text-align:left;"| 64 |
!style="text-align:left;"| 64 |
||
!style="text-align:left;"| 64 |
!style="text-align:left;"| 64 |
||
Line 114: | Line 114: | ||
!style="text-align:left;"| 48 |
!style="text-align:left;"| 48 |
||
!style="text-align:left;"| 40 |
!style="text-align:left;"| 40 |
||
!style="text-align:left;"| |
!style="text-align:left;"| 80 |
||
!style="text-align:left;"| 141 |
!style="text-align:left;"| 141 |
||
|- |
|- |
||
Line 130: | Line 130: | ||
== Storage Architecture == |
== Storage Architecture == |
||
There is one storage system providing a large parallel file system based on IBM Storage Scale for $HOME, for workspaces, and for temporary job data. |
There is one storage system providing a large parallel file system based on IBM Storage Scale for $HOME, for workspaces, and for temporary job data. The compute nodes do not have local disks. |
||
== Network == |
== Network == |
Latest revision as of 21:59, 15 April 2025
System Architecture
The bwForCluster Helix is a high performance supercomputer with high speed interconnect. The system consists of compute nodes (CPU and GPU nodes), some infrastructure nodes for login and administration and a storage system. All components are connected via a fast Infiniband network. The login nodes are also connected to the Internet via Baden Württemberg's extended LAN BelWü.
Operating System and Software
- Operating system: RedHat
- Queuing system: Slurm
- Access to application software: Environment Modules
Compute Nodes
The cluster is equipped with the following CPU and GPU nodes.
CPU Nodes | GPU Nodes | |||||
---|---|---|---|---|---|---|
Node Type | cpu | fat | gpu4 | gpu4 | gpu8 | gpu8 (in preparation) |
Quantity | 355 | 15 | 29 | 26 | 4 | 3 |
Processors | 2 x AMD EPYC 7513 | 2 x AMD EPYC 7513 | 2 x AMD EPYC 7513 | 2 x AMD EPYC 7513 | 2 x AMD EPYC 7513 | 2 x AMD EPYC 9334 |
Processor Frequency (GHz) | 2.6 | 2.6 | 2.6 | 2.6 | 2.6 | 2.7 |
Number of Cores per Node | 64 | 64 | 64 | 64 | 64 | 64 |
Installed Working Memory (GB) | 256 | 2048 | 256 | 256 | 2048 | 2304 |
Available Memory for Jobs (GB) | 236 | 2000 | 236 | 236 | 2000 | 2200 |
Interconnect | 1x HDR100 | 1x HDR100 | 2x HDR100 | 2x HDR200 | 4x HDR200 | 4x HDR200 |
Coprocessors | - | - | 4x Nvidia A40 (48 GB) | 4x Nvidia A100 (40 GB) | 8x Nvidia A100 (80 GB) | 8x Nvidia H200 (141 GB) |
Number of GPUs | - | - | 4 | 4 | 8 | 8 |
GPU Type | - | - | A40 | A100 | A100 | H200 |
GPU Memory per GPU (GB) | - | - | 48 | 40 | 80 | 141 |
GPU with FP64 capability | - | - | no | yes | yes | yes |
The lines marked in blue are in particular relevant for Slurm jobs.
Storage Architecture
There is one storage system providing a large parallel file system based on IBM Storage Scale for $HOME, for workspaces, and for temporary job data. The compute nodes do not have local disks.
Network
The components of the cluster are connected via two independent networks, a management network (Ethernet and IPMI) and an Infiniband fabric for MPI communication and storage access. The Infiniband backbone is a fully non-blocking fabric with 200 Gb/s data speed. The compute nodes are connected with different data speeds according to the node configuration. The term HDR100 stands for 100 Gb/s and HDR200 for 200 Gb/s.