NEMO2/Hardware: Difference between revisions

From bwHPC Wiki
Jump to navigation Jump to search
 
(2 intermediate revisions by the same user not shown)
Line 31: Line 31:
For researchers from the scientific fields '''N'''euroscience, '''E'''lementary Particle Physics, '''M'''icrosystems Engineering and '''M'''aterials Science the bwForCluster '''NEMO''' offers about 240 compute nodes plus several special purpose nodes for login, interactive jobs, visualization, machine learning and ai.
For researchers from the scientific fields '''N'''euroscience, '''E'''lementary Particle Physics, '''M'''icrosystems Engineering and '''M'''aterials Science the bwForCluster '''NEMO''' offers about 240 compute nodes plus several special purpose nodes for login, interactive jobs, visualization, machine learning and ai.


Node specification:
Node specification (see [[NEMO2/Slurm]] for slurm partitons):
{| class="wikitable" style="text-align:center;"
{| class="wikitable" style="text-align:center;"
|-
|-
Line 75: Line 75:
'''Usable in Slurm: 92'''
'''Usable in Slurm: 92'''
| 32
| 32
|-
!scope="column" | CPU Cores per APU/GPU
| ---
| ---
| 15
| 23
| ---
|-
|-
!scope="column" | Memory
!scope="column" | Memory
| 768 GiB, 4.8 GHz (DDR5)
| 768 GiB, 4.8 GHz (DDR5)
'''Usable in Slurm: 727 GiB, 745000 MiB, per Core: 3900 MiB'''
| 512 GiB, 3.2 GHZ (DDR4)
| 512 GiB, 3.2 GHZ (DDR4)
'''Usable in Slurm: 495 GiB, 507000 MiB, per Core: 4000 MiB'''
| 512 GiB, 5.6 GHz (DDR5)
| 512 GiB, 5.6 GHz (DDR5)
4x 48 GB, 864 GB/s (GDDR6)
4x 48 GB, 864 GB/s (GDDR6)
'''Usable in Slurm: 495 GiB, 507000 MiB, per Core: 8100 MiB'''
| 4x 128 GB, 5300 GB/s (HBM3)
| 4x 128 GB, 5300 GB/s (HBM3)
'''Usable in Slurm: 495 GiB, 507000 MiB, per Core: 5300 MiB'''
| 384 GiB, 4.8 GHz (DDR5)
| 384 GiB, 4.8 GHz (DDR5)
|-
|-
Line 109: Line 120:
!scope="column" | Job Example L40S Partition
!scope="column" | Job Example L40S Partition
|colspan="5" | Maximum resources for a single node job (*):
|colspan="5" | Maximum resources for a single node job (*):
<code>--partition=l40s --ntasks=62 --gres=gpu:4 --mem=495GB # or: --mem=507000MB, or: --mem-per-cpu=4000MB</code> (week 20)
<code>--partition=l40s --ntasks=62 --gres=gpu:4 --mem=495GB # or: --mem=507000MB, or: --mem-per-cpu=8100MB</code>
|-
|-
!scope="column" | Job Example MI300A Partition
!scope="column" | Job Example MI300A Partition
|colspan="5" | Maximum resources for a single node job (*):
|colspan="5" | Maximum resources for a single node job (*):
<code>--partition=mi300a --ntasks=94 --gres=gpu:4 --mem=495GB # or: --mem=507000MB, or: --mem-per-cpu=5300MB</code>
(not yet available)
|-
|-
|}
|}

Latest revision as of 18:05, 19 May 2025

Operating System and Software

Operating System Rocky Linux 9 (similar to RHEL 9)
Queuing System SLURM (see NEMO2/Slurm for help)
(Scientific) Libraries and Software Environment Modules
Own Software Modules using EasyBuild and Spack EasyBuild
Own (Python) Environments with Conda Conda
Containers with Apptianer/Singularity (and enroot in the future) Development/Containers



Compute and Special Purpose Nodes

For researchers from the scientific fields Neuroscience, Elementary Particle Physics, Microsystems Engineering and Materials Science the bwForCluster NEMO offers about 240 compute nodes plus several special purpose nodes for login, interactive jobs, visualization, machine learning and ai.

Node specification (see NEMO2/Slurm for slurm partitons):

Genoa Partition Milan Partition L40S Partition MI300A Partition Login Nodes
Quantity 106 137 9 4 2
Processors / APU/GPU 2x AMD EPYC 9654 (Genoa) 2x AMD EPYC 7763 (Milan) 2x Intel Xeon Platinum 8562Y+ (5th Gen)

4x NVIDIA L40S

4x AMD Instinct MI300A 1x AMD EPYC 9354 (Genoa)
Base Frequency/Boost Frequency (GHz) / APU/GPU Performance (TFLOPs/TOPs) 2.4/3.55 2.45/3.5 2.8/3.8

91.6 (FP32) / 733 (INT8)

-/3.7

61.3 (FP64) / 122.6 (FP32) / 1960 (INT8)

3.25/3.75
CPU Cores per Node 192

Usable in Slurm: 190

128

Usable in Slurm: 126

64

Usable in Slurm: 62

4x 24

Usable in Slurm: 92

32
CPU Cores per APU/GPU --- --- 15 23 ---
Memory 768 GiB, 4.8 GHz (DDR5)

Usable in Slurm: 727 GiB, 745000 MiB, per Core: 3900 MiB

512 GiB, 3.2 GHZ (DDR4)

Usable in Slurm: 495 GiB, 507000 MiB, per Core: 4000 MiB

512 GiB, 5.6 GHz (DDR5)

4x 48 GB, 864 GB/s (GDDR6) Usable in Slurm: 495 GiB, 507000 MiB, per Core: 8100 MiB

4x 128 GB, 5300 GB/s (HBM3)

Usable in Slurm: 495 GiB, 507000 MiB, per Core: 5300 MiB

384 GiB, 4.8 GHz (DDR5)
Local NVMe (GB) 3840 1920 3840 3840 480
Interconnect 100 GbE (RoCEv2) Omni-Path 100

100 GbE (RoCEv2)

100 GbE (RoCEv2) 100 GbE (RoCEv2) 100 GbE (RoCEv2)
Job Example Genoa Partition Maximum resources for a single node job (*):

--partition=genoa --ntasks=190 --mem=727GB # or: --mem=745000MB, or: --mem-per-cpu=3900MB

Job Example Milan Partition Maximum resources for a single node job (*):

--partition=milan --ntasks=126 --mem=495GB # or: --mem=507000MB, or: --mem-per-cpu=4000MB

Job Example L40S Partition Maximum resources for a single node job (*):

--partition=l40s --ntasks=62 --gres=gpu:4 --mem=495GB # or: --mem=507000MB, or: --mem-per-cpu=8100MB

Job Example MI300A Partition Maximum resources for a single node job (*):

--partition=mi300a --ntasks=94 --gres=gpu:4 --mem=495GB # or: --mem=507000MB, or: --mem-per-cpu=5300MB

(*) Slurm internally uses Mebi- (MiB) and Gibibyte (GiB), please multiply/divide with 1024 for (M/G).

File Systems

NEMO2 offers a fast Weka parallel filesystem, which is only limited by the uplink to this storage (>90GB/s). The storage is used for $HOME and workspaces. There sill be no backups, but we plan to implement Snapshots for the last 7 days in the next months. Additionally, each compute node provides temporary storage on the node-local NVMe disk.

$HOME Workspaces NVMe
Visibility global (100 GbE) node local
Lifetime permanent workspace lifetime

(max. 100 days, extensions possible)

batch job walltime
Capacity 1 PB 1.9 TB or more (depends on node)
Quotas per $HOME/Workspace 100 GB 5 TB (per workspace) ---
Snapshots daily (7 snapshots) (not yet implemented) --- ---
Backups There is NO storage backup!
  global             : all nodes access the same file system
  local              : each node has its own file system
  permanent          : files are stored permanently
                       however, if an account has lost access,
                       the remaining data will be deleted after 6 months
  batch job walltime : files are removed at end of the batch job