BinAC2/Hardware and Architecture: Difference between revisions

From bwHPC Wiki
Jump to navigation Jump to search
No edit summary
No edit summary
 
(5 intermediate revisions by the same user not shown)
Line 8: Line 8:
* Queuing System: [https://slurm.schedmd.com/documentation.html Slurm] (see [[BinAC2/Slurm]] for help)
* Queuing System: [https://slurm.schedmd.com/documentation.html Slurm] (see [[BinAC2/Slurm]] for help)
* (Scientific) Libraries and Software: [[Environment Modules]]
* (Scientific) Libraries and Software: [[Environment Modules]]



=== Compute Nodes ===
=== Compute Nodes ===


BinAC 2 offers compute nodes, high-mem nodes, and two types of GPU nodes.
BinAC 2 offers compute nodes, high-mem nodes, and two types of GPU nodes.
* 148 compute nodes
* 180 compute nodes
* 14 SMP node
* 14 SMP node
* 32 GPU nodes (A30)
* 32 GPU nodes (A30)
* 8 GPU nodes (A100)
* 8 GPU nodes (A100)
* splus several special purpose nodes for login, interactive jobs, etc.
* plus several special purpose nodes for login, interactive jobs, etc.


Compute node specification:
Compute node specification:
Line 29: Line 28:
|-
|-
!scope="column"| Quantity
!scope="column"| Quantity
| 148
| 180
| 14
| 14
| 32
| 32
Line 73: Line 72:
| -
| -
| -
| -
| 2 x [http://www.nvidia.com/object/tesla-k80.html NVIDIA A30 (24 GB ECC HBM2, NVLink]
| 2 x [https://www.nvidia.com/de-de/data-center/products/a30-gpu/ NVIDIA A30 (24 GB ECC HBM2, NVLink]
| 4 x [https://www.nvidia.com/en-us/data-center/products/a100-gpu/ (80 GB ECC HBM2e)]
| 4 x [https://www.nvidia.com/de-de/data-center/a100/ NVIDIA A100 (80 GB ECC HBM2e)]
|}
|}




=== Special Purpose Nodes ===

Besides the classical compute node several nodes serve as login and preprocessing nodes, nodes for interactive jobs and nodes for creating virtual environments providing a virtual service environment.

== Storage Architecture ==

The bwForCluster [https://www.binac.uni-tuebingen.de BinAC] consists of two separate storage systems, one for the user's home directory <tt>$HOME</tt> and one serving as a work space. The home directory is limited in space and parallel access but offers snapshots of your files and Backup. The work space is a parallel file system which offers fast and parallel file access and a bigger capacity than the home directory. This storage is based on [https://www.beegfs.com/ BeeGFS] and can be accessed parallel from many nodes. Additionally, each compute node provides high-speed temporary storage (SSD) on the node-local solid state disk via the <tt>$TMPDIR</tt> environment variable.

{| class="wikitable"
|-
! style="width:10%"|
! style="width:10%"| <tt>$HOME</tt>
! style="width:10%"| Work Space
! style="width:10%"| <tt>$TMPDIR</tt>
|-
!scope="column" | Visibility
| global
| global
| node local
|-
!scope="column" | Lifetime
| permanent
| work space lifetime (max. 30 days, max. 3 extensions)
| batch job walltime
|-
!scope="column" | Capacity
| unkn.
| 482 TB
| 211 GB per node
|-
!scope="column" | [https://en.wikipedia.org/wiki/Disk_quota#Quotas Quotas]
| 40 GB per user
| none
| none
|-
!scope="column" | Backup
| yes
| no
| no
|}

global : all nodes access the same file system
local : each node has its own file system
permanent : files are stored permanently
batch job walltime : files are removed at end of the batch job


=== $HOME ===

Home directories are meant for permanent file storage of files that are keep being used like source codes, configuration files, executable programs etc.; the content of home directories will be backed up on a regular basis.
<!--
Current disk usage on home directory and quota status can be checked with the '''diskusage''' command:

$ diskusage
User Used (GB) Quota (GB) Used (%)
------------------------------------------------------------------------
<username> 4.38 100.00 4.38

-->


NOTE:
Compute jobs on nodes must not write temporary data to $HOME.
Instead they should use the local $TMPDIR directory for I/O-heavy use cases
and work spaces for less I/O intense multinode-jobs.


<!--
'''Quota is full - what to do'''

In case of 100% usage of the quota user can get some problems with disk writing operations (e.g. error messages during the file copy/edit/save operations). To avoid it - please remove some data that you don't need from the $HOME directory or move it to some temporary place.

As temporary place for the data user can use:

* '''Workspace''' - space on the BeeGFS file system, lifetime up to 90 days (see below)

* '''Scratch on login nodes''' - special directory on every login node (login01..login03):
** Access via variable $TMPDIR (e.g. "cd $TMPDIR")
** Lifetime of data - minimum 7 days (based on the last access time)
** Data is private for every user
** Each login node has own scratch directory (data is NOT shared)
** There is NO backup of the data

To get optimal and comfortable work with the $HOME directory is important to keep the data in order (remove unnecessary and temporary data, archive big files, save large files only on the workspace).
-->


=== Work Space ===

Work spaces can be generated through the <tt>workspace</tt> tools. This will generate a directory on the parallel storage.

To create a work space you'll need to supply a name for your work space area and a lifetime in days.
For more information read the corresponding help, e.g: <tt>ws_allocate -h</tt>.

Examples:
{| class="wikitable"
|-
!style="width:30%" | Command
!style="width:70%" | Action
|-
|<tt>ws_allocate mywork 30</tt>
|Allocate a work space named "mywork" for 30 days.
|-
|<tt>ws_allocate myotherwork</tt>
|Allocate a work space named "myotherwork" with maximum lifetime.
|-
|<tt>ws_list -a</tt>
|List all your work spaces.
|-
|<tt>ws_find mywork</tt>
|Get absolute path of work space "mywork".
|-
|<tt>ws_extend mywork 30</tt>
|Extend life me of work space mywork by 30 days from now. (Not needed, workspaces on BinAC are not limited).
|-
|<tt>ws_release mywork</tt>
|Manually erase your work space "mywork". Please remove directory content first.
|-
|}

=== Local Disk Space ===

All compute nodes are equipped with a local SSD with 200 GB capacity for job execution. During computation the environment variable <tt>$TMPDIR</tt> points to this local disk space. The data will become unavailable as soon as the job has finished.

=== SDS@hd ===

SDS@hd is mounted only on login03 at <tt>/sds_hd</tt>.
To access your Speichervorhaben, please see the [[SDS@hd/Access/NFS#access_your_data|SDS@hd documentation]].
If you can't see your Speichervorhaben, you can [[BinAC/Support|open a ticket]].

Latest revision as of 12:21, 29 August 2024

System Architecture

The bwForCluster BinAC 2 supports researchers from the broader fields of Bioinformatics, Astrophysics, and Geosciences.

Operating System and Software

Compute Nodes

BinAC 2 offers compute nodes, high-mem nodes, and two types of GPU nodes.

  • 180 compute nodes
  • 14 SMP node
  • 32 GPU nodes (A30)
  • 8 GPU nodes (A100)
  • plus several special purpose nodes for login, interactive jobs, etc.

Compute node specification:

Standard High-Mem GPU (A30) GPU (A100)
Quantity 180 14 32 8
Processors 2 x AMD EPYC Milan 7543 2 x AMD EPYC Milan 7443 2 x AMD EPYC Milan 7543 2 x AMD EPYC Milan 7543
Processor Frequency (GHz) 2.80 2.85 2.80 2.80
Number of Cores 64 48 64 64
Working Memory (GB) 512 2048 512 512
Local Disk (GB) 512 (SSD) 1920 (SSD) 512 (SSD) 512 (SSD)
Interconnect HDR IB (80 nodes) / 100GbE HDR HDR HDR
Coprocessors - - 2 x NVIDIA A30 (24 GB ECC HBM2, NVLink 4 x NVIDIA A100 (80 GB ECC HBM2e)