BwUniCluster3.0/Hardware and Architecture and SDS@hd/Registration: Difference between pages
No edit summary |
H Schumacher (talk | contribs) (Improved structure, added instructions for "adding" new members) |
||
Line 1: | Line 1: | ||
== Registration Steps == |
|||
= Architecture of bwUniCluster 3.0 = |
|||
The registration consists of two steps: |
|||
The '''bwUniCluster 3.0''' is a parallel computer with distributed memory. |
|||
# '''Membership in Storageproject''' (Speichervorhaben/SV) |
|||
It consists of the bwUniCluster 3.0 components procured in 2024 and also includes the additional compute nodes which were procured as an extension to the bwUniCluster 2.0 in 2022. |
|||
## [[SDS@hd/Registration#Apply_for_new_SV | Apply for new SV]] (only possible for [https://www.bwidm.de/hochschulen.php bwIDM members]) |
|||
## [[SDS@hd/Registration#Join_existing_SV | Join existing SV]] |
|||
Each node of the system consists of two Intel Xeon or AMD EPYC processors, local memory, local storage, network adapters and optional accelerators (NVIDIA A100 and H100, AMD Instinct MI300A). All nodes are connected via a fast InfiniBand interconnect. |
|||
# [[SDS@hd/Registration#Registration_for_SDS@hd_Service |'''Registration for SDS@hd Service''']] |
|||
After finishing the registration, the next steps are: |
|||
The parallel file system (Lustre) is connected to the InfiniBand switch of the compute cluster. This provides a fast and scalable parallel file |
|||
* Check your options to manage your SV Membership. → '''[[#Manage_SV_Membership(s) | Manage SV Membership(s)]]''' |
|||
system. |
|||
* See how to access and use your storage space → '''[[SDS@hd/Access | Access]]''' |
|||
The operating system on each node is Red Hat Enterprise Linux (RHEL) 9.4. |
|||
=== Step 1: SV Membership === |
|||
The individual nodes of the system act in different roles. From an end users point of view the different groups of nodes are login nodes and compute nodes. File server nodes and administrative server nodes are not accessible by users. |
|||
You can either start a new SV or join an existing one. |
|||
'''Login Nodes'''<br> |
|||
The login nodes are the only nodes directly accessible by end users. These nodes are used for interactive login, file management, program development, and interactive pre- and post-processing. |
|||
There are two nodes dedicated to this service, but they can all be reached from a single address: <code>uc3.scc.kit.edu</code>. A DNS round-robin alias distributes login sessions to the login nodes. |
|||
To prevent login nodes from being used for activities that are not permitted there and that affect the user experience of other users, '''long-running and/or compute-intensive tasks are periodically terminated without any prior warning'''. Please refer to [[BwUniCluster3.0/Login#Allowed_Activities_on_Login_Nodes|Allowed Activities on Login Nodes]]. |
|||
==== Apply for new SV ==== |
|||
'''Compute Nodes'''<br> |
|||
The majority of nodes are compute nodes which are managed by a batch system. Users submit their jobs to the SLURM batch system and a job is executed when the required resources become available (depending on its fair-share priority). |
|||
This is typically done only by the leader of a scientific work group or the senior scientist of a research group/collaboration. |
|||
'''File Systems'''<br> |
|||
Any amount of co-workers can join your SV without having to register another project. You just need to provide them with the SV acronym and SV password. You'll receive this information via e-mail as soon as the SV application was successful. |
|||
bwUniCluster 3.0 comprises two parallel file systems based on Lustre. |
|||
There are two steps: |
|||
# '''Get the permission'''. Your own institution has to grant you the permission to start an SDS@hd Storageproject ("SDS@hd SV entitlement"). Each institution has their own procedure. If your unsure about the process for your institution, you can write to the SDS@hd support: [mailto:sds-hd-support@urz.uni-heidelberg.de Submit a Ticket]. For more details about the SDS@hd entitlements you can take a look at the [https://urz.uni-heidelberg.de/en/sds-hd SDS@hd Website] |
|||
<!-- not yet completed |
|||
The page [[Sds_hd_Entitlement]] contains a list of participating high schools and links to instructions on how to get an SDS@hd entitlement at each of them. |
|||
--> |
|||
# '''Apply for a new SV'''. Therefore, fill in the form at the [https://sds-hd.urz.uni-heidelberg.de/management SDS@hd Managementtool]. |
|||
If you register your own SV, you will be: |
|||
[[File:uc3.png|Optionen|center|Überschrift|800px]] |
|||
* ...responsibe for providing new members with the necessary information to join the SV |
|||
*: → '''see [[#SV_Responsible:_Manage_the_SV_and_its_Members | SV Responsible: Manage the SV and its Members]]''' |
|||
* ...held accountable for the co-workers in the SV |
|||
* ...asked to provide information for the two reports required by the DFG for their funding of SDS@hd |
|||
* ...likely asked for a contribution to a future DFG grant proposal for an extension of the storage system in your area of research ("wissenschaftliches Beiblatt") |
|||
==== Join existing SV ==== |
|||
= Compute Resources = |
|||
Your advisor (the SV responsible) will provide you with the following data on the SV: |
|||
== Login nodes == |
|||
* acronym |
|||
* password |
|||
To become coworker of an SV, please login at |
|||
After a successful [[BwUniCluster3.0/Login|login]], users find themselves on one of the so called login nodes. Technically, these largely correspond to a standard CPU node, i.e. users have two AMD EPYC 9454 processors with a total of 96 cores at their disposal. Login nodes are the bridgehead for accessing computing resources. |
|||
* [https://sds-hd.urz.uni-heidelberg.de/management/index.php?mode=mitarbeit SDS@hd Managementtool] and provide acronym and password. |
|||
Data and software are organized here, computing jobs are initiated and managed, and computing resources allocated for interactive use can also be accessed from here. |
|||
{|style="background:#deffee; width:100%;" |
|||
|style="padding:5px; background:#ffa500; text-align:left"| |
|||
[[Image:Attention.svg|center|25px]] |
|||
|style="padding:5px; background:#ffa500; text-align:left"| |
|||
'''Any compute intensive job running on the login nodes will be terminated without any notice.'''<br/> |
|||
Please refer to [[BwUniCluster3.0/Login#Allowed_Activities_on_Login_Nodes|Allowed Activities on Login Nodes]]. |
|||
|} |
|||
You will be assigned to the SV as a member. |
|||
== Compute nodes == |
|||
All compute activities on bwUniCluster 3.0 have to be performed on the compute nodes. Compute nodes are only available by requesting the corresponding resources via the queuing system. As soon as the requested resources are available, automated tasks are executed via a batch script or they can be accessed interactively. Please refer to [[BwUniCluster3.0/Running_Jobs|Running Jobs]] on how to request resources.<br> |
|||
The following compute node types are available:<br> |
|||
<b>CPU nodes</b> |
|||
* '''Standard''': Two AMD EPYC 9454 processors per node with a total of 96 physical CPU cores or 192 logical cores (Hyper-Threading) per node. The nodes have been procured in 2024. |
|||
* '''Ice Lake''': Two Intel Xeon Platinum 8358 processors per node with a total of 64 physical CPU cores or 128 logical cores (Hyper-Threading) per node. The nodes have been procured in 2022 as an extension to bwUniCluster 2.0. |
|||
* '''High Memory''': Similar to the standard nodes, but with six times larger memory. |
|||
<b>GPU nodes</b> |
|||
* '''NVIDIA GPU x4''': Similar to the standard nodes, but with larger memory and four NVIDIA H100 GPUs. |
|||
* '''AMD GPU x4''': AMD's accelerated processing unit (APU) MI300A with 4 CPU sockets and 4 compute units which share the same high-bandwidth memory (HBM). |
|||
* '''Ice Lake NVIDIA GPU x4''': Similar to the Ice Lake nodes, but with larger memory and four NVIDIA A100 or H100 GPUs. |
|||
* '''Cascade Lake NVIDIA GPU x4''': Nodes with four NVIDIA A100 GPUs. |
|||
{| class="wikitable" |
|||
|- |
|||
! style="width:10%"| Node Type |
|||
! style="width:10%"| CPU nodes<br/>Ice Lake |
|||
! style="width:10%"| CPU nodes<br/>Standard |
|||
! style="width:10%"| CPU nodes<br/>High Memory |
|||
! style="width:10%"| GPU nodes<br/>NVIDIA GPU x4 |
|||
! style="width:10%"| GPU node<br/>AMD GPU x4 |
|||
! style="width:10%"| GPU nodes<br/>Ice Lake<br/>NVIDIA GPU x4 |
|||
! style="width:10%"| GPU nodes<br/>Cascade Lake<br/>NVIDIA GPU x4 |
|||
! style="width:10%"| Login nodes |
|||
|- |
|||
!scope="column"| Availability in [[BwUniCluster3.0/Running_Jobs#Queues_on_bwUniCluster_3.0| queues]] |
|||
| <code>cpu_il</code>, <code>dev_cpu_il</code> |
|||
| <code>cpu</code>, <code>dev_cpu</code> |
|||
| <code>highmem</code>, <code>dev_highmem</code> |
|||
| <code>gpu_h100</code>, <code>dev_gpu_h100</code> |
|||
| <code>gpu_mi300</code> |
|||
| <code>gpu_a100_il</code> / <code>gpu_h100_il</code> |
|||
| <code>gpu_a100_short</code></code> |
|||
| - |
|||
|- |
|||
!scope="column"| Number of nodes |
|||
| 272 |
|||
| 70 |
|||
| 4 |
|||
| 12 |
|||
| 1 |
|||
| 15 |
|||
| 19 |
|||
| 2 |
|||
|- |
|||
!scope="column"| Processors |
|||
| Intel Xeon Platinum 8358 |
|||
| AMD EPYC 9454 |
|||
| AMD EPYC 9454 |
|||
| AMD EPYC 9454 |
|||
| AMD Zen 4 |
|||
| Intel Xeon Platinum 8358 |
|||
| Intel Xeon Gold 6248R |
|||
| AMD EPYC 9454 |
|||
|- |
|||
!scope="column"| Number of sockets |
|||
| 2 |
|||
| 2 |
|||
| 2 |
|||
| 2 |
|||
| 4 |
|||
| 2 |
|||
| 2 |
|||
| 2 |
|||
|- |
|||
!scope="column"| Total number of cores |
|||
| 64 |
|||
| 96 |
|||
| 96 |
|||
| 96 |
|||
| 96 (4x 24) |
|||
| 64 |
|||
| 48 |
|||
| 96 |
|||
|- |
|||
!scope="column"| Main memory |
|||
| 256 GiB |
|||
| 384 GiB |
|||
| 2304 GiB |
|||
| 768 GiB |
|||
| 4x 128 GiB HBM3 |
|||
| 512 GiB |
|||
| 384 GiB |
|||
| 384 GiB |
|||
|- |
|||
!scope="column"| Local SSD |
|||
| 1.8 TB NVMe |
|||
| 3.84 TB NVMe |
|||
| 15.36 TB NVMe |
|||
| 15.36 TB NVMe |
|||
| 7.68 TB NVMe |
|||
| 6.4 TB NVMe |
|||
| 1.92 TB SATA SSD |
|||
| 7.68 TB SATA SSD |
|||
|- |
|||
!scope="column"| Accelerators |
|||
| - |
|||
| - |
|||
| - |
|||
| 4x NVIDIA H100 |
|||
| 4x AMD Instinct MI300A |
|||
| 4x NVIDIA A100 / H100 |
|||
| 4x NVIDIA A100 |
|||
| - |
|||
|- |
|||
!scope="column"| Accelerator memory |
|||
| - |
|||
| - |
|||
| - |
|||
| 94 GB |
|||
| APU |
|||
| 80 GB / 94 GB |
|||
| 40 GB |
|||
| - |
|||
|- |
|||
!scope="column"| Interconnect |
|||
| IB HDR200 |
|||
| IB 2x NDR200 |
|||
| IB 2x NDR200 |
|||
| IB 4x NDR200 |
|||
| IB 2x NDR200 |
|||
| IB 2x HDR200 |
|||
| IB 4x EDR |
|||
| IB 1x NDR200 |
|||
|} |
|||
Table 1: Hardware overview and properties |
|||
After submitting the request you will receive an email about the further steps. |
|||
= File Systems = |
|||
The SV owner and any managers will be notified automatically. |
|||
On bwUniCluster 3.0 the following file systems are available: |
|||
=== Step 2: Registration for SDS@hd Service === |
|||
* '''$HOME'''<br>The HOME directory is created automatically after account activation, and the environment variable $HOME holds its name. HOME is the place, where users find themselves after login. |
|||
[[File:bwServices_chooseHomeOrganisation.png|right|300px|thumb|Select your home organization]] |
|||
* '''Workspaces'''<br>Users can create so-called workspaces for non-permanent data with temporary lifetime. |
|||
* '''Workspaces on flash storage'''<br>A further workspace file system based on flash-only storage is available for special requirements and certain users. |
|||
* '''$TMPDIR'''<br>The directory $TMPDIR is only available and visible on the local node during the runtime of a compute job. It is located on fast SSD storage devices. |
|||
* '''BeeOND''' (BeeGFS On-Demand)<br>On request a parallel on-demand file system (BeeOND) is created which uses the SSDs of the nodes which were allocated to the batch job. |
|||
* '''LSDF Online Storage'''<br>On request the external LSDF Online Storage is mounted on the nodes which were allocated to the batch job. On the login nodes, LSDF is automatically mounted. |
|||
After step 1 you have to register your personal account on the storage system and set a service password. |
|||
'''Which file system to use?''' |
|||
Please visit: |
|||
* [https://bwservices.uni-heidelberg.de/ https://bwservices.uni-heidelberg.de] |
|||
*# Select your home organization from the list and click '''Proceed''' |
|||
*# You will be directed to the ''Identity Provider'' of your home organisation |
|||
*# Enter your home-organisational user ID / username and your home-organisational password and click '''Login''' button |
|||
*# You will be redirected back to the registration website [https://bwservices.uni-heidelberg.de/ https://bwservices.uni-heidelberg.de/] |
|||
*# <div>Select unter '''The following services are available''' the service '''SDS@hd - Scientific Data Storage''' |
|||
*# Click '''Register''' |
|||
*# Finally, set a service password for authentication on SDS@hd |
|||
You should separate your data and store it on the appropriate file system. |
|||
Permanently needed data like software or important results should be stored in $HOME but capacity restrictions (quotas) apply. |
|||
In case you accidentally deleted data on $HOME there is a chance that we can restore it from backup. |
|||
Permanent data which is not needed for months or exceeds the capacity restrictions should be sent to the LSDF Online Storage or to the archive and deleted from the file systems. Temporary data which is only needed on a single node and which does not exceed the disk space shown in Table 1 above should be stored |
|||
below $TMPDIR. Data which is read many times on a single node, e.g. if you are doing AI training, |
|||
should be copied to $TMPDIR and read from there. Temporary data which is used from many nodes |
|||
of your batch job and which is only needed during job runtime should be stored on a |
|||
parallel on-demand file system BeeOND. Temporary data which can be recomputed or which is the |
|||
result of one job and input for another job should be stored in workspaces. The lifetime |
|||
of data in workspaces is limited and depends on the lifetime of the workspace which can be |
|||
several months. |
|||
== Manage SV Membership(s) == |
|||
For further details please check: [[BwUniCluster3.0/Hardware_and_Architecture/Filesystem_Details|File System Details]] |
|||
The SDS@hd Managementtool and the bwServices website offer various services to manage your membership. They are described below. |
|||
== $HOME == |
|||
=== SV Member: View SV and Membership Status === |
|||
The $HOME directories of bwUniCluster 3.0 users are located on the parallel file system Lustre. |
|||
At the [https://sds-hd.urz.uni-heidelberg.de/management SDS@hd Managementtool] you can... |
|||
You have access to your $HOME directory from all nodes of UC3. A regular backup of these directories |
|||
* ...view the status of your SV memberships (active/inactive) |
|||
to tape archive is done automatically. The directory $HOME is used to hold those files that are |
|||
* ...view the status of the SV (active/inactive, end date) |
|||
permanently used like source codes, configuration files, executable programs etc. |
|||
=== SV Responsible: Manage the SV and its Members === |
|||
[[BwUniCluster3.0/Hardware_and_Architecture/Filesystem_Details#$HOME|Detailed information on $HOME]] |
|||
As the SV responsible person, you can... |
|||
* ...enable others to join the SV |
|||
*# Provide them with the SV acronym. You can look it up at the [https://sds-hd.urz.uni-heidelberg.de/management/shib/info_sv.php SDS@hd Managementtool]. |
|||
*# Povide them with the SV password. You've received this password via email. You can reset it at the [https://sds-hd.urz.uni-heidelberg.de/management/shib/info_sv.php SDS@hd Managementtool]. <u>Caution</u>: Make sure to share the SV password and not your personal service password which you set in step 2 at bwServices. |
|||
*# Afterwards, they can join by following the registration steps on this page. |
|||
* ...change the status of SV members |
|||
* ...change the SV password |
|||
* ...hand over the SV responsibility |
|||
=== Change Service Password at bwServices === |
|||
== Workspaces == |
|||
→ '''[[Registration/bwForCluster/Helix#Setting_a_New_Service_Password | Setting a New Service Password]]''' |
|||
On UC3 workspaces should be used to store large non-permanent data sets, e.g. restart files or output |
|||
data that has to be post-processed. The file system used for workspaces is also the parallel file system Lustre. This file system is especially designed for parallel access and for a high throughput to large |
|||
files. It is able to provide high data transfer rates of up to 40 GB/s write and read performance when data access is parallel. |
|||
On UC3 there is a default user quota limit of 40 TiB and 20 million inodes (files and directories) per user. |
|||
[[BwUniCluster3.0/Hardware_and_Architecture/Filesystem_Details#Workspaces|Detailed information on Workspaces]] |
|||
== Workspaces on flash storage == |
|||
Another workspace file system based on flash-only storage is available for special requirements and certain users. |
|||
If possible, this file system should be used from the Ice Lake nodes of bwUniCluster 3.0 (queue ''cpu_il''). |
|||
It provides high IOPS rates and better performance for small files. The quota limts are lower than on the |
|||
normal workspace file system. |
|||
[[BwUniCluster3.0/Hardware_and_Architecture/Filesystem_Details#Workspaces_on_flash_storage|Detailed information on Workspaces on flash storage]] |
|||
== $TMPDIR == |
|||
The environment variable $TMPDIR contains the name of a directory which is located on the local SSD of each node. |
|||
This directory should be used for temporary files being accessed from the local node. It should |
|||
also be used if you read the same data many times from a single node, e.g. if you are doing AI training. |
|||
Because of the extremely fast local SSD storage devices performance with small files is much better than on the parallel file systems. |
|||
[[BwUniCluster3.0/Hardware_and_Architecture/Filesystem_Details#$TMPDIR|Detailed information on $TMPDIR]] |
|||
== BeeOND (BeeGFS On-Demand) == |
|||
Users have the possibility to request a private BeeOND (on-demand BeeGFS) parallel filesystem for each job. The file system is created during job startup and purged when your job completes. |
|||
[[BwUniCluster3.0/Hardware_and_Architecture/Filesystem_Details#BeeOND_(BeeGFS_On-Demand)|Detailed information on BeeOND]] |
|||
== LSDF Online Storage == |
|||
The LSDF Online Storage allows dedicated users to store scientific measurement data and simulation results. BwUniCluster 3.0 has an extremely fast network connection to the LSDF Online Storage. This file system provides external access via different protocols and is only available for certain users. |
|||
[[BwUniCluster3.0/Hardware_and_Architecture/Filesystem_Details#LSDF_Online_Storage|Detailed information on LSDF Online Storage]] |
Revision as of 13:01, 8 October 2025
Registration Steps
The registration consists of two steps:
- Membership in Storageproject (Speichervorhaben/SV)
- Apply for new SV (only possible for bwIDM members)
- Join existing SV
- Registration for SDS@hd Service
After finishing the registration, the next steps are:
- Check your options to manage your SV Membership. → Manage SV Membership(s)
- See how to access and use your storage space → Access
Step 1: SV Membership
You can either start a new SV or join an existing one.
Apply for new SV
This is typically done only by the leader of a scientific work group or the senior scientist of a research group/collaboration. Any amount of co-workers can join your SV without having to register another project. You just need to provide them with the SV acronym and SV password. You'll receive this information via e-mail as soon as the SV application was successful.
There are two steps:
- Get the permission. Your own institution has to grant you the permission to start an SDS@hd Storageproject ("SDS@hd SV entitlement"). Each institution has their own procedure. If your unsure about the process for your institution, you can write to the SDS@hd support: Submit a Ticket. For more details about the SDS@hd entitlements you can take a look at the SDS@hd Website
- Apply for a new SV. Therefore, fill in the form at the SDS@hd Managementtool.
If you register your own SV, you will be:
- ...responsibe for providing new members with the necessary information to join the SV
- ...held accountable for the co-workers in the SV
- ...asked to provide information for the two reports required by the DFG for their funding of SDS@hd
- ...likely asked for a contribution to a future DFG grant proposal for an extension of the storage system in your area of research ("wissenschaftliches Beiblatt")
Join existing SV
Your advisor (the SV responsible) will provide you with the following data on the SV:
- acronym
- password
To become coworker of an SV, please login at
- SDS@hd Managementtool and provide acronym and password.
You will be assigned to the SV as a member.
After submitting the request you will receive an email about the further steps. The SV owner and any managers will be notified automatically.
Step 2: Registration for SDS@hd Service
After step 1 you have to register your personal account on the storage system and set a service password. Please visit:
- https://bwservices.uni-heidelberg.de
- Select your home organization from the list and click Proceed
- You will be directed to the Identity Provider of your home organisation
- Enter your home-organisational user ID / username and your home-organisational password and click Login button
- You will be redirected back to the registration website https://bwservices.uni-heidelberg.de/
- Select unter The following services are available the service SDS@hd - Scientific Data Storage
- Click Register
- Finally, set a service password for authentication on SDS@hd
Manage SV Membership(s)
The SDS@hd Managementtool and the bwServices website offer various services to manage your membership. They are described below.
SV Member: View SV and Membership Status
At the SDS@hd Managementtool you can...
- ...view the status of your SV memberships (active/inactive)
- ...view the status of the SV (active/inactive, end date)
SV Responsible: Manage the SV and its Members
As the SV responsible person, you can...
- ...enable others to join the SV
- Provide them with the SV acronym. You can look it up at the SDS@hd Managementtool.
- Povide them with the SV password. You've received this password via email. You can reset it at the SDS@hd Managementtool. Caution: Make sure to share the SV password and not your personal service password which you set in step 2 at bwServices.
- Afterwards, they can join by following the registration steps on this page.
- ...change the status of SV members
- ...change the SV password
- ...hand over the SV responsibility