Difference between revisions of "BwUniCluster2.0/Hardware and Architecture"

From bwHPC Wiki
Jump to: navigation, search
 
(54 intermediate revisions by 5 users not shown)
Line 1: Line 1:
 
= Architecture of bwUniCluster 2.0 =
 
= Architecture of bwUniCluster 2.0 =
   
The bwUniCluster 2.0 is a parallel computer with distributed memory. Each node of system consists of at least two Intel Xeon processor, local memory, disks, network adapters and optionally accelerators (NVIDIA Tesla V100). All nodes are connected by a fast InfiniBand interconnect. In addition the file system
+
The bwUniCluster 2.0 is a parallel computer with distributed memory. Each node of system consists of at least two Intel Xeon processor, local memory, disks, network adapters and optionally accelerators (NVIDIA Tesla V100, A100 or H100). All nodes are connected by a fast InfiniBand interconnect. In addition the file system
 
Lustre, that is connected by coupling the InfiniBand of the file server with the InfiniBand
 
Lustre, that is connected by coupling the InfiniBand of the file server with the InfiniBand
 
switch of the compute cluster, is added to bwUniCluster 2.0 to provide a fast and scalable
 
switch of the compute cluster, is added to bwUniCluster 2.0 to provide a fast and scalable
 
parallel file system.
 
parallel file system.
   
The operating system on each node is Red Hat Enterprise Linux (RHEL) 7.7. A number of additional software packages like e.g. SLURM have been installed on top. Some of these components are of special interest to end users and are briefly
+
The operating system on each node is Red Hat Enterprise Linux (RHEL) 8.4. A number of additional software packages like e.g. SLURM have been installed on top. Some of these components are of special interest to end users and are briefly
 
discussed in this document. Others which are of greater importance to system
 
discussed in this document. Others which are of greater importance to system
 
administrators will not be covered by this document.
 
administrators will not be covered by this document.
Line 16: Line 16:
 
The login nodes are the only nodes that are directly accessible by end users. These nodes
 
The login nodes are the only nodes that are directly accessible by end users. These nodes
 
are used for interactive login, file management, program development and interactive pre-
 
are used for interactive login, file management, program development and interactive pre-
and postprocessing. Two nodes are dedicated to this service but they are all accessible via
+
and postprocessing. Three nodes are dedicated to this service but they are all accessible via
 
one address and a DNS round-robin alias distributes the login sessions to the
 
one address and a DNS round-robin alias distributes the login sessions to the
 
different login nodes.
 
different login nodes.
Line 42: Line 42:
 
! style="width:13%"| Compute nodes "Thin"
 
! style="width:13%"| Compute nodes "Thin"
 
! style="width:13%"| Compute nodes "HPC"
 
! style="width:13%"| Compute nodes "HPC"
! style="width:13%"| Compute nodes "HPC Broadwell"
+
! style="width:13%"| Compute nodes "IceLake"
 
! style="width:13%"| Compute nodes "Fat"
 
! style="width:13%"| Compute nodes "Fat"
 
! style="width:13%"| GPU x4
 
! style="width:13%"| GPU x4
 
! style="width:13%"| GPU x8
 
! style="width:13%"| GPU x8
  +
! style="width:13%"| IceLake + GPU x4
 
! style="width:13%"| Login
 
! style="width:13%"| Login
 
|-
 
|-
 
!scope="column"| Number of nodes
 
!scope="column"| Number of nodes
| 100 + 60
+
| 200 + 60
| 360
+
| 260
| 352
+
| 272
 
| 6
 
| 6
 
| 14
 
| 14
 
| 10
 
| 10
  +
| 15
| 4 + 2 (Broadwell)
 
  +
| 3
 
|-
 
|-
 
!scope="column"| Processors
 
!scope="column"| Processors
 
| Intel Xeon Gold 6230
 
| Intel Xeon Gold 6230
 
| Intel Xeon Gold 6230
 
| Intel Xeon Gold 6230
| Intel Xeon E5-2660 v4
+
| Intel Xeon Platinum 8358
 
| Intel Xeon Gold 6230
 
| Intel Xeon Gold 6230
 
| Intel Xeon Gold 6230
 
| Intel Xeon Gold 6230
 
| Intel Xeon Gold 6248
 
| Intel Xeon Gold 6248
  +
| Intel Xeon Platinum 8358
 
|-
 
|-
 
!scope="column"| Number of sockets
 
!scope="column"| Number of sockets
Line 70: Line 73:
 
| 2
 
| 2
 
| 4
 
| 4
  +
| 2
 
| 2
 
| 2
 
| 2
 
| 2
Line 77: Line 81:
 
| 2.1 Ghz
 
| 2.1 Ghz
 
| 2.1 Ghz
 
| 2.1 Ghz
| 2.0 GHz
+
| 2.6 Ghz
| 2.1 Ghz
 
 
| 2.1 Ghz
 
| 2.1 Ghz
 
| 2.1 Ghz
 
| 2.1 Ghz
  +
| 2.6 Ghz
  +
| 2.5 Ghz
 
|
 
|
 
|-
 
|-
Line 86: Line 91:
 
| 40
 
| 40
 
| 40
 
| 40
| 28
+
| 64
 
| 80
 
| 80
 
| 40
 
| 40
 
| 40
 
| 40
  +
| 64
| 40 / 20 (Broadwell)
 
  +
| 40
 
|-
 
|-
 
!scope="column"| Main memory
 
!scope="column"| Main memory
 
| 96 GB / 192 GB
 
| 96 GB / 192 GB
 
| 96 GB
 
| 96 GB
| 128 GB
+
| 256 GB
 
| 3 TB
 
| 3 TB
 
| 384 GB
 
| 384 GB
 
| 768 GB
 
| 768 GB
  +
| 512 GB
| 384 GB / 128 GB (Broadwell)
 
  +
| 384 GB
 
|-
 
|-
 
!scope="column"| Local SSD
 
!scope="column"| Local SSD
 
| 960 GB SATA
 
| 960 GB SATA
 
| 960 GB SATA
 
| 960 GB SATA
| 480 GB SATA
+
| 1,8 TB NVMe
 
| 4,8 TB NVMe
 
| 4,8 TB NVMe
 
| 3,2 TB NVMe
 
| 3,2 TB NVMe
  +
| 15 TB NVMe
 
| 6,4 TB NVMe
 
| 6,4 TB NVMe
 
|
 
|
Line 117: Line 125:
 
| 4x NVIDIA Tesla V100
 
| 4x NVIDIA Tesla V100
 
| 8x NVIDIA Tesla V100
 
| 8x NVIDIA Tesla V100
  +
| 4x NVIDIA A100 / 4x NVIDIA H100
|
 
  +
| -
  +
|-
  +
!scope="column"| Accelerator memory
  +
| -
  +
| -
  +
| -
  +
| -
  +
| 32 GB
  +
| 32 GB
  +
| 80 GB / 94 GB
  +
| -
 
|-
 
|-
 
!scope="column"| Interconnect
 
!scope="column"| Interconnect
 
| IB HDR100 (blocking)
 
| IB HDR100 (blocking)
 
| IB HDR100
 
| IB HDR100
| IB FDR
+
| IB HDR200
 
| IB HDR
 
| IB HDR
 
| IB HDR
 
| IB HDR
 
| IB HDR
 
| IB HDR
  +
| IB HDR200
 
| IB HDR100 (blocking)
 
| IB HDR100 (blocking)
 
|}
 
|}
Line 132: Line 152:
 
= File Systems =
 
= File Systems =
   
  +
On bwUniCluster 2.0 the parallel file system Lustre is used for most globally visible user data. Lustre is open source and Lustre solutions and support are available from different vendors. Nowadays, most of the biggest HPC systems are using Lustre. An initial home directory on a Lustre file system is created automatically after account activation, and the environment variable $HOME holds its name. Users can create so-called workspaces on another Lustre file system for non-permanent data with temporary lifetime. There is another workspace file system based on flash storage for special requirements available.
Details about changes on the file systems between bwUniCluster 1 and bwUniCluster 2.0 are described in the [[BwUniCluster_2.0_File_System_Migration_Guide|File system migration guide]]. Note that $WORK is deprecated.
 
 
On bwUniCluster 2.0 the parallel file system Lustre is used for most globally visible user data. Lustre is open source and Lustre solutions and support are available from different vendors. Nowadays, most of the biggest HPC systems are using Lustre. An initial home directory on a Lustre file system is created during the first login, and the environment variable $HOME holds its name. Users can create so-called workspaces on another Lustre file system for non-permanent data with temporary lifetime.
 
   
 
Within a batch job further file systems are available:
 
Within a batch job further file systems are available:
* The directory $TMP is only available and visible on the local node. It is located on fast SSD storage devices.
+
* The directory $TMPDIR is only available and visible on the local node. It is located on fast SSD storage devices.
* On request a parallel on-demand (BeeOND) file system is created which uses the SSDs of the nodes which were allocated to the batch job.
+
* On request a parallel on-demand file system (BeeOND) is created which uses the SSDs of the nodes which were allocated to the batch job.
 
* On request the external LSDF Online Storage is mounted on the nodes which were allocated to the batch job. This file system is based on the parallel file system Spectrum Scale.
 
* On request the external LSDF Online Storage is mounted on the nodes which were allocated to the batch job. This file system is based on the parallel file system Spectrum Scale.
   
Line 146: Line 164:
 
|- style="width:20%;height=20px; text-align:left;padding:3px"
 
|- style="width:20%;height=20px; text-align:left;padding:3px"
 
! style="background-color:#AAA;padding:3px"| Property
 
! style="background-color:#AAA;padding:3px"| Property
! style="background-color:#AAA;padding:3px"| $TMP
+
! style="background-color:#AAA;padding:3px"| $TMPDIR
 
! style="background-color:#AAA;padding:3px"| BeeOND
 
! style="background-color:#AAA;padding:3px"| BeeOND
 
! style="background-color:#AAA;padding:3px"| $HOME
 
! style="background-color:#AAA;padding:3px"| $HOME
 
! style="background-color:#AAA;padding:3px"| Workspace
 
! style="background-color:#AAA;padding:3px"| Workspace
  +
! style="background-color:#AAA;padding:3px"| Workspace <br> on flash
 
|-
 
|-
 
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Visibility
 
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Visibility
 
| style="height=20px; text-align:left;padding:3px"| local node
 
| style="height=20px; text-align:left;padding:3px"| local node
 
| style="height=20px; text-align:left;padding:3px"| nodes of batch job
 
| style="height=20px; text-align:left;padding:3px"| nodes of batch job
  +
| style="height=20px; text-align:left;padding:3px"| global
 
| style="height=20px; text-align:left;padding:3px"| global
 
| style="height=20px; text-align:left;padding:3px"| global
 
| style="height=20px; text-align:left;padding:3px"| global
 
| style="height=20px; text-align:left;padding:3px"| global
Line 161: Line 181:
 
| style="height=20px; text-align:left;padding:3px"| batch job runtime
 
| style="height=20px; text-align:left;padding:3px"| batch job runtime
 
| style="height=20px; text-align:left;padding:3px"| permanent
 
| style="height=20px; text-align:left;padding:3px"| permanent
  +
| style="height=20px; text-align:left;padding:3px"| max. 240 days
 
| style="height=20px; text-align:left;padding:3px"| max. 240 days
 
| style="height=20px; text-align:left;padding:3px"| max. 240 days
 
|- style="vertical-align:top;"
 
|- style="vertical-align:top;"
 
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Disk space
 
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Disk space
 
| style="height=20px; text-align:left;padding:3px"| 960 GB - 6.4 TB <br> details see table 1
 
| style="height=20px; text-align:left;padding:3px"| 960 GB - 6.4 TB <br> details see table 1
| style="height=20px; text-align:left;padding:3px"| n*250 GB
+
| style="height=20px; text-align:left;padding:3px"| n*750 GB
 
| style="height=20px; text-align:left;padding:3px"| 1.2 PiB
 
| style="height=20px; text-align:left;padding:3px"| 1.2 PiB
 
| style="height=20px; text-align:left;padding:3px"| 4.1 PiB
 
| style="height=20px; text-align:left;padding:3px"| 4.1 PiB
  +
| style="height=20px; text-align:left;padding:3px"| 236 TiB
 
|-
 
|-
 
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Capacity Quotas
 
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Capacity Quotas
 
| style="height=20px; text-align:left;padding:3px"| no
 
| style="height=20px; text-align:left;padding:3px"| no
 
| style="height=20px; text-align:left;padding:3px"| no
 
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| yes <br> 1 TiB per user <br> also per organization
+
| style="height=20px; text-align:left;padding:3px"| yes <br> 1 TiB per user, for <br> MA users 256 GiB <br> also per organization
 
| style="height=20px; text-align:left;padding:3px"| yes <br> 40 TiB per user
 
| style="height=20px; text-align:left;padding:3px"| yes <br> 40 TiB per user
  +
| style="height=20px; text-align:left;padding:3px"| yes <br> 1 TiB per user
 
|-
 
|-
 
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Inode Quotas
 
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Inode Quotas
 
| style="height=20px; text-align:left;padding:3px"| no
 
| style="height=20px; text-align:left;padding:3px"| no
 
| style="height=20px; text-align:left;padding:3px"| no
 
| style="height=20px; text-align:left;padding:3px"| no
| style="height=20px; text-align:left;padding:3px"| yes <br> 10 million per user
+
| style="height=20px; text-align:left;padding:3px"| yes <br> 10 million per user <br> for MA users 2.5 million
 
| style="height=20px; text-align:left;padding:3px"| yes <br> 30 million per user
 
| style="height=20px; text-align:left;padding:3px"| yes <br> 30 million per user
  +
| style="height=20px; text-align:left;padding:3px"| yes <br> 5 million per user
 
|-
 
|-
 
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Backup
 
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Backup
Line 185: Line 209:
 
| style="height=20px; text-align:left;padding:3px"| no
 
| style="height=20px; text-align:left;padding:3px"| no
 
| style="height=20px; text-align:left;padding:3px"| yes
 
| style="height=20px; text-align:left;padding:3px"| yes
  +
| style="height=20px; text-align:left;padding:3px"| no
 
| style="height=20px; text-align:left;padding:3px"| no
 
| style="height=20px; text-align:left;padding:3px"| no
 
|- style="vertical-align:top;"
 
|- style="vertical-align:top;"
Line 190: Line 215:
 
| style="height=20px; text-align:left;padding:3px"| 500 MB/s - 6 GB/s <br> depends on type of local SSD / job queue: <br> 520 MB/s @ single / multiple <br> 800 MB/s @ multiple_e <br> 6600 MB/s @ fat <br> 6500 MB/s @ gpu_4 <br> 6500 MB/s @ gpu_8
 
| style="height=20px; text-align:left;padding:3px"| 500 MB/s - 6 GB/s <br> depends on type of local SSD / job queue: <br> 520 MB/s @ single / multiple <br> 800 MB/s @ multiple_e <br> 6600 MB/s @ fat <br> 6500 MB/s @ gpu_4 <br> 6500 MB/s @ gpu_8
 
| style="height=20px; text-align:left;padding:3px"| 400 MB/s - 500 MB/s<br> depends on type of local SSDs / job queue: <br> 500 MB/s @ multiple <br> 400 MB/s @ multiple_e
 
| style="height=20px; text-align:left;padding:3px"| 400 MB/s - 500 MB/s<br> depends on type of local SSDs / job queue: <br> 500 MB/s @ multiple <br> 400 MB/s @ multiple_e
  +
| style="height=20px; text-align:left;padding:3px"| 1 GB/s
 
| style="height=20px; text-align:left;padding:3px"| 1 GB/s
 
| style="height=20px; text-align:left;padding:3px"| 1 GB/s
 
| style="height=20px; text-align:left;padding:3px"| 1 GB/s
 
| style="height=20px; text-align:left;padding:3px"| 1 GB/s
Line 196: Line 222:
 
| style="height=20px; text-align:left;padding:3px"| 500 MB/s - 4 GB/s <br> depends on type of local SSD / job queue: <br> 500 MB/s @ single / multiple <br> 650 MB/s @ multiple_e <br> 2900 MB/s @ fat <br> 2090 MB/s @ gpu_4 <br> 4060 MB/s @ gpu_8
 
| style="height=20px; text-align:left;padding:3px"| 500 MB/s - 4 GB/s <br> depends on type of local SSD / job queue: <br> 500 MB/s @ single / multiple <br> 650 MB/s @ multiple_e <br> 2900 MB/s @ fat <br> 2090 MB/s @ gpu_4 <br> 4060 MB/s @ gpu_8
 
| style="height=20px; text-align:left;padding:3px"| 250 MB/s - 350 MB/s <br> depends on type of local SSDs / job queue: <br> 350 MB/s @ multiple <br> 250 MB/s @ multiple_e
 
| style="height=20px; text-align:left;padding:3px"| 250 MB/s - 350 MB/s <br> depends on type of local SSDs / job queue: <br> 350 MB/s @ multiple <br> 250 MB/s @ multiple_e
  +
| style="height=20px; text-align:left;padding:3px"| 1 GB/s
 
| style="height=20px; text-align:left;padding:3px"| 1 GB/s
 
| style="height=20px; text-align:left;padding:3px"| 1 GB/s
 
| style="height=20px; text-align:left;padding:3px"| 1 GB/s
 
| style="height=20px; text-align:left;padding:3px"| 1 GB/s
Line 204: Line 231:
 
| style="height=20px; text-align:left;padding:3px"| 18 GB/s
 
| style="height=20px; text-align:left;padding:3px"| 18 GB/s
 
| style="height=20px; text-align:left;padding:3px"| 54 GB/s
 
| style="height=20px; text-align:left;padding:3px"| 54 GB/s
  +
| style="height=20px; text-align:left;padding:3px"| 45 GB/s
 
|- style="vertical-align:top;"
 
|- style="vertical-align:top;"
 
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Total write perf.
 
| style="background-color:#d3ddd8;height=20px; text-align:left;padding:3px"| Total write perf.
Line 210: Line 238:
 
| style="height=20px; text-align:left;padding:3px"| 18 GB/s
 
| style="height=20px; text-align:left;padding:3px"| 18 GB/s
 
| style="height=20px; text-align:left;padding:3px"| 54 GB/s
 
| style="height=20px; text-align:left;padding:3px"| 54 GB/s
  +
| style="height=20px; text-align:left;padding:3px"| 38 GB/s
 
|}
 
|}
 
---------------------------------------------------------------------------------------------------------
 
---------------------------------------------------------------------------------------------------------
global: all nodes of uc1 access the same file system;
+
global: all nodes of UniCluster access the same file system;
 
local: each node has its own file system;
 
local: each node has its own file system;
 
permanent: files are stored permanently;
 
permanent: files are stored permanently;
Line 225: Line 254:
 
but capacity restrictions (quotas) apply. In case you accidentally deleted data on $HOME
 
but capacity restrictions (quotas) apply. In case you accidentally deleted data on $HOME
 
you can usually restore it from backup. Permanent data which is not needed for months
 
you can usually restore it from backup. Permanent data which is not needed for months
or exceeds the capacity restrictions should be sent to bwFileStorage, to the LSDF Online Storage,
+
or exceeds the capacity restrictions should be sent to the LSDF Online Storage
 
or to the archive and deleted from the file systems. Temporary data which is only needed on a single
 
or to the archive and deleted from the file systems. Temporary data which is only needed on a single
 
node and which does not exceed the disk space shown in the table above should be stored
 
node and which does not exceed the disk space shown in the table above should be stored
below $TMP. Temporary data which is only needed during job runs should be stored on a
+
below $TMPDIR. Data which is read many times on a single node, e.g. if you are doing AI training,
  +
should be copied to $TMPDIR and read from there. Temporary data which is used from many nodes
  +
of your batch job and which is only needed during job runtime should be stored on a
 
parallel on-demand file system. Temporary data which can be recomputed or which is the
 
parallel on-demand file system. Temporary data which can be recomputed or which is the
result of one job and input for another job should be stored below in workspaces. The lifetime
+
result of one job and input for another job should be stored in workspaces. The lifetime
 
of data in workspaces is limited and depends on the lifetime of the workspace which can be
 
of data in workspaces is limited and depends on the lifetime of the workspace which can be
several months.
+
several months.
 
The most efficient way to transfer data to/from other HPC file systems or bwFileStorage is done
 
with the tool rdata.
 
   
 
For further details please check the chapters below.
 
For further details please check the chapters below.
Line 247: Line 275:
   
 
On uc2 there is a default user quota limit of 1 TiB and 10 million inodes (files and directories) per user.
 
On uc2 there is a default user quota limit of 1 TiB and 10 million inodes (files and directories) per user.
  +
For users of University of Mannheim the limit is 256 GiB and 2.5 million inodes.
You can chek your current usage and limits with the command
 
  +
You can check your current usage and limits with the command
 
<pre>
 
<pre>
 
$ lfs quota -uh $(whoami) $HOME
 
$ lfs quota -uh $(whoami) $HOME
Line 262: Line 291:
 
files. It is able to provide high data transfer rates of up to 54 GB/s write and read performance when data access is parallel.
 
files. It is able to provide high data transfer rates of up to 54 GB/s write and read performance when data access is parallel.
   
Workspaces have a lifetime and the data on a workspace expires as a whole after a fixed period. The maximum lifetime of a workspace is 60 days, but it can be renewed at the end of that period 3 times to a total maximum of 240 days after workspace generation. If a workspace has inadvertently expired we can restore the data during a limited time (few weeks). In this case you should create a new workspace and report the name of the new and of the expired workspace in a ticket or in an email to the hotline.
+
Workspaces have a lifetime and the data on a workspace expires as a whole after a fixed period. The maximum lifetime of a workspace is 60 days, but it can be renewed at the end of that period 3 times to a total maximum of 240 days after workspace generation. If a workspace has inadvertently expired we can restore the data during a limited time (few weeks). In this case you should create a new workspace and report the name of the new and of the expired workspace in a ticket.
   
 
Creating, deleting, finding and extending workspaces is explained on the [[workspace]] page.
 
Creating, deleting, finding and extending workspaces is explained on the [[workspace]] page.
Line 271: Line 300:
 
$ lfs quota -uh $(whoami) /pfs/work7
 
$ lfs quota -uh $(whoami) /pfs/work7
 
</pre>
 
</pre>
  +
Note that the quotas include data and inodes for all of your workspaces and all of your expired workspaces (as long as they are not yet completely removed).
   
 
=== Reminder for workspace deletion ===
 
=== Reminder for workspace deletion ===
Line 299: Line 329:
 
* avoid competitive file access by different tasks or use blocks with boundaries at stripe size (default is 1MB),
 
* avoid competitive file access by different tasks or use blocks with boundaries at stripe size (default is 1MB),
   
* if files are small enough for the SSDs and are only used by one process store them on $TMP.
+
* if files are small enough for the SSDs and are only used by one process store them on $TMPDIR.
   
 
With previous Lustre versions adapting the Lustre stripe count was the most important optimization. However, for the workspaces of uc2 the new Lustre feature Progressive File Layouts has been used to define file striping parameters. This means that the stripe count is adapted if the file size is growing. In normal cases users no longer need to adapt file striping parameters in case they have very huge files or in order to reach better performance.
 
With previous Lustre versions adapting the Lustre stripe count was the most important optimization. However, for the workspaces of uc2 the new Lustre feature Progressive File Layouts has been used to define file striping parameters. This means that the stripe count is adapted if the file size is growing. In normal cases users no longer need to adapt file striping parameters in case they have very huge files or in order to reach better performance.
Line 332: Line 362:
 
* avoid competitive directory access, e.g. by creating files in separate subdirectories for each task,
 
* avoid competitive directory access, e.g. by creating files in separate subdirectories for each task,
   
* if many small files are only used within a batch job and accessed by one process store them on $TMP,
+
* if many small files are only used within a batch job and accessed by one process store them on $TMPDIR,
   
 
* change the default colorization setting of the command ls (see below).
 
* change the default colorization setting of the command ls (see below).
Line 357: Line 387:
 
</pre>
 
</pre>
   
  +
== Workspaces on flash storage ==
== $TMP ==
 
   
  +
There is another workspace file system for special requirements available. The file system is called ''full flash pfs'' and is based on the parallel file system Lustre.
While all tasks of a parallel application access the same $HOME and workspace directory, the
 
$TMP directory is local to each node on bwUniCluster 2.0. Different tasks of a parallel
 
application use different directories when they do not utilize one node. This directory should
 
be used for temporary files being accessed by single tasks. All nodes have fast SSDs
 
local storage devices which are used to store data below $TMP.
 
In addition, this directory should be used for the installation
 
of software packages. This means that the software package to be installed should be
 
unpacked, compiled and linked in a subdirectory of $TMP. The real installation of the
 
package (e.g. make install) should be made in(to) the Lustre filesystem.
 
   
  +
=== Advantages of this file system ===
Each time a batch
 
  +
job is started, a subdirectory is created on each node and assigned to the job. $TMP is newly
 
  +
# All storage devices are based on flash (no hard disks) with low access times. Hence performance is better compared to other parallel file systems for read and write access with small blocks and with small files, i.e. IOPS rates are improved.
set; the name of the subdirectory contains the Job-id and the starting time so that the
 
  +
# The file system is mounted on bwUniCluster 2.0 and HoreKa, i.e. it can be used to share data between these clusters.
subdirectory name is unique for each job. This unique name is then assigned to the
 
  +
environment variable $TMP within the job. At the end of the job the subdirectory is removed.
 
  +
=== Access restrictions ===
  +
  +
Only HoreKa users or KIT users of bwUniCluster 2.0 can use this file system.
  +
  +
=== Using the file system ===
  +
  +
As KIT or HoreKa user you can use the file system in the same way as a normal workspace. You just have to specify the name of the flash-based workspace file system using the option ''-F'' to all the commands that manage workspaces. On bwUniCluster 2.0 it is called ''ffuc'', on HoreKa it is ''ffhk''. For example, to create a workspace with name myws and a lifetime of 60 days on bwUniCluster 2.0 execute:
  +
ws_allocate -F ffuc myws 60
  +
  +
If you want to use the full flash pfs on bwUniCluster 2.0 '''and''' HoreKa at the same time, please note that you only have to manage a particular workspace on one of the clusters since the name of the workspace directory is different. However, the path to each workspace is visible and can be used on both clusters.
  +
  +
Other features are similar to normal workspaces. For example, we are able to restore expired workspaces for few weeks and you have to open a ticket to request the restore. There are quota limits with a default limit of 1 TiB capacity and 5 millions inodes per user. You can check your current usage with
  +
lfs quota -uh $(whoami) /pfs/work8
  +
  +
== $TMPDIR ==
  +
  +
The environment variable $TMPDIR contains the name of a directory which is local to each node. This means
  +
that different tasks of a parallel application use different directories when they do not utilize the same node.
  +
Although $TMPDIR points to the same path name for different nodes of a batch job, the physical location and the
  +
content of this directory path on these nodes is different.
  +
  +
This directory should be used for temporary files being accessed from the local node during job runtime. It should
  +
also be used if you read the same data many times from a single node, e.g. if you are doing AI training. In this
  +
case you should copy the data at the beginning of your batch job to $TMPDIR and read the data from there.
  +
  +
The $TMPDIR directory is located on extremely fast SSD local storage devices. This means that performance
  +
on small files is much better than on the parallel file systems. The capacity of the local SSDs for each node type
  +
is different and can be checked in Table 1 above. The capacity of $TMPDIR is at least 800 GB.
  +
  +
Each time a batch job is started, a subdirectory is created on the SSD of each node and assigned to the job.
  +
$TMPDIR is set to the name of the subdirectory and this name contains the job ID so that it is unique
  +
for each job. At the end of the job the subdirectory is removed.
  +
  +
On login nodes $TMPDIR also points to a fast directory on a local SSD disk but this directory is not unique.
  +
It is recommended to create your own unique subdirectory on these nodes. This directory should be used for the
  +
installation of software packages. This means that the software package to be installed should be unpacked,
  +
compiled and linked in a subdirectory of $TMPDIR. The real installation of the package (e.g. make install)
  +
should be made into the $HOME folder.
  +
  +
=== Usage example for $TMPDIR ===
  +
  +
We will provide an example for using $TMPDIR and describe efficient data transfer to and from $TMPDIR.
  +
  +
If you have a data set with many files which is frequently used by batch jobs you should create
  +
a compressed archive on a workspace. This archive can be extracted on $TMPDIR inside your batch jobs.
  +
Such an archive can be read efficiently from a parallel file system since it is a single huge file.
  +
On a login node you can create such an archive with the following steps:
  +
<source lang="bash">
  +
# Create a workspace to store the archive
  +
[ab1234@uc2n997 ~]$ ws_allocate data-ssd 60
  +
# Create the archive from a local dataset folder (example)
  +
[ab1234@uc2n997 ~]$ tar -cvzf $(ws_find data-ssd)/dataset.tgz dataset/
  +
</source>
  +
  +
Inside a batch job extract the archive on $TMPDIR, read input data from $TMPDIR, store results on $TMPDIR
  +
and save the results on a workspace:
  +
<source lang="bash">
  +
#!/bin/bash
  +
# very simple example on how to use local $TMPDIR
  +
#SBATCH -N 1
  +
#SBATCH -t 24:00:00
  +
  +
# Extract compressed input dataset on local SSD
  +
tar -C $TMPDIR/ -xvzf $(ws_find data-ssd)/dataset.tgz
  +
  +
# The application reads data from dataset on $TMPDIR and writes results to $TMPDIR
  +
myapp -input $TMPDIR/dataset/myinput.csv -outputdir $TMPDIR/results
  +
  +
# Before job completes save results on a workspace
  +
rsync -av $TMPDIR/results $(ws_find data-ssd)/results-${SLURM_JOB_ID}/
  +
</source>
   
 
== LSDF Online Storage==
 
== LSDF Online Storage==
Line 389: Line 482:
 
Please request storage projects in the LSDF Online Storage seperately:
 
Please request storage projects in the LSDF Online Storage seperately:
 
[https://www.lsdf.kit.edu/os/storagerequest/: LSDF Storage Request].
 
[https://www.lsdf.kit.edu/os/storagerequest/: LSDF Storage Request].
 
== Access to other HPC-Filesystems==
 
 
<!--Users of several SCC HPC-Clusters and users of the LSDF online storage have the possibility to transfer data over the "data mover" nodes via the tool "rdata".
 
 
The command '''rdata''' executes the filesystem operations on special "data mover" nodes and distributes the load. Examples for the command are:
 
<pre>
 
$ rdata "ls $LSDFHOME/*.c"
 
$ rdata "cp $WORK/foo $LSDFHOME"
 
</pre>
 
 
The command
 
<pre>
 
$ man rdata
 
</pre>
 
shows how to use the command '''rdata'''.-->
 
 
===$WORK of SCC HPC-Clusters===
 
 
From ForHLR II users can transfer data of the $WORK filesystem to the bwUniCluster via the tool "rdata".
 
 
===$PROJECT of the ForHLR I===
 
 
Users of ForHLR II can transfer data of the $PROJECT file system to the bwUniCluster via the tool "rdata".
 
 
===LSDF online storage===
 
 
Users of the '''LSDF online storage''' can furthermore transfer data to bwUniCluster via the tool '''rdata'''.
 
Therefore the environment variables $LSDF, $LSDFPROJECTS and $LSDFHOME are set.
 
   
 
==BeeOND (BeeGFS On-Demand)==
 
==BeeOND (BeeGFS On-Demand)==
   
Users of the HPC-Cluster have possibility to request a private BeeOND (BeeGFS) parallel filesystem for each job. The file system is created during job startup and purged after your job.
+
Users of the UniCluster have possibility to request a private BeeOND (on-demand BeeGFS) parallel filesystem for each job. The file system is created during job startup and purged after your job.
   
  +
; '''IMPORTANT: '''
'''IMPORTANT: All data on the private filesystem will be deleted after your job. Make sure you have copied your data back to the global filesystem (within job), e.g., $HOME or any workspace.'''
 
  +
: All data on the private filesystem will be deleted after your job. Make sure you have copied your data back to the global filesystem (within job), e.g., $HOME or any workspace.
   
BeeOND/BeeGFS can be used like any other parallel file system. Tools like cp or rsync can be used to copy data in and out. This feature is currently under BETA. If you encounter any problems or have questions, please contact fh-hotline@lists.kit.edu
+
BeeOND/BeeGFS can be used like any other parallel file system. Tools like cp or rsync can be used to copy data in and out.
   
For detailed usage see here:[[BwUniCluster_2.0_Slurm_common_Features#BeeOND_.28BeeGFS_On-Demand.29|Request on-demand file system]]
+
For detailed usage see here: [[BwUniCluster_2.0_Slurm_common_Features#BeeOND_.28BeeGFS_On-Demand.29|Request on-demand file system]]
   
 
==Backup and Archiving==
 
==Backup and Archiving==
Line 434: Line 499:
 
not be backuped.
 
not be backuped.
   
Please contact the hotline if you need backuped data.
+
Please open a ticket if you need backuped data.
 
 
[[Category:bwUniCluster 2.0|File System]][[Category:Hardware_and_Architecture|bwUniCluster 2.0]]
 

Latest revision as of 12:04, 27 February 2024

1 Architecture of bwUniCluster 2.0

The bwUniCluster 2.0 is a parallel computer with distributed memory. Each node of system consists of at least two Intel Xeon processor, local memory, disks, network adapters and optionally accelerators (NVIDIA Tesla V100, A100 or H100). All nodes are connected by a fast InfiniBand interconnect. In addition the file system Lustre, that is connected by coupling the InfiniBand of the file server with the InfiniBand switch of the compute cluster, is added to bwUniCluster 2.0 to provide a fast and scalable parallel file system.

The operating system on each node is Red Hat Enterprise Linux (RHEL) 8.4. A number of additional software packages like e.g. SLURM have been installed on top. Some of these components are of special interest to end users and are briefly discussed in this document. Others which are of greater importance to system administrators will not be covered by this document.

The individual nodes of the system may act in different roles. According to the services supplied by the nodes, they are separated into disjoint groups. From an end users point of view the different groups of nodes are login nodes, compute nodes, file server nodes and administrative server nodes.

Login Nodes

The login nodes are the only nodes that are directly accessible by end users. These nodes are used for interactive login, file management, program development and interactive pre- and postprocessing. Three nodes are dedicated to this service but they are all accessible via one address and a DNS round-robin alias distributes the login sessions to the different login nodes.

Compute Node

The majority of nodes are compute nodes which are managed by a batch system. Users submit their jobs to the SLURM batch system and a job is executed when the required resources become available (depending on its fair-share priority).

File Server Nodes

The hardware of the parallel file system Lustre incorporates some file server nodes; the file system Lustre is connected by coupling the InfiniBand of the file server with the independent InfiniBand switch of the compute cluster. In addition to shared file space there is also local storage on the disks of each node (for details see chapter "File Systems").

Administrative Server Nodes

Some other nodes are delivering additional services like resource management, external network connection, administration etc. These nodes can be accessed directly by system administrators only.

2 Components of bwUniCluster

Compute nodes "Thin" Compute nodes "HPC" Compute nodes "IceLake" Compute nodes "Fat" GPU x4 GPU x8 IceLake + GPU x4 Login
Number of nodes 200 + 60 260 272 6 14 10 15 3
Processors Intel Xeon Gold 6230 Intel Xeon Gold 6230 Intel Xeon Platinum 8358 Intel Xeon Gold 6230 Intel Xeon Gold 6230 Intel Xeon Gold 6248 Intel Xeon Platinum 8358
Number of sockets 2 2 2 4 2 2 2 2
Processor frequency (GHz) 2.1 Ghz 2.1 Ghz 2.6 Ghz 2.1 Ghz 2.1 Ghz 2.6 Ghz 2.5 Ghz
Total number of cores 40 40 64 80 40 40 64 40
Main memory 96 GB / 192 GB 96 GB 256 GB 3 TB 384 GB 768 GB 512 GB 384 GB
Local SSD 960 GB SATA 960 GB SATA 1,8 TB NVMe 4,8 TB NVMe 3,2 TB NVMe 15 TB NVMe 6,4 TB NVMe
Accelerators - - - - 4x NVIDIA Tesla V100 8x NVIDIA Tesla V100 4x NVIDIA A100 / 4x NVIDIA H100 -
Accelerator memory - - - - 32 GB 32 GB 80 GB / 94 GB -
Interconnect IB HDR100 (blocking) IB HDR100 IB HDR200 IB HDR IB HDR IB HDR IB HDR200 IB HDR100 (blocking)

Table 1: Properties of the nodes

3 File Systems

On bwUniCluster 2.0 the parallel file system Lustre is used for most globally visible user data. Lustre is open source and Lustre solutions and support are available from different vendors. Nowadays, most of the biggest HPC systems are using Lustre. An initial home directory on a Lustre file system is created automatically after account activation, and the environment variable $HOME holds its name. Users can create so-called workspaces on another Lustre file system for non-permanent data with temporary lifetime. There is another workspace file system based on flash storage for special requirements available.

Within a batch job further file systems are available:

  • The directory $TMPDIR is only available and visible on the local node. It is located on fast SSD storage devices.
  • On request a parallel on-demand file system (BeeOND) is created which uses the SSDs of the nodes which were allocated to the batch job.
  • On request the external LSDF Online Storage is mounted on the nodes which were allocated to the batch job. This file system is based on the parallel file system Spectrum Scale.

Some of the characteristics of the file systems are shown in Table 2.

Property $TMPDIR BeeOND $HOME Workspace Workspace
on flash
Visibility local node nodes of batch job global global global
Lifetime batch job runtime batch job runtime permanent max. 240 days max. 240 days
Disk space 960 GB - 6.4 TB
details see table 1
n*750 GB 1.2 PiB 4.1 PiB 236 TiB
Capacity Quotas no no yes
1 TiB per user, for
MA users 256 GiB
also per organization
yes
40 TiB per user
yes
1 TiB per user
Inode Quotas no no yes
10 million per user
for MA users 2.5 million
yes
30 million per user
yes
5 million per user
Backup no no yes no no
Read perf./node 500 MB/s - 6 GB/s
depends on type of local SSD / job queue:
520 MB/s @ single / multiple
800 MB/s @ multiple_e
6600 MB/s @ fat
6500 MB/s @ gpu_4
6500 MB/s @ gpu_8
400 MB/s - 500 MB/s
depends on type of local SSDs / job queue:
500 MB/s @ multiple
400 MB/s @ multiple_e
1 GB/s 1 GB/s 1 GB/s
Write perf./node 500 MB/s - 4 GB/s
depends on type of local SSD / job queue:
500 MB/s @ single / multiple
650 MB/s @ multiple_e
2900 MB/s @ fat
2090 MB/s @ gpu_4
4060 MB/s @ gpu_8
250 MB/s - 350 MB/s
depends on type of local SSDs / job queue:
350 MB/s @ multiple
250 MB/s @ multiple_e
1 GB/s 1 GB/s 1 GB/s
Total read perf. n*500-6000 MB/s n*400-500 MB/s 18 GB/s 54 GB/s 45 GB/s
Total write perf. n*500-4000 MB/s n*250-350 MB/s 18 GB/s 54 GB/s 38 GB/s

 global: all nodes of UniCluster access the same file system;
 local: each node has its own file system;
 permanent: files are stored permanently;
 batch job: files are removed at end of the batch job.

Table 2: Properties of the file systems

3.1 Selecting the appropriate file system

In general, you should separate your data and store it on the appropriate file system. Permanently needed data like software or important results should be stored below $HOME but capacity restrictions (quotas) apply. In case you accidentally deleted data on $HOME you can usually restore it from backup. Permanent data which is not needed for months or exceeds the capacity restrictions should be sent to the LSDF Online Storage or to the archive and deleted from the file systems. Temporary data which is only needed on a single node and which does not exceed the disk space shown in the table above should be stored below $TMPDIR. Data which is read many times on a single node, e.g. if you are doing AI training, should be copied to $TMPDIR and read from there. Temporary data which is used from many nodes of your batch job and which is only needed during job runtime should be stored on a parallel on-demand file system. Temporary data which can be recomputed or which is the result of one job and input for another job should be stored in workspaces. The lifetime of data in workspaces is limited and depends on the lifetime of the workspace which can be several months.

For further details please check the chapters below.

3.2 $HOME

The home directories of bwUniCluster 2.0 (uc2) users are located in the parallel file system Lustre. You have access to your home directory from all nodes of uc2. A regular backup of these directories to tape archive is done automatically. The directory $HOME is used to hold those files that are permanently used like source codes, configuration files, executable programs etc.

On uc2 there is a default user quota limit of 1 TiB and 10 million inodes (files and directories) per user. For users of University of Mannheim the limit is 256 GiB and 2.5 million inodes. You can check your current usage and limits with the command

$ lfs quota -uh $(whoami) $HOME

In addition to the user limit there is a limit of your organization (e.g. university) which depends on the financial share. This limit is enforced with so-called Lustre project quotas. You can show the current usage and limits of your organization with the following command:

lfs quota -ph $(grep $(echo $HOME | sed -e "s|/[^/]*/[^/]*$||") /pfs/data5/project_ids.txt | cut -f 1 -d\ ) $HOME

3.3 Workspaces

On uc2 workspaces can be used to store large non-permanent data sets, e.g. restart files or output data that has to be post-processed. The file system used for workspaces is also the parallel file system Lustre. This file system is especially designed for parallel access and for a high throughput to large files. It is able to provide high data transfer rates of up to 54 GB/s write and read performance when data access is parallel.

Workspaces have a lifetime and the data on a workspace expires as a whole after a fixed period. The maximum lifetime of a workspace is 60 days, but it can be renewed at the end of that period 3 times to a total maximum of 240 days after workspace generation. If a workspace has inadvertently expired we can restore the data during a limited time (few weeks). In this case you should create a new workspace and report the name of the new and of the expired workspace in a ticket.

Creating, deleting, finding and extending workspaces is explained on the workspace page.

On uc2 there is a default user quota limit of 40 TiB and 30 million inodes (files and directories) per user. You can chek your current usage and limits with the command

$ lfs quota -uh $(whoami) /pfs/work7

Note that the quotas include data and inodes for all of your workspaces and all of your expired workspaces (as long as they are not yet completely removed).

3.3.1 Reminder for workspace deletion

Normally you will get an email every day starting 7 days before a workspace expires. You can send yourself a calender entry which reminds you when a workspace will be automatically deleted:

$ ws_send_ical.sh <workspace> <email>

3.4 Improving Performance on $HOME and workspaces

The following recommendations might help to improve throughput and metadata performance on Lustre filesystems.

3.4.1 Improving Throughput Performance

Depending on your application some adaptations might be necessary if you want to reach the full bandwidth of the filesystems. Parallel filesystems typically stripe files over storage subsystems, i.e. large files are separated into stripes and distributed to different storage subsystems. In Lustre, the size of these stripes (sometimes also mentioned as chunks) is called stripe size and the number of used storage subsystems is called stripe count.

When you are designing your application you should consider that the performance of parallel filesystems is generally better if data is transferred in large blocks and stored in few large files. In more detail, to increase throughput performance of a parallel application following aspects should be considered:

  • collect large chunks of data and write them sequentially at once,
  • to exploit complete filesystem bandwidth use several clients,
  • avoid competitive file access by different tasks or use blocks with boundaries at stripe size (default is 1MB),
  • if files are small enough for the SSDs and are only used by one process store them on $TMPDIR.

With previous Lustre versions adapting the Lustre stripe count was the most important optimization. However, for the workspaces of uc2 the new Lustre feature Progressive File Layouts has been used to define file striping parameters. This means that the stripe count is adapted if the file size is growing. In normal cases users no longer need to adapt file striping parameters in case they have very huge files or in order to reach better performance.

If you know what you are doing you can still change striping parameters, e.g. the stripe count, of a directory and of newly created files. New files and directories inherit the stripe count from the parent directory. E.g. if you want to enhance throughput on a single very large file which is created in the directory $HOME/my_output_dir you can use the command

$ lfs setstripe -c-1 $HOME/my_output_dir

to change the stripe count to -1 which means that all storage subsystems of the file system are used to store that file. If you change the stripe count of a directory the stripe count of existing files inside this directory is not changed. If you want to change the stripe count of existing files, change the stripe count of the parent directory, copy the files to new files, remove the old files and move the new files back to the old name. In order to check the stripe setting of the file my_file use

$ lfs getstripe my_file

Also note that changes on the striping parameters (e.g. stripe count) are not saved in the backup, i.e. if directories have to be recreated this information is lost and the default stripe count will be used. Therefore, you should annotate for which directories you made changes to the striping parameters so that you can repeat these changes if required.

3.4.2 Improving Metadata Performance

Metadata performance on parallel file systems is usually not as good as with local filesystems. In addition, it is usually not scalable, i.e. a limited resource. Therefore, you should omit metadata operations whenever possible. For example, it is much better to have few large files than lots of small files. In more detail, to increase metadata performance of a parallel application following aspects should be considered:

  • avoid creating many small files,
  • avoid competitive directory access, e.g. by creating files in separate subdirectories for each task,
  • if many small files are only used within a batch job and accessed by one process store them on $TMPDIR,
  • change the default colorization setting of the command ls (see below).

On modern Linux systems, the GNU ls command often uses colorization by default to visually highlight the file type; this is especially true if the command is run within a terminal session. This is because the default shell profile initializations usually contain an alias directive similar to the following for the ls command:

$ alias ls="ls --color=tty"

However, running the ls command in this way for files on a Lustre file system requires a stat() call to be used to determine the file type. This can result in a performance overhead, because the stat() call always needs to determine the size of a file, and that in turn means that the client node must query the object size of all the backing objects that make up a file. As a result of the default colorization setting, running a simple ls command on a Lustre file system often takes as much time as running the ls command with the -l option (the same is true if the -F, -p, or the -classify option, or any other option that requires information from a stat() call, is used). To avoid this performance overhead when using ls commands, add an alias directive similar to the following to your shell startup script:

$ alias ls="ls --color=never"

3.5 Workspaces on flash storage

There is another workspace file system for special requirements available. The file system is called full flash pfs and is based on the parallel file system Lustre.

3.5.1 Advantages of this file system

  1. All storage devices are based on flash (no hard disks) with low access times. Hence performance is better compared to other parallel file systems for read and write access with small blocks and with small files, i.e. IOPS rates are improved.
  2. The file system is mounted on bwUniCluster 2.0 and HoreKa, i.e. it can be used to share data between these clusters.

3.5.2 Access restrictions

Only HoreKa users or KIT users of bwUniCluster 2.0 can use this file system.

3.5.3 Using the file system

As KIT or HoreKa user you can use the file system in the same way as a normal workspace. You just have to specify the name of the flash-based workspace file system using the option -F to all the commands that manage workspaces. On bwUniCluster 2.0 it is called ffuc, on HoreKa it is ffhk. For example, to create a workspace with name myws and a lifetime of 60 days on bwUniCluster 2.0 execute:

ws_allocate -F ffuc myws 60

If you want to use the full flash pfs on bwUniCluster 2.0 and HoreKa at the same time, please note that you only have to manage a particular workspace on one of the clusters since the name of the workspace directory is different. However, the path to each workspace is visible and can be used on both clusters.

Other features are similar to normal workspaces. For example, we are able to restore expired workspaces for few weeks and you have to open a ticket to request the restore. There are quota limits with a default limit of 1 TiB capacity and 5 millions inodes per user. You can check your current usage with

lfs quota -uh $(whoami) /pfs/work8

3.6 $TMPDIR

The environment variable $TMPDIR contains the name of a directory which is local to each node. This means that different tasks of a parallel application use different directories when they do not utilize the same node. Although $TMPDIR points to the same path name for different nodes of a batch job, the physical location and the content of this directory path on these nodes is different.

This directory should be used for temporary files being accessed from the local node during job runtime. It should also be used if you read the same data many times from a single node, e.g. if you are doing AI training. In this case you should copy the data at the beginning of your batch job to $TMPDIR and read the data from there.

The $TMPDIR directory is located on extremely fast SSD local storage devices. This means that performance on small files is much better than on the parallel file systems. The capacity of the local SSDs for each node type is different and can be checked in Table 1 above. The capacity of $TMPDIR is at least 800 GB.

Each time a batch job is started, a subdirectory is created on the SSD of each node and assigned to the job. $TMPDIR is set to the name of the subdirectory and this name contains the job ID so that it is unique for each job. At the end of the job the subdirectory is removed.

On login nodes $TMPDIR also points to a fast directory on a local SSD disk but this directory is not unique. It is recommended to create your own unique subdirectory on these nodes. This directory should be used for the installation of software packages. This means that the software package to be installed should be unpacked, compiled and linked in a subdirectory of $TMPDIR. The real installation of the package (e.g. make install) should be made into the $HOME folder.

3.6.1 Usage example for $TMPDIR

We will provide an example for using $TMPDIR and describe efficient data transfer to and from $TMPDIR.

If you have a data set with many files which is frequently used by batch jobs you should create a compressed archive on a workspace. This archive can be extracted on $TMPDIR inside your batch jobs. Such an archive can be read efficiently from a parallel file system since it is a single huge file. On a login node you can create such an archive with the following steps:

# Create a workspace to store the archive
[ab1234@uc2n997 ~]$ ws_allocate data-ssd 60
# Create the archive from a local dataset folder (example)
[ab1234@uc2n997 ~]$ tar -cvzf $(ws_find data-ssd)/dataset.tgz dataset/

Inside a batch job extract the archive on $TMPDIR, read input data from $TMPDIR, store results on $TMPDIR and save the results on a workspace:

#!/bin/bash
# very simple example on how to use local $TMPDIR
#SBATCH -N 1
#SBATCH -t 24:00:00

# Extract compressed input dataset on local SSD
tar -C $TMPDIR/ -xvzf $(ws_find data-ssd)/dataset.tgz

# The application reads data from dataset on $TMPDIR and writes results to $TMPDIR
myapp -input $TMPDIR/dataset/myinput.csv -outputdir $TMPDIR/results

# Before job completes save results on a workspace
rsync -av $TMPDIR/results $(ws_find data-ssd)/results-${SLURM_JOB_ID}/

3.7 LSDF Online Storage

In some cases it is useful to have access to the LSDF Online Storage on the HPC-Clusters also. Therefore the LSDF Online Storage is mounted on the Login- and Datamover-Nodes. Furthermore it can be used on the compute nodes during the job runtime with the constraint flag "LSDF" (Slurm common features ). There is also an example about the LSDF batch usage: Slurm LSDF example .

#!/bin/bash
#SBATCH ...
#SBATCH --constraint=LSDF


For the usage of the LSDF Online Storage the following environment variables are available: $LSDF, $LSDFPROJECTS, $LSDFHOME. Please request storage projects in the LSDF Online Storage seperately: LSDF Storage Request.

3.8 BeeOND (BeeGFS On-Demand)

Users of the UniCluster have possibility to request a private BeeOND (on-demand BeeGFS) parallel filesystem for each job. The file system is created during job startup and purged after your job.

IMPORTANT:
All data on the private filesystem will be deleted after your job. Make sure you have copied your data back to the global filesystem (within job), e.g., $HOME or any workspace.

BeeOND/BeeGFS can be used like any other parallel file system. Tools like cp or rsync can be used to copy data in and out.

For detailed usage see here: Request on-demand file system

3.9 Backup and Archiving

There are regular backups of all data of the home directories,whereas ACLs and extended attributes will not be backuped.

Please open a ticket if you need backuped data.