BwUniCluster3.0/Hardware and Architecture/Filesystem Details: Difference between revisions
No edit summary |
|||
(19 intermediate revisions by the same user not shown) | |||
Line 2: | Line 2: | ||
On bwUniCluster |
On bwUniCluster 3.0 the parallel file system Lustre is used for most globally visible user data. Lustre is open source and Lustre solutions and support are available from different vendors. Nowadays, most of the biggest HPC systems are using Lustre. An initial home directory on a Lustre file system is created automatically after account activation, and the environment variable $HOME holds its name. Users can create so-called workspaces on another Lustre file system for non-permanent data with temporary lifetime. There is another workspace file system based on flash storage for special requirements available. |
||
Within a batch job further file systems are available: |
Within a batch job further file systems are available: |
||
* The directory $TMPDIR is only available and visible on the local node. It is located on fast SSD storage devices. |
* The directory $TMPDIR is only available and visible on the local node. It is located on fast SSD storage devices. |
||
* On request a parallel on-demand file system (BeeOND) is created which uses the SSDs of the nodes which were allocated to the batch job. |
* On request a parallel on-demand file system (BeeOND) is created which uses the SSDs of the nodes which were allocated to the batch job. |
||
* On request the external LSDF Online Storage is mounted on the nodes which were allocated to the batch job. This file system is based on the parallel file system |
* On request the external LSDF Online Storage is mounted on the nodes which were allocated to the batch job. This file system is based on the parallel file system IBM Storage Scale. |
||
Some of the characteristics of the file systems are shown in |
Some of the characteristics of the file systems are shown in the following table. |
||
{| class="wikitable" |
{| class="wikitable" |
||
Line 37: | Line 37: | ||
| style="height=20px; text-align:left;padding:3px"| 960 GB - 6.4 TB <br> details see table 1 |
| style="height=20px; text-align:left;padding:3px"| 960 GB - 6.4 TB <br> details see table 1 |
||
| style="height=20px; text-align:left;padding:3px"| n*750 GB |
| style="height=20px; text-align:left;padding:3px"| n*750 GB |
||
| style="height=20px; text-align:left;padding:3px"| 1. |
| style="height=20px; text-align:left;padding:3px"| 1.1 PiB |
||
| style="height=20px; text-align:left;padding:3px"| 4. |
| style="height=20px; text-align:left;padding:3px"| 4.6 PiB |
||
| style="height=20px; text-align:left;padding:3px"| 236 TiB |
| style="height=20px; text-align:left;padding:3px"| 236 TiB |
||
|- |
|- |
||
Line 44: | Line 44: | ||
| style="height=20px; text-align:left;padding:3px"| no |
| style="height=20px; text-align:left;padding:3px"| no |
||
| style="height=20px; text-align:left;padding:3px"| no |
| style="height=20px; text-align:left;padding:3px"| no |
||
| style="height=20px; text-align:left;padding:3px"| yes <br> |
| style="height=20px; text-align:left;padding:3px"| yes <br> 500 GiB per user, for <br> MA users 250 GiB <br> also per organization |
||
| style="height=20px; text-align:left;padding:3px"| yes <br> 40 TiB per user |
| style="height=20px; text-align:left;padding:3px"| yes <br> 40 TiB per user |
||
| style="height=20px; text-align:left;padding:3px"| yes <br> 1 TiB per user |
| style="height=20px; text-align:left;padding:3px"| yes <br> 1 TiB per user |
||
Line 51: | Line 51: | ||
| style="height=20px; text-align:left;padding:3px"| no |
| style="height=20px; text-align:left;padding:3px"| no |
||
| style="height=20px; text-align:left;padding:3px"| no |
| style="height=20px; text-align:left;padding:3px"| no |
||
| style="height=20px; text-align:left;padding:3px"| yes <br> |
| style="height=20px; text-align:left;padding:3px"| yes <br> 5 million per user <br> for MA users 2.5 million |
||
| style="height=20px; text-align:left;padding:3px"| yes <br> |
| style="height=20px; text-align:left;padding:3px"| yes <br> 20 million per user |
||
| style="height=20px; text-align:left;padding:3px"| yes <br> 5 million per user |
| style="height=20px; text-align:left;padding:3px"| yes <br> 5 million per user |
||
|- |
|- |
||
Line 63: | Line 63: | ||
|- |
|- |
||
! scope="column" style="height=20px; padding:3px"| Read perf./node |
! scope="column" style="height=20px; padding:3px"| Read perf./node |
||
| style="height=20px; text-align:left;padding:3px"| 500 MB/s - 6 GB/s <br> depends on type of local SSD / job queue |
| style="height=20px; text-align:left;padding:3px"| 500 MB/s - 6 GB/s <br> depends on type of local SSD / job queue |
||
| style="height=20px; text-align:left;padding:3px"| 400 MB/s - 500 MB/s<br> depends on type of local SSDs / job queue |
| style="height=20px; text-align:left;padding:3px"| 400 MB/s - 500 MB/s<br> depends on type of local SSDs / job queue |
||
| style="height=20px; text-align:left;padding:3px"| |
| style="height=20px; text-align:left;padding:3px"| 5 GB/s |
||
| style="height=20px; text-align:left;padding:3px"| |
| style="height=20px; text-align:left;padding:3px"| 5 GB/s |
||
| style="height=20px; text-align:left;padding:3px"| 1 GB/s |
| style="height=20px; text-align:left;padding:3px"| 1 GB/s |
||
|- |
|- |
||
! scope="column" style="height=20px; padding:3px"| Write perf./node |
! scope="column" style="height=20px; padding:3px"| Write perf./node |
||
| style="height=20px; text-align:left;padding:3px"| 500 MB/s - 4 GB/s <br> depends on type of local SSD / job queue |
| style="height=20px; text-align:left;padding:3px"| 500 MB/s - 4 GB/s <br> depends on type of local SSD / job queue |
||
| style="height=20px; text-align:left;padding:3px"| 250 MB/s - 350 MB/s <br> depends on type of local SSDs / job queue |
| style="height=20px; text-align:left;padding:3px"| 250 MB/s - 350 MB/s <br> depends on type of local SSDs / job queue |
||
| style="height=20px; text-align:left;padding:3px"| |
| style="height=20px; text-align:left;padding:3px"| 5 GB/s |
||
| style="height=20px; text-align:left;padding:3px"| |
| style="height=20px; text-align:left;padding:3px"| 5 GB/s |
||
| style="height=20px; text-align:left;padding:3px"| 1 GB/s |
| style="height=20px; text-align:left;padding:3px"| 1 GB/s |
||
|- |
|- |
||
Line 79: | Line 79: | ||
| style="height=20px; text-align:left;padding:3px"| n*500-6000 MB/s |
| style="height=20px; text-align:left;padding:3px"| n*500-6000 MB/s |
||
| style="height=20px; text-align:left;padding:3px"| n*400-500 MB/s |
| style="height=20px; text-align:left;padding:3px"| n*400-500 MB/s |
||
| style="height=20px; text-align:left;padding:3px"| |
| style="height=20px; text-align:left;padding:3px"| 63 GB/s |
||
| style="height=20px; text-align:left;padding:3px"| |
| style="height=20px; text-align:left;padding:3px"| 45 GB/s |
||
| style="height=20px; text-align:left;padding:3px"| 45 GB/s |
| style="height=20px; text-align:left;padding:3px"| 45 GB/s |
||
|- |
|- |
||
Line 86: | Line 86: | ||
| style="height=20px; text-align:left;padding:3px"| n*500-4000 MB/s |
| style="height=20px; text-align:left;padding:3px"| n*500-4000 MB/s |
||
| style="height=20px; text-align:left;padding:3px"| n*250-350 MB/s |
| style="height=20px; text-align:left;padding:3px"| n*250-350 MB/s |
||
| style="height=20px; text-align:left;padding:3px"| |
| style="height=20px; text-align:left;padding:3px"| 63 GB/s |
||
| style="height=20px; text-align:left;padding:3px"| |
| style="height=20px; text-align:left;padding:3px"| 40 GB/s |
||
| style="height=20px; text-align:left;padding:3px"| 38 GB/s |
| style="height=20px; text-align:left;padding:3px"| 38 GB/s |
||
|} |
|} |
||
Line 106: | Line 106: | ||
or exceeds the capacity restrictions should be sent to the LSDF Online Storage |
or exceeds the capacity restrictions should be sent to the LSDF Online Storage |
||
or to the archive and deleted from the file systems. Temporary data which is only needed on a single |
or to the archive and deleted from the file systems. Temporary data which is only needed on a single |
||
node and which does not exceed the disk space shown in |
node and which does not exceed the disk space shown in [[BwUniCluster3.0/Hardware_and_Architecture#Compute_nodes|Table 1]] |
||
below $TMPDIR. Data which is read many times on a single node, e.g. if you are doing AI training, |
should be stored below $TMPDIR. Data which is read many times on a single node, e.g. if you are doing AI training, |
||
should be copied to $TMPDIR and read from there. Temporary data which is used from many nodes |
should be copied to $TMPDIR and read from there. Temporary data which is used from many nodes |
||
of your batch job and which is only needed during job runtime should be stored on a |
of your batch job and which is only needed during job runtime should be stored on a |
||
Line 119: | Line 119: | ||
== $HOME == |
== $HOME == |
||
The home directories of bwUniCluster |
The home directories of bwUniCluster 3.0 (uc3) users are located in the parallel file system Lustre. |
||
You have access to your home directory from all nodes of |
You have access to your home directory from all nodes of uc3. A regular backup of these directories |
||
to tape |
to a tape library is done automatically. The directory $HOME should be used to hold permanently used data like |
||
source code, configuration files, executable programs etc. |
|||
On |
On uc3 there is a default user quota limit of 500 GiB and 5 million inodes (files and directories) per user. |
||
For users of University of Mannheim the limit is |
For users of University of Mannheim the limit is 250 GiB and 2.5 million inodes. |
||
You can check your current usage and limits with the command |
You can check your current usage and limits with the command |
||
<pre> |
<pre> |
||
$ lfs quota -uh $(whoami) $HOME |
$ lfs quota -uh $(whoami) $HOME |
||
</pre> |
</pre> |
||
Note that the quota limits mentioned above are soft quota limits. The hard limits (shown on the ''limits'' |
|||
columns) are 10 percent higher. If you are above the soft limit and below the hard limit |
|||
during the grace period (7 days) your I/O operations will show a warning message. If the grace period has |
|||
passed or if you are above the hard limit your I/O operations will abort. |
|||
In addition to the user limit there is a limit of your organization (e.g. university) which depends on the financial share. This limit is enforced with so-called Lustre project quotas. You can show the current usage and limits of your organization with the following command: |
In addition to the user limit there is a limit of your organization (e.g. university) which depends on the financial share. This limit is enforced with so-called Lustre project quotas. You can show the current usage and limits of your organization with the following command: |
||
<pre> |
<pre> |
||
lfs quota -ph $(grep $(echo $HOME | sed -e "s|/[^/]*/[^/]*$||") /pfs/ |
lfs quota -ph $(grep $(echo $HOME | sed -e "s|/[^/]*/[^/]*$||") /pfs/data6/project_ids.txt | cut -f 1 -d\ ) $HOME |
||
</pre> |
</pre> |
||
== Workspaces == |
== Workspaces == |
||
On |
On uc3 workspaces can be used to store large non-permanent data sets, e.g. restart files or output |
||
data that has to be post-processed. The file system used for workspaces is also the parallel file system Lustre. This file system is especially designed for parallel access and for a high throughput to large |
data that has to be post-processed. The file system used for workspaces is also the parallel file system Lustre. This file system is especially designed for parallel access and for a high throughput to large |
||
files. It is able to provide high data transfer rates of up to |
files. It is able to provide high data transfer rates of up to 40 GB/s write and read performance when data access is parallel. |
||
On |
On uc3 there is a default user quota limit of 40 TiB and 20 million inodes (files and directories) per user. |
||
You can chek your current usage and limits with the command |
You can chek your current usage and limits with the command |
||
<pre> |
<pre> |
||
$ lfs quota -uh $(whoami) /pfs/ |
$ lfs quota -uh $(whoami) /pfs/work9 |
||
</pre> |
</pre> |
||
Note that the quotas include data and inodes for all of your workspaces and all of your expired workspaces (as long as they are not yet completely removed). |
Note that the quotas include data and inodes for all of your workspaces and all of your expired workspaces (as long as they are not yet completely removed). |
||
Workspaces have a lifetime and the data on a workspace expires as a whole after a fixed period. The maximum lifetime of a workspace on |
Workspaces have a lifetime and the data on a workspace expires as a whole after a fixed period. The maximum lifetime of a workspace on uc3 is 60 days, but it can be renewed at the end of that period 3 times to a total maximum of 240 days after workspace generation. |
||
Creating, deleting, finding, extending and sharing workspaces is explained on the [[workspace]] page. |
Creating, deleting, finding, extending and sharing workspaces is explained on the [[workspace]] page. |
||
Line 160: | Line 165: | ||
=== Restoring expired Workspaces === |
=== Restoring expired Workspaces === |
||
At expiration time your workspace will be moved to a special, hidden directory. On |
At expiration time your workspace will be moved to a special, hidden directory. On uc3 expired workspaces are currently kept for 30 days. During this time you can still restore your data into a valid workspace. The same is true for released workspaces but they are only kept until the next night. In order to restore an expired workspace, use |
||
<pre> |
<pre> |
||
ws_restore -l |
ws_restore -l |
||
Line 194: | Line 199: | ||
Depending on your application some adaptations might be necessary if you want to reach |
Depending on your application some adaptations might be necessary if you want to reach |
||
the full bandwidth of the filesystems. |
|||
the full bandwidth of the filesystems. Parallel filesystems typically stripe files over storage subsystems, i.e. large files are separated into stripes and distributed to different storage subsystems. In Lustre, the size of these stripes (sometimes also mentioned as chunks) is called stripe size and the number of used storage subsystems is called stripe count. |
|||
When you are designing your application you should consider that the performance of |
When you are designing your application you should consider that the performance of |
||
Line 207: | Line 212: | ||
* avoid competitive file access by different tasks or use blocks with boundaries at stripe size (default is 1MB), |
* avoid competitive file access by different tasks or use blocks with boundaries at stripe size (default is 1MB), |
||
* if files are small enough for the SSDs and are only used |
* if files are small enough for the SSDs and are only used from one node store them on $TMPDIR. |
||
Parallel filesystems typically stripe files over storage subsystems, i.e. large files are separated into stripes and distributed to different storage subsystems. In Lustre, the size of these stripes (sometimes also mentioned as chunks) is called stripe size and the number of used storage subsystems is called stripe count. |
|||
With previous Lustre versions adapting the Lustre stripe count was the most important optimization. However, for the workspaces of uc2 the new Lustre feature Progressive File Layouts has been used to define file striping parameters. This means that the stripe count is adapted if the file size is growing. In normal cases users no longer need to adapt file striping parameters in case they have very huge files or in order to reach better performance. |
|||
With previous Lustre versions adapting the Lustre stripe count was the most important optimization. However, for the workspaces of uc3 the Lustre feature Progressive File Layout is used to define file striping parameters. This means that the stripe count is adapted if the file size is growing. In normal cases users no longer need to adapt file striping parameters in case they have very huge files or in order to reach better performance. If you know what you are doing you can still change striping parameters but further explanation is beyond the scope of this documentation. |
|||
If you know what you are doing you can still change striping parameters, e.g. the stripe count, of a directory and of newly created files. New files and directories inherit the stripe count from the parent directory. E.g. if you want to enhance throughput on a single very large file which is created in the directory $HOME/my_output_dir you can use the command |
|||
<pre> |
|||
$ lfs setstripe -c-1 $HOME/my_output_dir |
|||
</pre> |
|||
to change the stripe count to -1 which means that all storage subsystems of the file system are used to store that file. If you change the stripe count of a directory the stripe count of existing files inside this |
|||
directory is not changed. If you want to change the stripe count of existing files, change |
|||
the stripe count of the parent directory, copy the files to new files, remove the old files |
|||
and move the new files back to the old name. In order to check the stripe setting of |
|||
the file my_file use |
|||
<pre> |
|||
$ lfs getstripe my_file |
|||
</pre> |
|||
Also note that changes on the striping parameters (e.g. stripe count) are not saved in the |
|||
backup, i.e. if directories have to be recreated this information is lost and the default stripe |
|||
count will be used. Therefore, you should annotate for which directories you made changes |
|||
to the striping parameters so that you can repeat these changes if required. |
|||
=== Improving Metadata Performance === |
=== Improving Metadata Performance === |
||
Line 236: | Line 226: | ||
performance of a parallel application following aspects should be considered: |
performance of a parallel application following aspects should be considered: |
||
* avoid creating many small files |
* avoid creating many small files |
||
* avoid competitive directory access, e.g. by creating files in separate subdirectories for each task |
* avoid competitive directory access, e.g. by creating files in separate subdirectories for each task |
||
* if many small files are only used within a batch job and accessed by one |
* if many small files are only used within a batch job and accessed by one node store them on $TMPDIR |
||
* change the default colorization setting of the command ls (see below). |
|||
On modern Linux systems, the GNU ls command often uses colorization by default to |
|||
visually highlight the file type; this is especially true if the command is run within a terminal |
|||
session. This is because the default shell profile initializations usually contain an alias |
|||
directive similar to the following for the ls command: |
|||
<pre> |
|||
$ alias ls="ls --color=tty" |
|||
</pre> |
|||
However, running the ls command in this way for files on a Lustre file system requires |
|||
a stat() call to be used to determine the file type. This can result in a performance |
|||
overhead, because the stat() call always needs to determine the size of a file, and that |
|||
in turn means that the client node must query the object size of all the backing objects |
|||
that make up a file. As a result of the default colorization setting, running a simple |
|||
ls command on a Lustre file system often takes as much time as running the ls command |
|||
with the -l option (the same is true if the -F, -p, or the -classify option, or any other option |
|||
that requires information from a stat() call, is used). To avoid this performance overhead |
|||
when using ls commands, add an alias directive similar to the following |
|||
to your shell startup script: |
|||
<pre> |
|||
$ alias ls="ls --color=never" |
|||
</pre> |
|||
== Workspaces on flash storage == |
== Workspaces on flash storage == |
||
Line 271: | Line 238: | ||
=== Advantages of this file system === |
=== Advantages of this file system === |
||
# From the Ice Lake nodes of bwUniCluster 3.0 (queue ''cpu_il'') the network distance and latency is low compared to the normal workspace file system. |
|||
# All storage devices are based on flash (no hard disks) with low access times. Hence performance is better compared to other parallel file systems for read and write access with small blocks and with small files, i.e. IOPS rates are improved. |
# All storage devices are based on flash (no hard disks) with low access times. Hence performance is better compared to other parallel file systems for read and write access with small blocks and with small files, i.e. IOPS rates are improved. |
||
# The file system is mounted on bwUniCluster |
# The file system is mounted on bwUniCluster 3.0 and HoreKa, i.e. it can be used to share data between these clusters. |
||
=== Access restrictions === |
=== Access restrictions === |
||
Only HoreKa users or KIT users of bwUniCluster |
Only HoreKa users or KIT users of bwUniCluster 3.0 can use this file system. |
||
=== Using the file system === |
=== Using the file system === |
||
As KIT or HoreKa user you can use the file system in the same way as a normal workspace. You just have to specify the name of the flash-based workspace file system using the option ''-F'' to all the commands that manage workspaces. On bwUniCluster |
As KIT or HoreKa user you can use the file system in the same way as a normal workspace. You just have to specify the name of the flash-based workspace file system using the option ''-F'' to all the commands that manage workspaces. On bwUniCluster 3.0 it is called ''ffuc'', on HoreKa it is ''ffhk''. For example, to create a workspace with name myws and a lifetime of 60 days on bwUniCluster 3.0 execute: |
||
ws_allocate -F ffuc myws 60 |
ws_allocate -F ffuc myws 60 |
||
If you want to use the full flash pfs on bwUniCluster |
If you want to use the full flash pfs on bwUniCluster 3.0 '''and''' HoreKa at the same time, please note that you only have to manage a particular workspace on one of the clusters since the name of the workspace directory is different. However, the path to each workspace is visible and can be used on both clusters. |
||
Other features are similar to normal workspaces. For example, |
Other features are similar to normal workspaces. For example, you can restore expired workspaces during 30 days after workspace expiration. There are quota limits with a default limit of 1 TiB capacity and 5 millions inodes per user. You can check your current usage with |
||
lfs quota -uh $(whoami) /pfs/work8 |
lfs quota -uh $(whoami) /pfs/work8 |
||
Line 301: | Line 269: | ||
The $TMPDIR directory is located on extremely fast local SSD storage devices. This means that performance |
The $TMPDIR directory is located on extremely fast local SSD storage devices. This means that performance |
||
on small files is much better than on the parallel file systems. The capacity of the local SSDs for each node type |
on small files is much better than on the parallel file systems. The capacity of the local SSDs for each node type |
||
is different and can be checked in Table 1 |
is different and can be checked in [[BwUniCluster3.0/Hardware_and_Architecture#Compute_nodes|Table 1]]. |
||
The capacity of $TMPDIR is at least 1400 GB. |
|||
Each time a batch job is started, a subdirectory is created on the SSD of each node and assigned to the job. |
Each time a batch job is started, a subdirectory is created on the SSD of each node and assigned to the job. |
||
$TMPDIR is set to the name of the subdirectory and this name contains the job ID so that it is unique |
$TMPDIR is set to the name of the subdirectory and this name contains the job ID so that it is unique |
||
for each job. At the end of the job the subdirectory is removed. |
for each job. At the end of the job the subdirectory is removed. |
||
{|style="background:#deffee; width:100%;" |
|||
|style="padding:5px; background:#ffa500; text-align:left"| |
|||
[[Image:Attention.svg|center|25px]] |
|||
|style="padding:5px; background:#ffa500; text-align:left"| |
|||
'''All data on $TMPDIR will be deleted when your job completes.'''<br/> |
|||
Make sure you have copied your results back to a global filesystem, e.g., $HOME or a workspace, within your job. |
|||
|} |
|||
On login nodes $TMPDIR also points to a fast directory on a local SSD disk but this directory is not unique. |
On login nodes $TMPDIR also points to a fast directory on a local SSD disk but this directory is not unique. |
||
Line 332: | Line 309: | ||
<syntaxhighlight lang="bash"> |
<syntaxhighlight lang="bash"> |
||
# Create a workspace to store the archive |
# Create a workspace to store the archive |
||
[ab1234@ |
[ab1234@uc3n991 ~]$ ws_allocate data-ssd 60 |
||
# Create the archive from a local dataset folder (example) |
# Create the archive from a local dataset folder (example) |
||
[ab1234@ |
[ab1234@uc3n991 ~]$ tar -cvzf $(ws_find data-ssd)/dataset.tgz dataset/ |
||
</syntaxhighlight> |
</syntaxhighlight> |
||
Line 355: | Line 332: | ||
</syntaxhighlight> |
</syntaxhighlight> |
||
== LSDF Online Storage== |
== LSDF Online Storage == |
||
The LSDF Online Storage allows dedicated users to store scientific measurement data and simulation results. BwUniCluster 3.0 has an extremely fast network connection to the LSDF Online Storage. This file system provides external access via different protocols. It is only available for certain users. For information how request storage projects on the LSDF Online Storage see [[https://www.scc.kit.edu/en/services/lsdf]]. |
|||
In some cases it is useful to have access to the LSDF Online Storage on the HPC-Clusters also. Therefore the LSDF Online Storage is mounted on the Login- and Datamover-Nodes. |
|||
Furthermore it can be used on the compute nodes during the job runtime with the constraint flag "LSDF" ([[bwUniCluster_2.0_Slurm_common_Features|Slurm common features]] |
|||
The LSDF Online Storage is mounted on the login nodes. It will also be mounted on the compute nodes of your batch job request it with the constraint flag ''LSDF''. You have one of the following options: |
|||
). There is also an example about the LSDF batch usage: [[bwUniCluster_2.0_Slurm_common_Features#LSDF_Online_Storage|Slurm LSDF example ]] . |
|||
1. Add after the initial lines of your script job.sh the line <code>#SBATCH --constraint=LSDF</code>: |
|||
<pre> |
<pre> |
||
#!/bin/bash |
#!/bin/bash |
||
#SBATCH |
#SBATCH --ntasks=1 |
||
#SBATCH --time=120 |
|||
#SBATCH --mem=200 |
|||
#SBATCH --constraint=LSDF |
#SBATCH --constraint=LSDF |
||
</pre> |
</pre> |
||
2. Add the constraint on command line: |
|||
<br> |
|||
<pre> |
|||
For the usage of the LSDF Online Storage the following environment variables are available: $LSDF, $LSDFPROJECTS, $LSDFHOME. |
|||
$ sbatch -p <queue> -n 1 -t 2:00:00 --mem 200 job.sh -C LSDF |
|||
Please request storage projects in the LSDF Online Storage seperately: |
|||
</pre> |
|||
[https://www.lsdf.kit.edu/os/storagerequest/: LSDF Storage Request]. |
|||
In order to access the LSDF Online Storage the following environment variables are available: |
|||
<code>$LSDF</code>, <code>$LSDFPROJECTS</code>, <code>$LSDFHOME</code> |
|||
==BeeOND (BeeGFS On-Demand)== |
==BeeOND (BeeGFS On-Demand)== |
||
Users |
Users have the possibility to request a private BeeOND (on-demand BeeGFS) parallel filesystem for each job. The file system is created during job startup and purged when your job completes. |
||
{|style="background:#deffee; width:100%;" |
|||
|style="padding:5px; background:#ffa500; text-align:left"| |
|||
[[Image:Attention.svg|center|25px]] |
|||
|style="padding:5px; background:#ffa500; text-align:left"| |
|||
'''All data on the private BeeOND filesystem will be deleted when your job completes.'''<br/> |
|||
Make sure you have copied your results back to a global filesystem, e.g., $HOME or a workspace, within your job. |
|||
|} |
|||
BeeOND/BeeGFS can be used like any other parallel file system. All nodes of the batch job have acess to the same data below the same path. Tools like cp or rsync can be used to copy data in and out. |
|||
Starting and stopping BeeOND is integrated in the prolog and epilog of the cluster batch system Slurm. It can be used during job runtime if compute nodes are exclusive used. You can request the creation of a BeeOND file system with the constraint flags ''BEEOND'', ''BEEOND_4MDS'' or ''BEEOND_MAXMDS''. |
|||
* BEEOND: one metadata server is started on the first node |
|||
* BEEOND_4MDS: 4 metadata servers are started within your job. If you have less than 4 nodes less metadata servers are started. |
|||
* BEEOND_MAXMDS: on every node of your job a metadata server for the on_demand file system is started |
|||
As starting point we recommend using the contraint ''BEEOND''. You have one of the following options to request the constraint: |
|||
1. Add the line <code>#SBATCH --constraint=BEEOND</code> to your job script: |
|||
<source lang="bash"> |
|||
#!/bin/bash |
|||
#SBATCH ... |
|||
#SBATCH --constraint=BEEOND # or BEEOND_4MDS or BEEOND_MAXMDS |
|||
</source> |
|||
2. Add the constraint on command line: |
|||
<pre> |
|||
$ sbatch -p <queue> -N <# of nodes> -t <runtime> --mem <mem> -C BEEOND job.sh |
|||
</pre> |
|||
After your job has started you can find the private on-demand file system in '''/mnt/odfs/${SLURM_JOB_ID}''' directory. The mountpoint comes with five pre-configured directories: |
|||
<source lang="bash"> |
|||
# For small files (stripe count = 1) |
|||
/mnt/odfs/${SLURM_JOB_ID}/stripe_1 |
|||
# Stripe count = 4 |
|||
/mnt/odfs/${SLURM_JOB_ID}/stripe_default |
|||
# or |
|||
/mnt/odfs/${SLURM_JOB_ID}/stripe_4 |
|||
# Stripe count = 8, 16 or 32, use this directories for medium sized and large files or when using MPI-IO |
|||
/mnt/odfs/${SLURM_JOB_ID}/stripe_8 |
|||
/mnt/odfs/${SLURM_JOB_ID}/stripe_16 |
|||
# or |
|||
/mnt/odfs/${SLURM_JOB_ID}/stripe_32 |
|||
</source> |
|||
If you request less nodes than stripe count, the stripe count will be the number of nodes. For example, if you only request 8 nodes the directory stripe_16 has only a stripe count 8. |
|||
; '''IMPORTANT: ''' |
|||
: All data on the private filesystem will be deleted after your job. Make sure you have copied your data back to the global filesystem (within job), e.g., $HOME or any workspace. |
|||
; <font color=red>'''Attention:'''</font><br> |
|||
BeeOND/BeeGFS can be used like any other parallel file system. Tools like cp or rsync can be used to copy data in and out. |
|||
:Use always the directory with the greatest stripe count for large files. E.g. if your largest file is 3.1 Tb, then you have to use a stripe count greater than 4 (4 x 750 GB), otherwise the used disk space is exceeded. |
|||
The capacity of the private file system depends on the number of nodes. For each node you get 750 Gbyte. |
|||
For detailed usage see here: [[BwUniCluster_2.0_Slurm_common_Features#BeeOND_.28BeeGFS_On-Demand.29|Request on-demand file system]] |
|||
If you request 100 nodes for your job, the private file system is 100 * 750 Gbyte ~ 75 Tbyte (approx) capacity. |
|||
==Backup and Archiving== |
==Backup and Archiving== |
||
There are regular backups of all data of the home directories,whereas ACLs and extended attributes will |
There are regular backups of all data of the home directories, whereas ACLs and extended attributes will |
||
not be |
not be saved in backups. |
||
Please open a ticket if you need |
Please open a ticket if you have the need to restore backup data. |
Latest revision as of 17:40, 3 April 2025
File System Details
On bwUniCluster 3.0 the parallel file system Lustre is used for most globally visible user data. Lustre is open source and Lustre solutions and support are available from different vendors. Nowadays, most of the biggest HPC systems are using Lustre. An initial home directory on a Lustre file system is created automatically after account activation, and the environment variable $HOME holds its name. Users can create so-called workspaces on another Lustre file system for non-permanent data with temporary lifetime. There is another workspace file system based on flash storage for special requirements available.
Within a batch job further file systems are available:
- The directory $TMPDIR is only available and visible on the local node. It is located on fast SSD storage devices.
- On request a parallel on-demand file system (BeeOND) is created which uses the SSDs of the nodes which were allocated to the batch job.
- On request the external LSDF Online Storage is mounted on the nodes which were allocated to the batch job. This file system is based on the parallel file system IBM Storage Scale.
Some of the characteristics of the file systems are shown in the following table.
Property | $TMPDIR | BeeOND | $HOME | Workspace | Workspace on flash |
---|---|---|---|---|---|
Visibility | local node | nodes of batch job | global | global | global |
Lifetime | batch job runtime | batch job runtime | permanent | max. 240 days | max. 240 days |
Disk space | 960 GB - 6.4 TB details see table 1 |
n*750 GB | 1.1 PiB | 4.6 PiB | 236 TiB |
Capacity Quotas | no | no | yes 500 GiB per user, for MA users 250 GiB also per organization |
yes 40 TiB per user |
yes 1 TiB per user |
Inode Quotas | no | no | yes 5 million per user for MA users 2.5 million |
yes 20 million per user |
yes 5 million per user |
Backup | no | no | yes | no | no |
Read perf./node | 500 MB/s - 6 GB/s depends on type of local SSD / job queue |
400 MB/s - 500 MB/s depends on type of local SSDs / job queue |
5 GB/s | 5 GB/s | 1 GB/s |
Write perf./node | 500 MB/s - 4 GB/s depends on type of local SSD / job queue |
250 MB/s - 350 MB/s depends on type of local SSDs / job queue |
5 GB/s | 5 GB/s | 1 GB/s |
Total read perf. | n*500-6000 MB/s | n*400-500 MB/s | 63 GB/s | 45 GB/s | 45 GB/s |
Total write perf. | n*500-4000 MB/s | n*250-350 MB/s | 63 GB/s | 40 GB/s | 38 GB/s |
global: all nodes of UniCluster access the same file system; local: each node has its own file system; permanent: files are stored permanently; batch job: files are removed at end of the batch job.
Table 2: Properties of the file systems
Selecting the appropriate file system
In general, you should separate your data and store it on the appropriate file system. Permanently needed data like software or important results should be stored below $HOME but capacity restrictions (quotas) apply. In case you accidentally deleted data on $HOME there is a chance that we can restore it from backup. Permanent data which is not needed for months or exceeds the capacity restrictions should be sent to the LSDF Online Storage or to the archive and deleted from the file systems. Temporary data which is only needed on a single node and which does not exceed the disk space shown in Table 1 should be stored below $TMPDIR. Data which is read many times on a single node, e.g. if you are doing AI training, should be copied to $TMPDIR and read from there. Temporary data which is used from many nodes of your batch job and which is only needed during job runtime should be stored on a parallel on-demand file system. Temporary data which can be recomputed or which is the result of one job and input for another job should be stored in workspaces. The lifetime of data in workspaces is limited and depends on the lifetime of the workspace which can be several months.
For further details please check the chapters below.
$HOME
The home directories of bwUniCluster 3.0 (uc3) users are located in the parallel file system Lustre. You have access to your home directory from all nodes of uc3. A regular backup of these directories to a tape library is done automatically. The directory $HOME should be used to hold permanently used data like source code, configuration files, executable programs etc.
On uc3 there is a default user quota limit of 500 GiB and 5 million inodes (files and directories) per user. For users of University of Mannheim the limit is 250 GiB and 2.5 million inodes. You can check your current usage and limits with the command
$ lfs quota -uh $(whoami) $HOME
Note that the quota limits mentioned above are soft quota limits. The hard limits (shown on the limits columns) are 10 percent higher. If you are above the soft limit and below the hard limit during the grace period (7 days) your I/O operations will show a warning message. If the grace period has passed or if you are above the hard limit your I/O operations will abort.
In addition to the user limit there is a limit of your organization (e.g. university) which depends on the financial share. This limit is enforced with so-called Lustre project quotas. You can show the current usage and limits of your organization with the following command:
lfs quota -ph $(grep $(echo $HOME | sed -e "s|/[^/]*/[^/]*$||") /pfs/data6/project_ids.txt | cut -f 1 -d\ ) $HOME
Workspaces
On uc3 workspaces can be used to store large non-permanent data sets, e.g. restart files or output data that has to be post-processed. The file system used for workspaces is also the parallel file system Lustre. This file system is especially designed for parallel access and for a high throughput to large files. It is able to provide high data transfer rates of up to 40 GB/s write and read performance when data access is parallel.
On uc3 there is a default user quota limit of 40 TiB and 20 million inodes (files and directories) per user. You can chek your current usage and limits with the command
$ lfs quota -uh $(whoami) /pfs/work9
Note that the quotas include data and inodes for all of your workspaces and all of your expired workspaces (as long as they are not yet completely removed).
Workspaces have a lifetime and the data on a workspace expires as a whole after a fixed period. The maximum lifetime of a workspace on uc3 is 60 days, but it can be renewed at the end of that period 3 times to a total maximum of 240 days after workspace generation.
Creating, deleting, finding, extending and sharing workspaces is explained on the workspace page.
Reminder for workspace deletion
Normally you will get an email every day starting 7 days before a workspace expires. You can send yourself a calender entry which reminds you when a workspace will be automatically deleted:
$ ws_send_ical <workspace> <email>
Restoring expired Workspaces
At expiration time your workspace will be moved to a special, hidden directory. On uc3 expired workspaces are currently kept for 30 days. During this time you can still restore your data into a valid workspace. The same is true for released workspaces but they are only kept until the next night. In order to restore an expired workspace, use
ws_restore -l
to get a list of your expired workspaces, and then restore them into an existing, active workspace (here with name my_restored
):
ws_restore <full_name_of_expired_workspace> my_restored
NOTE: The expired workspace has to be specified using the full name as listed by ws_restore -l
, including username prefix and timestamp suffix (otherwise, it cannot be uniquely identified).
The target workspace, on the other hand, must be given with just its short name as listed by ws_list
, without the username prefix.
NOTE: ws_restore
can only work on the same filesystem. So you have to ensure that the new workspace allocated with ws_allocate
is placed on the same filesystem as the expired workspace. Therefore, you can use -F <filesystem>
flag if needed.
Linking workspaces in Home
It might be valuable to have links to personal workspaces within a certain directory, e.g. below the user home directory. The command
ws_register <DIR>
will create and manage links to all personal workspaces within in the directory <DIR>. Calling this command will do the following:
- The directory <DIR> will be created if necessary
- Links to all personal workspaces will be managed:
- Creates links to all available workspaces if not already present
- Removes links to released or expired workspaces
Improving Performance on $HOME and workspaces
The following recommendations might help to improve throughput and metadata performance on Lustre filesystems.
Improving Throughput Performance
Depending on your application some adaptations might be necessary if you want to reach the full bandwidth of the filesystems.
When you are designing your application you should consider that the performance of parallel filesystems is generally better if data is transferred in large blocks and stored in few large files. In more detail, to increase throughput performance of a parallel application following aspects should be considered:
- collect large chunks of data and write them sequentially at once,
- to exploit complete filesystem bandwidth use several clients,
- avoid competitive file access by different tasks or use blocks with boundaries at stripe size (default is 1MB),
- if files are small enough for the SSDs and are only used from one node store them on $TMPDIR.
Parallel filesystems typically stripe files over storage subsystems, i.e. large files are separated into stripes and distributed to different storage subsystems. In Lustre, the size of these stripes (sometimes also mentioned as chunks) is called stripe size and the number of used storage subsystems is called stripe count.
With previous Lustre versions adapting the Lustre stripe count was the most important optimization. However, for the workspaces of uc3 the Lustre feature Progressive File Layout is used to define file striping parameters. This means that the stripe count is adapted if the file size is growing. In normal cases users no longer need to adapt file striping parameters in case they have very huge files or in order to reach better performance. If you know what you are doing you can still change striping parameters but further explanation is beyond the scope of this documentation.
Improving Metadata Performance
Metadata performance on parallel file systems is usually not as good as with local filesystems. In addition, it is usually not scalable, i.e. a limited resource. Therefore, you should omit metadata operations whenever possible. For example, it is much better to have few large files than lots of small files. In more detail, to increase metadata performance of a parallel application following aspects should be considered:
- avoid creating many small files
- avoid competitive directory access, e.g. by creating files in separate subdirectories for each task
- if many small files are only used within a batch job and accessed by one node store them on $TMPDIR
Workspaces on flash storage
There is another workspace file system for special requirements available. The file system is called full flash pfs and is based on the parallel file system Lustre.
Advantages of this file system
- From the Ice Lake nodes of bwUniCluster 3.0 (queue cpu_il) the network distance and latency is low compared to the normal workspace file system.
- All storage devices are based on flash (no hard disks) with low access times. Hence performance is better compared to other parallel file systems for read and write access with small blocks and with small files, i.e. IOPS rates are improved.
- The file system is mounted on bwUniCluster 3.0 and HoreKa, i.e. it can be used to share data between these clusters.
Access restrictions
Only HoreKa users or KIT users of bwUniCluster 3.0 can use this file system.
Using the file system
As KIT or HoreKa user you can use the file system in the same way as a normal workspace. You just have to specify the name of the flash-based workspace file system using the option -F to all the commands that manage workspaces. On bwUniCluster 3.0 it is called ffuc, on HoreKa it is ffhk. For example, to create a workspace with name myws and a lifetime of 60 days on bwUniCluster 3.0 execute:
ws_allocate -F ffuc myws 60
If you want to use the full flash pfs on bwUniCluster 3.0 and HoreKa at the same time, please note that you only have to manage a particular workspace on one of the clusters since the name of the workspace directory is different. However, the path to each workspace is visible and can be used on both clusters.
Other features are similar to normal workspaces. For example, you can restore expired workspaces during 30 days after workspace expiration. There are quota limits with a default limit of 1 TiB capacity and 5 millions inodes per user. You can check your current usage with
lfs quota -uh $(whoami) /pfs/work8
$TMPDIR
The environment variable $TMPDIR contains the name of a directory which is located on the local SSD of each node. This means that different tasks of a parallel application use different directories when they do not utilize the same node. Although $TMPDIR points to the same path name for different nodes of a batch job, the physical location and the content of this directory path on these nodes is different.
This directory should be used for temporary files being accessed from the local node during job runtime. It should also be used if you read the same data many times from a single node, e.g. if you are doing AI training. In this case you should copy the data at the beginning of your batch job to $TMPDIR and read the data from there, see usage example below.
The $TMPDIR directory is located on extremely fast local SSD storage devices. This means that performance on small files is much better than on the parallel file systems. The capacity of the local SSDs for each node type is different and can be checked in Table 1. The capacity of $TMPDIR is at least 1400 GB.
Each time a batch job is started, a subdirectory is created on the SSD of each node and assigned to the job. $TMPDIR is set to the name of the subdirectory and this name contains the job ID so that it is unique for each job. At the end of the job the subdirectory is removed.
All data on $TMPDIR will be deleted when your job completes. |
On login nodes $TMPDIR also points to a fast directory on a local SSD disk but this directory is not unique. It is recommended to create your own unique subdirectory on these nodes. This directory should be used for the installation of software packages. This means that the software package to be installed should be unpacked, compiled and linked in a subdirectory of $TMPDIR. The real installation of the package (e.g. make install) should be made into the $HOME folder.
Note that you should not use /tmp or /scratch! Please use $TMPDIR instead. |
Usage example for $TMPDIR
We will provide an example for using $TMPDIR and describe efficient data transfer to and from $TMPDIR.
If you have a data set with many files which is frequently used by batch jobs you should create a compressed archive on a workspace. This archive can be extracted on $TMPDIR inside your batch jobs. Such an archive can be read efficiently from a parallel file system since it is a single huge file. On a login node you can create such an archive with the following steps:
# Create a workspace to store the archive
[ab1234@uc3n991 ~]$ ws_allocate data-ssd 60
# Create the archive from a local dataset folder (example)
[ab1234@uc3n991 ~]$ tar -cvzf $(ws_find data-ssd)/dataset.tgz dataset/
Inside a batch job extract the archive on $TMPDIR, read input data from $TMPDIR, store results on $TMPDIR and save the results on a workspace:
#!/bin/bash
# very simple example on how to use local $TMPDIR
#SBATCH -N 1
#SBATCH -t 24:00:00
# Extract compressed input dataset on local SSD
tar -C $TMPDIR/ -xvzf $(ws_find data-ssd)/dataset.tgz
# The application reads data from dataset on $TMPDIR and writes results to $TMPDIR
myapp -input $TMPDIR/dataset/myinput.csv -outputdir $TMPDIR/results
# Before job completes save results on a workspace
rsync -av $TMPDIR/results $(ws_find data-ssd)/results-${SLURM_JOB_ID}/
LSDF Online Storage
The LSDF Online Storage allows dedicated users to store scientific measurement data and simulation results. BwUniCluster 3.0 has an extremely fast network connection to the LSDF Online Storage. This file system provides external access via different protocols. It is only available for certain users. For information how request storage projects on the LSDF Online Storage see [[1]].
The LSDF Online Storage is mounted on the login nodes. It will also be mounted on the compute nodes of your batch job request it with the constraint flag LSDF. You have one of the following options:
1. Add after the initial lines of your script job.sh the line #SBATCH --constraint=LSDF
:
#!/bin/bash #SBATCH --ntasks=1 #SBATCH --time=120 #SBATCH --mem=200 #SBATCH --constraint=LSDF
2. Add the constraint on command line:
$ sbatch -p <queue> -n 1 -t 2:00:00 --mem 200 job.sh -C LSDF
In order to access the LSDF Online Storage the following environment variables are available:
$LSDF
, $LSDFPROJECTS
, $LSDFHOME
BeeOND (BeeGFS On-Demand)
Users have the possibility to request a private BeeOND (on-demand BeeGFS) parallel filesystem for each job. The file system is created during job startup and purged when your job completes.
All data on the private BeeOND filesystem will be deleted when your job completes. |
BeeOND/BeeGFS can be used like any other parallel file system. All nodes of the batch job have acess to the same data below the same path. Tools like cp or rsync can be used to copy data in and out.
Starting and stopping BeeOND is integrated in the prolog and epilog of the cluster batch system Slurm. It can be used during job runtime if compute nodes are exclusive used. You can request the creation of a BeeOND file system with the constraint flags BEEOND, BEEOND_4MDS or BEEOND_MAXMDS.
- BEEOND: one metadata server is started on the first node
- BEEOND_4MDS: 4 metadata servers are started within your job. If you have less than 4 nodes less metadata servers are started.
- BEEOND_MAXMDS: on every node of your job a metadata server for the on_demand file system is started
As starting point we recommend using the contraint BEEOND. You have one of the following options to request the constraint:
1. Add the line #SBATCH --constraint=BEEOND
to your job script:
#!/bin/bash
#SBATCH ...
#SBATCH --constraint=BEEOND # or BEEOND_4MDS or BEEOND_MAXMDS
2. Add the constraint on command line:
$ sbatch -p <queue> -N <# of nodes> -t <runtime> --mem <mem> -C BEEOND job.sh
After your job has started you can find the private on-demand file system in /mnt/odfs/${SLURM_JOB_ID} directory. The mountpoint comes with five pre-configured directories:
# For small files (stripe count = 1)
/mnt/odfs/${SLURM_JOB_ID}/stripe_1
# Stripe count = 4
/mnt/odfs/${SLURM_JOB_ID}/stripe_default
# or
/mnt/odfs/${SLURM_JOB_ID}/stripe_4
# Stripe count = 8, 16 or 32, use this directories for medium sized and large files or when using MPI-IO
/mnt/odfs/${SLURM_JOB_ID}/stripe_8
/mnt/odfs/${SLURM_JOB_ID}/stripe_16
# or
/mnt/odfs/${SLURM_JOB_ID}/stripe_32
If you request less nodes than stripe count, the stripe count will be the number of nodes. For example, if you only request 8 nodes the directory stripe_16 has only a stripe count 8.
- Attention:
- Use always the directory with the greatest stripe count for large files. E.g. if your largest file is 3.1 Tb, then you have to use a stripe count greater than 4 (4 x 750 GB), otherwise the used disk space is exceeded.
The capacity of the private file system depends on the number of nodes. For each node you get 750 Gbyte. If you request 100 nodes for your job, the private file system is 100 * 750 Gbyte ~ 75 Tbyte (approx) capacity.
Backup and Archiving
There are regular backups of all data of the home directories, whereas ACLs and extended attributes will not be saved in backups.
Please open a ticket if you have the need to restore backup data.