BwUniCluster2.0/Hardware and Architecture: Difference between revisions
(→$HOME) |
|||
Line 198: | Line 198: | ||
$ ws_send_ical.sh <workspace> <email> |
$ ws_send_ical.sh <workspace> <email> |
||
== Improving Performance on $HOME and workspaces == |
|||
The following recommendations might help to improve throughput and metadata |
|||
performance on Lustre filesystems. |
|||
=== '''Improving Throughput Performance''' === |
|||
Depending on your application some adaptations might be necessary if you want to reach |
|||
the full bandwidth of the filesystems. Parallel filesystems typically stripe files over storage subsystems, i.e. large files are separated into stripes and distributed to different storage subsystems. In Lustre, the size of these stripes (sometimes also mentioned as chunks) is called stripe size and the number of used storage subsystems is called stripe count. |
|||
When you are designing your application you should consider that the performance of |
|||
parallel filesystems is generally better if data is transferred in large blocks and stored in |
|||
few large files. In more detail, to increase throughput performance of a parallel application |
|||
following aspects should be considered: |
|||
* collect large chunks of data and write them sequentially at once, |
|||
* to exploit complete filesystem bandwidth use several clients, |
|||
* avoid competitive file access by different tasks or use blocks with boundaries at stripe size (default is 1MB), |
|||
* if files are small enough for the SSDs and are only used by one process store them on $TMP. |
|||
With previous Lustre versions adapting the Lustre stripe count was the most important optimization. However, for the workspaces of uc2 the new Lustre feature Progressive File Layouts has been used to define file striping parameters. This means that the stripe count is adapted if the file size is growing. In normal cases users no longer need to adapt file striping parameters in case they have very huge files or in order to reach better performance. |
|||
If you know what you are doing you can still change striping parameters, e.g. the stripe count, of a directory and of newly created files. New files and directories inherit the stripe count from the parent directory. E.g. if you want to enhance throughput on a single very large file which is created in the directory $HOME/my_output_dir you can use the command |
|||
<pre> |
|||
$ lfs setstripe -c-1 $HOME/my_output_dir |
|||
</pre> |
|||
to change the stripe count to -1 which means that all storage subsystems of the file system are used to store that file. If you change the stripe count of a directory the stripe count of existing files inside this |
|||
directory is not changed. If you want to change the stripe count of existing files, change |
|||
the stripe count of the parent directory, copy the files to new files, remove the old files |
|||
and move the new files back to the old name. In order to check the stripe setting of |
|||
the file my_file use |
|||
<pre> |
|||
$ lfs getstripe my_file |
|||
</pre> |
|||
Also note that changes on the striping parameters (e.g. stripe count) are not saved in the |
|||
backup, i.e. if directories have to be recreated this information is lost and the default stripe |
|||
count will be used. Therefore, you should annotate for which directories you made changes |
|||
to the striping parameters so that you can repeat these changes if required. |
|||
=== '''Improving Metadata Performance''' === |
|||
Metadata performance on parallel file systems is usually not as good as with local |
|||
filesystems. In addition, it is usually not scalable, i.e. a limited resource. Therefore, |
|||
you should omit metadata operations whenever possible. For example, it is much better |
|||
to have few large files than lots of small files. In more detail, to increase metadata |
|||
performance of a parallel application following aspects should be considered: |
|||
* avoid creating many small files, |
|||
* avoid competitive directory access, e.g. by creating files in separate subdirectories for each task, |
|||
* if many small files are only used within a batch job and accessed by one process store them on $TMP, |
|||
* change the default colorization setting of the command ls (see below). |
|||
On modern Linux systems, the GNU ls command often uses colorization by default to |
|||
visually highlight the file type; this is especially true if the command is run within a terminal |
|||
session. This is because the default shell profile initializations usually contain an alias |
|||
directive similar to the following for the ls command: |
|||
<pre> |
|||
$ alias ls="ls --color=tty" |
|||
</pre> |
|||
However, running the ls command in this way for files on a Lustre file system requires |
|||
a stat() call to be used to determine the file type. This can result in a performance |
|||
overhead, because the stat() call always needs to determine the size of a file, and that |
|||
in turn means that the client node must query the object size of all the backing objects |
|||
that make up a file. As a result of the default colorization setting, running a simple |
|||
ls command on a Lustre file system often takes as much time as running the ls command |
|||
with the -l option (the same is true if the -F, -p, or the -classify option, or any other option |
|||
that requires information from a stat() call, is used). To avoid this performance overhead |
|||
when using ls commands, add an alias directive similar to the following |
|||
to your shell startup script: |
|||
<pre> |
|||
$ alias ls="ls --color=never" |
|||
</pre> |
|||
Revision as of 20:24, 16 March 2020
Architecture of bwUniCluster 2.0
The bwUniCluster 2.0 is a parallel computer with distributed memory. Each node of system consists of at least two Intel Xeon processor, local memory, disks, network adapters and optionally accelerators (NVIDIA Tesla V100). All nodes are connected by a fast InfiniBand 4X FDR interconnect. In addition the file system Lustre, that is connected by coupling the InfiniBand of the file server with the InfiniBand switch of the compute cluster, is added to bwUniCluster (uc1) to provide a fast and scalable parallel file system.
The operating system on each node is Red Hat Enterprise Linux (RHEL) 7.x. A number of additional software packages like e.g. SLURM have been installed on top. Some of these components are of special interest to end users and are briefly discussed in this document. Others which are of greater importance to system administrators will not be covered by this document.
The individual nodes of the system may act in different roles. According to the services supplied by the nodes, they are separated into disjoint groups. From an end users point of view the different groups of nodes are login nodes, compute nodes, file server nodes and administrative server nodes.
Login Nodes
The login nodes are the only nodes that are directly accessible by end users. These nodes are used for interactive login, file management, program development and interactive pre- and postprocessing. Two nodes are dedicated to this service but they are all accessible via one address and a DNS round-robin alias distributes the login sessions to the different login nodes.
Compute Node
The majority of nodes are compute nodes which are managed by a batch system. Users submit their jobs to the SLURM batch system and a job is executed when the required resources become available (depending on its fair-share priority).
File Server Nodes
The hardware of the parallel file system Lustre incorporates some file server nodes; the file system Lustre is connected by coupling the InfiniBand of the file server with the independent InfiniBand switch of the compute cluster. In addition to shared file space there is also local storage on the disks of each node (for details see chapter "File Systems").
Administrative Server Nodes
Some other nodes are delivering additional services like resource management, external network connection, administration etc. These nodes can be accessed directly by system administrators only.
Components of bwUniCluster
Compute nodes "Thin" | Compute nodes "HPC" | Compute nodes "HPC Broadwell" | Compute nodes "Fat" | GPU x4 | GPU x8 | Login | |
---|---|---|---|---|---|---|---|
Number of nodes | 100 | 360 | 352 | 6 | 14 | 10 | 4 + 2 (Broadwell) |
Processors | Intel Xeon Gold 6230 | Intel Xeon Gold 6230 | Intel Xeon E5-2660 v4 | Intel Xeon Gold 6230 | Intel Xeon Gold 6230 | Intel Xeon Gold 6248 | |
Number of sockets | 2 | 2 | 2 | 4 | 2 | 2 | 2 |
Processor frequency (GHz) | 2.1 Ghz | 2.1 Ghz | 2.0 GHz | 2.1 Ghz | 2.1 Ghz | 2.1 Ghz | |
Total number of cores | 40 | 40 | 28 | 80 | 40 | 40 | 40 / 20 (Broadwell) |
Main memory | 96 GB | 96 GB | 128 GB | 3 TB | 384 GB | 768 GB | 384 GB / 128 GB (Broadwell) |
Local disk | 960 GB SATA | 960 GB SATA | 480 GB SATA | 4,8 TB NVMe | 3,2 TB NVMe | 6,4 TB NVMe | |
Accelerators | - | - | - | - | 4x NVIDIA Tesla V100 | 8x NVIDIA Tesla V100 | |
Interconnect | IB HDR100 (blocking) | IB HDR100 | IB FDR | IB HDR | IB HDR | IB HDR | IB HDR100 (blocking) |
File Systems
Details about changes on the file systems between bwUniCluster 1 and bwUniCluster 2.0 are described in the File system migration guide. Note that $WORK is deprecated.
On bwUniCluster 2.0 the parallel file system Lustre is used for most globally visible user data. Lustre is open source and Lustre solutions and support are available from different vendors. Nowadays, most of the biggest HPC systems are using Lustre. An initial home directory on a Lustre file system is created during the first login, and the environment variable $HOME holds its name. Users can create so-called workspaces on another Lustre file system for non-permanent data with temporary lifetime.
Within a batch job further file systems are available:
- The directory $TMP is only available and visible on the local node. It is located on fast SSD storage devices.
- On request a parallel on-demand file system is created which uses the SSDs of the nodes which were allocated to the batch job.
- On request the external LSDF Online Storage is mounted on the nodes which were allocated to the batch job. This file system is based on the parallel file system Spectrum Scale.
Selecting the appropriate file system
In general, you should separate your data and store it on the appropriate file system. Permanently needed data like software or important results should be stored below $HOME but capacity restrictions (quotas) apply. In case you accidentally deleted data on $HOME you can usually restore it from backup. Permanent data which is not needed for months or exceeds the capacity restrictions should be sent to bwFileStorage, to the LSDF Online Storage, or to the archive and deleted from the file systems. Temporary data which is only needed on a single node and which does not exceed the disk space shown in the table above should be stored below $TMP. Temporary data which is only needed during job runs should be stored on a parallel on-demand file system. Temporary data which can be recomputed or which is the result of one job and input for another job should be stored below in workspaces. The lifetime of data in workspaces is limited and depends on the lifetime of the workspace which can be several months.
The most efficient way to transfer data to/from other HPC file systems or bwFileStorage is done with the tool rdata.
For further details please check the chapters below.
$HOME
The home directories of bwUniCluster 2.0 (uc2) users are located in the parallel file system Lustre. You have access to your home directory from all nodes of uc2. A regular backup of these directories to tape archive is done automatically. The directory $HOME is used to hold those files that are permanently used like source codes, configuration files, executable programs etc.
On uc2 there is a default user quota limit of 1 TiB and 10 million inodes (files and directories) per user. You can chek your current usage and limits with the command
$ lfs quota -uh $(whoami) $HOME
In addition to the user limit there is a limit of your organization (e.g. university) which depends on the financial share. This limit is enforced with so-called Lustre project quotas. You can show the current usage and limits of your organization with the following command:
lfs quota -ph $(grep $(echo $HOME | sed -e "s|/[^/]*/[^/]*$||") /pfs/data5/project_ids.txt | cut -f 1 -d\ ) $HOME
Workspaces
On uc2 workspaces can be used to store large non-permanent data sets, e.g. restart files or output data that has to be post-processed. The file system used for workspaces is also the parallel file system Lustre. This file system is especially designed for parallel access and for a high throughput to large files. It is able to provide high data transfer rates of up to 54 GB/s write and read performance when data access is parallel.
Workspaces have a lifetime and the data on a workspace expires as a whole after a fixed period. The maximum lifetime of a workspace is 60 days, but it can be renewed at the end of that period 3 times to a total maximum of 240 days after workspace generation. If a workspace has inadvertently expired we can restore the data during a limited time (few weeks). In this case you should create a new workspace and report the name of the new and of the expired workspace in a ticket or in an email to the hotline.
Creating, deleting, finding and extending workspaces is explained on the workspace page.
On uc2 there is a default user quota limit of 40 TiB and 30 million inodes (files and directories) per user. You can chek your current usage and limits with the command
$ lfs quota -uh $(whoami) /pfs/work7
Reminder for workspace deletion
Normally you will get an email every day starting 7 days before a workspace expires. You can send yourself a calender entry which reminds you when a workspace will be automatically deleted:
$ ws_send_ical.sh <workspace> <email>
Improving Performance on $HOME and workspaces
The following recommendations might help to improve throughput and metadata performance on Lustre filesystems.
Improving Throughput Performance
Depending on your application some adaptations might be necessary if you want to reach the full bandwidth of the filesystems. Parallel filesystems typically stripe files over storage subsystems, i.e. large files are separated into stripes and distributed to different storage subsystems. In Lustre, the size of these stripes (sometimes also mentioned as chunks) is called stripe size and the number of used storage subsystems is called stripe count.
When you are designing your application you should consider that the performance of parallel filesystems is generally better if data is transferred in large blocks and stored in few large files. In more detail, to increase throughput performance of a parallel application following aspects should be considered:
- collect large chunks of data and write them sequentially at once,
- to exploit complete filesystem bandwidth use several clients,
- avoid competitive file access by different tasks or use blocks with boundaries at stripe size (default is 1MB),
- if files are small enough for the SSDs and are only used by one process store them on $TMP.
With previous Lustre versions adapting the Lustre stripe count was the most important optimization. However, for the workspaces of uc2 the new Lustre feature Progressive File Layouts has been used to define file striping parameters. This means that the stripe count is adapted if the file size is growing. In normal cases users no longer need to adapt file striping parameters in case they have very huge files or in order to reach better performance.
If you know what you are doing you can still change striping parameters, e.g. the stripe count, of a directory and of newly created files. New files and directories inherit the stripe count from the parent directory. E.g. if you want to enhance throughput on a single very large file which is created in the directory $HOME/my_output_dir you can use the command
$ lfs setstripe -c-1 $HOME/my_output_dir
to change the stripe count to -1 which means that all storage subsystems of the file system are used to store that file. If you change the stripe count of a directory the stripe count of existing files inside this directory is not changed. If you want to change the stripe count of existing files, change the stripe count of the parent directory, copy the files to new files, remove the old files and move the new files back to the old name. In order to check the stripe setting of the file my_file use
$ lfs getstripe my_file
Also note that changes on the striping parameters (e.g. stripe count) are not saved in the backup, i.e. if directories have to be recreated this information is lost and the default stripe count will be used. Therefore, you should annotate for which directories you made changes to the striping parameters so that you can repeat these changes if required.
Improving Metadata Performance
Metadata performance on parallel file systems is usually not as good as with local filesystems. In addition, it is usually not scalable, i.e. a limited resource. Therefore, you should omit metadata operations whenever possible. For example, it is much better to have few large files than lots of small files. In more detail, to increase metadata performance of a parallel application following aspects should be considered:
- avoid creating many small files,
- avoid competitive directory access, e.g. by creating files in separate subdirectories for each task,
- if many small files are only used within a batch job and accessed by one process store them on $TMP,
- change the default colorization setting of the command ls (see below).
On modern Linux systems, the GNU ls command often uses colorization by default to visually highlight the file type; this is especially true if the command is run within a terminal session. This is because the default shell profile initializations usually contain an alias directive similar to the following for the ls command:
$ alias ls="ls --color=tty"
However, running the ls command in this way for files on a Lustre file system requires a stat() call to be used to determine the file type. This can result in a performance overhead, because the stat() call always needs to determine the size of a file, and that in turn means that the client node must query the object size of all the backing objects that make up a file. As a result of the default colorization setting, running a simple ls command on a Lustre file system often takes as much time as running the ls command with the -l option (the same is true if the -F, -p, or the -classify option, or any other option that requires information from a stat() call, is used). To avoid this performance overhead when using ls commands, add an alias directive similar to the following to your shell startup script:
$ alias ls="ls --color=never"