Hardware and Architecture (bwForCluster Chemistry)
1 System Architecture
The bwForCluster for computational and theoretical Chemistry Justus is a high-performance compute resource with high speed interconnect. It is intended for chemistry-related jobs with high memory (RAM,disk) needs and medium to low requirements to the node-interconnecting Infiniband network.
1.1 Basic Software Features
- Red Hat Enterprise Linux (RHEL) 7
- Queuing System: MOAB/Torque
- Environment Modules system to access software
1.2 Common Hardware Features
A total of 444 compute nodes plus 10 nodes for login, admin and visualization purposes.
- Processor: 2x Intel Xeon E5-2630v3 Prozessor (Haswell, 8-core, 2.4 GHz)
- Two processors per node (2x8 cores)
- 1x QDR InfiniBand HCA, single Port, Intel TrueScale
1.3 Node Types
There are three types of compute nodes, matched for increasingly less scalable and more memory-intensive (RAM and disks) jobs.
|Diskless nodes||SSD nodes||Big SSD nodes||Large Memory/SSD nodes|
|Disk Space||-||~1TB||~2TB||~2TB (~7TB on 3 nodes)|
|Disks||-||4x240GB SSD||4x480GB SSD||4x480GB SSD (4x1.8TB on 3 nodes)|
|RAID||-||RAID 0||RAID 0||RAID 0|
There are 2 nodes dedicated to the remote visualization. Each of them has NVIDIA K6000 graphic card, 512 GB of RAM and 4 TB of the local disk space.
2 Storage Architecture
The storage concept of the bwForCluster for Chemistry Disks are served by two redundant servers for the Lustre and ZFS work and home directories. The additional block device storage is meant to expand the space of the local SSDs for problems that cannot fit in this local space.
|$TMPDIR||central block storage||workspaces||$HOME|
|Lifetime||batch job walltime||batch job walltime||< 90 days||permanent|
|Disk space||diskless/1TB/2TB/7TB||480 TB||200 TB||200 TB|
global : all nodes access the same file system; local : each node has its own file system; permanent: files are stored permanently; batch job walltime: files are removed at end of the batch job.
Home directories are meant for permanent file storage of files that are keep being used like source codes, configuration files, executable programs etc.; the content of home directories will be backed up on a regular basis. The files in $HOME are stored on a zfs filesystem and provided via NFS to all nodes.
Current disk usage on home directory and quota status can be checked with the diskusage command:
$ diskusage User Used (GB) Quota (GB) Used (%) ------------------------------------------------------------------------ <username> 4.38 300.00 1.46
Compute jobs on nodes must not write temporary data to $HOME. Instead they should use the local $TMPDIR directory for I/O-heavy use cases and workspaces for less I/O intense multinode-jobs.
Quota is full - what to do
In case of 100% usage of the quota user can get some problems with disk writing operations (e.g. error messages during the file copy/edit/save operations). To avoid it - please remove some data that you don't need from the $HOME directory or move it to some temporary place.
As temporary place for the data user can use:
- Workspace - space on the Lustre file system, lifetime up to 90 days (see below)
- Scratch on login nodes - special directory on every login node (login01..login04):
- Access via variable $TMPDIR (e.g. "cd $TMPDIR")
- Lifetime of data - minimum 7 days (based on the last access time)
- Data is private for every user
- Each login node has own scratch directory (data is NOT shared)
- There is NO backup of the data
To get optimal and comfortable work with the $HOME directory is important to keep the data in order (remove unnecessary and temporary data, archive big files, save large files only on the workspace). To optimise data-usage workflow user can always get help from the JUSTUS support team.
Restoring data with ZFS snapshots - what to do
ZFS snapshots provide point-in-time backups of the home directory. This feature can be used to restore previously deleted data.
Navigate into the home directory and retrieve a list of available snapshots.
$ cd $HOME
$ ls ../../../.zfs/snapshot
Note: The names of displayed snapshots follow a discrete <time interval>-<date and time> convention.
A look inside of snapshots helps to locate the appropriate one containing the file to be restored.
$ ls -l ../../../.zfs/snapshot/
The following command restores a file named foo by copying it into the home directory.
$ cp ../../../.zfs/snapshot/
2.2 Lustre filesystem "/work"
Workspaces tools can be used to get temporary space on the lustre file system.
Workspaces directories expire as a whole after a fixed period. The maximum lifetime of a workspace is 90 days, but it can be renewed at the end of that period.
Creating, deleting, finding and extending workspaces is explained on the workspace page.
2.3 $TMPDIR and $SCRATCH
The variables $TMPDIR and $SCRATCH always point to local scratch space.
On compute nodes equipped with local SSD devices, $TMPDIR and $SCRATCH will point to the corresponding filesystem mounted on that devices. If you want to use SSD scratch space in your compute job, you will have to explicitly request a disk space MOAB resource when submitting your job, as described in Batch Jobs - bwForCluster Chemistry Features#Disk Space and Resources.
On the diskless compute nodes $TMPDIR points to a RAM-disk which will automatically provide up to 50% of the RAM capacity of the machine.
On the login nodes $TMPDIR and $SCRATCH point to a local scratch directory on that node. This is located at /scratch/<user> and is not shared across nodes. The data stored in there is private but will be deleted automatically if not accessed for 7 consecutive days. Like any other local scratch space, the data stored in there is NOT included in any backup.
2.4 Central Storage via Blockdevices
It will be possible to expand the diskspace of the local SSDs using part of the capacity of a central disk repository that is exported in the form of block devices. This service will be made available at a later date after the cluster has gone in official live-operation.
The compute nodes are interconnected with QDR Infiniband for the communication needs of jobs and with gigabit ethernet for login and similar traffic.
The Infiniband network uses a blocking factor which means that islands of 32 nodes are fully interconnected. The hostnames of the node reflect this structure, e.g. node n0908 is the eighth node on the ninth Infiniband island, which it shares with nodes n0901 to n0932.