Hardware and Architecture (bwForCluster Chemistry)

From bwHPC Wiki
Revision as of 15:46, 11 December 2015 by V Kushnarenko (talk | contribs) (Storage Architecture)
Jump to: navigation, search

1 System Architecture

The bwForCluster for computational and theoretical Chemistry Justus is a high-performance compute resource with high speed interconnect. It is intended for chemistry-related jobs with high memory (RAM,disk) needs and medium to low requirements to the node-interconnecting Infiniband network.

Overview on bwForCluster Chemistry showing only the connecting the Infiniband network. All machines are additionally connected by 1GB Ethernet.

1.1 Basic Software Features

  • Red Hat Enterprise Linux (RHEL) 7
  • Queuing System: MOAB/Torque
  • Environment Modules system to access software

1.2 Common Hardware Features

A total of 444 compute nodes plus 10 nodes for login, admin and visualization purposes.

  • Processor: 2x Intel Xeon E5-2630v3 Prozessor (Haswell, 8-core, 2.4 GHz)
  • Two processors per node (2x8 cores)
  • 1x QDR InfiniBand HCA, single Port, Intel TrueScale

1.3 Node Types

There are three types of compute nodes, matched for increasingly less scalable and more memory-intensive (RAM and disks) jobs.

Diskless nodes SSD nodes Big SSD nodes Large Memory/SSD nodes
Quantity 202 204 22 16
RAM (GB) 128 128 256 512
Disk Space - ~1TB ~2TB ~2TB
Disks - 4x 240 GB Enterprise SSD 4x 240 GB Enterprise SSD 4x 480 GB Enterprise SSD

2 Storage Architecture

Overview of the bwForCluster Chemistry storage concept.

The storage concept of the bwForCluster for Chemistry Disks are served by two redundant servers for the Lustre and ZFS work and home directories. The additional block device storage is meant to expand the space of the local SSDs for problems that cannot fit in this local space.

$TMPDIR central block storage workspaces $HOME
Visibility local on-demand local global global
Lifetime batch job walltime batch job walltime < 90 days permanent
Disk space diskless/1TB/2TB 480 TB 200 TB 200 TB
Quotas no no no 100 GB
Backup no no no yes
 global   :  all nodes access the same file system;
 local    :  each node has its own file system;
 permanent:  files are stored permanently;
 batch job walltime:  files are removed at end of the batch job.

2.1 $HOME

Home directories are meant for permanent file storage of files that are keep being used like source codes, configuration files, executable programs etc.; the content of home directories will be backed up on a regular basis. The files in $HOME are stored on a zfs filesystem and provided via NFS to all nodes.

Current disk usage on home directory and quota status can be checked with the diskusage command:

$ diskusage

User           	   Used (GB)	  Quota (GB)	Used (%)
<username>                4.38               100.00             4.38


Compute jobs on nodes must not write temporary data to $HOME. Instead they should use the local $TMPDIR directory for I/O-heavy usecases and workspaces for less I/O intense multinode-jobs.

Quota is full - what to do

In case of 100% usage of the quota user can get some problems with disk writing operations (e.g. error messages during the file copy/edit/save operations). To avoid it - please remove some data that you don't need from the $HOME directory or move it to some temporary place.

As temporary place for the data user can use:

  • Workspace - space on the Lustre file system, lifetime up to 90 days (see below)
  • Scratch on login nodes - special directory on every login node (login01..login04):
    • Access via variable $TMPDIR (e.g. "cd $TMPDIR")
    • Lifetime of data - minimum 7 days (based on the last access time)
    • Data is private for every user
    • Each login node has own scratch directory (data is NOT shared)
    • There is NO backup of the data

To get optimal and comfortable work with the $HOME directory is important to keep the data in order (remove unnecessary and temporary data, archive big files, save large files only on the workspace). To optimise data-usage workflow user can always get help from the JUSTUS support team.

2.2 Lustre filesystem "/work"

Workspaces tools can be used to get temporary space on the lustre file system.

To create a workspace you need to supply a name for your workspace area and a lifetime in days. The maximum lifetime is 90 days.

  • allocate a workspace
$ ws_allocate myprojectworkspace 50
Workspace created. Duration is 1200 hours. 
Further extensions available: 9999

For more information is available with ws_allocate -h'

  • extend a workspace
$ ws_extend myprojectworkspace 50
Duration of workspace is successfully changed!
New duration is 1200 hours. Further extensions available: 9998

changes the lifetime of the workspace to the specific amount of days.

  • delete a workspace
$ ws_release myprojectworkspace
Info: Workspace was deleted.


The variables $TMPDIR and $SCRATCH always point to local scratch space.

On compute nodes equipped with local SSD devices, $TMPDIR and $SCRATCH will point to the corresponding filesystem mounted on that devices. If you want to use SSD scratch space in your compute job, you will have to explicitly request a disk space MOAB resource when submitting your job, as described in Batch Jobs - bwForCluster Chemistry Features#Disk Space and Resources.

On the diskless compute nodes $TMPDIR points to a RAM-disk which will automatically provide up to 50% of the RAM capacity of the machine.

On the login nodes $TMPDIR and $SCRATCH point to a local scratch directory on that node. This is located at /scratch/<user> and is not shared across nodes. The data stored in there is private but will be deleted automatically if not accessed for 7 consecutive days. Like any other local scratch space, the data stored in there is NOT included in any backup.

2.4 Central Storage via Blockdevices

It will be possible to expand the diskspace of the local SSDs using part of the capacity of a central disk repository that is exported in the form of block devices. This service will be made available at a later date after the cluster has gone in official live-operation.

3 Network

The compute nodes are interconnected with QDR Infiniband for the communication needs of jobs and with gigabit ethernet for login and similar traffic.

3.1 Infiniband

The Infiniband network uses a blocking factor which means that islands of 32 nodes are fully interconnected. The hostnames of the node reflect this structure, e.g. node n0908 is the eighth node on the ninth Infiniband island, which it shares with nodes n0901 to n0932.