Hardware and Architecture

The bwForCluster BinAC 2 supports researchers from the broader fields of Bioinformatics, Medical Informatics, Astrophysics, Geosciences and Pharmacy.

Overview on the BinAC 2 hardware architecture.

Operating System and Software

Operating System: Rocky Linux 9.6
Queuing System: Slurm (see BinAC2/Slurm for help)
(Scientific) Libraries and Software: Environment Modules

Compute Nodes

BinAC 2 offers compute nodes, high-mem nodes, and three types of GPU nodes.

180 compute nodes
16 SMP node
32 GPU nodes (2xA30)
8 GPU nodes (4xA100)
4 GPU nodes (4xH200)
plus several special purpose nodes for login, interactive jobs, etc.

Compute node specification:

	Standard	High-Mem	GPU (A30)	GPU (A100)	GPU (H200)
Quantity	168 / 12	14 / 2	32	8	4
Processors	2 x AMD EPYC Milan 7543 / 2 x AMD EPYC Milan 75F3	2 x AMD EPYC Milan 7443 / 2 x AMD EPYC Milan 75F3	2 x AMD EPYC Milan 7543	2 x AMD EPYC Milan 7543	2 x AMD EPYC Milan 9555
Processor Base Frequency (GHz)	2.80 / 2.95	2.85 / 2.95	2.80	2.80	3.20
Number of Physical Cores / Hypertreads	64 / 128	48 / 96 // 64 / 128	64 / 128	64 / 128	128 / 256
Working Memory (GB)	512	2048	512	512	1536
Local Disk (GiB)	450 (NVMe-SSD)	14000 (NVMe-SSD)	450 (NVMe-SSD)	14000 (NVMe-SSD)	28000 (NVMe-SSD)
Interconnect	HDR 100 IB (84 nodes) / 100GbE (96 nodes)	100GbE	100GbE	100GbE	HDR 200 IB + 100GbE
Coprocessors	-	-	2 x NVIDIA A30 (24 GB ECC HBM2, NVLink)	4 x NVIDIA A100 (80 GB ECC HBM2e)	4 x NVIDIA H200 NVL (141 GB ECC HBM3e, NVLink)

Network

The compute nodes and the parallel file system are connected via 100GbE ethernet
In contrast to BinAC 1 not all compute nodes are connected via Infiniband, but there are 84 standard compute nodes connected via HDR Infiniband (100 GbE). In order to get your jobs onto the Infiniband nodes, submit your job with --constraint=ib.

Question:
OpenMPI throws the following warning:

--------------------------------------------------------------------------
No OpenFabrics connection schemes reported that they were able to be
used on a specific port.  As such, the openib BTL (OpenFabrics
support) will be disabled for this port.
  Local host:           node1-083
  Local device:         mlx5_0
  Local port:           1
  CPCs attempted:       rdmacm, udcm
--------------------------------------------------------------------------
[node1-083:2137377] 3 more processes have sent help message help-mpi-btl-openib-cpc-base.txt / no cpcs for port
[node1-083:2137377] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages

What should i do?

Answer:
BinAC2 has two (almost) separate networks, a 100GbE network and and InfiniBand network, both connecting a subset of the nodes. Both networks require different cables and switches. Concerning the network cards for the nodes, however, there exist VPI network cards which can be configured to work in either mode (https://docs.nvidia.com/networking/display/connectx6vpi/specifications#src-2487215234_Specifications-MCX653105A-ECATSpecifications). OpenMPI can use a number of layers for transferring data and messages between processes. When it ramps up, it will test all means of communication that were configured during compilation and then tries to figure out the fastest path between all processes. If OpenMPI encounters such a VPI card, it will first try to establish a Remote Direct Memory Access communication (RDMA) channel using the OpenFabrics (OFI) layer. On nodes with 100Gb ethernet, this fails as there is no RDMA protocol configured. OpenMPI will fall back to TCP transport but not without complaints.

Workaround:
For single-node jobs or on regular compute nodes, A30 and A100 GPU nodes: Add the lines

export OMPI_MCA_btl="^ofi,openib"
export OMPI_MCA_mtl="^ofi"

to your job script to disable the OFI transport layer. If you need high-bandwidth, low-latency transport between all processes on all nodes, switch to the Infiniband partition (#SBATCH --constraint=ib). Do not turn off the OFI layer on Infiniband nodes as this will be the best choice between nodes!

File Systems

The bwForCluster BinAC 2 consists of two separate storage systems, one for the user's home directory $HOME and one serving as a project/work space. The home directory is limited in space and parallel access but offers snapshots of your files and backup.

The project/work is a parallel file system (PFS) which offers fast and parallel file access and a bigger capacity than the home directory. It is mounted at /pfs/10 on the login and compute nodes. This storage is based on Lustre and can be accessed in parallel from many nodes. The PFS contains the project and the work directory. Each compute project has its own directory at /pfs/10/project that is accessible for all members of the compute project. Each user can create workspaces under /pfs/10/work using the workspace tools. These directories are only accessible for the user who created the workspace.

Additionally, each compute node provides high-speed temporary storage (SSD) on the node-local solid state disk via the $TMPDIR environment variable.

	`$HOME`	project	work	`$TMPDIR`
Visibility	global	global	global	node local
Lifetime	permanent	permanent	work space lifetime (max. 30 days, max. 5 extensions)	batch job walltime
Capacity	-	8.1 PB	1000 TB	480 GB (compute nodes); 7.7 TB (GPU-A30 nodes); 16 TB (GPU-A100 and SMP nodes); 31 TB (GPU-H200 nodes)
File System Type	NFS	Lustre	Lustre	XFS
Speed (read)	≈ 1 GB/s, shared by all nodes	max. 12 GB/s	≈ 145 GB/s peak, aggregated over 56 nodes, ideal striping	≈ 3 GB/s (compute)/ ≈5 GB/S (GPUA-30)/ ≈ 26 GB/s (GPU-A100 + SMP)/ ≈ 42 GB/s (GPU-H200) per node
Quotas	40 GB per user	not yet, maybe in the future	none	none
Backup	yes (nightly)	no	no	no

 global             : all nodes access the same file system
 local              : each node has its own file system
 permanent          : files are stored permanently
 batch job walltime : files are removed at end of the batch job

Please note that due to the large capacity of work and project and due to frequent file changes on these file systems, no backup can be provided. Backing up these file systems would require a redundant storage facility with multiple times the capacity of project. Furthermore, regular backups would significantly degrade the performance. Data is stored redundantly, i.e. immune against disk failures but not immune against catastrophic incidents like cyber attacks or a fire in the server room. Please consider to use on of the remote storage facilities like SDS@hd, bwSFS, LSFD Online Storage or the bwDataArchive to back up your valuable data.

Home

Home directories are meant for permanent file storage of files that are keep being used like source codes, configuration files, executable programs etc.; the content of home directories will be backed up on a regular basis. Because the backup space is limited we enforce a quota of 40GB on the home directories.

NOTE: Compute jobs on nodes must not write temporary data to $HOME. Instead they should use the local $TMPDIR directory for I/O-heavy use cases and work spaces for less I/O intense multinode-jobs.

Project

The data is stored on HDDs. The primary focus of /pfs/10/project is pure capacity, not speed.

Every project gets a dedicated directory located at:

/pfs/10/project/<project_id>/

You can check the project(s) you are member of via:

# id $USER | grep -o 'bw[^)]*'
bw16f003

In this case, your project directory would be: /pfs/10/project/bw16f003/

Check our data organization guide for methods to organize data inside the project directory.

Workspaces

Data on the fast storage pool at /pfs/10/work is stored on SSDs. The primary focus is speed, not capacity.

In contrast to BinAC 1 we will enforce work space lifetime, as the capacity is limited. We ask you to only store data you actively use for computations on /pfs/10/work. Please move data to /pfs/10/project when you don't need it on the fast storage any more.

Each user should create workspaces at /pfs/10/work through the workspace tools

You can find more info on workspace tools on our general page:

→ Workspaces

To create a work space you'll need to supply a name for your work space area and a lifetime in days. For more information read the corresponding help, e.g: ws_allocate -h.

Command	Action
`ws_allocate mywork 30`	Allocate a work space named "mywork" for 30 days.
`ws_allocate myotherwork`	Allocate a work space named "myotherwork" with maximum lifetime.
`ws_list -a`	List all your work spaces.
`ws_find mywork`	Get absolute path of work space "mywork".
`ws_extend mywork 30`	Extend life me of work space mywork by 30 days from now.
`ws_release mywork`	Manually erase your work space "mywork". Please remove directory content first or use the `--delete-data` option.

Scratch

Please use the fast local scratch space for storing temporary data during your jobs.

For each job a scratch directory will be created on the compute nodes. It is available via the environment variable $TMPDIR, which points to /scratch/<jobID>.

Especially the SMP nodes and the GPU nodes are equipped with large and fast local disks that should be used for temporary data, scratch data or data staging for ML model training. The Lustre file system (WORK and PROJECT) is unsuited for repetitive random I/O, I/O sizes smaller than the Lustre and ZFS block size (1M) or I/O patterns where files are opened and closed in rapid succession. The XFS file system of the local scratch drives is better suited for typical scratch workloads and access patterns. Moreover, the local scratch drives offer a lower latency and a higher bandwidth than WORK.

SDS@hd

SDS@hd is mounted via NFS on login and compute nodes at /mnt/sds-hd.

To access your Speichervorhaben, the export to BinAC 2 must first be enabled by the SDS@hd-Team. Please contact SDS@hd support and provide the acronym of your Speichervorhaben, along with a request to enable the export to BinAC 2.

Once this has been done, you can access your Speichervorhaben as described in the SDS documentation.

$ kinit $USER
Password for <user>@BWSERVICES.UNI-HEIDELBERG.DE:

The Kerberos ticket store is shared across all nodes. Creating a single ticket is sufficient to access your Speichervorhaben on all nodes.

More Details on the Lustre File System

Lustre is a distributed parallel file system.

The entire logical volume as presented to the user is formed by multiple physical or local drives. Data is distributed over more than one physical or logical volume/hard drive, single files can be larger than the capacity of a single hard drive.
The file system can be mounted from all nodes ("clients") in parallel at the same time for reading and writing. This also means that technically you can write to the same file from two different compute nodes! Usually, this will create an unpredictable mess! Never ever do this unless you know exactly what you are doing!
On a single server or client, the bandwidth of multiple network interfaces can be aggregated to increase the throughput ("multi-rail").

Lustre works by chopping files into many small parts ("stripes", file objects) which are then stored on the object storage servers. The information which part of the file is stored where on which object storage server, when it was changed last etc. and the entire directory structure is stored on the metadata servers. Think of the entries on the metadata server as being pointers pointing to the actual file objects on the object storage servers. A Lustre file system can consist of many metadata servers (MDS) and object storage servers (OSS). Each MDS or OSS can again hold one or more so-called object storage targets (OST) or metadata targets (MDT) which can e.g. be simply multiple hard drives. The capacity of a Lustre file system can hence be easily scaled by adding more servers.

Useful Lustre Comamnds

Commands specific to the Lustre file system are divided into user commands (lfs ...) and administrative commands (lctl ...). On BinAC2, users may only execute user commands, and also not all of them.

lfs help <command>: Print built-in help for command; Alternative: man lfs <command>
lfs find: Drop-in replacement for the find command, much faster on Lustre filesystems as it directly talks to the metadata sever
lfs --list-commands: Print a list of available commands

Moving data between WORK and PROJECT

!! IMPORTANT !! Calling mv on files will not physically move them between the fast and the slow pool of the file system. Instead, the file metadata, i.e. the path to the file in the directory tree will be modified (i.e. data stored on the MDS). The stripes of the file on the OSS, however, will remain exactly were they were. The only result will be the confusing situation that you now have metadata entries under /pfs/10/project that still point to WORK OSTs. This may sound confusing at first. When using mv on the same file system, Lustre only renames the files and makes them available from a different path. The pointers to the file objects on the OSS stay identical. This will only change if you either create a copy of the file at a different path (with cp or rsync, e.g.) or if you explicitly instruct Lustre to move the actual file objects to another storage location, e.g. another pool of the same file system.

Proper ways of moving data between the pools

Copy the data - which will create new files -, then delete the old files. Example:

$> cp -ar /pfs/10/work/tu_abcde01-my-precious-ws/* /pfs/10/project/bw10a001/tu_abcde01/my-precious-research/.
$> rm -rf /pfs/10/work/tu_abcde01-my-precious-ws/*
$> ws_release --delete-data my-precious-ws

Alternative to copy: use rsync to copy data between the workspace and the project directories. Example:

$> rsync -av /pfs/10/work/tu_abcde01-my-precious-ws/simulation/output /pfs/10/project/bw10a001/tu_abcde01/my-precious-research/simulation25/
$> rm -rf /pfs/10/work/tu_abcde01-my-precious-ws/*
$> ws_release --delete-data my-precious-ws

If there are many subfolders with similar size, you can use xargs to copy them in parallel:

$> find . -maxdepth 1 -mindepth 1 -type d -print | xargs -P4 -I{} rsync -aHAXW --inplace --update {} /pfs/10/project/bw10a001/tu_abcde01/my-precious-research/simulation25/

will launch four parallel rsync processes at a time, each will copy one of the subdirectories.

First move the metadata with mv, then use lfs migrate or the wrapper lfs_migrate to actually migrate the file stripes. This is also a possible resolution if you already mved data from work to project or vice versa.
- lfs migrate is the raw lustre command. It can only operate on one file at a time, but offers access to all options.
- lfs_migrate is a versatile wrapper script that can work on single files or recursively on entire directories. If available, it will try to use lfs migrate, otherwise it will fall back to rsync (see lfs_migrate --help for all options.)

Example with lfs migrate:

$> mv /pfs/10/work/tu_abcde01-my-precious-ws/* /pfs/10/project/bw10a001/tu_abcde01/my-precious-research/.
$> cd /pfs/10/project/bw10a001/tu_abcde01/my-precious-research/.
$> lfs find . -type f --pool work -0 | xargs -0 lfs migrate --pool project # find all files whose file objects are on the work pool and migrate the objects to the project pool
$> ws_release --delete-data my-precious-ws

Example with lfs_migrate:

$> mv /pfs/10/work/tu_abcde01-my-precious-ws/* /pfs/10/project/bw10a001/tu_abcde01/my-precious-research/.
$> cd /pfs/10/project/bw10a001/tu_abcde01/my-precious-research/.
$> lfs_migrate --yes -q -p project * # migrate all file objects in the current directory to the project pool, be quiet (-q) and do not ask for confirmation (--yes)
$> ws_release --delete-data my-precious-ws

Both migration commands can also be combined with options to restripe the files during migration, i.e. you can also change the number of OSTs the file is striped over, the size of a single strip etc.
Attention! Both lfs migrate and lfs_migrate will not change the path of the file(s), you must also mv them! If used without mv, the files will still belong to the workspace although their file object stripes are now on the project pool and a subsequent rm in the workspace will wipe them.

All of the above procedures may take a considerable amount of time depending on the amount of data, so it might be advisable to execute them in a terminal multiplexer like screen or tmux or wrap them into small SLURM jobs with sbatch --wrap="<command>".

Question:
I totally lost overview, how do i find out where my files are located?

Answer:

Use lfs find to find files on a specific pool. Example:

$> lfs find . --pool project # recursively find all files in the current directory whose file objects are on the "project" pool

Use lfs getstripe to query the striping pattern and the pool (also works recursively if called with a directory). Example:

$> lfs getstripe parameter.h 
parameter.h
lmm_stripe_count:  1
lmm_stripe_size:   1048576
lmm_pattern:       raid0
lmm_layout_gen:    1
lmm_stripe_offset: 44
lmm_pool:          project
        obdidx           objid           objid           group
            44         7991938       0x79f282      0xd80000400

shows that the file is striped over OST 44 (obdidx) which belongs to pool project (lmm_pool).

Why pathes and storage pools should match:
There are four different possible scenarios with two subdirectories and two pools:

File path in /pfs/10/work, file objects on pool work: good.
File path in /pfs/10/project, file objects on pool project: good.
File path in /pfs/10/project, file objects on pool work: bad. This will "leak" storage from the fast pool, making it unavailable for workspaces.
File path in /pfs/10/work, file objects on pool project: bad. Access will be slow, and if (volatile) workspaces are purged, data residing on project will (voluntarily or involuntarily) be deleted.

The latter two situations may arise from mving data between workspaces and project folders.

More on data striping and how to influence it

!! The default striping patterns on BinAC2 are set for good reasons and should not light-heartedly be changed!
Doing so wrongly will in the best case only hurt your performance.
In the worst case, it will also hurt all other users and endanger the stability of the cluster.
Please talk to the admins first if you think that you need a non-default pattern.

Reading striping patterns with lfs getstripe
Setting striping patterns with lfs setstripe for new files and directories
Restriping files with lfs migrate
Progressive File Layout

Architecture of BinAC2's Lustre File System

Metadata Servers:

2 metadata servers
1 MDT per server
MDT Capacity: 31TB, hardware RAID6 on NVMe drives (flash memory/SSD)
Networking: 2x 100 GbE, 2x HDR-100 InfiniBand

Object storage servers:

8 object storage servers
2 fast OSTs per server
- 70 TB per OST, software RAID (raid-z2, 10+2 reduncancy)
- NVMe drives, directly attached to the PCIe bus
8 slow OSTs per server
- 143 TB per OST, hardware RAID (RAID6, 8+2 redundancy)
- externally attached via SAS
Networking: 2x 100 GbE, 2x HDR-100 InfiniBand

All fast OSTs are assigned to the pool work
All slow OSTs are assigned to the pool project
All files that are created under /pfs/10/work are by default stored on the fast pool
All files that are created under /pfs/10/project are by default stored on the slow pool
Metadata is distributed over both MDTs. All subdirectories of a directory (workspace or project folder) are typically on the same MDT. Directory striping/placement on MDTs can not be influenced by users.
Default OST striping: Stripes have size 1 MiB. Files are striped over one OST if possible, i.e. all stripes of a file are on the same OST. New files are created on the most empty OST.

Internally, the slow and the fast pool belong to the same Lustre file system and namespace.

BinAC2/Hardware and Architecture

Contents

Hardware and Architecture

Operating System and Software

Compute Nodes

Network

File Systems

Home

Project

Workspaces

Scratch

SDS@hd

More Details on the Lustre File System

Useful Lustre Comamnds

Moving data between WORK and PROJECT

More on data striping and how to influence it

Architecture of BinAC2's Lustre File System

Navigation menu

BinAC2/Hardware and Architecture

Hardware and Architecture

Operating System and Software

Compute Nodes

Network

File Systems

Home

Project

Workspaces

Scratch

SDS@hd

More Details on the Lustre File System

Useful Lustre Comamnds

Moving data between WORK and PROJECT

More on data striping and how to influence it

Architecture of BinAC2's Lustre File System

Navigation menu

Search