BwForCluster NEMO Specific Batch Features
This article contains information on features of the batch job system only applicable on the "bwForCluster NEMO" in Freiburg.
1 Submitting Jobs on the bwForCluster NEMO
This page describes the details of the queuing system specific to the bwForCluster NEMO.
A general description on options that should work on all bwHPC clusters can be found on the Batch Jobs page.
Currently all worker nodes have 20 physical cores. Do not request more than 20 processes per node with the ppn flag (e.g. ppn=20).
If the requested ppn count exceeds this limit, the job will remain in idle state, i.e. it will not start running. Unfortunately msub does not inform you about wrong resource definitions. Please use checkjob -v to see if your job definitions are OK. Please select a number of cores which divides 20 evenly (e.g. ppn=[1,2,4,5,10,20]). This way the resource usage can be optimized.
Jobs run in a user only mode. This means that more than one job per user can run on the same worker node, if there are spare resources.
1.1 Limits and Queues
Walltime and cores correlate. MAXPS defines the product of maximum cores times walltime (cores X time). If you increase cores you'll need to decrease walltime to run the same amount of jobs.
On the bwForCluster NEMO the standard queue should not be explicitly specified.
- The maximum walltime for a job is 96 hours (4 days)
walltime=4:00:00:00 or walltime=96:00:00.
- All nodes have 20 cores and 128 GB RAM. Thus each core can use roughly 6GB RAM (pmem=6GB).
nodes=1:ppn=20 pmem=6gb Four nodes have 256 GB and four nodes have 512 GB of RAM.
- Maximum used cores at any time is 6000. We use MAXPE which takes Memory into account,
see processor equivalent MAXPE: 4000 (soft limit), 6000 (hard limit)
- IMPORTANT MAXPS: Maximum processor seconds which can be used at any time is 456192000. This increases with every job but decreases with the time passing, see basic fairness policies
MAXPS: 304128000 (soft limit), 456192000 (hard limit) (calculation 4 OPA islands * 44 nodes * 20 cores * 60 sec * 60 min * 24 h)
MAXPS = (# of cores) * walltime
Example: If a job uses 3520 cores for 24h, it will reserve 304128000 processor seconds. This is the soft limit for the cluster.
showq -b -v # will show you when your jobs hit the limit
|queue||node||default resources||minimum resources||theoretical maximum resources||node access policy|
|do not specify||worker||nodes=1:ppn=1, walltime=01:00:00, pmem=1000mb||nodes=1:ppn=1||nodes=300:ppn=20, walltime=4:00:00:00, pmem=6GB / pmem=12GB (256GB) / pmem=24GB (512GB)||single user|
|express||worker / interactive||nodes=1:ppn=1, walltime=15:00, pmem=1000mb||nodes=1:ppn=1||nodes=44:ppn=20, walltime=15:00, pmem=6GB||single user|
|gpu||gpu||nodes=1:ppn=1, walltime=15:00, pmem=1000mb||nodes=1:ppn=1:gpus=1||nodes=1:ppn=64:gpus=8, walltime=4:00:00:00, pmem=4GB||SHARED (SMT enabled)|
1.2 Interactive Jobs
Interactive jobs must NOT run on the logins nodes, however resources for interactive jobs can be requested using msub. The following example starts an interactive session on one compute node with one core for one hour:
$ msub -l nodes=1:ppn=1 -l walltime=1:00:00 -I
The option "-I" means "interactive job". After execution of this command wait until the queuing system has granted you the requested resources. Once granted you will be automatically logged on the allocated compute node.
If you use applications or tools which provide a GUI, enable X-forwarding for your interactive session with:
# use -Y for ssh X-forwarding $ ssh -l <uid> -Y login.nemo.uni-freiburg.de # use -X for X-forwarding $ msub -l nodes=1:ppn=1,walltime=1:00:00 -I -X
Once the walltime limit has been reached you will be automatically logged out from the compute node.
The option "-V" exports all environment variables to the compute node of the interactive session, but if you want to test your jobs, please aviod using "-V" since this alters your job environment.
1.2.1 Interactive GPU Jobs
If you add the :gpus flag to your interactive jobs, you will get a node with a GPU:
$ msub -l nodes=1:ppn=1:gpus=1 -I
1.3 Express Jobs
You can use the express queue to test batch jobs:
$ msub -q express -l nodes=1:ppn=20 test.moab
For defaults and maximum usage see table in 1.1 Limits and Queues.
1.4 GPU Jobs
You can use the gpu queue if you need graphic cards for your jobs. The node has 32 cores with enabled simultaneous multithreading (SMT) so 64 processes can be used. This node can be used by multiple users simultaneously (SHARED mode).
$ msub -q gpu -l nodes=1:ppn=8:gpus=1 gpu.moab # minimal job description $ msub -q gpu -l nodes=1:ppn=64:gpus=8 gpu.moab # maximum resources
For defaults and maximum usage see table in 1.1 Limits and Queues.
1.5 NEW AMD ROME Nodes
There are four nodes with AMD Rome processors with 128 real cores, 512 GiB and one Nvidia T4 GPU per node available on NEMO for evaluation purposes. SMT is disabled currently. To schedule your jobs on these machines please add
-l feature=amd, for the Tesla card add
$ msub -l nodes=1:ppn=1 amd.moab # minimal job description $ msub -l nodes=1:ppn=128:gpus=1,pmem=4G amd.moab # maximum resources
1.6 Monitor Running Jobs
Once your jobs are running you can log in to the nodes where your jobs were submitted to. The nodes are listed in checkjob.
$ checkjob 12345 ... Allocated Nodes: [n3101.nemo.privat:20][n3102.nemo.privat:20] ...
Then you can ssh into these nodes. The short host name is sufficient. Please logout after you are finished. And you can use the program pdsh to monitor your jobs non-interactively. pdsh checks where your job is running and performs a task on all nodes where your job is running.
$ ssh n3101
Non-interactive with ssh:
# run 'ps aux | grep <myjob>' on node n3101 $ ssh n3101 'ps aux | grep <myjob>'
Non-interactive with pdsh:
# run 'ps aux | grep <myjob>' on all nodes corresponding to jobid '12345' $ pdsh -j 12345 'ps aux | grep <myjob>' n3101: fr_uid 125068 101 0.0 39040 1684 ? Sl 12:15 0:25 <myjob> n3102: fr_uid 125068 101 0.0 39040 1684 ? Sl 12:15 0:25 <myjob> # run kill '<myjob>' on all nodes corresponding to jobid '12345' $ pdsh -j 12345 killall <myjob> # works with array jobs as well $ pdsh -j 12346 'ps aux | grep <myjob>' n3103: fr_uid 125068 101 0.0 39040 1684 ? Sl 12:15 0:25 <myjob>
2 Simple parallel jobs with job arrays
A typical method to create "embarrassingly parallel" compute tasks is to slice a large data set into equally sized partitions and create jobs that work on their respective partition.
The manual management of hundreds of jobs becomes difficult, though. Therefore, using the job array feature is recommended.
2.1 Job array example
Create the directory $HOME/arrayjob and the job description file $HOME/arrayjob/arrayjob.moab
#!/bin/bash #MOAB -N ARRAYJOB # # This is a workaround for a know bug. # Arrayjobs need to be given the output directory cd $HOME/arrayjob # Now call the programm which does the work depending on the job id python divide-and-conquer.py $MOAB_JOBARRAYINDEX
Create the worker program (python in this example):
#!/usr/bin/python import sys def main (argv): print ("Executing work according to array index ", argv) if __name__ == "__main__": main(sys.argv)
You can now submit the job for single array indices or index ranges:
msub -t 11 arrayjob.moab msub -t 23-42 arrayjob.moab