Difference between revisions of "Batch Jobs Moab"

From bwHPC Wiki
Jump to: navigation, search
Line 1: Line 1:
  +
<br>
  +
<br>
  +
<div style="text-align:center;font-size:120%;color:red;">'''This article is partly outdated and currently under revision!'''</div>
  +
<br>
  +
<br>
  +
= Moab® HPC Workload Manager =
  +
== Specification ==
  +
The Moab Cluster Suite is a '''cluster workload management package''', available from [http://www.adaptivecomputing.com/ Adaptive Computing, Inc.], that integrates the scheduling, managing, monitoring and reporting of cluster workloads. Moab Cluster Suite simplifies and unifies management across one or multiple hardware, operating system, storage, network, license and resource manager environments.
  +
<br>
  +
Any kind of calculation on the compute nodes of a [[HPC_infrastructure_of_Baden_Wuerttemberg|bwHPC cluster of tier 2 or 3]] requires the user to define calculations as a sequence of commands or single command together with required run time, number of CPU cores and main memory and submit all, i.e., the '''batch job''', to a resource and workload managing software. All bwHPC cluster of tier 2 and 3, including have installed the workload managing software MOAB. Therefore any job submission by the user is to be executed by commands of the MOAB software. MOAB queues and runs user jobs based on fair sharing policies.
 
<br>
 
<br>
  +
== Moab Commands ==
<br>
 
<div style="text-align:center;font-size:120%;color:red;">'''This article is partly outdated and currently under revision!'''</div>
 
<br>
 
<br>
 
 
Any kind of calculation on the compute nodes of a [[HPC_infrastructure_of_Baden_Wuerttemberg|bwHPC cluster of tier 2 or 3]] requires the user to define calculations as a sequence of commands or single command together with required run time, number of CPU cores and main memory and submit all, i.e., the '''batch job''', to a resource and workload managing software. All bwHPC cluster of tier 2 and 3, including have installed the workload managing software MOAB. Therefore any job submission by the user is to be executed by commands of the MOAB software. MOAB queues and runs user jobs based on fair sharing policies.
 
 
This page only describes options and commands that can be used on all bwHPC clusters. Options specific to a single cluster are described in the following separate articles:
 
 
* [[Batch Jobs - bwUniCluster Features]]
 
* [[Batch Jobs - bwForCluster Chemistry Features]]
 
* [[Batch Jobs - ForHLR Features]]
 
 
 
Overview of:
 
Overview of:
 
{| width=750px class="wikitable"
 
{| width=750px class="wikitable"
 
! MOAB commands !! Brief explanation
 
! MOAB commands !! Brief explanation
 
|-
 
|-
| msub || submits a job and queues it in an input queue
+
| [[#Job Submission : msub|msub]] || submits a job and queues it in an input queue
 
|-
 
|-
| checkjob || displays detailed job state information
+
| [[#Detailed job information : checkjob|checkjob]] || displays detailed job state information
 
|-
 
|-
| showq || displays information about active, eligible, blocked, and/or recently completed jobs
+
| [[#List of your submitted jobs : showq|showq]] || displays information about active, eligible, blocked, and/or recently completed jobs
 
|-
 
|-
| showbf || shows what resources are available for immediate use
+
| [[#Shows free resources : showbf|showbf]] || shows what resources are available for immediate use
 
|-
 
|-
| showstart || returns start time of submitted job or requested resources
+
| [[# Start time of job or resources : showstart|showstart]] || returns start time of submitted job or requested resources
 
|-
 
|-
| canceljob || cancels a job
+
| [[#Canceling own jobs : canceljob|canceljob]] || cancels a job
 
|}
 
|}
  +
== Job Submission : msub ==
<br>
 
= Job Submission =
 
 
Batch jobs are submitted by using the command '''msub'''. The main purpose of the '''msub''' command is to specify the resources that are needed to run the job. '''msub''' will then queue the batch job. However, starting of batch job depends on availability of the requested resources and the fair sharing value.
 
Batch jobs are submitted by using the command '''msub'''. The main purpose of the '''msub''' command is to specify the resources that are needed to run the job. '''msub''' will then queue the batch job. However, starting of batch job depends on availability of the requested resources and the fair sharing value.
<!--
 
into the input queue. The jobs are organized into different job classes. For each job class there are specific limits for the available resources (number of nodes, number of CPUs, maximum CPU time, maximum memory etc.).
 
-->
 
 
<br>
 
<br>
== msub Command ==
+
=== msub Command Parameters ===
 
The syntax and use of '''msub''' can be displayed via:
 
The syntax and use of '''msub''' can be displayed via:
 
<pre>
 
<pre>
Line 45: Line 39:
 
! colspan="3" | msub Options
 
! colspan="3" | msub Options
 
|-
 
|-
! Command line
+
! Command line
 
! Script
 
! Script
 
! Purpose
 
! Purpose
 
|- style="vertical-align:top;"
 
|- style="vertical-align:top;"
| -l ''resources''
+
| -l ''resources''
| #MSUB -l ''resources''
+
| #MSUB -l ''resources''
 
| Defines the resources that are required by the job.<br>
 
| Defines the resources that are required by the job.<br>
 
See the description below for this important flag.
 
See the description below for this important flag.
Line 61: Line 55:
 
| -o ''filename''
 
| -o ''filename''
 
| #MSUB -o ''filename''
 
| #MSUB -o ''filename''
| Defines the file-name to be used for the standard output stream of the<br>
+
| Defines the file-name to be used for the standard output stream of the<br>
 
batch job. By default the file with defined file name is placed under your<br>
 
batch job. By default the file with defined file name is placed under your<br>
 
job submit directory. To place under a different location, expand<br>
 
job submit directory. To place under a different location, expand<br>
Line 69: Line 63:
 
| #MSUB -q ''queue''
 
| #MSUB -q ''queue''
 
| Defines the queue class
 
| Defines the queue class
|-
 
 
|- style="vertical-align:top;"
 
|- style="vertical-align:top;"
 
| -v ''variable=arg''
 
| -v ''variable=arg''
Line 91: Line 84:
 
| Send email to the specified email address "name@uni.de".
 
| Send email to the specified email address "name@uni.de".
 
|-
 
|-
<!--
 
| -V
 
| #MSUB -V
 
| Declares that all environment variables in the msub environment are exported<br>
 
to the batch job.
 
|-
 
-->
 
 
|}
 
|}
<br>
 
 
For cluster specific msub options, read:
 
For cluster specific msub options, read:
 
* [[Batch_Jobs_-_bwUniCluster_Features#msub Command|bwUniCluster msub options]]
 
* [[Batch_Jobs_-_bwUniCluster_Features#msub Command|bwUniCluster msub options]]
  +
==== msub -l ''resource_list'' ====
 
  +
The '''-l''' option is one of the most important msub options. It is used to specify a number of resource requirements for your job. Multiple resource strings are separated by commas.
=== msub -l ''resource_list'' ===
 
The '''-l''' option is one of the most important msub options. It is used to specify a number of resource requirements for your job. Multiple resource strings are separated by commas.
 
 
 
{| width=750px class="wikitable"
 
{| width=750px class="wikitable"
 
! colspan="3" | msub -l ''resource_list''
 
! colspan="3" | msub -l ''resource_list''
|-
+
|-
 
! resource
 
! resource
 
! Purpose
 
! Purpose
 
|- style="vertical-align:top;"
 
|- style="vertical-align:top;"
  +
<!-- temporarily removed
 
  +
| -l procs=8
<!-- temporarily removed
 
| -l procs=8
 
 
| Number of processes, distribution over nodes will be done by MOAB
 
| Number of processes, distribution over nodes will be done by MOAB
 
|-
 
|-
 
-->
 
-->
 
 
| -l nodes=2:ppn=16
 
| -l nodes=2:ppn=16
 
| Number of nodes and number of processes per node
 
| Number of nodes and number of processes per node
Line 134: Line 115:
 
| Maximum amount of physical memory used by any single process of the job.<br>
 
| Maximum amount of physical memory used by any single process of the job.<br>
 
Allowed units are kb, mb, gb. Be aware that '''processes''' are either ''MPI tasks''<br>
 
Allowed units are kb, mb, gb. Be aware that '''processes''' are either ''MPI tasks''<br>
if running MPI parallel jobs or ''threads'' if running multi-threaded jobs.
 
|- style="vertical-align:top;"
 
| -l mem=1000mb
 
| Maximum amount of physical memory used by the job.<br>
 
Allowed units are kb, mb, gb. Be aware that this memory value is the accumulated<br>
 
 
memory for all ''MPI tasks'' or all ''threads'' of the job.</div>
 
memory for all ''MPI tasks'' or all ''threads'' of the job.</div>
|-
+
|-
 
|- style="vertical-align:top;"
 
|- style="vertical-align:top;"
 
| -l advres=''res_name''
 
| -l advres=''res_name''
 
| Specifies the reservation "res_name" required to run the job.</div>
 
| Specifies the reservation "res_name" required to run the job.</div>
|-
+
|-
 
|- style="vertical-align:top;"
 
|- style="vertical-align:top;"
 
| -l naccesspolicy=''policy''
 
| -l naccesspolicy=''policy''
| Specifies how node resources should be accessed, e.g. ''-l naccesspolicy=singlejob''<br>
+
| Specifies how node resources should be accessed, e.g. ''-l naccesspolicy=singlejob''<br>
 
reserves all requested nodes for the job exclusively.<br>
 
reserves all requested nodes for the job exclusively.<br>
Attention, if you request ''nodes=1:ppn=4'' together with ''singlejob'' you will be<br>
+
Attention, if you request ''nodes=1:ppn=4'' together with ''singlejob'' you will be<br>
 
charged for the maximum cores of the node.
 
charged for the maximum cores of the node.
 
|}
 
|}
<br>
 
 
Note that all compute nodes do not have SWAP space, thus <span style="color:red;font-size:105%;">DO NOT specify '-l vmem' or '-l pvmem'</span> or your jobs will not start.
 
Note that all compute nodes do not have SWAP space, thus <span style="color:red;font-size:105%;">DO NOT specify '-l vmem' or '-l pvmem'</span> or your jobs will not start.
  +
==== msub -q ''queues'' ====
<br>
 
<br>
 
=== msub -q ''queues'' ===
 
 
Queue classes define maximum resources such as walltime, nodes and processes per node and partition of the compute system. Note that queue settings of the bwHPC cluster are not '''identical''', but differ due to their different prerequisites, such as HPC performance, scalability and throughput levels. Details can be found here:
 
Queue classes define maximum resources such as walltime, nodes and processes per node and partition of the compute system. Note that queue settings of the bwHPC cluster are not '''identical''', but differ due to their different prerequisites, such as HPC performance, scalability and throughput levels. Details can be found here:
 
* [[Batch_Jobs_-_bwUniCluster_Features#msub_-q_queues|bwUniCluster queue settings]]
 
* [[Batch_Jobs_-_bwUniCluster_Features#msub_-q_queues|bwUniCluster queue settings]]
 
* [[Batch_Jobs_-_ForHLR_Features#msub_-q_queues|ForHLR queue settings]]
 
* [[Batch_Jobs_-_ForHLR_Features#msub_-q_queues|ForHLR queue settings]]
  +
=== msub Examples ===
<br>
 
<br>
 
 
= Environment Variables for Batch Jobs =
 
Once an eligible compute jobs starts on the compute system, MOAB adds the following variables to the job's environment:
 
{| width=800px class="wikitable"
 
! colspan="3" | MOAB variables
 
|-
 
! Environment variables
 
! Description
 
|-
 
| MOAB_CLASS
 
| Class name
 
|-
 
| MOAB_GROUP
 
| Group name
 
|-
 
| MOAB_JOBID
 
| Job ID
 
|-
 
| MOAB_JOBNAME
 
| Job name
 
|-
 
| MOAB_NODECOUNT
 
| Number of nodes allocated to job
 
|-
 
| MOAB_PARTITION
 
| Partition name the job is running in
 
|-
 
| MOAB_PROCCOUNT
 
| Number of processors allocated to job
 
|-
 
| MOAB_SUBMITDIR
 
| Directory of job submission
 
|-
 
| MOAB_USER
 
| User name
 
|}
 
<br>
 
MOAB environment variables can be used to generalize your job scripts, compare [[Batch_Jobs#msub_Examples|msub examples]].
 
<br>
 
<br>
 
== bwHPC Cluster specific environment variables ==
 
Note that bwHPC Cluster may use different resource managers and additional environment variables. Details can be found here:
 
* [[Batch_Jobs_-_bwUniCluster_Features|bwUniCluster batch job variables]]
 
* [[Batch_Jobs_-_bwForCluster_Features|bwForCluster batch job variables (in progress)]]
 
* [[Batch_Jobs_-_ForHLR_Features|ForHLR batch job variables]]
 
<br>
 
<br>
 
 
= Interactive Jobs =
 
Policies of interactive batch jobs are cluster specific and can be found here:
 
* [[Batch_Jobs_-_bwUniCluster_Features#Interactive_Jobs|bwUniCluster interactive jobs]]
 
* [[Batch_Jobs_-_ForHLR_Phase_I_Features#Interactive_Jobs|ForHLR interactive jobs]]
 
<br>
 
<br>
 
 
= msub Examples =
 
 
 
''Hint for JUSTUS users:'' in the following examples instead of '''singlenode''' and '''fat''' use '''short''' and '''long''', respectively!
 
''Hint for JUSTUS users:'' in the following examples instead of '''singlenode''' and '''fat''' use '''short''' and '''long''', respectively!
  +
==== Serial Programs ====
 
  +
To submit a serial job that runs the script '''job.sh''' and that requires 5000 MB of main memory and 3 hours of wall clock time
== Serial Programs ==
 
To submit a serial job that runs the script '''job.sh''' and that requires 5000 MB of main memory and 3 hours of wall clock time
 
   
 
a) execute:
 
a) execute:
Line 228: Line 142:
 
$ msub -q singlenode -N test -l nodes=1:ppn=1,walltime=3:00:00,pmem=5000mb job.sh
 
$ msub -q singlenode -N test -l nodes=1:ppn=1,walltime=3:00:00,pmem=5000mb job.sh
 
</pre>
 
</pre>
or
+
or
 
 
b) add after the initial line of your script '''job.sh''' the lines (here with a high memory request):
 
b) add after the initial line of your script '''job.sh''' the lines (here with a high memory request):
 
<source lang="bash">
 
<source lang="bash">
Line 241: Line 154:
 
$ msub -q fat job.sh
 
$ msub -q fat job.sh
 
</pre>
 
</pre>
 
 
Note, that msub command line options overrule script options.
 
Note, that msub command line options overrule script options.
  +
==== Multithreaded Programs ====
<br>
 
  +
Multithreaded programs operate faster than serial programs on CPUs with multiple cores. Moreover, multiple threads of one process share resources such as memory.
 
== Handling job script options and arguments ==
 
Job script options and arguments as followed:
 
<pre>
 
$ ./job.sh -n 10
 
</pre>
 
can not be passed while using msub command since those will be interpreted as command line options of msub.
 
 
 
'''Solution A:'''
 
 
Submit a wrapper script, e.g. wrapper.sh:
 
<pre>
 
$ msub -q singlenode wrapper.sh
 
</pre>
 
which simply contains all options and arguments of job.sh. The script wrapper.sh would at least contain the following lines:
 
<source lang="bash">
 
#!/bin/bash
 
./job.sh -n 10
 
</source>
 
 
'''Solution B:'''
 
 
Add after the header of your '''BASH''' script job.sh the following lines:
 
<source lang="bash">
 
## check if $SCRIPT_FLAGS is "set"
 
if [ -n "${SCRIPT_FLAGS}" ] ; then
 
## but if positional parameters are already present
 
## we are going to ignore $SCRIPT_FLAGS
 
if [ -z "${*}" ] ; then
 
set -- ${SCRIPT_FLAGS}
 
fi
 
fi
 
</source>
 
 
These lines modify your BASH script to read options and arguments from the environment variable $SCRIPT_FLAGS. Now submit your script job.sh as followed:
 
<pre>
 
$ msub -q singlenode -v SCRIPT_FLAGS='-n 10' job.sh
 
</pre>
 
 
<!--For advanced users: [[generalized version of solution B]] if job script arguments contain whitespaces.
 
<br>-->
 
 
== Multithreaded Programs ==
 
Multithreaded programs operate faster than serial programs on CPUs with multiple cores. Moreover, multiple threads of one process share resources such as memory.
 
   
 
For multithreaded programs based on '''Open''' '''M'''ulti-'''P'''rocessing (OpenMP) number of threads are defined by the environment variable OMP_NUM_THREADS. By default this variable is set to 1 (OMP_NUM_THREADS=1).
 
For multithreaded programs based on '''Open''' '''M'''ulti-'''P'''rocessing (OpenMP) number of threads are defined by the environment variable OMP_NUM_THREADS. By default this variable is set to 1 (OMP_NUM_THREADS=1).
   
To submit a batch job called ''OpenMP_Test'' that runs a fourfold threaded program ''omp_executable'' which requires 6000 MByte of total physical memory and total wall clock time of 3 hours:
+
To submit a batch job called ''OpenMP_Test'' that runs a fourfold threaded program ''omp_executable'' which requires 6000 MByte of total physical memory and total wall clock time of 3 hours:
   
<!-- 2014-01-29, at the moment submission of executables does not work, SLURM has to be instructed to generate a wrapper
+
<!-- 2014-01-29, at the moment submission of executables does not work, SLURM has to be instructed to generate a wrapper
 
a) execute:
 
a) execute:
 
<pre>
 
<pre>
 
$ msub -v OMP_NUM_THREADS=4 -N test -l nodes=1:ppn=4,walltime=3:00:00,mem=6000mb omp_program
 
$ msub -v OMP_NUM_THREADS=4 -N test -l nodes=1:ppn=4,walltime=3:00:00,mem=6000mb omp_program
 
</pre>
 
</pre>
 
 
or
 
or
 
-->
 
-->
<!--b)-->
+
<!--b)-->
 
* generate the script '''job_omp.sh''' containing the following lines:
 
* generate the script '''job_omp.sh''' containing the following lines:
 
<source lang="bash">
 
<source lang="bash">
Line 315: Line 182:
 
#Usually you should set
 
#Usually you should set
 
export KMP_AFFINITY=compact,1,0
 
export KMP_AFFINITY=compact,1,0
#export KMP_AFFINITY=verbose,compact,1,0 prints messages concerning the supported affinity
+
#export KMP_AFFINITY=verbose,compact,1,0 prints messages concerning the supported affinity
 
#KMP_AFFINITY Description: https://software.intel.com/en-us/node/524790#KMP_AFFINITY_ENVIRONMENT_VARIABLE
 
#KMP_AFFINITY Description: https://software.intel.com/en-us/node/524790#KMP_AFFINITY_ENVIRONMENT_VARIABLE
   
Line 325: Line 192:
 
exec $startexe
 
exec $startexe
 
</source>
 
</source>
 
 
Using Intel compiler the environment variable KMP_AFFINITY switches on binding of threads to specific cores and, if necessary, replace <placeholder> with the required modulefile to enable the OpenMP environment and execute the script '''job_omp.sh''' adding the queue class ''singlenode'' as msub option:
 
Using Intel compiler the environment variable KMP_AFFINITY switches on binding of threads to specific cores and, if necessary, replace <placeholder> with the required modulefile to enable the OpenMP environment and execute the script '''job_omp.sh''' adding the queue class ''singlenode'' as msub option:
 
<pre>
 
<pre>
 
$ msub -q singlenode job_omp.sh
 
$ msub -q singlenode job_omp.sh
 
</pre>
 
</pre>
<br>
 
 
Note, that msub command line options overrule script options, e.g.,
 
Note, that msub command line options overrule script options, e.g.,
 
<pre>
 
<pre>
Line 336: Line 201:
 
</pre>
 
</pre>
 
overwrites the script setting of 6000 MByte with 2000 MByte.
 
overwrites the script setting of 6000 MByte with 2000 MByte.
  +
==== MPI Parallel Programs ====
  +
MPI parallel programs run faster than serial programs on multi CPU and multi core systems. N-fold spawned processes of the MPI program, i.e., '''MPI tasks''', run simultaneously and communicate via the Message Passing Interface (MPI) paradigm. MPI tasks do not share memory but can be spawned over different nodes.
 
<br>
 
<br>
<br>
 
 
== MPI parallel Programs ==
 
MPI parallel programs run faster than serial programs on multi CPU and multi core systems. N-fold spawned processes of the MPI program, i.e., '''MPI tasks''', run simultaneously and communicate via the Message Passing Interface (MPI) paradigm. MPI tasks do not share memory but can be spawned over different nodes.
 
 
 
Multiple MPI tasks can not be launched by the MPI parallel program itself but via '''mpirun''', e.g. 4 MPI tasks of ''my_par_program'':
 
Multiple MPI tasks can not be launched by the MPI parallel program itself but via '''mpirun''', e.g. 4 MPI tasks of ''my_par_program'':
 
<pre>
 
<pre>
$ mpirun -n 4 my_par_program
+
$ mpirun -n 4 my_par_program
 
</pre>
 
</pre>
  +
However, this given command can '''not''' be directly included in your '''msub''' command for submitting as a batch job to the compute cluster, [[BwUniCluster_Batch_Jobs#Handling_job_script_options_and_arguments|see above]].
<br>
 
However, this given command can '''not''' be directly included in your '''msub''' command for submitting as a batch job to the compute cluster, [[BwUniCluster_Batch_Jobs#Handling_job_script_options_and_arguments|see above]].
 
   
 
Generate a wrapper script ''job_ompi.sh'' for '''OpenMPI''' containing the following lines:
 
Generate a wrapper script ''job_ompi.sh'' for '''OpenMPI''' containing the following lines:
Line 365: Line 226:
 
$ msub -q singlenode -l nodes=1:ppn=4,pmem=1000mb,walltime=01:00:00 job_ompi.sh
 
$ msub -q singlenode -l nodes=1:ppn=4,pmem=1000mb,walltime=01:00:00 job_ompi.sh
 
</pre>
 
</pre>
<br>
 
 
The policy on batch jobs with Intel MPI on bwUniCluster can be found here:
 
The policy on batch jobs with Intel MPI on bwUniCluster can be found here:
 
* [[Batch_Jobs_-_bwUniCluster_Features#Intel_MPI_without_Multithreading|bwUniCluster: Intel MPI parallel Programs]]
 
* [[Batch_Jobs_-_bwUniCluster_Features#Intel_MPI_without_Multithreading|bwUniCluster: Intel MPI parallel Programs]]
  +
==== Multithreaded + MPI parallel Programs ====
  +
Multithreaded + MPI parallel programs operate faster than serial programs on multi CPUs with multiple cores. All threads of one process share resources such as memory. On the contrary MPI tasks do not share memory but can be spawned over different nodes.
 
<br>
 
<br>
 
== Multithreaded + MPI parallel Programs ==
 
Multithreaded + MPI parallel programs operate faster than serial programs on multi CPUs with multiple cores. All threads of one process share resources such as memory. On the contrary MPI tasks do not share memory but can be spawned over different nodes.
 
 
 
Multiple MPI tasks using '''OpenMPI''' must be launched by the MPI parallel program '''mpirun'''. For multithreaded programs based on '''Open''' '''M'''ulti-'''P'''rocessing (OpenMP) number of threads are defined by the environment variable OMP_NUM_THREADS. By default this variable is set to 1 (OMP_NUM_THREADS=1).
 
Multiple MPI tasks using '''OpenMPI''' must be launched by the MPI parallel program '''mpirun'''. For multithreaded programs based on '''Open''' '''M'''ulti-'''P'''rocessing (OpenMP) number of threads are defined by the environment variable OMP_NUM_THREADS. By default this variable is set to 1 (OMP_NUM_THREADS=1).
   
'''For OpenMPI''' a job-script to submit a batch job called ''job_ompi_omp.sh'' that runs a MPI program with 4 tasks and an fivefold threaded program ''ompi_omp_program'' requiring 6000 MByte of physical memory per process/thread (using 5 threads per MPI task you will get 5*6000 MByte = 30000 MByte per MPI task) and total wall clock time of 3 hours looks like:
+
'''For OpenMPI''' a job-script to submit a batch job called ''job_ompi_omp.sh'' that runs a MPI program with 4 tasks and an fivefold threaded program ''ompi_omp_program'' requiring 6000 MByte of physical memory per process/thread (using 5 threads per MPI task you will get 5*6000 MByte = 30000 MByte per MPI task) and total wall clock time of 3 hours looks like:
   
<!--b)-->
+
<!--b)-->
 
<source lang="bash">
 
<source lang="bash">
 
#!/bin/bash
 
#!/bin/bash
Line 384: Line 242:
 
#MSUB -l pmem=6000mb
 
#MSUB -l pmem=6000mb
 
#MSUB -v MPI_MODULE=mpi/ompi
 
#MSUB -v MPI_MODULE=mpi/ompi
  +
#MSUB -v OMP_NUM_THREADS=5
 
#MSUB -v OMP_NUM_THREADS=5
 
#MSUB -v OMP_NUM_THREADS=5
 
#MSUB -v MPIRUN_OPTIONS="--bind-to core --map-by socket:PE=5 -report-bindings"
 
#MSUB -v MPIRUN_OPTIONS="--bind-to core --map-by socket:PE=5 -report-bindings"
 
#MSUB -v EXECUTABLE=./ompi_omp_program
 
#MSUB -v EXECUTABLE=./ompi_omp_program
 
#MSUB -N test_ompi_omp
 
#MSUB -N test_ompi_omp
  +
 
 
module load ${MPI_MODULE}
 
module load ${MPI_MODULE}
 
TASK_COUNT=$((${MOAB_PROCCOUNT}/${OMP_NUM_THREADS}))
 
TASK_COUNT=$((${MOAB_PROCCOUNT}/${OMP_NUM_THREADS}))
Line 396: Line 255:
 
exec $startexe
 
exec $startexe
 
</source>
 
</source>
 
 
Execute the script '''job_ompi_omp.sh''' adding the queue class ''multinode'' to your msub command:
 
Execute the script '''job_ompi_omp.sh''' adding the queue class ''multinode'' to your msub command:
 
<pre>
 
<pre>
 
$ msub -q multinode job_ompi_omp.sh
 
$ msub -q multinode job_ompi_omp.sh
 
</pre>
 
</pre>
  +
With the mpirun option ''--bind-to core'' MPI tasks and OpenMP threads are bound to physical cores.
  +
With the option ''--map-by socket:PE=<value>'' (neighbored) MPI tasks will be attached to different sockets and each MPI task is bound to the (in <value>) specified number of cpus. <value> must be set to ${OMP_NUM_THREADS}.
 
<br>
 
<br>
With the mpirun option ''--bind-to core'' MPI tasks and OpenMP threads are bound to physical cores.
 
With the option ''--map-by socket:PE=<value>'' (neighbored) MPI tasks will be attached to different sockets and each MPI task is bound to the (in <value>) specified number of cpus. <value> must be set to ${OMP_NUM_THREADS}.
 
 
 
Old OpenMPI version 1.6.x:
 
Old OpenMPI version 1.6.x:
 
With the mpirun option ''-bind-to-core'' MPI tasks and OpenMP threads are bound to physical cores.
 
With the mpirun option ''-bind-to-core'' MPI tasks and OpenMP threads are bound to physical cores.
  +
Old OpenMPI version 1.6.x:
With the option ''-bysocket'' (neighbored) MPI tasks will be attached to different sockets and the option ''-cpus-per-proc <value>'' binds each MPI task to the (in <value>) specified number of cpus. <value> must be set to ${OMP_NUM_THREADS}.
 
  +
With the mpirun option ''-bind-to-core'' MPI tasks and OpenMP threads are bound to physical cores.
  +
With the option ''-bysocket'' (neighbored) MPI tasks will be attached to different sockets and the option ''-cpus-per-proc <value>'' binds each MPI task to the (in <value>) specified number of cpus. <value> must be set to ${OMP_NUM_THREADS}.
   
 
The option ''-report-bindings'' shows the bindings between MPI tasks and physical cores.
 
The option ''-report-bindings'' shows the bindings between MPI tasks and physical cores.
   
 
The mpirun-options '''--bind-to core''', '''--map-by socket|...|node:PE=<value>''' should always be used when running a multithreaded MPI program. (OpenMPI version 1.6.x: The mpirun-options '''-bind-to-core''', '''-bysocket|-bynode''' and '''-cpus-per-proc <value>''' should always be used when running a multithreaded MPI program.)
 
The mpirun-options '''--bind-to core''', '''--map-by socket|...|node:PE=<value>''' should always be used when running a multithreaded MPI program. (OpenMPI version 1.6.x: The mpirun-options '''-bind-to-core''', '''-bysocket|-bynode''' and '''-cpus-per-proc <value>''' should always be used when running a multithreaded MPI program.)
 
 
<br>
 
<br>
 
The policy on batch jobs with Intel MPI + Multithreading on bwUniCluster can be found here:
 
The policy on batch jobs with Intel MPI + Multithreading on bwUniCluster can be found here:
 
* [[Batch_Jobs_-_bwUniCluster_Features#Intel_MPI_with_Multithreading|bwUniCluster: Intel MPI parallel Programs with Multithreading]]
 
* [[Batch_Jobs_-_bwUniCluster_Features#Intel_MPI_with_Multithreading|bwUniCluster: Intel MPI parallel Programs with Multithreading]]
  +
==== Chain jobs ====
<br>
 
  +
A job chain is a sequence of jobs where each job automatically starts its successor. Chain Job handling differs on the bwHPC Clusters. See the cluster-specific pages
 
== Chain jobs ==
 
 
A job chain is a sequence of jobs where each job automatically starts its successor. Chain Job handling differs on the bwHPC Clusters. See the cluster-specific pages
 
   
 
* [[Batch Jobs - bwUniCluster Features]]
 
* [[Batch Jobs - bwUniCluster Features]]
 
* [[Batch Jobs - bwForCluster Chemistry Features]]
 
* [[Batch Jobs - bwForCluster Chemistry Features]]
 
<!--* [[Batch Jobs - ForHLR Features]]-->
 
<!--* [[Batch Jobs - ForHLR Features]]-->
  +
==== Interactive Jobs ====
  +
Policies of interactive batch jobs are cluster specific and can be found here:
  +
* [[Batch_Jobs_-_bwUniCluster_Features#Interactive_Jobs|bwUniCluster interactive jobs]]
  +
* [[Batch_Jobs_-_ForHLR_Phase_I_Features#Interactive_Jobs|ForHLR interactive jobs]]
  +
=== Handling job script options and arguments ===
  +
Job script options and arguments as followed:
  +
<pre>
  +
$ ./job.sh -n 10
  +
</pre>
  +
can not be passed while using msub command since those will be interpreted as command line options of ''job.sh'' <small>(like $1 = -n, $2 = 10)</small>.
   
  +
'''Solution A:'''
= Status of batch system/jobs =
 
  +
== Start time of job or resources - showstart ==
 
  +
Submit a wrapper script, e.g. wrapper.sh:
  +
<pre>
  +
$ msub -q singlenode wrapper.sh
  +
</pre>
  +
which simply contains all options and arguments of job.sh. The script wrapper.sh would at least contain the following lines:
  +
<source lang="bash">
  +
#!/bin/bash
  +
./job.sh -n 10
  +
</source>
  +
  +
'''Solution B:'''
  +
  +
Add after the header of your '''BASH''' script job.sh the following lines:
  +
<source lang="bash">
  +
## check if $SCRIPT_FLAGS is "set"
  +
if [ -n "${SCRIPT_FLAGS}" ] ; then
  +
## but if positional parameters are already present
  +
## we are going to ignore $SCRIPT_FLAGS
  +
if [ -z "${*}" ] ; then
  +
set -- ${SCRIPT_FLAGS}
  +
fi
  +
fi
  +
</source>
  +
  +
These lines modify your BASH script to read options and arguments from the environment variable $SCRIPT_FLAGS. Now submit your script job.sh as followed:
  +
<pre>
  +
$ msub -q singlenode -v SCRIPT_FLAGS='-n 10' job.sh
  +
</pre>
  +
<!-- For advanced users: [[generalized version of solution B]] if job script arguments contain whitespaces.
  +
-->
  +
=== ForHLR Batch-Jobs ===
  +
* [[Batch_Jobs_-_ForHLR_Features|Additive ForHLR batch jobs information]]
  +
=== Moab Environment Variables ===
  +
Once an eligible compute jobs starts on the compute system, MOAB adds the following variables to the job's environment:
  +
{| width=800px class="wikitable"
  +
! colspan="3" | MOAB variables
  +
|-
  +
! Environment variables
  +
! Description
  +
|-
  +
| MOAB_CLASS
  +
| Class name
  +
|-
  +
| MOAB_GROUP
  +
| Group name
  +
|-
  +
| MOAB_JOBID
  +
| Job ID
  +
|-
  +
| MOAB_JOBNAME
  +
| Job name
  +
|-
  +
| MOAB_NODECOUNT
  +
| Number of nodes allocated to job
  +
|-
  +
| MOAB_PARTITION
  +
| Partition name the job is running in
  +
|-
  +
| MOAB_PROCCOUNT
  +
| Number of processors allocated to job
  +
|-
  +
| MOAB_SUBMITDIR
  +
| Directory of job submission
  +
|-
  +
| MOAB_USER
  +
| User name
  +
|}
  +
<font color=red size=+2>Attention!</font>
  +
<br>
  +
<font color=green>Most of all scientific programs available for HPC systems are able to extract all essential important environments at their own.
  +
<br>
  +
These programs identify the underlying resource management system ([[#TORQUE Resource Manager|TORQUE]]/[[Slurm Resource Manager|Slurm]]) and use the
  +
correct variables.
  +
<br>
  +
But a few programs still need 'msub' command line parameters like '''-np''' 'number-of-cores...' (example). In this case use [[#TORQUE Resource Manager|TORQUE]] or [[#Slurm Resource Manager|Slurm]] environments only.</font>
  +
<br>
  +
<br>
  +
<u>recapitulating</u>
  +
* The MOAB environment variables are for your own convenience only!
  +
* It's not sure, the contents of the Moab variables are always accurate.
  +
* Do not use them in your job scripts!
  +
* '''Hence use the [[#TORQUE Resource Manager|TORQUE]] or [[#Slurm Resource Manager|Slum]] environments instead.'''
  +
== Start time of job or resources : showstart ==
 
The following command can be used by any user to displays the estimated start time of a job based a number of analysis types based on historical usage, earliest available reservable resources, and priority based backlog. To show estimated start time of job <job_ID> enter:
 
The following command can be used by any user to displays the estimated start time of a job based a number of analysis types based on historical usage, earliest available reservable resources, and priority based backlog. To show estimated start time of job <job_ID> enter:
 
<pre>
 
<pre>
Line 442: Line 390:
 
$ man showstart
 
$ man showstart
 
</pre>
 
</pre>
  +
== List of your submitted jobs : showq ==
<br>
 
 
== List of your submitted jobs - showq ==
 
 
The following command displays information about your active, eligible, blocked, and/or recently completed jobs:
 
The following command displays information about your active, eligible, blocked, and/or recently completed jobs:
 
<pre>
 
<pre>
 
$ showq
 
$ showq
 
</pre>
 
</pre>
  +
The summary of your active jobs shows how many jobs of yours are running, how many processors are in use by your jobs and how many nodes are in use by '''all''' active jobs.
 
The summary of your active jobs shows how many jobs of yours are running, how many processors are in use by your jobs and how many nodes are in use by '''all''' active jobs.
 
The summary of your active jobs shows how many jobs of yours are running, how many processors are in use by your jobs and how many nodes are in use by '''all''' active jobs.
 
<br>
 
<br>
Line 456: Line 403:
 
$ man showq
 
$ man showq
 
</pre>
 
</pre>
  +
== Shows free resources : showbf ==
<br>
 
 
== Shows free resources - showbf ==
 
 
The following command displays what resources are available for immediate use for the whole
 
The following command displays what resources are available for immediate use for the whole
 
partition - for queue "singlenode" - for queue "multinode" - for queue "fat":
 
partition - for queue "singlenode" - for queue "multinode" - for queue "fat":
Line 472: Line 417:
 
$ man showbf
 
$ man showbf
 
</pre>
 
</pre>
  +
== Detailed job information : checkjob ==
<br>
 
 
== Detailed job information - checkjob ==
 
 
''checkjob <jobID>'' displays detailed job state information and diagnostic output for the (finished) job of ''<jobID>'':
 
''checkjob <jobID>'' displays detailed job state information and diagnostic output for the (finished) job of ''<jobID>'':
 
<pre>
 
<pre>
Line 485: Line 428:
   
 
AName: test.sh
 
AName: test.sh
State: Completed
+
State: Completed
 
Completion Code: 0 Time: Thu Jul 31 16:03:32
 
Completion Code: 0 Time: Thu Jul 31 16:03:32
 
Creds: user:XXXX group:YYY account:ZZZ class:develop
 
Creds: user:XXXX group:YYY account:ZZZ class:develop
Line 503: Line 446:
 
Allocated Nodes:
 
Allocated Nodes:
 
[uc1n459:1]
 
[uc1n459:1]
 
   
 
SystemID: uc1
 
SystemID: uc1
Line 522: Line 464:
 
$ man checkjob
 
$ man checkjob
 
</pre>
 
</pre>
  +
== Blocked job information : checkjob -v ==
<br>
 
=== Blocked job information - checkjob ===
 
   
 
$ checkjob -v <jobID>
 
$ checkjob -v <jobID>
Line 536: Line 477:
 
In this case the job has reached the account limit of mannheim while requesting 160 core when 742 were already in use.
 
In this case the job has reached the account limit of mannheim while requesting 160 core when 742 were already in use.
   
  +
== Canceling own jobs : canceljob ==
= Job management =
 
== Canceling own jobs ==
 
 
''canceljob <jobID>'' cancels the own job with ''<jobID>''.
 
''canceljob <jobID>'' cancels the own job with ''<jobID>''.
 
<pre>
 
<pre>
Line 550: Line 490:
 
<br>
 
<br>
 
<br>
 
<br>
  +
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
 
  +
= Resource Managers =
 
  +
== TORQUE Resource Manager ==
  +
The '''T'''erascale '''O'''pen-source '''R'''esource and '''QUE'''ue Manager ([http://www.adaptivecomputing.com/products/open-source/torque/ TORQUE]) is a distributed resource manager providing control over batch jobs and distributed compute nodes. TORQUE can integrate with the non-commercial Maui Cluster Scheduler or the commercial Moab Workload Manager to improve overall utilization, scheduling and administration on a cluster.
  +
<br>
  +
=== Batch Job Varaibles : bwForCluster (TORQUE) ===
  +
* [[Batch_Jobs_-_bwForCluster_Features|bwForCluster batch job variables (in progress)]]
  +
== Slurm Resource Manager ==
  +
The Slurm Resource and Workload Manager (formally known as '''S'''imple '''L'''inux '''U'''tility for '''R'''esource '''M'''anagement ([http://slurm.net/about/ SLURM]), or Slurm for short, is a free and open-source job scheduler.
  +
<br>
  +
=== Batch Job Varaibles : bwUniCluster (Slurm) ===
  +
* [[Batch_Jobs_-_bwUniCluster_Features|bwUniCluster batch job variables]]
  +
<br>
  +
<br>
 
----
 
----
 
[[Category:bwUniCluster|Batch Jobs - General Features]][[Category:ForHLR Phase I|Batch Jobs - General Features]][[Category:BwForCluster Chemistry|Batch Jobs - General Features]]
 
[[Category:bwUniCluster|Batch Jobs - General Features]][[Category:ForHLR Phase I|Batch Jobs - General Features]][[Category:BwForCluster Chemistry|Batch Jobs - General Features]]

Revision as of 13:08, 8 January 2016



This article is partly outdated and currently under revision!



1 Moab® HPC Workload Manager

1.1 Specification

The Moab Cluster Suite is a cluster workload management package, available from Adaptive Computing, Inc., that integrates the scheduling, managing, monitoring and reporting of cluster workloads. Moab Cluster Suite simplifies and unifies management across one or multiple hardware, operating system, storage, network, license and resource manager environments.
Any kind of calculation on the compute nodes of a bwHPC cluster of tier 2 or 3 requires the user to define calculations as a sequence of commands or single command together with required run time, number of CPU cores and main memory and submit all, i.e., the batch job, to a resource and workload managing software. All bwHPC cluster of tier 2 and 3, including have installed the workload managing software MOAB. Therefore any job submission by the user is to be executed by commands of the MOAB software. MOAB queues and runs user jobs based on fair sharing policies.

1.2 Moab Commands

Overview of:

MOAB commands Brief explanation
msub submits a job and queues it in an input queue
checkjob displays detailed job state information
showq displays information about active, eligible, blocked, and/or recently completed jobs
showbf shows what resources are available for immediate use
showstart returns start time of submitted job or requested resources
canceljob cancels a job

1.3 Job Submission : msub

Batch jobs are submitted by using the command msub. The main purpose of the msub command is to specify the resources that are needed to run the job. msub will then queue the batch job. However, starting of batch job depends on availability of the requested resources and the fair sharing value.

1.3.1 msub Command Parameters

The syntax and use of msub can be displayed via:

$ man msub

msub options can be used from the command line or in your job script.

msub Options
Command line Script Purpose
-l resources #MSUB -l resources Defines the resources that are required by the job.

See the description below for this important flag.

-N name #MSUB -N name Gives a user specified name to the job.
-o filename #MSUB -o filename Defines the file-name to be used for the standard output stream of the

batch job. By default the file with defined file name is placed under your
job submit directory. To place under a different location, expand
file name by the relative or absolute path of destination.

-q queue #MSUB -q queue Defines the queue class
-v variable=arg #MSUB -v variable=arg Expands the list of environment variables that are exported to the job
-S Shell #MSUB -S Shell Declares the shell (state path+name, e.g. /bin/bash) that interpret

the job script

-m bea #MSUB -m bea Send email when job begins (b), ends (e) or aborts (a).
-M name@uni.de #MSUB -M name@uni.de Send email to the specified email address "name@uni.de".

For cluster specific msub options, read:

1.3.1.1 msub -l resource_list

The -l option is one of the most important msub options. It is used to specify a number of resource requirements for your job. Multiple resource strings are separated by commas.

msub -l resource_list
resource Purpose
-l nodes=2:ppn=16 Number of nodes and number of processes per node
-l walltime=600
-l walltime=01:30:00
Wall-clock time. Default units are seconds.

HH:MM:SS format is also accepted.

-l pmem=1000mb Maximum amount of physical memory used by any single process of the job.

Allowed units are kb, mb, gb. Be aware that processes are either MPI tasks

memory for all MPI tasks or all threads of the job.
-l advres=res_name Specifies the reservation "res_name" required to run the job.
-l naccesspolicy=policy Specifies how node resources should be accessed, e.g. -l naccesspolicy=singlejob

reserves all requested nodes for the job exclusively.
Attention, if you request nodes=1:ppn=4 together with singlejob you will be
charged for the maximum cores of the node.

Note that all compute nodes do not have SWAP space, thus DO NOT specify '-l vmem' or '-l pvmem' or your jobs will not start.

1.3.1.2 msub -q queues

Queue classes define maximum resources such as walltime, nodes and processes per node and partition of the compute system. Note that queue settings of the bwHPC cluster are not identical, but differ due to their different prerequisites, such as HPC performance, scalability and throughput levels. Details can be found here:

1.3.2 msub Examples

Hint for JUSTUS users: in the following examples instead of singlenode and fat use short and long, respectively!

1.3.2.1 Serial Programs

To submit a serial job that runs the script job.sh and that requires 5000 MB of main memory and 3 hours of wall clock time

a) execute:

$ msub -q singlenode -N test -l nodes=1:ppn=1,walltime=3:00:00,pmem=5000mb   job.sh

or b) add after the initial line of your script job.sh the lines (here with a high memory request):

#MSUB -l nodes=1:ppn=1
#MSUB -l walltime=3:00:00
#MSUB -l pmem=200000mb
#MSUB -N test

and execute the modified script with the command line option -q fat (with -q singlenode maximum pmem=64000mb is possible):

$ msub -q fat job.sh

Note, that msub command line options overrule script options.

1.3.2.2 Multithreaded Programs

Multithreaded programs operate faster than serial programs on CPUs with multiple cores. Moreover, multiple threads of one process share resources such as memory.

For multithreaded programs based on Open Multi-Processing (OpenMP) number of threads are defined by the environment variable OMP_NUM_THREADS. By default this variable is set to 1 (OMP_NUM_THREADS=1).

To submit a batch job called OpenMP_Test that runs a fourfold threaded program omp_executable which requires 6000 MByte of total physical memory and total wall clock time of 3 hours:

  • generate the script job_omp.sh containing the following lines:
#!/bin/bash
#MSUB -l nodes=1:ppn=4
#MSUB -l walltime=3:00:00
#MSUB -l mem=6000mb
#MSUB -v EXECUTABLE=./omp_executable
#MSUB -v MODULE=<placeholder>
#MSUB -N OpenMP_Test

#Usually you should set
export KMP_AFFINITY=compact,1,0
#export KMP_AFFINITY=verbose,compact,1,0  prints messages concerning the supported affinity
#KMP_AFFINITY Description: https://software.intel.com/en-us/node/524790#KMP_AFFINITY_ENVIRONMENT_VARIABLE

module load ${MODULE}
export OMP_NUM_THREADS=${MOAB_PROCCOUNT}
echo "Executable ${EXECUTABLE} running on ${MOAB_PROCCOUNT} cores with ${OMP_NUM_THREADS} threads"
startexe=${EXECUTABLE}
echo $startexe
exec $startexe

Using Intel compiler the environment variable KMP_AFFINITY switches on binding of threads to specific cores and, if necessary, replace <placeholder> with the required modulefile to enable the OpenMP environment and execute the script job_omp.sh adding the queue class singlenode as msub option:

$ msub -q singlenode job_omp.sh

Note, that msub command line options overrule script options, e.g.,

$ msub -l mem=2000mb -q singlenode job_omp.sh

overwrites the script setting of 6000 MByte with 2000 MByte.

1.3.2.3 MPI Parallel Programs

MPI parallel programs run faster than serial programs on multi CPU and multi core systems. N-fold spawned processes of the MPI program, i.e., MPI tasks, run simultaneously and communicate via the Message Passing Interface (MPI) paradigm. MPI tasks do not share memory but can be spawned over different nodes.
Multiple MPI tasks can not be launched by the MPI parallel program itself but via mpirun, e.g. 4 MPI tasks of my_par_program:

$ mpirun -n 4 my_par_program

However, this given command can not be directly included in your msub command for submitting as a batch job to the compute cluster, see above.

Generate a wrapper script job_ompi.sh for OpenMPI containing the following lines:

#!/bin/bash
module load mpi/openmpi/<placeholder_for_version>
# Use when loading OpenMPI in version 1.8.x
mpirun --bind-to core --map-by core -report-bindings my_par_program
# Use when loading OpenMPI in an old version 1.6.x
mpirun -bind-to-core -bycore -report-bindings my_par_program

Attention: Do NOT add mpirun options -n <number_of_processes> or any other option defining processes or nodes, since MOAB instructs mpirun about number of processes and node hostnames. Use ALWAYS the MPI options --bind-to core and --map-by core|socket|node (OpenMPI version 1.8.x). Please type mpirun --help for an explanation of the meaning of the different options of mpirun option --map-by.

Considering 4 OpenMPI tasks on a single node, each requiring 1000 MByte, and running for 1 hour, execute:

$ msub -q singlenode -l nodes=1:ppn=4,pmem=1000mb,walltime=01:00:00 job_ompi.sh

The policy on batch jobs with Intel MPI on bwUniCluster can be found here:

1.3.2.4 Multithreaded + MPI parallel Programs

Multithreaded + MPI parallel programs operate faster than serial programs on multi CPUs with multiple cores. All threads of one process share resources such as memory. On the contrary MPI tasks do not share memory but can be spawned over different nodes.
Multiple MPI tasks using OpenMPI must be launched by the MPI parallel program mpirun. For multithreaded programs based on Open Multi-Processing (OpenMP) number of threads are defined by the environment variable OMP_NUM_THREADS. By default this variable is set to 1 (OMP_NUM_THREADS=1).

For OpenMPI a job-script to submit a batch job called job_ompi_omp.sh that runs a MPI program with 4 tasks and an fivefold threaded program ompi_omp_program requiring 6000 MByte of physical memory per process/thread (using 5 threads per MPI task you will get 5*6000 MByte = 30000 MByte per MPI task) and total wall clock time of 3 hours looks like:

#!/bin/bash
#MSUB -l nodes=2:ppn=10
#MSUB -l walltime=03:00:00
#MSUB -l pmem=6000mb
#MSUB -v MPI_MODULE=mpi/ompi
#MSUB -v OMP_NUM_THREADS=5
#MSUB -v OMP_NUM_THREADS=5
#MSUB -v MPIRUN_OPTIONS="--bind-to core --map-by socket:PE=5 -report-bindings"
#MSUB -v EXECUTABLE=./ompi_omp_program
#MSUB -N test_ompi_omp

module load ${MPI_MODULE}
TASK_COUNT=$((${MOAB_PROCCOUNT}/${OMP_NUM_THREADS}))
echo "${EXECUTABLE} running on ${MOAB_PROCCOUNT} cores with ${TASK_COUNT} MPI-tasks and ${OMP_NUM_THREADS} threads"
startexe="mpirun -n ${TASK_COUNT} ${MPIRUN_OPTIONS} ${EXECUTABLE}"
echo $startexe
exec $startexe

Execute the script job_ompi_omp.sh adding the queue class multinode to your msub command:

$ msub -q multinode job_ompi_omp.sh

With the mpirun option --bind-to core MPI tasks and OpenMP threads are bound to physical cores. With the option --map-by socket:PE=<value> (neighbored) MPI tasks will be attached to different sockets and each MPI task is bound to the (in <value>) specified number of cpus. <value> must be set to ${OMP_NUM_THREADS}.
Old OpenMPI version 1.6.x: With the mpirun option -bind-to-core MPI tasks and OpenMP threads are bound to physical cores. Old OpenMPI version 1.6.x: With the mpirun option -bind-to-core MPI tasks and OpenMP threads are bound to physical cores. With the option -bysocket (neighbored) MPI tasks will be attached to different sockets and the option -cpus-per-proc <value> binds each MPI task to the (in <value>) specified number of cpus. <value> must be set to ${OMP_NUM_THREADS}.

The option -report-bindings shows the bindings between MPI tasks and physical cores.

The mpirun-options --bind-to core, --map-by socket|...|node:PE=<value> should always be used when running a multithreaded MPI program. (OpenMPI version 1.6.x: The mpirun-options -bind-to-core, -bysocket|-bynode and -cpus-per-proc <value> should always be used when running a multithreaded MPI program.)
The policy on batch jobs with Intel MPI + Multithreading on bwUniCluster can be found here:

1.3.2.5 Chain jobs

A job chain is a sequence of jobs where each job automatically starts its successor. Chain Job handling differs on the bwHPC Clusters. See the cluster-specific pages

1.3.2.6 Interactive Jobs

Policies of interactive batch jobs are cluster specific and can be found here:

1.3.3 Handling job script options and arguments

Job script options and arguments as followed:

$ ./job.sh -n 10

can not be passed while using msub command since those will be interpreted as command line options of job.sh (like $1 = -n, $2 = 10).

Solution A:

Submit a wrapper script, e.g. wrapper.sh:

$ msub -q singlenode wrapper.sh

which simply contains all options and arguments of job.sh. The script wrapper.sh would at least contain the following lines:

#!/bin/bash
./job.sh -n 10

Solution B:

Add after the header of your BASH script job.sh the following lines:

## check if $SCRIPT_FLAGS is "set"
if [ -n "${SCRIPT_FLAGS}" ] ; then
   ## but if positional parameters are already present
   ## we are going to ignore $SCRIPT_FLAGS
   if [ -z "${*}"  ] ; then
      set -- ${SCRIPT_FLAGS}
   fi
fi

These lines modify your BASH script to read options and arguments from the environment variable $SCRIPT_FLAGS. Now submit your script job.sh as followed:

$ msub -q singlenode -v SCRIPT_FLAGS='-n 10' job.sh

1.3.4 ForHLR Batch-Jobs

1.3.5 Moab Environment Variables

Once an eligible compute jobs starts on the compute system, MOAB adds the following variables to the job's environment:

MOAB variables
Environment variables Description
MOAB_CLASS Class name
MOAB_GROUP Group name
MOAB_JOBID Job ID
MOAB_JOBNAME Job name
MOAB_NODECOUNT Number of nodes allocated to job
MOAB_PARTITION Partition name the job is running in
MOAB_PROCCOUNT Number of processors allocated to job
MOAB_SUBMITDIR Directory of job submission
MOAB_USER User name

Attention!
Most of all scientific programs available for HPC systems are able to extract all essential important environments at their own.
These programs identify the underlying resource management system (TORQUE/Slurm) and use the correct variables.
But a few programs still need 'msub' command line parameters like -np 'number-of-cores...' (example). In this case use TORQUE or Slurm environments only.


recapitulating

  • The MOAB environment variables are for your own convenience only!
  • It's not sure, the contents of the Moab variables are always accurate.
  • Do not use them in your job scripts!
  • Hence use the TORQUE or Slum environments instead.

1.4 Start time of job or resources : showstart

The following command can be used by any user to displays the estimated start time of a job based a number of analysis types based on historical usage, earliest available reservable resources, and priority based backlog. To show estimated start time of job <job_ID> enter:

$ showstart -e all <job_ID>


Furthermore start time of resource demands, e.g. 16 processes @ 12 h, can be displayed via:

$ showstart -e all 16@12:00:00


For further options of showstart read the manpage of showstart:

$ man showstart

1.5 List of your submitted jobs : showq

The following command displays information about your active, eligible, blocked, and/or recently completed jobs:

$ showq

The summary of your active jobs shows how many jobs of yours are running, how many processors are in use by your jobs and how many nodes are in use by all active jobs. The summary of your active jobs shows how many jobs of yours are running, how many processors are in use by your jobs and how many nodes are in use by all active jobs.

For further options of showq read the manpage of showq:

$ man showq

1.6 Shows free resources : showbf

The following command displays what resources are available for immediate use for the whole partition - for queue "singlenode" - for queue "multinode" - for queue "fat":

$ showbf
$ showbf -c singlenode
$ showbf -c multinode
$ showbf -c fat


For further options of showbf read the manpage of showbf:

$ man showbf

1.7 Detailed job information : checkjob

checkjob <jobID> displays detailed job state information and diagnostic output for the (finished) job of <jobID>:

$ checkjob <jobID>


The returned output for finished job ID uc1.000000 reads:

job uc1.000000

AName: test.sh
State: Completed
Completion Code: 0  Time: Thu Jul 31 16:03:32
Creds:  user:XXXX  group:YYY  account:ZZZ  class:develop
WallTime:   00:01:06 of 00:10:00
SubmitTime: Thu Jul 31 16:02:18
  (Time Queued  Total: 00:00:08  Eligible: 00:03:41)

TemplateSets:  DEFAULT
NodeMatchPolicy: EXACTNODE
Total Requested Tasks: 1

Req[0]  TaskCount: 1  Partition: uc1
Memory >= 4000M  Disk >= 0  Swap >= 0
Dedicated Resources Per Task: PROCS: 1  MEM: 4000M
NodeSet=ONEOF:FEATURE:[NONE]

Allocated Nodes:
[uc1n459:1]

SystemID:   uc1
SystemJID:  uc1.000000

IWD:            /pfs/data1/home/ZZZ/YYY/XXX/bwUniCluster
SubmitDir:      /pfs/data1/home/ZZZ/YYY/XXX/bwUniCluster
Executable:     /opt/moab/spool/moab.job.jCLed6

StartCount:     1
Execution Partition:  uc1
Flags:          GLOBALQUEUE
StartPriority:  5321

For further options of checkjob read the manpage of checkjob:

$ man checkjob

1.8 Blocked job information : checkjob -v

$ checkjob -v <jobID>

If your job is blocked do not delete it!

A blocked job has hit a limit and will become idle if resource get free. The "-v (Verbose)" Mode of 'checkjob' also shows a message "BLOCK MSG:" for more details.

e.g.
BLOCK MSG: job <jobID> violates active SOFT MAXPROC limit of 750 for acct mannheim  partition ALL (Req: 160  InUse: 742) (recorded at last scheduling iteration)

In this case the job has reached the account limit of mannheim while requesting 160 core when 742 were already in use.

1.9 Canceling own jobs : canceljob

canceljob <jobID> cancels the own job with <jobID>.

$ canceljob <jobID>


Note that only own jobs can be cancelled. The command:

$ mjobctl -c <jobID>

has the same effect as canceljob <jobID>.

2 Resource Managers

2.1 TORQUE Resource Manager

The Terascale Open-source Resource and QUEue Manager (TORQUE) is a distributed resource manager providing control over batch jobs and distributed compute nodes. TORQUE can integrate with the non-commercial Maui Cluster Scheduler or the commercial Moab Workload Manager to improve overall utilization, scheduling and administration on a cluster.

2.1.1 Batch Job Varaibles : bwForCluster (TORQUE)

2.2 Slurm Resource Manager

The Slurm Resource and Workload Manager (formally known as Simple Linux Utility for Resource Management (SLURM), or Slurm for short, is a free and open-source job scheduler.

2.2.1 Batch Job Varaibles : bwUniCluster (Slurm)