Difference between revisions of "Batch Jobs Moab"

From bwHPC Wiki
Jump to: navigation, search
(Environment Variables for Batch Jobs)
m (msub Examples)
(191 intermediate revisions by 13 users not shown)
Line 1: Line 1:
  +
= Moab® HPC Workload Manager =
{| style="border-style: solid; border-width: 1px"
 
  +
== Specification ==
! Navigation: [[BwHPC_Best_Practices_Repository|bwHPC BPR]] / [[BwUniCluster_User_Guide|bwUniCluster]]
 
  +
The Moab Cluster Suite is a '''cluster workload management package''', available from [http://www.adaptivecomputing.com/ Adaptive Computing, Inc.], that integrates the scheduling, managing, monitoring and reporting of cluster workloads. Moab Cluster Suite simplifies and unifies management across one or multiple hardware, operating system, storage, network, license and resource manager environments.
|}
 
  +
<br>
 
  +
<br>
 
  +
Any kind of calculation on the compute nodes of a [[HPC_infrastructure_of_Baden_Wuerttemberg|bwHPC cluster of tier 2 or 3]] requires the user to define calculations as a sequence of commands or single command together with required run time, number of CPU cores and main memory and submit all, i.e., the '''batch job''', to a resource and workload managing software. All bwHPC cluster of tier 2 and 3, including have installed the workload managing software MOAB. Therefore any job submission by the user is to be executed by commands of the MOAB software. MOAB queues and runs user jobs based on fair sharing policies.
<!--<span style="color:red;font-size:105%;">Important note: bwUniCluster is '''not''' in production mode yet.</span>-->
 
  +
<br>
 
  +
== Moab Commands (excerpt) ==
<!---
 
  +
Some of the most used Moab commands for non-administrators working on a HPC-C5 cluster.
<span style="color:red;font-size:105%;">The folllowing features of MOAB are not working:</span>
 
  +
{| width=750px class="wikitable"
* <span style="color:red;font-size:105%;">interactive jobs, i.e.,</span> <span style="color:red;font-size:105%;background:#edeae2;margin:10px;padding:1px;border:1px dotted #808080">msub -I</span>
 
* <span style="color:red;font-size:105%;">any memory allocation of multi task jobs, i.e.,</span> <span style="color:red;font-size:105%;background:#edeae2;margin:10px;padding:1px;border:1px dotted #808080">msub -l mem=<value></span> <span style="color:red;font-size:105%;">or</span> <span style="color:red;font-size:105%;background:#edeae2;margin:10px;padding:1px;border:1px dotted #808080">msub -l pmem=<value></span>
 
 
 
<span style="color:red;font-size:105%;">Please do not use these features until further notice. Adaptive Computing, producer of MOAB, is working on that problem.
 
</span>
 
-->
 
 
 
Any kind of calculation on the compute nodes of '''bwUniCluster''' requires the user to define calculations as a sequence of commands or single command together with required run time, number of CPU cores and main memory and submit all, i.e., the '''batch job''', to a resource and workload managing software. All bwHPC cluster, including '''bwUniCluster''', have installed the workload managing software MOAB. Therefore any job submission by the user is to be executed by commands of the MOAB software. MOAB queues and runs user jobs based on fair sharing policies.
 
 
 
Overview of:
 
{| style="width:100%; vertical-align:top; background:#f5fffa;border:2px solid #000000;"
 
 
! MOAB commands !! Brief explanation
 
! MOAB commands !! Brief explanation
 
|-
 
|-
| msub || submits a job and queues it in an input queue
+
| [[#Job Submission : msub|msub]] || Submits a job and queues it in an input queue [[http://docs.adaptivecomputing.com/mwm/6-1-9/Content/commands/msub.html msub]]
 
|-
 
|-
| checkjob || displays detailed job state information
+
| [[#Detailed job information : checkjob|checkjob]] || Displays detailed job state information [[http://docs.adaptivecomputing.com/mwm/6-1-9/Content/commands/checkjob.html checkjob]]
 
|-
 
|-
| showq || displays information about active, eligible, blocked, and/or recently completed jobs
+
| [[#List of your submitted jo/bs : showq|showq]] || Displays information about active, eligible, blocked, and/or recently completed jobs [[http://docs.adaptivecomputing.com/mwm/6-1-9/Content/commands/showq.html showq]]
 
|-
 
|-
| showbf || shows what resources are available for immediate use
+
| [[#Shows free resources : showbf|showbf]] || Shows what resources are available for immediate use [[http://docs.adaptivecomputing.com/mwm/6-1-9/Content/commands/showbf.html showbf]]
 
|-
 
|-
  +
| [[#Start time of job or resources : showstart|showstart]] || Returns start time of submitted job or requested resources [[http://docs.adaptivecomputing.com/mwm/6-1-9/Content/commands/showstart.html showstart]]
| canceljob || cancels a job
 
  +
|-
  +
| [[#Canceling own jobs : canceljob|canceljob]] || Cancels a job (opsoleted!) [[http://docs.adaptivecomputing.com/mwm/6-1-9/Content/commands/canceljob.html canceljob]]
  +
|-
  +
| [[#Moab Job Control : mjobctl|mjobctl]] || Cancel a job and more job control options [[http://docs.adaptivecomputing.com/mwm/6-1-9/Content/commands/mjobctl.html mjobctl]]
 
|}
 
|}
  +
* [<font color=#D98E00>'command as a link'</font>] = additional documentation, external link
  +
* [https://computing.llnl.gov/tutorials/moab/#References Using Moab References]
  +
* [http://docs.adaptivecomputing.com/mwm/6-1-9/Content/a.gcommandoverview.html Moab Commands]
   
  +
== Job Submission : msub ==
 
 
= Job Submission =
 
 
 
Batch jobs are submitted by using the command '''msub'''. The main purpose of the '''msub''' command is to specify the resources that are needed to run the job. '''msub''' will then queue the batch job. However, starting of batch job depends on availability of the requested resources and the fair sharing value.
 
Batch jobs are submitted by using the command '''msub'''. The main purpose of the '''msub''' command is to specify the resources that are needed to run the job. '''msub''' will then queue the batch job. However, starting of batch job depends on availability of the requested resources and the fair sharing value.
<!--into the input queue. The jobs are organized into different job classes. For each job class there are specific limits for the available resources (number of nodes, number of CPUs, maximum CPU time, maximum memory etc.). -->
 
 
<br>
 
<br>
  +
=== msub Command Parameters ===
 
== msub Command ==
 
 
 
The syntax and use of '''msub''' can be displayed via:
 
The syntax and use of '''msub''' can be displayed via:
 
<pre>
 
<pre>
 
$ man msub
 
$ man msub
 
</pre>
 
</pre>
 
 
'''msub''' options can be used from the command line or in your job script.
 
'''msub''' options can be used from the command line or in your job script.
  +
{| width=750px class="wikitable"
 
  +
! colspan="3" | msub Options
 
{| style="width:100%; vertical-align:top; background:#f5fffa;border:1px solid #000000;padding:1px"
 
! colspan="3" style="background-color:#999999;padding:3px"| msub Options
 
 
|-
 
|-
  +
! Command line
! style="width:15%;height=20px; text-align:left;padding:3px"|Command line
 
  +
! Script
! style="width:20%;height=20px; text-align:left;padding:3px"|Script
 
  +
! Purpose
! style="width:65%;height=20px; text-align:left;padding:3px"|Purpose
 
 
|- style="vertical-align:top;"
 
|- style="vertical-align:top;"
| style="height=20px; text-align:left;padding:3px" | -l ''resources''
+
| -l ''resources''
| style="height=20px; text-align:left;padding:3px" | #MSUB -l ''resources''
+
| #MSUB -l ''resources''
| style="height=20px; text-align:left;padding:3px" | Defines the resources that are required by the job. See the description below for this important flag.
+
| Defines the resources that are required by the job.<br>
  +
See the description below for this important flag.
 
|- style="vertical-align:top;"
 
|- style="vertical-align:top;"
| style="height=20px; text-align:left;padding:3px" | -N ''name''
+
| -N ''name''
| style="height=20px; text-align:left;padding:3px" | #MSUB -N ''name''
+
| #MSUB -N ''name''
| style="height=20px; text-align:left;padding:3px" | Gives a user specified name to the job.
+
| Gives a user specified name to the job.
  +
|-
 
|- style="vertical-align:top;"
 
|- style="vertical-align:top;"
  +
| -o ''filename''
| style="height=20px; text-align:left;padding:3px" | -I
 
  +
| #MSUB -o ''filename''
|
 
  +
| Defines the file-name to be used for the standard output stream of the<br>
| style="height=20px; text-align:left;padding:3px" | Declares the job is to be run interactively.
 
  +
batch job. By default the file with defined file name is placed under your<br>
  +
job submit directory. To place under a different location, expand<br>
  +
''file name'' by the relative or absolute path of destination.<br>
 
|- style="vertical-align:top;"
 
|- style="vertical-align:top;"
  +
| -q ''queue''
| style="height=20px; text-align:left;padding:3px" | -o ''filename''
 
  +
| #MSUB -q ''queue''
| style="height=20px; text-align:left;padding:3px" | #MSUB -o ''filename''
 
  +
| Defines the queue class
| style="height=20px; text-align:left;padding:3px" | Defines the filename to be used for the standard output stream of the batch job. By default the file with defined filename is placed under your job submit directory. To place under a different location, expand ''filename'' by the relative or absolute path of destination.
 
 
|- style="vertical-align:top;"
 
|- style="vertical-align:top;"
  +
| -v ''variable=arg''
| style="height=20px; text-align:left;padding:3px" | -q ''queue''
 
  +
| #MSUB -v ''variable=arg''
| style="height=20px; text-align:left;padding:3px" | #MSUB -q ''queue''
 
  +
| Expands the list of environment variables that are exported to the job
| style="height=20px; text-align:left;padding:3px" | Defines the queue class
 
 
|-
 
|-
  +
|- style="vertical-align:top;"
<!--
 
| -V
+
| -S ''Shell''
| #MSUB -V
+
| #MSUB -S ''Shell''
  +
| Declares the shell (state path+name, e.g. /bin/bash) that interpret<br>
| Declares that all environment variables in the msub environment are exported to the batch job.
 
  +
the job script
  +
|-
  +
|- style="vertical-align:top;"
  +
| -m ''bea''
  +
| #MSUB -m ''bea''
  +
| Send email when job begins (b), ends (e) or aborts (a).
  +
|-
  +
|- style="vertical-align:top;"
  +
| -M ''name@uni.de''
  +
| #MSUB -M ''name@uni.de''
  +
| Send email to the specified email address "name@uni.de".
 
|-
 
|-
-->
 
 
|}
 
|}
  +
For cluster specific msub options, read:
<br>
 
  +
* [[Batch_Jobs_-_bwUniCluster_Features#msub Command|bwUniCluster msub options]]
 
  +
* [[Batch_Jobs_-_bwForCluster_Chemistry_Features|bwForCluster Chemistry msub options]]
 
  +
* [[BwForCluster_MLS&WISO_Production_Batch_Jobs|bwForCluster MLS&WISO (Production) msub options]]
=== msub -l ''resource_list'' ===
 
  +
==== msub -l ''resource_list'' ====
 
The '''-l''' option is one of the most important msub options. It is used to specify a number of resource requirements for your job. Multiple resource strings are separated by commas.
 
The '''-l''' option is one of the most important msub options. It is used to specify a number of resource requirements for your job. Multiple resource strings are separated by commas.
  +
{| width=750px class="wikitable"
 
  +
! colspan="3" | msub -l ''resource_list''
 
  +
|-
 
  +
! resource
<!--{| style="border-style: solid; border-width: 1px; padding=5px;" border="1"-->
 
  +
! Purpose
{| style="width:100%; vertical-align:top; background:#f5fffa;border:1px solid #000000;padding:1px"
 
  +
|- style="vertical-align:top;"
! colspan="3" style="background-color:#999999;padding:3px"| msub -l ''resource_list''
 
  +
<!-- temporarily removed
|- style="width:20%;height=20px; text-align:left;padding:3px"
 
  +
| -l procs=8
! style="width:20%;height=20px; text-align:left;padding:3px"| resource
 
  +
| Number of processes, distribution over nodes will be done by MOAB
! style="height=20px; text-align:left;padding:3px"| Purpose
 
|-
+
|-
  +
-->
| style="width:20%;height=20px; text-align:left;padding:3px" | -l procs=8
 
  +
| -l nodes=2:ppn=16
| style="height=20px; text-align:left;padding:3px"| Number of processes, distribution over nodes will be done by MOAB
 
  +
| Number of nodes and number of processes per node
|-
 
| style="width:20%;height=20px; text-align:left;padding:3px" | -l nodes=2:ppn=8
+
|- style="vertical-align:top;"
  +
| -l walltime=600 <br> -l walltime=01:30:00
| style="height=20px; text-align:left;padding:3px"| Number of nodes and number of processes per node
 
  +
| Wall-clock time. Default units are seconds.<br>
|-
 
  +
HH:MM:SS format is also accepted.
| style="width:20%;height=20px; text-align:left;padding:3px" | -l walltime=600 <br> -l walltime=01:30:00
 
| style="height=20px; text-align:left;padding:3px"| Wall-clock time. Default units are seconds. <br> HH:MM:SS format is also accepted.
 
 
<!--
 
<!--
  +
|- style="vertical-align:top;"
|-
 
 
| style="width:20%;height=20px; text-align:left;padding:3px" | -l feature=tree <br> -l feature=blocking <br> -l feature=fat
 
| style="width:20%;height=20px; text-align:left;padding:3px" | -l feature=tree <br> -l feature=blocking <br> -l feature=fat
| style="height=20px; text-align:left;padding:3px" | For jobs that span over several nodes <br> For sequential jobs <br> For jobs that require up to 1 TB memory-->
+
| style="height=20px; text-align:left;padding:3px" | For jobs that span over several nodes <br> For sequential jobs <br> For jobs that require up to 1 TB memory
  +
-->
 
|- style="vertical-align:top;"
 
|- style="vertical-align:top;"
| style="width:20%;height=40px; text-align:left;padding:3px;" | -l pmem=1000mb
+
| -l pmem=1000mb
| <div style="padding:3px;">Maximum amount of physical memory used by any single process of the job. <br>Allowed units are kb, mb, gb. Be aware that '''processes''' are either ''MPI tasks'' if running MPI parallel jobs or ''threads'' if running multithreaded jobs.</div>
+
| Maximum amount of physical memory used by any single process of the job.<br>
  +
Allowed units are kb, mb, gb. Be aware that '''processes''' are either ''MPI tasks''<br>
  +
memory for all ''MPI tasks'' or all ''threads'' of the job.</div>
  +
|-
 
|- style="vertical-align:top;"
 
|- style="vertical-align:top;"
  +
| -l advres=''res_name''
| style="width:20%;height=40px; text-align:left;padding:3px;" | -l mem=1000mb
 
  +
| Specifies the reservation "res_name" required to run the job.</div>
| <div style="padding:3px;">Maximum amount of physical memory used by the job.<br>Allowed units are kb, mb, gb. Be aware that this memory value is the accumulated memory for all ''MPI tasks'' or all ''threads'' of the job.</div>
 
|-
 
|}
 
<br>
 
 
=== msub -q ''queues'' ===
 
 
Queue classes define maximum resources such as walltime, nodes and processes per node and partition of the compute system.
 
 
{| style="width:100%; vertical-align:top; background:#f5fffa;border:1px solid #000000;padding:1px"
 
! colspan="3" style="background-color:#999999;padding:3px"| msub -q ''queue''
 
|- style="width:20%;height=20px; text-align:left;padding:3px"
 
! style="width:20%;height=20px; text-align:left;padding:3px"| queue
 
! style="height=20px; text-align:left;padding:3px"| maximum resources
 
|-
 
| style="width:20%;height=20px; text-align:left;padding:3px" | -q develop
 
| style="height=20px; text-align:left;padding:3px"| walltime=00:30:00 (i.e. 30 min), node=1, processes=16
 
 
|-
 
|-
| style="width:20%;height=20px; text-align:left;padding:3px" | -q singlenode
+
|- style="vertical-align:top;"
  +
| -l naccesspolicy=''policy''
| style="height=20px; text-align:left;padding:3px"| walltime=3:00:00:00 (i.e. 3 days), node=1, processes=16
 
  +
| Specifies how node resources should be accessed, e.g. ''-l naccesspolicy=singlejob''<br>
|-
 
  +
reserves all requested nodes for the job exclusively.<br>
| style="width:20%;height=20px; text-align:left;padding:3px" | -q multinode
 
  +
Attention, if you request ''nodes=1:ppn=4'' together with ''singlejob'' you will be<br>
| style="height=20px; text-align:left;padding:3px"| walltime=2:00:00:00 (i.e. 2 days), node=8
 
  +
charged for the maximum cores of the node.
|-
 
| style="width:20%;height=20px; text-align:left;padding:3px" | -q verylong
 
| style="height=20px; text-align:left;padding:3px"| walltime=6:00:00:00 (i.e. 6 days), node=1, processes=16
 
|-
 
| style="width:20%;height=20px; text-align:left;padding:3px" | -q fat
 
| style="height=20px; text-align:left;padding:3px"| walltime=1:00:00:00 (i.e. 1 days), node=1, processes=32 on fat nodes
 
|-
 
 
|}
 
|}
  +
Note that all compute nodes do not have SWAP space, thus <span style="color:red;font-size:105%;">DO NOT specify '-l vmem' or '-l pvmem'</span> or your jobs will not start.
If queue classes are not specified explicitly in your msub command, your batch jobs are automatically assigned to queues ''develop'', ''singlenode'' and ''multinode'' based on your requested walltime, nodes and processes.
 
  +
==== msub -q ''queues'' ====
  +
Queue classes define maximum resources such as walltime, nodes and processes per node and partition of the compute system. Note that queue settings of the bwHPC cluster are not '''identical''', but differ due to their different prerequisites, such as HPC performance, scalability and throughput levels. Details can be found here:
  +
* [[Batch_Jobs_-_bwUniCluster_Features#msub_-q_queues|bwUniCluster queue settings]]
  +
* [[BwForCluster_NEMO_Specific_Batch_Features#msub_-q_queues|bwForCluster NEMO queue settings]]
   
  +
=== msub Examples ===
* To run your batch job longer than 3 days, please use <span style="background:#edeae2;margin:10px;padding:1px;border:1px dotted #808080">msub -q verylong</span>.
 
  +
<font color=green>
  +
'''WARNING: many of these examples use queue names like 'singlenode', 'multinode', 'fat' and 'develop' which are specific to the bwUniCluster only! Please refer to the other cluster queue definitions for correct usage.
  +
'''</font>
   
  +
''Hint for JUSTUS users:'' in the following examples instead of '''singlenode''' and '''fat''' use '''short''' and '''long''', respectively!
* To run your batch job on one of the [[BwUniCluster_File_System#Components_of_bwUniCluster|fat nodes]], please use <span style="background:#edeae2;margin:10px;padding:1px;border:1px dotted #808080">msub -q fat</span>.
 
  +
==== Serial Programs ====
<br>
 
  +
To submit a serial job that runs the script '''job.sh''' and that requires 5000 MB of main memory and 3 hours of wall clock time
<br>
 
 
== msub Examples ==
 
 
 
 
=== Serial Programs ===
 
To submit a serial job that runs the script '''job.sh''' and that requires 5000 MB of main memory and 3 hours of wall clock time
 
   
 
a) execute:
 
a) execute:
 
<pre>
 
<pre>
$ msub -N test -l nodes=1:ppn=1,walltime=3:00:00,pmem=5000mb job.sh
+
$ msub -q singlenode -N test -l nodes=1:ppn=1,walltime=3:00:00,pmem=5000mb job.sh
 
</pre>
 
</pre>
or
+
or
  +
b) add after the initial line of your script '''job.sh''' the lines (here with a high memory request):
 
b) add after the initial line of your script '''job.sh''' the lines:
 
{| style="width: 100%; border:1px solid #d0cfcc; background:#f2f7ff;border-spacing: 2px;"
 
| style="width:280px; text-align:center; white-space:nowrap; color:#000;" |
 
 
<source lang="bash">
 
<source lang="bash">
 
#MSUB -l nodes=1:ppn=1
 
#MSUB -l nodes=1:ppn=1
 
#MSUB -l walltime=3:00:00
 
#MSUB -l walltime=3:00:00
#MSUB -l pmem=5000mb
+
#MSUB -l pmem=200000mb
 
#MSUB -N test
 
#MSUB -N test
 
</source>
 
</source>
  +
and execute the modified script with the command line option ''-q fat'' (with ''-q singlenode'' maximum ''pmem=64000mb'' is possible):
|}
 
and execute the modified script without any msub command line options:
 
 
<pre>
 
<pre>
$ msub job.sh
+
$ msub -q fat job.sh
 
</pre>
 
</pre>
  +
Note, that msub command line options overrule script options.
  +
==== Multithreaded Programs ====
  +
Multithreaded programs operate faster than serial programs on CPUs with multiple cores.<br>
  +
Moreover, multiple threads of one process share resources such as memory.
  +
<br>
  +
For multithreaded programs based on '''Open''' '''M'''ulti-'''P'''rocessing (OpenMP) number of threads are defined by the environment variable OMP_NUM_THREADS. By default this variable is set to 1 (OMP_NUM_THREADS=1).
  +
<br>
  +
To submit a batch job called ''OpenMP_Test'' that runs a fourfold threaded program ''omp_executable'' which requires 6000 MByte of total physical memory and total wall clock time of 3 hours:
  +
<br>
  +
<!-- 2014-01-29, at the moment submission of executables does not work, SLURM has to be instructed to generate a wrapper
  +
a) execute:
  +
<pre>
  +
$ msub -v OMP_NUM_THREADS=4 -N test -l nodes=1:ppn=4,walltime=3:00:00,mem=6000mb omp_program
  +
</pre>
  +
or
  +
-->
  +
* generate the script '''job_omp.sh''' containing the following lines:
  +
<source lang="bash">
  +
#!/bin/bash
  +
#MSUB -l nodes=1:ppn=4
  +
#MSUB -l walltime=3:00:00
  +
#MSUB -l mem=6000mb
  +
#MSUB -v EXECUTABLE=./omp_executable
  +
#MSUB -v MODULE=<placeholder>
  +
#MSUB -N OpenMP_Test
   
  +
#Usually you should set
  +
export KMP_AFFINITY=compact,1,0
  +
#export KMP_AFFINITY=verbose,compact,1,0 prints messages concerning the supported affinity
  +
#KMP_AFFINITY Description: https://software.intel.com/en-us/node/524790#KMP_AFFINITY_ENVIRONMENT_VARIABLE
   
  +
module load ${MODULE}
Note, that msub command line options overrule script options.
 
  +
export OMP_NUM_THREADS=${MOAB_PROCCOUNT}
  +
echo "Executable ${EXECUTABLE} running on ${MOAB_PROCCOUNT} cores with ${OMP_NUM_THREADS} threads"
  +
startexe=${EXECUTABLE}
  +
echo $startexe
  +
exec $startexe
  +
</source>
  +
Using Intel compiler the environment variable KMP_AFFINITY switches on binding of threads to specific cores and, if necessary, replace <placeholder> with the required modulefile to enable the OpenMP environment and execute the script '''job_omp.sh''' adding the queue class ''singlenode'' as msub option:
  +
<pre>
  +
$ msub -q singlenode job_omp.sh
  +
</pre>
  +
Note, that msub command line options overrule script options, e.g.,
  +
<pre>
  +
$ msub -l mem=2000mb -q singlenode job_omp.sh
  +
</pre>
  +
overwrites the script setting of 6000 MByte with 2000 MByte.
  +
==== MPI Parallel Programs ====
  +
MPI parallel programs run faster than serial programs on multi CPU and multi core systems. N-fold spawned processes of the MPI program, i.e., '''MPI tasks''', run simultaneously and communicate via the Message Passing Interface (MPI) paradigm. MPI tasks do not share memory but can be spawned over different nodes.
  +
<br>
  +
Multiple MPI tasks can not be launched by the MPI parallel program itself but via '''mpirun''', e.g. 4 MPI tasks of ''my_par_program'':
  +
<pre>
  +
$ mpirun -n 4 my_par_program
  +
</pre>
  +
However, this given command can '''not''' be directly included in your '''msub''' command for submitting as a batch job to the compute cluster, [[BwUniCluster_Batch_Jobs#Handling_job_script_options_and_arguments|see above]].
   
  +
Generate a wrapper script ''job_ompi.sh'' for '''OpenMPI''' containing the following lines:
  +
<source lang="bash">
  +
#!/bin/bash
  +
module load mpi/openmpi/<placeholder_for_version>
  +
# Use when loading OpenMPI in version 1.8.x
  +
mpirun --bind-to core --map-by core -report-bindings my_par_program
  +
# Use when loading OpenMPI in an old version 1.6.x
  +
mpirun -bind-to-core -bycore -report-bindings my_par_program
  +
</source>
  +
'''Attention:''' Do '''NOT''' add mpirun options ''-n <number_of_processes>'' or any other option defining processes or nodes, since MOAB instructs mpirun about number of processes and node hostnames. Use '''ALWAYS''' the MPI options '''''--bind-to core''''' and '''''--map-by core|socket|node''''' (OpenMPI version 1.8.x). Please type ''mpirun --help'' for an explanation of the meaning of the different options of mpirun option ''--map-by''.
  +
<br>
  +
Considering 4 OpenMPI tasks on a single node, each requiring 1000 MByte, and running for 1 hour, execute:
  +
<pre>
  +
$ msub -q singlenode -l nodes=1:ppn=4,pmem=1000mb,walltime=01:00:00 job_ompi.sh
  +
</pre>
  +
The policy on batch jobs with Intel MPI on bwUniCluster can be found here:
  +
* [[Batch_Jobs_-_bwUniCluster_Features#Intel_MPI_without_Multithreading|bwUniCluster: Intel MPI parallel Programs]]
  +
==== Multithreaded + MPI parallel Programs ====
  +
Multithreaded + MPI parallel programs operate faster than serial programs on multi CPUs with multiple cores. All threads of one process share resources such as memory. On the contrary MPI tasks do not share memory but can be spawned over different nodes.
  +
<br>
  +
Multiple MPI tasks using '''OpenMPI''' must be launched by the MPI parallel program '''mpirun'''. For multithreaded programs based on '''Open''' '''M'''ulti-'''P'''rocessing (OpenMP) number of threads are defined by the environment variable OMP_NUM_THREADS. By default this variable is set to 1 (OMP_NUM_THREADS=1).
  +
<br>
  +
'''For OpenMPI''' a job-script to submit a batch job called ''job_ompi_omp.sh'' that runs a MPI program with 4 tasks and an fivefold threaded program ''ompi_omp_program'' requiring 6000 MByte of physical memory per process/thread (using 5 threads per MPI task you will get 5*6000 MByte = 30000 MByte per MPI task) and total wall clock time of 3 hours looks like:
  +
<!--b)-->
  +
<source lang="bash">
  +
#!/bin/bash
  +
#MSUB -l nodes=2:ppn=10
  +
#MSUB -l walltime=03:00:00
  +
#MSUB -l pmem=6000mb
  +
#MSUB -v MPI_MODULE=mpi/ompi
  +
#MSUB -v OMP_NUM_THREADS=5
  +
#MSUB -v MPIRUN_OPTIONS="--bind-to core --map-by socket:PE=5 -report-bindings"
  +
#MSUB -v EXECUTABLE=./ompi_omp_program
  +
#MSUB -N test_ompi_omp
   
  +
module load ${MPI_MODULE}
==== Handling job script options and arguments ====
 
  +
TASK_COUNT=$((${MOAB_PROCCOUNT}/${OMP_NUM_THREADS}))
Job script options and arguments as followed:
 
  +
echo "${EXECUTABLE} running on ${MOAB_PROCCOUNT} cores with ${TASK_COUNT} MPI-tasks and ${OMP_NUM_THREADS} threads"
  +
startexe="mpirun -n ${TASK_COUNT} ${MPIRUN_OPTIONS} ${EXECUTABLE}"
  +
echo $startexe
  +
exec $startexe
  +
</source>
  +
Execute the script '''job_ompi_omp.sh''' adding the queue class ''multinode'' to your msub command:
 
<pre>
 
<pre>
  +
$ msub -q multinode job_ompi_omp.sh
./job.sh -n 10
 
 
</pre>
 
</pre>
  +
* With the mpirun option ''--bind-to core'' MPI tasks and OpenMP threads are bound to physical cores.
can not be passed while using msub command since those will be interpreted as command line options of msub.
 
  +
* With the option ''--map-by socket:PE=<value>'' (neighbored) MPI tasks will be attached to different sockets and each MPI task is bound to the (in <value>) specified number of cpus. <value> must be set to ${OMP_NUM_THREADS}.
  +
* Old OpenMPI version 1.6.x: With the mpirun option ''-bind-to-core'' MPI tasks and OpenMP threads are bound to physical cores.
  +
* With the option ''-bysocket'' (neighbored) MPI tasks will be attached to different sockets and the option ''-cpus-per-proc <value>'' binds each MPI task to the (in <value>) specified number of cpus. <value> must be set to ${OMP_NUM_THREADS}.
  +
* The option ''-report-bindings'' shows the bindings between MPI tasks and physical cores.
  +
* The mpirun-options '''--bind-to core''', '''--map-by socket|...|node:PE=<value>''' should always be used when running a multithreaded MPI program. (OpenMPI version 1.6.x: The mpirun-options ''-bind-to-core''', '''-bysocket|-bynode''' and '''-cpus-per-proc <value>''' should always be used when running a multithreaded MPI program.)
  +
* The policy on batch jobs with Intel MPI + Multithreading on bwUniCluster can be found here: <br>[[Batch_Jobs_-_bwUniCluster_Features#Intel_MPI_with_Multithreading|bwUniCluster: Intel MPI Parallel Programs with Multithreading]]
   
  +
==== Chain jobs ====
  +
A job chain is a sequence of jobs where each job automatically starts its successor. Chain Job handling differs on the bwHPC Clusters. See the cluster-specific pages:
  +
* [[Batch Jobs - bwUniCluster Features]]
  +
* [[Batch Jobs - bwForCluster Chemistry Features]]
  +
  +
==== Interactive Jobs ====
  +
Policies of interactive batch jobs are cluster specific and can be found here:
  +
* [[Batch_Jobs_-_bwUniCluster_Features#Interactive_Jobs|bwUniCluster interactive jobs]]
  +
  +
=== Handling job script options and arguments ===
  +
Job script options and arguments as followed:
  +
<pre>
  +
$ ./job.sh -n 10
  +
</pre>
  +
can not be passed while using msub command since those will be interpreted as command line options of ''job.sh'' <small>(like $1 = -n, $2 = 10)</small>.
   
 
'''Solution A:'''
 
'''Solution A:'''
   
Submit a wrapper script, e.g. job_msub.sh:
+
Submit a wrapper script, e.g. wrapper.sh:
 
<pre>
 
<pre>
msub job_msub.sh
+
$ msub -q singlenode wrapper.sh
 
</pre>
 
</pre>
which simply contains all your job script options and arguments. The script job_msub.sh would at least contain the following lines:
+
which simply contains all options and arguments of job.sh. The script wrapper.sh would at least contain the following lines:
{| style="width: 100%; border:1px solid #d0cfcc; background:#f2f7ff;border-spacing: 2px;"
 
| style="width:280px; text-align:center; white-space:nowrap; color:#000;" |
 
 
<source lang="bash">
 
<source lang="bash">
 
#!/bin/bash
 
#!/bin/bash
./job_msub.sh -n 10
+
./job.sh -n 10
 
</source>
 
</source>
|}
 
 
 
   
 
'''Solution B:'''
 
'''Solution B:'''
   
Add after the header of your '''BASH''' script job.sh the following lines:
+
Add after the header of your '''BASH''' script job.sh the following lines:
{| style="width: 100%; border:1px solid #d0cfcc; background:#f2f7ff;border-spacing: 2px;"
 
| style="width:280px; text-align:center; white-space:nowrap; color:#000;" |
 
 
<source lang="bash">
 
<source lang="bash">
 
## check if $SCRIPT_FLAGS is "set"
 
## check if $SCRIPT_FLAGS is "set"
Line 230: Line 314:
 
fi
 
fi
 
</source>
 
</source>
|}
 
   
 
These lines modify your BASH script to read options and arguments from the environment variable $SCRIPT_FLAGS. Now submit your script job.sh as followed:
 
These lines modify your BASH script to read options and arguments from the environment variable $SCRIPT_FLAGS. Now submit your script job.sh as followed:
 
<pre>
 
<pre>
msub -v SCRIPT_FLAGS='-n 10' job.sh
+
$ msub -q singlenode -v SCRIPT_FLAGS='-n 10' job.sh
 
</pre>
 
</pre>
 
   
For advanced users: [[generalised version of solution B]] if job script arguments contain whitespaces.
 
   
   
  +
=== Moab Environment Variables ===
  +
Once an eligible compute jobs starts on the compute system, MOAB adds the following variables to the job's environment:
  +
{| width=750px class="wikitable"
  +
! colspan="3" | MOAB variables
  +
|-
  +
! Environment variables
  +
! Description
  +
|-
  +
| MOAB_CLASS
  +
| Class name
  +
|-
  +
| MOAB_GROUP
  +
| Group name
  +
|-
  +
| MOAB_JOBID
  +
| Job ID
  +
|-
  +
| MOAB_JOBNAME
  +
| Job name
  +
|-
  +
| MOAB_NODECOUNT
  +
| Number of nodes allocated to job
  +
|-
  +
| MOAB_PARTITION
  +
| Partition name the job is running in
  +
|-
  +
| MOAB_PROCCOUNT
  +
| Number of processors allocated to job
  +
|-
  +
| MOAB_SUBMITDIR
  +
| Directory of job submission
  +
|-
  +
| MOAB_USER
  +
| User name
  +
|}
  +
See also:
  +
* [[Batch_Jobs_-_bwUniCluster_Features#Additional_Moab_Environments|Additional Moab Environments for bwUniCluster]]
  +
* [[Batch_Jobs_-_bwForCluster_Chemistry_Features#Environment_Variables_for_Batch_Jobs|Additional Moab Environments for bwForCluster Justus]]
   
  +
<!--
=== Multithreaded Programs ===
 
  +
<font color=red size=+2>Attention!</font>
Multithreaded programs operate faster than serial programs on CPUs with multiple cores. Moreover, multiple threads of one process share resources such as memory.
 
  +
<br>
  +
<font color=green>Most of all scientific programs available for HPC systems are able to extract all essential important environments at their own.
  +
<br>
  +
These programs identify the underlying resource management system ([[#TORQUE Resource Manager|TORQUE]]/[[#Slurm Resource Manager|Slurm]]) and use the
  +
correct variables.
  +
<br>
  +
But a few programs still need 'msub' command line parameters like '''-np''' 'number-of-cores...' (example). In this case use [[#TORQUE Resource Manager|TORQUE]] or [[#Slurm Resource Manager|Slurm]] environments only.</font>
  +
<br>
  +
<br>
   
  +
<u>recapitulating</u>
For multithreaded programs based on '''Open''' '''M'''ulti-'''P'''rocessing (OpenMP) number of threads are defined by the environment variable OMP_NUM_THREADS. By default this variable is set to 1 (OMP_NUM_THREADS=1).
 
  +
* The MOAB environment variables are for your own convenience only!
  +
* It's not sure, the contents of the Moab variables are always accurate.
  +
* Do not use them in your job scripts!
  +
* '''Hence use the [[#TORQUE Resource Manager|TORQUE]] or [[#Slurm Resource Manager|Slurm]] environments instead.'''
  +
-->
   
  +
=== Interpreting PBS exit codes ===
To submit a batch job called ''test'' that runs a fourfold threaded program ''omp_program'' which requires 6000 MByte of total physical memory and total wall clock time of 3 hours:
 
  +
* The PBS Server logs and accounting logs record the ‘exit status’ of jobs.
  +
* Zero or positive exit status is the status of the top-level shell.
  +
* Certain negative exit statuses are used internally and will never be reported to the user.
  +
* The positive exit status values indicate which signal killed the job.
  +
* Depending on the system, values greater than 128 (or on some systems 256, see wait(2) or waitpid(2) for more information) are the value of the signal that killed the job.
  +
* To interpret (or ‘decode’) the signal contained in the exit status value, subtract the base value from the exit status.<br>For example, if a job had an exit status of 143, that indicates the jobs was killed via a SIGTERM (e.g. 143 - 128 = 15, signal 15 is SIGTERM).
  +
==== Job termination ====
  +
* The exit code from a batch job is a standard Unix termination signal.
  +
* Typically, exit code 0 means successful completion.
  +
* Codes 1-127 are generated from the job calling exit() with a non-zero value to indicate an error.
  +
* Exit codes 129-255 represent jobs terminated by Unix signals.
  +
* Each signal has a corresponding value which is indicated in the job exit code.
  +
==== Job termination signals ====
   
  +
Specific job exit codes are also supplied by the underlying resource manager of the cluster's batch system which is either TORQUE or Slurm. More detailed information can be found in the corresponding documentation:
<!-- 2014-01-29, at the moment submission of executables does not work, SLURM has to be instructed to generate a wrapper
 
a) execute:
 
<pre>
 
$ msub -v OMP_NUM_THREADS=4 -N test -l nodes=1:ppn=4,walltime=3:00:00,mem=6000mb omp_program
 
</pre>
 
   
  +
* [http://docs.adaptivecomputing.com/torque/6-1-2/adminGuide/torque.htm#topics/torque/2-jobs/jobExitStatus.htm TORQUE exit codes]
or
 
  +
* [https://slurm.schedmd.com/job_exit_code.html Slurm exit codes]
-->
 
<!--b)-->
 
* generate the script '''job_omp.sh''' containing the following the lines:
 
{| style="width: 100%; border:1px solid #d0cfcc; background:#f2f7ff;border-spacing: 5px;"
 
| style="width:280px; text-align:center; white-space:nowrap; color:#000;" |
 
<source lang="bash">
 
#!/bin/bash
 
#MSUB -l nodes=1:ppn=4
 
#MSUB -l walltime=3:00:00
 
#MSUB -l mem=6000mb
 
#MSUB -N test
 
   
  +
==== Submitting Termination Signal ====
module load <placeholder>
 
  +
Here is an example, how to 'save' a ''msub'' termination signal in a typical bwHPC-submit script.
export OMP_NUM_THREADS=${MOAB_PROCCOUNT}
 
  +
<source lang="bash">
./omp_program
 
  +
[...]
  +
exit_code=$?
  +
echo "### Calling YOUR_PROGRAM command ..."
  +
mpirun -np 'NUMBER_OF_CORES' $YOUR_PROGRAM_BIN_DIR/runproc ... (options) 2>&1
  +
[ "$exit_code" -eq 0 ] && echo "all clean..." || \
  +
echo "Executable ${YOUR_PROGRAM_BIN_DIR}/runproc finished with exit code ${$exit_code}"
  +
[...]
 
</source>
 
</source>
  +
* Do not use ''''time'''' mpirun! The exit code will be the one submitted by the first (time) program and not the msub exit code.
  +
* You do not need an '''exit $exit_code''' in the scripts.
  +
  +
== Start time of job or resources : showstart ==
  +
The following command can be used by any user to displays the estimated start time of a job based a number of analysis types based on historical usage, earliest available reservable resources, and priority based backlog.
  +
<br>
  +
=== Access ===
  +
By default, this command can be run by '''any user'''.
  +
<br>
  +
=== Showstart Parameters ===
  +
{| width=750px class="wikitable"
  +
|- style="vertical-align:top;"
  +
! Parameter !! Description
  +
|- style="vertical-align:top;"
  +
| style="width:12%;" | DURATION
  +
| Duration of pseudo-job to be checked in format [[[DD:]HH:]MM:]SS (default duration is 1 second)
  +
|- style="vertical-align:top;"
  +
| -e
  +
| Estimate method. By default, Moab will use the reservation based estimation method.
  +
|- style="vertical-align:top"
  +
| -f
  +
| Use feedback. If specified, Moab will apply historical accuracy information to improve the quality of the estimate.
  +
|- style="vertical-align:top"
  +
| -g
  +
| Grid mode. Obtain showstart information from remote resource managers. If -g is not used and Moab determines that job is already migrated, Moab obtains showstart information form the remote Moab where the job was migrated to. All resource managers can be queried by using the keyword "all" which returns all information in a table.
  +
<pre>$ showstart -g all head.1
  +
Estimated Start Times
  +
[ Remote RM ] [ Reservation ] [ Priority ] [ Historical ]
  +
[ c1 ] [ 00:15:35 ] [ ] [ ]
  +
[ c2 ] [ 3:15:38 ] [ ] [ ]</pre>
  +
|- style="vertical-align:top"
  +
| -l qos=<QOS>
  +
| Specifies what QOS the job must start under, using the same syntax as the msub command. Currently, no other resource manager extensions are supported. This flag only applies to hypothetical jobs by using the proccount[@duration] syntax.
  +
|- style="vertical-align:top"
  +
| JOBID
  +
| Job to be checked
  +
|- style="vertical-align:top"
  +
| PROCCOUNT
  +
| Number of processors in pseudo-job to be checked
  +
|- style="vertical-align:top"
  +
| S3JOBSPEC
  +
| XML describing the job according to the Dept. of Energy Scalable Systems Software/S3 job specification.
 
|}
 
|}
  +
Note: You cannot specify job flags when running showstart, and since a job by default can only run on one partition. Showstart fails when querying for a job requiring more nodes than the largest partition available.
and, if necessary, replace <placeholder> with the required modulefile to enable the openMP environment and execute the script '''job_omp.sh''' without any msub command line options:
 
  +
  +
=== Showstart Examples ===
  +
* To show estimated start time of job <job_ID> enter
 
<pre>
 
<pre>
  +
$ showstart -e all <job_ID>
$ msub job_omp.sh
 
 
</pre>
 
</pre>
  +
* Furthermore start time of resource demands, e.g. 16 processes @ 12 h, can be displayed
<br>
 
Note, that msub command line options overrule script options, e.g.,
 
 
<pre>
 
<pre>
  +
$ showstart -e all 16@12:00:00
$ msub -l mem=2000mb job_omp.sh
 
 
</pre>
 
</pre>
  +
* For a list of all options of ''showstart'' read the manpage of ''showstart''
overwrites the script setting of 6000 MByte with 2000 MByte.
 
  +
  +
== List of your submitted jobs : showq ==
  +
Displays information about active, eligible, blocked, and/or recently completed jobs. Since the resource manager is not actually scheduling jobs, the job ordering it displays is not valid. The showq command displays the actual job ordering under the Moab Workload Manager. When used without flags, this command displays all jobs in active, idle, and non-queued states.
 
<br>
 
<br>
  +
=== Access ===
  +
By default, this command can be run by any user.<br>
  +
However, the -c, -i, and -r flags can only be used by Moab administrators.
 
<br>
 
<br>
  +
=== Flags ===
  +
{| width=750px class="wikitable"
  +
|-
  +
! Flag !! Description
  +
|-
  +
| -b
  +
| display blocked jobs only
  +
|-
  +
| -c
  +
| display details about recently completed jobs (see example, JOBCPURGETIME)
  +
|-
  +
| -g
  +
| display grid job and system id's for all jobs
  +
|-
  +
| -i
  +
| display extended details about idle jobs
  +
|-
  +
| -l
  +
| display local/remote view. For use in a Grid environment, displays job usage of both local and remote compute resources.
  +
|-
  +
| -p
  +
| display only jobs assigned to the specified partition
  +
|-
  +
| -r
  +
| display extended details about active (running) jobs
  +
|-
  +
| -R
  +
| display only jobs which overlap the specified reservation
  +
|-
  +
| -v
  +
| Display local and full resource manager job IDs as well as partitions. If specified with the '-i' option, will display job reservation time.
  +
|-
  +
| -w
  +
| display only jobs associated with the specified constraint. Valid constraints include user, group, acct, class, and qos.
  +
|}
  +
=== Examples ===
  +
These are examples as shown on the Adaptive Homepage <small>(external links)</small>.
  +
* [http://docs.adaptivecomputing.com/maui/commands/showq.php#defaultexample Default Report]
  +
* [http://docs.adaptivecomputing.com/maui/commands/showq.php#activeexample Detailed Active/Running Job Report]
  +
* [http://docs.adaptivecomputing.com/maui/commands/showq.php#idleexample Detailed Eligible/Idle Job Report]
  +
* [http://docs.adaptivecomputing.com/maui/commands/showq.php#completedexample Detailed Completed Job Report]
  +
* [http://docs.adaptivecomputing.com/maui/commands/showq.php#whereexample Filtered Job Report]
  +
Showq example on the bwUniCluster for one specific user only <small>(Name and Pop-ID is fiction. Run as MOAB-Admin.)</small>.
  +
<pre>
  +
$ # use UID for option in showq --->
  +
$ showq -u kn_pop332211
  +
active jobs------------------------
  +
JOBID USERNAME STATE PROCS REMAINING STARTTIME
   
  +
8370992 kn_pop33 Running 1 2:05:09:17 Wed Jan 13 15:59:01
=== MPI parallel Programs ===
 
  +
8370991 kn_pop33 Running 1 2:05:09:17 Wed Jan 13 15:59:01
MPI parallel programs run faster than serial programs on multi CPU and multi core systems. N-fold spawned processes of the MPI program, i.e., '''MPI tasks''', run simultaneously and communicate via the Message Passing Interface (MPI) paradigm. MPI tasks do not share memory but can be spawned over different nodes.
 
  +
8370993 kn_pop33 Running 1 2:05:10:20 Wed Jan 13 16:00:04
  +
[...]
  +
8371040 kn_pop33 Running 1 2:05:11:41 Wed Jan 13 16:01:25
   
  +
50 active jobs 50 of 7072 processors in use by local jobs (0.71%)
Multiple MPI tasks can not be launched by the MPI parallel program itself but via '''mpirun''', e.g. 4 MPI tasks of ''my_par_program'':
 
  +
434 of 434 nodes active (100.00%)
<pre>
 
mpirun -n 4 my_par_program
 
</pre>
 
<br>
 
However, this given command can '''not''' be directly included in your '''msub''' command for submitting as a batch job to the compute cluster, [[BwUniCluster_Batch_Jobs#Handling_job_script_options_and_arguments|see above]].
 
   
  +
eligible jobs----------------------
Instead, generate a wrapper script, ''job.sh'' containing the following lines:
 
  +
JOBID USERNAME STATE PROCS WCLIMIT QUEUETIME
{| style="width: 100%; border:1px solid #d0cfcc; background:#f2f7ff;border-spacing: 2px;"
 
| style="width:280px; text-align:center; white-space:nowrap; color:#000;" |
 
<source lang="bash">
 
#!/bin/bash
 
module load <placeholder>
 
mpirun my_par_program
 
</source>
 
|}
 
'''Attention:''' Do '''NOT''' add mpirun options ''-n <number_of_processes>'' or any other option defining processes or nodes, since MOAB instructs mpirun about number of processes and node hostnames.
 
   
  +
0 eligible jobs
Moreover, replace <placeholder> with all the required modulefiles to enable the MPI environment.
 
<br>
 
   
  +
blocked jobs-----------------------
Considering 4 MPI tasks on a single node, each requiring 1000 MByte, and running for 1 hour, execute:
 
  +
JOBID USERNAME STATE PROCS WCLIMIT QUEUETIME
<pre>
 
  +
msub -l nodes=1:ppn=4,pmem=1000mb,walltime=01:00:00 job.sh
 
  +
0 blocked jobs
  +
  +
Total jobs: 50
 
</pre>
 
</pre>
  +
* The summary of your active jobs shows how many jobs of yours are running, how many processors are in use by your jobs and how many nodes are in use by '''all''' active jobs.
  +
* Use showq -u $USER for your own jobs.
  +
* For further options of ''showq'' read the manpage of ''showq''.
  +
  +
== Shows free resources : showbf ==
  +
The showbf command can be used by any user to find out how many processors are available for immediate use on the system. It is anticipated that users will use this information to submit jobs that meet these criteria and thus obtain quick job turnaround times. This command incorporates down time, reservations, and node state information in determining the available backfill window.
 
<br>
 
<br>
  +
Note If specific information is not specified, showbf will return information for the user and group running but with global access for other credentials.
  +
=== Access ===
  +
By default, this command can be used by any user or administrator.
  +
=== Flags ===
  +
{| width=750px class="wikitable"
  +
|-
  +
! Flag !! Description
  +
|-
  +
-A Show resource availability information for all users, groups, and accounts. By default, showbf uses the default user, group, and account ID of the user issuing the command.
  +
|-
  +
| -a
  +
| Show resource availability information only for specified account
  +
|-
  +
| -d
  +
| Show resource availability information for specified duration
  +
|-
  +
| -D
  +
| Display current and future resource availability notation
  +
|-
  +
| -f
  +
| Show resource availability information only for specified feature
  +
|-
  +
| -g
  +
| Show resource availability information only for specified group
  +
|-
  +
| -h
  +
| Help for this command
  +
|-
  +
| -L
  +
| Enforce Hard limits when showing available resources
  +
|-
  +
| -m
  +
| Allows user to specify the memory requirements for the backfill nodes of interest. It is important to note that if the optional MEMCMP and MEMORY parameters are used, they MUST be enclosed in single ticks (') to avoid interpretation by the shell. For example, enter showbf -m '==256' to request nodes with 256 MB memory.
  +
|-
  +
| -n
  +
| Show resource availability information for a specified number of nodes. That is, this flag can be used to force showbf to display only blocks of resources with at least this many nodes available.
  +
|-
  +
| -p
  +
| Show resource availability information for the specified partition
  +
|-
  +
| -q
  +
| Show information for the specified QOS
  +
|-
  +
| -u
  +
| Show resource availability information only for specified user
  +
|}
   
  +
=== Parameters ===
Launching and running 32 MPI tasks on 4 nodes, each requiring 1000 MByte, and running for 5 hours, execute:
 
  +
{| width=750px class="wikitable"
  +
|-
  +
! Parameter !! Description
  +
|-
  +
| ACCOUNT
  +
| Account name
  +
|-
  +
| CLASS
  +
| Class/queue required
  +
|-
  +
| DURATION
  +
| Time duration specified as the number of seconds or in [DD:]HH:MM:SS notation
  +
|-
  +
| FEATURELIST
  +
| Colon separated list of node features required
  +
|-
  +
| GROUP
  +
| Specify particular group
  +
|-
  +
| MEMCMP
  +
| Memory comparison used with the -m flag. Valid signs are >, >=, ==, <=, and <.
  +
|-
  +
| MEMORY
  +
| Specifies the amount of required real memory configured on the node, (in MB), used with the -m flag.
  +
|-
  +
| NODECOUNT
  +
| Specify number of nodes for inquiry with -n flag
  +
|-
  +
| PARTITION
  +
| Specify partition to check with -p flag
  +
|-
  +
| QOS
  +
| Specify QOS to check with -q flag
  +
|-
  +
| USER
  +
| Specify particular user to check with -u flag
  +
|}
  +
=== Examples ===
  +
* The following command displays what resources are available for immediate use for the whole partition.
  +
<pre>$ showbf
  +
Partition Tasks Nodes Duration StartOffset StartDate
  +
--------- ----- ----- ------------ ------------ --------------
  +
ALL 371 129 INFINITY 00:00:00 11:30:26_01/14
  +
</pre>
  +
* The request for 16 nodes can be run immediately on all partitions.
 
<pre>
 
<pre>
  +
$ showbf -n 16 -d 2:00:00
msub -l nodes=4:ppn=8,pmem=1000mb,walltime=05:00:00 job.sh
 
  +
Partition Tasks Nodes Duration StartOffset StartDate
  +
--------- ----- ----- ------------ ------------ --------------
  +
ALL 392 132 INFINITY 00:00:00 11:35:01_01/14
 
</pre>
 
</pre>
  +
* This request for 64 nodes returned nothing, meaning it could not be fulfilled immedately.
  +
<pre>
  +
$ showbf -n 64 -d 2:00:00
  +
Partition Tasks Nodes Duration StartOffset StartDate
  +
--------- ----- ----- ------------ ------------ --------------
  +
</pre>
  +
* The last example request displays an error message. Too much nodes requested!
  +
<pre>
  +
$ showbf -n 256 -d 2:00:00
  +
resources not available
  +
</pre>
  +
* For further options of ''showbf'' read the manpage of showbf.
  +
  +
== Detailed job information : checkjob ==
  +
Checkjob displays detailed job state information and diagnostic output for a specified job. Detailed information is available for queued, blocked, active, and recently completed jobs.
  +
=== Access ===
  +
* End users can use checkjob to view the status of their '''own jobs''' only.
  +
* JobArrays and Reservations are not available on '''all''' clusters. See specific informations depending the single bwUniCluster and bwForClusters for a list of capabilities.
  +
=== Output ===
  +
{| width=750px class="wikitable"
  +
|- style="vertical-align:top;"
  +
! Attribute !! Value !! Description
  +
|- style="vertical-align:top;"
  +
| Account
  +
| <STRING>
  +
| Name of account associated with job
  +
|- style="vertical-align:top;"
  +
| Actual Run Time
  +
| [[[DD:]HH:]MM:]SS
  +
| Length of time job actually ran. This info is only displayed in simulation mode.
  +
|- style="vertical-align:top;"
  +
| Allocated Nodes
  +
| Square bracket delimited list of node and processor ids
  +
| List of nodes and processors allocated to job
  +
|- style="vertical-align:top;"
  +
| Applied Nodeset<sup>**</sup>
  +
| <STRING>
  +
| Nodeset used for job's node allocation
  +
|- style="vertical-align:top;"
  +
| Arch
  +
| <STRING>
  +
| Node architecture required by job
  +
|- style="vertical-align:top;"
  +
| Attr
  +
| Square bracket delimited list of job attributes
  +
| Job Attributes (i.e. [BACKFILL][PREEMPTEE])
  +
|- style="vertical-align:top;"
  +
| Available Memory<sup>**</sup>
  +
| <INTEGER>
  +
| The available memory requested by job. Moab displays the relative or exact value by returning a comparison symbol (>, <, >=, <=, or ==) with the value (i.e. Available Memory <= 2048).
  +
|- style="vertical-align:top;"
  +
| Available Swap<sup>**</sup>
  +
| <INTEGER>
  +
| The available swap requested by job. Moab displays the relative or exact value by returning a comparison symbol (>, <, >=, <=, or ==) with the value (i.e. Available Swap >= 1024).
  +
|- style="vertical-align:top;"
  +
| Average Utilized Procs<sup>*</sup>
  +
| <FLOAT>
  +
| Average load balance for a job
  +
|- style="vertical-align:top;"
  +
| Avg Util Resources Per Task<sup>*</sup>
  +
| <FLOAT>
  +
|
  +
|- style="vertical-align:top;"
  +
| BecameEligible
  +
| <TIMESTAMP>
  +
| The date and time when the job moved from Blocked to Eligible.
  +
|- style="vertical-align:top;"
  +
| Bypass
  +
| <INTEGER>
  +
| Number of times a lower priority job with a later submit time ran before the job
  +
|- style="vertical-align:top;"
  +
| CheckpointStartTime<sup>**</sup>
  +
| [[[DD:]HH:]MM:]SS
  +
| The time the job was first checkpointed
  +
|- style="vertical-align:top;"
  +
| Class
  +
| [<CLASS NAME> <CLASS COUNT>]
  +
| Name of class/queue required by job and number of class initiators required per task.
  +
|- style="vertical-align:top;"
  +
| Dedicated Resources Per Task<sup>*</sup>
  +
| Space-delimited list of <STRING>:<INTEGER>
  +
| Resources dedicated to a job on a per-task basis
  +
|- style="vertical-align:top;"
  +
| Disk
  +
| <INTEGER>
  +
| Amount of local disk required by job (in MB)
  +
|- style="vertical-align:top;"
  +
| Estimated Walltime
  +
| [[[DD:]HH:]MM:]SS
  +
| The scheduler's estimated walltime. In simulation mode, it is the actual walltime.
  +
|- style="vertical-align:top;"
  +
| EnvVariables<sup>**</sup>
  +
| Comma-delimited list of <STRING>
  +
| List of environment variables assigned to job
  +
|- style="vertical-align:top;"
  +
| Exec Size<sup>*</sup>
  +
| <INTEGER>
  +
| Size of job executable (in MB)
  +
|- style="vertical-align:top;"
  +
| Executable
  +
| <STRING>
  +
| Name of command to run
  +
|- style="vertical-align:top;"
  +
| Features
  +
| Square bracket delimited list of <STRING>s
  +
| Node features required by job
  +
|- style="vertical-align:top;"
  +
| Flags
  +
|
  +
|
  +
|- style="vertical-align:top;"
  +
| Group
  +
| <STRING>
  +
| Name of UNIX group associated with job
  +
|- style="vertical-align:top;"
  +
| Holds
  +
| Zero or more of User, System, and Batch
  +
| Types of job holds currently applied to job
  +
|- style="vertical-align:top;"
  +
| Image Size
  +
| <INTEGER>
  +
| Size of job data (in MB)
  +
|- style="vertical-align:top;"
  +
| IWD (Initial Working Directory)
  +
| <DIR>
  +
| Directory to run the executable in
  +
|- style="vertical-align:top;"
  +
| Job Messages<sup>**</sup>
  +
| <STRING>
  +
| Messages attached to a job
  +
|- style="vertical-align:top;"
  +
| Job Submission<sup>**</sup>
  +
| <STRING>
  +
| Job script submitted to RM
  +
|- style="vertical-align:top;"
  +
| Memory
  +
| <INTEGER>
  +
| Amount of real memory required per node (in MB)
  +
|- style="vertical-align:top;"
  +
| Max Util Resources Per Task<sup>*</sup>
  +
| <FLOAT>
  +
|
  +
|- style="vertical-align:top;"
  +
| NodeAccess<sup>*</sup>
  +
|
  +
|
  +
|- style="vertical-align:top;"
  +
| Nodecount
  +
| <INTEGER>
  +
| Number of nodes required by job
  +
|- style="vertical-align:top;"
  +
| Opsys
  +
| <STRING>
  +
| Node operating system required by job
  +
|- style="vertical-align:top;"
  +
| Partition Mask
  +
| ALL or colon delimited list of partitions
  +
| List of partitions the job has access to
  +
|- style="vertical-align:top;"
  +
| PE
  +
| <FLOAT>
  +
| Number of processor-equivalents requested by job
  +
|- style="vertical-align:top;"
  +
| Per Partition Priority<sup>**</sup>
  +
| Tabular
  +
| Table showing job template priority for each partition
  +
|- style="vertical-align:top;"
  +
| Priority Analysis<sup>**</sup>
  +
| Tabular
  +
| Table showing how job's priority was calculated:
  +
|- style="vertical-align:top;"
  +
| Job PRIORITY<sup>*</sup>
  +
| Cred( User:Group:Class)
  +
| Serv(QTime)
  +
|- style="vertical-align:top;"
  +
| QOS
  +
| <STRING>
  +
| Quality of Service associated with job
  +
|- style="vertical-align:top;"
  +
| Reservation
  +
| <RSVID ( <TIME1 - <TIME2> Duration: <TIME3>)
  +
| RESID specifies the reservation id, TIME1 is the relative start time, TIME2 the relative end time
  +
|- style="vertical-align:top;"
  +
| TIME3
  +
| The duration of the reservation
  +
|
  +
|- style="vertical-align:top;"
  +
| Req
  +
| [<INTEGER>] TaskCount: <INTEGER> Partition: <partition>
  +
| A job requirement for a single type of resource followed by the number of tasks instances required and the appropriate partition
  +
|- style="vertical-align:top;"
  +
| StartCount
  +
| <INTEGER>
  +
| Number of times job has been started by Moab
  +
|- style="vertical-align:top;"
  +
| StartPriority
  +
| <INTEGER>
  +
| Start priority of job
  +
|- style="vertical-align:top;"
  +
| StartTime
  +
| <TIME>
  +
| Time job was started by the resource management system
  +
|- style="vertical-align:top;"
  +
| State
  +
| One of Idle, Starting, Running, etc
  +
| Current Job State
  +
|- style="vertical-align:top;"
  +
| SubmitTime
  +
| <TIME>
  +
| Time job was submitted to resource management system
  +
|- style="vertical-align:top;"
  +
| Swap
  +
| <INTEGER>
  +
| Amount of swap disk required by job (in MB)
  +
|- style="vertical-align:top;"
  +
| Task Distribution<sup>*</sup>
  +
| Square bracket delimited list of nodes
  +
|
  +
|- style="vertical-align:top;"
  +
| Time Queued
  +
|
  +
|
  +
|- style="vertical-align:top;"
  +
| Total Requested Nodes<sup>**</sup>
  +
| <INTEGER>
  +
| Number of nodes the job requested
  +
|- style="vertical-align:top;"
  +
| Total Requested Tasks
  +
| <INTEGER>
  +
| Number of tasks requested by job
  +
|- style="vertical-align:top;"
  +
| User
  +
| <STRING>
  +
| Name of user submitting job
  +
|- style="vertical-align:top;"
  +
| Utilized Resources Per Task<sup>*</sup>
  +
| <FLOAT>
  +
|
  +
|- style="vertical-align:top;"
  +
| WallTime
  +
| [[[DD:]HH:]MM:]SS of [[[DD:]HH:]MM:]SS
  +
| Length of time job has been running out of the specified limit
  +
|}
  +
In the above table, fields marked with an asterisk (<sup>*</sup>) are only displayed when set or when the -v flag is specified. Fields marked with two asterisks (<sup>**</sup>) are only displayed when set or when the -v -v flag is specified.
 
<br>
 
<br>
  +
=== Arguments ===
  +
{| width=750px class="wikitable"
  +
|-
  +
! Argument !! Format !! Default !! Description !! Example
  +
|- style="vertical-align:top;"
  +
| style="width:12%;" | --flags
  +
| --flags=future
  +
| (none)
  +
| Evaluates future eligibility of job (ignore current resource state and usage limitations)
  +
| <pre>$ checkjob -v --flags=future 8370992</pre>Display reasons why idle job is blocked ignoring node state and current node utilization constraints.
  +
|- style="vertical-align:top;"
  +
| -l (Policy level)
  +
| <POLICYLEVEL> HARD, SOFT, or OFF
  +
| (none)
  +
| Reports job start eligibility subject to specified throttling policy level.
  +
| <pre>$ checkjob -l SOFT 8370992
  +
$ checkjob -l HARD 8370992</pre>
  +
|- style="vertical-align:top;"
  +
| -n (NodeID)
  +
| <NODEID>
  +
| (none)
  +
| Checks job access to specified node and preemption status with regards to jobs located on that node.
  +
| <pre>checkjob -n uc1n320 8370992</pre>
  +
|- style="vertical-align:top;"
  +
| -r (Reservation)
  +
| <RSVID>
  +
| (none)
  +
| Checks job access to specified reservation <RSVID>.
  +
| <pre>checkjob -r rainer_kn_resa.1 8370992</pre>
  +
|- style="vertical-align:top;"
  +
| [[#Blocked job information : checkjob -v|-v (Verbose)]]
  +
|
  +
| (n/a)
  +
| Sets verbose mode. If the job is part of an array, the -v option shows pertinent array information before the job-specific information. Specifying the double verbose ("-v -v") displays additional information about the job. [[#Blocked job information : checkjob -v|See more infos here!]]
  +
| <pre>checkjob -v 8370992</pre>
  +
|}
  +
8370992 = JobId <small>(see examples above)</small>
  +
=== Parameters ===
  +
Parameters, descriptions (a lot!) and examples can be found in Adaptive documentation page.
  +
<br>
  +
* [http://docs.adaptivecomputing.com/maui/commands/checkjob.php Adaptive Checkjob Command Reference]<br>&nbsp;<font color=red>Use this informations if you'd like to do some analyses why your job is hold or blocked!</font>
   
  +
* For further options of checkjob see the manual page of checkjob <pre>$ man checkjob</pre>
=== Multithreaded + MPI parallel Programs ===
 
  +
=== Checkjob Examples ===
Multithreaded + MPI parallel programs operate faster than serial programs on multi CPUs with multiple cores. All threads of one process share resources such as memory. On the contrary MPI tasks do not share memory but can be spawned over different nodes.
 
  +
Here is an example from the bwUniCluster.
  +
<pre>
  +
showq -u $USER # show my own jobs
  +
active jobs------------------------
  +
JOBID USERNAME STATE PROCS REMAINING STARTTIME
  +
8370992 kn_popnn Running 1 2:03:56:50 Wed Jan 13 15:59:01
  +
8370991 kn_popnn Running 1 2:03:56:50 Wed Jan 13 15:59:01
  +
[...]
  +
8371040 kn_popnn Running 1 2:03:59:14 Wed Jan 13 16:01:25
   
  +
49 active jobs 49 of 7072 processors in use by local jobs (0.69%)
Multiple MPI tasks must be launched by the MPI parallel program '''mpirun'''. For multithreaded programs based on '''Open''' '''M'''ulti-'''P'''rocessing (OpenMP) number of threads are defined by the environment variable OMP_NUM_THREADS. By default this variable is set to 1 (OMP_NUM_THREADS=1).
 
  +
434 of 434 nodes active (100.00%)
  +
eligible jobs----------------------
  +
JOBID USERNAME STATE PROCS WCLIMIT QUEUETIME
  +
0 eligible jobs
   
  +
blocked jobs-----------------------
Thus a job-script to submit a batch job called ''test'' that runs a MPI program with 2 tasks and an eightfold threaded program ''my_par_program'' requiring 12000 MByte of total physical memory and total wall clock time of 3 hours looks like:
 
  +
JOBID USERNAME STATE PROCS WCLIMIT QUEUETIME
  +
0 blocked jobs
   
  +
Total jobs: 49
<!--b)-->
 
  +
$
* generate the script '''job_mpi_omp.sh''' containing the following the lines:
 
  +
$ # now, see what's up with the first job in my queue
{| style="width: 100%; border:1px solid #d0cfcc; background:#f2f7ff;border-spacing: 5px;"
 
  +
$
| style="width:280px; text-align:center; white-space:nowrap; color:#000;" |
 
  +
$ checkjob 8370992
<source lang="bash">
 
  +
job 8370992
#!/bin/bash
 
#MSUB -l nodes=1:ppn=16
 
#MSUB -l walltime=3:00:00
 
#MSUB -l mem=12000mb
 
#MSUB -v tpt=8
 
#MSUB -N test
 
   
  +
AName: Nic_cit_09_Apo_zal_07_cl2_07_cl3_07_cl5_07_2.moab
module load <placeholder>
 
  +
State: Running
export OMP_NUM_THREADS=8
 
  +
Creds: user:kn_pop'nnnnn' group:kn_kn account:konstanz class:singlenode
mpirun -np 2 -bind-to-core -bysocket -cpus-per-proc ${OMP_NUM_THREADS} ./my_par_program
 
  +
WallTime: 20:04:28 of 3:00:00:00
</source>
 
  +
BecameEligible: Wed Jan 13 15:58:11
|}
 
  +
SubmitTime: Wed Jan 13 15:57:58
If necessary, replace <placeholder> with the required modulefile to enable the MPI environment and execute the script '''job_mpi_omp.sh''' without any msub command line options:
 
  +
(Time Queued Total: 00:01:03 Eligible: 00:00:58)
<pre>
 
$ msub job_mpi_omp.sh
 
</pre>
 
<br>
 
The MSUB-option ''tpt'' means "tasks per thread". With the mpirun-option -bind-to-core MPI tasks and OpenMP-threads are bound to physical cores. With the option -bysocket (neighbored) MPI-tasks will be attached to different sockets and -cpus-per-proc means that for the threads of a single MPI-task ${OMP_NUM_THREADS} cores will be available.
 
   
  +
StartTime: Wed Jan 13 15:59:01
A job-script to submit a batch job called ''large_test'' that runs a OpenMPI program with 4 tasks and a sixteenfold threaded program ''my_par_program'' requiring 48000 MByte of total physical memory and total wall clock time of 6 hours looks like:
 
  +
TemplateSets: DEFAULT
  +
NodeMatchPolicy: EXACTNODE
  +
Total Requested Tasks: 1
   
  +
Req[0] TaskCount: 1 Partition: uc1
<!--b)-->
 
  +
Memory >= 4000M Disk >= 0 Swap >= 0
{| style="width: 100%; border:1px solid #d0cfcc; background:#f2f7ff;border-spacing: 5px;"
 
  +
Dedicated Resources Per Task: PROCS: 1 MEM: 4000M
| style="width:280px; text-align:center; white-space:nowrap; color:#000;" |
 
  +
NodeSet=ONEOF:FEATURE:[NONE]
  +
  +
Allocated Nodes:
  +
[uc1n320:1]
  +
  +
SystemID: uc1
  +
SystemJID: 8370992
  +
  +
IWD: /pfs/data2/home/kn/kn_kn/kn_pop139522/fastsimcoal25/Midas_RAD_anchored/Nic_cit_09_Apo_zal_07_cl2_07_cl3_07_cl5_07/1col_DIV-resize_admix_zal1st_starlike_cl2-base_CL-growths_GL-bottlegrowth_onlyintramig/new_est/run_2
  +
SubmitDir: /pfs/data2/home/kn/kn_kn/kn_pop139522/fastsimcoal25/Midas_RAD_anchored/Nic_cit_09_Apo_zal_07_cl2_07_cl3_07_cl5_07/1col_DIV-resize_admix_zal1st_starlike_cl2-base_CL-growths_GL-bottlegrowth_onlyintramig/new_est/run_2
  +
Executable: /opt/moab/spool/moab.job.voNyde
  +
  +
StartCount: 1
  +
BypassCount: 1
  +
Partition List: uc1
  +
Flags: BACKFILL,FSVIOLATION,GLOBALQUEUE
  +
Attr: BACKFILL,FSVIOLATION
  +
StartPriority: -3692
  +
PE: 1.00
  +
Reservation '8370992' (-20:04:34 -> 2:03:55:26 Duration: 3:00:00:00)
  +
[...]
  +
</pre>
  +
You can use standard Linux pipe commands to filter the very detailed checkjob output.
  +
* Is the job still running?
  +
<pre>$ checkjob 8370992 | grep ^State
  +
State: Running
  +
</pre>
  +
* Write your own checkjob wrapper to modify the checkjob output to have it all one's own way. Here's an example <small>(cut/paste it if you'd like to use this one)</small>:
 
<source lang="bash">
 
<source lang="bash">
 
#!/bin/bash
 
#!/bin/bash
  +
# cj Display Moab-Jobstatus (checkjob wrapper)
#MSUB -l nodes=4:ppn=16
 
  +
# 'rainer.rutka@uni-konstanz.de "2015-09-21 ~1.1
#MSUB -l walltime=6:00:00
 
  +
# $1 = Moab Job-Nummer, $2 = Sleep-Time in seconds
#MSUB -l mem=48000mb
 
  +
shopt -s extglob # "+"-Zeichen/Integerpruefung
#MSUB -v tpt=16
 
#MSUB -N large_test
 
   
  +
rev=$(tput rev)
module load mpi/openmpi
 
  +
res=$(tput sgr0)
export OMP_NUM_THREADS=16
 
  +
resc=$(tput setf 0)
mpirun -np 4 -bind-to-core -bynode -cpus-per-proc ${OMP_NUM_THREADS} -report-bindings ./my_par_program
 
  +
gruen=$(tput setf 2)
</source>
 
  +
rot=$(tput setf 4)
|}
 
<br>
 
If all cores of a node are used by threads the option -bynode must be chosen. The option -bysocket does not work because the MPI-tasks '''must''' be attached to different nodes. '''The mpirun-options -bind-to-core, -bysocket or -bynode and -cpus-per-proc should always be used when running a multithreaded MPI program. '''
 
<br>
 
<br>
 
   
  +
[ "$1" == "-h" ] && { echo "usage: $(basename "$0") (int)Moab-Job-Number [(int)Intervall/Sekunden(Default:30s)]"; exit 0; }
=== Interactive Jobs ===
 
  +
[ "$1" ] || { echo "Moab-Jobnummer fehlt..."; exit 1; } && { jobid="$1"; }
[[BwUniCluster_User_Access#Allowed_activities_on_login_nodes|Interactive jobs must not run on the logins nodes]], however resources for interactive jobs can be requested using msub. Considering a serial application with a graphical frontend that requires 5000 MByte of memory and limiting the interactive run to 2 hours execute the following:
 
  +
[ ! -z "${jobid##+([0-9])}" ] && { echo "<${jobid}> ist keine Zahl!"; exit 1; }
<pre>
 
  +
[ "$2" ] && { let schlaf="$2"; } || { let schlaf=30; }
$ msub -v DISPLAY,HOME -l nodes=1:ppn=1,pmem=5000mb,walltime=02:00:00 -I
 
  +
[ ! -z "${2##+([0-9])}" ] && { echo "<${2}> ist keine Zahl!"; exit 1; }
</pre>
 
<br>
 
After execution of this command '''DO NOT CLOSE''' your current terminal session but wait until the queueing system MOAB has granted you the requested resources on the compute system. Once granted you will be automatically logged on the dedicated resource. Now you have an interactive session with 1 core and 5000 MByte of memory on the compute system for 2 hours. Simply execute now your application:
 
<pre>
 
$ cd to_path
 
$ ./application
 
</pre>
 
Note that, once the walltime limit has been reached you will be automatically logged out of the compute system.
 
<br>
 
   
  +
let sekunden=0
  +
let von=1
  +
let bis=73
  +
let rechts=1
  +
pos () { tput cup ${1} ${2}; }
   
  +
dauer () {
= Status of batch system/jobs =
 
  +
pos 0 90
== Available resources - showbf ==
 
  +
let gesamt=$[$sekunden*$schlaf]
The following command can be used any user to find out how many processors are '''available for immediate use''' on the system:
 
  +
pos 0 38
  +
echo 'Wartezeit:'
  +
[ "$gesamt" -lt 60 ] && { pos 0 49; echo "${gruen}${gesamt}s${resc}"; }
  +
[ "$gesamt" -ge 60 ] && { pos 0 49; echo "${gruen}$[$gesamt / 60]m$[$gesamt % 60]s ${resc}";
  +
[ "$gesamt" -ge 3600 ] && { pos 0 49; echo "${gruen}$[$gesamt / 3600]h$[$gesamt % 3600 / 60]m$[$gesamt % 60]s ${resc}"; }
  +
}
  +
  +
checkit () {
  +
tput clear
  +
pos 0 1
  +
echo "${resc}Job: ${gruen}${jobid}${resc} Status: "
  +
pos 0 60
  +
echo "Intervall: ${schlaf}s"
  +
  +
pos 1 1
  +
echo '>'
  +
  +
while true
  +
do
  +
status=$(checkjob ${jobid} | grep ^State:)
  +
pos 0 22
  +
echo "<${rot}${status/State:}${resc}>"
  +
dauer
  +
[ "${status/State:}" == "" ] && { let sekunden=0; echo "${rot} ERROR! ${resc}"; exit 1; }
  +
[ "${status/State:}" == " Completed " ] && { echo "${gruen} Job ${jobid} ist fertig!${resc}"; exit 0; }
  +
[ "${status/State:}" == " Removed " ] && { echo "${rot} Job ${jobid} wurde gelöscht!${resc}"; exit 1; }
  +
[ "$rechts" == "$bis" ] && { let rechts="$von"; checkit ${jobid} ${schlaf} ; } || let rechts++
  +
pos 1 $rechts
  +
sleep ${schlaf}
  +
echo -n .
  +
let sekunden++;
  +
done
  +
}
  +
checkit ${*}
  +
</source>
  +
Using the same Job-ID as above the output of the script (named 'cj'):
 
<pre>
 
<pre>
$ showbf
+
$ cj -h
  +
usage: cj (int)Moab-Job-Number [(int)Intervall/Sekunden(Default:30s)]
  +
$ cj 8370992 10 # update every 10 seconds
  +
Job: 8370992 Status: < Running > Wartezeit: 30m50s Intervall: 10s
  +
>.........................................
 
</pre>
 
</pre>
  +
  +
== Blocked job information : checkjob -v ==
  +
This command allows to check the detailed status and resource requirements of your active, queued, or recently completed job. Additionally, this command performs numerous diagnostic checks and determines if and where the job could potentially run. Diagnostic checks include policy violations, reservation constraints, preemption status, and job to resource mapping. If a job cannot run, a text reason is provided along with a summary of how many nodes are and are not available. If the -v flag is specified, a node by node summary of resource availability will be displayed for idle jobs.
 
<br>
 
<br>
For further options of ''showbf'' read the manpage of ''showbf'':
 
<pre>
 
$ man showbf
 
</pre>
 
 
<br>
 
<br>
  +
<font color=red>If your job is blocked do not delete it!</font>
   
  +
=== Job Eligibility ===
  +
If a job cannot run, a text reason is provided along with a summary of how many nodes are and are not available. If the -v flag is specified, a node by node summary of resource availability will be displayed for idle jobs. For job level eligibility issues, one of the following reasons will be given:
  +
<br>
  +
{| width=750px class="wikitable"
  +
! Reason !! Description
  +
|- style="vertical-align:top;"
  +
| job has hold in place
  +
| one or more job holds are currently in place
  +
|- style="vertical-align:top;"
  +
| insufficient idle procs
  +
| there are currently not adequate processor resources available to start the job
  +
|- style="vertical-align:top;"
  +
| idle procs do not meet requirements
  +
| adequate idle processors are available but these do not meet job requirements
  +
|- style="vertical-align:top;"
  +
| start date not reached
  +
| job has specified a minimum start date which is still in the future
  +
|- style="vertical-align:top;"
  +
| expected state is not idle
  +
| job is in an unexpected state
  +
|- style="vertical-align:top;"
  +
| state is not idle
  +
| job is not in the idle state
  +
|- style="vertical-align:top;"
  +
| dependency is not met
  +
| job depends on another job reaching a certain state
  +
|- style="vertical-align:top;"
  +
| rejected by policy
  +
| job start is prevented by a throttling policy
  +
|}
  +
If a job cannot run on a particular node, one of the following 'per node' reasons will be given:
  +
{| width=750px class="wikitable"
  +
! Description || Reason
  +
|- style="vertical-align:top;"
  +
| Class
  +
| Node does not allow required job class/queue
  +
|- style="vertical-align:top;"
  +
| CPU
  +
| Node does not possess required processors
  +
|- style="vertical-align:top;"
  +
| Disk
  +
| Node does not possess required local disk
  +
|- style="vertical-align:top;"
  +
| Features
  +
| Node does not possess required node features
  +
|- style="vertical-align:top;"
  +
| Memory
  +
| Node does not possess required real memory
  +
|- style="vertical-align:top;"
  +
| Network
  +
| Node does not possess required network interface
  +
|- style="vertical-align:top;"
  +
| State
  +
| Node is not Idle or Running
  +
|}
   
  +
=== Example ===
== List of submitted jobs - showq ==
 
  +
A '''''blocked''''' job has hit a limit and will become '''''idle''''' if resource get free.
The following command displays information about active, eligible, blocked, and/or recently completed jobs:
 
  +
The "-v (verbose)" mode of 'checkjob' also shows a message "BLOCK MSG:" for more details.
 
<pre>
 
<pre>
  +
checkjob -v 8370992
$ showq
 
  +
[...]
  +
  +
BLOCK MSG: job <jobID> violates active SOFT MAXPROC limit of 750 for acct mannheim
  +
partition ALL (Req: 160 InUse: 742) (recorded at last scheduling iteration)
 
</pre>
 
</pre>
  +
In this case the job has reached the account limit of mannheim while requesting 160 core when 742 were already in use.
 
<br>
 
<br>
  +
The most common cause of blocked jobs is a violation of MAXPROC or MAXPS limits, indicating that your group has scheduled too many outstanding processor seconds at the same time.
For further options of ''showq'' read the manpage of ''showq'':
 
  +
<pre>
 
  +
=== The Limits imposed by the Scheduler ===
$ man showq
 
  +
This refers to limits on the number of jobs in the queue which are enforced by the scheduler. The largest factors in determining limits in numbers of jobs are the Maximum Processor Seconds (MAXPS) and the Maximum Processors (MAXPROC) for each account. The MAXPS is the total number of processor core seconds (ps) allocated for each (group) account. It is based on fairshare values in dependency of the configured values for your <OE> (Konstanz, Ulm, etc. ...) .
</pre>
 
  +
<br>
  +
Users can submit as many jobs but they cannot be scheduled to run if their groups MAXPROC or MAXPS value is exceeded. They instead enter into a "HOLD" state. If the limits of the group is not reached but the resources are not available, the jobs enter into "IDLE" state and will run once the requested resources become available.
 
<br>
 
<br>
   
  +
== Canceling own jobs : canceljob ==
  +
<font color=red>Caution: This command is '''deprecated'''. Use [[#Canceling own jobs : mjobctl -c|mjobctl -c]] instead!</font>
  +
<br><br>
  +
The canceljob <JobId> command is used to selectively cancel the specified job(s) (active, idle, or non-queued) from the queue.
  +
<br><br>
  +
<font color=red>Note that only '''own jobs''' can be cancelled.</font>
  +
<br>
  +
=== Access ===
  +
This command can be run by any Moab Administrator and '''by the owner of the job'''.
  +
{| width=750px class="wikitable"
  +
! Flag !! Name !! Format !! Default !! Description !! Example
  +
|- style="vertical-align:top;"
  +
| -h
  +
| HELP
  +
|
  +
| n./a.
  +
| Display usage information
  +
| <pre>$ canceljob -h</pre>
  +
|- style="vertical-align:top;"
  +
|
  +
| JOB ID
  +
| <STRING>
  +
| (none)
  +
| a jobid, a job expression, or the keyword 'ALL'
  +
| see: [[#Example Use of Canceljob|example use of canceljob]]
  +
|}
   
== Detailed job information - checkjob ==
+
=== Example Use of Canceljob ===
  +
Example use of canceljob run on the bwUniCluster
''checkjob <jobID>'' displays detailed job state information and diagnostic output for the job of ''<jobID>'':
 
 
<pre>
 
<pre>
  +
[...calc_repo-0]$ msub bwhpc-fasta-example.moab
$ checkjob <jobID>
 
  +
8374356 # this is the JobId
  +
$
  +
$ checkjob 8374356
  +
job 8374356
  +
AName: fasta36_job
  +
State: Idle
  +
Creds: user:kn_pop235844 group:kn_kn account:konstanz class:multinode
  +
WallTime: 00:00:00 of 00:10:00
  +
BecameEligible: Fri Jan 15 12:10:53
  +
SubmitTime: Fri Jan 15 12:10:43
  +
(Time Queued Total: 00:00:10 Eligible: 00:00:08)
  +
[...]
  +
  +
$ checkjob 8374356 | grep ^State:
  +
State: Idle # state is 'Idle'
  +
  +
$ # now cancel the job
  +
$ canceljob 8374356
  +
job '8374356' cancelled
  +
  +
$ checkjob 8374356 | grep ^State:
  +
State: Removed # state turned into 'Removed'
 
</pre>
 
</pre>
  +
* See: [[#E-Mail notification of mjobctl -c|E-Mail notification after a job was cancelled/removed]]
  +
  +
== Moab Job Control : mjobctl ==
  +
The mjobctl command controls various aspects of jobs. It is used to submit, cancel, execute, and checkpoint jobs. It can also display diagnostic information about your own jobs.
 
<br>
 
<br>
  +
=== Canceling own jobs : mjobctl -c ===
For further options of ''checkjob'' read the manpage of ''checkjob'':
 
  +
If you want to cancel a job that has been submitted, please do not use the PBS/Torque qdel (n./a.) or the deprecated [[#Canceling own jobs : canceljob|canceljob]] commands.
<pre>
 
$ man checkjob
 
</pre>
 
 
<br>
 
<br>
  +
<font color=green>Instead, use '''mjobctl -c <jobid>'''. </font>
  +
<br>
  +
{| width=750px class="wikitable"
  +
! Flag !! Format !! Default !! Description !! Example
  +
|- style="vertical-align:top;"
  +
| -cl
  +
| JobId
  +
| (none)
  +
| Cancel a job.
  +
| see: [[#Example Use of Mjobctl -c|example use of mjobctl -c]]
  +
|}
  +
==== Example Use of mjobctl -c ====
  +
Canceling a job on the bwUniCluster
  +
<pre>
  +
[...-calc_repo-0]$ msub bwhpc-fasta-example.moab
  +
8374426
   
  +
$ checkjob 8374426 | grep ^State
  +
State: Idle # job is 'Idle'
   
  +
$ mjobctl -c 8374426
= Job management =
 
  +
job '8374426' cancelled # job is cancelled
== Canceling own jobs ==
 
  +
''canceljob <jobID>'' cancels the own job with ''<jobID>''.
 
  +
checkjob 8374426 | grep ^State
<pre>
 
  +
State: Removed # now, job is removed
$ canceljob <jobID>
 
  +
  +
$ # my own checkjob wrapper
  +
cj 8374426
  +
Job: 8374426 Status: < Removed > Wartezeit: 1m30s Intervall: 30s
  +
Job 8374426 wurde gelöscht!
  +
$
 
</pre>
 
</pre>
  +
* [[#Checkjob Examples|checkjob wrapper]]
  +
==== E-Mail notification of mjobctl -c ====
  +
You will receive an e-mail notification like this <small>(example with Slurm)</small>
  +
<source lang=email>
  +
From: slurm@uc1-sv1.scc.kit.edu
  +
Subject: SLURM Job_id= 8374426 Name=fasta36_job Ended, Run time 00:01:30, CANCELLED, ExitCode 0
  +
</source>
  +
from the resource manager shortly after the job was removed from the queue.
 
<br>
 
<br>
Note that only own jobs can be cancelled. The command:
 
<pre>
 
$ mjobctl -c <jobID>
 
</pre>
 
has the same effect as ''canceljob <jobID>''.
 
 
<br>
 
<br>
  +
You need to add these MOAB environments into your submit script or command line parameters for the msub command:
 
  +
<source lang="bash">
 
  +
#MSUB -m ae
= Environment Variables for Batch Jobs =
 
  +
#MSUB -M e-mail@FQDN (e-mail address like: name@domain.de)
Once an eligible compute jobs starts on the compute system, MOAB adds the following variables to the job's environment:
 
  +
</source>
{| style="width:100%; vertical-align:top; background:#f5fffa;border:1px solid #000000;padding:1px"
 
  +
{| width=750px class="wikitable"
! colspan="3" style="background-color:#999999;padding:3px"| MOAB variables
 
  +
! colspan="2" | msub -m option(s)
|- style="width:25%;height=20px; text-align:left;padding:3px"
 
! style="width:20%;height=20px; text-align:left;padding:3px"| Environment variables
+
|- style="vertical-align:top;"
! style="height=20px; text-align:left;padding:3px"| Description
+
! Option !! Description
  +
|- style="vertical-align:top;"
|-
 
| style="width:20%;height=20px; text-align:left;padding:3px" | MOAB_CLASS
+
| style="width:12%;" | -m option(s)
  +
| Defines the set of conditions ('''a=abort''' &#124; '''b=begin''' &#124; '''e=end''') when the server will send a mail message about the job to the user.
| style="height=20px; text-align:left;padding:3px"| Class name
 
  +
|- style="vertical-align:top;"
|-
 
  +
| -N name
| style="width:20%;height=20px; text-align:left;padding:3px" | MOAB_GROUP
 
  +
| Gives a user specified name to the job. Note that job names do not appear in all MOAB job info displays, and do not determine how your jobs stdout/stderr files are named.
| style="height=20px; text-align:left;padding:3px"| Group name
 
|-
 
| style="width:20%;height=20px; text-align:left;padding:3px" | MOAB_JOBID
 
| style="height=20px; text-align:left;padding:3px"| Job ID
 
|-
 
| style="width:20%;height=20px; text-align:left;padding:3px" | MOAB_JOBNAME
 
| style="height=20px; text-align:left;padding:3px"| Job name
 
|-
 
| style="width:20%;height=20px; text-align:left;padding:3px" | MOAB_NODECOUNT
 
| style="height=20px; text-align:left;padding:3px"| Number of nodes allocated to job
 
|-
 
| style="width:20%;height=20px; text-align:left;padding:3px" | MOAB_PARTITION
 
| style="height=20px; text-align:left;padding:3px"| Partition name the job is running in
 
|-
 
| style="width:20%;height=20px; text-align:left;padding:3px" | MOAB_PROCCOUNT
 
| style="height=20px; text-align:left;padding:3px"| Number of processors allocated to job
 
|-
 
| style="width:20%;height=20px; text-align:left;padding:3px" | MOAB_SUBMITDIR
 
| style="height=20px; text-align:left;padding:3px"| Directory of job submission
 
|-
 
| style="width:20%;height=20px; text-align:left;padding:3px" | MOAB_USER
 
| style="height=20px; text-align:left;padding:3px"| User name
 
 
|}
 
|}
<br>
 
   
  +
=== Other Mjobctl-Options ===
Further environment variables are added by the resource manager SLURM:
 
  +
See also:
{| style="width:100%; vertical-align:top; background:#f5fffa;border:1px solid #000000;padding:1px"
 
  +
* [http://docs.adaptivecomputing.com/mwm/6-1-9/Content/commands/mjobctl.html Complete list of mjobctl options and parameters].
! colspan="3" style="background-color:#999999;padding:3px"| SLURM variables
 
|- style="width:25%;height=20px; text-align:left;padding:3px"
 
! style="width:20%;height=20px; text-align:left;padding:3px"| Environment variables
 
! style="height=20px; text-align:left;padding:3px"| Description
 
|-
 
| style="width:20%;height=20px; text-align:left;padding:3px" | SLURM_JOB_CPUS_PER_NODE
 
| style="height=20px; text-align:left;padding:3px"| Number of processes per node dedicated to the job
 
|-
 
| style="width:20%;height=20px; text-align:left;padding:3px" | SLURM_JOB_NODELIST
 
| style="height=20px; text-align:left;padding:3px"| List of nodes dedicated to the job
 
|-
 
| style="width:20%;height=20px; text-align:left;padding:3px" | SLURM_JOB_NUM_NODES
 
| style="height=20px; text-align:left;padding:3px"| Number of nodes dedicated to the job
 
|-
 
| style="width:20%;height=20px; text-align:left;padding:3px" | SLURM_MEM_PER_NODE
 
| style="height=20px; text-align:left;padding:3px"| Memory per node dedicated to the job
 
|-
 
| style="width:20%;height=20px; text-align:left;padding:3px" | SLURM_NPROCS
 
| style="height=20px; text-align:left;padding:3px"| Total number of processes dedicated to the job
 
|}
 
 
<br>
 
<br>
  +
<font color=red>Not all of the listed options are available for 'normal' users. Some are for MOAB-admins only.</font>
Both MOAB and SLURM environment variables can be used to generalize you job scripts, compare [[BwUniCluster_Batch_Jobs#msub_Examples|msub examples]].
 
 
<br>
 
<br>
 
<br>
 
<br>
  +
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
   
 
----
 
----
[[Category:bwHPC|Job Submission]]
+
[[Category:bwUniCluster|Batch Jobs - General Features]]
[[Category:bwUniCluster|Moab]]
+
[[Category:BwForCluster Chemistry|Batch Jobs - General Features]]
  +
[[Category:BwForCluster_MLS%26WISO_Production|Batch Jobs - General Features]]
[[Category:bwUniCluster|Job Submission]]
 
  +
[[Category:BwForCluster NEMO|Batch Jobs - General Features]]
  +
[[Category:BwForCluster_BinAC]]

Revision as of 15:14, 1 October 2019

1 Moab® HPC Workload Manager

1.1 Specification

The Moab Cluster Suite is a cluster workload management package, available from Adaptive Computing, Inc., that integrates the scheduling, managing, monitoring and reporting of cluster workloads. Moab Cluster Suite simplifies and unifies management across one or multiple hardware, operating system, storage, network, license and resource manager environments.

Any kind of calculation on the compute nodes of a bwHPC cluster of tier 2 or 3 requires the user to define calculations as a sequence of commands or single command together with required run time, number of CPU cores and main memory and submit all, i.e., the batch job, to a resource and workload managing software. All bwHPC cluster of tier 2 and 3, including have installed the workload managing software MOAB. Therefore any job submission by the user is to be executed by commands of the MOAB software. MOAB queues and runs user jobs based on fair sharing policies.

1.2 Moab Commands (excerpt)

Some of the most used Moab commands for non-administrators working on a HPC-C5 cluster.

MOAB commands Brief explanation
msub Submits a job and queues it in an input queue [msub]
checkjob Displays detailed job state information [checkjob]
showq Displays information about active, eligible, blocked, and/or recently completed jobs [showq]
showbf Shows what resources are available for immediate use [showbf]
showstart Returns start time of submitted job or requested resources [showstart]
canceljob Cancels a job (opsoleted!) [canceljob]
mjobctl Cancel a job and more job control options [mjobctl]

1.3 Job Submission : msub

Batch jobs are submitted by using the command msub. The main purpose of the msub command is to specify the resources that are needed to run the job. msub will then queue the batch job. However, starting of batch job depends on availability of the requested resources and the fair sharing value.

1.3.1 msub Command Parameters

The syntax and use of msub can be displayed via:

$ man msub

msub options can be used from the command line or in your job script.

msub Options
Command line Script Purpose
-l resources #MSUB -l resources Defines the resources that are required by the job.

See the description below for this important flag.

-N name #MSUB -N name Gives a user specified name to the job.
-o filename #MSUB -o filename Defines the file-name to be used for the standard output stream of the

batch job. By default the file with defined file name is placed under your
job submit directory. To place under a different location, expand
file name by the relative or absolute path of destination.

-q queue #MSUB -q queue Defines the queue class
-v variable=arg #MSUB -v variable=arg Expands the list of environment variables that are exported to the job
-S Shell #MSUB -S Shell Declares the shell (state path+name, e.g. /bin/bash) that interpret

the job script

-m bea #MSUB -m bea Send email when job begins (b), ends (e) or aborts (a).
-M name@uni.de #MSUB -M name@uni.de Send email to the specified email address "name@uni.de".

For cluster specific msub options, read:

1.3.1.1 msub -l resource_list

The -l option is one of the most important msub options. It is used to specify a number of resource requirements for your job. Multiple resource strings are separated by commas.

msub -l resource_list
resource Purpose
-l nodes=2:ppn=16 Number of nodes and number of processes per node
-l walltime=600
-l walltime=01:30:00
Wall-clock time. Default units are seconds.

HH:MM:SS format is also accepted.

-l pmem=1000mb Maximum amount of physical memory used by any single process of the job.

Allowed units are kb, mb, gb. Be aware that processes are either MPI tasks

memory for all MPI tasks or all threads of the job.
-l advres=res_name Specifies the reservation "res_name" required to run the job.
-l naccesspolicy=policy Specifies how node resources should be accessed, e.g. -l naccesspolicy=singlejob

reserves all requested nodes for the job exclusively.
Attention, if you request nodes=1:ppn=4 together with singlejob you will be
charged for the maximum cores of the node.

Note that all compute nodes do not have SWAP space, thus DO NOT specify '-l vmem' or '-l pvmem' or your jobs will not start.

1.3.1.2 msub -q queues

Queue classes define maximum resources such as walltime, nodes and processes per node and partition of the compute system. Note that queue settings of the bwHPC cluster are not identical, but differ due to their different prerequisites, such as HPC performance, scalability and throughput levels. Details can be found here:

1.3.2 msub Examples

WARNING: many of these examples use queue names like 'singlenode', 'multinode', 'fat' and 'develop' which are specific to the bwUniCluster only! Please refer to the other cluster queue definitions for correct usage.

Hint for JUSTUS users: in the following examples instead of singlenode and fat use short and long, respectively!

1.3.2.1 Serial Programs

To submit a serial job that runs the script job.sh and that requires 5000 MB of main memory and 3 hours of wall clock time

a) execute:

$ msub -q singlenode -N test -l nodes=1:ppn=1,walltime=3:00:00,pmem=5000mb   job.sh

or b) add after the initial line of your script job.sh the lines (here with a high memory request):

#MSUB -l nodes=1:ppn=1
#MSUB -l walltime=3:00:00
#MSUB -l pmem=200000mb
#MSUB -N test

and execute the modified script with the command line option -q fat (with -q singlenode maximum pmem=64000mb is possible):

$ msub -q fat job.sh

Note, that msub command line options overrule script options.

1.3.2.2 Multithreaded Programs

Multithreaded programs operate faster than serial programs on CPUs with multiple cores.
Moreover, multiple threads of one process share resources such as memory.
For multithreaded programs based on Open Multi-Processing (OpenMP) number of threads are defined by the environment variable OMP_NUM_THREADS. By default this variable is set to 1 (OMP_NUM_THREADS=1).
To submit a batch job called OpenMP_Test that runs a fourfold threaded program omp_executable which requires 6000 MByte of total physical memory and total wall clock time of 3 hours:

  • generate the script job_omp.sh containing the following lines:
#!/bin/bash
#MSUB -l nodes=1:ppn=4
#MSUB -l walltime=3:00:00
#MSUB -l mem=6000mb
#MSUB -v EXECUTABLE=./omp_executable
#MSUB -v MODULE=<placeholder>
#MSUB -N OpenMP_Test

#Usually you should set
export KMP_AFFINITY=compact,1,0
#export KMP_AFFINITY=verbose,compact,1,0 prints messages concerning the supported affinity
#KMP_AFFINITY Description: https://software.intel.com/en-us/node/524790#KMP_AFFINITY_ENVIRONMENT_VARIABLE

module load ${MODULE}
export OMP_NUM_THREADS=${MOAB_PROCCOUNT}
echo "Executable ${EXECUTABLE} running on ${MOAB_PROCCOUNT} cores with ${OMP_NUM_THREADS} threads"
startexe=${EXECUTABLE}
echo $startexe
exec $startexe

Using Intel compiler the environment variable KMP_AFFINITY switches on binding of threads to specific cores and, if necessary, replace <placeholder> with the required modulefile to enable the OpenMP environment and execute the script job_omp.sh adding the queue class singlenode as msub option:

$ msub -q singlenode job_omp.sh

Note, that msub command line options overrule script options, e.g.,

$ msub -l mem=2000mb -q singlenode job_omp.sh

overwrites the script setting of 6000 MByte with 2000 MByte.

1.3.2.3 MPI Parallel Programs

MPI parallel programs run faster than serial programs on multi CPU and multi core systems. N-fold spawned processes of the MPI program, i.e., MPI tasks, run simultaneously and communicate via the Message Passing Interface (MPI) paradigm. MPI tasks do not share memory but can be spawned over different nodes.
Multiple MPI tasks can not be launched by the MPI parallel program itself but via mpirun, e.g. 4 MPI tasks of my_par_program:

$ mpirun -n 4 my_par_program

However, this given command can not be directly included in your msub command for submitting as a batch job to the compute cluster, see above.

Generate a wrapper script job_ompi.sh for OpenMPI containing the following lines:

#!/bin/bash
module load mpi/openmpi/<placeholder_for_version>
# Use when loading OpenMPI in version 1.8.x
mpirun --bind-to core --map-by core -report-bindings my_par_program
# Use when loading OpenMPI in an old version 1.6.x
mpirun -bind-to-core -bycore -report-bindings my_par_program

Attention: Do NOT add mpirun options -n <number_of_processes> or any other option defining processes or nodes, since MOAB instructs mpirun about number of processes and node hostnames. Use ALWAYS the MPI options --bind-to core and --map-by core|socket|node (OpenMPI version 1.8.x). Please type mpirun --help for an explanation of the meaning of the different options of mpirun option --map-by.
Considering 4 OpenMPI tasks on a single node, each requiring 1000 MByte, and running for 1 hour, execute:

$ msub -q singlenode -l nodes=1:ppn=4,pmem=1000mb,walltime=01:00:00 job_ompi.sh

The policy on batch jobs with Intel MPI on bwUniCluster can be found here:

1.3.2.4 Multithreaded + MPI parallel Programs

Multithreaded + MPI parallel programs operate faster than serial programs on multi CPUs with multiple cores. All threads of one process share resources such as memory. On the contrary MPI tasks do not share memory but can be spawned over different nodes.
Multiple MPI tasks using OpenMPI must be launched by the MPI parallel program mpirun. For multithreaded programs based on Open Multi-Processing (OpenMP) number of threads are defined by the environment variable OMP_NUM_THREADS. By default this variable is set to 1 (OMP_NUM_THREADS=1).
For OpenMPI a job-script to submit a batch job called job_ompi_omp.sh that runs a MPI program with 4 tasks and an fivefold threaded program ompi_omp_program requiring 6000 MByte of physical memory per process/thread (using 5 threads per MPI task you will get 5*6000 MByte = 30000 MByte per MPI task) and total wall clock time of 3 hours looks like:

#!/bin/bash
#MSUB -l nodes=2:ppn=10
#MSUB -l walltime=03:00:00
#MSUB -l pmem=6000mb
#MSUB -v MPI_MODULE=mpi/ompi
#MSUB -v OMP_NUM_THREADS=5
#MSUB -v MPIRUN_OPTIONS="--bind-to core --map-by socket:PE=5 -report-bindings"
#MSUB -v EXECUTABLE=./ompi_omp_program
#MSUB -N test_ompi_omp

module load ${MPI_MODULE}
TASK_COUNT=$((${MOAB_PROCCOUNT}/${OMP_NUM_THREADS}))
echo "${EXECUTABLE} running on ${MOAB_PROCCOUNT} cores with ${TASK_COUNT} MPI-tasks and ${OMP_NUM_THREADS} threads"
startexe="mpirun -n ${TASK_COUNT} ${MPIRUN_OPTIONS} ${EXECUTABLE}"
echo $startexe
exec $startexe

Execute the script job_ompi_omp.sh adding the queue class multinode to your msub command:

$ msub -q multinode job_ompi_omp.sh
  • With the mpirun option --bind-to core MPI tasks and OpenMP threads are bound to physical cores.
  • With the option --map-by socket:PE=<value> (neighbored) MPI tasks will be attached to different sockets and each MPI task is bound to the (in <value>) specified number of cpus. <value> must be set to ${OMP_NUM_THREADS}.
  • Old OpenMPI version 1.6.x: With the mpirun option -bind-to-core MPI tasks and OpenMP threads are bound to physical cores.
  • With the option -bysocket (neighbored) MPI tasks will be attached to different sockets and the option -cpus-per-proc <value> binds each MPI task to the (in <value>) specified number of cpus. <value> must be set to ${OMP_NUM_THREADS}.
  • The option -report-bindings shows the bindings between MPI tasks and physical cores.
  • The mpirun-options --bind-to core', --map-by socket|...|node:PE=<value> should always be used when running a multithreaded MPI program. (OpenMPI version 1.6.x: The mpirun-options -bind-to-core, -bysocket|-bynode and -cpus-per-proc <value> should always be used when running a multithreaded MPI program.)
  • The policy on batch jobs with Intel MPI + Multithreading on bwUniCluster can be found here:
    bwUniCluster: Intel MPI Parallel Programs with Multithreading

1.3.2.5 Chain jobs

A job chain is a sequence of jobs where each job automatically starts its successor. Chain Job handling differs on the bwHPC Clusters. See the cluster-specific pages:

1.3.2.6 Interactive Jobs

Policies of interactive batch jobs are cluster specific and can be found here:

1.3.3 Handling job script options and arguments

Job script options and arguments as followed:

$ ./job.sh -n 10

can not be passed while using msub command since those will be interpreted as command line options of job.sh (like $1 = -n, $2 = 10).

Solution A:

Submit a wrapper script, e.g. wrapper.sh:

$ msub -q singlenode wrapper.sh

which simply contains all options and arguments of job.sh. The script wrapper.sh would at least contain the following lines:

#!/bin/bash
./job.sh -n 10

Solution B:

Add after the header of your BASH script job.sh the following lines:

## check if $SCRIPT_FLAGS is "set"
if [ -n "${SCRIPT_FLAGS}" ] ; then
   ## but if positional parameters are already present
   ## we are going to ignore $SCRIPT_FLAGS
   if [ -z "${*}"  ] ; then
      set -- ${SCRIPT_FLAGS}
   fi
fi

These lines modify your BASH script to read options and arguments from the environment variable $SCRIPT_FLAGS. Now submit your script job.sh as followed:

$ msub -q singlenode -v SCRIPT_FLAGS='-n 10' job.sh


1.3.4 Moab Environment Variables

Once an eligible compute jobs starts on the compute system, MOAB adds the following variables to the job's environment:

MOAB variables
Environment variables Description
MOAB_CLASS Class name
MOAB_GROUP Group name
MOAB_JOBID Job ID
MOAB_JOBNAME Job name
MOAB_NODECOUNT Number of nodes allocated to job
MOAB_PARTITION Partition name the job is running in
MOAB_PROCCOUNT Number of processors allocated to job
MOAB_SUBMITDIR Directory of job submission
MOAB_USER User name

See also:


1.3.5 Interpreting PBS exit codes

  • The PBS Server logs and accounting logs record the ‘exit status’ of jobs.
  • Zero or positive exit status is the status of the top-level shell.
  • Certain negative exit statuses are used internally and will never be reported to the user.
  • The positive exit status values indicate which signal killed the job.
  • Depending on the system, values greater than 128 (or on some systems 256, see wait(2) or waitpid(2) for more information) are the value of the signal that killed the job.
  • To interpret (or ‘decode’) the signal contained in the exit status value, subtract the base value from the exit status.
    For example, if a job had an exit status of 143, that indicates the jobs was killed via a SIGTERM (e.g. 143 - 128 = 15, signal 15 is SIGTERM).

1.3.5.1 Job termination

  • The exit code from a batch job is a standard Unix termination signal.
  • Typically, exit code 0 means successful completion.
  • Codes 1-127 are generated from the job calling exit() with a non-zero value to indicate an error.
  • Exit codes 129-255 represent jobs terminated by Unix signals.
  • Each signal has a corresponding value which is indicated in the job exit code.

1.3.5.2 Job termination signals

Specific job exit codes are also supplied by the underlying resource manager of the cluster's batch system which is either TORQUE or Slurm. More detailed information can be found in the corresponding documentation:

1.3.5.3 Submitting Termination Signal

Here is an example, how to 'save' a msub termination signal in a typical bwHPC-submit script.

[...]
exit_code=$?
echo "### Calling YOUR_PROGRAM command ..."
mpirun -np 'NUMBER_OF_CORES' $YOUR_PROGRAM_BIN_DIR/runproc ... (options)  2>&1
[ "$exit_code" -eq 0 ] && echo "all clean..." || \
   echo "Executable ${YOUR_PROGRAM_BIN_DIR}/runproc finished with exit code ${$exit_code}"
[...]
  • Do not use 'time' mpirun! The exit code will be the one submitted by the first (time) program and not the msub exit code.
  • You do not need an exit $exit_code in the scripts.

1.4 Start time of job or resources : showstart

The following command can be used by any user to displays the estimated start time of a job based a number of analysis types based on historical usage, earliest available reservable resources, and priority based backlog.

1.4.1 Access

By default, this command can be run by any user.

1.4.2 Showstart Parameters

Parameter Description
DURATION Duration of pseudo-job to be checked in format [[[DD:]HH:]MM:]SS (default duration is 1 second)
-e Estimate method. By default, Moab will use the reservation based estimation method.
-f Use feedback. If specified, Moab will apply historical accuracy information to improve the quality of the estimate.
-g Grid mode. Obtain showstart information from remote resource managers. If -g is not used and Moab determines that job is already migrated, Moab obtains showstart information form the remote Moab where the job was migrated to. All resource managers can be queried by using the keyword "all" which returns all information in a table.
$ showstart -g all head.1
Estimated Start Times
[ Remote RM ] [ Reservation ] [ Priority ] [ Historical ]
[ c1 ] [ 00:15:35 ] [ ] [ ]
[ c2 ] [ 3:15:38 ] [ ] [ ]
-l qos=<QOS> Specifies what QOS the job must start under, using the same syntax as the msub command. Currently, no other resource manager extensions are supported. This flag only applies to hypothetical jobs by using the proccount[@duration] syntax.
JOBID Job to be checked
PROCCOUNT Number of processors in pseudo-job to be checked
S3JOBSPEC XML describing the job according to the Dept. of Energy Scalable Systems Software/S3 job specification.

Note: You cannot specify job flags when running showstart, and since a job by default can only run on one partition. Showstart fails when querying for a job requiring more nodes than the largest partition available.

1.4.3 Showstart Examples

  • To show estimated start time of job <job_ID> enter
$ showstart -e all <job_ID>
  • Furthermore start time of resource demands, e.g. 16 processes @ 12 h, can be displayed
$ showstart -e all 16@12:00:00
  • For a list of all options of showstart read the manpage of showstart

1.5 List of your submitted jobs : showq

Displays information about active, eligible, blocked, and/or recently completed jobs. Since the resource manager is not actually scheduling jobs, the job ordering it displays is not valid. The showq command displays the actual job ordering under the Moab Workload Manager. When used without flags, this command displays all jobs in active, idle, and non-queued states.

1.5.1 Access

By default, this command can be run by any user.
However, the -c, -i, and -r flags can only be used by Moab administrators.

1.5.2 Flags

Flag Description
-b display blocked jobs only
-c display details about recently completed jobs (see example, JOBCPURGETIME)
-g display grid job and system id's for all jobs
-i display extended details about idle jobs
-l display local/remote view. For use in a Grid environment, displays job usage of both local and remote compute resources.
-p display only jobs assigned to the specified partition
-r display extended details about active (running) jobs
-R display only jobs which overlap the specified reservation
-v Display local and full resource manager job IDs as well as partitions. If specified with the '-i' option, will display job reservation time.
-w display only jobs associated with the specified constraint. Valid constraints include user, group, acct, class, and qos.

1.5.3 Examples

These are examples as shown on the Adaptive Homepage (external links).

Showq example on the bwUniCluster for one specific user only (Name and Pop-ID is fiction. Run as MOAB-Admin.).

$ # use UID for option in showq --->
$ showq -u kn_pop332211
active jobs------------------------
JOBID              USERNAME      STATE PROCS   REMAINING            STARTTIME

8370992            kn_pop33    Running     1  2:05:09:17  Wed Jan 13 15:59:01
8370991            kn_pop33    Running     1  2:05:09:17  Wed Jan 13 15:59:01
8370993            kn_pop33    Running     1  2:05:10:20  Wed Jan 13 16:00:04
[...]
8371040            kn_pop33    Running     1  2:05:11:41  Wed Jan 13 16:01:25

50 active jobs          50 of 7072 processors in use by local jobs (0.71%)
                        434 of 434 nodes active      (100.00%)

eligible jobs----------------------
JOBID              USERNAME      STATE PROCS     WCLIMIT            QUEUETIME

0 eligible jobs   

blocked jobs-----------------------
JOBID              USERNAME      STATE PROCS     WCLIMIT            QUEUETIME

0 blocked jobs   

Total jobs:  50
  • The summary of your active jobs shows how many jobs of yours are running, how many processors are in use by your jobs and how many nodes are in use by all active jobs.
  • Use showq -u $USER for your own jobs.
  • For further options of showq read the manpage of showq.

1.6 Shows free resources : showbf

The showbf command can be used by any user to find out how many processors are available for immediate use on the system. It is anticipated that users will use this information to submit jobs that meet these criteria and thus obtain quick job turnaround times. This command incorporates down time, reservations, and node state information in determining the available backfill window.
Note If specific information is not specified, showbf will return information for the user and group running but with global access for other credentials.

1.6.1 Access

By default, this command can be used by any user or administrator.

1.6.2 Flags

-A Show resource availability information for all users, groups, and accounts. By default, showbf uses the default user, group, and account ID of the user issuing the command.
Flag Description
-a Show resource availability information only for specified account
-d Show resource availability information for specified duration
-D Display current and future resource availability notation
-f Show resource availability information only for specified feature
-g Show resource availability information only for specified group
-h Help for this command
-L Enforce Hard limits when showing available resources
-m Allows user to specify the memory requirements for the backfill nodes of interest. It is important to note that if the optional MEMCMP and MEMORY parameters are used, they MUST be enclosed in single ticks (') to avoid interpretation by the shell. For example, enter showbf -m '==256' to request nodes with 256 MB memory.
-n Show resource availability information for a specified number of nodes. That is, this flag can be used to force showbf to display only blocks of resources with at least this many nodes available.
-p Show resource availability information for the specified partition
-q Show information for the specified QOS
-u Show resource availability information only for specified user

1.6.3 Parameters

Parameter Description
ACCOUNT Account name
CLASS Class/queue required
DURATION Time duration specified as the number of seconds or in [DD:]HH:MM:SS notation
FEATURELIST Colon separated list of node features required
GROUP Specify particular group
MEMCMP Memory comparison used with the -m flag. Valid signs are >, >=, ==, <=, and <.
MEMORY Specifies the amount of required real memory configured on the node, (in MB), used with the -m flag.
NODECOUNT Specify number of nodes for inquiry with -n flag
PARTITION Specify partition to check with -p flag
QOS Specify QOS to check with -q flag
USER Specify particular user to check with -u flag

1.6.4 Examples

  • The following command displays what resources are available for immediate use for the whole partition.
$ showbf
Partition     Tasks  Nodes      Duration   StartOffset       StartDate
---------     -----  -----  ------------  ------------  -------------- 
ALL             371    129      INFINITY      00:00:00  11:30:26_01/14
  • The request for 16 nodes can be run immediately on all partitions.
$ showbf -n 16 -d 2:00:00
Partition     Tasks  Nodes      Duration   StartOffset       StartDate
---------     -----  -----  ------------  ------------  --------------
ALL             392    132      INFINITY      00:00:00  11:35:01_01/14
  • This request for 64 nodes returned nothing, meaning it could not be fulfilled immedately.
$ showbf -n 64 -d 2:00:00
Partition     Tasks  Nodes      Duration   StartOffset       StartDate
---------     -----  -----  ------------  ------------  --------------
  • The last example request displays an error message. Too much nodes requested!
$ showbf -n 256 -d 2:00:00
resources not available
  • For further options of showbf read the manpage of showbf.

1.7 Detailed job information : checkjob

Checkjob displays detailed job state information and diagnostic output for a specified job. Detailed information is available for queued, blocked, active, and recently completed jobs.

1.7.1 Access

  • End users can use checkjob to view the status of their own jobs only.
  • JobArrays and Reservations are not available on all clusters. See specific informations depending the single bwUniCluster and bwForClusters for a list of capabilities.

1.7.2 Output

Attribute Value Description
Account <STRING> Name of account associated with job
Actual Run Time [[[DD:]HH:]MM:]SS Length of time job actually ran. This info is only displayed in simulation mode.
Allocated Nodes Square bracket delimited list of node and processor ids List of nodes and processors allocated to job
Applied Nodeset** <STRING> Nodeset used for job's node allocation
Arch <STRING> Node architecture required by job
Attr Square bracket delimited list of job attributes Job Attributes (i.e. [BACKFILL][PREEMPTEE])
Available Memory** <INTEGER> The available memory requested by job. Moab displays the relative or exact value by returning a comparison symbol (>, <, >=, <=, or ==) with the value (i.e. Available Memory <= 2048).
Available Swap** <INTEGER> The available swap requested by job. Moab displays the relative or exact value by returning a comparison symbol (>, <, >=, <=, or ==) with the value (i.e. Available Swap >= 1024).
Average Utilized Procs* <FLOAT> Average load balance for a job
Avg Util Resources Per Task* <FLOAT>
BecameEligible <TIMESTAMP> The date and time when the job moved from Blocked to Eligible.
Bypass <INTEGER> Number of times a lower priority job with a later submit time ran before the job
CheckpointStartTime** [[[DD:]HH:]MM:]SS The time the job was first checkpointed
Class [<CLASS NAME> <CLASS COUNT>] Name of class/queue required by job and number of class initiators required per task.
Dedicated Resources Per Task* Space-delimited list of <STRING>:<INTEGER> Resources dedicated to a job on a per-task basis
Disk <INTEGER> Amount of local disk required by job (in MB)
Estimated Walltime [[[DD:]HH:]MM:]SS The scheduler's estimated walltime. In simulation mode, it is the actual walltime.
EnvVariables** Comma-delimited list of <STRING> List of environment variables assigned to job
Exec Size* <INTEGER> Size of job executable (in MB)
Executable <STRING> Name of command to run
Features Square bracket delimited list of <STRING>s Node features required by job
Flags
Group <STRING> Name of UNIX group associated with job
Holds Zero or more of User, System, and Batch Types of job holds currently applied to job
Image Size <INTEGER> Size of job data (in MB)
IWD (Initial Working Directory) <DIR> Directory to run the executable in
Job Messages** <STRING> Messages attached to a job
Job Submission** <STRING> Job script submitted to RM
Memory <INTEGER> Amount of real memory required per node (in MB)
Max Util Resources Per Task* <FLOAT>
NodeAccess*
Nodecount <INTEGER> Number of nodes required by job
Opsys <STRING> Node operating system required by job
Partition Mask ALL or colon delimited list of partitions List of partitions the job has access to
PE <FLOAT> Number of processor-equivalents requested by job
Per Partition Priority** Tabular Table showing job template priority for each partition
Priority Analysis** Tabular Table showing how job's priority was calculated:
Job PRIORITY* Cred( User:Group:Class) Serv(QTime)
QOS <STRING> Quality of Service associated with job
Reservation <RSVID ( <TIME1 - <TIME2> Duration: <TIME3>) RESID specifies the reservation id, TIME1 is the relative start time, TIME2 the relative end time
TIME3 The duration of the reservation
Req [<INTEGER>] TaskCount: <INTEGER> Partition: <partition> A job requirement for a single type of resource followed by the number of tasks instances required and the appropriate partition
StartCount <INTEGER> Number of times job has been started by Moab
StartPriority <INTEGER> Start priority of job
StartTime Time job was started by the resource management system
State One of Idle, Starting, Running, etc Current Job State
SubmitTime Time job was submitted to resource management system
Swap <INTEGER> Amount of swap disk required by job (in MB)
Task Distribution* Square bracket delimited list of nodes
Time Queued
Total Requested Nodes** <INTEGER> Number of nodes the job requested
Total Requested Tasks <INTEGER> Number of tasks requested by job
User <STRING> Name of user submitting job
Utilized Resources Per Task* <FLOAT>
WallTime [[[DD:]HH:]MM:]SS of [[[DD:]HH:]MM:]SS Length of time job has been running out of the specified limit

In the above table, fields marked with an asterisk (*) are only displayed when set or when the -v flag is specified. Fields marked with two asterisks (**) are only displayed when set or when the -v -v flag is specified.

1.7.3 Arguments

Argument Format Default Description Example
--flags --flags=future (none) Evaluates future eligibility of job (ignore current resource state and usage limitations)
$ checkjob -v --flags=future 8370992
Display reasons why idle job is blocked ignoring node state and current node utilization constraints.
-l (Policy level) <POLICYLEVEL> HARD, SOFT, or OFF (none) Reports job start eligibility subject to specified throttling policy level.
$ checkjob -l SOFT 8370992
$ checkjob -l HARD 8370992
-n (NodeID) <NODEID> (none) Checks job access to specified node and preemption status with regards to jobs located on that node.
checkjob -n uc1n320 8370992
-r (Reservation) <RSVID> (none) Checks job access to specified reservation <RSVID>.
checkjob -r rainer_kn_resa.1 8370992
-v (Verbose) (n/a) Sets verbose mode. If the job is part of an array, the -v option shows pertinent array information before the job-specific information. Specifying the double verbose ("-v -v") displays additional information about the job. See more infos here!
checkjob -v 8370992

8370992 = JobId (see examples above)

1.7.4 Parameters

Parameters, descriptions (a lot!) and examples can be found in Adaptive documentation page.

  • For further options of checkjob see the manual page of checkjob
    $ man checkjob

1.7.5 Checkjob Examples

Here is an example from the bwUniCluster.

showq -u $USER   # show my own jobs
active jobs------------------------
JOBID              USERNAME      STATE PROCS   REMAINING            STARTTIME
8370992            kn_popnn    Running     1  2:03:56:50  Wed Jan 13 15:59:01
8370991            kn_popnn    Running     1  2:03:56:50  Wed Jan 13 15:59:01
[...]
8371040            kn_popnn    Running     1  2:03:59:14  Wed Jan 13 16:01:25

49 active jobs          49 of 7072 processors in use by local jobs (0.69%)
                        434 of 434 nodes active      (100.00%)
eligible jobs----------------------
JOBID              USERNAME      STATE PROCS     WCLIMIT            QUEUETIME
0 eligible jobs   

blocked jobs-----------------------
JOBID              USERNAME      STATE PROCS     WCLIMIT            QUEUETIME
0 blocked jobs   

Total jobs:  49
$
$ # now, see what's up with the first job in my queue
$ 
$ checkjob 8370992
job 8370992

AName: Nic_cit_09_Apo_zal_07_cl2_07_cl3_07_cl5_07_2.moab
State: Running 
Creds:  user:kn_pop'nnnnn'  group:kn_kn  account:konstanz  class:singlenode
WallTime:   20:04:28 of 3:00:00:00
BecameEligible: Wed Jan 13 15:58:11
SubmitTime: Wed Jan 13 15:57:58
  (Time Queued  Total: 00:01:03  Eligible: 00:00:58)

StartTime: Wed Jan 13 15:59:01
TemplateSets:  DEFAULT
NodeMatchPolicy: EXACTNODE
Total Requested Tasks: 1

Req[0]  TaskCount: 1  Partition: uc1
Memory >= 4000M  Disk >= 0  Swap >= 0
Dedicated Resources Per Task: PROCS: 1  MEM: 4000M
NodeSet=ONEOF:FEATURE:[NONE]

Allocated Nodes:
[uc1n320:1]

SystemID:   uc1
SystemJID:  8370992

IWD:            /pfs/data2/home/kn/kn_kn/kn_pop139522/fastsimcoal25/Midas_RAD_anchored/Nic_cit_09_Apo_zal_07_cl2_07_cl3_07_cl5_07/1col_DIV-resize_admix_zal1st_starlike_cl2-base_CL-growths_GL-bottlegrowth_onlyintramig/new_est/run_2
SubmitDir:      /pfs/data2/home/kn/kn_kn/kn_pop139522/fastsimcoal25/Midas_RAD_anchored/Nic_cit_09_Apo_zal_07_cl2_07_cl3_07_cl5_07/1col_DIV-resize_admix_zal1st_starlike_cl2-base_CL-growths_GL-bottlegrowth_onlyintramig/new_est/run_2
Executable:     /opt/moab/spool/moab.job.voNyde

StartCount:     1
BypassCount:    1
Partition List: uc1
Flags:          BACKFILL,FSVIOLATION,GLOBALQUEUE
Attr:           BACKFILL,FSVIOLATION
StartPriority:  -3692
PE:             1.00
Reservation '8370992' (-20:04:34 -> 2:03:55:26  Duration: 3:00:00:00)
[...]

You can use standard Linux pipe commands to filter the very detailed checkjob output.

  • Is the job still running?
$ checkjob 8370992 | grep ^State
State: Running 
  • Write your own checkjob wrapper to modify the checkjob output to have it all one's own way. Here's an example (cut/paste it if you'd like to use this one):
#!/bin/bash
# cj Display Moab-Jobstatus (checkjob wrapper)
# 'rainer.rutka@uni-konstanz.de "2015-09-21 ~1.1
# $1 = Moab Job-Nummer, $2 = Sleep-Time in seconds
shopt -s extglob        # "+"-Zeichen/Integerpruefung

rev=$(tput rev)
res=$(tput sgr0)
resc=$(tput setf 0)
gruen=$(tput setf 2)
rot=$(tput setf 4)

[ "$1" == "-h" ] && { echo "usage: $(basename "$0") (int)Moab-Job-Number [(int)Intervall/Sekunden(Default:30s)]"; exit 0; }
[ "$1" ] || { echo "Moab-Jobnummer fehlt..."; exit 1; } && { jobid="$1"; }
[ ! -z "${jobid##+([0-9])}" ] && { echo "<${jobid}> ist keine Zahl!"; exit 1; }
[ "$2" ] && { let schlaf="$2"; } || { let schlaf=30; }
[ ! -z "${2##+([0-9])}" ] && { echo "<${2}> ist keine Zahl!"; exit 1; }

let sekunden=0
let von=1
let bis=73
let rechts=1
pos () { tput cup ${1} ${2}; }

dauer () {
   pos 0 90
   let gesamt=$[$sekunden*$schlaf]
   pos 0 38
    echo 'Wartezeit:'
   [ "$gesamt" -lt 60 ] && { pos 0 49; echo "${gruen}${gesamt}s${resc}"; }
   [ "$gesamt" -ge 60 ] && { pos 0 49; echo "${gruen}$[$gesamt / 60]m$[$gesamt % 60]s ${resc}"; 
   [ "$gesamt" -ge 3600 ] && { pos 0 49; echo "${gruen}$[$gesamt / 3600]h$[$gesamt % 3600 / 60]m$[$gesamt % 60]s ${resc}"; }
}

checkit () {
   tput clear
   pos 0 1
   echo "${resc}Job: ${gruen}${jobid}${resc} Status: "
   pos 0 60
   echo "Intervall: ${schlaf}s"

   pos 1 1
   echo '>'

   while true
   do
      status=$(checkjob ${jobid} | grep ^State:)
      pos 0 22
      echo "<${rot}${status/State:}${resc}>" 
      dauer
      [ "${status/State:}" == "" ] && { let sekunden=0; echo "${rot} ERROR! ${resc}"; exit 1; }
      [ "${status/State:}" == " Completed " ] && { echo "${gruen} Job ${jobid} ist fertig!${resc}"; exit 0; }
      [ "${status/State:}" == " Removed " ] && { echo "${rot} Job ${jobid} wurde gelöscht!${resc}"; exit 1; }
      [ "$rechts" == "$bis" ] && { let rechts="$von"; checkit ${jobid} ${schlaf} ; } || let rechts++
      pos 1 $rechts
      sleep ${schlaf}
      echo -n .
      let sekunden++;
   done
}
checkit ${*}

Using the same Job-ID as above the output of the script (named 'cj'):

$ cj -h
usage: cj (int)Moab-Job-Number [(int)Intervall/Sekunden(Default:30s)]
$ cj 8370992 10 # update every 10 seconds
 Job: 8370992 Status: < Running >     Wartezeit: 30m50s     Intervall: 10s
 >.........................................

1.8 Blocked job information : checkjob -v

This command allows to check the detailed status and resource requirements of your active, queued, or recently completed job. Additionally, this command performs numerous diagnostic checks and determines if and where the job could potentially run. Diagnostic checks include policy violations, reservation constraints, preemption status, and job to resource mapping. If a job cannot run, a text reason is provided along with a summary of how many nodes are and are not available. If the -v flag is specified, a node by node summary of resource availability will be displayed for idle jobs.

If your job is blocked do not delete it!

1.8.1 Job Eligibility

If a job cannot run, a text reason is provided along with a summary of how many nodes are and are not available. If the -v flag is specified, a node by node summary of resource availability will be displayed for idle jobs. For job level eligibility issues, one of the following reasons will be given:

Reason Description
job has hold in place one or more job holds are currently in place
insufficient idle procs there are currently not adequate processor resources available to start the job
idle procs do not meet requirements adequate idle processors are available but these do not meet job requirements
start date not reached job has specified a minimum start date which is still in the future
expected state is not idle job is in an unexpected state
state is not idle job is not in the idle state
dependency is not met job depends on another job reaching a certain state
rejected by policy job start is prevented by a throttling policy

If a job cannot run on a particular node, one of the following 'per node' reasons will be given:

Description Reason
Class Node does not allow required job class/queue
CPU Node does not possess required processors
Disk Node does not possess required local disk
Features Node does not possess required node features
Memory Node does not possess required real memory
Network Node does not possess required network interface
State Node is not Idle or Running

1.8.2 Example

A blocked job has hit a limit and will become idle if resource get free. The "-v (verbose)" mode of 'checkjob' also shows a message "BLOCK MSG:" for more details.

checkjob -v 8370992
[...]

 BLOCK MSG: job <jobID> violates active SOFT MAXPROC limit of 750 for acct mannheim
  partition ALL (Req: 160  InUse: 742) (recorded at last scheduling iteration)

In this case the job has reached the account limit of mannheim while requesting 160 core when 742 were already in use.
The most common cause of blocked jobs is a violation of MAXPROC or MAXPS limits, indicating that your group has scheduled too many outstanding processor seconds at the same time.

1.8.3 The Limits imposed by the Scheduler

This refers to limits on the number of jobs in the queue which are enforced by the scheduler. The largest factors in determining limits in numbers of jobs are the Maximum Processor Seconds (MAXPS) and the Maximum Processors (MAXPROC) for each account. The MAXPS is the total number of processor core seconds (ps) allocated for each (group) account. It is based on fairshare values in dependency of the configured values for your <OE> (Konstanz, Ulm, etc. ...) .
Users can submit as many jobs but they cannot be scheduled to run if their groups MAXPROC or MAXPS value is exceeded. They instead enter into a "HOLD" state. If the limits of the group is not reached but the resources are not available, the jobs enter into "IDLE" state and will run once the requested resources become available.

1.9 Canceling own jobs : canceljob

Caution: This command is deprecated. Use mjobctl -c instead!

The canceljob <JobId> command is used to selectively cancel the specified job(s) (active, idle, or non-queued) from the queue.

Note that only own jobs can be cancelled.

1.9.1 Access

This command can be run by any Moab Administrator and by the owner of the job.

Flag Name Format Default Description Example
-h HELP n./a. Display usage information
$ canceljob -h
JOB ID <STRING> (none) a jobid, a job expression, or the keyword 'ALL' see: example use of canceljob

1.9.2 Example Use of Canceljob

Example use of canceljob run on the bwUniCluster

[...calc_repo-0]$ msub bwhpc-fasta-example.moab
8374356              # this is the JobId
$
$ checkjob 8374356
job 8374356
AName: fasta36_job
State: Idle 
Creds:  user:kn_pop235844  group:kn_kn  account:konstanz  class:multinode
WallTime:   00:00:00 of 00:10:00
BecameEligible: Fri Jan 15 12:10:53
SubmitTime: Fri Jan 15 12:10:43
  (Time Queued  Total: 00:00:10  Eligible: 00:00:08)
[...]

$ checkjob 8374356 | grep ^State:
State: Idle              # state is 'Idle'

$ # now cancel the job
$ canceljob 8374356
job '8374356' cancelled

$ checkjob 8374356 | grep ^State:
State: Removed      # state turned into 'Removed'

1.10 Moab Job Control : mjobctl

The mjobctl command controls various aspects of jobs. It is used to submit, cancel, execute, and checkpoint jobs. It can also display diagnostic information about your own jobs.

1.10.1 Canceling own jobs : mjobctl -c

If you want to cancel a job that has been submitted, please do not use the PBS/Torque qdel (n./a.) or the deprecated canceljob commands.
Instead, use mjobctl -c <jobid>.

Flag Format Default Description Example
-cl JobId (none) Cancel a job. see: example use of mjobctl -c

1.10.1.1 Example Use of mjobctl -c

Canceling a job on the bwUniCluster

[...-calc_repo-0]$ msub bwhpc-fasta-example.moab
8374426

$ checkjob 8374426 | grep ^State
State: Idle                # job is 'Idle'

$ mjobctl -c 8374426
job '8374426' cancelled    # job is cancelled

checkjob 8374426 | grep ^State
State: Removed             # now, job is removed

$ # my own checkjob wrapper
cj 8374426
 Job: 8374426 Status: < Removed >     Wartezeit: 1m30s         Intervall: 30s
 Job 8374426 wurde gelöscht!
$ 

1.10.1.2 E-Mail notification of mjobctl -c

You will receive an e-mail notification like this (example with Slurm)

From: slurm@uc1-sv1.scc.kit.edu
Subject: SLURM Job_id= 8374426 Name=fasta36_job Ended, Run time 00:01:30, CANCELLED, ExitCode 0

from the resource manager shortly after the job was removed from the queue.

You need to add these MOAB environments into your submit script or command line parameters for the msub command:

#MSUB -m ae
#MSUB -M e-mail@FQDN (e-mail address like: name@domain.de)
msub -m option(s)
Option Description
-m option(s) Defines the set of conditions (a=abort | b=begin | e=end) when the server will send a mail message about the job to the user.
-N name Gives a user specified name to the job. Note that job names do not appear in all MOAB job info displays, and do not determine how your jobs stdout/stderr files are named.

1.10.2 Other Mjobctl-Options

See also:


Not all of the listed options are available for 'normal' users. Some are for MOAB-admins only.