Revision as of 12:53, 3 February 2025

As of February 1st 2025 Moab® is not licensed any more. As a consequence, the tools previously provided by the module module load system/moab/9.1.3 (like checkjob) are not available any more.

Torque scheduler

Any kind of calculation on the bwForCluster BinAC compute nodes requires the user to define calculations as a sequence of commands or single command together with required run time, number of CPU cores and main memory and submit all, i.e., the batch job, to a resource and workload managing software. Therefore any job submission by the user is to be executed by commands of the Torque scheduler. Torque queues and runs user jobs based on fair sharing policies.

Torque Commands

Some of the most used Torque commands for non-administrators working on the bwForCluster BinAC

Torque commands	Brief explanation
qsub	Submits a job and queues it in an input queue [qsub]

Job Submission : qsub

Batch jobs are submitted by using the command qsub. The main purpose of the qsub command is to specify the resources that are needed to run the job. qsub will then queue the batch job. However, starting of batch job depends on availability of the requested resources and the fair sharing value.

qsub Command Parameters

The syntax and use of qsub can be displayed via:

$ man qsub

qsub options can be used from the command line or in your job script.

qsub Options
Command line	Script	Purpose
-l resources	#PBS -l resources	Defines the resources that are required by the job. See the description below for this important flag.
-N name	#PBS -N name	Gives a user specified name to the job.
-o filename	#PBS -o filename	Defines the file-name to be used for the standard output stream of the batch job. By default the file with defined file name is placed under your job submit directory. To place under a different location, expand file name by the relative or absolute path of destination.
-q queue	#PBS -q queue	Defines the queue class
-v variable=arg	#PBS -v variable=arg	Expands the list of environment variables that are exported to the job
-S Shell	#PBS -S Shell	Declares the shell (state path+name, e.g. /bin/bash) that interpret the job script
-m bea	#PBS -m bea	Send email when job begins (b), ends (e) or aborts (a).
-M name@uni.de	#PBS -M name@uni.de	Send email to the specified email address "name@uni.de".

qsub -l resource_list

The -l option is one of the most important qsub options. It is used to specify a number of resource requirements for your job. Multiple resource strings are separated by commas.

qsub -l resource_list
resource	Purpose
-l nodes=2:ppn=16	Number of nodes and number of processes per node
-l walltime=600 -l walltime=01:30:00	Wall-clock time. Default units are seconds. HH:MM:SS format is also accepted.
-l pmem=1000mb	Maximum amount of physical memory used by any single process of the job. Allowed units are kb, mb, gb. Be aware that processes are either MPI tasks memory for all MPI tasks or all threads of the job.
-l advres=res_name	Specifies the reservation "res_name" required to run the job.

qsub -q queues

Queue classes define maximum resources such as walltime, nodes and processes per node and partition of the compute system. Note that queue settings of the bwHPC cluster are not identical, but differ due to their different prerequisites, such as HPC performance, scalability and throughput levels. Details can be found here:

bwForCluster BINAC queue settings

qsub Examples

Serial Programs

To submit a serial job that runs the script job.sh and that requires 5000 MB of main memory and 3 hours of wall clock time

a) execute:

$ qsub -q short -N test -l nodes=1:ppn=1,walltime=3:00:00,mem=5000mb   job.sh

or b) add after the initial line of your script job.sh the lines (here with a high memory request):

#PBS -l nodes=1:ppn=1
#PBS -l walltime=3:00:00
#PBS -l mem=200gb
#PBS -N test

and execute the modified script with the command line option -q smp, as the compute nodes only have 128GB memory.

$ qsub -q smp job.sh

Note, that qsub command line options overrule script options.

Multithreaded Programs

Multithreaded programs operate faster than serial programs on CPUs with multiple cores.
Moreover, multiple threads of one process share resources such as memory.
For multithreaded programs based on Open Multi-Processing (OpenMP) number of threads are defined by the environment variable OMP_NUM_THREADS. By default this variable is set to 1 (OMP_NUM_THREADS=1).
To submit a batch job called OpenMP_Test that runs a fourfold threaded program omp_executable which requires 6000 MByte of total physical memory and total wall clock time of 3 hours:

generate the script job_omp.sh containing the following lines:

#!/bin/bash
#PBS -l nodes=1:ppn=4
#PBS -l walltime=3:00:00
#PBS -l mem=6000mb
#PBS -v EXECUTABLE=./omp_executable
#PBS -v MODULE=<placeholder>
#PBS -N OpenMP_Test

#Usually you should set
export KMP_AFFINITY=compact,1,0
#export KMP_AFFINITY=verbose,compact,1,0 prints messages concerning the supported affinity
#KMP_AFFINITY Description: https://software.intel.com/en-us/node/524790#KMP_AFFINITY_ENVIRONMENT_VARIABLE

module load ${MODULE}
export OMP_NUM_THREADS=${PBS_NUM_PPN}
echo "Executable ${EXECUTABLE} running on ${PBS_NUM_PPN} cores with ${OMP_NUM_THREADS} threads"
startexe=${EXECUTABLE}
echo $startexe
exec $startexe

Using Intel compiler the environment variable KMP_AFFINITY switches on binding of threads to specific cores and, if necessary, replace <placeholder> with the required modulefile to enable the OpenMP environment and execute the script job_omp.sh adding the queue class short as qsub option:

$ qsub -q short job_omp.sh

Note, that qsub command line options overrule script options, e.g.,

$ qsub -l mem=2000mb -q short job_omp.sh

overwrites the script setting of 6000 MByte with 2000 MByte.

MPI Parallel Programs

MPI parallel programs run faster than serial programs on multi CPU and multi core systems. N-fold spawned processes of the MPI program, i.e., MPI tasks, run simultaneously and communicate via the Message Passing Interface (MPI) paradigm. MPI tasks do not share memory but can be spawned over different nodes.
Multiple MPI tasks can not be launched by the MPI parallel program itself but via mpirun, e.g. 4 MPI tasks of my_par_program:

$ mpirun -n 4 my_par_program

Generate a script job_ompi.sh for OpenMPI containing the following lines:

#!/bin/bash
module load mpi/openmpi/<placeholder_for_version>
# Use when loading OpenMPI in version 1.8.x
mpirun --bind-to core --map-by core -report-bindings my_par_program
# Use when loading OpenMPI in an old version 1.6.x
mpirun -bind-to-core -bycore -report-bindings my_par_program

Attention: Do NOT add mpirun options -n <number_of_processes> or any other option defining processes or nodes, since Torque instructs mpirun about number of processes and node hostnames. Use ALWAYS the MPI options --bind-to core and --map-by core|socket|node (OpenMPI version 1.8.x). Please type mpirun --help for an explanation of the meaning of the different options of mpirun option --map-by.
Considering 4 OpenMPI tasks on a single node, each requiring 1000 MByte, and running for 1 hour, execute:

$ qsub -q short -l nodes=1:ppn=4,pmem=1000mb,walltime=01:00:00 job_ompi.sh

Multithreaded + MPI parallel Programs

Multithreaded + MPI parallel programs operate faster than serial programs on multi CPUs with multiple cores. All threads of one process share resources such as memory. On the contrary MPI tasks do not share memory but can be spawned over different nodes.
Multiple MPI tasks using OpenMPI must be launched by the MPI parallel program mpirun. For multithreaded programs based on Open Multi-Processing (OpenMP) number of threads are defined by the environment variable OMP_NUM_THREADS. By default this variable is set to 1 (OMP_NUM_THREADS=1).
For OpenMPI a job-script to submit a batch job called job_ompi_omp.sh that runs a MPI program with 4 tasks and an fivefold threaded program ompi_omp_program requiring 6000 MByte of physical memory per process/thread (using 5 threads per MPI task you will get 5*6000 MByte = 30000 MByte per MPI task) and total wall clock time of 3 hours looks like:

#!/bin/bash
#PBS -l nodes=2:ppn=10
#PBS -l walltime=03:00:00
#PBS -l pmem=6000mb
#PBS -v MPI_MODULE=mpi/ompi
#PBS -v OMP_NUM_THREADS=5
#PBS -v MPIRUN_OPTIONS="--bind-to core --map-by socket:PE=5 -report-bindings"
#PBS -v EXECUTABLE=./ompi_omp_program
#PBS -N test_ompi_omp

module load ${MPI_MODULE}
TASK_COUNT=$((${PBS_NUM_PPN}/${OMP_NUM_THREADS}))
echo "${EXECUTABLE} running on ${PBS_NUM_PPN} cores with ${TASK_COUNT} MPI-tasks and ${OMP_NUM_THREADS} threads"
startexe="mpirun -n ${TASK_COUNT} ${MPIRUN_OPTIONS} ${EXECUTABLE}"
echo $startexe
exec $startexe

Execute the script job_ompi_omp.sh adding the queue class multinode to your qsub command:

$ qsub -q multinode job_ompi_omp.sh

With the mpirun option --bind-to core MPI tasks and OpenMP threads are bound to physical cores.
With the option --map-by socket:PE=<value> (neighbored) MPI tasks will be attached to different sockets and each MPI task is bound to the (in <value>) specified number of cpus. <value> must be set to ${OMP_NUM_THREADS}.
Old OpenMPI version 1.6.x: With the mpirun option -bind-to-core MPI tasks and OpenMP threads are bound to physical cores.
With the option -bysocket (neighbored) MPI tasks will be attached to different sockets and the option -cpus-per-proc <value> binds each MPI task to the (in <value>) specified number of cpus. <value> must be set to ${OMP_NUM_THREADS}.
The option -report-bindings shows the bindings between MPI tasks and physical cores.
The mpirun-options --bind-to core', --map-by socket|...|node:PE=<value> should always be used when running a multithreaded MPI program. (OpenMPI version 1.6.x: The mpirun-options -bind-to-core, -bysocket|-bynode and -cpus-per-proc <value> should always be used when running a multithreaded MPI program.)

Handling job script options and arguments

Job script options and arguments as followed:

$ ./job.sh -n 10

can not be passed while using qsub command since those will be interpreted as command line options of job.sh (like $1 = -n, $2 = 10).

Solution A:

Submit a wrapper script, e.g. wrapper.sh:

$ qsub -q singlenode wrapper.sh

which simply contains all options and arguments of job.sh. The script wrapper.sh would at least contain the following lines:

#!/bin/bash
./job.sh -n 10

Solution B:

Add after the header of your BASH script job.sh the following lines:

## check if $SCRIPT_FLAGS is "set"
if [ -n "${SCRIPT_FLAGS}" ] ; then
   ## but if positional parameters are already present
   ## we are going to ignore $SCRIPT_FLAGS
   if [ -z "${*}"  ] ; then
      set -- ${SCRIPT_FLAGS}
   fi
fi

These lines modify your BASH script to read options and arguments from the environment variable $SCRIPT_FLAGS. Now submit your script job.sh as followed:

$ qsub -q singlenode -v SCRIPT_FLAGS='-n 10' job.sh

Environment Variables

Once an eligible compute jobs starts on the compute system, PBS (our resource manager) adds the following variables to the job's environment:

PBS variables
Environment variables	Description
PBS_JOBID	Job ID
PBS_JOBNAME	Job name
PBS_NUM_NODES	Number of nodes allocated to job
PBS_QUEUE	Partition name the job is running in
PBS_NP	Number of processors allocated to job
PBS_O_WORKDIR	Directory of job submission
PBS_O_LOGNAME	User name

Interpreting PBS exit codes

The PBS Server logs and accounting logs record the ‘exit status’ of jobs.
Zero or positive exit status is the status of the top-level shell.
Certain negative exit statuses are used internally and will never be reported to the user.
The positive exit status values indicate which signal killed the job.
Depending on the system, values greater than 128 (or on some systems 256, see wait(2) or waitpid(2) for more information) are the value of the signal that killed the job.
To interpret (or ‘decode’) the signal contained in the exit status value, subtract the base value from the exit status.
For example, if a job had an exit status of 143, that indicates the jobs was killed via a SIGTERM (e.g. 143 - 128 = 15, signal 15 is SIGTERM).

Job termination

The exit code from a batch job is a standard Unix termination signal.
Typically, exit code 0 means successful completion.
Codes 1-127 are generated from the job calling exit() with a non-zero value to indicate an error.
Exit codes 129-255 represent jobs terminated by Unix signals.
Each signal has a corresponding value which is indicated in the job exit code.

Job termination signals

Specific job exit codes are also supplied by the underlying resource manager of the cluster's batch system. More detailed information can be found in the corresponding documentation:

TORQUE exit codes

Submitting Termination Signal

Here is an example, how to 'save' a qsub termination signal in a typical bwHPC-submit script.

[...]
exit_code=$?
echo "### Calling YOUR_PROGRAM command ..."
mpirun -np 'NUMBER_OF_CORES' $YOUR_PROGRAM_BIN_DIR/runproc ... (options)  2>&1
[ "$exit_code" -eq 0 ] && echo "all clean..." || \
   echo "Executable ${YOUR_PROGRAM_BIN_DIR}/runproc finished with exit code ${$exit_code}"
[...]

Do not use 'time' mpirun! The exit code will be the one submitted by the first (time) program and not the qsub exit code.
You do not need an exit $exit_code in the scripts.

List your jobs and show job details : qstat

Displays information about active, eligible, blocked, and/or recently completed jobs. Since the resource manager is not actually scheduling jobs, the job ordering it displays is not valid. The showq command displays the actual job ordering under the Moab Workload Manager. When used without flags, this command displays all jobs in active, idle, and non-queued states.

Show all your jobs: qstat -u $USER
Show details about a specific job: qstat -f JOBID
For further options of qstat read the manpage of qstat.

Canceling own jobs : canceljob

Caution: This command is deprecated. Use mjobctl -c instead!

The canceljob <JobId> command is used to selectively cancel the specified job(s) (active, idle, or non-queued) from the queue.

Note that only own jobs can be cancelled.

Access

This command can be run by any Moab Administrator and by the owner of the job.

Flag	Name	Format	Default	Description	Example
-h	HELP		n./a.	Display usage information	$ canceljob -h
	JOB ID	<STRING>	(none)	a jobid, a job expression, or the keyword 'ALL'	see: example use of canceljob

Example Use of Canceljob

Example use of canceljob run on the bwUniCluster

[...calc_repo-0]$ qsub bwhpc-fasta-example.moab
8374356              # this is the JobId
$
$ checkjob 8374356
job 8374356
AName: fasta36_job
State: Idle 
Creds:  user:kn_pop235844  group:kn_kn  account:konstanz  class:multinode
WallTime:   00:00:00 of 00:10:00
BecameEligible: Fri Jan 15 12:10:53
SubmitTime: Fri Jan 15 12:10:43
  (Time Queued  Total: 00:00:10  Eligible: 00:00:08)
[...]

$ checkjob 8374356 | grep ^State:
State: Idle              # state is 'Idle'

$ # now cancel the job
$ canceljob 8374356
job '8374356' cancelled

$ checkjob 8374356 | grep ^State:
State: Removed      # state turned into 'Removed'

See: E-Mail notification after a job was cancelled/removed

Moab Job Control : mjobctl

The mjobctl command controls various aspects of jobs. It is used to submit, cancel, execute, and checkpoint jobs. It can also display diagnostic information about your own jobs.

Canceling own jobs : mjobctl -c

If you want to cancel a job that has been submitted, please do not use the PBS/Torque qdel (n./a.) or the deprecated canceljob commands.
Instead, use mjobctl -c <jobid>.

Flag	Format	Default	Description	Example
-cl	JobId	(none)	Cancel a job.	see: example use of mjobctl -c

Example Use of mjobctl -c

Canceling a job on the bwUniCluster

[...-calc_repo-0]$ qsub bwhpc-fasta-example.moab
8374426

$ checkjob 8374426 | grep ^State
State: Idle                # job is 'Idle'

$ mjobctl -c 8374426
job '8374426' cancelled    # job is cancelled

checkjob 8374426 | grep ^State
State: Removed             # now, job is removed

$ # my own checkjob wrapper
cj 8374426
 Job: 8374426 Status: < Removed >     Wartezeit: 1m30s         Intervall: 30s
 Job 8374426 wurde gelöscht!
$

checkjob wrapper

@@ Line 355: / Line 355: @@
 * Show details about a specific job:  <code>qstat -f JOBID </code>
 * For further options of ''qstat'' read the manpage of ''qstat''.
-== Blocked job information : checkjob -v ==
-This command allows to check the detailed status and resource requirements of your active, queued, or recently completed job. Additionally, this command performs numerous diagnostic checks and determines if and where the job could potentially run. Diagnostic checks include policy violations, reservation constraints, preemption status, and job to resource mapping. If a job cannot run, a text reason is provided along with a summary of how many nodes are and are not available. If the -v flag is specified, a node by node summary of resource availability will be displayed for idle jobs.
-<br>
-<br>
-<font color=red>If your job is blocked do not delete it!</font>
-=== Job Eligibility ===
-If a job cannot run, a text reason is provided along with a summary of how many nodes are and are not available. If the -v flag is specified, a node by node summary of resource availability will be displayed for idle jobs. For job level eligibility issues, one of the following reasons will be given:
-<br>
-{| width=750px class="wikitable"
-! Reason !! Description
-|- style="vertical-align:top;"
-| job has hold in place
-| one or more job holds are currently in place
-|- style="vertical-align:top;"
-| insufficient idle procs
-| there are currently not adequate processor resources available to start the job
-|- style="vertical-align:top;"
-| idle procs do not meet requirements
-| adequate idle processors are available but these do not meet job requirements
-|- style="vertical-align:top;"
-| start date not reached
-| job has specified a minimum start date which is still in the future
-|- style="vertical-align:top;"
-| expected state is not idle
-| job is in an unexpected state
-|- style="vertical-align:top;"
-| state is not idle
-| job is not in the idle state
-|- style="vertical-align:top;"
-| dependency is not met
-| job depends on another job reaching a certain state
-|- style="vertical-align:top;"
-| rejected by policy
-| job start is prevented by a throttling policy
-|}
-If a job cannot run on a particular node, one of the following 'per node' reasons will be given:
-{| width=750px class="wikitable"
-! Description || Reason
-|- style="vertical-align:top;"
-| Class
-| Node does not allow required job class/queue
-|- style="vertical-align:top;"
-| CPU
-| Node does not possess required processors
-|- style="vertical-align:top;"
-| Disk
-| Node does not possess required local disk
-|- style="vertical-align:top;"
-| Features
-| Node does not possess required node features
-|- style="vertical-align:top;"
-| Memory
-| Node does not possess required real memory
-|- style="vertical-align:top;"
-| Network
-| Node does not possess required network interface
-|- style="vertical-align:top;"
-| State
-| Node is not Idle or Running
-|}
-=== Example ===
-A '''''blocked''''' job has hit a limit and will become '''''idle''''' if resource get free.
-The "-v (verbose)" mode of 'checkjob' also shows a message "BLOCK MSG:" for more details.
-<pre>
-checkjob -v 8370992
-[...]
- BLOCK MSG: job <jobID> violates active SOFT MAXPROC limit of 750 for acct mannheim
-  partition ALL (Req: 160  InUse: 742) (recorded at last scheduling iteration)
-</pre>
-In this case the job has reached the account limit of mannheim while requesting 160 core when 742 were already in use.
-<br>
-The most common cause of blocked jobs is a violation of MAXPROC or MAXPS limits, indicating that your group has scheduled too many outstanding processor seconds at the same time.
-=== The Limits imposed by the Scheduler ===
-This refers to limits on the number of jobs in the queue which are enforced by the scheduler. The largest factors in determining limits  in numbers of jobs are the Maximum Processor Seconds (MAXPS) and the Maximum Processors (MAXPROC) for each account. The MAXPS is the total number of processor core seconds (ps) allocated for each (group) account. It is based on fairshare values in dependency of the configured values for your <OE> (Konstanz, Ulm, etc. ...) .
-<br>
-Users can submit as many jobs but they cannot be scheduled to run if their groups MAXPROC or MAXPS value is exceeded. They instead enter into a "HOLD" state. If the limits of the group is not reached but the resources are not available, the jobs enter into "IDLE" state and will run once the requested resources become available.
-<br>
 == Canceling own jobs : canceljob ==

BinAC/Moab: Difference between revisions

Revision as of 12:53, 3 February 2025

Contents

Torque scheduler

Torque Commands

Job Submission : qsub

qsub Command Parameters

qsub -l resource_list

qsub -q queues

qsub Examples

Serial Programs

Multithreaded Programs

MPI Parallel Programs

Multithreaded + MPI parallel Programs

Handling job script options and arguments

Environment Variables

Interpreting PBS exit codes

Job termination

Job termination signals

Submitting Termination Signal

List your jobs and show job details : qstat

Canceling own jobs : canceljob

Access

Example Use of Canceljob

Moab Job Control : mjobctl

Canceling own jobs : mjobctl -c

Example Use of mjobctl -c

Other Mjobctl-Options

Navigation menu

BinAC/Moab: Difference between revisions

Revision as of 12:53, 3 February 2025

Torque scheduler

Torque Commands

Job Submission : qsub

qsub Command Parameters

qsub -l resource_list

qsub -q queues

qsub Examples

Serial Programs

Multithreaded Programs

MPI Parallel Programs

Multithreaded + MPI parallel Programs

Handling job script options and arguments

Environment Variables

Interpreting PBS exit codes

Job termination

Job termination signals

Submitting Termination Signal

List your jobs and show job details : qstat

Canceling own jobs : canceljob

Access

Example Use of Canceljob

Moab Job Control : mjobctl

Canceling own jobs : mjobctl -c

Example Use of mjobctl -c

Other Mjobctl-Options

Navigation menu

Search