Batch Jobs Moab

From bwHPC Wiki
Revision as of 14:55, 25 February 2014 by H Haefner (talk | contribs) (MPI parallel Programs)
Jump to: navigation, search
Navigation: bwHPC BPR / bwUniCluster



Any kind of calculation on the compute nodes of bwUniCluster requires the user to define calculations as a sequence of commands or single command together with required run time, number of CPU cores and main memory and submit all, i.e., the batch job, to a resource and workload managing software. All bwHPC cluster, including bwUniCluster, have installed the workload managing software MOAB. Therefore any job submission by the user is to be executed by commands of the MOAB software. MOAB queues and runs user jobs based on fair sharing policies.


Overview of:

MOAB commands Brief explanation
msub submits a job and queues it in an input queue
checkjob displays detailed job state information
showq displays information about active, eligible, blocked, and/or recently completed jobs
showstart returns start time of submitted job or requested resources
canceljob cancels a job


1 Job Submission

Batch jobs are submitted by using the command msub. The main purpose of the msub command is to specify the resources that are needed to run the job. msub will then queue the batch job. However, starting of batch job depends on availability of the requested resources and the fair sharing value.

1.1 msub Command

The syntax and use of msub can be displayed via:

$ man msub

msub options can be used from the command line or in your job script.


msub Options
Command line Script Purpose
-l resources #MSUB -l resources Defines the resources that are required by the job. See the description below for this important flag.
-N name #MSUB -N name Gives a user specified name to the job.
-I Declares the job is to be run interactively.
-o filename #MSUB -o filename Defines the filename to be used for the standard output stream of the batch job. By default the file with defined filename is placed under your job submit directory. To place under a different location, expand filename by the relative or absolute path of destination.
-q queue #MSUB -q queue Defines the queue class
-v variable=arg #MSUB -v variable=arg Expands the list of environment variables that are exported to the job
-S Shell #MSUB -S Shell Declares the shell (state path+name, e.g. /bin/bash) that interprets the job script



1.1.1 msub -l resource_list

The -l option is one of the most important msub options. It is used to specify a number of resource requirements for your job. Multiple resource strings are separated by commas.


msub -l resource_list
resource Purpose
-l nodes=2:ppn=8 Number of nodes and number of processes per node
-l walltime=600
-l walltime=01:30:00
Wall-clock time. Default units are seconds.
HH:MM:SS format is also accepted.
-l pmem=1000mb
Maximum amount of physical memory used by any single process of the job.
Allowed units are kb, mb, gb. Be aware that processes are either MPI tasks if running MPI parallel jobs or threads if running multithreaded jobs.
-l mem=1000mb
Maximum amount of physical memory used by the job.
Allowed units are kb, mb, gb. Be aware that this memory value is the accumulated memory for all MPI tasks or all threads of the job.
-l advres=res_name
Specifies the reservation "res_name" required to run the job.
-l naccesspolicy=policy
Specifies how node resources should be accessed, e.g. -l naccesspolicy=singlejob reserves all requested nodes for the job exclusively. Attention, if you request nodes=1:ppn=4 together with singlejob you will be charged for the maximum cores (=16) of the node.


1.1.2 msub -q queues

Queue classes define maximum resources such as walltime, nodes and processes per node and partition of the compute system.

msub -q queue
queue maximum resources
-q develop walltime=00:30:00 (i.e. 30 min), node=1, processes=16
-q singlenode walltime=3:00:00:00 (i.e. 3 days), node=1, processes=16
-q multinode walltime=2:00:00:00 (i.e. 2 days), node=8
-q verylong walltime=6:00:00:00 (i.e. 6 days), node=1, processes=16
-q fat walltime=1:00:00:00 (i.e. 1 days), node=1, processes=32 on fat nodes

If queue classes are not specified explicitly in your msub command, your batch jobs are automatically assigned to queues develop, singlenode and multinode based on your requested walltime, nodes and processes.

  • To run your batch job longer than 3 days, please use msub -q verylong.
  • To run your batch job on one of the fat nodes, please use msub -q fat.



1.2 msub Examples

1.2.1 Serial Programs

To submit a serial job that runs the script job.sh and that requires 5000 MB of main memory and 3 hours of wall clock time

a) execute:

$ msub -N test -l nodes=1:ppn=1,walltime=3:00:00,pmem=5000mb   job.sh

or

b) add after the initial line of your script job.sh the lines:

#MSUB -l nodes=1:ppn=1
#MSUB -l walltime=3:00:00
#MSUB -l pmem=5000mb
#MSUB -N test

and execute the modified script without any msub command line options:

$ msub job.sh


Note, that msub command line options overrule script options.


1.2.1.1 Handling job script options and arguments

Job script options and arguments as followed:

./job.sh -n 10

can not be passed while using msub command since those will be interpreted as command line options of msub.


Solution A:

Submit a wrapper script, e.g. job_msub.sh:

msub job_msub.sh

which simply contains all your job script options and arguments. The script job_msub.sh would at least contain the following lines:

#!/bin/bash
./job_msub.sh -n 10


Solution B:

Add after the header of your BASH script job.sh the following lines:

## check if $SCRIPT_FLAGS is "set"
if [ -n "${SCRIPT_FLAGS}" ] ; then
   ## but if positional parameters are already present
   ## we are going to ignore $SCRIPT_FLAGS
   if [ -z "${*}"  ] ; then
      set -- ${SCRIPT_FLAGS}
   fi
fi

These lines modify your BASH script to read options and arguments from the environment variable $SCRIPT_FLAGS. Now submit your script job.sh as followed:

msub -v SCRIPT_FLAGS='-n 10' job.sh 


For advanced users: generalised version of solution B if job script arguments contain whitespaces.


1.2.2 Multithreaded Programs

Multithreaded programs operate faster than serial programs on CPUs with multiple cores. Moreover, multiple threads of one process share resources such as memory.

For multithreaded programs based on Open Multi-Processing (OpenMP) number of threads are defined by the environment variable OMP_NUM_THREADS. By default this variable is set to 1 (OMP_NUM_THREADS=1).

To submit a batch job called test that runs a fourfold threaded program omp_program which requires 6000 MByte of total physical memory and total wall clock time of 3 hours:

  • generate the script job_omp.sh containing the following the lines:
#!/bin/bash
#MSUB -l nodes=1:ppn=4
#MSUB -l walltime=3:00:00
#MSUB -l mem=6000mb
#MSUB -N test

module load <placeholder>
export OMP_NUM_THREADS=${MOAB_PROCCOUNT}
./omp_program

and, if necessary, replace <placeholder> with the required modulefile to enable the openMP environment and execute the script job_omp.sh without any msub command line options:

$ msub job_omp.sh


Note, that msub command line options overrule script options, e.g.,

$ msub -l mem=2000mb job_omp.sh

overwrites the script setting of 6000 MByte with 2000 MByte.

1.2.3 MPI parallel Programs

MPI parallel programs run faster than serial programs on multi CPU and multi core systems. N-fold spawned processes of the MPI program, i.e., MPI tasks, run simultaneously and communicate via the Message Passing Interface (MPI) paradigm. MPI tasks do not share memory but can be spawned over different nodes.

Multiple MPI tasks can not be launched by the MPI parallel program itself but via mpirun, e.g. 4 MPI tasks of my_par_program:

 
mpirun  -n 4 my_par_program 


However, this given command can not be directly included in your msub command for submitting as a batch job to the compute cluster, see above.

Generate a wrapper script job_ompi.sh for OpenMPI containing the following lines:

#!/bin/bash
module load mpi/openmpi/<placeholder_for_version>
mpirun -bind-to-core -bycore -report-bindings my_par_program

Attention: Do NOT add mpirun options -n <number_of_processes> or any other option defining processes or nodes, since MOAB instructs mpirun about number of processes and node hostnames. Use always the MPI-options -bind-to-core and -bycore|-bysocket|-bynode!

Generate a wrapper script for Intel MPI, job_impi.sh containing the following lines:

#!/bin/bash
module load mpi/impi/<placeholder_for_version>
mpiexec.hydra -bootstrap slurm my_par_program

Attention: Do NOT add mpirun options -n <number_of_processes> or any other option defining processes or nodes, since MOAB instructs mpirun about number of processes and node hostnames. Moreover, replace <placeholder_for_version> with the wished version of Intel MPI to enable the MPI environment.

Considering 4 MPI tasks on a single node, each requiring 1000 MByte, and running for 1 hour, execute:

msub -l nodes=1:ppn=4,pmem=1000mb,walltime=01:00:00 job.sh


Launching and running 32 MPI tasks on 4 nodes, each requiring 1000 MByte, and running for 5 hours, execute:

msub -l nodes=4:ppn=8,pmem=1000mb,walltime=05:00:00 job.sh


1.2.4 Multithreaded + MPI parallel Programs

Multithreaded + MPI parallel programs operate faster than serial programs on multi CPUs with multiple cores. All threads of one process share resources such as memory. On the contrary MPI tasks do not share memory but can be spawned over different nodes.

Multiple MPI tasks must be launched by the MPI parallel program mpirun and mpiexec.hydra respectively. For multithreaded programs based on Open Multi-Processing (OpenMP) number of threads are defined by the environment variable OMP_NUM_THREADS. By default this variable is set to 1 (OMP_NUM_THREADS=1).

For OpenMPI a job-script to submit a batch job called job_ompi_omp.sh that runs a MPI program with 4 tasks and an fourfold threaded program ompi_omp_program requiring 7000 MByte of physical memory per process/thread (using 4 threads per MPI task you will get 4*7000 MByte = 28000 MByte per MPI task) and total wall clock time of 3 hours looks like:

#!/bin/bash
#MSUB -l nodes=2:ppn=8
#MSUB -l walltime=03:00:00
#MSUB -l pmem=7000mb
#MSUB -v MPI_MODULE=mpi/ompi
#MSUB -v OMP_NUM_THREADS=4
#MSUB -v MPIRUN_OPTIONS="-bind-to-core -bynode -cpus-per-proc 4 -report-bindings"
#MSUB -v EXECUTABLE=./ompi_omp_program
#MSUB -N test_ompi_omp
 
module load ${MPI_MODULE}
TASK_COUNT=$((${MOAB_PROCCOUNT}/${OMP_NUM_THREADS}))
echo "${EXECUTABLE} running on ${MOAB_PROCCOUNT} cores with ${TASK_COUNT} MPI-tasks and ${OMP_NUM_THREADS} threads"
startexe="mpirun -n ${TASK_COUNT} ${MPIRUN_OPTIONS} ${EXECUTABLE}"
echo $startexe
exec $startexe

Execute the script job_ompi_omp.sh without any msub command line options:

$ msub job_ompi_omp.sh


With the mpirun option -bind-to-core MPI tasks and OpenMP threads are bound to physical cores. With the option -bynode that must be set (neighbored) MPI tasks will be attached to different nodes and the value of the option -cpus-per-proc <value> must be set to ${OMP_NUM_THREADS}. The option -report-bindings shows the bindings between MPI tasks and physical cores.

The option -bysocket does not work!!! The mpirun-options -bind-to-core, -bynode and -cpus-per-proc should always be used when running a multithreaded MPI program else your multithreaded MPI program will run only on 1 node.

Intel MPI should not be used up to now!!! For Intel MPI a job-script to submit a batch job called job_impi_omp.sh that runs a Intel MPI program with 8 tasks and a eightfold threaded program impi_omp_program requiring 64000 MByte of total physical memory and total wall clock time of 6 hours looks like:

#!/bin/bash
#MSUB -l nodes=4:ppn=16
#MSUB -l walltime=06:00:00
#MSUB -l pmem=64000mb
#MSUB -v MPI_MODULE=mpi/impi
#MSUB -v OMP_NUM_THREADS=8
#MSUB -v MPIRUN_OPTIONS="-print-rank-map -env I_MPI_PIN_DOMAIN socket"
#MSUB -v EXE=./impi_omp_program
#MSUB -N test_impi_omp
 
module load ${MPI_MODULE}
TASK_COUNT=$((${MOAB_PROCCOUNT}/${OMP_NUM_THREADS}))
echo "${EXE} running on ${MOAB_PROCCOUNT} cores with ${TASK_COUNT} MPI-tasks and ${OMP_NUM_THREADS} threads"
startexe="mpiexec.hydra -bootstrap slurm ${MPIRUN_OPTIONS} -n ${TASK_COUNT} ${EXE}"
echo $startexe
exec $startexe


Execute the script job_impi_omp.sh without any msub command line options:

$ msub job_impi_omp.sh


The mpirun option -print-rank-map shows the bindings between MPI tasks and nodes (not very useful). With the environment variable I_MPI_PIN_DOMAIN the binding, which is always switched on, between MPI tasks and physical cores can be controlled. Choosing socket as value for the environment variable means that (neighbored) MPI tasks run on different sockets. Other values like node and cache are possible. Choosing node as value means that (neighbored) MPI tasks run on different nodes. Choosing cache as value means that (neighbored) MPI tasks run on cores that don't access a common L3-cache.

1.2.5 Interactive Jobs

Interactive jobs must not run on the logins nodes, however resources for interactive jobs can be requested using msub. Considering a serial application with a graphical frontend that requires 5000 MByte of memory and limiting the interactive run to 2 hours execute the following:

$ msub  -v HOME,TERM,USER,DISPLAY -S /bin/bash -I -l nodes=1:ppn=1 -l walltime=0:02:00:00

After execution of this command DO NOT CLOSE your current terminal session but wait until the queueing system MOAB has granted you the requested resources on the compute system. Once granted you will be automatically logged on the dedicated resource. Now you have an interactive session with 1 core and 5000 MByte of memory on the compute system for 2 hours. Simply execute now your application:

$ cd to_path
$ ./application

Note that, once the walltime limit has been reached you will be automatically logged out of the compute system.

2 Status of batch system/jobs

2.1 Start time of job or resources - showstart

The following command can be used by any user to displays the estimated start time of a job based a number of analysis types based on historical usage, earliest available reservable resources, and priority based backlog. To show estimated start time of job <job_ID> enter:

$ showstart -e all <job_ID>


Furthermore start time of resource demands, e.g. 16 processes @ 12 h, can be displayed via:

$ showstart -e all 16@12:00:00


For further options of showstart read the manpage of showstart:

$ man showstart



2.2 List of submitted jobs - showq

The following command displays information about active, eligible, blocked, and/or recently completed jobs:

$ showq


For further options of showq read the manpage of showq:

$ man showq



2.3 Detailed job information - checkjob

checkjob <jobID> displays detailed job state information and diagnostic output for the job of <jobID>:

$ checkjob <jobID>


For further options of checkjob read the manpage of checkjob:

$ man checkjob



3 Job management

3.1 Canceling own jobs

canceljob <jobID> cancels the own job with <jobID>.

$ canceljob <jobID>


Note that only own jobs can be cancelled. The command:

$ mjobctl -c <jobID>

has the same effect as canceljob <jobID>.


4 Environment Variables for Batch Jobs

Once an eligible compute jobs starts on the compute system, MOAB adds the following variables to the job's environment:

MOAB variables
Environment variables Description
MOAB_CLASS Class name
MOAB_GROUP Group name
MOAB_JOBID Job ID
MOAB_JOBNAME Job name
MOAB_NODECOUNT Number of nodes allocated to job
MOAB_PARTITION Partition name the job is running in
MOAB_PROCCOUNT Number of processors allocated to job
MOAB_SUBMITDIR Directory of job submission
MOAB_USER User name


Further environment variables are added by the resource manager SLURM:

SLURM variables
Environment variables Description
SLURM_JOB_CPUS_PER_NODE Number of processes per node dedicated to the job
SLURM_JOB_NODELIST List of nodes dedicated to the job
SLURM_JOB_NUM_NODES Number of nodes dedicated to the job
SLURM_MEM_PER_NODE Memory per node dedicated to the job
SLURM_NPROCS Total number of processes dedicated to the job


Both MOAB and SLURM environment variables can be used to generalize you job scripts, compare msub examples.