Batch Jobs - bwUniCluster Features: Difference between revisions
Line 129: | Line 129: | ||
= Intel MPI parallel Programs = |
= Intel MPI parallel Programs = |
||
MPI parallel programs run faster than serial programs on multi CPU and multi core systems. N-fold spawned processes of the MPI program, i.e., '''MPI tasks''', run simultaneously and communicate via the Message Passing Interface (MPI) paradigm. MPI tasks do not share memory but can be spawned over different nodes. |
MPI parallel programs run faster than serial programs on multi CPU and multi core systems. N-fold spawned processes of the MPI program, i.e., '''MPI tasks''', run simultaneously and communicate via the Message Passing Interface (MPI) paradigm. MPI tasks do not share memory but can be spawned over different nodes. |
||
Multiple MPI tasks can not be launched by the MPI parallel program itself but via '''mpirun''', e.g. 4 MPI tasks of ''my_par_program'': |
|||
<pre> |
|||
$ mpirun -n 4 my_par_program |
|||
</pre> |
|||
<br> |
|||
However, this given command can '''not''' be directly included in your '''msub''' command for submitting as a batch job to the compute cluster, [[BwUniCluster_Batch_Jobs#Handling_job_script_options_and_arguments|see above]]. |
|||
Generate a wrapper script ''job_ompi.sh'' for '''OpenMPI''' containing the following lines: |
|||
{| style="width: 100%; border:1px solid #d0cfcc; background:#f2f7ff;border-spacing: 2px;" |
|||
| style="width:280px; white-space:nowrap; color:#000;" | |
|||
<source lang="bash"> |
|||
#!/bin/bash |
|||
module load mpi/openmpi/<placeholder_for_version> |
|||
# Use when loading OpenMPI in version 1.8.x |
|||
mpirun --bind-to core --map-by core -report-bindings my_par_program |
|||
# Use when loading OpenMPI in an old version 1.6.x |
|||
mpirun -bind-to-core -bycore -report-bindings my_par_program |
|||
</source> |
|||
|} |
|||
'''Attention:''' Do '''NOT''' add mpirun options ''-n <number_of_processes>'' or any other option defining processes or nodes, since MOAB instructs mpirun about number of processes and node hostnames. Use '''ALWAYS''' the MPI options '''''--bind-to core''''' and '''''--map-by core|socket|node''''' (OpenMPI version 1.8.x). Please type ''mpirun --help'' for an explanation of the meaning of the different options of mpirun option ''--map-by''. |
|||
Generate a wrapper script for '''Intel MPI''', ''job_impi.sh'' containing the following lines: |
Generate a wrapper script for '''Intel MPI''', ''job_impi.sh'' containing the following lines: |
Revision as of 16:28, 24 July 2015
This article contains information on features of the batch job system only applicable on bwUniCluster.
Job Submission
msub Command
The bwUniCluster supports the following additional msub option(s):
bwUniCluster additional msub Options | ||
---|---|---|
Command line | Script | Purpose |
-I | Declares the job is to be run interactively. |
msub -l resource_list
No deviation or additional features to general batch job setting.
msub -q queues
Compute resources such as walltime, nodes and memory are restricted and must fit into queues. Since requested compute resources are NOT always automatically mapped to the correct queue class, you must add to your msub command the correct queue class. Details are:
msub -q queue | |||||
---|---|---|---|---|---|
queue | node | default resources | minimum resources | maximum resources | node access policy |
develop | thin | walltime=00:10:00,procs=1, pmem=4000mb | nodes=1 | nodes=1:ppn=16, walltime=00:30:00 | shared |
singlenode | thin | walltime=00:30:01,procs=1, pmem=4000mb | nodes=1, walltime=00:30:01 | nodes=1:ppn=16, walltime=3:00:00:00 | shared |
multinode | thin | walltime=00:10:00,procs=1, pmem=4000mb | nodes=2 | nodes=16:ppn=16, walltime=2:00:00:00 | singlejob |
verylong | thin | walltime=3:00:00:01,procs=1, pmem=4000mb | nodes=1, walltime=3:00:00:01 | nodes=1:ppn=16, walltime=6:00:00:00 | shared |
fat | fat | walltime=00:10:00,procs=1, pmem=32000mb | nodes=1 | nodes=1:ppn=32, walltime=3:00:00:00 | shared |
Note that node access policy=singlejob means that, irrespected of the requested number of cores, node access is exclusive. Default resources of a queue class defines walltime, processes and memory if not explicitly given with msub command. Resource list acronyms walltime, procs, nodes and ppn are described here.
Queue class examples
- To run your batch job longer than 3 days, please use$ msub -q verylong.
- To run your batch job on one of the fat nodes, please use$ msub -q fat.
Environment Variables for Batch Jobs
The bwUniCluster expands the common set of MOAB environment variables by the following variable(s):
bwUniCluster specific MOAB variables | ||
---|---|---|
Environment variables | Description | |
MOAB_SUBMITDIR | Directory of job submission |
Since the work load manager MOAB on bwUniCluster uses the resource manager SLURM, the following environment variables of SLURM are added to your environment once your job has started:
SLURM variables | ||
---|---|---|
Environment variables | Description | |
SLURM_JOB_CPUS_PER_NODE | Number of processes per node dedicated to the job | |
SLURM_JOB_NODELIST | List of nodes dedicated to the job | |
SLURM_JOB_NUM_NODES | Number of nodes dedicated to the job | |
SLURM_MEM_PER_NODE | Memory per node dedicated to the job | |
SLURM_NPROCS | Total number of processes dedicated to the job |
Node Monitoring
By default nodes are not used exclusive anless they are requested with -l naccesspolicy=singlejob as described here.
If a Job runs exclusive on one node you may do a ssh login to that node. To get the nodes of your job need to read the environment variable SLURM_JOB_NODELIST, e.g.
echo $SLURM_JOB_NODELIST > nodelist
Intel MPI parallel Programs
MPI parallel programs run faster than serial programs on multi CPU and multi core systems. N-fold spawned processes of the MPI program, i.e., MPI tasks, run simultaneously and communicate via the Message Passing Interface (MPI) paradigm. MPI tasks do not share memory but can be spawned over different nodes.
Generate a wrapper script for Intel MPI, job_impi.sh containing the following lines:
#!/bin/bash
module load mpi/impi/<placeholder_for_version>
mpiexec.hydra -bootstrap slurm my_par_program
|
Attention: Do NOT add mpirun options -n <number_of_processes> or any other option defining processes or nodes, since MOAB instructs mpirun about number of processes and node hostnames.
Moreover, replace <placeholder_for_version> with the wished version of Intel MPI to enable the MPI environment.
Considering 4 OpenMPI tasks on a single node, each requiring 1000 MByte, and running for 1 hour, execute:
$ msub -q singlenode -l nodes=1:ppn=4,pmem=1000mb,walltime=01:00:00 job_ompi.sh
Launching and running 32 Intel MPI tasks on 4 nodes, each requiring 1000 MByte, and running for 5 hours, execute:
$ msub -q multinode -l nodes=4:ppn=16,pmem=1000mb,walltime=05:00:00 job_impi.sh
Multithreaded + Intel MPI parallel Programs
Multithreaded + MPI parallel Programs
Multithreaded + MPI parallel programs operate faster than serial programs on multi CPUs with multiple cores. All threads of one process share resources such as memory. On the contrary MPI tasks do not share memory but can be spawned over different nodes.
Multiple MPI tasks must be launched by the MPI parallel program mpirun and mpiexec.hydra respectively. For multithreaded programs based on Open Multi-Processing (OpenMP) number of threads are defined by the environment variable OMP_NUM_THREADS. By default this variable is set to 1 (OMP_NUM_THREADS=1).
For OpenMPI a job-script to submit a batch job called job_ompi_omp.sh that runs a MPI program with 4 tasks and an fivefold threaded program ompi_omp_program requiring 6000 MByte of physical memory per process/thread (using 5 threads per MPI task you will get 5*6000 MByte = 30000 MByte per MPI task) and total wall clock time of 3 hours looks like:
#!/bin/bash
#MSUB -l nodes=2:ppn=10
#MSUB -l walltime=03:00:00
#MSUB -l pmem=6000mb
#MSUB -v MPI_MODULE=mpi/ompi
#MSUB -v OMP_NUM_THREADS=5
#MSUB -v MPIRUN_OPTIONS="--bind-to core --map-by socket:PE=5 -report-bindings"
#MSUB -v EXECUTABLE=./ompi_omp_program
#MSUB -N test_ompi_omp
module load ${MPI_MODULE}
TASK_COUNT=$((${MOAB_PROCCOUNT}/${OMP_NUM_THREADS}))
echo "${EXECUTABLE} running on ${MOAB_PROCCOUNT} cores with ${TASK_COUNT} MPI-tasks and ${OMP_NUM_THREADS} threads"
startexe="mpirun -n ${TASK_COUNT} ${MPIRUN_OPTIONS} ${EXECUTABLE}"
echo $startexe
exec $startexe
|
Execute the script job_ompi_omp.sh adding the queue class multinode to your msub command:
$ msub -q multinode job_ompi_omp.sh
With the mpirun option --bind-to core MPI tasks and OpenMP threads are bound to physical cores.
With the option --map-by socket:PE=<value> (neighbored) MPI tasks will be attached to different sockets and each MPI task is bound to the (in <value>) specified number of cpus. <value> must be set to ${OMP_NUM_THREADS}.
Old OpenMPI version 1.6.x: With the mpirun option -bind-to-core MPI tasks and OpenMP threads are bound to physical cores. With the option -bysocket (neighbored) MPI tasks will be attached to different sockets and the option -cpus-per-proc <value> binds each MPI task to the (in <value>) specified number of cpus. <value> must be set to ${OMP_NUM_THREADS}.
The option -report-bindings shows the bindings between MPI tasks and physical cores.
The mpirun-options --bind-to core, --map-by socket|...|node:PE=<value> should always be used when running a multithreaded MPI program. (OpenMPI version 1.6.x: The mpirun-options -bind-to-core, -bysocket|-bynode and -cpus-per-proc <value> should always be used when running a multithreaded MPI program.)
For Intel MPI a job-script to submit a batch job called job_impi_omp.sh that runs a Intel MPI program with 8 tasks and a tenfold threaded program impi_omp_program requiring 32000 MByte of total physical memory per task and total wall clock time of 6 hours looks like:
#!/bin/bash
#MSUB -l nodes=4:ppn=20
#MSUB -l walltime=06:00:00
#MSUB -l pmem=3200mb
#MSUB -v MPI_MODULE=mpi/impi
#MSUB -v OMP_NUM_THREADS=10
#MSUB -v MPIRUN_OPTIONS="-binding "domain=omp" -print-rank-map -ppn 2 -envall"
#MSUB -v EXE=./impi_omp_program
#MSUB -N test_impi_omp
#If using more than one MPI task per node please set
export KMP_AFFINITY=scatter
#export KMP_AFFINITY=verbose,scatter prints messages concerning the supported affinity
#KMP_AFFINITY Description: https://software.intel.com/en-us/node/524790#KMP_AFFINITY_ENVIRONMENT_VARIABLE
module load ${MPI_MODULE}
TASK_COUNT=$((${MOAB_PROCCOUNT}/${OMP_NUM_THREADS}))
echo "${EXE} running on ${MOAB_PROCCOUNT} cores with ${TASK_COUNT} MPI-tasks and ${OMP_NUM_THREADS} threads"
startexe="mpiexec.hydra -bootstrap slurm ${MPIRUN_OPTIONS} -n ${TASK_COUNT} ${EXE}"
echo $startexe
exec $startexe
|
Using Intel compiler the environment variable KMP_AFFINITY switches on binding of threads to specific cores. If you only run one MPI task per node please set KMP_AFFINITY=compact,1,0.
Execute the script job_impi_omp.sh adding the queue class multinode to your msub command:
$ msub -q multinode job_impi_omp.sh
The mpirun option -print-rank-map shows the bindings between MPI tasks and nodes (not very beneficial). The option -binding binds MPI tasks (processes) to a particular processor; domain=omp means that the domain size is determined by the number of threads. In the above examples (2 MPI tasks per node) you could also choose -binding "cell=unit;map=bunch"; this binding maps one MPI process to each socket.
Interactive Jobs
Interactive jobs on bwUniCluster must NOT run on the logins nodes, however resources for interactive jobs can be requested using msub. Considering a serial application with a graphical frontend that requires 5000 MByte of memory and limiting the interactive run to 2 hours execute the following:
$ msub -I -V -l nodes=1:ppn=1 -l walltime=0:02:00:00
The option -V defines that all environment variables are exported to the compute node of the interactive session. After execution of this command DO NOT CLOSE your current terminal session but wait until the queueing system MOAB has granted you the requested resources on the compute system. Once granted you will be automatically logged on the dedicated resource. Now you have an interactive session with 1 core and 5000 MByte of memory on the compute system for 2 hours. Simply execute now your application:
$ cd to_path $ ./application
Note that, once the walltime limit has been reached you will be automatically logged out of the compute system.
Chain Jobs
The CPU time requirements of many applications exceed the limits of the job classes. In those situations it is recommended to solve the problem by a job chain. A job chain is a sequence of jobs where each job automatically starts its successor.
#!/bin/bash
##################################################
## simple job run template for bwUniCluster ##
## to run chain jobs with MOAB ##
##################################################
##
## usage :
## msub -v myloop_counter=0 ./moab_chain_job.sh
#MSUB -l nodes=1:ppn=1
#MSUB -l walltime=00:00:05
#MSUB -l pmem=50mb
#MSUB -q develop
#MSUB -N chain
## Defaults
loop_max=10
cmd='sleep 2'
## Check if counter environment variable is set
if [ -z "${myloop_counter}" ] ; then
echo " ERROR: myloop_counter is undefined, stop chain job"
exit 1
fi
## only continue if below loop_max
if [ ${myloop_counter} -lt ${loop_max} ] ; then
## increase counter
let myloop_counter+=1
## print current Job number
echo " Chain job iteration = ${myloop_counter}"
## Define your command
cmd='sleep 2'
echo " -> executing ${cmd}"
${cmd}
if [ $? -eq 0 ] ; then
## continue only if last command was successful
msub -v myloop_counter=${myloop_counter} ./moab_chain_job.sh
else
## Terminate chain
echo " ERROR: ${cmd} of chain job no. ${myloop_counter} terminated unexpectedly"
exit 1
fi
fi
|