Difference between revisions of "Batch Jobs - bwUniCluster Features"

From bwHPC Wiki
Jump to: navigation, search
(Interactive Jobs)
(Completely outdated and emptied for deletion)
(Tag: Blanking)
 
Line 1: Line 1:
<font color=green size=+2>This article contains information on features of the [[Batch_Jobs|batch job system]] only applicable on bwUniCluster.</font>
 
<br>
 
= Job Submission =
 
== msub Command ==
 
= Interactive Jobs =
 
Interactive jobs on bwUniCluster [[BwUniCluster_User_Access#Allowed_activities_on_login_nodes|must '''NOT''' run on the logins nodes]], however resources for interactive jobs can be requested using msub. Considering a serial application with a graphical frontend that requires 5000 MByte of memory and limiting the interactive run to 2 hours execute the following:
 
<pre>
 
$ msub -I -V -l nodes=1:ppn=1,pmem=5000mb -l walltime=0:02:00:00
 
</pre>
 
The option -V defines that all environment variables are exported to the compute node of the interactive session.
 
After execution of this command '''DO NOT CLOSE''' your current terminal session but wait until the queueing system MOAB has granted you the requested resources on the compute system. Once granted you will be automatically logged on the dedicated resource. Now you have an interactive session with 1 core and 5000 MByte of memory on the compute system for 2 hours. Simply execute now your application:
 
<pre>
 
$ cd to_path
 
$ ./application
 
</pre>
 
Note that, once the walltime limit has been reached you will be automatically logged out of the compute system.
 
<br>
 
<br>
 
=== Single core jobs ===
 
At the moment only singlecore jobs are supported. If you want a whole full node, you must use the following msub option:
 
<pre>-l naccesspolicy=singlejob</pre>
 
<br>
 
<br>
 
----
 
 
=== msub -l ''resource_list'' ===
 
No deviation or additional features to general [[Batch_Jobs|batch job]] setting.
 
 
=== msub -q ''queues'' ===
 
Compute resources such as walltime, nodes and memory are restricted and must fit into '''queues'''. Since requested compute resources are NOT always automatically mapped to the correct queue class, you must add to your msub command the correct queue class. Details are:
 
 
{| width=750px class="wikitable"
 
! colspan="6" style="background-color:#999999;padding:3px"| msub -q ''queue''
 
|- style="width:10%;height=20px; text-align:left;"
 
! style="width:10%;padding:3px"| ''queue''
 
! style="width:5%;padding:3px"| ''node''
 
! style="width:15%;padding:3px"| ''default resources''
 
! style="padding:3px"| ''minimum resources''
 
! style="padding:3px"| ''maximum resources''
 
! style="padding:3px"| ''node access policy''
 
|- style="vertical-align:top; height=20px; text-align:left"
 
| style="padding:3px ; style="color:#00a000"| develop*
 
| style="padding:3px"| thin
 
| style="width:15%;padding:3px"| ''walltime''=00:10:00,''procs''=1, ''pmem''=4000mb
 
| style="padding:3px"| ''nodes''=1
 
| style="padding:3px"| ''nodes''=1:''ppn''=16, ''walltime''=00:30:00
 
| style="padding:3px"| shared
 
|- style="vertical-align:top; height=20px; text-align:left"
 
| style="width:10%;padding:3px ; style="color:#00a000" | singlenode*
 
| style="padding:3px"| thin
 
| style="padding:3px"| ''walltime''=00:30:01,''procs''=1, ''pmem''=4000mb
 
| style="padding:3px"| ''nodes''=1, ''walltime''=00:30:01
 
| style="padding:3px"| ''nodes''=1:''ppn''=16, ''walltime''=3:00:00:00
 
| style="padding:3px"| shared
 
|- style="vertical-align:top; height=20px; text-align:left"
 
| style="width:10%;vertical-align:top;height=20px; text-align:left;padding:3px ; style="color:#00a000" | verylong*
 
| style="padding:3px"| thin
 
| style="padding:3px"| ''walltime''=3:00:00:01,''procs''=1, ''pmem''=4000mb
 
| style="padding:3px"| ''nodes''=1'', walltime''=3:00:00:01
 
| style="padding:3px"| ''nodes=1:''ppn''=16, ''walltime''=6:00:00:00
 
| style="padding:3px"| shared
 
|- style="vertical-align:top; height=20px; text-align:left"
 
| style="width:10%;padding:3px ; style="color:#36c" | extralong**
 
| style="padding:3px"| thin
 
| style="padding:3px"| ''walltime''=6:00:00:01,''procs''=1, ''pmem''=4000mb
 
| style="padding:3px"| ''nodes''=1'', walltime''=6:00:00:01
 
| style="padding:3px"| ''nodes=1:''ppn''=16, ''walltime''=14:00:00:00
 
| style="padding:3px"| singlejob (Only one job can run at a time)
 
|- style="vertical-align:top; height=20px; text-align:left"
 
| style="width:10%;padding:3px" | fat
 
| style="padding:3px"| fat
 
| style="padding:3px"| ''walltime''=00:10:00,''procs''=1, ''pmem''=32000mb
 
| style="padding:3px"| ''nodes''=1
 
| style="padding:3px"| ''nodes''=1:''ppn''=32, ''walltime''=3:00:00:00
 
| style="padding:3px"| shared
 
|- style="vertical-align:top; height=20px; text-align:left"
 
| style="width:10%;padding:3px ; style="color:#00a000" | multinode*
 
| style="padding:3px"| broadwell
 
| style="padding:3px"| ''walltime''=00:10:00,''procs''=1, ''pmem''=4500mb
 
| style="padding:3px"| ''nodes''=2
 
| style="padding:3px"| ''nodes''=128:''ppn''=28, ''walltime''=48:00:00
 
| style="padding:3px"| singlejob
 
|- style="vertical-align:top; height=20px; text-align:left"
 
| style="width:10%;padding:3px" | dev_multinode
 
| style="padding:3px"| broadwell
 
| style="padding:3px"| ''walltime''=00:10:00,''procs''=1, ''pmem''=4500mb
 
| style="padding:3px"| ''nodes''=2
 
| style="padding:3px"| ''nodes''=16:''ppn''=28, ''walltime''=00:30:00
 
| style="padding:3px"| singlejob
 
|- style="vertical-align:top; height=20px; text-align:left"
 
| style="width:10%;padding:3px ; style="color:#b32425" | special**
 
| style="padding:3px"| broadwell
 
| style="padding:3px"| ''walltime''=00:30:00,''procs''=1, ''pmem''=4500mb
 
| style="padding:3px"| ''nodes''=1, ''walltime''=00:30:00
 
| style="padding:3px"| ''nodes''=1:''ppn''=28, ''walltime''=48:00:00
 
| style="padding:3px"| shared
 
|- style="vertical-align:top; height=20px; text-align:left"
 
| style="width:10%;padding:3px ; style="color:#b32425" | dev_special**
 
| style="padding:3px"| broadwell
 
| style="padding:3px"| ''walltime''=00:10:00,''procs''=1, ''pmem''=4500mb
 
| style="padding:3px"| ''nodes''=1, ''walltime''=00:10:00
 
| style="padding:3px"| ''nodes''=1:''ppn''=28, ''walltime''=00:30:00
 
| style="padding:3px"| shared
 
|-
 
|}
 
<span style="color:#00a000"> *Automatic routing.</span><br>
 
<span style="color:#b32425">**Only accessible to predefined user groups. </span><br>
 
<span style="color:#36c"> **It can be accessed only limited amount of time. Also only accessible to predefined user groups. </span>
 
 
Note that ''node access policy''=singlejob means that, irrespective of the requested number of cores, node access is exclusive.
 
Default resources of a queue class defines walltime, processes and memory if not explicitly given with msub command. Resource list acronyms ''walltime'', ''procs'', ''nodes'' and ''ppn'' are described [[Batch_Jobs#msub_-l_resource_list|here]].
 
 
==== Queue class examples ====
 
 
* To run your batch job longer than 3 days, please use<span style="background:#edeae2;margin:10px;padding:1px;border:1px dotted #808080">$ msub -q verylong</span>.
 
 
* To run your batch job on one of the [[BwUniCluster_File_System#Components_of_bwUniCluster|fat nodes]], please use<span style="background:#edeae2;margin:10px;padding:1px;border:1px dotted #808080">$ msub -q fat</span>.
 
<br>
 
<br>
 
 
= Environment Variables for Batch Jobs =
 
== Additional Moab Environments ==
 
The bwUniCluster expands the [[Batch_Jobs#Moab Environment Variables|common set of MOAB environment variables]] by the following variable:
 
{| width=700px class="wikitable"
 
! colspan="3" style="background-color:#999999;padding:3px"| bwUniCluster specific MOAB variables
 
|-
 
! Environment variable
 
! Description
 
|-
 
| MOAB_SUBMITDIR
 
| Directory of job submission
 
|}
 
 
== Additional Slurm Environments ==
 
Since the work load manager MOAB on [[bwUniCluster]] uses the resource manager SLURM, the following environment variables of SLURM are added to your environment once your job has started:
 
{| width=750px class="wikitable"
 
! colspan="3" style="background-color:#999999;padding:3px"| SLURM variables
 
|- style="width:25%;height=20px; text-align:left;padding:3px"
 
! style="width:20%;height=20px; text-align:left;padding:3px"| Environment variables
 
! style="height=20px; text-align:left;padding:3px"| Description
 
|-
 
| style="width:20%;height=20px; text-align:left;padding:3px" | SLURM_JOB_CPUS_PER_NODE
 
| style="height=20px; text-align:left;padding:3px"| Number of processes per node dedicated to the job
 
|-
 
| style="width:20%;height=20px; text-align:left;padding:3px" | SLURM_JOB_NODELIST
 
| style="height=20px; text-align:left;padding:3px"| List of nodes dedicated to the job
 
|-
 
| style="width:20%;height=20px; text-align:left;padding:3px" | SLURM_JOB_NUM_NODES
 
| style="height=20px; text-align:left;padding:3px"| Number of nodes dedicated to the job
 
|-
 
| style="width:20%;height=20px; text-align:left;padding:3px" | SLURM_MEM_PER_NODE
 
| style="height=20px; text-align:left;padding:3px"| Memory per node dedicated to the job
 
|-
 
| style="width:20%;height=20px; text-align:left;padding:3px" | SLURM_NPROCS
 
| style="height=20px; text-align:left;padding:3px"| Total number of processes dedicated to the job
 
|}
 
See also:
 
* [[Batch_Jobs#Batch_Job_.28Slurm.29_Variables_:_bwUniCluster|List of almost all important Slurm environments]]
 
<!--
 
== Interactive Job Monitoring per Node ==
 
By default nodes are not used exclusive unless they are requested with ''-l naccesspolicy=singlejob'' as described [[Batch_Jobs#msub_-l_resource_list|here]]. <br>
 
If a Job runs exclusive on one node you may do a ssh login to that node. The ssh access will be limited by the set walltime. To get the nodes of your job need to read the environment variable SLURM_JOB_NODELIST during the runtime of the job. It contains all nodes in a shortened way e.g. ''uc1n[344,386]'' or ''uc1n[344-345]''. To expand this string to ''uc1n344 uc1n345'' you can you can use the command expandnodes like:
 
 
expandnodes $SLURM_JOB_NODELIST > nodelist
 
 
<br>
 
<br>
 
-->
 
 
= Intel MPI parallel Programs =
 
== Intel MPI without Multithreading ==
 
MPI parallel programs run faster than serial programs on multi CPU and multi core systems. N-fold spawned processes of the MPI program, i.e., '''MPI tasks''', run simultaneously and communicate via the Message Passing Interface (MPI) paradigm. MPI tasks do not share memory but can be spawned over different nodes.
 
 
Generate a wrapper script for '''Intel MPI''', ''job_impi.sh'' containing the following lines:
 
<source lang="bash">
 
#!/bin/bash
 
module load mpi/impi/<placeholder_for_version>
 
mpiexec.hydra -bootstrap slurm my_par_program
 
</source>
 
<font color=red>'''Attention:'''</font><br>
 
Do '''NOT''' add mpirun options ''-n <number_of_processes>'' or any other option defining processes or nodes, since MOAB instructs mpirun about number of processes and node hostnames.
 
Moreover, replace <placeholder_for_version> with the wished version of '''Intel MPI''' to enable the MPI environment.
 
<br>
 
Launching and running 32 Intel MPI tasks on 4 nodes, each requiring 1000 MByte, and running for 5 hours, execute:
 
<pre>
 
$ msub -q multinode -l nodes=4:ppn=16,pmem=1000mb,walltime=05:00:00 job_impi.sh
 
</pre>
 
<br>
 
 
== Intel MPI with Multithreading ==
 
Multithreaded + MPI parallel programs operate faster than serial programs on multi CPUs with multiple cores. All threads of one process share resources such as memory. On the contrary MPI tasks do not share memory but can be spawned over different nodes.
 
 
Multiple Intel MPI tasks must be launched by the MPI parallel program '''mpiexec.hydra'''. For multithreaded programs based on '''Open''' '''M'''ulti-'''P'''rocessing (OpenMP) number of threads are defined by the environment variable OMP_NUM_THREADS. By default this variable is set to 1 (OMP_NUM_THREADS=1).
 
 
'''For Intel MPI''' a job-script to submit a batch job called ''job_impi_omp.sh'' that runs a Intel MPI program with 8 tasks and a tenfold threaded program ''impi_omp_program'' requiring 32000 MByte of total physical memory per task and total wall clock time of 6 hours looks like:
 
 
<!--b)-->
 
<source lang="bash">
 
#!/bin/bash
 
#MSUB -l nodes=4:ppn=20
 
#MSUB -l walltime=06:00:00
 
#MSUB -l pmem=3200mb
 
#MSUB -v MPI_MODULE=mpi/impi
 
#MSUB -v OMP_NUM_THREADS=10
 
#MSUB -v MPIRUN_OPTIONS="-binding "domain=omp" -print-rank-map -ppn 2 -envall"
 
#MSUB -v EXE=./impi_omp_program
 
#MSUB -N test_impi_omp
 
 
#If using more than one MPI task per node please set
 
export KMP_AFFINITY=scatter
 
#export KMP_AFFINITY=verbose,scatter prints messages concerning the supported affinity
 
#KMP_AFFINITY Description: https://software.intel.com/en-us/node/524790#KMP_AFFINITY_ENVIRONMENT_VARIABLE
 
 
module load ${MPI_MODULE}
 
TASK_COUNT=$((${MOAB_PROCCOUNT}/${OMP_NUM_THREADS}))
 
echo "${EXE} running on ${MOAB_PROCCOUNT} cores with ${TASK_COUNT} MPI-tasks and ${OMP_NUM_THREADS} threads"
 
startexe="mpiexec.hydra -bootstrap slurm ${MPIRUN_OPTIONS} -n ${TASK_COUNT} ${EXE}"
 
echo $startexe
 
exec $startexe
 
</source>
 
Using Intel compiler the environment variable KMP_AFFINITY switches on binding of threads to specific cores. If you only run one MPI task per node please set KMP_AFFINITY=compact,1,0.
 
 
<br>
 
Execute the script '''job_impi_omp.sh''' adding the queue class ''multinode'' to your msub command:
 
<pre>
 
$ msub -q multinode job_impi_omp.sh
 
</pre>
 
<br>
 
The mpirun option ''-print-rank-map'' shows the bindings between MPI tasks and nodes (not very beneficial). The option ''-binding'' binds MPI tasks (processes) to a particular processor; ''domain=omp'' means that the domain size is determined by the number of threads. In the above examples (2 MPI tasks per node) you could also choose ''-binding "cell=unit;map=bunch"''; this binding maps one MPI process to each socket.
 
<br>
 
<br>
 
= Interactive Jobs =
 
Interactive jobs on bwUniCluster [[BwUniCluster_User_Access#Allowed_activities_on_login_nodes|must '''NOT''' run on the logins nodes]], however resources for interactive jobs can be requested using msub. Considering a serial application with a graphical frontend that requires 5000 MByte of memory and limiting the interactive run to 2 hours execute the following:
 
<pre>
 
$ msub -I -V -l nodes=1:ppn=1 -l walltime=0:02:00:00
 
</pre>
 
The option -V defines that all environment variables are exported to the compute node of the interactive session.
 
After execution of this command '''DO NOT CLOSE''' your current terminal session but wait until the queueing system MOAB has granted you the requested resources on the compute system. Once granted you will be automatically logged on the dedicated resource. Now you have an interactive session with 1 core and 5000 MByte of memory on the compute system for 2 hours. Simply execute now your application:
 
<pre>
 
$ cd to_path
 
$ ./application
 
</pre>
 
Note that, once the walltime limit has been reached you will be automatically logged out of the compute system.
 
<br>
 
<br>
 
== Chain Jobs ==
 
The CPU time requirements of many applications exceed the limits of the job classes. In those situations it is recommended to solve the problem by a job chain. A job chain is a sequence of jobs where each job automatically starts its successor.
 
<source lang="bash">
 
#!/bin/bash
 
##################################################
 
## simple MOAB submitter script to setup ##
 
## a chain of jobs for bwUniCluster ##
 
##################################################
 
## ver. : 2015-09-17, KIT, SCC
 
 
## Define maximum number of jobs via positional parameter 1, default is 5
 
max_nojob=${1:-5}
 
 
## Define your jobscript (e.g. "~/chain_link_job.sh")
 
chain_link_job=${PWD}/chain_link_job.sh
 
 
## Define type of dependency via positional parameter 2, default is 'afterok'
 
dep_type="${2:-afterok}"
 
## -> List of all dependencies:
 
## http://docs.adaptivecomputing.com/suite/8-0/enterprise/help.htm#topics/\
 
## moabWorkloadManager/topics/jobAdministration/jobdependencies.html
 
 
myloop_counter=1
 
## Submit loop
 
while [ ${myloop_counter} -le ${max_nojob} ] ; do
 
##
 
## Differ msub_opt depending on chain link number
 
if [ ${myloop_counter} -eq 1 ] ; then
 
msub_opt=""
 
else
 
## Attention: do NOT use '-W depend' together with msub
 
msub_opt="-l depend=${dep_type}:${jobID}"
 
fi
 
##
 
## Print current iteration number and msub command
 
echo "Chain job iteration = ${myloop_counter}"
 
echo " msub -v myloop_counter=${myloop_counter} ${msub_opt} ${chain_link_job}"
 
## Store job ID for next iteration by storing output of msub command with empty lines
 
jobID=$(msub -v myloop_counter=${myloop_counter} ${msub_opt} ${chain_link_job} 2>&1 | sed '/^$/d')
 
##
 
## Check if ERROR occurred
 
if [[ "${jobID}" =~ "ERROR" ]] ; then
 
echo " -> submission failed!" ; exit 1
 
else
 
echo " -> job number = ${jobID}"
 
fi
 
##
 
## Increase counter
 
let myloop_counter+=1
 
done
 
</source>
 
<br>
 
<br>
 
----
 
[[Category:bwUniCluster|Batch Jobs - bwUniCluster features]]
 

Latest revision as of 15:03, 20 September 2023