BwUniCluster2.0/Batch System Migration Guide: Difference between revisions

From bwHPC Wiki
Jump to navigation Jump to search
 
(28 intermediate revisions by 2 users not shown)
Line 1: Line 1:
While the former bwUniCluster 1 system used the combination Moab/SLURM for the batch system queue, bwUniCluster 2.0 uses only SLURM. This means that most job scripts and workflows which relied on Moab-specific pragmas and commands have to be changed.

=General Overview=

Job parameters can be passed to SLURM in the same ways which were possible with Moab.

* Instead of the #MOAB or #PBS pragmas, the pragma #SLURM has to be used within job files.

* Instead of the Moab commands, the corresponding SLURM commands have to be used.

A general mapping of Moab to SLURM commands can be found in the following table:

{| class="wikitable"
! Moab command !! SLURM command
|-
| msub || sbatch
|-
| msub -I || salloc
|-
| canceljob || scancel
|-
| showq || squeue
|-
| checkjob $JOBID || scontrol show job $JOBID
|}


In the next table you find the most MOAB job specification flags and environment variables. You have to replace all MOAB flags and environment variables of your batch scripts by their corresponding Slurm counterparts.

'''Commonly used MOAB job specification flags and their Slurm equivalents'''

{| width=750px class="wikitable"
! Option !! Moab (msub) !! Slurm (sbatch)
|-
| Script directive || #MSUB || #SBATCH
|-
| Job name || -N <name> || --job-name=<name> (-J <name>)
|-
| Account || -A <account> || --account=<account> (-A <account>)
|-
| Queue || -q <queue> || --partition=<partition> (-p <partition>)
|-
| Wall time limit || -l walltime=<hh:mm:ss> || --time=<hh:mm:ss> (-t <hh:mm:ss>)
|-
| Node count || -l nodes=<count> || --nodes=<count> (-N <count>)
|-
| Core count || -l procs=<count> || --ntasks=<count> (-n <count>)
|-
| Process count per node || -l ppn=<count> || --ntasks-per-node=<count>
|-
| Core count per process || || --cpus-per-task=<count>
|-
| Memory limit per node || -l mem=<limit> || --mem=<limit>
|-
| Memory limit per process || -l pmem=<limit> || --mem-per-cpu=<limit>
|-
| Job array || -t <array indices> || --array=<indices> (-a <indices>)
|-
| Node exclusive job || -l naccesspolicy=singlejob || --exclusive
|-
| Initial working directory || -d <directory> (default: $HOME) || --chdir=<directory> (-D <directory>) (default: submission directory)
|-
| Standard output file || -o <file path> || --output=<file> (-o <file>)
|-
| Standard error file || -e <file path> || --error=<file> (-e <file>)
|-
| Combine stdout/stderr to stdout || -j oe || --output=<combined stdout/stderr file>
|-
| Mail notification events || -m <event> || --mail-type=<events> (valid types include: NONE, BEGIN, END, FAIL, ALL)
|-
| Export environment to job || -V || --export=ALL (default)
|-
| Don't export environment to job || (default) || --export=NONE
|-
| Export environment variables to job || -v <var[=value][,var2=value2[, ...]]> || --export=<var[=value][,var2=value2[,...]]>
|}

'''Notes:'''
* Default initial job working directory is $HOME for MOAB. For Slurm the default working directory is where you submit your job from.
* By default MOAB does not export any environment variables to the job's runtime environment. With Slurm most of the login environment variables are exported to your job's runtime environment. This includes environment variables from software modules that were loaded at job submission time (and also $HOSTNAME variable).



'''Commonly used MOAB script environment variables and their Slurm equivalents

{| width=750px class="wikitable"
! Information !! MOAB !! Slurm
|-
| Job name || $MOAB_JOBNAME || $SLURM_JOB_NAME
|-
| Job ID || $MOAB_JOBID || $SLURM_JOB_ID
|-
| Submit directory || $MOAB_SUBMITDIR || $SLURM_SUBMIT_DIR
|-
| Number of nodes allocated || $MOAB_NODECOUNT || $SLURM_JOB_NUM_NODES (and: $SLURM_NNODES)
|-
| Node list || $MOAB_NODELIST || $SLURM_JOB_NODELIST
|-
| Number of processes || $MOAB_PROCCOUNT || $SLURM_NTASKS
|-
| Requested tasks per node || --- || $SLURM_NTASKS_PER_NODE
|-
| Requested CPUs per task || --- || $SLURM_CPUS_PER_TASK
|-
| Job array index || $MOAB_JOBARRAYINDEX || $SLURM_ARRAY_TASK_ID
|-
| Job array range || $MOAB_JOBARRAYRANGE || $SLURM_ARRAY_TASK_COUNT
|-
| Queue name || $MOAB_CLASS || $SLURM_JOB_PARTITION
|-
| QOS name || $MOAB_QOS || $SLURM_JOB_QOS
|-
| Number of processes per node || --- || $SLURM_TASKS_PER_NODE
|-
| Job user || $MOAB_USER || $SLURM_JOB_USER
|-
| Hostname || $MOAB_MACHINE || $SLURMD_NODENAME
|}

<br>

=Serial Programs=
=Serial Programs=


Line 17: Line 138:
=Multithreaded Programs=
=Multithreaded Programs=


* Use the time option ''-t'' or ''--ime'' (instead of ''-l walltime''). If only one number is entered behind ''-t'', the default unit is minutes.
* Use the time option '''-t''' or '''--time''' (instead of '''-l walltime'''). If only one number is entered behind '''-t''', the default unit is minutes.
* Use the option ''-N 1'' or ''--nodes=1'' and ''c (instead of ''-l nodes=1,ppn=...''). A number between 1 and 40 can be entered (because of 40 cores within one node); a number between 41 and 80 can also be entered (because of active hyperthreading).
* Use the option '''-N 1''' or '''--nodes=1''' and '''c ''x''''' or '''--cpus-per-task=''x''''' (instead of '''-l nodes=1,ppn=''x'' '''). '''''x''''' can be a number between 1 and 40 (because of 40 cores within one node); it can also be a number between 41 and 80 (because of active hyperthreading).
* Use the option ''-m'' or ''--mem'' (instead of ''-l pmem''). The default unit is MegaByte.
* Use the option '''-m''' or '''--mem''' (instead of '''-l pmem'''). The default unit is MegaByte.
* Use the option '''--export''' to set the needed environment variable OMP_NUM_THREADS for the batch job. Adding '''ALL''' means to pass all interactively set environment variables to the batch job.
* If you want to use one node exclusively, you must enter the whole memory (''-m 96327'' or ''--mem=96327'').
* If you want to use one node exclusively, you must either enter the whole memory (''-m 96327'' or ''--mem=96327'') or set the number of threads greater than 39.
<br>
<br>
'''Example for a serial job'''
'''Example for a multithreaded job'''
<pre>
<pre>
$ sbatch -p single -t 60 -n 1 -m 96327 ./job.sh
$ sbatch -p single -t 1:00:00 -N 1 -c 20 -m 50gb --export=ALL,OMP_NUM_THREADS=20 ./job_threaded.sh
</pre>
The script '''job_threaded.sh''' (containing a multithreaded program) is started running 1 hour in shared mode on 20 cores requesting 50GB on one batch node.
<br>
<br>
<br>

=MPI Parallel Programs within one node=

* Use the time option '''-t''' or '''--time''' (instead of '''-l walltime'''). If only one number is entered behind '''-t''', the default unit is minutes.
* Use the option '''-n ''x''''' or '''--ntasks=''x''''' (instead of '''-l nodes=1,ppn=''x'' '''). '''''x''''' can be a number between 1 and 40 (because of 40 cores within one node); you should'nt utilize hyperthreading.
* Use the option '''-m''' or '''--mem''' (instead of '''-l pmem'''). The default unit is MegaByte.
* If you want to use one node exclusively, you must either enter the whole memory (''-m 96327'' or ''--mem=96327'') or set the number of MPI tasks greater than 39.
* Don't forget to load the appropriate MPI-module in your job script.
* If you are using OpenMPI, the options '''--bind-to core --map-by core|socket|node''' of the command mpirun should be used.
<br>
'''Example for a MPI job'''
<pre>
$ sbatch -p single -t 600 -n 10 -m 40000 ./job_mpi.sh
</pre>
The script '''job_mpi.sh''' (containing a MPI program after loading the appropriate MPI module) is started running 10 hours in shared mode on 10 cores requesting 40000 MB on one batch node.
<br>
<br>
<br>

=MPI Parallel Programs on many nodes=

* Use the time option '''-t''' or '''--time''' (instead of '''-l walltime'''). If only one number is entered behind '''-t''', the default unit is minutes.
* Use the option '''-N ''y''''' or '''--nodes=''y''''' and '''--ntasks-per-node=''x''''' (instead of '''-l nodes=''y'',ppn=''x'''''). '''''x''''' can be a number between 1 and 40 (28 for Broadwell nodes) (because of 40 (28) cores within one node); you should'nt utilize hyperthreading.
* You should'nt use the option '''-m''' or '''--mem''' because the nodes are used exclusively.
* You always use the nodes exclusively.
* Don't forget to load the appropriate MPI-module in your job script.
* If you are using OpenMPI, the options '''--bind-to core --map-by core|socket|node''' ofthe command mpirun should be used.
<br>
'''Example for a MPI job'''
<pre>
$ sbatch -p multiple -t 48:00:00 -N 10 --ntasks-per-node=40 ./job_mpi.sh
</pre>
The script '''job_mpi.sh''' (containing a MPI program after loading the appropriate MPI module) is started running 2 days on 400 cores on ten batch nodes.
<br>
<br>
<br>

=Multithreaded + MPI Parallel Programs on many nodes=

* Use the time option '''-t''' or '''--time''' (instead of '''-l walltime'''). If only one number is entered behind '''-t''', the default unit is minutes.
* Use the option '''-N ''y''''' or '''--nodes=''y''''' and '''--ntasks-per-node=''x''''' and '''-c ''z''''' or '''--cpus-per-task=''z''''' (instead of '''-l nodes=''y'',ppn=''x+z'''''). '''''x''''' usually should be 1 or 2 and '''''x+z''''' usually 40 (28 on Broadwell nodes); you can utilize hyperthreading if you want.
* You should'nt use the option '''-m''' or '''--mem''' because the nodes are used exclusively.
* You always use the nodes exclusively.
* Don't forget to load the appropriate MPI-module in your job script.
* If you are using OpenMPI, the options '''--bind-to core --map-by socket|node:PE=''z''''' of the command mpirun must be used.
<br>
'''Example for a MPI job'''
<pre>
$ sbatch -p multiple -t 2-12 -N 10 --ntasks-per-node=2 -c 20 ./job_threaded_mpi.sh
</pre>
</pre>
The script '''job_threaded_mpi.sh''' (containing a multithreaded MPI program after loading the appropriate MPI module) is started running 2.5 days on 400 cores with 20 MPI tasks and 20 threads per task on ten batch nodes. Here the options '''--bind-to core --map-by socket:PE=10''' of the command mpirun must be used.
A serial program is started running 60 minutes exclusively on a batch node.

Latest revision as of 23:23, 7 October 2022

While the former bwUniCluster 1 system used the combination Moab/SLURM for the batch system queue, bwUniCluster 2.0 uses only SLURM. This means that most job scripts and workflows which relied on Moab-specific pragmas and commands have to be changed.

General Overview

Job parameters can be passed to SLURM in the same ways which were possible with Moab.

  • Instead of the #MOAB or #PBS pragmas, the pragma #SLURM has to be used within job files.
  • Instead of the Moab commands, the corresponding SLURM commands have to be used.

A general mapping of Moab to SLURM commands can be found in the following table:

Moab command SLURM command
msub sbatch
msub -I salloc
canceljob scancel
showq squeue
checkjob $JOBID scontrol show job $JOBID


In the next table you find the most MOAB job specification flags and environment variables. You have to replace all MOAB flags and environment variables of your batch scripts by their corresponding Slurm counterparts.

Commonly used MOAB job specification flags and their Slurm equivalents

Option Moab (msub) Slurm (sbatch)
Script directive #MSUB #SBATCH
Job name -N <name> --job-name=<name> (-J <name>)
Account -A <account> --account=<account> (-A <account>)
Queue -q <queue> --partition=<partition> (-p <partition>)
Wall time limit -l walltime=<hh:mm:ss> --time=<hh:mm:ss> (-t <hh:mm:ss>)
Node count -l nodes=<count> --nodes=<count> (-N <count>)
Core count -l procs=<count> --ntasks=<count> (-n <count>)
Process count per node -l ppn=<count> --ntasks-per-node=<count>
Core count per process --cpus-per-task=<count>
Memory limit per node -l mem=<limit> --mem=<limit>
Memory limit per process -l pmem=<limit> --mem-per-cpu=<limit>
Job array -t <array indices> --array=<indices> (-a <indices>)
Node exclusive job -l naccesspolicy=singlejob --exclusive
Initial working directory -d <directory> (default: $HOME) --chdir=<directory> (-D <directory>) (default: submission directory)
Standard output file -o <file path> --output=<file> (-o <file>)
Standard error file -e <file path> --error=<file> (-e <file>)
Combine stdout/stderr to stdout -j oe --output=<combined stdout/stderr file>
Mail notification events -m <event> --mail-type=<events> (valid types include: NONE, BEGIN, END, FAIL, ALL)
Export environment to job -V --export=ALL (default)
Don't export environment to job (default) --export=NONE
Export environment variables to job -v <var[=value][,var2=value2[, ...]]> --export=<var[=value][,var2=value2[,...]]>

Notes:

  • Default initial job working directory is $HOME for MOAB. For Slurm the default working directory is where you submit your job from.
  • By default MOAB does not export any environment variables to the job's runtime environment. With Slurm most of the login environment variables are exported to your job's runtime environment. This includes environment variables from software modules that were loaded at job submission time (and also $HOSTNAME variable).


Commonly used MOAB script environment variables and their Slurm equivalents

Information MOAB Slurm
Job name $MOAB_JOBNAME $SLURM_JOB_NAME
Job ID $MOAB_JOBID $SLURM_JOB_ID
Submit directory $MOAB_SUBMITDIR $SLURM_SUBMIT_DIR
Number of nodes allocated $MOAB_NODECOUNT $SLURM_JOB_NUM_NODES (and: $SLURM_NNODES)
Node list $MOAB_NODELIST $SLURM_JOB_NODELIST
Number of processes $MOAB_PROCCOUNT $SLURM_NTASKS
Requested tasks per node --- $SLURM_NTASKS_PER_NODE
Requested CPUs per task --- $SLURM_CPUS_PER_TASK
Job array index $MOAB_JOBARRAYINDEX $SLURM_ARRAY_TASK_ID
Job array range $MOAB_JOBARRAYRANGE $SLURM_ARRAY_TASK_COUNT
Queue name $MOAB_CLASS $SLURM_JOB_PARTITION
QOS name $MOAB_QOS $SLURM_JOB_QOS
Number of processes per node --- $SLURM_TASKS_PER_NODE
Job user $MOAB_USER $SLURM_JOB_USER
Hostname $MOAB_MACHINE $SLURMD_NODENAME


Serial Programs

  • Use the time option -t or --time (instead of -l walltime). If only one number is entered behind -t, the default unit is minutes.
  • Use the option -n 1 or --ntasks=1 (instead of -l nodes=1,ppn=1).
  • Use the option -m or --mem (instead of -l pmem). The default unit is MegaByte.
  • If you want to use one node exclusively, you must enter the whole memory (-m 96327 or --mem=96327).


Example for a serial job

$ sbatch -p single -t 60 -n 1 -m 96327 ./job.sh 

The script job.sh (containing the execution of a serial program) is started running 60 minutes exclusively on a batch node.


Multithreaded Programs

  • Use the time option -t or --time (instead of -l walltime). If only one number is entered behind -t, the default unit is minutes.
  • Use the option -N 1 or --nodes=1 and c x or --cpus-per-task=x (instead of -l nodes=1,ppn=x ). x can be a number between 1 and 40 (because of 40 cores within one node); it can also be a number between 41 and 80 (because of active hyperthreading).
  • Use the option -m or --mem (instead of -l pmem). The default unit is MegaByte.
  • Use the option --export to set the needed environment variable OMP_NUM_THREADS for the batch job. Adding ALL means to pass all interactively set environment variables to the batch job.
  • If you want to use one node exclusively, you must either enter the whole memory (-m 96327 or --mem=96327) or set the number of threads greater than 39.


Example for a multithreaded job

$ sbatch -p single -t 1:00:00 -N 1 -c 20 -m 50gb --export=ALL,OMP_NUM_THREADS=20 ./job_threaded.sh 

The script job_threaded.sh (containing a multithreaded program) is started running 1 hour in shared mode on 20 cores requesting 50GB on one batch node.


MPI Parallel Programs within one node

  • Use the time option -t or --time (instead of -l walltime). If only one number is entered behind -t, the default unit is minutes.
  • Use the option -n x or --ntasks=x (instead of -l nodes=1,ppn=x ). x can be a number between 1 and 40 (because of 40 cores within one node); you should'nt utilize hyperthreading.
  • Use the option -m or --mem (instead of -l pmem). The default unit is MegaByte.
  • If you want to use one node exclusively, you must either enter the whole memory (-m 96327 or --mem=96327) or set the number of MPI tasks greater than 39.
  • Don't forget to load the appropriate MPI-module in your job script.
  • If you are using OpenMPI, the options --bind-to core --map-by core|socket|node of the command mpirun should be used.


Example for a MPI job

$ sbatch -p single -t 600 -n 10 -m 40000 ./job_mpi.sh 

The script job_mpi.sh (containing a MPI program after loading the appropriate MPI module) is started running 10 hours in shared mode on 10 cores requesting 40000 MB on one batch node.


MPI Parallel Programs on many nodes

  • Use the time option -t or --time (instead of -l walltime). If only one number is entered behind -t, the default unit is minutes.
  • Use the option -N y or --nodes=y and --ntasks-per-node=x (instead of -l nodes=y,ppn=x). x can be a number between 1 and 40 (28 for Broadwell nodes) (because of 40 (28) cores within one node); you should'nt utilize hyperthreading.
  • You should'nt use the option -m or --mem because the nodes are used exclusively.
  • You always use the nodes exclusively.
  • Don't forget to load the appropriate MPI-module in your job script.
  • If you are using OpenMPI, the options --bind-to core --map-by core|socket|node ofthe command mpirun should be used.


Example for a MPI job

$ sbatch -p multiple -t 48:00:00 -N 10 --ntasks-per-node=40  ./job_mpi.sh 

The script job_mpi.sh (containing a MPI program after loading the appropriate MPI module) is started running 2 days on 400 cores on ten batch nodes.


Multithreaded + MPI Parallel Programs on many nodes

  • Use the time option -t or --time (instead of -l walltime). If only one number is entered behind -t, the default unit is minutes.
  • Use the option -N y or --nodes=y and --ntasks-per-node=x and -c z or --cpus-per-task=z (instead of -l nodes=y,ppn=x+z). x usually should be 1 or 2 and x+z usually 40 (28 on Broadwell nodes); you can utilize hyperthreading if you want.
  • You should'nt use the option -m or --mem because the nodes are used exclusively.
  • You always use the nodes exclusively.
  • Don't forget to load the appropriate MPI-module in your job script.
  • If you are using OpenMPI, the options --bind-to core --map-by socket|node:PE=z of the command mpirun must be used.


Example for a MPI job

$ sbatch -p multiple -t 2-12 -N 10 --ntasks-per-node=2 -c 20 ./job_threaded_mpi.sh 

The script job_threaded_mpi.sh (containing a multithreaded MPI program after loading the appropriate MPI module) is started running 2.5 days on 400 cores with 20 MPI tasks and 20 threads per task on ten batch nodes. Here the options --bind-to core --map-by socket:PE=10 of the command mpirun must be used.