JUSTUS2/Jobscripts: Running Your Calculations: Difference between revisions
No edit summary |
M Carmesin (talk | contribs) |
(103 intermediate revisions by 4 users not shown) | |||
Line 1: | Line 1: | ||
{{Justus2}} |
{{Justus2}} |
The JUSTUS 2 cluster uses [https://slurm.schedmd.com/ |
The JUSTUS 2 cluster uses Slurm ([https://slurm.schedmd.com/ https://slurm.schedmd.com/]) for scheduling compute jobs. |
= JUSTUS 2 Slurm Howto = |
== Partitions == |
Job allocations at JUSTUS 2 are routed automatically to the most suitable compute node(s) that can provide the requested resources for the job (e.g. amount of cores, memory, local scratch space). This is to prevent fragmentation of the cluster system and to ensure most efficient usage of available compute resources. There is no need to request a specific partition in in your batch job scripts, i.e. users '''must not''' specify "-p, --partition=<partition_name>" on job submission. This is of particular importance if you adapt job scripts from other cluster systems (e.g. bwUniCluster 2.0) to JUSTUS 2. |
This page only presents some very basic introduction. |
== Job Priorities == |
Please see the '''[[bwForCluster JUSTUS 2 Slurm HOWTO|JUSTUS 2 Slurm HOWTO]]''' for many more examples and commands for common tasks. |
= Slurm Command Overview = |
{| width=750px class="wikitable" |
! Slurm commands !! Brief explanation |
|- |
| [https://slurm.schedmd.com/sbatch.html sbatch] || Submits a job and queues it in an input queue |
|- |
| [https://slurm.schedmd.com/salloc.html salloc] || Request resources for an interactive job |
|- |
| [https://slurm.schedmd.com/squeue.html squeue] || Displays information about active, eligible, blocked, and/or recently completed jobs |
|- |
| [https://slurm.schedmd.com/scontrol.html scontrol] || Displays detailed job state information |
|- |
| [https://slurm.schedmd.com/sstat.html sstat] || Displays status information about a running job |
|- |
| [https://slurm.schedmd.com/scancel.html scancel] || Cancels a job |
|- |
| seff || Shows the "job efficiency" of a job after it has finished |
|} |
= Submitting Jobs on the bwForCluster JUSTUS 2 = |
Batch jobs are submitted with the command: |
<syntaxhighlight lang=bash>$ sbatch <job-script> </syntaxhighlight> |
A job script contains options for Slurm in lines beginning with #SBATCH as well as your commands which you want to execute on the compute nodes. For example: |
<syntaxhighlight lang='bash'> |
#!/bin/bash |
#SBATCH --nodes=1 |
#SBATCH --ntasks-per-node=1 |
#SBATCH --time=00:14:00 |
#SBATCH --mem=1gb |
echo 'Here starts the calculation' |
</syntaxhighlight> |
You can override options from the script on the command-line: |
<syntaxhighlight lang=bash>$ sbatch --time=03:00:00 <job-script> </syntaxhighlight> |
Note: <font color="red"> Compute jobs must not write/read from the global file systems as a calculation swap file. </font> |
Use local storage /tmp in the ramdisk for small files or /scratch (see [[BwForCluster_JUSTUS_2_Slurm_HOWTO#How_to_request_local_scratch_.28SSD.2FNVMe.29_at_job_submission.3F|How to request NVME]]) for this purpose |
To not use the central file system for calculation, you must often configure the the program you are using to write temporary files elsewhere. |
If the program uses the current directory to look for files, you must copy files to a temporary directory - and copy/save the results of the calculation in the end, else your results get deleted by automated cleanup happening after the job. |
There diskless nodes have a disk in RAM memory, that can have a maximum of half the size of the total RAM. Note that files created plus memory requirement of your job need to fit into the total memory. |
There are more diskless nodes than nodes with disks, so if your job can run on a diskless node, you should choose this option. |
Example job script with requesting 700GB disk space and copying files: |
<syntaxhighlight lang='bash'> |
#!/bin/bash |
#SBATCH --nodes=1 |
#SBATCH --ntasks-per-node=1 |
#SBATCH --time=00:14:00 |
#SBATCH --mem=1gb |
#SBATCH --gres=scratch:700 |
# copy input file |
cp $HOME/inputfiles/myinput.inp $SCRATCH |
# switch directory |
echo 'Here starts the calculation' |
myprogram --input=$SCRATCH/myinput.inp |
# calculation ends |
# copy result |
cp outfile.out results2.txt $HOME/resultdir/job12345 |
# clean up |
rm myinput outfile.out results2.txt |
</syntaxhighlight> |
{| style=" background:#deffee; width:100%;" |
| style="padding:12px; background:#cef2e0; text-align:left" | |
Software examples: Most [[Environment Modules|installed software]] comes with example job scripts. <br> To find it e.g. for lammps: <code> module load chem/lammps; cd $LAMMPS_EXA_DIR; ls -la</code>. |
|} |
<br/> |
== Resource Requests == |
Important resource request options for the Slurm command sbatch are: |
{| width=750px class="wikitable" |
! Option !! Slurm (sbatch) |
|- |
| #SBATCH|| Script directive |
|- |
| --time=<hh:mm:ss> (-t <hh:mm:ss>)|| Wall time limit |
|- |
| --job-name=<name> (-J <name>)|| Job name |
|- |
| --nodes=<count> (-N <count>)|| Node count |
|- |
| --ntasks=<count> (-n <count>)|| Core count |
|- |
| --ntasks-per-node=<count>|| Process count per node |
|- |
| --mem=<limit>|| Memory limit per node |
|- |
| --mem-per-cpu=<limit>|| Memory limit per process |
|- |
| --gres=gpu:<count>|| GPU count (gres = "generic resource") |
|- |
| --gres=scratch:<count> || Disk space of <count> GB per requested task |
|- |
| --exclusive|| Node exclusive job |
|} |
'''Nodes and Cores''' |
Slurm provides a number of options to request nodes and cores. |
Typically, using <code>--nodes=<count></code> and <code>--ntasks-per-node=<count></code> should work for all your jobs. For single core jobs, it would be sufficient to use the option <code>--ntasks=1</code>. Specifying only <code>--ntasks</code> may lead to slurm trying to distribute tasks over more than one node even if you requested a small amount of cores. |
'''Memory''' |
Memory can be requested with either the option <code>--mem=<limit></code> (memory per node) or <code>--mem-per-cpu=<limit></code> (memory per process). When looking up the maximum available memory for a certain node type subtract about 5 GB for the operating system. Specify the memory limit as a value-unit-pair, for example 500mb or 8gb. |
In most cases it is preferable to use the <code>--mem=<limit></code> option. |
'''GPUs''' and '''Scratch''' |
These are requested as "generic resources" with <code>--gres:gpu:<count></code> and <code>--gres:scratch:<count></code>. |
== Default Values == |
Default values for jobs are: |
* Runtime: --time=02:00:00 (2 hours) |
* Nodes: --nodes=1 (one node) |
* Tasks: --tasks-per-node=1 (one task per node) |
* Cores: --cpus-per-task=1 (one core per task) |
* Memory: --mem-per-cpu=2gb (2 GB per core) |
== "Exclusive User" Node Access Policy == |
Nodes are exclusively allocated to one single user. However, multiple jobs (up to 48) from the same user can share a node. |
For efficient resource use, choose a core count for your jobs that evenly divides 48. For example, two 24-core jobs fit on one node, while two 32-core jobs require two nodes but leave 16 cores unused on each. |
The same applies to memory requests (see below). |
Think of scheduling as a game of Tetris with cores, memory, and other resources. Choosing well-fitting allocations helps the scheduler pack jobs efficiently. |
== Memory Limits == |
The wait time of a job also depends largely on the amount of requested resources and the available number of nodes providing this amount of resources. This must be taken into account in particular when requesting a certain amount of memory. |
For example a node with 192 GB RAM can only run jobs with up to 187 GB memory requested. The remaining amount is reserved for the operating system, system services and local file systems. |
This means that if a job requests 192 GB RAM per node (i.e. --mem=192gb or --tasks-per-node=48 and --mem-per-cpu=4gb), the job cannot run on one of the 456 "small" nodes but only on one of the "medium", "large" or "fat" nodes. Unnecessarily limiting your jobs to a sub-set of nodes will increase your wait time and the wait time of others, who actually need the amount of memory. |
The following table provides an overview of how much memory can be allocated by user jobs on the various node types and how many nodes can serve this memory requirement: |
{| width=500px class="wikitable" |
! Physical RAM on node !! Available RAM on node !! Number of suitable nodes |
|- |
|small| 192 GB || 187 GB || 692 |
|- |
|medium| 384 GB || 376 GB || 220 |
|- |
|large| 768 GB || 754 GB || 28 |
|- |
|fat| 1536 GB || 1510 GB || 8 |
|} |
Also note that allocated memory is factored into resource usage accounting for fair share. This means over-requesting memory may have a negative impact on the priority of subsequent jobs. |
= Testing Your Jobs = |
Justus2 has three compute nodes reserved for jobs with a walltime under 15 minutes. You can test if your jobs start properly just by specifying a short walltime, e.g. --time=00:14:00 and your job should start very quickly. |
= Monitoring Your Jobs = |
== squeue == |
After you submitted the job, you can see it waiting using the <code>squeue</code> command: |
(also read the man page with <code>man squeue</code> for more information on how to use the command) |
<syntaxhighlight lang='shell'> |
> squeue |
6260301 standard r_60_b_2 ul_yxz1 PD 0:00 1 (AssocGrpMemRunMinutes) |
</syntaxhighlight> |
Output shows: |
* JOBID: the jobid is an unique number your job gets |
* PARTITION: the cluster can be divided in different types of nodes. |
* NAME: the name you gave your job with the --job-name= option |
* USER: your username |
* ST: the state the job is in. R = running, PD = pending, CD = completed. See man page for a full list on states. |
* TIME: how long the job has been running |
* NODES: how many nodes were requested |
* NODELIST(REASON): either show the node(s) the job is running on, or a reason why it hasn't started |
==scontrol== |
You can then show more info on one specific running job using the <code>scontrol</code> command, e.g for the job with ID 6260301 listed above: |
<code> |
scontrol show job 6260301 |
</code> |
displays detailed information for job with JobID 6260301 |
<code> |
scontrol show jobs |
</code> |
displays detailed information for all your jobs |
<code> |
scontrol write batch_script 6260301 - |
</code> |
display job script of a running job. The "-" is a special filename which means "write to the terminal". |
== Monitoring a Started Job == |
After a job has started, you can ssh from a login node to the node(s) the job is running on, using the node name from NODELIST, e.g. if your job runs on n0603: |
<code>> ssh n0603 |
</code> |
= Partitions = |
Job allocations at JUSTUS 2 are routed automatically to the most suitable compute node(s) that can provide the requested resources for the job (e.g. amount of cores, memory, local scratch space). This is to prevent fragmentation of the cluster system and to ensure most efficient usage of available compute resources. Thus, there is no point in requesting a partition in batch job scripts, i.e. users '''should not''' specify any partition "-p, --partition=<partition_name>" on job submission. This is of particular importance if you adapt job scripts from other cluster systems (e.g. bwUniCluster 2.0) to JUSTUS 2. |
= Job Priorities = |
Job priorities at JUSTUS 2 depend on [https://slurm.schedmd.com/priority_multifactor.html multiple factors ]: |
Job priorities at JUSTUS 2 depend on [https://slurm.schedmd.com/priority_multifactor.html multiple factors ]: |
* Age: The amount of time a job has been waiting in the queue, eligible to be scheduled |
* Age: The amount of time a job has been waiting in the queue, eligible to be scheduled. |
* Fairshare: The difference between the portion of the computing resource allocated to an association and the amount of resources that has been consumed. |
* Fairshare: The difference between the portion of the computing resource allocated to an association and the amount of resources that has been consumed. |
''' |
'''Notes:''' |
Jobs that are pending because the user reached one of the resource usage limits (see below) are not eligible to be scheduled and, thus, do not accrue priority by their age. |
Fairshare does '''not''' introduce a fixed allotment, in that a user's ability to run new jobs is cut off as soon as a fixed target utilization is reached. Instead, the fairshare factor ensures that jobs from users who were under-served in the past are given higher priority than jobs from users who were over-served in the past. This keeps individual groups from long term monopolizing the resources, thus making it unfair to groups who have not used their fairshare for quite some time. |
Fairshare does '''not''' introduce a fixed allotment, in that a user's ability to run new jobs is cut off as soon as a fixed target utilization is reached. Instead, the fairshare factor ensures that jobs from users who were under-served in the past are given higher priority than jobs from users who were over-served in the past. This keeps individual groups from long term monopolizing the resources, thus making it unfair to groups who have not used their fairshare for quite some time. |
Slurm features '''backfilling''', meaning that the scheduler will start lower priority jobs if doing so does not delay the expected start time of '''any''' higher priority job. Since the expected start time of pending jobs depends upon the expected completion time of running jobs, reasonably accurate time limits are valuable for backfill scheduling to work well. |
Slurm features '''backfilling''', meaning that the scheduler will start lower priority jobs if doing so does not delay the expected start time of '''any''' higher priority job. Since the expected start time of pending jobs depends upon the expected completion time of running jobs, reasonably accurate time limits are valuable for backfill scheduling to work well. This '''[https://youtu.be/OKhWwem1XZg?t=161 video]''' gives an illustrative description to how backfilling works. |
In summary, an approximate model of Slurm's behavior for scheduling jobs is this: |
* Step 1: Can the job in position one (highest priority) start now? |
* Step 2: If it can, remove it from the queue, start it and continue with step 1. |
* Step 3: If it can not, look at next job. |
* Step 4: Can it start now, without delaying the start time of any job before it in the queue? |
* Step 5: If it can, remove it from the queue, start it, recalculate what nodes are free, look at next job and continue with step 4. |
* Step 6: If it can not, look at next job, and continue with step 4. |
As soon as a new job is submitted and as soon as a job finishes, Slurm restarts its main scheduling cycle with step 1. |
= Usage Limits/Throttling Policies = |
While the fairshare factor ensures fair long term balance of resource utilization between users and groups, there are additional usage limits that constrain the total cumulative resources at a given time. This is to prevent individual users from short term monopolizing large fractions of the whole cluster system. |
* The '''maximum walltime''' for a job is '''14 days''' (336 hours) |
--time=336:00:00 or --time=14-0 |
* The '''maximum amount of cores''' used at any given time from jobs running is '''1920''' per user (aggregated over all running jobs). This translates to 40 nodes. An equivalent limit for allocated memory does also apply. If this limit is reached new jobs will be queued (with REASON: AssocGrpCpuLimit) but only allowed to run after resources have been relinquished. |
* The maximum amount of '''remaining allocated core-minutes''' per user is '''3300000''' (aggregated over all running jobs). For example, if a user has a 4-core job running that will complete in 1 hour and a 2-core job that will complete in 6 hours, this translates to 4 * 1 * 60 + 2 * 6 * 60 = 16 * 60 = 960 remaining core-minutes. Once a user reaches the limit, no more jobs are allowed to start (REASON: AssocGrpCPURunMinutesLimit). As the jobs continue to run, the remaining core time will decrease and eventually allow more jobs to start in a staggered way. This limit also '''correlates the maximum walltime and amount of cores that can be allocated''' for this amount of time. Thus, shorter walltimes for the jobs allow more resources to be allocated at a given time (but capped by the maximum amount of cores limit above). Watch this '''[https://youtu.be/OKhWwem1XZg?t=306 video]''' for an illustrative description. An equivalent limit applies for remaining time of memory allocation in which case jobs may be held back from starting with REASON AssocGrpMemRunMinutes. |
* The '''maximum amount of GPUs''' allocated by running jobs is '''8''' per user. If this limit it reached new jobs will be queued (with REASON: AssocGrpGRES) but only allowed to run after GPU resources have been relinquished. |
'''Note:''' |
Usage limits are subject to change. |
= Considerations on Efficiency / Special Use Cases = |
When we speak of poor job efficiency, we usually mean that hardware resources are wasted. |
That means, a similar overall result could have been achieved using less hardware resources, leaving those for other jobs and reducing the wait time for you and everyone. |
Some simple causes for poor overall job efficiency are: |
* poor choice of resources compared to the size of the nodes leaves part of the node blocked, but doing nothing: |
** multiple of --ntasks-per-node is not the number of cores of a node (see section [[#"Exclusive User" Node Access Policy]]) |
** too much (un-needed) memory or disk space requested |
* more cores requested than are actually used by the job |
* more cores used for a single mpi/openmp parallel computation than useful |
* many small jobs with a short runtime (seconds in extreme cases) |
* one-core jobs with very different run-times (because of single-user policy) |
== Many One or Few-Core Jobs == |
Jobs that use only a few CPU cores can lead to very inefficient node usage: |
# You submit 1000 jobs, each runs for ~30s. Jobs need up to 30s to start and finish - a huge waste if the job only takes 30 seconds. Additionally, the starting and finishing of so many jobs in a short time causes strain on the scheduler SLURM and may cause severe problems for everyone and clutter the SLURM job database. |
# many few-core jobs with very different run times. The jobs will start on many nodes, but at some time all quicker jobs have finished the calculation and only a few remain. Because of the single-user policy on JUSTUS2, jobs of other users cannot fill in the gaps and the rest of the node is idle. |
To address the problem, you can reduce the amount of jobs and/or the amount of nodes used. |
To limit the amount of jobs, start many calculations within one job (problem 1. and 2.): |
* use a bash loop in your job script |
* use the program GNU parallel to start the processes for you |
To only limit the amount of nodes used: |
* use array jobs |
=== Bash Loop === |
One advantage of this method is, that you can run more threads than cores if your jobs are really short and do not use too much RAM memory and in this way keep all cores busy even if many calculations are still starting up. |
It is of course even better, if you can combine such short calculations in a way that for 1000 calculations the kernel does not need to start 1000 processes which in turn need to initialize everything. |
This example uses pgrep to count how many jobs are running: |
<syntaxhighlight lang="bash"> |
#!/bin/bash |
#SBATCH --nodes=1 |
#SBATCH --ntasks-per-node=48 |
#SBATCH --time=00:10:00 |
#SBATCH --mem=100gb |
for i in {1..200} |
do |
echo starting up $i |
bash my_calculation $i & |
while [ $(pgrep -c -f my_calculation) -gt 48 ] ; do echo sleeping; sleep 5; done |
done |
wait |
</syntaxhighlight> |
The same, but by tracking the PIDs (process IDs) of the started processes. This is more robust, but is more difficult to read: |
<syntaxhighlight lang="bash"> |
#!/bin/bash |
running_jobs=() |
for i in {1..200}; do |
echo "Starting job $i" |
sleep "$i" & |
running_jobs+=($!) # Track PID |
while [ "${#running_jobs[@]}" -ge 8 ]; do |
sleep 2 # adjust duration depending on your runtime |
echo running_jobs: ${running_jobs[@]} |
echo pid-out: $(ps -o pid= -p "${running_jobs[@]}" 2>/dev/null | xargs) |
echo ----- |
running_jobs=($(ps -o pid= -p "${running_jobs[@]}" 2>/dev/null)) # Remove finished jobs |
done |
done |
wait # Ensure all jobs complete |
</syntaxhighlight> |
You may not be able to just use an index number "i" to start many calculations. In this case, for not too many files, the for loop could be used to read in config files. Here just the general idea for the for loop without |
<syntaxhighlight lang="bash"> |
for config in config-1980-03-01_1/*; do |
mycalculation -config "$config" |
done |
</syntaxhighlight> |
This loops over all files in the directory config-1980-03-01_1/ and gives them as an input file to "mycalculation" via a hypothetical "-config" option. Adding a date to the config-dirs (and outputs) would enable you to track different runs in your lab journal more easily. |
=== Gnu Parallel === |
Gnu Parallel is available on the HPC Cluster and comes with its own set of examples, you can access them like this: |
<syntaxhighlight lang="bash"> |
$ module load system/parallel |
$ cp $PARALLEL_EXA_DIR/parallel.slurm . |
</syntaxhighlight> |
=== Array Jobs === |
<syntaxhighlight lang="bash">$ sbatch -a 1-500%48 batch_script</syntaxhighlight> |
This will submit 500 tasks to be executed, each one indexed by SLURM_ARRAY_TASK_ID ranging from 1 to 200, but will limit the number of simultaneously running tasks from this job array to 48 (number of cores on a Justus2 node). |
Thee same can be done inside the job script: |
<syntaxhighlight lang="bash"> |
#!/bin/bash |
# Number of cores per individual array task |
#SBATCH --ntasks-per-node=1 |
#SBATCH --array=1-500%48 |
#SBATCH --mem=3G |
#SBATCH --time=1:10:00 |
#SBATCH --job-name=array_job |
#SBATCH --output=array_job-%A_%a.out |
#SBATCH --error=array_job-%A_%a.err |
# Print the task id. |
export TIMEFORMAT=%R ; |
time bash mycalculation $SLURM_ARRAY_TASK_ID |
</syntaxhighlight> |
Also see: |
* Slurm-Howto entry: [[BwForCluster_JUSTUS_2_Slurm_HOWTO#How_to_submit_an_array_job?]] |
* Schedmd documentations on Job Arrays: https://slurm.schedmd.com/job_array.html |
Latest revision as of 14:09, 12 March 2025
The bwForCluster JUSTUS 2 is a state-wide high-performance compute resource dedicated to Computational Chemistry and Quantum Sciences in Baden-Württemberg, Germany.
The JUSTUS 2 cluster uses Slurm (https://slurm.schedmd.com/) for scheduling compute jobs.
JUSTUS 2 Slurm Howto
This page only presents some very basic introduction.
Please see the JUSTUS 2 Slurm HOWTO for many more examples and commands for common tasks.
Slurm Command Overview
Slurm commands | Brief explanation |
sbatch | Submits a job and queues it in an input queue |
salloc | Request resources for an interactive job |
squeue | Displays information about active, eligible, blocked, and/or recently completed jobs |
scontrol | Displays detailed job state information |
sstat | Displays status information about a running job |
scancel | Cancels a job |
seff | Shows the "job efficiency" of a job after it has finished |
Submitting Jobs on the bwForCluster JUSTUS 2
Batch jobs are submitted with the command:
$ sbatch <job-script>
A job script contains options for Slurm in lines beginning with #SBATCH as well as your commands which you want to execute on the compute nodes. For example:
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --time=00:14:00
#SBATCH --mem=1gb
echo 'Here starts the calculation'
You can override options from the script on the command-line:
$ sbatch --time=03:00:00 <job-script>
Note: Compute jobs must not write/read from the global file systems as a calculation swap file.
Use local storage /tmp in the ramdisk for small files or /scratch (see How to request NVME) for this purpose
To not use the central file system for calculation, you must often configure the the program you are using to write temporary files elsewhere.
If the program uses the current directory to look for files, you must copy files to a temporary directory - and copy/save the results of the calculation in the end, else your results get deleted by automated cleanup happening after the job.
There diskless nodes have a disk in RAM memory, that can have a maximum of half the size of the total RAM. Note that files created plus memory requirement of your job need to fit into the total memory.
There are more diskless nodes than nodes with disks, so if your job can run on a diskless node, you should choose this option.
Example job script with requesting 700GB disk space and copying files:
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --time=00:14:00
#SBATCH --mem=1gb
#SBATCH --gres=scratch:700
# copy input file
cp $HOME/inputfiles/myinput.inp $SCRATCH
# switch directory
echo 'Here starts the calculation'
myprogram --input=$SCRATCH/myinput.inp
# calculation ends
# copy result
cp outfile.out results2.txt $HOME/resultdir/job12345
# clean up
rm myinput outfile.out results2.txt
Software examples: Most installed software comes with example job scripts. |
Resource Requests
Important resource request options for the Slurm command sbatch are:
Option | Slurm (sbatch) |
#SBATCH | Script directive |
--time=<hh:mm:ss> (-t <hh:mm:ss>) | Wall time limit |
--job-name=<name> (-J <name>) | Job name |
--nodes=<count> (-N <count>) | Node count |
--ntasks=<count> (-n <count>) | Core count |
--ntasks-per-node=<count> | Process count per node |
--mem=<limit> | Memory limit per node |
--mem-per-cpu=<limit> | Memory limit per process |
--gres=gpu:<count> | GPU count (gres = "generic resource") |
--gres=scratch:<count> | Disk space of <count> GB per requested task |
--exclusive | Node exclusive job |
Nodes and Cores
Slurm provides a number of options to request nodes and cores.
Typically, using --nodes=<count>
and --ntasks-per-node=<count>
should work for all your jobs. For single core jobs, it would be sufficient to use the option --ntasks=1
. Specifying only --ntasks
may lead to slurm trying to distribute tasks over more than one node even if you requested a small amount of cores.
Memory can be requested with either the option --mem=<limit>
(memory per node) or --mem-per-cpu=<limit>
(memory per process). When looking up the maximum available memory for a certain node type subtract about 5 GB for the operating system. Specify the memory limit as a value-unit-pair, for example 500mb or 8gb.
In most cases it is preferable to use the --mem=<limit>
GPUs and Scratch
These are requested as "generic resources" with --gres:gpu:<count>
and --gres:scratch:<count>
Default Values
Default values for jobs are:
- Runtime: --time=02:00:00 (2 hours)
- Nodes: --nodes=1 (one node)
- Tasks: --tasks-per-node=1 (one task per node)
- Cores: --cpus-per-task=1 (one core per task)
- Memory: --mem-per-cpu=2gb (2 GB per core)
"Exclusive User" Node Access Policy
Nodes are exclusively allocated to one single user. However, multiple jobs (up to 48) from the same user can share a node.
For efficient resource use, choose a core count for your jobs that evenly divides 48. For example, two 24-core jobs fit on one node, while two 32-core jobs require two nodes but leave 16 cores unused on each.
The same applies to memory requests (see below).
Think of scheduling as a game of Tetris with cores, memory, and other resources. Choosing well-fitting allocations helps the scheduler pack jobs efficiently.
Memory Limits
The wait time of a job also depends largely on the amount of requested resources and the available number of nodes providing this amount of resources. This must be taken into account in particular when requesting a certain amount of memory.
For example a node with 192 GB RAM can only run jobs with up to 187 GB memory requested. The remaining amount is reserved for the operating system, system services and local file systems. This means that if a job requests 192 GB RAM per node (i.e. --mem=192gb or --tasks-per-node=48 and --mem-per-cpu=4gb), the job cannot run on one of the 456 "small" nodes but only on one of the "medium", "large" or "fat" nodes. Unnecessarily limiting your jobs to a sub-set of nodes will increase your wait time and the wait time of others, who actually need the amount of memory.
The following table provides an overview of how much memory can be allocated by user jobs on the various node types and how many nodes can serve this memory requirement:
Physical RAM on node | Available RAM on node | Number of suitable nodes |
192 GB | 187 GB | 692 |
384 GB | 376 GB | 220 |
768 GB | 754 GB | 28 |
1536 GB | 1510 GB | 8 |
Also note that allocated memory is factored into resource usage accounting for fair share. This means over-requesting memory may have a negative impact on the priority of subsequent jobs.
Testing Your Jobs
Justus2 has three compute nodes reserved for jobs with a walltime under 15 minutes. You can test if your jobs start properly just by specifying a short walltime, e.g. --time=00:14:00 and your job should start very quickly.
Monitoring Your Jobs
After you submitted the job, you can see it waiting using the squeue
(also read the man page with man squeue
for more information on how to use the command)
> squeue
6260301 standard r_60_b_2 ul_yxz1 PD 0:00 1 (AssocGrpMemRunMinutes)
Output shows:
- JOBID: the jobid is an unique number your job gets
- PARTITION: the cluster can be divided in different types of nodes.
- NAME: the name you gave your job with the --job-name= option
- USER: your username
- ST: the state the job is in. R = running, PD = pending, CD = completed. See man page for a full list on states.
- TIME: how long the job has been running
- NODES: how many nodes were requested
- NODELIST(REASON): either show the node(s) the job is running on, or a reason why it hasn't started
You can then show more info on one specific running job using the scontrol
command, e.g for the job with ID 6260301 listed above:
scontrol show job 6260301
displays detailed information for job with JobID 6260301
scontrol show jobs
displays detailed information for all your jobs
scontrol write batch_script 6260301 -
display job script of a running job. The "-" is a special filename which means "write to the terminal".
Monitoring a Started Job
After a job has started, you can ssh from a login node to the node(s) the job is running on, using the node name from NODELIST, e.g. if your job runs on n0603:
> ssh n0603
Job allocations at JUSTUS 2 are routed automatically to the most suitable compute node(s) that can provide the requested resources for the job (e.g. amount of cores, memory, local scratch space). This is to prevent fragmentation of the cluster system and to ensure most efficient usage of available compute resources. Thus, there is no point in requesting a partition in batch job scripts, i.e. users should not specify any partition "-p, --partition=<partition_name>" on job submission. This is of particular importance if you adapt job scripts from other cluster systems (e.g. bwUniCluster 2.0) to JUSTUS 2.
Job Priorities
Job priorities at JUSTUS 2 depend on multiple factors :
- Age: The amount of time a job has been waiting in the queue, eligible to be scheduled.
- Fairshare: The difference between the portion of the computing resource allocated to an association and the amount of resources that has been consumed.
Jobs that are pending because the user reached one of the resource usage limits (see below) are not eligible to be scheduled and, thus, do not accrue priority by their age.
Fairshare does not introduce a fixed allotment, in that a user's ability to run new jobs is cut off as soon as a fixed target utilization is reached. Instead, the fairshare factor ensures that jobs from users who were under-served in the past are given higher priority than jobs from users who were over-served in the past. This keeps individual groups from long term monopolizing the resources, thus making it unfair to groups who have not used their fairshare for quite some time.
Slurm features backfilling, meaning that the scheduler will start lower priority jobs if doing so does not delay the expected start time of any higher priority job. Since the expected start time of pending jobs depends upon the expected completion time of running jobs, reasonably accurate time limits are valuable for backfill scheduling to work well. This video gives an illustrative description to how backfilling works.
In summary, an approximate model of Slurm's behavior for scheduling jobs is this:
- Step 1: Can the job in position one (highest priority) start now?
- Step 2: If it can, remove it from the queue, start it and continue with step 1.
- Step 3: If it can not, look at next job.
- Step 4: Can it start now, without delaying the start time of any job before it in the queue?
- Step 5: If it can, remove it from the queue, start it, recalculate what nodes are free, look at next job and continue with step 4.
- Step 6: If it can not, look at next job, and continue with step 4.
As soon as a new job is submitted and as soon as a job finishes, Slurm restarts its main scheduling cycle with step 1.
Usage Limits/Throttling Policies
While the fairshare factor ensures fair long term balance of resource utilization between users and groups, there are additional usage limits that constrain the total cumulative resources at a given time. This is to prevent individual users from short term monopolizing large fractions of the whole cluster system.
- The maximum walltime for a job is 14 days (336 hours)
--time=336:00:00 or --time=14-0
- The maximum amount of cores used at any given time from jobs running is 1920 per user (aggregated over all running jobs). This translates to 40 nodes. An equivalent limit for allocated memory does also apply. If this limit is reached new jobs will be queued (with REASON: AssocGrpCpuLimit) but only allowed to run after resources have been relinquished.
- The maximum amount of remaining allocated core-minutes per user is 3300000 (aggregated over all running jobs). For example, if a user has a 4-core job running that will complete in 1 hour and a 2-core job that will complete in 6 hours, this translates to 4 * 1 * 60 + 2 * 6 * 60 = 16 * 60 = 960 remaining core-minutes. Once a user reaches the limit, no more jobs are allowed to start (REASON: AssocGrpCPURunMinutesLimit). As the jobs continue to run, the remaining core time will decrease and eventually allow more jobs to start in a staggered way. This limit also correlates the maximum walltime and amount of cores that can be allocated for this amount of time. Thus, shorter walltimes for the jobs allow more resources to be allocated at a given time (but capped by the maximum amount of cores limit above). Watch this video for an illustrative description. An equivalent limit applies for remaining time of memory allocation in which case jobs may be held back from starting with REASON AssocGrpMemRunMinutes.
- The maximum amount of GPUs allocated by running jobs is 8 per user. If this limit it reached new jobs will be queued (with REASON: AssocGrpGRES) but only allowed to run after GPU resources have been relinquished.
Usage limits are subject to change.
Considerations on Efficiency / Special Use Cases
When we speak of poor job efficiency, we usually mean that hardware resources are wasted. That means, a similar overall result could have been achieved using less hardware resources, leaving those for other jobs and reducing the wait time for you and everyone.
Some simple causes for poor overall job efficiency are:
- poor choice of resources compared to the size of the nodes leaves part of the node blocked, but doing nothing:
- multiple of --ntasks-per-node is not the number of cores of a node (see section #"Exclusive User" Node Access Policy)
- too much (un-needed) memory or disk space requested
- more cores requested than are actually used by the job
- more cores used for a single mpi/openmp parallel computation than useful
- many small jobs with a short runtime (seconds in extreme cases)
- one-core jobs with very different run-times (because of single-user policy)
Many One or Few-Core Jobs
Jobs that use only a few CPU cores can lead to very inefficient node usage:
- You submit 1000 jobs, each runs for ~30s. Jobs need up to 30s to start and finish - a huge waste if the job only takes 30 seconds. Additionally, the starting and finishing of so many jobs in a short time causes strain on the scheduler SLURM and may cause severe problems for everyone and clutter the SLURM job database.
- many few-core jobs with very different run times. The jobs will start on many nodes, but at some time all quicker jobs have finished the calculation and only a few remain. Because of the single-user policy on JUSTUS2, jobs of other users cannot fill in the gaps and the rest of the node is idle.
To address the problem, you can reduce the amount of jobs and/or the amount of nodes used.
To limit the amount of jobs, start many calculations within one job (problem 1. and 2.):
- use a bash loop in your job script
- use the program GNU parallel to start the processes for you
To only limit the amount of nodes used:
- use array jobs
Bash Loop
One advantage of this method is, that you can run more threads than cores if your jobs are really short and do not use too much RAM memory and in this way keep all cores busy even if many calculations are still starting up.
It is of course even better, if you can combine such short calculations in a way that for 1000 calculations the kernel does not need to start 1000 processes which in turn need to initialize everything.
This example uses pgrep to count how many jobs are running:
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=48
#SBATCH --time=00:10:00
#SBATCH --mem=100gb
for i in {1..200}
echo starting up $i
bash my_calculation $i &
while [ $(pgrep -c -f my_calculation) -gt 48 ] ; do echo sleeping; sleep 5; done
The same, but by tracking the PIDs (process IDs) of the started processes. This is more robust, but is more difficult to read:
for i in {1..200}; do
echo "Starting job $i"
sleep "$i" &
running_jobs+=($!) # Track PID
while [ "${#running_jobs[@]}" -ge 8 ]; do
sleep 2 # adjust duration depending on your runtime
echo running_jobs: ${running_jobs[@]}
echo pid-out: $(ps -o pid= -p "${running_jobs[@]}" 2>/dev/null | xargs)
echo -----
running_jobs=($(ps -o pid= -p "${running_jobs[@]}" 2>/dev/null)) # Remove finished jobs
wait # Ensure all jobs complete
You may not be able to just use an index number "i" to start many calculations. In this case, for not too many files, the for loop could be used to read in config files. Here just the general idea for the for loop without
for config in config-1980-03-01_1/*; do
mycalculation -config "$config"
This loops over all files in the directory config-1980-03-01_1/ and gives them as an input file to "mycalculation" via a hypothetical "-config" option. Adding a date to the config-dirs (and outputs) would enable you to track different runs in your lab journal more easily.
Gnu Parallel
Gnu Parallel is available on the HPC Cluster and comes with its own set of examples, you can access them like this:
$ module load system/parallel
$ cp $PARALLEL_EXA_DIR/parallel.slurm .
Array Jobs
$ sbatch -a 1-500%48 batch_script
This will submit 500 tasks to be executed, each one indexed by SLURM_ARRAY_TASK_ID ranging from 1 to 200, but will limit the number of simultaneously running tasks from this job array to 48 (number of cores on a Justus2 node).
Thee same can be done inside the job script:
# Number of cores per individual array task
#SBATCH --ntasks-per-node=1
#SBATCH --array=1-500%48
#SBATCH --mem=3G
#SBATCH --time=1:10:00
#SBATCH --job-name=array_job
#SBATCH --output=array_job-%A_%a.out
#SBATCH --error=array_job-%A_%a.err
# Print the task id.
export TIMEFORMAT=%R ;
time bash mycalculation $SLURM_ARRAY_TASK_ID
Also see:
- Slurm-Howto entry: BwForCluster_JUSTUS_2_Slurm_HOWTO#How_to_submit_an_array_job?
- Schedmd documentations on Job Arrays: https://slurm.schedmd.com/job_array.html