Difference between revisions of "JUSTUS2/Slurm"

From bwHPC Wiki
Jump to: navigation, search
(Job Priorities)
(Submitting Jobs on the bwForCluster JUSTUS 2)
(36 intermediate revisions by 3 users not shown)
Line 1: Line 1:
 
{{Justus2}}
 
{{Justus2}}
  +
  +
= Justus 2 Slurm Howto and External Information =
  +
  +
  +
The JUSTUS 2 cluster uses Slurm ([https://slurm.schedmd.com/ https://slurm.schedmd.com/]) for scheduling compute jobs. In order to get help with Slurm at JUSTUS 2 on many common cases, please visit our '''[[bwForCluster JUSTUS 2 Slurm HOWTO|Slurm HOWTO]]''' for JUSTUS 2.
  +
  +
There is also a slurm [https://slurm.schedmd.com/quickstart.html quickstart] from the vendor of Slurm.
  +
   
 
= Submitting Jobs on the bwForCluster JUSTUS 2 =
 
= Submitting Jobs on the bwForCluster JUSTUS 2 =
  +
Batch jobs are submitted with the command:
  +
  +
<source lang=bash>$ sbatch <job-script> </source>
  +
  +
A job script contains options for Slurm in lines beginning with #SBATCH as well as your commands which you want to execute on the compute nodes. For example:
  +
  +
<source lang='bash'>
  +
#!/bin/bash
  +
#SBATCH --ntasks=1
  +
#SBATCH --time=00:20:00
  +
#SBATCH --mem=1gb
  +
#SBATCH --export=NONE
  +
echo 'Here starts the calculation'
  +
</source>
  +
  +
You can override options from the script on the command-line:
  +
<source lang=bash>$ sbatch --time=03:00:00 <job-script> </source>
  +
  +
= Slurm Command Overview =
   
  +
{| width=750px class="wikitable"
The JUSTUS 2 cluster uses [https://slurm.schedmd.com/archive/slurm-19.05.5/ Slurm] for scheduling compute jobs. In order to get started with Slurm at JUSTUS 2, please visit our '''[[bwForCluster JUSTUS 2 Slurm HOWTO|Slurm HOWTO]]''' for JUSTUS 2.
 
  +
! Slurm commands !! Description
  +
|-
  +
| [https://slurm.schedmd.com/sbatch.html sbatch] || Submits a job and queues it in an input queue
  +
|-
  +
| [https://slurm.schedmd.com/squeue.html squeue] || Displays information about active, eligible, blocked, and/or recently completed jobs
  +
|-
  +
| [https://slurm.schedmd.com/scontrol.html scontrol] || Displays detailed job state information
  +
|-
  +
| [https://slurm.schedmd.com/scancel.html scancel] || Cancels a job
  +
|-
  +
|}
   
== Partitions ==
+
= Partitions =
 
Job allocations at JUSTUS 2 are routed automatically to the most suitable compute node(s) that can provide the requested resources for the job (e.g. amount of cores, memory, local scratch space). This is to prevent fragmentation of the cluster system and to ensure most efficient usage of available compute resources. Thus, there is no point in requesting a partition in batch job scripts, i.e. users '''should not''' specify any partition "-p, --partition=<partition_name>" on job submission. This is of particular importance if you adapt job scripts from other cluster systems (e.g. bwUniCluster 2.0) to JUSTUS 2.
 
Job allocations at JUSTUS 2 are routed automatically to the most suitable compute node(s) that can provide the requested resources for the job (e.g. amount of cores, memory, local scratch space). This is to prevent fragmentation of the cluster system and to ensure most efficient usage of available compute resources. Thus, there is no point in requesting a partition in batch job scripts, i.e. users '''should not''' specify any partition "-p, --partition=<partition_name>" on job submission. This is of particular importance if you adapt job scripts from other cluster systems (e.g. bwUniCluster 2.0) to JUSTUS 2.
   
== Job Priorities ==
+
= Job Priorities =
 
Job priorities at JUSTUS 2 depend on [https://slurm.schedmd.com/priority_multifactor.html multiple factors ]:
 
Job priorities at JUSTUS 2 depend on [https://slurm.schedmd.com/priority_multifactor.html multiple factors ]:
 
* Age: The amount of time a job has been waiting in the queue, eligible to be scheduled.
 
* Age: The amount of time a job has been waiting in the queue, eligible to be scheduled.
Line 15: Line 53:
 
'''Notes:'''
 
'''Notes:'''
   
Jobs that are pending because the user reached one or more resource usage limits (see below) will not accrue priority by age.
+
Jobs that are pending because the user reached one of the resource usage limits (see below) are not eligible to be scheduled and, thus, do not accrue priority by their age.
   
 
Fairshare does '''not''' introduce a fixed allotment, in that a user's ability to run new jobs is cut off as soon as a fixed target utilization is reached. Instead, the fairshare factor ensures that jobs from users who were under-served in the past are given higher priority than jobs from users who were over-served in the past. This keeps individual groups from long term monopolizing the resources, thus making it unfair to groups who have not used their fairshare for quite some time.
 
Fairshare does '''not''' introduce a fixed allotment, in that a user's ability to run new jobs is cut off as soon as a fixed target utilization is reached. Instead, the fairshare factor ensures that jobs from users who were under-served in the past are given higher priority than jobs from users who were over-served in the past. This keeps individual groups from long term monopolizing the resources, thus making it unfair to groups who have not used their fairshare for quite some time.
Line 21: Line 59:
 
Slurm features '''backfilling''', meaning that the scheduler will start lower priority jobs if doing so does not delay the expected start time of '''any''' higher priority job. Since the expected start time of pending jobs depends upon the expected completion time of running jobs, reasonably accurate time limits are valuable for backfill scheduling to work well. This '''[https://youtu.be/OKhWwem1XZg?t=161 video]''' gives an illustrative description to how backfilling works.
 
Slurm features '''backfilling''', meaning that the scheduler will start lower priority jobs if doing so does not delay the expected start time of '''any''' higher priority job. Since the expected start time of pending jobs depends upon the expected completion time of running jobs, reasonably accurate time limits are valuable for backfill scheduling to work well. This '''[https://youtu.be/OKhWwem1XZg?t=161 video]''' gives an illustrative description to how backfilling works.
   
  +
In summary, an approximate model of Slurm's behavior for scheduling jobs is this:
== Usage Limits/Throttling Policies ==
 
  +
  +
* Step 1: Can the job in position one (highest priority) start now?
  +
* Step 2: If it can, remove it from the queue, start it and continue with step 1.
  +
* Step 3: If it can not, look at next job.
  +
* Step 4: Can it start now, without delaying the start time of any job before it in the queue?
  +
* Step 5: If it can, remove it from the queue, start it, recalculate what nodes are free, look at next job and continue with step 4.
  +
* Step 6: If it can not, look at next job, and continue with step 4.
  +
  +
As soon as a new job is submitted and as soon as a job finishes, Slurm restarts its main scheduling cycle with step 1.
  +
  +
= Usage Limits/Throttling Policies =
   
 
While the fairshare factor ensures fair long term balance of resource utilization between users and groups, there are additional usage limits that constrain the total cumulative resources at a given time. This is to prevent individual users from short term monopolizing large fractions of the whole cluster system.
 
While the fairshare factor ensures fair long term balance of resource utilization between users and groups, there are additional usage limits that constrain the total cumulative resources at a given time. This is to prevent individual users from short term monopolizing large fractions of the whole cluster system.
Line 28: Line 77:
 
--time=336:00:00 or --time=14-0
 
--time=336:00:00 or --time=14-0
   
* The '''maximum amount of cores''' used at any given time from jobs running is '''1920''' per user (aggregated over all running jobs). This translates to 40 nodes. An equivalent limit for allocated memory does also apply. If this limit is reached new jobs will be queued but only allowed to run after resources have been relinquished.
+
* The '''maximum amount of cores''' used at any given time from jobs running is '''1920''' per user (aggregated over all running jobs). This translates to 40 nodes. An equivalent limit for allocated memory does also apply. If this limit is reached new jobs will be queued (with REASON: AssocGrpCpuLimit) but only allowed to run after resources have been relinquished.
   
* The maximum amount of '''remaining allocated core-minutes''' per user is '''3300000''' (aggregated over all running jobs). For example, if a user has a 4-core job running that will complete in 1 hour and a 2-core job that will complete in 6 hours, this translates to 4 * 1 * 60 + 2 * 6 * 60 = 16 * 60 = 960 remaining core-minutes. Once a user reaches the limit, no more jobs are allowed to start. As the jobs continue to run, the remaining core time will decrease and eventually allow more jobs to start in a staggered way. This limit also '''correlates the maximum walltime and amount of cores that can be allocated''' for this amount of time. Thus, shorter walltimes for the jobs allow more resources to be allocated at a given time (but capped by the maximum amount of cores limit above). Watch this '''[https://youtu.be/OKhWwem1XZg?t=306 video]''' for an illustrative description. An equivalent limit applies for remaining time of memory allocation.
+
* The maximum amount of '''remaining allocated core-minutes''' per user is '''3300000''' (aggregated over all running jobs). For example, if a user has a 4-core job running that will complete in 1 hour and a 2-core job that will complete in 6 hours, this translates to 4 * 1 * 60 + 2 * 6 * 60 = 16 * 60 = 960 remaining core-minutes. Once a user reaches the limit, no more jobs are allowed to start (REASON: AssocGrpCPURunMinutesLimit). As the jobs continue to run, the remaining core time will decrease and eventually allow more jobs to start in a staggered way. This limit also '''correlates the maximum walltime and amount of cores that can be allocated''' for this amount of time. Thus, shorter walltimes for the jobs allow more resources to be allocated at a given time (but capped by the maximum amount of cores limit above). Watch this '''[https://youtu.be/OKhWwem1XZg?t=306 video]''' for an illustrative description. An equivalent limit applies for remaining time of memory allocation.
  +
  +
* The '''maximum amount of GPUs''' allocated by running jobs is '''4''' per user. If this limit it reached new jobs will be queued (with REASON: AssocGrpGRES) but only allowed to run after GPU resources have been relinquished.
   
 
'''Note:'''
 
'''Note:'''
   
 
Usage limits are subject to change.
 
Usage limits are subject to change.
  +
  +
= Other Considerations =
  +
  +
== Default Values ==
  +
  +
Default values for jobs are:
  +
  +
* Runtime: --time=02:00:00 (2 hours)
  +
* Nodes: --nodes=1 (one node)
  +
* Tasks: --tasks-per-node=1 (one task per node)
  +
* Cores: --cpus-per-task=1 (one core per task)
  +
* Memory: --mem-per-cpu=2gb (2 GB per core)
  +
  +
== Node Access Policy ==
  +
  +
Node access policy for jobs is "'''exclusive user'''". Nodes will be exclusively allocated to users. '''Multiple jobs (up to 48) of the same user can run on a single node''' at any time.
  +
  +
'''Note:''' This implies that for '''sub-node jobs''', it is advisable for efficient resource utilization and maximum job throughput to '''adjust the number of cores to be an integer divisor of 48''' (total number of cores on each node). For example, two 24-core jobs can run simultaneously on one and the same node, while two 32-core jobs will always have to allocate two separate nodes, but leave 16 cores unused on each of them. Users must therefore always '''think carefully about how many cores to request''' and whether their applications really benefit from allocating more cores for their jobs. Similar considerations apply - at the same time - to the '''requested amount of memory per job'''.
  +
  +
Think of it as the scheduler playing a game of multi-dimensional Tetris, where the dimensions are number of cores, amount of memory and other resources. '''Users can support this by making resource allocations that allow the scheduler to pack their jobs as densely as possible on the nodes'''.
  +
  +
== Memory Management ==
  +
  +
The '''wait time of a job also depends largely on the amount of requested resources''' and the available number of nodes providing this amount of resources. This must be taken into account '''in particular when requesting a certain amount of memory'''.
  +
  +
For example, there is a total of 692 compute nodes in JUSTUS, of which 456 nodes have 192 GB RAM. However, '''not the entire amount of physical RAM is available exclusively for user jobs''', because the operating system, system services and local file systems also require a certain amount of RAM.
  +
This means that if a job requests 192 GB RAM per node (i.e. --mem=192gb or --tasks-per-node=48 and --mem-per-cpu=4gb), Slurm will rule out 456 out of 692 nodes as being suitable for this job and considers only 220 out of 692 nodes as being eligible for running this job.
  +
  +
The following table provides an overview of how much memory can be allocated by user jobs on the various node types and how many nodes can serve this memory requirement:
  +
  +
{| width=500px class="wikitable"
  +
! Physical RAM on node !! Available RAM on node !! Number of suitable nodes
  +
|-
  +
| 192 GB || 187 GB || 692
  +
|-
  +
| 384 GB || 376 GB || 220
  +
|-
  +
| 768 GB || 754 GB || 28
  +
|-
  +
| 1536 GB || 1510 GB || 8
  +
|}
  +
  +
Also note that allocated memory is factored into resource usage accounting for fair share. This means over-requesting memory may have a negative impact on the priority of subsequent jobs.

Revision as of 12:07, 23 June 2021

The bwForCluster JUSTUS 2 is a state-wide high-performance compute resource dedicated to Computational Chemistry and Quantum Sciences in Baden-Württemberg, Germany.

1 Justus 2 Slurm Howto and External Information

The JUSTUS 2 cluster uses Slurm (https://slurm.schedmd.com/) for scheduling compute jobs. In order to get help with Slurm at JUSTUS 2 on many common cases, please visit our Slurm HOWTO for JUSTUS 2.

There is also a slurm quickstart from the vendor of Slurm.


2 Submitting Jobs on the bwForCluster JUSTUS 2

Batch jobs are submitted with the command:

$ sbatch <job-script>

A job script contains options for Slurm in lines beginning with #SBATCH as well as your commands which you want to execute on the compute nodes. For example:

#!/bin/bash
#SBATCH --ntasks=1
#SBATCH --time=00:20:00
#SBATCH --mem=1gb
#SBATCH --export=NONE
echo 'Here starts the calculation'

You can override options from the script on the command-line:

$ sbatch --time=03:00:00 <job-script>

3 Slurm Command Overview

Slurm commands Description
sbatch Submits a job and queues it in an input queue
squeue Displays information about active, eligible, blocked, and/or recently completed jobs
scontrol Displays detailed job state information
scancel Cancels a job

4 Partitions

Job allocations at JUSTUS 2 are routed automatically to the most suitable compute node(s) that can provide the requested resources for the job (e.g. amount of cores, memory, local scratch space). This is to prevent fragmentation of the cluster system and to ensure most efficient usage of available compute resources. Thus, there is no point in requesting a partition in batch job scripts, i.e. users should not specify any partition "-p, --partition=<partition_name>" on job submission. This is of particular importance if you adapt job scripts from other cluster systems (e.g. bwUniCluster 2.0) to JUSTUS 2.

5 Job Priorities

Job priorities at JUSTUS 2 depend on multiple factors :

  • Age: The amount of time a job has been waiting in the queue, eligible to be scheduled.
  • Fairshare: The difference between the portion of the computing resource allocated to an association and the amount of resources that has been consumed.

Notes:

Jobs that are pending because the user reached one of the resource usage limits (see below) are not eligible to be scheduled and, thus, do not accrue priority by their age.

Fairshare does not introduce a fixed allotment, in that a user's ability to run new jobs is cut off as soon as a fixed target utilization is reached. Instead, the fairshare factor ensures that jobs from users who were under-served in the past are given higher priority than jobs from users who were over-served in the past. This keeps individual groups from long term monopolizing the resources, thus making it unfair to groups who have not used their fairshare for quite some time.

Slurm features backfilling, meaning that the scheduler will start lower priority jobs if doing so does not delay the expected start time of any higher priority job. Since the expected start time of pending jobs depends upon the expected completion time of running jobs, reasonably accurate time limits are valuable for backfill scheduling to work well. This video gives an illustrative description to how backfilling works.

In summary, an approximate model of Slurm's behavior for scheduling jobs is this:

  • Step 1: Can the job in position one (highest priority) start now?
  • Step 2: If it can, remove it from the queue, start it and continue with step 1.
  • Step 3: If it can not, look at next job.
  • Step 4: Can it start now, without delaying the start time of any job before it in the queue?
  • Step 5: If it can, remove it from the queue, start it, recalculate what nodes are free, look at next job and continue with step 4.
  • Step 6: If it can not, look at next job, and continue with step 4.

As soon as a new job is submitted and as soon as a job finishes, Slurm restarts its main scheduling cycle with step 1.

6 Usage Limits/Throttling Policies

While the fairshare factor ensures fair long term balance of resource utilization between users and groups, there are additional usage limits that constrain the total cumulative resources at a given time. This is to prevent individual users from short term monopolizing large fractions of the whole cluster system.

  • The maximum walltime for a job is 14 days (336 hours)
 --time=336:00:00 or --time=14-0
  • The maximum amount of cores used at any given time from jobs running is 1920 per user (aggregated over all running jobs). This translates to 40 nodes. An equivalent limit for allocated memory does also apply. If this limit is reached new jobs will be queued (with REASON: AssocGrpCpuLimit) but only allowed to run after resources have been relinquished.
  • The maximum amount of remaining allocated core-minutes per user is 3300000 (aggregated over all running jobs). For example, if a user has a 4-core job running that will complete in 1 hour and a 2-core job that will complete in 6 hours, this translates to 4 * 1 * 60 + 2 * 6 * 60 = 16 * 60 = 960 remaining core-minutes. Once a user reaches the limit, no more jobs are allowed to start (REASON: AssocGrpCPURunMinutesLimit). As the jobs continue to run, the remaining core time will decrease and eventually allow more jobs to start in a staggered way. This limit also correlates the maximum walltime and amount of cores that can be allocated for this amount of time. Thus, shorter walltimes for the jobs allow more resources to be allocated at a given time (but capped by the maximum amount of cores limit above). Watch this video for an illustrative description. An equivalent limit applies for remaining time of memory allocation.
  • The maximum amount of GPUs allocated by running jobs is 4 per user. If this limit it reached new jobs will be queued (with REASON: AssocGrpGRES) but only allowed to run after GPU resources have been relinquished.

Note:

Usage limits are subject to change.

7 Other Considerations

7.1 Default Values

Default values for jobs are:

  • Runtime: --time=02:00:00 (2 hours)
  • Nodes: --nodes=1 (one node)
  • Tasks: --tasks-per-node=1 (one task per node)
  • Cores: --cpus-per-task=1 (one core per task)
  • Memory: --mem-per-cpu=2gb (2 GB per core)

7.2 Node Access Policy

Node access policy for jobs is "exclusive user". Nodes will be exclusively allocated to users. Multiple jobs (up to 48) of the same user can run on a single node at any time.

Note: This implies that for sub-node jobs, it is advisable for efficient resource utilization and maximum job throughput to adjust the number of cores to be an integer divisor of 48 (total number of cores on each node). For example, two 24-core jobs can run simultaneously on one and the same node, while two 32-core jobs will always have to allocate two separate nodes, but leave 16 cores unused on each of them. Users must therefore always think carefully about how many cores to request and whether their applications really benefit from allocating more cores for their jobs. Similar considerations apply - at the same time - to the requested amount of memory per job.

Think of it as the scheduler playing a game of multi-dimensional Tetris, where the dimensions are number of cores, amount of memory and other resources. Users can support this by making resource allocations that allow the scheduler to pack their jobs as densely as possible on the nodes.

7.3 Memory Management

The wait time of a job also depends largely on the amount of requested resources and the available number of nodes providing this amount of resources. This must be taken into account in particular when requesting a certain amount of memory.

For example, there is a total of 692 compute nodes in JUSTUS, of which 456 nodes have 192 GB RAM. However, not the entire amount of physical RAM is available exclusively for user jobs, because the operating system, system services and local file systems also require a certain amount of RAM. This means that if a job requests 192 GB RAM per node (i.e. --mem=192gb or --tasks-per-node=48 and --mem-per-cpu=4gb), Slurm will rule out 456 out of 692 nodes as being suitable for this job and considers only 220 out of 692 nodes as being eligible for running this job.

The following table provides an overview of how much memory can be allocated by user jobs on the various node types and how many nodes can serve this memory requirement:

Physical RAM on node Available RAM on node Number of suitable nodes
192 GB 187 GB 692
384 GB 376 GB 220
768 GB 754 GB 28
1536 GB 1510 GB 8

Also note that allocated memory is factored into resource usage accounting for fair share. This means over-requesting memory may have a negative impact on the priority of subsequent jobs.