BwUniCluster2.0/Batch Queues: Difference between revisions
No edit summary |
|||
(22 intermediate revisions by 4 users not shown) | |||
Line 1: | Line 1: | ||
__TOC__ |
|||
⚫ | |||
Compute resources such as (wall-)time, nodes and memory are restricted and must fit into '''queues'''. Since requested compute resources are NOT always automatically mapped to the correct queue class, '''you must add the correct queue class to your sbatch command '''. <font color=red>The specification of a queue is obligatory on BwUniCluster 2.0.</font> |
|||
== Partitions, Queues and Jobs== |
|||
Slurm manages job queues for different '''partitions'''. Partitions are used to group similar node types (e.g. nodes with and without accelerators) and to enforce different access policies and resource limits. |
|||
On bwUniCluster 2.0 there different partitions: |
|||
* CPU-only nodes |
|||
** 2-socket nodes, consisting of 2 Intel processors with 20 (Cascade Lake) or 32 (Ice Lake) cores each |
|||
** 4-socket nodes with very high RAM capacity, consisting of 4 Intel processors with 20 cores each |
|||
* GPU-accelerated nodes |
|||
** 2-socket nodes with 4x NVIDIA Tesla V100, 4x NVIDIA A100 or 4x NVIDIA H100 |
|||
** 2-socket nodes with 8x NVIDIA Tesla V100 |
|||
Job '''queues''' are used to manage jobs that request access to shared but limited computing resources of a certain kind (partition). |
|||
On bwUniCluster 2.0 there different main types of queues: |
|||
* Regular queues |
|||
** single: Jobs that request at most one node or even only single cores of a node. |
|||
** multiple: Jobs that request more than one node. |
|||
** gpu: Jobs that request GPU accelerators on one or more than one node. |
|||
* Development queues (dev) |
|||
** Short, usually interactive jobs that are used for developing, compiling and testing code and workflows. The intention behind development queues is to provide users with immediate access to computer resources without having to wait. This is the place to realize instantaneous heavy compute without affecting other users, as would be the case on the login nodes. |
|||
Requested compute resources such as (wall-)time, number of nodes and amount of memory are restricted and must fit into the boundaries imposed by the queues. The request for compute resources on the bwUniCluster 2.0 <font color=red>requires at least the specification of the '''queue''' and the '''time'''</font>. |
|||
Jobs can be run non-interactively as '''batch jobs''' or as '''interactive jobs'''. |
|||
Submitting a batch job means, that all steps of a compute project are defined in a bash script. This bash script is queued and executed as soon as the compute resources are available and allocated. Jobs are enqueued with the ''sbatch'' command. |
|||
For interactive jobs, the resources are requested with the ''salloc'' command. As soon as the computing resources are available and allocated, a command line prompt is returned on a computing node and the user can freely dispose of the resources now available to him. |
|||
{|style="background:#deffee; width:100%;" |
|||
|style="padding:5px; background:#cef2e0; text-align:left"| |
|||
[[Image:Attention.svg|center|25px]] |
|||
|style="padding:5px; background:#cef2e0; text-align:left"| |
|||
'''Please remember:''' |
|||
* Heavy computations are not allowed on the login nodes.<br>Use a developement or a regular job queue instead! Please refer to [[BwUniCluster2.0/Login#Allowed_Activities_on_Login_Nodes|Allowed Activities on Login Nodes]]. |
|||
* Development queues are meant for development tasks.<br>Do not misuse this queue for regular, short-running jobs or chain jobs! Only one running job at a time is enabled. |
|||
|} |
|||
== Batch Jobs: sbatch == |
|||
⚫ | |||
<br> |
<br> |
||
Details are: |
Details are: |
||
Line 10: | Line 51: | ||
|- style="text-align:left" |
|- style="text-align:left" |
||
| dev_single |
| dev_single |
||
| |
| Thin |
||
| time=10, mem-per-cpu=1125mb |
| time=10, mem-per-cpu=1125mb |
||
| |
| |
||
Line 16: | Line 57: | ||
|- style="text-align:left;" |
|- style="text-align:left;" |
||
| single |
| single |
||
| |
| Thin |
||
| time=30, mem-per-cpu=1125mb |
| time=30, mem-per-cpu=1125mb |
||
| |
| |
||
Line 22: | Line 63: | ||
|- style="text-align:left;" |
|- style="text-align:left;" |
||
| dev_multiple |
| dev_multiple |
||
| |
| HPC |
||
| time=10, mem-per-cpu=1125mb |
| time=10, mem-per-cpu=1125mb |
||
| nodes=2 |
| nodes=2 |
||
Line 28: | Line 69: | ||
|- style="text-align:left;" |
|- style="text-align:left;" |
||
| multiple |
| multiple |
||
| |
| HPC |
||
| time=30, mem-per-cpu=1125mb |
| time=30, mem-per-cpu=1125mb |
||
| nodes=2 |
| nodes=2 |
||
Line 34: | Line 75: | ||
|- style="text-align:left;" |
|- style="text-align:left;" |
||
| dev_multiple_il |
| dev_multiple_il |
||
| Ice Lake |
|||
| thin (IceLake) |
|||
| time=10, mem-per-cpu=1950mb |
| time=10, mem-per-cpu=1950mb |
||
| nodes=2 |
| nodes=2 |
||
Line 40: | Line 81: | ||
|- style="text-align:left;" |
|- style="text-align:left;" |
||
| multiple_il |
| multiple_il |
||
| Ice Lake |
|||
| thin (IceLake) |
|||
| time=10, mem-per-cpu=1950mb |
| time=10, mem-per-cpu=1950mb |
||
| nodes=2 |
| nodes=2 |
||
| time=72:00:00, nodes= |
| time=72:00:00, nodes=80, mem=249600mb, ntasks-per-node=64, (threads-per-core=2) |
||
|- style="text-align:left;" |
|- style="text-align:left;" |
||
| dev_gpu_4_a100 |
| dev_gpu_4_a100 |
||
| |
| Ice Lake + A100 |
||
| time=10, mem-per-gpu=127500mb, cpus-per-gpu=16 |
| time=10, mem-per-gpu=127500mb, cpus-per-gpu=16 |
||
| |
| |
||
Line 52: | Line 93: | ||
|- style="text-align:left;" |
|- style="text-align:left;" |
||
| gpu_4_a100 |
| gpu_4_a100 |
||
| |
| Ice Lake + A100 |
||
| time=10, mem-per-gpu=127500mb, cpus-per-gpu=16 |
| time=10, mem-per-gpu=127500mb, cpus-per-gpu=16 |
||
| |
| |
||
Line 58: | Line 99: | ||
|- style="text-align:left;" |
|- style="text-align:left;" |
||
| gpu_4_h100 |
| gpu_4_h100 |
||
| |
| Ice Lake + H100 |
||
| time=10, mem-per-gpu=127500mb, cpus-per-gpu=16 |
| time=10, mem-per-gpu=127500mb, cpus-per-gpu=16 |
||
| |
| |
||
| time=48:00:00, nodes=5, mem=510000mb, ntasks-per-node=64, (threads-per-core=2) |
| time=48:00:00, nodes=5, mem=510000mb, ntasks-per-node=64, (threads-per-core=2) |
||
|- style="vertical-align:top; text-align:left" |
|- style="vertical-align:top; text-align:left" |
||
| |
| Fat |
||
| |
| Fat |
||
| time=10, mem-per-cpu=18750mb |
| time=10, mem-per-cpu=18750mb |
||
| mem=180001mb |
| mem=180001mb |
||
| time=72:00:00, nodes=1, ntasks-per-node=80, (threads-per-core=2) |
| time=72:00:00, nodes=1, mem=3000000mb, ntasks-per-node=80, (threads-per-core=2) |
||
|- style="vertical-align:top; text-align:left" |
|- style="vertical-align:top; text-align:left" |
||
| dev_gpu_4 |
| dev_gpu_4 |
||
Line 90: | Line 131: | ||
Default resources of a queue class defines time, #tasks and memory if not explicitly given with sbatch command. Resource list acronyms ''--time'', ''--ntasks'', ''--nodes'', ''--mem'' and ''--mem-per-cpu'' are described [[BwUniCluster_2.0_Slurm_common_Features|here]]. |
Default resources of a queue class defines time, #tasks and memory if not explicitly given with sbatch command. Resource list acronyms ''--time'', ''--ntasks'', ''--nodes'', ''--mem'' and ''--mem-per-cpu'' are described [[BwUniCluster_2.0_Slurm_common_Features|here]]. |
||
<br> |
|||
⚫ | |||
<br> |
|||
⚫ | |||
To run your batch job on one of the thin nodes, please use: |
To run your batch job on one of the thin nodes, please use: |
||
Line 102: | Line 143: | ||
<br> |
<br> |
||
== Interactive Jobs: salloc == |
|||
On bwUniCluster 2.0 you are only allowed to run short jobs (<< 1 hour) with little memory requirements (<< 8 GByte) on the logins nodes. If you want to run longer jobs and/or jobs with a request of more than 8 GByte of memory, you must allocate resources for so-called interactive jobs by usage of the command salloc on a login node. Considering a serial application running on a compute node that requires 5000 MByte of memory and limiting the interactive run to 2 hours the following command has to be executed: |
On bwUniCluster 2.0 you are only allowed to run short jobs (<< 1 hour) with little memory requirements (<< 8 GByte) on the logins nodes. If you want to run longer jobs and/or jobs with a request of more than 8 GByte of memory, you must allocate resources for so-called interactive jobs by usage of the command salloc on a login node. Considering a serial application running on a compute node that requires 5000 MByte of memory and limiting the interactive run to 2 hours the following command has to be executed: |
||
<pre> |
<pre> |
||
$ salloc -p single -n 1 -t 120 --mem=5000 |
$ salloc -p single -n 1 -t 120 --mem=5000 |
||
</pre> |
</pre> |
||
Then you will get one core on a compute node within the partition "single". After execution of this command '''DO NOT CLOSE''' your current terminal session but wait until the queueing system Slurm has granted you the requested resources on the compute system. You will be logged in automatically on the granted core! To run a serial program on the granted core you only have to type the name of the executable. |
Then you will get one core on a compute node within the partition "single". After execution of this command '''DO NOT CLOSE''' your current terminal session but wait until the queueing system Slurm has granted you the requested resources on the compute system. You will be logged in automatically on the granted core! To run a serial program on the granted core you only have to type the name of the executable. |
||
<pre> |
<pre> |
||
$ ./<my_serial_program> |
$ ./<my_serial_program> |
||
</pre> |
</pre> |
||
Please be aware that your serial job must run less than 2 hours in this example, else the job will be killed during runtime by the system. |
Please be aware that your serial job must run less than 2 hours in this example, else the job will be killed during runtime by the system. |
||
<br> |
|||
You can also start now a graphical X11-terminal connecting you to the dedicated resource that is available for 2 hours. You can start it by the command: |
You can also start now a graphical X11-terminal connecting you to the dedicated resource that is available for 2 hours. You can start it by the command: |
||
<pre> |
<pre> |
||
$ xterm |
$ xterm |
||
</pre> |
</pre> |
||
Note that, once the walltime limit has been reached the resources - i.e. the compute node - will automatically be revoked. |
Note that, once the walltime limit has been reached the resources - i.e. the compute node - will automatically be revoked. |
||
<br> |
|||
An interactive parallel application running on one compute node or on many compute nodes (e.g. here 5 nodes) with 40 cores each requires usually an amount of memory in GByte (e.g. 50 GByte) and a maximum time (e.g. 1 hour). E.g. 5 nodes can be allocated by the following command: |
An interactive parallel application running on one compute node or on many compute nodes (e.g. here 5 nodes) with 40 cores each requires usually an amount of memory in GByte (e.g. 50 GByte) and a maximum time (e.g. 1 hour). E.g. 5 nodes can be allocated by the following command: |
||
<pre> |
<pre> |
||
$ salloc -p multiple -N 5 --ntasks-per-node=40 -t 01:00:00 --mem=50gb |
$ salloc -p multiple -N 5 --ntasks-per-node=40 -t 01:00:00 --mem=50gb |
||
</pre> |
</pre> |
||
Now you can run parallel jobs on 200 cores requiring 50 GByte of memory per node. Please be aware that you will be logged in on core 0 of the first node. |
Now you can run parallel jobs on 200 cores requiring 50 GByte of memory per node. Please be aware that you will be logged in on core 0 of the first node. |
||
If you want to have access to another node you have to open a new terminal, connect it also to BwUniCluster 2.0 and type the following commands to |
If you want to have access to another node you have to open a new terminal, connect it also to BwUniCluster 2.0 and type the following commands to |
||
connect to the running interactive job and then to a specific node: |
connect to the running interactive job and then to a specific node: |
||
<pre> |
<pre> |
||
$ srun --jobid=XXXXXXXX --pty /bin/bash |
$ srun --jobid=XXXXXXXX --pty /bin/bash |
||
$ srun --nodelist=uc2nXXX --pty /bin/bash |
$ srun --nodelist=uc2nXXX --pty /bin/bash |
||
</pre> |
</pre> |
||
With the command: |
With the command: |
||
<pre> |
<pre> |
||
$ squeue |
$ squeue |
||
</pre> |
</pre> |
||
the jobid and the nodelist can be shown. |
the jobid and the nodelist can be shown. |
||
If you want to run MPI-programs, you can do it by simply typing mpirun <program_name>. Then your program will be run on 200 cores. A very simple example for starting a parallel job can be: |
If you want to run MPI-programs, you can do it by simply typing mpirun <program_name>. Then your program will be run on 200 cores. A very simple example for starting a parallel job can be: |
||
<pre> |
<pre> |
||
$ mpirun <my_mpi_program> |
$ mpirun <my_mpi_program> |
||
</pre> |
</pre> |
||
You can also start the debugger ddt by the commands: |
You can also start the debugger ddt by the commands: |
||
<pre> |
<pre> |
||
$ module add devel/ddt |
$ module add devel/ddt |
||
$ ddt <my_mpi_program> |
$ ddt <my_mpi_program> |
||
</pre> |
</pre> |
||
The above commands will execute the parallel program <my_mpi_program> on all available cores. You can also start parallel programs on a subset of cores; an example for this can be: |
The above commands will execute the parallel program <my_mpi_program> on all available cores. You can also start parallel programs on a subset of cores; an example for this can be: |
||
<pre> |
<pre> |
||
$ mpirun -n 50 <my_mpi_program> |
$ mpirun -n 50 <my_mpi_program> |
||
</pre> |
</pre> |
||
⚫ | |||
<br> |
|||
<br> |
|||
⚫ | |||
---- |
|||
[[Category:bwUniCluster 2.0|Batch Jobs - bwUniCluster 2.0 Features]] |
Latest revision as of 15:27, 17 January 2025
Partitions, Queues and Jobs
Slurm manages job queues for different partitions. Partitions are used to group similar node types (e.g. nodes with and without accelerators) and to enforce different access policies and resource limits.
On bwUniCluster 2.0 there different partitions:
- CPU-only nodes
- 2-socket nodes, consisting of 2 Intel processors with 20 (Cascade Lake) or 32 (Ice Lake) cores each
- 4-socket nodes with very high RAM capacity, consisting of 4 Intel processors with 20 cores each
- GPU-accelerated nodes
- 2-socket nodes with 4x NVIDIA Tesla V100, 4x NVIDIA A100 or 4x NVIDIA H100
- 2-socket nodes with 8x NVIDIA Tesla V100
Job queues are used to manage jobs that request access to shared but limited computing resources of a certain kind (partition).
On bwUniCluster 2.0 there different main types of queues:
- Regular queues
- single: Jobs that request at most one node or even only single cores of a node.
- multiple: Jobs that request more than one node.
- gpu: Jobs that request GPU accelerators on one or more than one node.
- Development queues (dev)
- Short, usually interactive jobs that are used for developing, compiling and testing code and workflows. The intention behind development queues is to provide users with immediate access to computer resources without having to wait. This is the place to realize instantaneous heavy compute without affecting other users, as would be the case on the login nodes.
Requested compute resources such as (wall-)time, number of nodes and amount of memory are restricted and must fit into the boundaries imposed by the queues. The request for compute resources on the bwUniCluster 2.0 requires at least the specification of the queue and the time.
Jobs can be run non-interactively as batch jobs or as interactive jobs. Submitting a batch job means, that all steps of a compute project are defined in a bash script. This bash script is queued and executed as soon as the compute resources are available and allocated. Jobs are enqueued with the sbatch command.
For interactive jobs, the resources are requested with the salloc command. As soon as the computing resources are available and allocated, a command line prompt is returned on a computing node and the user can freely dispose of the resources now available to him.
Please remember:
|
Batch Jobs: sbatch
sbatch -p queue
Details are:
bwUniCluster 2.0 sbatch -p queue | ||||
---|---|---|---|---|
queue | node | default resources | minimum resources | maximum resources |
dev_single | Thin | time=10, mem-per-cpu=1125mb | time=30, nodes=1, mem=180000mb, ntasks-per-node=40, (threads-per-core=2) 6 nodes are reserved for this queue. Only for development, i.e. debugging or performance optimization ... | |
single | Thin | time=30, mem-per-cpu=1125mb | time=72:00:00, nodes=1, mem=180000mb, ntasks-per-node=40, (threads-per-core)=2 | |
dev_multiple | HPC | time=10, mem-per-cpu=1125mb | nodes=2 | time=30, nodes=4, mem=90000mb, ntasks-per-node=40, (threads-per-core=2) 8 nodes are reserved for this queue. Only for development, i.e. debugging or performance optimization ... |
multiple | HPC | time=30, mem-per-cpu=1125mb | nodes=2 | time=72:00:00, mem=90000mb, nodes=80, ntasks-per-node=40, (threads-per-core=2) |
dev_multiple_il | Ice Lake | time=10, mem-per-cpu=1950mb | nodes=2 | time=30, nodes=8, mem=249600mb, ntasks-per-node=64, (threads-per-core=2) 8 nodes are reserved for this queue Only for development, i.e. debugging or performance optimization ... |
multiple_il | Ice Lake | time=10, mem-per-cpu=1950mb | nodes=2 | time=72:00:00, nodes=80, mem=249600mb, ntasks-per-node=64, (threads-per-core=2) |
dev_gpu_4_a100 | Ice Lake + A100 | time=10, mem-per-gpu=127500mb, cpus-per-gpu=16 | time=30, nodes=1, mem=510000mb, ntasks-per-node=64, (threads-per-core=2) | |
gpu_4_a100 | Ice Lake + A100 | time=10, mem-per-gpu=127500mb, cpus-per-gpu=16 | time=48:00:00, nodes=9, mem=510000mb, ntasks-per-node=64, (threads-per-core=2) | |
gpu_4_h100 | Ice Lake + H100 | time=10, mem-per-gpu=127500mb, cpus-per-gpu=16 | time=48:00:00, nodes=5, mem=510000mb, ntasks-per-node=64, (threads-per-core=2) | |
Fat | Fat | time=10, mem-per-cpu=18750mb | mem=180001mb | time=72:00:00, nodes=1, mem=3000000mb, ntasks-per-node=80, (threads-per-core=2) |
dev_gpu_4 | gpu4 | time=10, mem-per-gpu=94000mb, cpus-per-gpu=10 | time=30, nodes=1, mem=376000, ntasks-per-node=40, (threads-per-core=2) 1 node is reserved for this queue Only for development, i.e. debugging or performance optimization ... | |
gpu_4 | gpu4 | time=10, mem-per-gpu=94000mb, cpus-per-gpu=10 | time=48:00:00, mem=376000, nodes=14, ntasks-per-node=40, (threads-per-core=2) | |
gpu_8 | gpu8 | time=10, mem-per-cpu=94000mb, cpus-per-gpu=10 | time=48:00:00, mem=752000, nodes=10, ntasks-per-node=40, (threads-per-core=2) |
Default resources of a queue class defines time, #tasks and memory if not explicitly given with sbatch command. Resource list acronyms --time, --ntasks, --nodes, --mem and --mem-per-cpu are described here.
Queue class examples
To run your batch job on one of the thin nodes, please use:
$ sbatch --partition=dev_multiple or $ sbatch -p dev_multiple
Interactive Jobs: salloc
On bwUniCluster 2.0 you are only allowed to run short jobs (<< 1 hour) with little memory requirements (<< 8 GByte) on the logins nodes. If you want to run longer jobs and/or jobs with a request of more than 8 GByte of memory, you must allocate resources for so-called interactive jobs by usage of the command salloc on a login node. Considering a serial application running on a compute node that requires 5000 MByte of memory and limiting the interactive run to 2 hours the following command has to be executed:
$ salloc -p single -n 1 -t 120 --mem=5000
Then you will get one core on a compute node within the partition "single". After execution of this command DO NOT CLOSE your current terminal session but wait until the queueing system Slurm has granted you the requested resources on the compute system. You will be logged in automatically on the granted core! To run a serial program on the granted core you only have to type the name of the executable.
$ ./<my_serial_program>
Please be aware that your serial job must run less than 2 hours in this example, else the job will be killed during runtime by the system.
You can also start now a graphical X11-terminal connecting you to the dedicated resource that is available for 2 hours. You can start it by the command:
$ xterm
Note that, once the walltime limit has been reached the resources - i.e. the compute node - will automatically be revoked.
An interactive parallel application running on one compute node or on many compute nodes (e.g. here 5 nodes) with 40 cores each requires usually an amount of memory in GByte (e.g. 50 GByte) and a maximum time (e.g. 1 hour). E.g. 5 nodes can be allocated by the following command:
$ salloc -p multiple -N 5 --ntasks-per-node=40 -t 01:00:00 --mem=50gb
Now you can run parallel jobs on 200 cores requiring 50 GByte of memory per node. Please be aware that you will be logged in on core 0 of the first node. If you want to have access to another node you have to open a new terminal, connect it also to BwUniCluster 2.0 and type the following commands to connect to the running interactive job and then to a specific node:
$ srun --jobid=XXXXXXXX --pty /bin/bash $ srun --nodelist=uc2nXXX --pty /bin/bash
With the command:
$ squeue
the jobid and the nodelist can be shown.
If you want to run MPI-programs, you can do it by simply typing mpirun <program_name>. Then your program will be run on 200 cores. A very simple example for starting a parallel job can be:
$ mpirun <my_mpi_program>
You can also start the debugger ddt by the commands:
$ module add devel/ddt $ ddt <my_mpi_program>
The above commands will execute the parallel program <my_mpi_program> on all available cores. You can also start parallel programs on a subset of cores; an example for this can be:
$ mpirun -n 50 <my_mpi_program>
If you are using Intel MPI you must start <my_mpi_program> by the command mpiexec.hydra (instead of mpirun).