BwUniCluster3.0/Slurm: Difference between revisions

From bwHPC Wiki
Jump to navigation Jump to search
No edit summary
No edit summary
 
(7 intermediate revisions by the same user not shown)
Line 13: Line 13:
== Slurm Commands (excerpt) ==
== Slurm Commands (excerpt) ==
Important Slurm commands for non-administrators working on bwUniCluster 3.0.
Important Slurm commands for non-administrators working on bwUniCluster 3.0.
{| width=750px class="wikitable"
{| width=850px class="wikitable"
! Slurm commands !! Brief explanation
! Slurm commands !! Brief explanation
|-
|-
| [[#Job Submission : sbatch|sbatch]] || Submits a job and puts it into the queue [[https://slurm.schedmd.com/sbatch.html sbatch]]
| [[#Job submission : sbatch|sbatch]] || Submits a job and puts it into the queue [[https://slurm.schedmd.com/sbatch.html sbatch]]
|-
| [[#Interactive job : salloc|salloc]] || Requests resources for an interactive Job [[https://slurm.schedmd.com/salloc.html salloc]]
|-
|-
| [[#Detailed job information : scontrol show job|scontrol show job]] || Displays detailed job state information [[https://slurm.schedmd.com/scontrol.html scontrol]]
| [[#Detailed job information : scontrol show job|scontrol show job]] || Displays detailed job state information [[https://slurm.schedmd.com/scontrol.html scontrol]]
Line 34: Line 36:
* [https://slurm.schedmd.com/man_index.html Slurm Commands]
* [https://slurm.schedmd.com/man_index.html Slurm Commands]
<br>
<br>

== Job submission : sbatch ==
Batch jobs are submitted by using the command '''sbatch'''. The main purpose of the '''sbatch''' command is to specify the resources that are needed to run the job. '''sbatch''' will then queue the batch job. However, starting of batch job depends on the availability of the requested resources and the fair sharing value.
<br>
<br>
=== Command parameters sbatch ===
The syntax and use of '''sbatch''' can be displayed via:
<pre>
$ man sbatch
</pre>
'''sbatch''' options can be used from the command line or in your job script. Different defaults for some of these options are set based on the queue and can be found [https://wiki.bwhpc.de/e/BwUniCluster3.0/Batch_Queues#sbatch_-p_queue here]

{| class="wikitable"
! colspan="3" | sbatch Options
|-
! style="width:8%"| Command line
! style="width:9%"| Script
! style="width:13%"| Purpose
|- style="vertical-align:top;"
| -t, --time=''time''
| #SBATCH --time=''time''
| Wall clock time limit.<br>
|-
|- style="vertical-align:top;"
| -N, --nodes=''count''
| #SBATCH --nodes=''count''
| Number of nodes to be used.
|- style="vertical-align:top;"
| -n, --ntasks=''count''
| #SBATCH --ntasks=''count''
| Number of tasks to be launched.
|-
|- style="vertical-align:top;"
| --ntasks-per-node=''count''
| #SBATCH --ntasks-per-node=''count''
| Maximum count of tasks per node.
|-
|- style="vertical-align:top;"
| -c, --cpus-per-task=''count''
| #SBATCH --cpus-per-task=''count''
| Number of CPUs required per (MPI-)task.
|-
|- style="vertical-align:top;"
| --mem=''value_in_MB''
| #SBATCH --mem=''value_in_MB''
| Memory in MegaByte per node. (You should omit the setting of this option.)
|-
|- style="vertical-align:top;"
| --mem-per-cpu=''value_in_MB''
| #SBATCH --mem-per-cpu=''value_in_MB''
| Minimum Memory required per allocated CPU. (You should omit the setting of this option.)
|-
|- style="vertical-align:top;"
| --mail-type=''type''
| #SBATCH --mail-type=''type''
| Notify user by email when certain event types occur.<br>Valid type values are NONE, BEGIN, END, FAIL, REQUEUE, ALL.
|-
|- style="vertical-align:top;"
| --mail-user=''mail-address''
| #SBATCH --mail-user=''mail-address''
| The specified mail-address receives email notification of state changes as defined by --mail-type.
|-
|- style="vertical-align:top;"
| --output=''name''
| #SBATCH --output=''name''
| File in which job output is stored.
|-
|- style="vertical-align:top;"
| --error=''name''
| #SBATCH --error=''name''
| File in which job error messages are stored.
|-
|- style="vertical-align:top;"
| -J, --job-name=''name''
| #SBATCH --job-name=''name''
| Job name.
|-
|- style="vertical-align:top;"
| --export=[ALL,] ''env-variables''
| #SBATCH --export=[ALL,] ''env-variables''
| Identifies which environment variables from the submission environment are propagated to the launched application. Default is ALL.
|-
|- style="vertical-align:top;"
| -A, --account=''group-name''
| #SBATCH --account=''group-name''
| Change resources used by this job to specified group. You may need this option if your account is assigned to more than one group. By command "scontrol show job" the project group the job is accounted on can be seen behind "Account=".
|-
|- style="vertical-align:top;"
| -p, --partition=''queue-name''
| #SBATCH --partition=''queue-name''
| Request a specific queue for the resource allocation.
|-
|- style="vertical-align:top;"
| --reservation=''reservation-name''
| #SBATCH --reservation=''reservation-name''
| Use a specific reservation for the resource allocation.
|-
|- style="vertical-align:top;"
| -C, --constraint=''LSDF''
| #SBATCH --constraint=LSDF
| Job constraint LSDF filesystems.
|-
|- style="vertical-align:top;"
| -C, --constraint=''BEEOND (BEEOND_4MDS, BEEOND_MAXMDS)''
| #SBATCH --constraint=BEEOND (BEEOND_4MDS, BEEOND_MAXMDS)
| Job constraint BeeOND filesystem.
|-
|}
<br>

== Interactive job : salloc ==

If you want to run an interactive job, you can do so via the command salloc on a login node.<br>Considering a serial application running on a compute node that requires 5000 MByte of memory and limiting the interactive run to 2 hours the following command has to be executed:

<pre>
$ salloc --partition=cpu --ntasks=1 --time=120 --mem=5000
</pre>

Then you will get one core on a compute node within the partition "cpu". After execution of this command '''DO NOT CLOSE''' your current terminal session but wait until the queueing system Slurm has granted you the requested resources on the compute node. You will be logged in automatically on the granted core! To run a serial program on the granted core you only have to type the name of the executable.

<pre>
$ ./<my_serial_program>
</pre>

Please be aware that your serial job must run less than 2 hours in this example, else the job will be killed during runtime by the system.


You can also start now a graphical X11-terminal connecting you to the dedicated resource that is available for 2 hours. You can start it by the command:

<pre>
$ xterm
</pre>

Note that, once the walltime limit has been reached the resources - i.e. the compute node - will automatically be revoked.


An interactive parallel application running on one compute node or on many compute nodes (e.g. here 5 nodes) with 40 cores each requires usually an amount of memory in GByte (e.g. 50 GByte) and a maximum time (e.g. 1 hour). E.g. 5 nodes can be allocated by the following command:

<pre>
$ salloc --partition=cpu --nodes=5 --ntasks-per-node=40 --time=01:00:00 --mem=50gb
</pre>

Now you can run parallel jobs on 200 cores requiring 50 GByte of memory per node. Please be aware that you will be logged in on core 0 of the first node.
If you want to have access to another node you have to open a new terminal, connect it also to BwUniCluster 2.0 and type the following commands to
connect to the running interactive job and then to a specific node:

<pre>
$ srun --jobid=XXXXXXXX --pty /bin/bash
$ srun --nodelist=uc3nXXX --pty /bin/bash
</pre>

With the command:

<pre>
$ squeue
</pre>

the jobid and the nodelist can be shown.

If you want to run MPI-programs, you can do it by simply typing mpirun <program_name>. Then your program will be run on 200 cores. A very simple example for starting a parallel job can be:

<pre>
$ mpirun <my_mpi_program>
</pre>

You can also start the debugger ddt by the commands:

<pre>
$ module add devel/ddt
$ ddt <my_mpi_program>
</pre>

The above commands will execute the parallel program <my_mpi_program> on all available cores. You can also start parallel programs on a subset of cores; an example for this can be:

<pre>
$ mpirun -n 50 <my_mpi_program>
</pre>

If you are using Intel MPI you must start <my_mpi_program> by the command mpiexec.hydra (instead of mpirun).

Latest revision as of 15:58, 5 December 2024


Slurm HPC Workload Manager

Specification

Slurm is an open source, fault-tolerant, and highly scalable cluster management and job scheduling system for large and small Linux clusters. Slurm requires no kernel modifications for its operation and is relatively self-contained. As a cluster workload manager, Slurm has three key functions. First, it allocates exclusive and/or non-exclusive access to resources (compute nodes) to users for some duration of time so they can perform work. Second, it provides a framework for starting, executing, and monitoring work (normally a parallel job) on the set of allocated nodes. Finally, it arbitrates contention for resources by managing a queue of pending work.

Any kind of calculation on the compute nodes of bwUniCluster 2.0 requires the user to define calculations as a sequence of commands or single command together with required run time, number of CPU cores and main memory and submit all, i.e., the batch job, to a resource and workload managing software. bwUniCluster 2.0 has installed the workload managing software Slurm. Therefore any job submission by the user is to be executed by commands of the Slurm software. Slurm queues and runs user jobs based on fair sharing policies.

Slurm Commands (excerpt)

Important Slurm commands for non-administrators working on bwUniCluster 3.0.

Slurm commands Brief explanation
sbatch Submits a job and puts it into the queue [sbatch]
salloc Requests resources for an interactive Job [salloc]
scontrol show job Displays detailed job state information [scontrol]
squeue Displays information about active, eligible, blocked, and/or recently completed jobs [squeue]
squeue --start Returns start time of submitted job [squeue]
sinfo_t_idle Shows what resources are available for immediate use [sinfo]
scancel Cancels a job [scancel]



Job submission : sbatch

Batch jobs are submitted by using the command sbatch. The main purpose of the sbatch command is to specify the resources that are needed to run the job. sbatch will then queue the batch job. However, starting of batch job depends on the availability of the requested resources and the fair sharing value.

Command parameters sbatch

The syntax and use of sbatch can be displayed via:

$ man sbatch

sbatch options can be used from the command line or in your job script. Different defaults for some of these options are set based on the queue and can be found here

sbatch Options
Command line Script Purpose
-t, --time=time #SBATCH --time=time Wall clock time limit.
-N, --nodes=count #SBATCH --nodes=count Number of nodes to be used.
-n, --ntasks=count #SBATCH --ntasks=count Number of tasks to be launched.
--ntasks-per-node=count #SBATCH --ntasks-per-node=count Maximum count of tasks per node.
-c, --cpus-per-task=count #SBATCH --cpus-per-task=count Number of CPUs required per (MPI-)task.
--mem=value_in_MB #SBATCH --mem=value_in_MB Memory in MegaByte per node. (You should omit the setting of this option.)
--mem-per-cpu=value_in_MB #SBATCH --mem-per-cpu=value_in_MB Minimum Memory required per allocated CPU. (You should omit the setting of this option.)
--mail-type=type #SBATCH --mail-type=type Notify user by email when certain event types occur.
Valid type values are NONE, BEGIN, END, FAIL, REQUEUE, ALL.
--mail-user=mail-address #SBATCH --mail-user=mail-address The specified mail-address receives email notification of state changes as defined by --mail-type.
--output=name #SBATCH --output=name File in which job output is stored.
--error=name #SBATCH --error=name File in which job error messages are stored.
-J, --job-name=name #SBATCH --job-name=name Job name.
--export=[ALL,] env-variables #SBATCH --export=[ALL,] env-variables Identifies which environment variables from the submission environment are propagated to the launched application. Default is ALL.
-A, --account=group-name #SBATCH --account=group-name Change resources used by this job to specified group. You may need this option if your account is assigned to more than one group. By command "scontrol show job" the project group the job is accounted on can be seen behind "Account=".
-p, --partition=queue-name #SBATCH --partition=queue-name Request a specific queue for the resource allocation.
--reservation=reservation-name #SBATCH --reservation=reservation-name Use a specific reservation for the resource allocation.
-C, --constraint=LSDF #SBATCH --constraint=LSDF Job constraint LSDF filesystems.
-C, --constraint=BEEOND (BEEOND_4MDS, BEEOND_MAXMDS) #SBATCH --constraint=BEEOND (BEEOND_4MDS, BEEOND_MAXMDS) Job constraint BeeOND filesystem.


Interactive job : salloc

If you want to run an interactive job, you can do so via the command salloc on a login node.
Considering a serial application running on a compute node that requires 5000 MByte of memory and limiting the interactive run to 2 hours the following command has to be executed:

$ salloc --partition=cpu --ntasks=1 --time=120 --mem=5000

Then you will get one core on a compute node within the partition "cpu". After execution of this command DO NOT CLOSE your current terminal session but wait until the queueing system Slurm has granted you the requested resources on the compute node. You will be logged in automatically on the granted core! To run a serial program on the granted core you only have to type the name of the executable.

$ ./<my_serial_program>

Please be aware that your serial job must run less than 2 hours in this example, else the job will be killed during runtime by the system.


You can also start now a graphical X11-terminal connecting you to the dedicated resource that is available for 2 hours. You can start it by the command:

$ xterm

Note that, once the walltime limit has been reached the resources - i.e. the compute node - will automatically be revoked.


An interactive parallel application running on one compute node or on many compute nodes (e.g. here 5 nodes) with 40 cores each requires usually an amount of memory in GByte (e.g. 50 GByte) and a maximum time (e.g. 1 hour). E.g. 5 nodes can be allocated by the following command:

$ salloc --partition=cpu --nodes=5 --ntasks-per-node=40 --time=01:00:00  --mem=50gb

Now you can run parallel jobs on 200 cores requiring 50 GByte of memory per node. Please be aware that you will be logged in on core 0 of the first node. If you want to have access to another node you have to open a new terminal, connect it also to BwUniCluster 2.0 and type the following commands to connect to the running interactive job and then to a specific node:

$ srun --jobid=XXXXXXXX --pty /bin/bash
$ srun --nodelist=uc3nXXX --pty /bin/bash

With the command:

$ squeue

the jobid and the nodelist can be shown.

If you want to run MPI-programs, you can do it by simply typing mpirun <program_name>. Then your program will be run on 200 cores. A very simple example for starting a parallel job can be:

$ mpirun <my_mpi_program>

You can also start the debugger ddt by the commands:

$ module add devel/ddt
$ ddt <my_mpi_program>

The above commands will execute the parallel program <my_mpi_program> on all available cores. You can also start parallel programs on a subset of cores; an example for this can be:

$ mpirun -n 50 <my_mpi_program>

If you are using Intel MPI you must start <my_mpi_program> by the command mpiexec.hydra (instead of mpirun).