The bwForCluster NEMO 2 uses Slurm (https://slurm.schedmd.com/) for scheduling compute jobs.

Slurm Command Overview

Slurm commands	Brief explanation
sbatch	Submits a job and queues it in an input queue
salloc	Request resources for an interactive job
squeue	Displays information about active, eligible, blocked, and/or recently completed jobs
scontrol	Displays detailed job state information
sstat	Displays status information about a running job
scancel	Cancels a job
seff	Shows the "job efficiency" of a job after it has finished

Submitting Jobs on the bwForCluster NEMO 2

Batch jobs are submitted with the command:

$ sbatch <job-script>

A job script contains options for Slurm in lines beginning with #SBATCH as well as your commands which you want to execute on the compute nodes. For example:

#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --time=00:14:00
#SBATCH --mem=1gb
...

You can override options from the script on the command-line:

$ sbatch --time=03:00:00 <job-script>

Resource Requests

Important resource request options for the Slurm command sbatch are:

Option	Slurm (sbatch)
#SBATCH	Script directive
--time=<hh:mm:ss> (-t <hh:mm:ss>)	Wall time limit
--job-name=<name> (-J <name>)	Job name
--nodes=<count> (-N <count>)	Node count
--ntasks=<count> (-n <count>)	Core count
--ntasks-per-node=<count>	Process count per node
--mem=<limit>	Memory limit per node
--mem-per-cpu=<limit>	Memory limit per process
--gpus=<count>	GPU count
--gres=gpu:<count>	GPU count (gres = "generic resource")
--exclusive	Node exclusive job

Nodes and Cores

Slurm provides a number of options to request nodes and cores. Typically, using --nodes=<count> and --ntasks-per-node=<count> should work for all your jobs. For single core jobs, it would be sufficient to use the option --ntasks=1. Specifying only --ntasks may lead to slurm trying to distribute tasks over more than one node even if you requested a small amount of cores.

Memory

Memory can be requested with either the option --mem=<limit> (memory per node) or --mem-per-cpu=<limit> (memory per process). When looking up the maximum available memory for a certain node type subtract about 5 GB for the operating system. Specify the memory limit as a value-unit-pair, for example 500mb or 8gb.

In most cases it is preferable to use the --mem=<limit> option.

GPUs

GPUs are requested as "generic resources" with --gres:gpu:<count> or --gpus:<count>.

Default Values

Default values for jobs are:

Runtime: --time=01:00:00 (1 hour)
Nodes: --nodes=1 (one node)
Tasks: --tasks-per-node=1 (one task per node)
Cores: --cpus-per-task=1 (one core per task)
Memory: --mem-per-cpu=1gb (1 GB per core)

Partitions

On bwForCluster NEMO 2 it is optional to request a partition with --partition=<partition_name> on job submission. Within a partition job allocations are routed automatically to the most suitable compute node(s) for the requested resources (e.g. amount of nodes and cores, memory, number of GPUs). The 'cpu' partition is the default partition, if no partition is requested. It will start jobs either on 'milan' or 'genoa' nodes.

The partitions 'cpu', 'l40s', 'mi300a' and 'h200' only allow the use of single nodes. You cannot start jobs that use more than one node, either by specifying --nodes (>1) or --ntasks (more than available on one node). If you do so, your jobs will be blocked with the reason (PartitionNodeLimit).

All partitions cpu, milan, genoa, l40s, mi300a and h200 are operated in shared mode, i.e. jobs from different users can run on the same node. Jobs can get exclusive access to compute nodes in these partitions with the --exclusive option.

The login partition is intended for quick tests on the login lodes with the Slurm environment.

See NEMO2/Hardware for more information on the exact hardware specifications.

Partition	Multi Node	Nodes	APU/GPU	Cores per APU/GPU	Defaults	Limits
cpu*	single node	2x AMD EPYC 7763/9654 (Milan/Genoa)	---	---	ntasks=1, time=01:00:00, mem-per-cpu=1gb	time=96:00:00, nodes=1
genoa	multi node	2x AMD EPYC 9654 (Genoa)	---	---	ntasks=1, time=01:00:00, mem-per-cpu=1gb	time=96:00:00
milan	multi node	2x AMD EPYC 7763 (Milan)	---	---	ntasks=1, time=01:00:00, mem-per-cpu=1gb	time=96:00:00
l40s	single node	2x Intel Xeon Platinum 8562Y+ (5th Gen)	4x NVIDIA L40S	15	ntasks=1, time=01:00:00, mem-per-cpu=1gb	time=48:00:00, nodes=1, gpus=4
mi300a	single node	4x AMD Instinct MI300A	4x AMD Instinct MI300A	23	ntasks=1, time=01:00:00, mem-per-cpu=1gb	time=48:00:00, nodes=1, gpus=4
h200 (August 2025)	single node	2x AMD EPYC 9654 (Genoa)	8x NVIDIA H200	23	ntasks=1, time=01:00:00, mem-per-cpu=1gb	time=48:00:00, nodes=1, gpus=8
login	single node	1x AMD EPYC 9354 (Genoa) (SMT on)	---	---	ntasks=1, time=15:00, mem-per-cpu=1gb	time=15:00, nodes=1, ntasks=4, mem=50gb

* default partition

Show free nodes

The Slurm command sinfo is used to view partition and node information for a system running Slurm, unfortunately it can only be used by the administrators. For users there is a special binary sinfo_t_idle to find out how many nodes are available on the system for immediate use.

$ sinfo_t_idle
Partition cpu*                    :    127 nodes idle
Partition genoa                   :     30 nodes idle
Partition milan                   :     97 nodes idle
Partition l40s                    :      6 nodes idle
Partition mi300a                  :      3 nodes idle
Partition login                   :      2 nodes idle

Monitoring Jobs with squeue

After you submitted the job, you can see it waiting using the squeue command:

(also read the man page with man squeue for more information on how to use the command)

> squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
               426       cpu      20k   fr_0123 R       0:02      2 n[4101-4102]

Output shows:

JOBID: the jobid is an unique number your job gets
PARTITION: the cluster can be divided in different types of nodes.
NAME: the name you gave your job with the --job-name= option
USER: your username
ST: the state the job is in. R = running, PD = pending, CD = completed. See man page for a full list on states.
TIME: how long the job has been running
NODES: how many nodes were requested
NODELIST(REASON): either show the node(s) the job is running on, or a reason why it hasn't started

scontrol

You can then show more info on one specific running job using the scontrol command, e.g for the job with ID 426 listed above:

scontrol show job 426

displays detailed information for job with JobID 426

scontrol show jobs

displays detailed information for all your jobs

scontrol write batch_script 426 - display job script of a running job. The - is a special filename which means "write to the terminal".

Monitoring a Started Job

After a job has started, you can ssh from a login node to the node(s) the job is running on, using the node name from NODELIST, e.g. if your job runs on n4101:

> ssh n4101

Job Examples

Here you can find some example scripts for batch jobs.

Serial Programs

#!/bin/bash
#SBATCH --partition=cpu
#SBATCH --ntasks=1
#SBATCH --time=20:00:00
#SBATCH --mem=3800mb
./my_serial_program

Notes:

Jobs with --mem up to 500gb can run on all node types associated with the cpu partition.

Multi-threaded Programs

#!/bin/bash
#SBATCH --partition=cpu
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=32
#SBATCH --time=01:30:00
#SBATCH --mem=50gb
export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK}
./my_multithreaded_program

Notes:

Jobs with --ntasks-per-node up to 126 and --mem up to 500gb can run on all node types associated with the milan partition.
With export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK} you can set the number of threads according to the number of resources requested.

MPI Programs

#!/bin/bash
#SBATCH --partition=milan
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=126
#SBATCH --time=12:00:00
module load mpi/openmpi   #  loads dependencies
srun ./my_mpi_program

Notes:

--mem requests the memory per node. The maximum is 500gb for partition milan.
The Compiler and MPI modules used for the compilation must be loaded before the start of the program.
It is recommended to start MPI programs with srun.

Interactive Jobs

Interactive jobs must NOT run on the login nodes, however resources for interactive jobs can be requested using srun. The following examples requests an interactive session on 1 core for 2 hours.

$ salloc --partition=cpu --ntasks=1 --time=2:00:00

After execution of this command wait until the queueing system has granted you the requested resources. Once granted you will be automatically logged on the allocated compute node.

If you use applications or tools which provide a GUI, enable X-forwarding for your interactive session with. We suggest using VNC instead of X-forwarding:

$ salloc --partition=cpu --ntasks=1 --time=2:00:00 --x11

Once the walltime limit has been reached you will be automatically logged out from the compute node.

Interactive Tests on the Login Nodes

The login partition is intended for quick tests on the login lodes with the Slurm environment. The limits are 4 cores and 15 minutes of walltime.

$ salloc --partition=login --ntasks=4
$ srun -p login <application>

Job Monitoring

Information about submitted jobs

For an overview of your submitted jobs use the command:

$ squeue

To get detailed information about a specific job use the command:

$ scontrol show job <jobid>

Notes:
A job start may be delayed for various reasons:

(QOSMaxCpuPerUserLimit) - There is a limit to how many CPU cores a user can use at the same time. The job exceeds this limit.
(QOSMaxGRESPerUser) - There is a limit to how many GPUs a user can use at the same time. The job exceeds this limit.
(QOSMinGRES) - The job was submitted to a gpu partition without requesting a GPU.
(launch failed requeued held) - The job has failed to start. You may be able to resume it using scontrol. Alternatively you can cancel it and submit it again.

For further reasons please refer to: https://slurm.schedmd.com/job_reason_codes.html

Information about resource usage of running jobs

You can monitor the resource usage of running jobs with the sstat command. For example:

$ sstat --format=JobId,AveCPU,AveRSS,MaxRSS -j <jobid>

This will show average CPU time, average and maximum memory consumption of all tasks in the running job.

sstat -e command shows a list of fields that can be specified with the --format option.

Interactive access to running jobs

It is also possible to attach an interactive shell to a running job with command:

$ srun --jobid=<jobid> --overlap --pty /bin/bash

Commands like htop show you the most busy processes on the node. To exit htop type q.

To monitor your GPU processes use the command nvidia-smi on AMD APUs rocm-smi.

Job Feedback

You get feedback on resource usage and job efficiency for completed jobs with the command:

$ seff <jobid>

Example Output:

Job ID: 426
Cluster: nemo
User/Group: fr_ab0123/fr_fr
State: COMPLETED (exit code 0)
Nodes: 2
Cores per node: 190
CPU Utilized: 19:13:22
CPU Efficiency: 91.06% of 21:06:40 core-walltime
Job Wall-clock time: 00:03:20
Memory Utilized: 15.48 GB
Memory Efficiency: 1.10% of 1.38 TB

Explanation:

Nodes: Number of allocated nodes for the job.
Cores per node: Number of physical cores per node allocated for the job.
CPU Utilized: Sum of utilized core time.
CPU Efficiency: 'CPU Utilized' with respect to core-walltime (= 'Nodes' x 'Cores per node' x 'Job Wall-clock time') in percent.
Job Wall-clock time: runtime of the job.
Memory Utilized: Sum of memory used. For multi node MPI jobs the sum is only correct when srun is used instead of mpirun.
Memory Efficiency: 'Memory Utilized' with respect to total allocated memory for the job.

Queue / Partition Limits

Currently, each user can use 8000 cores (QOSMaxCpuPerUserLimit) and 7,2 million core minutes (MaxCpuRunMinsPerUser) at any time.

The core minutes can be calculated by (running CPU cores) * (remaining runtime of all jobs in minutes) = CpuRunMinsPerUser, example: 5000cores * 24h * 60min = 7.200.000 CpuRunMinsPerUser

This is constantly evaluated, you do not have to do anything. This value decreases after a few minutes (running CPU cores * remaining runtime of all jobs in minutes) or when a job ends.

The partitions 'cpu', 'l40s', 'mi300a' and 'h200' only allow the use of single nodes. You cannot start jobs that use more than one node, either by specifying --nodes (>1) or --ntasks (more than available on one node). If you do so, your jobs will be blocked with the reason (PartitionNodeLimit).

Common NEMO2 job codes:

Queue Job Code	Reason
MaxCpuRunMinsPerUser	You are currently running more core minutes than permitted. This value decreases automatically over time.
PartitionNodeLimit	You try to run a multi node job on a single node partition.
QOSMaxCpuPerUserLimit	Your job cannot be started as your user would otherwise exceed the maximum permitted cores for one user.

Check additional reason codes at Schedmd.

Accounting (not implemented)

Jobs are billed for allocated CPU cores, memory and GPUs.

To see the accounting data of a specific job:

$ sacct -j <jobid> --format=user,jobid,account,nnodes,ncpus,time,elapsed,AllocTRES%50

To retrive the job history for a specific user for a certain time frame:

$ sacct -u <user> -S 2025-03-01 -E 2025-03-02 --format=user,jobid,account,nnodes,ncpus,time,elapsed,AllocTRES%50

NEMO2/Slurm

Contents

Slurm Command Overview

Submitting Jobs on the bwForCluster NEMO 2

Resource Requests

Default Values

Partitions

Show free nodes

Monitoring Jobs with squeue

scontrol

Monitoring a Started Job

Job Examples

Serial Programs

Multi-threaded Programs

MPI Programs

Interactive Jobs

Interactive Tests on the Login Nodes

Job Monitoring

Information about submitted jobs

Information about resource usage of running jobs

Interactive access to running jobs

Job Feedback

Queue / Partition Limits

Accounting (not implemented)

Navigation menu

NEMO2/Slurm

Slurm Command Overview

Submitting Jobs on the bwForCluster NEMO 2

Resource Requests

Default Values

Partitions

Show free nodes

Monitoring Jobs with squeue

scontrol

Monitoring a Started Job

Job Examples

Serial Programs

Multi-threaded Programs

MPI Programs

Interactive Jobs

Interactive Tests on the Login Nodes

Job Monitoring

Information about submitted jobs

Information about resource usage of running jobs

Interactive access to running jobs

Job Feedback

Queue / Partition Limits

Accounting (not implemented)

Navigation menu

Search