bwHPC Wiki - User contributions [en]

Status

2026-03-17T08:38:10Z

M Carmesin: /* Current Status */

= bwHPC Cluster and Service Status Page =

== Current Status ==







{| style=" background:#FF8A8A; width:100%;"
| style="padding:8px; background:#FF5C5C; font-size:120%; font-weight:bold; text-align:left" | Status
|-
|
14.3.2026: bwUniCluster: Login Not Possible 

Due to a technical issue, users have been unable to log in to bwUniCluster 3.0 since Saturday.

We are already working to resolve the issue as quickly as possible. At this time, we do not yet have an estimate of when the problem will be resolved.

|}

== Old Messages ==

15.10.2025: central application page (ZAS) [https://zas.bwhpc.de/] currently down
* Renewal and application of new projects (Rechenvorhaben/RV) and registration of new RV members not possible.
* Filling out the bwUniCluster3.0 questionnaire not possible.
* Login and compute activities are '''not''' affected

09.10.2025: central application page (ZAS) [https://zas.bwhpc.de/] currently down
* Renewal and application of new projects (Rechenvorhaben/RV) and registration of new RV members not possible.
* Filling out the bwUniCluster3.0 questionnaire not possible.
* 10.10.: ongoing issue
* Login and compute activities are '''not''' affected
* 2025-10-10T16: Some sites report normal operation. Identity providers need to update DFN information, expected within 24h.

Status

2026-03-17T08:37:11Z

M Carmesin: /* Current Status */

= bwHPC Cluster and Service Status Page =

== Current Status ==







{| style=" background:#FF8A8A; width:100%;"
| style="padding:8px; background:#FF5C5C; font-size:120%; font-weight:bold; text-align:left" | Status
|-
|
14.3.2026: bwUniCluster: login impossible 

Due to a technical issue, users have been unable to log in to bwUniCluster 3.0 since Saturday.

We are already working to resolve the issue as quickly as possible. At this time, we do not yet have an estimate of when the problem will be resolved.

|}

== Old Messages ==

15.10.2025: central application page (ZAS) [https://zas.bwhpc.de/] currently down
* Renewal and application of new projects (Rechenvorhaben/RV) and registration of new RV members not possible.
* Filling out the bwUniCluster3.0 questionnaire not possible.
* Login and compute activities are '''not''' affected

09.10.2025: central application page (ZAS) [https://zas.bwhpc.de/] currently down
* Renewal and application of new projects (Rechenvorhaben/RV) and registration of new RV members not possible.
* Filling out the bwUniCluster3.0 questionnaire not possible.
* 10.10.: ongoing issue
* Login and compute activities are '''not''' affected
* 2025-10-10T16: Some sites report normal operation. Identity providers need to update DFN information, expected within 24h.

Status

2026-03-17T08:36:55Z

M Carmesin:

= bwHPC Cluster and Service Status Page =

== Current Status ==







{| style=" background:#FF8A8A; width:100%;"
| style="padding:8px; background:#FF5C5C; font-size:120%; font-weight:bold; text-align:left" | Status
|-
|
14.3.2026: bwUniCluster: login impossible 
Due to a technical issue, users have been unable to log in to bwUniCluster 3.0 since Saturday.

We are already working to resolve the issue as quickly as possible. At this time, we do not yet have an estimate of when the problem will be resolved.

|}

== Old Messages ==

15.10.2025: central application page (ZAS) [https://zas.bwhpc.de/] currently down
* Renewal and application of new projects (Rechenvorhaben/RV) and registration of new RV members not possible.
* Filling out the bwUniCluster3.0 questionnaire not possible.
* Login and compute activities are '''not''' affected

09.10.2025: central application page (ZAS) [https://zas.bwhpc.de/] currently down
* Renewal and application of new projects (Rechenvorhaben/RV) and registration of new RV members not possible.
* Filling out the bwUniCluster3.0 questionnaire not possible.
* 10.10.: ongoing issue
* Login and compute activities are '''not''' affected
* 2025-10-10T16: Some sites report normal operation. Identity providers need to update DFN information, expected within 24h.

Development/Julia

2025-12-02T08:35:16Z

M Carmesin:

Julia is a high-level, high-performance, dynamic programming language, being designed with scientific computing in mind. Parallel programming features, such as multi-threading are included in the core language, while there also exist packages leveraging the power of MPI and CUDA.

There are no packages preinstalled besides the Julia language core, please use the Julia package manager to install any required Julia package.

The Julia module on JUSTUS2 and bwUniCLuster 3.0 load suitable versions of CUDA and OpenMPI and the corresponding Julia packages CUDA.jl and MPI.jl will be automatically configured to use these libraries after being installed by the user. Any changes, either by loading modules with different MPI and/ or CUDA versions as well as using the ones that come as Julia artifacts are likely to lead to errors.

== Availability ==

On UniCluster3.0 and JUSTUS 2, Julia is available as module. Check <code>module avail math/julia</code> for the provided versions. In case there is no suitable version, you can install Julia to your home directory using the [https://julialang.org/install/ JuliaUP] installer.

== Environments and Package Installation ==

It is highly recommended to use an separate Julia environment for every project. If Julia is started with the option <code>--project=.</code> the current folder will be used as environment and the <code>Project.toml</code> file containing the information on the installed packages will be created, if not yet present.

In an interactive Julia session, the [https://pkgdocs.julialang.org/v1/getting-started/#Basic-Usage package manager] is activated by entering <code>]</code>. The most importent commands are
* <code>add PACKAGENAME</code> install package PACKAGENAME in the current environment
* <code>instantiate</code>: install all packages with dependencies as stated in Project.toml and Manifest.toml, e.g. after copying the existing code to the cluster
* <code>activate PATH_TO_ENV</code>: use the environment located at the path <code>PATH_TO_ENV</code> and initialize it, if necessary.

== Interactive Example ==

Load Julia module and start interactive REPL session with 8 threads, using the environment in the current directory:

<pre>
$ module load math/julia
$ julia -t 8 --project=.
</pre>

Enter ']' to go into package manager and install package [https://github.com/JuliaPlots/UnicodePlots.jl?tab=readme-ov-file <code>UnicodePlots</code>].
<pre>
add UnicodePlots
</pre>

Leave the package manager with the backspace key.

Create a vector with 64 elements set to 0 and fill it using all 8 threads with the corresponding tread id number.

<pre>
vec = zeros(64)
Threads.@threads for i in eachindex(vec)
vec[i]= Threads.threadid()
end
</pre>

Load the <code>UnicodePlots</code> package and draw a scatter plot of the contents of <code>vec</code>

<pre>
using UnicodePlots
scatterplot(vec)
</pre>

== Recommended Packages ==

Depending on your specific problem, you might speedup your calculations by using the drop-in replacements for the LinearAlebra routines from the Intel oneApi MKL:
* [https://github.com/JuliaLinearAlgebra/MKL.jl MKL.jl]: dense linear algebra
* [https://github.com/JuliaSparse/MKLSparse.jl MKLSparse.jl]: sparse linear algebra

If you are developing some low-level numerical codes, you could profit from the package [https://github.com/JuliaSIMD/LoopVectorization.jl LoopVectorization.jl].

== Further documentation ==

* [https://modernjuliaworkflows.org Modern Julia Workflows]: A collection of best practices

* [https://github.com/carstenbauer/JuliaHLRS25 Julia Workshop at HLRS]: The material of this workshop is in large parts also valid for the bwHPC clusters (on bwUniCluster and JUSTUS2 you only need the module math/julia).

* [https://github.com/JuliaPDE/SurveyofPDEPackages Survey of PDE packages]

* [https://book.sciml.ai/ Parallel Computing and Scientific Machine Learning (SciML): Methods and Applications ]
== Tips & Tricks ==

* [[JUSTUS2/Software/Julia/Parallel_Programming|Parallel Programming]]

JUSTUS2/Software/Julia/Parallel Programming

2025-12-02T08:21:51Z

M Carmesin:

= Parallel Programming in Julia =

Julia supports several paradigms of parallel programming:

# Implicit multi-threading by math libraries (OpenBLAS, MKL)
# Explicit multi-threading using Julia threads (e.g. `Threads.@threads for`) or [https://github.com/JuliaSIMD/Polyester.jl Polyester.jl ]
# Multiple processes on one ore more nodes
#* <code>Distributed.jl</code> package and <code>SlurmManager</code> from [https://github.com/JuliaParallel/SlurmClusterManager.jl <code>SlurmClusterManager.jl</code>] package, (e.g.<code>@distributed for</code>-loops)
#* [https://github.com/JuliaParallel/MPI.jl <code>MPI.jl</code>]
# Execution on GPUs/CUDA using [https://cuda.juliagpu.org/stable/ <code>CUDA.jl</code> ]

All paradigms may be used at the same time, but must be chosen carefully, to obtain the desired performance.

== Implict Multi-Threading ==

The number of threads used by the mathematical linear algebra libraries may be configured using <code>BLAS.set_num_threads()</code> from the <code>LinearAlgebra</code> package. Alternatively you can set the environment variables <code>OPENBLAS_NUM_THREADS</code> or <code>MKL_NUM_THREADS</code> if you use MKL.

If your code is already multi-threaded, you probably want to set the number of BLAS threads to 1, in order to avoid running too many competing threads, as every Julia thread comes with its own BLAS threads.

== Explicit Multi-Threading ==
Start Julia with option <code>-t x</code> where x is
the number of (Julia) threads or the keyword <code>auto</code>, which however doesn't determine correctly the number of threads requested from SLURM with the option <code>--cpus-per-task</code>. Alternatively, you can set the environment variable <code>JULIA_NUM_THREADS</code>. See the [https://docs.julialang.org/en/v1/manual/multi-threading/ Julia documentation] for more details.

== Multiple Processes ==
With the [https://docs.julialang.org/en/v1/manual/distributed-computing/ Distributed package] Julia has native support for distributed computing using multiple processes on different nodes. To integrate well into SLURM, the use of the [https://github.com/JuliaParallel/ClusterManagers.jl <code>ClusterManagers.jl</code>], providing the <code>addprocs_slurm()</code> function, is advised to spawn the worker processes.

== MPI ==

Distributed computing using MPI can be performed leveraging the [https://github.com/JuliaParallel/MPI.jl <code>MPI.jl</code>] package, which provides Julia wrappers for most of the standard MPI functions.

== CUDA ==
Julia supports computations on NVidia GPUS using the [https://cuda.juliagpu.org/stable/ CUDA.jl] package. It provides the possibility to write own kernels as well as wrappers for libraries like cuBLAS or cuFFT, that contain implementations of standard numerical routines optimized for GPUs.

== Higher Level Packages ==

There are several Julia packages that allow for mixing/changing the different paradigms for parallel computing with minimal code changes:

* [https://github.com/JuliaFolds2/FLoops.jl <code>FLoops.jl</code>] and its backend [https://juliafolds2.github.io/Folds.jl/dev/ <code>Folds.jl</code>]
* [https://juliaparallel.org/Dagger.jl/stable/ Dagger.jl] (still quite experimental)

Development/Julia

2025-12-02T08:18:53Z

M Carmesin:

Julia is a high-level, high-performance, dynamic programming language, being designed with scientific computing in mind. Parallel programming features, such as multi-threading are included in the core language, while there also exist packages leveraging the power of MPI and CUDA.

There are no packages preinstalled besides the Julia language core, please use the Julia package manager to install any required Julia package.

The Julia module on JUSTUS2 and bwUniCLuster 3.0 load suitable versions of CUDA and OpenMPI and the corresponding Julia packages CUDA.jl and MPI.jl will be automatically configured to use these libraries after being installed by the user. Any changes, either by loading modules with different MPI and/ or CUDA versions as well as using the ones that come as Julia artifacts are likely to lead to errors.

== Availability ==

On UniCluster3.0 and JUSTUS 2, Julia is available as module. Check <code>module avail math/julia</code> for the provided versions. In case there is no suitable version, you can install Julia to your home directory using the [https://julialang.org/install/ JuliaUP] installer.

== Environments and Package Installation ==

It is highly recommended to use an separate Julia environment for every project. If Julia is started with the option <code>--project=.</code> the current folder will be used as environment and the <code>Project.toml</code> file containing the information on the installed packages will be created, if not yet present.

In an interactive Julia session, the [https://pkgdocs.julialang.org/v1/getting-started/#Basic-Usage package manager] is activated by entering <code>]</code>. The most importent commands are
* <code>add PACKAGENAME</code> install package PACKAGENAME in the current environment
* <code>instantiate</code>: install all packages with dependencies as stated in Project.toml and Manifest.toml, e.g. after copying the existing code to the cluster
* <code>activate PATH_TO_ENV</code>: use the environment located at the path <code>PATH_TO_ENV</code> and initialize it, if necessary.

== Interactive Example ==

Load Julia module and start interactive REPL session with 8 threads, using the environment in the current directory:

<pre>
$ module load math/julia
$ julia -t 8 --project=.
</pre>

Enter ']' to go into package manager and install package [https://github.com/JuliaPlots/UnicodePlots.jl?tab=readme-ov-file <code>UnicodePlots</code>].
<pre>
add UnicodePlots
</pre>

Leave the package manager with the backspace key.

Create a vector with 64 elements set to 0 and fill it using all 8 threads with the corresponding tread id number.

<pre>
vec = zeros(64)
Threads.@threads for i in eachindex(vec)
vec[i]= Threads.threadid()
end
</pre>

Load the <code>UnicodePlots</code> package and draw a scatter plot of the contents of <code>vec</code>

<pre>
using UnicodePlots
scatterplot(vec)
</pre>

== Further documentation ==

* [https://modernjuliaworkflows.org Modern Julia Workflows]: A collection of best practices

* [https://github.com/carstenbauer/JuliaHLRS25 Julia Workshop at HLRS]: The material of this workshop is in large parts also valid for the bwHPC clusters (on bwUniCluster and JUSTUS2 you only need the module math/julia).

* [https://github.com/JuliaPDE/SurveyofPDEPackages Survey of PDE packages]

* [https://book.sciml.ai/ Parallel Computing and Scientific Machine Learning (SciML): Methods and Applications ]
== Tips & Tricks ==

* [[JUSTUS2/Software/Julia/Parallel_Programming|Parallel Programming]]

Development/Julia

2025-12-02T08:14:59Z

M Carmesin:

Julia is a high-level, high-performance, dynamic programming language, being designed with scientific computing in mind. Parallel programming features, such as multi-threading are included in the core language, while there also exist packages leveraging the power of MPI and CUDA.

There are no packages preinstalled besides the Julia language core, please use the Julia package manager to install any required Julia package.

The Julia module on Justus loads suitable versions of CUDA and OpenMPI and the corresponding Julia packages CUDA.jl and MPI.jl will be automatically configured to use these libraries after being installed by the user. Any changes, either by loading modules with different MPI and/ or CUDA versions as well as using the ones that come as Julia artifacts are likely to lead to errors.

== Availability ==

On UniCluster3.0 and JUSTUS 2, Julia is available as module. Check <code>module avail math/julia</code> for the provided versions. In case there is no suitable version, you can install Julia to your home directory using the [https://julialang.org/install/ JuliaUP] installer.

== Environments and Package Installation ==

It is highly recommended to use an separate Julia environment for every project. If Julia is started with the option <code>--project=.</code> the current folder will be used as environment and the <code>Project.toml</code> file containing the information on the installed packages will be created, if not yet present.

In an interactive Julia session, the [https://pkgdocs.julialang.org/v1/getting-started/#Basic-Usage package manager] is activated by entering <code>]</code>. The most importent commands are
* <code>add PACKAGENAME</code> install package PACKAGENAME in the current environment
* <code>instantiate</code>: install all packages with dependencies as stated in Project.toml and Manifest.toml, e.g. after copying the existing code to the cluster
* <code>activate PATH_TO_ENV</code>: use the environment located at the path <code>PATH_TO_ENV</code> and initialize it, if necessary.

== Interactive Example ==

Load Julia module and start interactive REPL session with 8 threads, using the environment in the current directory:

<pre>
$ module load math/julia
$ julia -t 8 --project=.
</pre>

Enter ']' to go into package manager and install package [https://github.com/JuliaPlots/UnicodePlots.jl?tab=readme-ov-file <code>UnicodePlots</code>].
<pre>
add UnicodePlots
</pre>

Leave the package manager with the backspace key.

Create a vector with 64 elements set to 0 and fill it using all 8 threads with the corresponding tread id number.

<pre>
vec = zeros(64)
Threads.@threads for i in eachindex(vec)
vec[i]= Threads.threadid()
end
</pre>

Load the <code>UnicodePlots</code> package and draw a scatter plot of the contents of <code>vec</code>

<pre>
using UnicodePlots
scatterplot(vec)
</pre>

== Further documentation ==

* [https://modernjuliaworkflows.org Modern Julia Workflows]: A collection of best practices

* [https://github.com/carstenbauer/JuliaHLRS25 Julia Workshop at HLRS]: The material of this workshop is in large parts also valid for the bwHPC clusters (on bwUniCluster and JUSTUS2 you only need the module math/julia).

* [https://github.com/JuliaPDE/SurveyofPDEPackages Survey of PDE packages]

* [https://book.sciml.ai/ Parallel Computing and Scientific Machine Learning (SciML): Methods and Applications ]
== Tips & Tricks ==

* [[JUSTUS2/Software/Julia/Parallel_Programming|Parallel Programming]]

Development/Julia

2025-12-02T07:59:12Z

M Carmesin: /* Further documentation */

Julia is a high-level, high-performance, dynamic programming language, being designed with scientific computing in mind. Parallel programming features, such as multi-threading are included in the core language, while there also exist packages leveraging the power of MPI and CUDA.

There are no packages preinstalled besides the Julia language core, please use the Julia package manager to install any required Julia package.

The Julia module on Justus loads suitable versions of CUDA and OpenMPI and the corresponding Julia packages CUDA.jl and MPI.jl will be automatically configured to use these libraries after being installed by the user. Any changes, either by loading modules with different MPI and/ or CUDA versions as well as using the ones that come as Julia artifacts are likely to lead to errors.

== Availability ==

On UniCluster3.0 and JUSTUS 2, Julia is available as module. Check <code>module avail math/julia</code> for the provided versions. In case there is no suitable version, you can install Julia to your home directory using the [https://julialang.org/install/ JuliaUP] installer.

== Environments and Package Installation ==

It is highly recommended to use an separate Julia environment for every project. If Julia is started with the option <code>--project=.</code> the current folder will be used as environment and the <code>Project.toml</code> file containing the information on the installed packages will be created, if not yet present.

In an interactive Julia session, the [https://pkgdocs.julialang.org/v1/getting-started/#Basic-Usage package manager] is activated by entering <code>]</code>. The most importent commands are
* <code>add PACKAGENAME</code> install package PACKAGENAME in the current environment
* <code>instantiate</code>: install all packages with dependencies as stated in Project.toml and Manifest.toml, e.g. after copying the existing code to the cluster
* <code>activate PATH_TO_ENV</code>: use the environment located at the path <code>PATH_TO_ENV</code> and initialize it, if necessary.

== Interactive Example ==

Load Julia module and start interactive REPL session with 8 threads, using the environment in the current directory:

<pre>
$ module load math/julia
$ julia -t 8 --project=.
</pre>

Enter ']' to go into package manager and install package [https://github.com/JuliaPlots/UnicodePlots.jl?tab=readme-ov-file <code>UnicodePlots</code>].
<pre>
add UnicodePlots
</pre>

Leave the package manager with the backspace key.

Create a vector with 64 elements set to 0 and fill it using all 8 threads with the corresponding tread id number.

<pre>
vec = zeros(64)
Threads.@threads for i in eachindex(vec)
vec[i]= Threads.threadid()
end
</pre>

Load the <code>UnicodePlots</code> package and draw a scatter plot of the contents of <code>vec</code>

<pre>
using UnicodePlots
scatterplot(vec)
</pre>

== Further documentation ==

* [https://modernjuliaworkflows.org Modern Julia Workflows]: A collection of best practices

* [https://github.com/carstenbauer/JuliaHLRS25 Julia Workshop at HLRS]: The material of this workshop is in large parts also valid for the bwHPC clusters (on bwUniCluster and JUSTUS2 you only need the module math/julia).

== Tips & Tricks ==

* [[JUSTUS2/Software/Julia/Parallel_Programming|Parallel Programming]]

JUSTUS2/Jobscripts: Running Your Calculations

2025-11-06T07:41:00Z

M Carmesin: /* Monitoring a Started Job */

{{Justus2}}

The JUSTUS 2 cluster uses Slurm ([https://slurm.schedmd.com/ https://slurm.schedmd.com/]) for scheduling compute jobs.

= JUSTUS 2 Slurm Howto =

This page only presents some basic introduction.

Please see the '''[[bwForCluster JUSTUS 2 Slurm HOWTO|JUSTUS 2 Slurm HOWTO]]''' for many more examples and commands for common tasks or the original slurm documentation.

= Slurm Command Overview =

{| width=750px class="wikitable"
! Slurm commands !! Brief explanation
|-
| [https://slurm.schedmd.com/sbatch.html sbatch] || Submits a job and queues it in an input queue
|-
| [https://slurm.schedmd.com/salloc.html salloc] || Request resources for an interactive job
|-
| [https://slurm.schedmd.com/squeue.html squeue] || Displays information about active, eligible, blocked, and/or recently completed jobs
|-
| [https://slurm.schedmd.com/scontrol.html scontrol] || Displays detailed job state information
|-
| [https://slurm.schedmd.com/sstat.html sstat] || Displays status information about a running job
|-
| [https://slurm.schedmd.com/scancel.html scancel] || Cancels a job
|-
| seff || Shows the "job efficiency" of a job after it has finished
|}

= Submitting Jobs on the bwForCluster JUSTUS 2 =
Batch jobs are submitted with the command:

<syntaxhighlight lang=bash>$ sbatch <job-script> </syntaxhighlight>

A job script contains options for Slurm in lines beginning with #SBATCH as well as your commands which you want to execute on the compute nodes. For example:

<syntaxhighlight lang='bash'>
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --time=06:00:00
#SBATCH --mem=1gb
echo 'Here starts the calculation'
</syntaxhighlight>

You can override options from the script on the command-line:
<syntaxhighlight lang=bash>$ sbatch --time=00:14:00 <job-script> </syntaxhighlight>

Note: --time=00:14:00 should start your job very quickly. see [[#Testing Your Jobs]]

{| style=" background:#deffee; width:100%;"
| style="padding:12px; background:#cef2e0; text-align:left" |
Software examples: Most [[Environment Modules|installed software]] comes with example job scripts. To find it e.g. for lammps: <code> module load chem/lammps; cd $LAMMPS_EXA_DIR; ls -la</code>.
|}

== File Access ==

Note: Compute jobs must not write/read temporary files from the [[JUSTUS2/Hardware#Storage_Architecture|global file systems]] (HOME and WORK) such as a calculation swap files. 

Use local storage /tmp in the ramdisk for small files or /scratch on disk (see [[BwForCluster_JUSTUS_2_Slurm_HOWTO#How_to_request_local_scratch_.28SSD.2FNVMe.29_at_job_submission.3F|How to request NVME]]) for this purpose.

Often, you must configure the the program you are using to write temporary files not to use the global file systems.
If the program uses the current directory to look for files, you must copy these files to a temporary directory, start the program there and copy/save the results of the calculation in the end. The contents of /tmp and /scratch are deleted by the automated cleanup happening after the job.

Each node has a file system in memory (“ram disk”), that can have a maximum of half the size of the total RAM. Note that files created plus the memory requirements of your job need to fit into the total memory.

There are more diskless nodes than nodes with scratch disks, so if your job can run on a diskless node, you should choose this option.

Example job script with requesting 700GB disk space and copying files:

<syntaxhighlight lang='bash'>
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --time=00:14:00
#SBATCH --mem=1gb
#SBATCH --gres=scratch:700
mkdir -p $SCRATCH/mycalculation
# copy input file
cp $HOME/inputfiles/myinput.inp $SCRATCH/mycalculation
# switch directory
cd $SCRATCH/mycalculation
echo 'Here starts the calculation'
myprogram --input=$SCRATCH/mycalculation/myinput.inp
# calculation ends
# copy result
cp outfile.out results2.txt $HOME/resultdir/job12345
# clean up
rm myinput outfile.out results2.txt
</syntaxhighlight>

If there are many files or you don't know exactly how the output files are called, you can just create a tar archive of the whole hole directory (in HOME) instead of using cp:

<syntaxhighlight lang='bash'>
tar -cvzf $HOME/resultdir/mycalculation-${SLURM_JOB_ID}.tgz
</syntaxhighlight>

$SLURM_JOB_ID contains the jobid in slurm during the run on the node and so makes sure the filename is unique.

== Resource Requests ==

Important resource request options for the Slurm command sbatch are:

{| width=750px class="wikitable"
! Option !! Slurm (sbatch)
|-
| #SBATCH|| Script directive
|-
| --time=<hh:mm:ss> (-t <hh:mm:ss>)|| Wall time limit
|-
| --job-name=<name> (-J <name>)|| Job name
|-
| --nodes=<count> (-N <count>)|| Node count
|-
| --ntasks=<count> (-n <count>)|| Core count
|-
| --ntasks-per-node=<count>|| Process count per node
|-
| --mem=<limit>|| Memory limit per node
|-
| --mem-per-cpu=<limit>|| Memory limit per process
|-
| --gres=gpu:<count>|| GPU count (gres = "generic resource")
|-
| --gres=scratch:<count> || Disk space of <count> GB per requested task
|-
| --exclusive|| Node exclusive job
|}

'''Nodes and Cores'''

Slurm provides a number of options to request nodes and cores.
Typically, using <code>--nodes=<count></code> and <code>--ntasks-per-node=<count></code> should work for all your jobs. For single core jobs, it would be sufficient to use the option <code>--ntasks=1</code>. Specifying only <code>--ntasks</code> may lead to slurm trying to distribute tasks over more than one node even if you requested a small amount of cores.

'''Memory'''

Memory can be requested with either the option <code>--mem=<limit></code> (memory per node) or <code>--mem-per-cpu=<limit></code> (memory per process). When looking up the maximum available memory for a certain node type subtract about 5 GB for the operating system. Specify the memory limit as a value-unit-pair, for example 500mb or 8gb.

In most cases it is preferable to use the <code>--mem=<limit></code> option.

'''GPUs''' and '''Scratch'''

These are requested as "generic resources" with <code>--gres:gpu:<count></code> and <code>--gres:scratch:<count></code>.

=== Default Values ===
Some values will be set by default if you do not specify them for your job.

{| width=500px class="wikitable"
!Option || Equivalient To || Meaning
|-
|Runtime: || --time=02:00:00 || 2 hours
|-
|Nodes: ||--nodes=1 ||one node
|-
|Tasks: || --tasks-per-node=1 ||one task per node
|-
|Cores: || --cpus-per-task=1 ||one core per task
|-
|Memory: || --mem-per-cpu=2gb ||2 GB per core
|}

== "Exclusive User" Node Access Policy ==

Nodes are exclusively allocated to one single user. However, multiple jobs (up to 48) from the same user can share a node.

For efficient resource use, choose a core count for your jobs that evenly divides 48. For example, two 24-core jobs fit on one node, while two 32-core jobs require two nodes but leave 16 cores unused on each.

The same applies to memory requests (see below).

Think of scheduling as a game of Tetris with cores, memory, and other resources. Choosing well-fitting allocations helps the scheduler pack jobs efficiently.

== Memory Limits ==

The wait time of a job also depends largely on the amount of requested resources and the available number of nodes providing this amount of resources. This must be taken into account in particular when requesting a certain amount of memory.

For example a node with 192 GB RAM can only run jobs with up to 187 GB memory requested. The remaining amount is reserved for the operating system, system services and local file systems.
This means that if a job requests 192 GB RAM per node (i.e. --mem=192gb or --tasks-per-node=48 and --mem-per-cpu=4gb), the job cannot run on one of the 456 "small" nodes but only on one of the "medium", "large" or "fat" nodes. Unnecessarily limiting your jobs to a sub-set of nodes will increase your wait time and the wait time of others, who actually need the amount of memory.

The following table provides an overview of how much memory can be allocated by user jobs on the various node types and how many nodes can serve this memory requirement:

{| width=500px class="wikitable"
!|| Physical RAM on node !! Available RAM on node !! Number of suitable nodes
|-
|small|| 192 GB || 187 GB || 692
|-
|medium|| 384 GB || 376 GB || 220
|-
|large|| 768 GB || 754 GB || 28
|-
|fat|| 1536 GB || 1510 GB || 8
|}

Also note that allocated memory is factored into resource usage accounting for fair share. This means over-requesting memory may have a negative impact on the priority of subsequent jobs.

= Testing Your Jobs =

Justus2 has three compute nodes reserved for jobs with a walltime under 15 minutes. You can test if your jobs start properly just by specifying a short walltime, e.g. --time=00:14:00 and your job should start very quickly.

= Monitoring Your Jobs =

Always test things first with few jobs before you roll out hundreds of jobs!

Please ensure at minimum:
* are my jobs using the amount of cores I requested
* is my job using near to the amount of memory I requested

If you are running more than 1-10 jobs:

* are my jobs running at the very least over 10 minutes
* do my jobs scale reasonably well → [[Scaling]]

== squeue ==

{|style="background:#deffee; width:100%;"
|style="padding:5px; background:#cef2e0; text-align:left"|
[[Image:Attention.svg|center|25px]]
|style="padding:5px; background:#cef2e0; text-align:left"| Do not run squeue and other slurm commands in loops or "watch" as not to saturate up the slurm daemon with rpc requests
|}

After you submitted the job, you can see it waiting using the <code>squeue</code> command:

(also read the man page with <code>man squeue</code> for more information on how to use the command)

<syntaxhighlight lang='shell'>
> squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
6260301 standard r_60_b_2 ul_yxz1 PD 0:00 1 (AssocGrpMemRunMinutes)

</syntaxhighlight>

Output shows:
* JOBID: the jobid is an unique number your job gets
* PARTITION: the cluster can be divided in different types of nodes.
* NAME: the name you gave your job with the --job-name= option
* USER: your username
* ST: the state the job is in. R = running, PD = pending, CD = completed. See man page for a full list on states.
* TIME: how long the job has been running
* NODES: how many nodes were requested
* NODELIST(REASON): either show the node(s) the job is running on, or a reason why it hasn't started

==scontrol==

You can then show more info on one specific running job using the <code>scontrol</code> command, e.g for the job with ID 6260301 listed above:

<code>
scontrol show job 6260301
</code>

displays detailed information for job with JobID 6260301

<code>
scontrol show jobs
</code>

displays detailed information for all your jobs
<code>

scontrol write batch_script 6260301 -
</code>
display job script of a running job. The "-" is a special filename which means "write to the terminal".

This does not work for already completed jobs. When enabled in slurm, one can see those job scripts with <code>sacct -B -j 6260301</code>

== Monitoring a Started Job ==

After a job has started, you can ssh from a login node to the node(s) the job is running on, using the node name from NODELIST, e.g. if your job runs on n0603:

<code>> ssh n0603
</code>

To get a live overview of the current resource usage on the node, use the command

<code>htop
</code>

On the GPU nodes, the usage of the GPU(s) can be visualized using

<code>nvtop
</code>

Further, we provide the tool jobreport (only on the login nodes), that generates plots for the resource usage over time of a given job.

<code>jobreport 6260301 </code>

creates an HTML file with these plots in the current directory. For convenience, the report may alternatively be sent as email using

<code>jobreport -E max.mustermann@uni-ulm.de 6260301 </code>

= Partitions =
Job allocations at JUSTUS 2 are routed automatically to the most suitable compute node(s) that can provide the requested resources for the job (e.g. amount of cores, memory, local scratch space). This is to prevent fragmentation of the cluster system and to ensure most efficient usage of available compute resources. Thus, there is no point in requesting a partition in batch job scripts, i.e. users '''should not''' specify any partition "-p, --partition=<partition_name>" on job submission. This is of particular importance if you adapt job scripts from other cluster systems (e.g. bwUniCluster 2.0) to JUSTUS 2.

= Job Priorities =
Job priorities at JUSTUS 2 depend on [https://slurm.schedmd.com/priority_multifactor.html multiple factors ]:
* Age: The amount of time a job has been waiting in the queue, eligible to be scheduled.
* Fairshare: The difference between the portion of the computing resource allocated to an association and the amount of resources that has been consumed.

'''Notes:'''

Jobs that are pending because the user reached one of the resource usage limits (see below) are not eligible to be scheduled and, thus, do not accrue priority by their age.

Fairshare does '''not''' introduce a fixed allotment, in that a user's ability to run new jobs is cut off as soon as a fixed target utilization is reached. Instead, the fairshare factor ensures that jobs from users who were under-served in the past are given higher priority than jobs from users who were over-served in the past. This keeps individual groups from long term monopolizing the resources, thus making it unfair to groups who have not used their fairshare for quite some time.

Slurm features '''backfilling''', meaning that the scheduler will start lower priority jobs if doing so does not delay the expected start time of '''any''' higher priority job. Since the expected start time of pending jobs depends upon the expected completion time of running jobs, reasonably accurate time limits are valuable for backfill scheduling to work well. This '''[https://youtu.be/OKhWwem1XZg?t=161 video]''' gives an illustrative description to how backfilling works.

In summary, an approximate model of Slurm's behavior for scheduling jobs is this:

* Step 1: Can the job in position one (highest priority) start now?
* Step 2: If it can, remove it from the queue, start it and continue with step 1.
* Step 3: If it can not, look at next job.
* Step 4: Can it start now, without delaying the start time of any job before it in the queue?
* Step 5: If it can, remove it from the queue, start it, recalculate what nodes are free, look at next job and continue with step 4.
* Step 6: If it can not, look at next job, and continue with step 4.

As soon as a new job is submitted and as soon as a job finishes, Slurm restarts its main scheduling cycle with step 1.

= Usage Limits/Throttling Policies =

While the fairshare factor ensures fair long term balance of resource utilization between users and groups, there are additional usage limits that constrain the total cumulative resources at a given time. This is to prevent individual users from short term monopolizing large fractions of the whole cluster system.

* The '''maximum walltime''' for a job is '''14 days''' (336 hours)
--time=336:00:00 or --time=14-0

* The '''maximum amount of cores''' used at any given time from jobs running is '''1920''' per user (aggregated over all running jobs). This translates to 40 nodes. An equivalent limit for allocated memory does also apply. If this limit is reached new jobs will be queued (with REASON: AssocGrpCpuLimit) but only allowed to run after resources have been relinquished.

* The maximum amount of '''remaining allocated core-minutes''' per user is '''3300000''' (aggregated over all running jobs). For example, if a user has a 4-core job running that will complete in 1 hour and a 2-core job that will complete in 6 hours, this translates to 4 * 1 * 60 + 2 * 6 * 60 = 16 * 60 = 960 remaining core-minutes. Once a user reaches the limit, no more jobs are allowed to start (REASON: AssocGrpCPURunMinutesLimit). As the jobs continue to run, the remaining core time will decrease and eventually allow more jobs to start in a staggered way. This limit also '''correlates the maximum walltime and amount of cores that can be allocated''' for this amount of time. Thus, shorter walltimes for the jobs allow more resources to be allocated at a given time (but capped by the maximum amount of cores limit above). Watch this '''[https://youtu.be/OKhWwem1XZg?t=306 video]''' for an illustrative description. An equivalent limit applies for remaining time of memory allocation in which case jobs may be held back from starting with REASON AssocGrpMemRunMinutes.

* The '''maximum amount of GPUs''' allocated by running jobs is '''8''' per user. If this limit it reached new jobs will be queued (with REASON: AssocGrpGRES) but only allowed to run after GPU resources have been relinquished.

'''Note:'''

Usage limits are subject to change.

= Efficiency / Use Cases =

When we speak of poor job efficiency, we usually mean that hardware resources are wasted.
That means, a similar overall result could have been achieved using less hardware resources, leaving those for other jobs and reducing the wait time for you and everyone.

The more resources you use, the more important efficiency becomes. If you just run 3-5 jobs that take under a day, just go ahead and choose roughly sane defaults. If you submit hundreds or thousands of jobs - jobs that will accumulate years of CPU compute time (by using many CPU cores), then think very carefully about your jobs and take some time to do trial runs until you are sure your calculations are run well.

Also consider these non-technical things:
* does the calculation give me all the results I need?
→ rerunning calculations is extremely wasteful
* am I using the most efficient algorithm?
→ using better algorithms can reduce the CPU time needed by an order of magnitude or two. And this can sometimes be something as simple as arranging loops in a more clever way or avoiding slow storage.

Some simple causes for poor overall job efficiency are:

* using $HOME or work directories for scratch space (expressively forbidden for $HOME, discouraged for work directories, except for multinode jobs that specifically need this for communication)
* poor choice of resources compared to the size of the nodes leaves part of the node blocked, but doing nothing:
** multiple of --ntasks-per-node is not the number of cores of a node (see section [[#"Exclusive User" Node Access Policy]])
** too much (un-needed) memory or disk space requested
* more cores requested than are actually used by the job
* more cores used for a single mpi/openmp parallel computation than useful
* many small jobs with a short runtime (seconds in extreme cases)
* one-core jobs with very different run-times (because of single-user policy)
* not using full node capacity
* using more cores than what your computational problem can be split into → see [[Scaling]]

== User-exclusive Nodes on Justus2 ==

For several reasons, Justus2 nodes are assigned to one user exclusively. That means that you are responsible for using the full compute node efficiently, as no jobs from other users can fill gaps!

Several key points to accomplishing that:

* Use dividers of the core number:

The Justus2 nodes have 48 cores, two sockets with 24 cores each. Use dividers of 48 to be able to use all cores of the node (e.g. 8). Be aware, that when you choose 16, one job will be executed half on one of the CPUs and the other half on the other. This might be suboptimal.
* Be aware of memory resources:

When you request more memory per core than the "small" nodes on Justus2 have per core, your jobs will not be able to use all cores on the small nodes - or will have to wait for the rarer spaces on the nodes with more memory. Try to estimate your memory requirements well and if you need more than 3.8GB per core, consider mixing in jobs with lower memory requirements to fill the nodes

== Many One or Few-Core Jobs ==

Jobs that use only a few CPU cores can lead to very inefficient node usage:

# You submit 1000 jobs, each runs for ~30s. Jobs need up to 30s to start and finish - a huge waste if the job only takes 30 seconds. Additionally, the starting and finishing of so many jobs in a short time causes strain on the scheduler SLURM and may cause severe problems for everyone and clutter the SLURM job database.
# many few-core jobs with very different run times. The jobs will start on many nodes, but at some time all quicker jobs have finished the calculation and only a few remain. Because of the single-user policy on JUSTUS2, jobs of other users cannot fill in the gaps and the rest of the node is idle.

To address the problem, you can reduce the amount of jobs and/or the amount of nodes used.

To limit the amount of jobs, start many calculations within one job (problem 1. and 2.):

* use a bash loop in your job script
* use the program GNU parallel to start the processes for you

To only limit the amount of nodes used:

* use array jobs

=== Bash Loop ===

One advantage of this method is, that you can run more threads than cores if your jobs are really short and do not use too much RAM memory and in this way keep all cores busy even if many calculations are still starting up.

It is of course even better, if you can combine such short calculations in a way that for 1000 calculations the kernel does not need to start 1000 processes which in turn need to initialize everything.

This example uses pgrep to count how many jobs are running:

<syntaxhighlight lang="bash">
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=48
#SBATCH --time=00:10:00
#SBATCH --mem=100gb

for i in {1..200}
do
echo starting up $i
bash my_calculation $i &
while [ $(pgrep -c -f my_calculation) -ge 48 ] ; do echo sleeping; sleep 5; done
done
wait
</syntaxhighlight>

The same, but by tracking the PIDs (process IDs) of the started processes. This is more robust, but is more difficult to read:

<syntaxhighlight lang="bash">
#!/bin/bash
running_jobs=()

for i in {1..200}; do
echo "Starting job $i"
sleep "$i" &
running_jobs+=($!) # Track PID

while [ "${#running_jobs[@]}" -ge 48 ]; do
sleep 5 # adjust duration depending on your runtime
echo running_jobs: ${running_jobs[@]}
echo pid-out: $(ps -o pid= -p "${running_jobs[@]}" 2>/dev/null | xargs)
echo -----
running_jobs=($(ps -o pid= -p "${running_jobs[@]}" 2>/dev/null)) # Remove finished jobs
done
done

wait # Ensure all jobs complete
</syntaxhighlight>

You may not be able to just use an index number "i" to start many calculations. In this case, for not too many files, the for loop could be used to read in config files. Here just the general idea for the for loop without
<syntaxhighlight lang="bash">
for config in config-1980-03-01_1/*; do
mycalculation -config "$config"
done
</syntaxhighlight>
This loops over all files in the directory config-1980-03-01_1/ and gives them as an input file to "mycalculation" via a hypothetical "-config" option. Adding a date to the config-dirs (and outputs) would enable you to track different runs in your lab journal more easily.

=== Gnu Parallel ===

Gnu Parallel is available on the HPC Cluster and comes with its own set of examples, you can access them like this:

<syntaxhighlight lang="bash">
$ module load system/parallel
$ cp $PARALLEL_EXA_DIR/parallel.slurm .
</syntaxhighlight>
=== Array Jobs ===
<syntaxhighlight lang="bash">$ sbatch -a 1-500%48 batch_script</syntaxhighlight>

This will submit 500 tasks to be executed, each one indexed by SLURM_ARRAY_TASK_ID ranging from 1 to 200, but will limit the number of simultaneously running tasks from this job array to 48 (number of cores on a Justus2 node).

Thee same can be done inside the job script:

<syntaxhighlight lang="bash">
#!/bin/bash
# Number of cores per individual array task
#SBATCH --ntasks-per-node=1
#SBATCH --array=1-500%48
#SBATCH --mem=3G
#SBATCH --time=1:10:00
#SBATCH --job-name=array_job
#SBATCH --output=array_job-%A_%a.out
#SBATCH --error=array_job-%A_%a.err

# Print the task id.
echo "My SLURM_ARRAY_TASK_ID: " $SLURM_ARRAY_TASK_ID

export TIMEFORMAT=%R ;
time bash mycalculation $SLURM_ARRAY_TASK_ID

</syntaxhighlight>

Also see:
* Slurm-Howto entry: [[BwForCluster_JUSTUS_2_Slurm_HOWTO#How_to_submit_an_array_job?]]
* Schedmd documentations on Job Arrays: https://slurm.schedmd.com/job_array.html

JUSTUS2/Software

2025-07-10T08:53:06Z

M Carmesin: /* Available Software */

== Environment Modules ==
Most software is provided as Modules.

Required reading to use: [[Environment Modules]]

== Available Software ==

* Web: Visit [https://www.bwhpc.de/software.php https://www.bwhpc.de/software.php], select <code>Cluster → bwForCluster JUSTUS2</code>

* On the cluster: <code>module avail</code>[[Environment_Modules#module_help|(→module avail)]]

* Software in Containers: Instructions for loading software in containers: [[JUSTUS2/Software/Singularity|Singularity]]
* Instructions for using [[JUSTUS2/Software/Python|Python]] on JUSTUS2

== Documentation ==
=== Main Documentation on The Cluster ===
Documentation for environment modules available on the cluster (shown for a chemistry software called "softname"):

* with command <code>module help chem/softname</code> [[Environment_Modules#module_help|(→module help)]]
* examples in <code>$SOFTNAME_EXA_DIR</code> [[Environment_Modules#Software_job_examples|(→job examples)]]

=== Sometimes Additional Documentation in the Wiki ===
For some environment modules additional documentation is provided here.

* [[JUSTUS2/Software/ADF|ADF]]

* [[JUSTUS2/Software/Dalton|Dalton]]

* [[JUSTUS2/Software/Gaussian|Gaussian]]

* [[JUSTUS2/Software/Gaussview|Gaussview]]

* [[JUSTUS2/Software/Molden|Molden]]

* [[JUSTUS2/Software/NAMD|NAMD]]

* [[JUSTUS2/Software/Orca|Orca]]

* [[JUSTUS2/Software/Quantum ESPRESSO|Quantum ESPRESSO]]

* [[JUSTUS2/Software/SIESTA|SIESTA]]

* [[JUSTUS2/Software/Schrodinger|Schrodinger]]

* [[JUSTUS2/Software/Turbomole|Turbomole]]

* [[JUSTUS2/Software/VASP|VASP]]

* [[JUSTUS2/Software/Julia|Julia]]

JUSTUS2/Software

2025-07-10T08:52:48Z

M Carmesin:

== Environment Modules ==
Most software is provided as Modules.

Required reading to use: [[Environment Modules]]

== Available Software ==

* Web: Visit [https://www.bwhpc.de/software.php https://www.bwhpc.de/software.php], select <code>Cluster → bwForCluster JUSTUS2</code>

* On the cluster: <code>module avail</code>[[Environment_Modules#module_help|(→module avail)]]

* Software in Containers: Instructions for loading software in containers: [[JUSTUS2/Software/Singularity|Singularity]]
* Instructions for using [[JUSTUS2/Software/Python]] on JUSTUS2

== Documentation ==
=== Main Documentation on The Cluster ===
Documentation for environment modules available on the cluster (shown for a chemistry software called "softname"):

* with command <code>module help chem/softname</code> [[Environment_Modules#module_help|(→module help)]]
* examples in <code>$SOFTNAME_EXA_DIR</code> [[Environment_Modules#Software_job_examples|(→job examples)]]

=== Sometimes Additional Documentation in the Wiki ===
For some environment modules additional documentation is provided here.

* [[JUSTUS2/Software/ADF|ADF]]

* [[JUSTUS2/Software/Dalton|Dalton]]

* [[JUSTUS2/Software/Gaussian|Gaussian]]

* [[JUSTUS2/Software/Gaussview|Gaussview]]

* [[JUSTUS2/Software/Molden|Molden]]

* [[JUSTUS2/Software/NAMD|NAMD]]

* [[JUSTUS2/Software/Orca|Orca]]

* [[JUSTUS2/Software/Quantum ESPRESSO|Quantum ESPRESSO]]

* [[JUSTUS2/Software/SIESTA|SIESTA]]

* [[JUSTUS2/Software/Schrodinger|Schrodinger]]

* [[JUSTUS2/Software/Turbomole|Turbomole]]

* [[JUSTUS2/Software/VASP|VASP]]

* [[JUSTUS2/Software/Julia|Julia]]

JUSTUS2/Software/Python

2025-07-10T08:45:52Z

M Carmesin: /* Recommendations */

This page covers information on Python specific to JUSTUS2. For general information valid on all clusters see [[Development/Python]].

=Recommendations=

* Don't use conda (see [[#Conda|below]])
* Don't use the system python for computation intensive work
* Do never ever activate a python environment, neither venv nor conda, in your .bashrc! This may break various things.

* Use optimized numerical libraries (SciPy, NumPy) provided as [[Environment Modules|environment modules]]
* Use [[Development/Python#Virtual Environments (venv)|virtual environments (venv)]]
* Use [[Development/Python#Package Manager (pip)|pip]] for installing further packages

{|style="background:#deffee; width:100%;"
|style="padding:5px; background:#cef2e0; text-align:left"|
[[Image:Attention.svg|center|25px]]
|style="padding:5px; background:#cef2e0; text-align:left"|
Always load the environment modules before activating your Python environments!
|}

=Optimized Libraries=

We provide versions of SciPy and NumPy, that are optimized for the JUSTUS2 CPUs and make use of the highly optimized linear algebra routines provided by [[Development/MKL| Intel MKL]].

For available versions see

module avail numlib/python_scipy

or if you don't need SciPy

module avail numlib/python_numpy

Note that each SciPy module also loads the corresponding NumPy and Python modules. Please don't try to mix with other Python versions!

==Advanced Users: Building your own optimized libraries==
If you need more Python packages that depend on C/C++ or Fortran code for numerical calculations, we recommend building them manually with optimizations and linking to the Intel MKL. How to pass the compilation/linking options depends on the package. Hence see its documentation. There is usually a section like “Building From Source“

A typical workflow might be

module load numlib/mkl/2024.2.1
module load compiler/gnu/14.2
export CFLAGS="-O2 -march=native"
export FFLAGS="-O2 -march=native"
export CXXFLAGS="-02 -march=native"
pip install --no-binary some_package some_package==VERSION BUILD_OPTIONS

=Conda=

There are several reasons for not using [[Conda]] on the cluster:

* legal: unclear license situation for research with the official Anaconda channel
* free conda-forge channel provides mostly unoptimized packages
* conflicting libraries: Conda installs own versions of low-level libraries such as OpenMPI, that do not work well together with Slurm.

However, there might be some valid use cases:

* some packages are only available via conda
* simple installation for testing some software before doing an optimized build

JUSTUS2/Software/Python

2025-07-10T08:44:58Z

M Carmesin: /* Recommendations */

This page covers information on Python specific to JUSTUS2. For general information valid on all clusters see [[Development/Python]].

=Recommendations=

* Don't use conda (see [[#Conda|below]])
* Don't use the system python for computation intensive work
* Do never ever activate a python environment, neither venv nor conda, in your .bashrc! This may break various things.

* Use optimized numerical libraries (SciPy, NumPy) provided as [[Environment Modules|environment modules]]
* Use [[Development/Python#Virtual Environments (venv)|virtual environments (venv)]]
* Use [[Development/Python#Package Manager (pip)|pip]] for installing further packages

{|style="background:#deffee; width:100%;"
|style="padding:5px; background:#cef2e0; text-align:left"|
[[Image:Attention.svg|center|25px]]
|style="padding:5px; background:#cef2e0; text-align:left"|
Always load the environment modules before activating your environments!
|}

=Optimized Libraries=

We provide versions of SciPy and NumPy, that are optimized for the JUSTUS2 CPUs and make use of the highly optimized linear algebra routines provided by [[Development/MKL| Intel MKL]].

For available versions see

module avail numlib/python_scipy

or if you don't need SciPy

module avail numlib/python_numpy

Note that each SciPy module also loads the corresponding NumPy and Python modules. Please don't try to mix with other Python versions!

==Advanced Users: Building your own optimized libraries==
If you need more Python packages that depend on C/C++ or Fortran code for numerical calculations, we recommend building them manually with optimizations and linking to the Intel MKL. How to pass the compilation/linking options depends on the package. Hence see its documentation. There is usually a section like “Building From Source“

A typical workflow might be

module load numlib/mkl/2024.2.1
module load compiler/gnu/14.2
export CFLAGS="-O2 -march=native"
export FFLAGS="-O2 -march=native"
export CXXFLAGS="-02 -march=native"
pip install --no-binary some_package some_package==VERSION BUILD_OPTIONS

=Conda=

There are several reasons for not using [[Conda]] on the cluster:

* legal: unclear license situation for research with the official Anaconda channel
* free conda-forge channel provides mostly unoptimized packages
* conflicting libraries: Conda installs own versions of low-level libraries such as OpenMPI, that do not work well together with Slurm.

However, there might be some valid use cases:

* some packages are only available via conda
* simple installation for testing some software before doing an optimized build

JUSTUS2/Software/Python

2025-07-10T08:44:22Z

M Carmesin: /* Advanced Users: Building your own optimized libraries */

This page covers information on Python specific to JUSTUS2. For general information valid on all clusters see [[Development/Python]].

=Recommendations=

* Don't use conda (see below)
* Don't use the system python for computation intensive work
* Do never ever activate a python environment, neither venv nor conda, in your .bashrc! This may break various things.

* Use optimized numerical libraries (SciPy, NumPy) provided as [[Environment Modules|environment modules]]
* Use [[Development/Python#Virtual Environments (venv)|virtual environments (venv)]]
* Use [[Development/Python#Package Manager (pip)|pip]] for installing further packages

{|style="background:#deffee; width:100%;"
|style="padding:5px; background:#cef2e0; text-align:left"|
[[Image:Attention.svg|center|25px]]
|style="padding:5px; background:#cef2e0; text-align:left"|
Always load the environment modules before activating your environments!
|}

=Optimized Libraries=

We provide versions of SciPy and NumPy, that are optimized for the JUSTUS2 CPUs and make use of the highly optimized linear algebra routines provided by [[Development/MKL| Intel MKL]].

For available versions see

module avail numlib/python_scipy

or if you don't need SciPy

module avail numlib/python_numpy

Note that each SciPy module also loads the corresponding NumPy and Python modules. Please don't try to mix with other Python versions!

==Advanced Users: Building your own optimized libraries==
If you need more Python packages that depend on C/C++ or Fortran code for numerical calculations, we recommend building them manually with optimizations and linking to the Intel MKL. How to pass the compilation/linking options depends on the package. Hence see its documentation. There is usually a section like “Building From Source“

A typical workflow might be

module load numlib/mkl/2024.2.1
module load compiler/gnu/14.2
export CFLAGS="-O2 -march=native"
export FFLAGS="-O2 -march=native"
export CXXFLAGS="-02 -march=native"
pip install --no-binary some_package some_package==VERSION BUILD_OPTIONS

=Conda=

There are several reasons for not using [[Conda]] on the cluster:

* legal: unclear license situation for research with the official Anaconda channel
* free conda-forge channel provides mostly unoptimized packages
* conflicting libraries: Conda installs own versions of low-level libraries such as OpenMPI, that do not work well together with Slurm.

However, there might be some valid use cases:

* some packages are only available via conda
* simple installation for testing some software before doing an optimized build

JUSTUS2/Software/Python

2025-07-10T08:43:08Z

M Carmesin: /* Advanced Users: Building your own optimized libraries */

This page covers information on Python specific to JUSTUS2. For general information valid on all clusters see [[Development/Python]].

=Recommendations=

* Don't use conda (see below)
* Don't use the system python for computation intensive work
* Do never ever activate a python environment, neither venv nor conda, in your .bashrc! This may break various things.

* Use optimized numerical libraries (SciPy, NumPy) provided as [[Environment Modules|environment modules]]
* Use [[Development/Python#Virtual Environments (venv)|virtual environments (venv)]]
* Use [[Development/Python#Package Manager (pip)|pip]] for installing further packages

{|style="background:#deffee; width:100%;"
|style="padding:5px; background:#cef2e0; text-align:left"|
[[Image:Attention.svg|center|25px]]
|style="padding:5px; background:#cef2e0; text-align:left"|
Always load the environment modules before activating your environments!
|}

=Optimized Libraries=

We provide versions of SciPy and NumPy, that are optimized for the JUSTUS2 CPUs and make use of the highly optimized linear algebra routines provided by [[Development/MKL| Intel MKL]].

For available versions see

module avail numlib/python_scipy

or if you don't need SciPy

module avail numlib/python_numpy

Note that each SciPy module also loads the corresponding NumPy and Python modules. Please don't try to mix with other Python versions!

==Advanced Users: Building your own optimized libraries==
If you need more Python packages that depend on C/C++ or Fortran code for numerical calculations, we recommend building them manually with optimizations and linking the to MKL. How to pass the compilation/lionking options depends on the package. hence see its documentation. There is usually a section like “Building From Source“

A typical workflow might be

module load numlib/mkl/2024.2.1
module load compiler/gnu/14.2
export CFLAGS="-O2 -march=native"
export FFLAGS="-O2 -march=native"
export CXXFLAGS="-02 -march=native"
pip install --no-binary my_package my_package==VERSION BUILD_OPTIONS

=Conda=

There are several reasons for not using [[Conda]] on the cluster:

* legal: unclear license situation for research with the official Anaconda channel
* free conda-forge channel provides mostly unoptimized packages
* conflicting libraries: Conda installs own versions of low-level libraries such as OpenMPI, that do not work well together with Slurm.

However, there might be some valid use cases:

* some packages are only available via conda
* simple installation for testing some software before doing an optimized build

JUSTUS2/Software/Python

2025-07-09T08:52:50Z

M Carmesin:

This page covers information on Python specific to JUSTUS2. For general information valid on all clusters see [[Development/Python]].

=Recommendations=

* Don't use conda (see below)
* Don't use the system python for computation intensive work
* Do never ever activate a python environment, neither venv nor conda, in your .bashrc! This may break various things.

* Use optimized numerical libraries (SciPy, NumPy) provided as [[Environment Modules|environment modules]]
* Use [[Development/Python#Virtual Environments (venv)|virtual environments (venv)]]
* Use [[Development/Python#Package Manager (pip)|pip]] for installing further packages

{|style="background:#deffee; width:100%;"
|style="padding:5px; background:#cef2e0; text-align:left"|
[[Image:Attention.svg|center|25px]]
|style="padding:5px; background:#cef2e0; text-align:left"|
Always load the environment modules before activating your environments!
|}

=Optimized Libraries=

We provide versions of SciPy and NumPy, that are optimized for the JUSTUS2 CPUs and make use of the highly optimized linear algebra routines provided by [[Development/MKL| Intel MKL]].

For available versions see

module avail numlib/python_scipy

or if you don't need SciPy

module avail numlib/python_numpy

Note that each SciPy module also loads the corresponding NumPy and Python modules. Please don't try to mix with other Python versions!

==Advanced Users: Building your own optimized libraries==
If you need more Python packages that depend on C or Fortran code for numerical calculations, we recommend building them manually with optimizations and linking the to MKL. How to pass the compilation/lionking options depends on the package. hence see its documentation. There is usually a section like “Building From Source“

A typical workflow might be

module load numlib/mkl/2024.2.1
module load compiler/gnu/14.2
export CFLAGS="-O2 -march=native"
export FFLAGS="-O2 -march=native"
export CXXFLAGS="-02 -march=native"
pip install --no-binary my_package my_package==VERSION BUILD_OPTIONS

=Conda=

There are several reasons for not using [[Conda]] on the cluster:

* legal: unclear license situation for research with the official Anaconda channel
* free conda-forge channel provides mostly unoptimized packages
* conflicting libraries: Conda installs own versions of low-level libraries such as OpenMPI, that do not work well together with Slurm.

However, there might be some valid use cases:

* some packages are only available via conda
* simple installation for testing some software before doing an optimized build

JUSTUS2/Software/Python

2025-07-09T08:26:47Z

M Carmesin: /* Recommendations */

This page covers information on Python specific to JUSTUS2. For general information valid on all clusters see [[Development/Python]].

=Recommendations=

* Don't use conda (see below)
* Don't use the system python for computation intensive work
* Do never ever activate a python environment, neither venv nor conda, in your .bashrc! This may break various things.

* Use optimized numerical libraries (SciPy, NumPy) provided as [[Environment Modules|environment modules]]
* Use [[Development/Python#Virtual Environments (venv)|virtual environments (venv)]]
* Use [[Development/Python#Package Manager (pip)|pip]] for installing further packages

{|style="background:#deffee; width:100%;"
|style="padding:5px; background:#cef2e0; text-align:left"|
[[Image:Attention.svg|center|25px]]
|style="padding:5px; background:#cef2e0; text-align:left"|
Always load the environment modules before activating your environments!
|}

=Optimized Libraries=

We provide versions of SciPy and NumPy, that are optimized for the JUSTUS2 CPUs and make use of the highly optimized linear algebra routines provided by [[Development/MKL| Intel MKL]].

For available versions see

module avail numlib/python_scipy

or if you don't need SciPy

module avail numlib/python_numpy

Note that each SciPy module also loads the corresponding NumPy and Python modules. Please don't try to mix with other Python versions!

==Advanced Users: Building your own optimized libraries==
If you need more Python packages that depend on C or Fortran code for numerical calculations, we recommend building them manually with optimizations and linking the to MKL. How to pass the compilation/lionking options depends on the package. hence see its documentation. There is usually a section like “Building From Source“

A typical workflow might be

module load numlib/mkl/2024.2.1
module load compiler/gnu/14.2
export CFLAGS="-O2 -march=native"
export FFLAGS="-O2 -march=native"
export CXXFLAGS="-02 -march=native"
pip install --no-binary my_package my_package==VERSION BUILD_OPTIONS

JUSTUS2/Software/Python

2025-07-09T08:24:11Z

M Carmesin:

This page covers information on Python specific to JUSTUS2. For general information valid on all clusters see [[Development/Python]].

=Recommendations=

* Don't use conda (see below)
* Don't use the system python for computation intensive work
* Do never ever activate a python environment, neither venv nor conda, in your .bashrc! This may break various things.

* Use optimized numerical libraries (SciPy, NumPy) provided as [[Environment Modules|environment modules]]
* Use [[Development/Python#Virtual Environments (venv)|virtual environments (venv)]]
* Use [[Development/Python#Package Manager (pip)|pip]] for installing further packages

{|style="background:#deffee; width:100%;"
|style="padding:5px; background:#cef2e0; text-align:left"|
[[Image:Attention.svg|center|25px]]
|style="padding:5px; background:#cef2e0; text-align:left"|
Always load the environment modules before activating your environments!
|}

=Optimized Libraries=

We provide versions of SciPy and NumPy, that are optimized for the JUSTUS2 CPUs and make use of the highly optimized linear algebra routines provided by [[Development/MKL| Intel MKL]].

For available versions see

module avail numlib/python_scipy

or if you don't need SciPy

module avail numlib/python_numpy

Note that each SciPy module also loads the corresponding NumPy and Python modules. Please don't try to mix with other Python versions!

==Advanced Users: Building your own optimized libraries==
If you need more Python packages that depend on C or Fortran code for numerical calculations, we recommend building them manually with optimizations and linking the to MKL. How to pass the compilation/lionking options depends on the package. hence see its documentation. There is usually a section like “Building From Source“

A typical workflow might be

module load numlib/mkl/2024.2.1
module load compiler/gnu/14.2
export CFLAGS="-O2 -march=native"
export FFLAGS="-O2 -march=native"
export CXXFLAGS="-02 -march=native"
pip install --no-binary my_package my_package==VERSION BUILD_OPTIONS

JUSTUS2/Software/Python

2025-07-09T07:49:17Z

M Carmesin: Created page with "This page covers information on Python specific to JUSTUS2. For general information valid on all clusters see Development/Python. =Recommendations= * Don't use conda (see below) * Don't use the system python for computation intensive work * Use optimized numerical libraries (SciPy, NumPy) provided as environment modules * Use virtual environments (venv) * Use pip for in..."

This page covers information on Python specific to JUSTUS2. For general information valid on all clusters see [[Development/Python]].

=Recommendations=

* Don't use conda (see below)
* Don't use the system python for computation intensive work

* Use optimized numerical libraries (SciPy, NumPy) provided as environment modules
* Use [[Development/Python#Virtual Environments (venv)|virtual environments (venv)]]
* Use [[Development/Python#Package Manager (pip)|pip]] for installing further packages

=Optimized Libraries=

We provide versions of SciPy and NumPy, that are optimized for the JUSTUS2 CPUs and make use of the highly optimized linear algebra routines provided by [[Development/MKL| Intel MKL]].

For available versions see

<code>module avail numlib/python_scipy</code>

Note that each SciPy module also loads the corresponding NumPy and Python modules. Please don't try to mix with other python versions!

JUSTUS2/Jobscripts: Running Your Calculations

2025-06-23T13:22:50Z

M Carmesin: /* File Access */

{{Justus2}}

The JUSTUS 2 cluster uses Slurm ([https://slurm.schedmd.com/ https://slurm.schedmd.com/]) for scheduling compute jobs.

= JUSTUS 2 Slurm Howto =

This page only presents some basic introduction.

Please see the '''[[bwForCluster JUSTUS 2 Slurm HOWTO|JUSTUS 2 Slurm HOWTO]]''' for many more examples and commands for common tasks or the original slurm documentation.

= Slurm Command Overview =

{| width=750px class="wikitable"
! Slurm commands !! Brief explanation
|-
| [https://slurm.schedmd.com/sbatch.html sbatch] || Submits a job and queues it in an input queue
|-
| [https://slurm.schedmd.com/salloc.html salloc] || Request resources for an interactive job
|-
| [https://slurm.schedmd.com/squeue.html squeue] || Displays information about active, eligible, blocked, and/or recently completed jobs
|-
| [https://slurm.schedmd.com/scontrol.html scontrol] || Displays detailed job state information
|-
| [https://slurm.schedmd.com/sstat.html sstat] || Displays status information about a running job
|-
| [https://slurm.schedmd.com/scancel.html scancel] || Cancels a job
|-
| seff || Shows the "job efficiency" of a job after it has finished
|}

= Submitting Jobs on the bwForCluster JUSTUS 2 =
Batch jobs are submitted with the command:

<syntaxhighlight lang=bash>$ sbatch <job-script> </syntaxhighlight>

A job script contains options for Slurm in lines beginning with #SBATCH as well as your commands which you want to execute on the compute nodes. For example:

<syntaxhighlight lang='bash'>
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --time=06:00:00
#SBATCH --mem=1gb
echo 'Here starts the calculation'
</syntaxhighlight>

You can override options from the script on the command-line:
<syntaxhighlight lang=bash>$ sbatch --time=00:14:00 <job-script> </syntaxhighlight>

Note: --time=00:14:00 should start your job very quickly. see [[#Testing Your Jobs]]

{| style=" background:#deffee; width:100%;"
| style="padding:12px; background:#cef2e0; text-align:left" |
Software examples: Most [[Environment Modules|installed software]] comes with example job scripts. To find it e.g. for lammps: <code> module load chem/lammps; cd $LAMMPS_EXA_DIR; ls -la</code>.
|}

== File Access ==

Note: Compute jobs must not write/read temporary files from the [[JUSTUS2/Hardware#Storage_Architecture|global file systems]] (HOME and WORK) such as a calculation swap files. 

Use local storage /tmp in the ramdisk for small files or /scratch on disk (see [[BwForCluster_JUSTUS_2_Slurm_HOWTO#How_to_request_local_scratch_.28SSD.2FNVMe.29_at_job_submission.3F|How to request NVME]]) for this purpose.

Often, you must configure the the program you are using to write temporary files not to use the global file systems.
If the program uses the current directory to look for files, you must copy these files to a temporary directory, start the program there and copy/save the results of the calculation in the end. The contents of /tmp and /scratch are deleted by the automated cleanup happening after the job.

Each node has a file system in memory (“ram disk”), that can have a maximum of half the size of the total RAM. Note that files created plus the memory requirements of your job need to fit into the total memory.

There are more diskless nodes than nodes with scratch disks, so if your job can run on a diskless node, you should choose this option.

Example job script with requesting 700GB disk space and copying files:

<syntaxhighlight lang='bash'>
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --time=00:14:00
#SBATCH --mem=1gb
#SBATCH --gres=scratch:700
mkdir -p $SCRATCH/mycalculation
# copy input file
cp $HOME/inputfiles/myinput.inp $SCRATCH/mycalculation
# switch directory
cd $SCRATCH/mycalculation
echo 'Here starts the calculation'
myprogram --input=$SCRATCH/mycalculation/myinput.inp
# calculation ends
# copy result
cp outfile.out results2.txt $HOME/resultdir/job12345
# clean up
rm myinput outfile.out results2.txt
</syntaxhighlight>

If there are many files or you don't know exactly how the output files are called, you can just create a tar archive of the whole hole directory (in HOME) instead of using cp:

<syntaxhighlight lang='bash'>
tar -cvzf $HOME/resultdir/mycalculation-${SLURM_JOB_ID}.tgz
</syntaxhighlight>

$SLURM_JOB_ID contains the jobid in slurm during the run on the node and so makes sure the filename is unique.

== Resource Requests ==

Important resource request options for the Slurm command sbatch are:

{| width=750px class="wikitable"
! Option !! Slurm (sbatch)
|-
| #SBATCH|| Script directive
|-
| --time=<hh:mm:ss> (-t <hh:mm:ss>)|| Wall time limit
|-
| --job-name=<name> (-J <name>)|| Job name
|-
| --nodes=<count> (-N <count>)|| Node count
|-
| --ntasks=<count> (-n <count>)|| Core count
|-
| --ntasks-per-node=<count>|| Process count per node
|-
| --mem=<limit>|| Memory limit per node
|-
| --mem-per-cpu=<limit>|| Memory limit per process
|-
| --gres=gpu:<count>|| GPU count (gres = "generic resource")
|-
| --gres=scratch:<count> || Disk space of <count> GB per requested task
|-
| --exclusive|| Node exclusive job
|}

'''Nodes and Cores'''

Slurm provides a number of options to request nodes and cores.
Typically, using <code>--nodes=<count></code> and <code>--ntasks-per-node=<count></code> should work for all your jobs. For single core jobs, it would be sufficient to use the option <code>--ntasks=1</code>. Specifying only <code>--ntasks</code> may lead to slurm trying to distribute tasks over more than one node even if you requested a small amount of cores.

'''Memory'''

Memory can be requested with either the option <code>--mem=<limit></code> (memory per node) or <code>--mem-per-cpu=<limit></code> (memory per process). When looking up the maximum available memory for a certain node type subtract about 5 GB for the operating system. Specify the memory limit as a value-unit-pair, for example 500mb or 8gb.

In most cases it is preferable to use the <code>--mem=<limit></code> option.

'''GPUs''' and '''Scratch'''

These are requested as "generic resources" with <code>--gres:gpu:<count></code> and <code>--gres:scratch:<count></code>.

=== Default Values ===
Some values will be set by default if you do not specify them for your job.

{| width=500px class="wikitable"
!Option || Equivalient To || Meaning
|-
|Runtime: || --time=02:00:00 || 2 hours
|-
|Nodes: ||--nodes=1 ||one node
|-
|Tasks: || --tasks-per-node=1 ||one task per node
|-
|Cores: || --cpus-per-task=1 ||one core per task
|-
|Memory: || --mem-per-cpu=2gb ||2 GB per core
|}

== "Exclusive User" Node Access Policy ==

Nodes are exclusively allocated to one single user. However, multiple jobs (up to 48) from the same user can share a node.

For efficient resource use, choose a core count for your jobs that evenly divides 48. For example, two 24-core jobs fit on one node, while two 32-core jobs require two nodes but leave 16 cores unused on each.

The same applies to memory requests (see below).

Think of scheduling as a game of Tetris with cores, memory, and other resources. Choosing well-fitting allocations helps the scheduler pack jobs efficiently.

== Memory Limits ==

The wait time of a job also depends largely on the amount of requested resources and the available number of nodes providing this amount of resources. This must be taken into account in particular when requesting a certain amount of memory.

For example a node with 192 GB RAM can only run jobs with up to 187 GB memory requested. The remaining amount is reserved for the operating system, system services and local file systems.
This means that if a job requests 192 GB RAM per node (i.e. --mem=192gb or --tasks-per-node=48 and --mem-per-cpu=4gb), the job cannot run on one of the 456 "small" nodes but only on one of the "medium", "large" or "fat" nodes. Unnecessarily limiting your jobs to a sub-set of nodes will increase your wait time and the wait time of others, who actually need the amount of memory.

The following table provides an overview of how much memory can be allocated by user jobs on the various node types and how many nodes can serve this memory requirement:

{| width=500px class="wikitable"
!|| Physical RAM on node !! Available RAM on node !! Number of suitable nodes
|-
|small|| 192 GB || 187 GB || 692
|-
|medium|| 384 GB || 376 GB || 220
|-
|large|| 768 GB || 754 GB || 28
|-
|fat|| 1536 GB || 1510 GB || 8
|}

Also note that allocated memory is factored into resource usage accounting for fair share. This means over-requesting memory may have a negative impact on the priority of subsequent jobs.

= Testing Your Jobs =

Justus2 has three compute nodes reserved for jobs with a walltime under 15 minutes. You can test if your jobs start properly just by specifying a short walltime, e.g. --time=00:14:00 and your job should start very quickly.

= Monitoring Your Jobs =

Always test things first with few jobs before you roll out hundreds of jobs!

Please ensure at minimum:
* are my jobs using the amount of cores I requested
* is my job using near to the amount of memory I requested

If you are running more than 1-10 jobs:

* are my jobs running at the very least over 10 minutes
* do my jobs scale reasonably well → [[Scaling]]

== squeue ==

{|style="background:#deffee; width:100%;"
|style="padding:5px; background:#cef2e0; text-align:left"|
[[Image:Attention.svg|center|25px]]
|style="padding:5px; background:#cef2e0; text-align:left"| Do not run squeue and other slurm commands in loops or "watch" as not to saturate up the slurm daemon with rpc requests
|}

After you submitted the job, you can see it waiting using the <code>squeue</code> command:

(also read the man page with <code>man squeue</code> for more information on how to use the command)

<syntaxhighlight lang='shell'>
> squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
6260301 standard r_60_b_2 ul_yxz1 PD 0:00 1 (AssocGrpMemRunMinutes)

</syntaxhighlight>

Output shows:
* JOBID: the jobid is an unique number your job gets
* PARTITION: the cluster can be divided in different types of nodes.
* NAME: the name you gave your job with the --job-name= option
* USER: your username
* ST: the state the job is in. R = running, PD = pending, CD = completed. See man page for a full list on states.
* TIME: how long the job has been running
* NODES: how many nodes were requested
* NODELIST(REASON): either show the node(s) the job is running on, or a reason why it hasn't started

==scontrol==

You can then show more info on one specific running job using the <code>scontrol</code> command, e.g for the job with ID 6260301 listed above:

<code>
scontrol show job 6260301
</code>

displays detailed information for job with JobID 6260301

<code>
scontrol show jobs
</code>

displays detailed information for all your jobs
<code>

scontrol write batch_script 6260301 -
</code>
display job script of a running job. The "-" is a special filename which means "write to the terminal".

This does not work for already completed jobs. When enabled in slurm, one can see those job scripts with <code>sacct -B -j 6260301</code>

== Monitoring a Started Job ==

After a job has started, you can ssh from a login node to the node(s) the job is running on, using the node name from NODELIST, e.g. if your job runs on n0603:

<code>> ssh n0603
</code>

To get a live overview of the current resource usage on the node, use the command

<code>htop
</code>

On the GPU nodes, the usage of the GPU(s) can be visualized using

<code>nvtop
</code>

Further, we provide the tool jobreport, that generates plots for the resource usage over time of a given job.

<code>jobreport 6260301 </code>

creates an HTML file with these plots in the current directory. For convenience, the report may alternatively be sent as email using

<code>jobreport -E max.mustermann@uni-ulm.de 6260301 </code>

= Partitions =
Job allocations at JUSTUS 2 are routed automatically to the most suitable compute node(s) that can provide the requested resources for the job (e.g. amount of cores, memory, local scratch space). This is to prevent fragmentation of the cluster system and to ensure most efficient usage of available compute resources. Thus, there is no point in requesting a partition in batch job scripts, i.e. users '''should not''' specify any partition "-p, --partition=<partition_name>" on job submission. This is of particular importance if you adapt job scripts from other cluster systems (e.g. bwUniCluster 2.0) to JUSTUS 2.

= Job Priorities =
Job priorities at JUSTUS 2 depend on [https://slurm.schedmd.com/priority_multifactor.html multiple factors ]:
* Age: The amount of time a job has been waiting in the queue, eligible to be scheduled.
* Fairshare: The difference between the portion of the computing resource allocated to an association and the amount of resources that has been consumed.

'''Notes:'''

Jobs that are pending because the user reached one of the resource usage limits (see below) are not eligible to be scheduled and, thus, do not accrue priority by their age.

Fairshare does '''not''' introduce a fixed allotment, in that a user's ability to run new jobs is cut off as soon as a fixed target utilization is reached. Instead, the fairshare factor ensures that jobs from users who were under-served in the past are given higher priority than jobs from users who were over-served in the past. This keeps individual groups from long term monopolizing the resources, thus making it unfair to groups who have not used their fairshare for quite some time.

Slurm features '''backfilling''', meaning that the scheduler will start lower priority jobs if doing so does not delay the expected start time of '''any''' higher priority job. Since the expected start time of pending jobs depends upon the expected completion time of running jobs, reasonably accurate time limits are valuable for backfill scheduling to work well. This '''[https://youtu.be/OKhWwem1XZg?t=161 video]''' gives an illustrative description to how backfilling works.

In summary, an approximate model of Slurm's behavior for scheduling jobs is this:

* Step 1: Can the job in position one (highest priority) start now?
* Step 2: If it can, remove it from the queue, start it and continue with step 1.
* Step 3: If it can not, look at next job.
* Step 4: Can it start now, without delaying the start time of any job before it in the queue?
* Step 5: If it can, remove it from the queue, start it, recalculate what nodes are free, look at next job and continue with step 4.
* Step 6: If it can not, look at next job, and continue with step 4.

As soon as a new job is submitted and as soon as a job finishes, Slurm restarts its main scheduling cycle with step 1.

= Usage Limits/Throttling Policies =

While the fairshare factor ensures fair long term balance of resource utilization between users and groups, there are additional usage limits that constrain the total cumulative resources at a given time. This is to prevent individual users from short term monopolizing large fractions of the whole cluster system.

* The '''maximum walltime''' for a job is '''14 days''' (336 hours)
--time=336:00:00 or --time=14-0

* The '''maximum amount of cores''' used at any given time from jobs running is '''1920''' per user (aggregated over all running jobs). This translates to 40 nodes. An equivalent limit for allocated memory does also apply. If this limit is reached new jobs will be queued (with REASON: AssocGrpCpuLimit) but only allowed to run after resources have been relinquished.

* The maximum amount of '''remaining allocated core-minutes''' per user is '''3300000''' (aggregated over all running jobs). For example, if a user has a 4-core job running that will complete in 1 hour and a 2-core job that will complete in 6 hours, this translates to 4 * 1 * 60 + 2 * 6 * 60 = 16 * 60 = 960 remaining core-minutes. Once a user reaches the limit, no more jobs are allowed to start (REASON: AssocGrpCPURunMinutesLimit). As the jobs continue to run, the remaining core time will decrease and eventually allow more jobs to start in a staggered way. This limit also '''correlates the maximum walltime and amount of cores that can be allocated''' for this amount of time. Thus, shorter walltimes for the jobs allow more resources to be allocated at a given time (but capped by the maximum amount of cores limit above). Watch this '''[https://youtu.be/OKhWwem1XZg?t=306 video]''' for an illustrative description. An equivalent limit applies for remaining time of memory allocation in which case jobs may be held back from starting with REASON AssocGrpMemRunMinutes.

* The '''maximum amount of GPUs''' allocated by running jobs is '''8''' per user. If this limit it reached new jobs will be queued (with REASON: AssocGrpGRES) but only allowed to run after GPU resources have been relinquished.

'''Note:'''

Usage limits are subject to change.

= Efficiency / Use Cases =

When we speak of poor job efficiency, we usually mean that hardware resources are wasted.
That means, a similar overall result could have been achieved using less hardware resources, leaving those for other jobs and reducing the wait time for you and everyone.

The more resources you use, the more important efficiency becomes. If you just run 3-5 jobs that take under a day, just go ahead and choose roughly sane defaults. If you submit hundreds or thousands of jobs - jobs that will accumulate years of CPU compute time (by using many CPU cores), then think very carefully about your jobs and take some time to do trial runs until you are sure your calculations are run well.

Also consider these non-technical things:
* does the calculation give me all the results I need?
→ rerunning calculations is extremely wasteful
* am I using the most efficient algorithm?
→ using better algorithms can reduce the CPU time needed by an order of magnitude or two. And this can sometimes be something as simple as arranging loops in a more clever way or avoiding slow storage.

Some simple causes for poor overall job efficiency are:

* using $HOME or work directories for scratch space (expressively forbidden for $HOME, discouraged for work directories, except for multinode jobs that specifically need this for communication)
* poor choice of resources compared to the size of the nodes leaves part of the node blocked, but doing nothing:
** multiple of --ntasks-per-node is not the number of cores of a node (see section [[#"Exclusive User" Node Access Policy]])
** too much (un-needed) memory or disk space requested
* more cores requested than are actually used by the job
* more cores used for a single mpi/openmp parallel computation than useful
* many small jobs with a short runtime (seconds in extreme cases)
* one-core jobs with very different run-times (because of single-user policy)
* not using full node capacity
* using more cores than what your computational problem can be split into → see [[Scaling]]

== User-exclusive Nodes on Justus2 ==

For several reasons, Justus2 nodes are assigned to one user exclusively. That means that you are responsible for using the full compute node efficiently, as no jobs from other users can fill gaps!

Several key points to accomplishing that:

* Use dividers of the core number:

The Justus2 nodes have 48 cores, two sockets with 24 cores each. Use dividers of 48 to be able to use all cores of the node (e.g. 8). Be aware, that when you choose 16, one job will be executed half on one of the CPUs and the other half on the other. This might be suboptimal.
* Be aware of memory resources:

When you request more memory per core than the "small" nodes on Justus2 have per core, your jobs will not be able to use all cores on the small nodes - or will have to wait for the rarer spaces on the nodes with more memory. Try to estimate your memory requirements well and if you need more than 3.8GB per core, consider mixing in jobs with lower memory requirements to fill the nodes

== Many One or Few-Core Jobs ==

Jobs that use only a few CPU cores can lead to very inefficient node usage:

# You submit 1000 jobs, each runs for ~30s. Jobs need up to 30s to start and finish - a huge waste if the job only takes 30 seconds. Additionally, the starting and finishing of so many jobs in a short time causes strain on the scheduler SLURM and may cause severe problems for everyone and clutter the SLURM job database.
# many few-core jobs with very different run times. The jobs will start on many nodes, but at some time all quicker jobs have finished the calculation and only a few remain. Because of the single-user policy on JUSTUS2, jobs of other users cannot fill in the gaps and the rest of the node is idle.

To address the problem, you can reduce the amount of jobs and/or the amount of nodes used.

To limit the amount of jobs, start many calculations within one job (problem 1. and 2.):

* use a bash loop in your job script
* use the program GNU parallel to start the processes for you

To only limit the amount of nodes used:

* use array jobs

=== Bash Loop ===

One advantage of this method is, that you can run more threads than cores if your jobs are really short and do not use too much RAM memory and in this way keep all cores busy even if many calculations are still starting up.

It is of course even better, if you can combine such short calculations in a way that for 1000 calculations the kernel does not need to start 1000 processes which in turn need to initialize everything.

This example uses pgrep to count how many jobs are running:

<syntaxhighlight lang="bash">
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=48
#SBATCH --time=00:10:00
#SBATCH --mem=100gb

for i in {1..200}
do
echo starting up $i
bash my_calculation $i &
while [ $(pgrep -c -f my_calculation) -gt 48 ] ; do echo sleeping; sleep 5; done
done
wait
</syntaxhighlight>

The same, but by tracking the PIDs (process IDs) of the started processes. This is more robust, but is more difficult to read:

<syntaxhighlight lang="bash">
#!/bin/bash
running_jobs=()

for i in {1..200}; do
echo "Starting job $i"
sleep "$i" &
running_jobs+=($!) # Track PID

while [ "${#running_jobs[@]}" -ge 8 ]; do
sleep 2 # adjust duration depending on your runtime
echo running_jobs: ${running_jobs[@]}
echo pid-out: $(ps -o pid= -p "${running_jobs[@]}" 2>/dev/null | xargs)
echo -----
running_jobs=($(ps -o pid= -p "${running_jobs[@]}" 2>/dev/null)) # Remove finished jobs
done
done

wait # Ensure all jobs complete
</syntaxhighlight>

You may not be able to just use an index number "i" to start many calculations. In this case, for not too many files, the for loop could be used to read in config files. Here just the general idea for the for loop without
<syntaxhighlight lang="bash">
for config in config-1980-03-01_1/*; do
mycalculation -config "$config"
done
</syntaxhighlight>
This loops over all files in the directory config-1980-03-01_1/ and gives them as an input file to "mycalculation" via a hypothetical "-config" option. Adding a date to the config-dirs (and outputs) would enable you to track different runs in your lab journal more easily.

=== Gnu Parallel ===

Gnu Parallel is available on the HPC Cluster and comes with its own set of examples, you can access them like this:

<syntaxhighlight lang="bash">
$ module load system/parallel
$ cp $PARALLEL_EXA_DIR/parallel.slurm .
</syntaxhighlight>
=== Array Jobs ===
<syntaxhighlight lang="bash">$ sbatch -a 1-500%48 batch_script</syntaxhighlight>

This will submit 500 tasks to be executed, each one indexed by SLURM_ARRAY_TASK_ID ranging from 1 to 200, but will limit the number of simultaneously running tasks from this job array to 48 (number of cores on a Justus2 node).

Thee same can be done inside the job script:

<syntaxhighlight lang="bash">
#!/bin/bash
# Number of cores per individual array task
#SBATCH --ntasks-per-node=1
#SBATCH --array=1-500%48
#SBATCH --mem=3G
#SBATCH --time=1:10:00
#SBATCH --job-name=array_job
#SBATCH --output=array_job-%A_%a.out
#SBATCH --error=array_job-%A_%a.err

# Print the task id.
echo "My SLURM_ARRAY_TASK_ID: " $SLURM_ARRAY_TASK_ID

export TIMEFORMAT=%R ;
time bash mycalculation $SLURM_ARRAY_TASK_ID

</syntaxhighlight>

Also see:
* Slurm-Howto entry: [[BwForCluster_JUSTUS_2_Slurm_HOWTO#How_to_submit_an_array_job?]]
* Schedmd documentations on Job Arrays: https://slurm.schedmd.com/job_array.html

JUSTUS2/Jobscripts: Running Your Calculations

2025-06-23T13:13:33Z

M Carmesin: /* Submitting Jobs on the bwForCluster JUSTUS 2 */

{{Justus2}}

The JUSTUS 2 cluster uses Slurm ([https://slurm.schedmd.com/ https://slurm.schedmd.com/]) for scheduling compute jobs.

= JUSTUS 2 Slurm Howto =

This page only presents some basic introduction.

Please see the '''[[bwForCluster JUSTUS 2 Slurm HOWTO|JUSTUS 2 Slurm HOWTO]]''' for many more examples and commands for common tasks or the original slurm documentation.

= Slurm Command Overview =

{| width=750px class="wikitable"
! Slurm commands !! Brief explanation
|-
| [https://slurm.schedmd.com/sbatch.html sbatch] || Submits a job and queues it in an input queue
|-
| [https://slurm.schedmd.com/salloc.html salloc] || Request resources for an interactive job
|-
| [https://slurm.schedmd.com/squeue.html squeue] || Displays information about active, eligible, blocked, and/or recently completed jobs
|-
| [https://slurm.schedmd.com/scontrol.html scontrol] || Displays detailed job state information
|-
| [https://slurm.schedmd.com/sstat.html sstat] || Displays status information about a running job
|-
| [https://slurm.schedmd.com/scancel.html scancel] || Cancels a job
|-
| seff || Shows the "job efficiency" of a job after it has finished
|}

= Submitting Jobs on the bwForCluster JUSTUS 2 =
Batch jobs are submitted with the command:

<syntaxhighlight lang=bash>$ sbatch <job-script> </syntaxhighlight>

A job script contains options for Slurm in lines beginning with #SBATCH as well as your commands which you want to execute on the compute nodes. For example:

<syntaxhighlight lang='bash'>
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --time=06:00:00
#SBATCH --mem=1gb
echo 'Here starts the calculation'
</syntaxhighlight>

You can override options from the script on the command-line:
<syntaxhighlight lang=bash>$ sbatch --time=00:14:00 <job-script> </syntaxhighlight>

Note: --time=00:14:00 should start your job very quickly. see [[#Testing Your Jobs]]

{| style=" background:#deffee; width:100%;"
| style="padding:12px; background:#cef2e0; text-align:left" |
Software examples: Most [[Environment Modules|installed software]] comes with example job scripts. To find it e.g. for lammps: <code> module load chem/lammps; cd $LAMMPS_EXA_DIR; ls -la</code>.
|}

== File Access ==

Note: Compute jobs must not write/read temporary files from the [[JUSTUS2/Hardware#Storage_Architecture|global file systems]] (HOME and WORK) such as a calculation swap files. 

Use local storage /tmp in the ramdisk for small files or /scratch on disk (see [[BwForCluster_JUSTUS_2_Slurm_HOWTO#How_to_request_local_scratch_.28SSD.2FNVMe.29_at_job_submission.3F|How to request NVME]]) for this purpose.

Often, you must configure the the program you are using to write temporary files not to use the global file systems.
If the program uses the current directory to look for files, you must copy these files to a temporary directory, start the program there and copy/save the results of the calculation in the end. The contents of the of /tmp and /scratch are deleted by the automated cleanup happening after the job.

Each node has a file system in memory (“ram disk”), that can have a maximum of half the size of the total RAM. Note that files created plus the memory requirements of your job need to fit into the total memory.

There are more diskless nodes than nodes with scratch disks, so if your job can run on a diskless node, you should choose this option.

Example job script with requesting 700GB disk space and copying files:

<syntaxhighlight lang='bash'>
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --time=00:14:00
#SBATCH --mem=1gb
#SBATCH --gres=scratch:700
mkdir -p $SCRATCH/mycalculation
# copy input file
cp $HOME/inputfiles/myinput.inp $SCRATCH/mycalculation
# switch directory
cd $SCRATCH/mycalculation
echo 'Here starts the calculation'
myprogram --input=$SCRATCH/mycalculation/myinput.inp
# calculation ends
# copy result
cp outfile.out results2.txt $HOME/resultdir/job12345
# clean up
rm myinput outfile.out results2.txt
</syntaxhighlight>

If there are many files or you don't know exactly how the output files are called, you can just create a tar archive of the whole hole directory (in HOME) instead of using cp:

<syntaxhighlight lang='bash'>
tar -cvzf $HOME/resultdir/mycalculation-${SLURM_JOB_ID}.tgz
</syntaxhighlight>

$SLURM_JOB_ID contains the jobid in slurm during the run on the node and so makes sure the filename is unique.

== Resource Requests ==

Important resource request options for the Slurm command sbatch are:

{| width=750px class="wikitable"
! Option !! Slurm (sbatch)
|-
| #SBATCH|| Script directive
|-
| --time=<hh:mm:ss> (-t <hh:mm:ss>)|| Wall time limit
|-
| --job-name=<name> (-J <name>)|| Job name
|-
| --nodes=<count> (-N <count>)|| Node count
|-
| --ntasks=<count> (-n <count>)|| Core count
|-
| --ntasks-per-node=<count>|| Process count per node
|-
| --mem=<limit>|| Memory limit per node
|-
| --mem-per-cpu=<limit>|| Memory limit per process
|-
| --gres=gpu:<count>|| GPU count (gres = "generic resource")
|-
| --gres=scratch:<count> || Disk space of <count> GB per requested task
|-
| --exclusive|| Node exclusive job
|}

'''Nodes and Cores'''

Slurm provides a number of options to request nodes and cores.
Typically, using <code>--nodes=<count></code> and <code>--ntasks-per-node=<count></code> should work for all your jobs. For single core jobs, it would be sufficient to use the option <code>--ntasks=1</code>. Specifying only <code>--ntasks</code> may lead to slurm trying to distribute tasks over more than one node even if you requested a small amount of cores.

'''Memory'''

Memory can be requested with either the option <code>--mem=<limit></code> (memory per node) or <code>--mem-per-cpu=<limit></code> (memory per process). When looking up the maximum available memory for a certain node type subtract about 5 GB for the operating system. Specify the memory limit as a value-unit-pair, for example 500mb or 8gb.

In most cases it is preferable to use the <code>--mem=<limit></code> option.

'''GPUs''' and '''Scratch'''

These are requested as "generic resources" with <code>--gres:gpu:<count></code> and <code>--gres:scratch:<count></code>.

=== Default Values ===
Some values will be set by default if you do not specify them for your job.

{| width=500px class="wikitable"
!Option || Equivalient To || Meaning
|-
|Runtime: || --time=02:00:00 || 2 hours
|-
|Nodes: ||--nodes=1 ||one node
|-
|Tasks: || --tasks-per-node=1 ||one task per node
|-
|Cores: || --cpus-per-task=1 ||one core per task
|-
|Memory: || --mem-per-cpu=2gb ||2 GB per core
|}

== "Exclusive User" Node Access Policy ==

Nodes are exclusively allocated to one single user. However, multiple jobs (up to 48) from the same user can share a node.

For efficient resource use, choose a core count for your jobs that evenly divides 48. For example, two 24-core jobs fit on one node, while two 32-core jobs require two nodes but leave 16 cores unused on each.

The same applies to memory requests (see below).

Think of scheduling as a game of Tetris with cores, memory, and other resources. Choosing well-fitting allocations helps the scheduler pack jobs efficiently.

== Memory Limits ==

The wait time of a job also depends largely on the amount of requested resources and the available number of nodes providing this amount of resources. This must be taken into account in particular when requesting a certain amount of memory.

For example a node with 192 GB RAM can only run jobs with up to 187 GB memory requested. The remaining amount is reserved for the operating system, system services and local file systems.
This means that if a job requests 192 GB RAM per node (i.e. --mem=192gb or --tasks-per-node=48 and --mem-per-cpu=4gb), the job cannot run on one of the 456 "small" nodes but only on one of the "medium", "large" or "fat" nodes. Unnecessarily limiting your jobs to a sub-set of nodes will increase your wait time and the wait time of others, who actually need the amount of memory.

The following table provides an overview of how much memory can be allocated by user jobs on the various node types and how many nodes can serve this memory requirement:

{| width=500px class="wikitable"
!|| Physical RAM on node !! Available RAM on node !! Number of suitable nodes
|-
|small|| 192 GB || 187 GB || 692
|-
|medium|| 384 GB || 376 GB || 220
|-
|large|| 768 GB || 754 GB || 28
|-
|fat|| 1536 GB || 1510 GB || 8
|}

Also note that allocated memory is factored into resource usage accounting for fair share. This means over-requesting memory may have a negative impact on the priority of subsequent jobs.

= Testing Your Jobs =

Justus2 has three compute nodes reserved for jobs with a walltime under 15 minutes. You can test if your jobs start properly just by specifying a short walltime, e.g. --time=00:14:00 and your job should start very quickly.

= Monitoring Your Jobs =

Always test things first with few jobs before you roll out hundreds of jobs!

Please ensure at minimum:
* are my jobs using the amount of cores I requested
* is my job using near to the amount of memory I requested

If you are running more than 1-10 jobs:

* are my jobs running at the very least over 10 minutes
* do my jobs scale reasonably well → [[Scaling]]

== squeue ==

{|style="background:#deffee; width:100%;"
|style="padding:5px; background:#cef2e0; text-align:left"|
[[Image:Attention.svg|center|25px]]
|style="padding:5px; background:#cef2e0; text-align:left"| Do not run squeue and other slurm commands in loops or "watch" as not to saturate up the slurm daemon with rpc requests
|}

After you submitted the job, you can see it waiting using the <code>squeue</code> command:

(also read the man page with <code>man squeue</code> for more information on how to use the command)

<syntaxhighlight lang='shell'>
> squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
6260301 standard r_60_b_2 ul_yxz1 PD 0:00 1 (AssocGrpMemRunMinutes)

</syntaxhighlight>

Output shows:
* JOBID: the jobid is an unique number your job gets
* PARTITION: the cluster can be divided in different types of nodes.
* NAME: the name you gave your job with the --job-name= option
* USER: your username
* ST: the state the job is in. R = running, PD = pending, CD = completed. See man page for a full list on states.
* TIME: how long the job has been running
* NODES: how many nodes were requested
* NODELIST(REASON): either show the node(s) the job is running on, or a reason why it hasn't started

==scontrol==

You can then show more info on one specific running job using the <code>scontrol</code> command, e.g for the job with ID 6260301 listed above:

<code>
scontrol show job 6260301
</code>

displays detailed information for job with JobID 6260301

<code>
scontrol show jobs
</code>

displays detailed information for all your jobs
<code>

scontrol write batch_script 6260301 -
</code>
display job script of a running job. The "-" is a special filename which means "write to the terminal".

This does not work for already completed jobs. When enabled in slurm, one can see those job scripts with <code>sacct -B -j 6260301</code>

== Monitoring a Started Job ==

After a job has started, you can ssh from a login node to the node(s) the job is running on, using the node name from NODELIST, e.g. if your job runs on n0603:

<code>> ssh n0603
</code>

To get a live overview of the current resource usage on the node, use the command

<code>htop
</code>

On the GPU nodes, the usage of the GPU(s) can be visualized using

<code>nvtop
</code>

Further, we provide the tool jobreport, that generates plots for the resource usage over time of a given job.

<code>jobreport 6260301 </code>

creates an HTML file with these plots in the current directory. For convenience, the report may alternatively be sent as email using

<code>jobreport -E max.mustermann@uni-ulm.de 6260301 </code>

= Partitions =
Job allocations at JUSTUS 2 are routed automatically to the most suitable compute node(s) that can provide the requested resources for the job (e.g. amount of cores, memory, local scratch space). This is to prevent fragmentation of the cluster system and to ensure most efficient usage of available compute resources. Thus, there is no point in requesting a partition in batch job scripts, i.e. users '''should not''' specify any partition "-p, --partition=<partition_name>" on job submission. This is of particular importance if you adapt job scripts from other cluster systems (e.g. bwUniCluster 2.0) to JUSTUS 2.

= Job Priorities =
Job priorities at JUSTUS 2 depend on [https://slurm.schedmd.com/priority_multifactor.html multiple factors ]:
* Age: The amount of time a job has been waiting in the queue, eligible to be scheduled.
* Fairshare: The difference between the portion of the computing resource allocated to an association and the amount of resources that has been consumed.

'''Notes:'''

Jobs that are pending because the user reached one of the resource usage limits (see below) are not eligible to be scheduled and, thus, do not accrue priority by their age.

Fairshare does '''not''' introduce a fixed allotment, in that a user's ability to run new jobs is cut off as soon as a fixed target utilization is reached. Instead, the fairshare factor ensures that jobs from users who were under-served in the past are given higher priority than jobs from users who were over-served in the past. This keeps individual groups from long term monopolizing the resources, thus making it unfair to groups who have not used their fairshare for quite some time.

Slurm features '''backfilling''', meaning that the scheduler will start lower priority jobs if doing so does not delay the expected start time of '''any''' higher priority job. Since the expected start time of pending jobs depends upon the expected completion time of running jobs, reasonably accurate time limits are valuable for backfill scheduling to work well. This '''[https://youtu.be/OKhWwem1XZg?t=161 video]''' gives an illustrative description to how backfilling works.

In summary, an approximate model of Slurm's behavior for scheduling jobs is this:

* Step 1: Can the job in position one (highest priority) start now?
* Step 2: If it can, remove it from the queue, start it and continue with step 1.
* Step 3: If it can not, look at next job.
* Step 4: Can it start now, without delaying the start time of any job before it in the queue?
* Step 5: If it can, remove it from the queue, start it, recalculate what nodes are free, look at next job and continue with step 4.
* Step 6: If it can not, look at next job, and continue with step 4.

As soon as a new job is submitted and as soon as a job finishes, Slurm restarts its main scheduling cycle with step 1.

= Usage Limits/Throttling Policies =

While the fairshare factor ensures fair long term balance of resource utilization between users and groups, there are additional usage limits that constrain the total cumulative resources at a given time. This is to prevent individual users from short term monopolizing large fractions of the whole cluster system.

* The '''maximum walltime''' for a job is '''14 days''' (336 hours)
--time=336:00:00 or --time=14-0

* The '''maximum amount of cores''' used at any given time from jobs running is '''1920''' per user (aggregated over all running jobs). This translates to 40 nodes. An equivalent limit for allocated memory does also apply. If this limit is reached new jobs will be queued (with REASON: AssocGrpCpuLimit) but only allowed to run after resources have been relinquished.

* The maximum amount of '''remaining allocated core-minutes''' per user is '''3300000''' (aggregated over all running jobs). For example, if a user has a 4-core job running that will complete in 1 hour and a 2-core job that will complete in 6 hours, this translates to 4 * 1 * 60 + 2 * 6 * 60 = 16 * 60 = 960 remaining core-minutes. Once a user reaches the limit, no more jobs are allowed to start (REASON: AssocGrpCPURunMinutesLimit). As the jobs continue to run, the remaining core time will decrease and eventually allow more jobs to start in a staggered way. This limit also '''correlates the maximum walltime and amount of cores that can be allocated''' for this amount of time. Thus, shorter walltimes for the jobs allow more resources to be allocated at a given time (but capped by the maximum amount of cores limit above). Watch this '''[https://youtu.be/OKhWwem1XZg?t=306 video]''' for an illustrative description. An equivalent limit applies for remaining time of memory allocation in which case jobs may be held back from starting with REASON AssocGrpMemRunMinutes.

* The '''maximum amount of GPUs''' allocated by running jobs is '''8''' per user. If this limit it reached new jobs will be queued (with REASON: AssocGrpGRES) but only allowed to run after GPU resources have been relinquished.

'''Note:'''

Usage limits are subject to change.

= Efficiency / Use Cases =

When we speak of poor job efficiency, we usually mean that hardware resources are wasted.
That means, a similar overall result could have been achieved using less hardware resources, leaving those for other jobs and reducing the wait time for you and everyone.

The more resources you use, the more important efficiency becomes. If you just run 3-5 jobs that take under a day, just go ahead and choose roughly sane defaults. If you submit hundreds or thousands of jobs - jobs that will accumulate years of CPU compute time (by using many CPU cores), then think very carefully about your jobs and take some time to do trial runs until you are sure your calculations are run well.

Also consider these non-technical things:
* does the calculation give me all the results I need?
→ rerunning calculations is extremely wasteful
* am I using the most efficient algorithm?
→ using better algorithms can reduce the CPU time needed by an order of magnitude or two. And this can sometimes be something as simple as arranging loops in a more clever way or avoiding slow storage.

Some simple causes for poor overall job efficiency are:

* using $HOME or work directories for scratch space (expressively forbidden for $HOME, discouraged for work directories, except for multinode jobs that specifically need this for communication)
* poor choice of resources compared to the size of the nodes leaves part of the node blocked, but doing nothing:
** multiple of --ntasks-per-node is not the number of cores of a node (see section [[#"Exclusive User" Node Access Policy]])
** too much (un-needed) memory or disk space requested
* more cores requested than are actually used by the job
* more cores used for a single mpi/openmp parallel computation than useful
* many small jobs with a short runtime (seconds in extreme cases)
* one-core jobs with very different run-times (because of single-user policy)
* not using full node capacity
* using more cores than what your computational problem can be split into → see [[Scaling]]

== User-exclusive Nodes on Justus2 ==

For several reasons, Justus2 nodes are assigned to one user exclusively. That means that you are responsible for using the full compute node efficiently, as no jobs from other users can fill gaps!

Several key points to accomplishing that:

* Use dividers of the core number:

The Justus2 nodes have 48 cores, two sockets with 24 cores each. Use dividers of 48 to be able to use all cores of the node (e.g. 8). Be aware, that when you choose 16, one job will be executed half on one of the CPUs and the other half on the other. This might be suboptimal.
* Be aware of memory resources:

When you request more memory per core than the "small" nodes on Justus2 have per core, your jobs will not be able to use all cores on the small nodes - or will have to wait for the rarer spaces on the nodes with more memory. Try to estimate your memory requirements well and if you need more than 3.8GB per core, consider mixing in jobs with lower memory requirements to fill the nodes

== Many One or Few-Core Jobs ==

Jobs that use only a few CPU cores can lead to very inefficient node usage:

# You submit 1000 jobs, each runs for ~30s. Jobs need up to 30s to start and finish - a huge waste if the job only takes 30 seconds. Additionally, the starting and finishing of so many jobs in a short time causes strain on the scheduler SLURM and may cause severe problems for everyone and clutter the SLURM job database.
# many few-core jobs with very different run times. The jobs will start on many nodes, but at some time all quicker jobs have finished the calculation and only a few remain. Because of the single-user policy on JUSTUS2, jobs of other users cannot fill in the gaps and the rest of the node is idle.

To address the problem, you can reduce the amount of jobs and/or the amount of nodes used.

To limit the amount of jobs, start many calculations within one job (problem 1. and 2.):

* use a bash loop in your job script
* use the program GNU parallel to start the processes for you

To only limit the amount of nodes used:

* use array jobs

=== Bash Loop ===

One advantage of this method is, that you can run more threads than cores if your jobs are really short and do not use too much RAM memory and in this way keep all cores busy even if many calculations are still starting up.

It is of course even better, if you can combine such short calculations in a way that for 1000 calculations the kernel does not need to start 1000 processes which in turn need to initialize everything.

This example uses pgrep to count how many jobs are running:

<syntaxhighlight lang="bash">
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=48
#SBATCH --time=00:10:00
#SBATCH --mem=100gb

for i in {1..200}
do
echo starting up $i
bash my_calculation $i &
while [ $(pgrep -c -f my_calculation) -gt 48 ] ; do echo sleeping; sleep 5; done
done
wait
</syntaxhighlight>

The same, but by tracking the PIDs (process IDs) of the started processes. This is more robust, but is more difficult to read:

<syntaxhighlight lang="bash">
#!/bin/bash
running_jobs=()

for i in {1..200}; do
echo "Starting job $i"
sleep "$i" &
running_jobs+=($!) # Track PID

while [ "${#running_jobs[@]}" -ge 8 ]; do
sleep 2 # adjust duration depending on your runtime
echo running_jobs: ${running_jobs[@]}
echo pid-out: $(ps -o pid= -p "${running_jobs[@]}" 2>/dev/null | xargs)
echo -----
running_jobs=($(ps -o pid= -p "${running_jobs[@]}" 2>/dev/null)) # Remove finished jobs
done
done

wait # Ensure all jobs complete
</syntaxhighlight>

You may not be able to just use an index number "i" to start many calculations. In this case, for not too many files, the for loop could be used to read in config files. Here just the general idea for the for loop without
<syntaxhighlight lang="bash">
for config in config-1980-03-01_1/*; do
mycalculation -config "$config"
done
</syntaxhighlight>
This loops over all files in the directory config-1980-03-01_1/ and gives them as an input file to "mycalculation" via a hypothetical "-config" option. Adding a date to the config-dirs (and outputs) would enable you to track different runs in your lab journal more easily.

=== Gnu Parallel ===

Gnu Parallel is available on the HPC Cluster and comes with its own set of examples, you can access them like this:

<syntaxhighlight lang="bash">
$ module load system/parallel
$ cp $PARALLEL_EXA_DIR/parallel.slurm .
</syntaxhighlight>
=== Array Jobs ===
<syntaxhighlight lang="bash">$ sbatch -a 1-500%48 batch_script</syntaxhighlight>

This will submit 500 tasks to be executed, each one indexed by SLURM_ARRAY_TASK_ID ranging from 1 to 200, but will limit the number of simultaneously running tasks from this job array to 48 (number of cores on a Justus2 node).

Thee same can be done inside the job script:

<syntaxhighlight lang="bash">
#!/bin/bash
# Number of cores per individual array task
#SBATCH --ntasks-per-node=1
#SBATCH --array=1-500%48
#SBATCH --mem=3G
#SBATCH --time=1:10:00
#SBATCH --job-name=array_job
#SBATCH --output=array_job-%A_%a.out
#SBATCH --error=array_job-%A_%a.err

# Print the task id.
echo "My SLURM_ARRAY_TASK_ID: " $SLURM_ARRAY_TASK_ID

export TIMEFORMAT=%R ;
time bash mycalculation $SLURM_ARRAY_TASK_ID

</syntaxhighlight>

Also see:
* Slurm-Howto entry: [[BwForCluster_JUSTUS_2_Slurm_HOWTO#How_to_submit_an_array_job?]]
* Schedmd documentations on Job Arrays: https://slurm.schedmd.com/job_array.html

JUSTUS2/Jobscripts: Running Your Calculations

2025-06-23T13:02:33Z

M Carmesin: /* Submitting Jobs on the bwForCluster JUSTUS 2 */

{{Justus2}}

The JUSTUS 2 cluster uses Slurm ([https://slurm.schedmd.com/ https://slurm.schedmd.com/]) for scheduling compute jobs.

= JUSTUS 2 Slurm Howto =

This page only presents some basic introduction.

Please see the '''[[bwForCluster JUSTUS 2 Slurm HOWTO|JUSTUS 2 Slurm HOWTO]]''' for many more examples and commands for common tasks or the original slurm documentation.

= Slurm Command Overview =

{| width=750px class="wikitable"
! Slurm commands !! Brief explanation
|-
| [https://slurm.schedmd.com/sbatch.html sbatch] || Submits a job and queues it in an input queue
|-
| [https://slurm.schedmd.com/salloc.html salloc] || Request resources for an interactive job
|-
| [https://slurm.schedmd.com/squeue.html squeue] || Displays information about active, eligible, blocked, and/or recently completed jobs
|-
| [https://slurm.schedmd.com/scontrol.html scontrol] || Displays detailed job state information
|-
| [https://slurm.schedmd.com/sstat.html sstat] || Displays status information about a running job
|-
| [https://slurm.schedmd.com/scancel.html scancel] || Cancels a job
|-
| seff || Shows the "job efficiency" of a job after it has finished
|}

= Submitting Jobs on the bwForCluster JUSTUS 2 =
Batch jobs are submitted with the command:

<syntaxhighlight lang=bash>$ sbatch <job-script> </syntaxhighlight>

A job script contains options for Slurm in lines beginning with #SBATCH as well as your commands which you want to execute on the compute nodes. For example:

<syntaxhighlight lang='bash'>
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --time=06:00:00
#SBATCH --mem=1gb
echo 'Here starts the calculation'
</syntaxhighlight>

You can override options from the script on the command-line:
<syntaxhighlight lang=bash>$ sbatch --time=00:14:00 <job-script> </syntaxhighlight>

Note: --time=00:14:00 should start your job very quickly. see [[#Testing Your Jobs]]

== File Access ==

Note: Compute jobs must not write/read temporary files from the [[JUSTUS2/Hardware#Storage_Architecture|global file systems]] (HOME and WORK) such as a calculation swap files. 

Use local storage /tmp in the ramdisk for small files or /scratch on disk (see [[BwForCluster_JUSTUS_2_Slurm_HOWTO#How_to_request_local_scratch_.28SSD.2FNVMe.29_at_job_submission.3F|How to request NVME]]) for this purpose.

Often, you must configure the the program you are using to write temporary files not to use the global file systems.
If the program uses the current directory to look for files, you must copy these files to a temporary directory, start the program there and copy/save the results of the calculation in the end. The contents of the of /tmp and /scratch are deleted by the automated cleanup happening after the job.

Each node has a file system in memory (“ram disk”), that can have a maximum of half the size of the total RAM. Note that files created plus the memory requirements of your job need to fit into the total memory.

There are more diskless nodes than nodes with scratch disks, so if your job can run on a diskless node, you should choose this option.

Example job script with requesting 700GB disk space and copying files:

<syntaxhighlight lang='bash'>
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --time=00:14:00
#SBATCH --mem=1gb
#SBATCH --gres=scratch:700
mkdir -p $SCRATCH/mycalculation
# copy input file
cp $HOME/inputfiles/myinput.inp $SCRATCH/mycalculation
# switch directory
cd $SCRATCH/mycalculation
echo 'Here starts the calculation'
myprogram --input=$SCRATCH/mycalculation/myinput.inp
# calculation ends
# copy result
cp outfile.out results2.txt $HOME/resultdir/job12345
# clean up
rm myinput outfile.out results2.txt
</syntaxhighlight>

If there are many files or you don't know exactly how the output files are called, you can just create a tar archive of the whole hole directory (in HOME) instead of using cp:

<syntaxhighlight lang='bash'>
tar -cvzf $HOME/resultdir/mycalculation-${SLURM_JOB_ID}.tgz
</syntaxhighlight>

$SLURM_JOB_ID contains the jobid in slurm during the run on the node and so makes sure the filename is unique.

{| style=" background:#deffee; width:100%;"
| style="padding:12px; background:#cef2e0; text-align:left" |
Software examples: Most [[Environment Modules|installed software]] comes with example job scripts. To find it e.g. for lammps: <code> module load chem/lammps; cd $LAMMPS_EXA_DIR; ls -la</code>.
|}
 

== Resource Requests ==

Important resource request options for the Slurm command sbatch are:

{| width=750px class="wikitable"
! Option !! Slurm (sbatch)
|-
| #SBATCH|| Script directive
|-
| --time=<hh:mm:ss> (-t <hh:mm:ss>)|| Wall time limit
|-
| --job-name=<name> (-J <name>)|| Job name
|-
| --nodes=<count> (-N <count>)|| Node count
|-
| --ntasks=<count> (-n <count>)|| Core count
|-
| --ntasks-per-node=<count>|| Process count per node
|-
| --mem=<limit>|| Memory limit per node
|-
| --mem-per-cpu=<limit>|| Memory limit per process
|-
| --gres=gpu:<count>|| GPU count (gres = "generic resource")
|-
| --gres=scratch:<count> || Disk space of <count> GB per requested task
|-
| --exclusive|| Node exclusive job
|}

'''Nodes and Cores'''

Slurm provides a number of options to request nodes and cores.
Typically, using <code>--nodes=<count></code> and <code>--ntasks-per-node=<count></code> should work for all your jobs. For single core jobs, it would be sufficient to use the option <code>--ntasks=1</code>. Specifying only <code>--ntasks</code> may lead to slurm trying to distribute tasks over more than one node even if you requested a small amount of cores.

'''Memory'''

Memory can be requested with either the option <code>--mem=<limit></code> (memory per node) or <code>--mem-per-cpu=<limit></code> (memory per process). When looking up the maximum available memory for a certain node type subtract about 5 GB for the operating system. Specify the memory limit as a value-unit-pair, for example 500mb or 8gb.

In most cases it is preferable to use the <code>--mem=<limit></code> option.

'''GPUs''' and '''Scratch'''

These are requested as "generic resources" with <code>--gres:gpu:<count></code> and <code>--gres:scratch:<count></code>.

=== Default Values ===
Some values will be set by default if you do not specify them for your job.

{| width=500px class="wikitable"
!Option || Equivalient To || Meaning
|-
|Runtime: || --time=02:00:00 || 2 hours
|-
|Nodes: ||--nodes=1 ||one node
|-
|Tasks: || --tasks-per-node=1 ||one task per node
|-
|Cores: || --cpus-per-task=1 ||one core per task
|-
|Memory: || --mem-per-cpu=2gb ||2 GB per core
|}

== "Exclusive User" Node Access Policy ==

Nodes are exclusively allocated to one single user. However, multiple jobs (up to 48) from the same user can share a node.

For efficient resource use, choose a core count for your jobs that evenly divides 48. For example, two 24-core jobs fit on one node, while two 32-core jobs require two nodes but leave 16 cores unused on each.

The same applies to memory requests (see below).

Think of scheduling as a game of Tetris with cores, memory, and other resources. Choosing well-fitting allocations helps the scheduler pack jobs efficiently.

== Memory Limits ==

The wait time of a job also depends largely on the amount of requested resources and the available number of nodes providing this amount of resources. This must be taken into account in particular when requesting a certain amount of memory.

For example a node with 192 GB RAM can only run jobs with up to 187 GB memory requested. The remaining amount is reserved for the operating system, system services and local file systems.
This means that if a job requests 192 GB RAM per node (i.e. --mem=192gb or --tasks-per-node=48 and --mem-per-cpu=4gb), the job cannot run on one of the 456 "small" nodes but only on one of the "medium", "large" or "fat" nodes. Unnecessarily limiting your jobs to a sub-set of nodes will increase your wait time and the wait time of others, who actually need the amount of memory.

The following table provides an overview of how much memory can be allocated by user jobs on the various node types and how many nodes can serve this memory requirement:

{| width=500px class="wikitable"
!|| Physical RAM on node !! Available RAM on node !! Number of suitable nodes
|-
|small|| 192 GB || 187 GB || 692
|-
|medium|| 384 GB || 376 GB || 220
|-
|large|| 768 GB || 754 GB || 28
|-
|fat|| 1536 GB || 1510 GB || 8
|}

Also note that allocated memory is factored into resource usage accounting for fair share. This means over-requesting memory may have a negative impact on the priority of subsequent jobs.

= Testing Your Jobs =

Justus2 has three compute nodes reserved for jobs with a walltime under 15 minutes. You can test if your jobs start properly just by specifying a short walltime, e.g. --time=00:14:00 and your job should start very quickly.

= Monitoring Your Jobs =

Always test things first with few jobs before you roll out hundreds of jobs!

Please ensure at minimum:
* are my jobs using the amount of cores I requested
* is my job using near to the amount of memory I requested

If you are running more than 1-10 jobs:

* are my jobs running at the very least over 10 minutes
* do my jobs scale reasonably well → [[Scaling]]

== squeue ==

{|style="background:#deffee; width:100%;"
|style="padding:5px; background:#cef2e0; text-align:left"|
[[Image:Attention.svg|center|25px]]
|style="padding:5px; background:#cef2e0; text-align:left"| Do not run squeue and other slurm commands in loops or "watch" as not to saturate up the slurm daemon with rpc requests
|}

After you submitted the job, you can see it waiting using the <code>squeue</code> command:

(also read the man page with <code>man squeue</code> for more information on how to use the command)

<syntaxhighlight lang='shell'>
> squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
6260301 standard r_60_b_2 ul_yxz1 PD 0:00 1 (AssocGrpMemRunMinutes)

</syntaxhighlight>

Output shows:
* JOBID: the jobid is an unique number your job gets
* PARTITION: the cluster can be divided in different types of nodes.
* NAME: the name you gave your job with the --job-name= option
* USER: your username
* ST: the state the job is in. R = running, PD = pending, CD = completed. See man page for a full list on states.
* TIME: how long the job has been running
* NODES: how many nodes were requested
* NODELIST(REASON): either show the node(s) the job is running on, or a reason why it hasn't started

==scontrol==

You can then show more info on one specific running job using the <code>scontrol</code> command, e.g for the job with ID 6260301 listed above:

<code>
scontrol show job 6260301
</code>

displays detailed information for job with JobID 6260301

<code>
scontrol show jobs
</code>

displays detailed information for all your jobs
<code>

scontrol write batch_script 6260301 -
</code>
display job script of a running job. The "-" is a special filename which means "write to the terminal".

This does not work for already completed jobs. When enabled in slurm, one can see those job scripts with <code>sacct -B -j 6260301</code>

== Monitoring a Started Job ==

After a job has started, you can ssh from a login node to the node(s) the job is running on, using the node name from NODELIST, e.g. if your job runs on n0603:

<code>> ssh n0603
</code>

To get a live overview of the current resource usage on the node, use the command

<code>htop
</code>

On the GPU nodes, the usage of the GPU(s) can be visualized using

<code>nvtop
</code>

Further, we provide the tool jobreport, that generates plots for the resource usage over time of a given job.

<code>jobreport 6260301 </code>

creates an HTML file with these plots in the current directory. For convenience, the report may alternatively be sent as email using

<code>jobreport -E max.mustermann@uni-ulm.de 6260301 </code>

= Partitions =
Job allocations at JUSTUS 2 are routed automatically to the most suitable compute node(s) that can provide the requested resources for the job (e.g. amount of cores, memory, local scratch space). This is to prevent fragmentation of the cluster system and to ensure most efficient usage of available compute resources. Thus, there is no point in requesting a partition in batch job scripts, i.e. users '''should not''' specify any partition "-p, --partition=<partition_name>" on job submission. This is of particular importance if you adapt job scripts from other cluster systems (e.g. bwUniCluster 2.0) to JUSTUS 2.

= Job Priorities =
Job priorities at JUSTUS 2 depend on [https://slurm.schedmd.com/priority_multifactor.html multiple factors ]:
* Age: The amount of time a job has been waiting in the queue, eligible to be scheduled.
* Fairshare: The difference between the portion of the computing resource allocated to an association and the amount of resources that has been consumed.

'''Notes:'''

Jobs that are pending because the user reached one of the resource usage limits (see below) are not eligible to be scheduled and, thus, do not accrue priority by their age.

Fairshare does '''not''' introduce a fixed allotment, in that a user's ability to run new jobs is cut off as soon as a fixed target utilization is reached. Instead, the fairshare factor ensures that jobs from users who were under-served in the past are given higher priority than jobs from users who were over-served in the past. This keeps individual groups from long term monopolizing the resources, thus making it unfair to groups who have not used their fairshare for quite some time.

Slurm features '''backfilling''', meaning that the scheduler will start lower priority jobs if doing so does not delay the expected start time of '''any''' higher priority job. Since the expected start time of pending jobs depends upon the expected completion time of running jobs, reasonably accurate time limits are valuable for backfill scheduling to work well. This '''[https://youtu.be/OKhWwem1XZg?t=161 video]''' gives an illustrative description to how backfilling works.

In summary, an approximate model of Slurm's behavior for scheduling jobs is this:

* Step 1: Can the job in position one (highest priority) start now?
* Step 2: If it can, remove it from the queue, start it and continue with step 1.
* Step 3: If it can not, look at next job.
* Step 4: Can it start now, without delaying the start time of any job before it in the queue?
* Step 5: If it can, remove it from the queue, start it, recalculate what nodes are free, look at next job and continue with step 4.
* Step 6: If it can not, look at next job, and continue with step 4.

As soon as a new job is submitted and as soon as a job finishes, Slurm restarts its main scheduling cycle with step 1.

= Usage Limits/Throttling Policies =

While the fairshare factor ensures fair long term balance of resource utilization between users and groups, there are additional usage limits that constrain the total cumulative resources at a given time. This is to prevent individual users from short term monopolizing large fractions of the whole cluster system.

* The '''maximum walltime''' for a job is '''14 days''' (336 hours)
--time=336:00:00 or --time=14-0

* The '''maximum amount of cores''' used at any given time from jobs running is '''1920''' per user (aggregated over all running jobs). This translates to 40 nodes. An equivalent limit for allocated memory does also apply. If this limit is reached new jobs will be queued (with REASON: AssocGrpCpuLimit) but only allowed to run after resources have been relinquished.

* The maximum amount of '''remaining allocated core-minutes''' per user is '''3300000''' (aggregated over all running jobs). For example, if a user has a 4-core job running that will complete in 1 hour and a 2-core job that will complete in 6 hours, this translates to 4 * 1 * 60 + 2 * 6 * 60 = 16 * 60 = 960 remaining core-minutes. Once a user reaches the limit, no more jobs are allowed to start (REASON: AssocGrpCPURunMinutesLimit). As the jobs continue to run, the remaining core time will decrease and eventually allow more jobs to start in a staggered way. This limit also '''correlates the maximum walltime and amount of cores that can be allocated''' for this amount of time. Thus, shorter walltimes for the jobs allow more resources to be allocated at a given time (but capped by the maximum amount of cores limit above). Watch this '''[https://youtu.be/OKhWwem1XZg?t=306 video]''' for an illustrative description. An equivalent limit applies for remaining time of memory allocation in which case jobs may be held back from starting with REASON AssocGrpMemRunMinutes.

* The '''maximum amount of GPUs''' allocated by running jobs is '''8''' per user. If this limit it reached new jobs will be queued (with REASON: AssocGrpGRES) but only allowed to run after GPU resources have been relinquished.

'''Note:'''

Usage limits are subject to change.

= Efficiency / Use Cases =

When we speak of poor job efficiency, we usually mean that hardware resources are wasted.
That means, a similar overall result could have been achieved using less hardware resources, leaving those for other jobs and reducing the wait time for you and everyone.

The more resources you use, the more important efficiency becomes. If you just run 3-5 jobs that take under a day, just go ahead and choose roughly sane defaults. If you submit hundreds or thousands of jobs - jobs that will accumulate years of CPU compute time (by using many CPU cores), then think very carefully about your jobs and take some time to do trial runs until you are sure your calculations are run well.

Also consider these non-technical things:
* does the calculation give me all the results I need?
→ rerunning calculations is extremely wasteful
* am I using the most efficient algorithm?
→ using better algorithms can reduce the CPU time needed by an order of magnitude or two. And this can sometimes be something as simple as arranging loops in a more clever way or avoiding slow storage.

Some simple causes for poor overall job efficiency are:

* using $HOME or work directories for scratch space (expressively forbidden for $HOME, discouraged for work directories, except for multinode jobs that specifically need this for communication)
* poor choice of resources compared to the size of the nodes leaves part of the node blocked, but doing nothing:
** multiple of --ntasks-per-node is not the number of cores of a node (see section [[#"Exclusive User" Node Access Policy]])
** too much (un-needed) memory or disk space requested
* more cores requested than are actually used by the job
* more cores used for a single mpi/openmp parallel computation than useful
* many small jobs with a short runtime (seconds in extreme cases)
* one-core jobs with very different run-times (because of single-user policy)
* not using full node capacity
* using more cores than what your computational problem can be split into → see [[Scaling]]

== User-exclusive Nodes on Justus2 ==

For several reasons, Justus2 nodes are assigned to one user exclusively. That means that you are responsible for using the full compute node efficiently, as no jobs from other users can fill gaps!

Several key points to accomplishing that:

* Use dividers of the core number:

The Justus2 nodes have 48 cores, two sockets with 24 cores each. Use dividers of 48 to be able to use all cores of the node (e.g. 8). Be aware, that when you choose 16, one job will be executed half on one of the CPUs and the other half on the other. This might be suboptimal.
* Be aware of memory resources:

When you request more memory per core than the "small" nodes on Justus2 have per core, your jobs will not be able to use all cores on the small nodes - or will have to wait for the rarer spaces on the nodes with more memory. Try to estimate your memory requirements well and if you need more than 3.8GB per core, consider mixing in jobs with lower memory requirements to fill the nodes

== Many One or Few-Core Jobs ==

Jobs that use only a few CPU cores can lead to very inefficient node usage:

# You submit 1000 jobs, each runs for ~30s. Jobs need up to 30s to start and finish - a huge waste if the job only takes 30 seconds. Additionally, the starting and finishing of so many jobs in a short time causes strain on the scheduler SLURM and may cause severe problems for everyone and clutter the SLURM job database.
# many few-core jobs with very different run times. The jobs will start on many nodes, but at some time all quicker jobs have finished the calculation and only a few remain. Because of the single-user policy on JUSTUS2, jobs of other users cannot fill in the gaps and the rest of the node is idle.

To address the problem, you can reduce the amount of jobs and/or the amount of nodes used.

To limit the amount of jobs, start many calculations within one job (problem 1. and 2.):

* use a bash loop in your job script
* use the program GNU parallel to start the processes for you

To only limit the amount of nodes used:

* use array jobs

=== Bash Loop ===

One advantage of this method is, that you can run more threads than cores if your jobs are really short and do not use too much RAM memory and in this way keep all cores busy even if many calculations are still starting up.

It is of course even better, if you can combine such short calculations in a way that for 1000 calculations the kernel does not need to start 1000 processes which in turn need to initialize everything.

This example uses pgrep to count how many jobs are running:

<syntaxhighlight lang="bash">
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=48
#SBATCH --time=00:10:00
#SBATCH --mem=100gb

for i in {1..200}
do
echo starting up $i
bash my_calculation $i &
while [ $(pgrep -c -f my_calculation) -gt 48 ] ; do echo sleeping; sleep 5; done
done
wait
</syntaxhighlight>

The same, but by tracking the PIDs (process IDs) of the started processes. This is more robust, but is more difficult to read:

<syntaxhighlight lang="bash">
#!/bin/bash
running_jobs=()

for i in {1..200}; do
echo "Starting job $i"
sleep "$i" &
running_jobs+=($!) # Track PID

while [ "${#running_jobs[@]}" -ge 8 ]; do
sleep 2 # adjust duration depending on your runtime
echo running_jobs: ${running_jobs[@]}
echo pid-out: $(ps -o pid= -p "${running_jobs[@]}" 2>/dev/null | xargs)
echo -----
running_jobs=($(ps -o pid= -p "${running_jobs[@]}" 2>/dev/null)) # Remove finished jobs
done
done

wait # Ensure all jobs complete
</syntaxhighlight>

You may not be able to just use an index number "i" to start many calculations. In this case, for not too many files, the for loop could be used to read in config files. Here just the general idea for the for loop without
<syntaxhighlight lang="bash">
for config in config-1980-03-01_1/*; do
mycalculation -config "$config"
done
</syntaxhighlight>
This loops over all files in the directory config-1980-03-01_1/ and gives them as an input file to "mycalculation" via a hypothetical "-config" option. Adding a date to the config-dirs (and outputs) would enable you to track different runs in your lab journal more easily.

=== Gnu Parallel ===

Gnu Parallel is available on the HPC Cluster and comes with its own set of examples, you can access them like this:

<syntaxhighlight lang="bash">
$ module load system/parallel
$ cp $PARALLEL_EXA_DIR/parallel.slurm .
</syntaxhighlight>
=== Array Jobs ===
<syntaxhighlight lang="bash">$ sbatch -a 1-500%48 batch_script</syntaxhighlight>

This will submit 500 tasks to be executed, each one indexed by SLURM_ARRAY_TASK_ID ranging from 1 to 200, but will limit the number of simultaneously running tasks from this job array to 48 (number of cores on a Justus2 node).

Thee same can be done inside the job script:

<syntaxhighlight lang="bash">
#!/bin/bash
# Number of cores per individual array task
#SBATCH --ntasks-per-node=1
#SBATCH --array=1-500%48
#SBATCH --mem=3G
#SBATCH --time=1:10:00
#SBATCH --job-name=array_job
#SBATCH --output=array_job-%A_%a.out
#SBATCH --error=array_job-%A_%a.err

# Print the task id.
echo "My SLURM_ARRAY_TASK_ID: " $SLURM_ARRAY_TASK_ID

export TIMEFORMAT=%R ;
time bash mycalculation $SLURM_ARRAY_TASK_ID

</syntaxhighlight>

Also see:
* Slurm-Howto entry: [[BwForCluster_JUSTUS_2_Slurm_HOWTO#How_to_submit_an_array_job?]]
* Schedmd documentations on Job Arrays: https://slurm.schedmd.com/job_array.html

JUSTUS2/Jobscripts: Running Your Calculations

2025-06-04T07:57:25Z

M Carmesin: /* Monitoring a Started Job */

{{Justus2}}

The JUSTUS 2 cluster uses Slurm ([https://slurm.schedmd.com/ https://slurm.schedmd.com/]) for scheduling compute jobs.

= JUSTUS 2 Slurm Howto =

This page only presents some basic introduction.

Please see the '''[[bwForCluster JUSTUS 2 Slurm HOWTO|JUSTUS 2 Slurm HOWTO]]''' for many more examples and commands for common tasks or the original slurm documentation.

= Slurm Command Overview =

{| width=750px class="wikitable"
! Slurm commands !! Brief explanation
|-
| [https://slurm.schedmd.com/sbatch.html sbatch] || Submits a job and queues it in an input queue
|-
| [https://slurm.schedmd.com/salloc.html salloc] || Request resources for an interactive job
|-
| [https://slurm.schedmd.com/squeue.html squeue] || Displays information about active, eligible, blocked, and/or recently completed jobs
|-
| [https://slurm.schedmd.com/scontrol.html scontrol] || Displays detailed job state information
|-
| [https://slurm.schedmd.com/sstat.html sstat] || Displays status information about a running job
|-
| [https://slurm.schedmd.com/scancel.html scancel] || Cancels a job
|-
| seff || Shows the "job efficiency" of a job after it has finished
|}

= Submitting Jobs on the bwForCluster JUSTUS 2 =
Batch jobs are submitted with the command:

<syntaxhighlight lang=bash>$ sbatch <job-script> </syntaxhighlight>

A job script contains options for Slurm in lines beginning with #SBATCH as well as your commands which you want to execute on the compute nodes. For example:

<syntaxhighlight lang='bash'>
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --time=06:00:00
#SBATCH --mem=1gb
echo 'Here starts the calculation'
</syntaxhighlight>

You can override options from the script on the command-line:
<syntaxhighlight lang=bash>$ sbatch --time=00:14:00 <job-script> </syntaxhighlight>

Note: --time=00:14:00 should start your job very quickly. see [[#Testing Your Jobs]]

Note: Compute jobs must not write/read from the global file systems (HOME and WORK) as a calculation swap file. 

Use local storage /tmp in the ramdisk for small files or /scratch (see [[BwForCluster_JUSTUS_2_Slurm_HOWTO#How_to_request_local_scratch_.28SSD.2FNVMe.29_at_job_submission.3F|How to request NVME]]) for this purpose

To not use the central file system for calculation, you must often configure the the program you are using to write temporary files elsewhere.

If the program uses the current directory to look for files, you must copy files to a temporary directory - and copy/save the results of the calculation in the end, else your results get deleted by automated cleanup happening after the job.

There diskless nodes have a disk in RAM memory, that can have a maximum of half the size of the total RAM. Note that files created plus memory requirement of your job need to fit into the total memory.

There are more diskless nodes than nodes with disks, so if your job can run on a diskless node, you should choose this option.

Example job script with requesting 700GB disk space and copying files:

<syntaxhighlight lang='bash'>
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --time=00:14:00
#SBATCH --mem=1gb
#SBATCH --gres=scratch:700
mkdir -p $SCRATCH/mycalculation
# copy input file
cp $HOME/inputfiles/myinput.inp $SCRATCH/mycalculation
# switch directory
cd $SCRATCH/mycalculation
echo 'Here starts the calculation'
myprogram --input=$SCRATCH/mycalculation/myinput.inp
# calculation ends
# copy result
cp outfile.out results2.txt $HOME/resultdir/job12345
# clean up
rm myinput outfile.out results2.txt
</syntaxhighlight>

If there are many files or you don't know exactly how the output files are called, you can just tar the whole directory instead of using cp:

<syntaxhighlight lang='bash'>
tar -cvzf $HOME/resultdir/mycalculation-${SLURM_JOB_ID}.tgz
</syntaxhighlight>

$SLURM_JOB_ID contains the jobid in slurm during the run on the node and so makes sure the filename is unique.

{| style=" background:#deffee; width:100%;"
| style="padding:12px; background:#cef2e0; text-align:left" |
Software examples: Most [[Environment Modules|installed software]] comes with example job scripts. To find it e.g. for lammps: <code> module load chem/lammps; cd $LAMMPS_EXA_DIR; ls -la</code>.
|}
 

== Resource Requests ==

Important resource request options for the Slurm command sbatch are:

{| width=750px class="wikitable"
! Option !! Slurm (sbatch)
|-
| #SBATCH|| Script directive
|-
| --time=<hh:mm:ss> (-t <hh:mm:ss>)|| Wall time limit
|-
| --job-name=<name> (-J <name>)|| Job name
|-
| --nodes=<count> (-N <count>)|| Node count
|-
| --ntasks=<count> (-n <count>)|| Core count
|-
| --ntasks-per-node=<count>|| Process count per node
|-
| --mem=<limit>|| Memory limit per node
|-
| --mem-per-cpu=<limit>|| Memory limit per process
|-
| --gres=gpu:<count>|| GPU count (gres = "generic resource")
|-
| --gres=scratch:<count> || Disk space of <count> GB per requested task
|-
| --exclusive|| Node exclusive job
|}

'''Nodes and Cores'''

Slurm provides a number of options to request nodes and cores.
Typically, using <code>--nodes=<count></code> and <code>--ntasks-per-node=<count></code> should work for all your jobs. For single core jobs, it would be sufficient to use the option <code>--ntasks=1</code>. Specifying only <code>--ntasks</code> may lead to slurm trying to distribute tasks over more than one node even if you requested a small amount of cores.

'''Memory'''

Memory can be requested with either the option <code>--mem=<limit></code> (memory per node) or <code>--mem-per-cpu=<limit></code> (memory per process). When looking up the maximum available memory for a certain node type subtract about 5 GB for the operating system. Specify the memory limit as a value-unit-pair, for example 500mb or 8gb.

In most cases it is preferable to use the <code>--mem=<limit></code> option.

'''GPUs''' and '''Scratch'''

These are requested as "generic resources" with <code>--gres:gpu:<count></code> and <code>--gres:scratch:<count></code>.

=== Default Values ===
Some values will be set by default if you do not specify them for your job.

{| width=500px class="wikitable"
!Option || Equivalient To || Meaning
|-
|Runtime: || --time=02:00:00 || 2 hours
|-
|Nodes: ||--nodes=1 ||one node
|-
|Tasks: || --tasks-per-node=1 ||one task per node
|-
|Cores: || --cpus-per-task=1 ||one core per task
|-
|Memory: || --mem-per-cpu=2gb ||2 GB per core
|}

== "Exclusive User" Node Access Policy ==

Nodes are exclusively allocated to one single user. However, multiple jobs (up to 48) from the same user can share a node.

For efficient resource use, choose a core count for your jobs that evenly divides 48. For example, two 24-core jobs fit on one node, while two 32-core jobs require two nodes but leave 16 cores unused on each.

The same applies to memory requests (see below).

Think of scheduling as a game of Tetris with cores, memory, and other resources. Choosing well-fitting allocations helps the scheduler pack jobs efficiently.

== Memory Limits ==

The wait time of a job also depends largely on the amount of requested resources and the available number of nodes providing this amount of resources. This must be taken into account in particular when requesting a certain amount of memory.

For example a node with 192 GB RAM can only run jobs with up to 187 GB memory requested. The remaining amount is reserved for the operating system, system services and local file systems.
This means that if a job requests 192 GB RAM per node (i.e. --mem=192gb or --tasks-per-node=48 and --mem-per-cpu=4gb), the job cannot run on one of the 456 "small" nodes but only on one of the "medium", "large" or "fat" nodes. Unnecessarily limiting your jobs to a sub-set of nodes will increase your wait time and the wait time of others, who actually need the amount of memory.

The following table provides an overview of how much memory can be allocated by user jobs on the various node types and how many nodes can serve this memory requirement:

{| width=500px class="wikitable"
!|| Physical RAM on node !! Available RAM on node !! Number of suitable nodes
|-
|small|| 192 GB || 187 GB || 692
|-
|medium|| 384 GB || 376 GB || 220
|-
|large|| 768 GB || 754 GB || 28
|-
|fat|| 1536 GB || 1510 GB || 8
|}

Also note that allocated memory is factored into resource usage accounting for fair share. This means over-requesting memory may have a negative impact on the priority of subsequent jobs.

= Testing Your Jobs =

Justus2 has three compute nodes reserved for jobs with a walltime under 15 minutes. You can test if your jobs start properly just by specifying a short walltime, e.g. --time=00:14:00 and your job should start very quickly.

= Monitoring Your Jobs =

Test things first with few jobs before you roll out hundreds of jobs!

Please ensure at minimum:
* are my jobs using the amount of cores I requested
* is my job using near to the amount of memory I requested

If you are only running more than 1-5 jobs:

* are my jobs running at the very least over 10 minutes
* do my jobs scale reasonably well → [[Scaling]]

== squeue ==

After you submitted the job, you can see it waiting using the <code>squeue</code> command:

(also read the man page with <code>man squeue</code> for more information on how to use the command)

<syntaxhighlight lang='shell'>
> squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
6260301 standard r_60_b_2 ul_yxz1 PD 0:00 1 (AssocGrpMemRunMinutes)

</syntaxhighlight>

Output shows:
* JOBID: the jobid is an unique number your job gets
* PARTITION: the cluster can be divided in different types of nodes.
* NAME: the name you gave your job with the --job-name= option
* USER: your username
* ST: the state the job is in. R = running, PD = pending, CD = completed. See man page for a full list on states.
* TIME: how long the job has been running
* NODES: how many nodes were requested
* NODELIST(REASON): either show the node(s) the job is running on, or a reason why it hasn't started

==scontrol==

You can then show more info on one specific running job using the <code>scontrol</code> command, e.g for the job with ID 6260301 listed above:

<code>
scontrol show job 6260301
</code>

displays detailed information for job with JobID 6260301

<code>
scontrol show jobs
</code>

displays detailed information for all your jobs
<code>

scontrol write batch_script 6260301 -
</code>
display job script of a running job. The "-" is a special filename which means "write to the terminal".

This does not work for already completed jobs. When enabled in slurm, one can see those job scripts with <code>sacct -B -j 6260301</code>

== Monitoring a Started Job ==

After a job has started, you can ssh from a login node to the node(s) the job is running on, using the node name from NODELIST, e.g. if your job runs on n0603:

<code>> ssh n0603
</code>

To get a live overview of the current resource usage on the node, use the command

<code>htop
</code>

On the GPU nodes, the usage of the GPU(s) can be visualized using

<code>nvtop
</code>

Further, we provide the tool jobreport, that generates plots for the resource usage over time of a given job.

<code>jobreport 6260301 </code>

creates an HTML file with these plots in the current directory. For convenience, the report may alternatively be sent as email using

<code>jobreport -E max.mustermann@uni-ulm.de 6260301 </code>

= Partitions =
Job allocations at JUSTUS 2 are routed automatically to the most suitable compute node(s) that can provide the requested resources for the job (e.g. amount of cores, memory, local scratch space). This is to prevent fragmentation of the cluster system and to ensure most efficient usage of available compute resources. Thus, there is no point in requesting a partition in batch job scripts, i.e. users '''should not''' specify any partition "-p, --partition=<partition_name>" on job submission. This is of particular importance if you adapt job scripts from other cluster systems (e.g. bwUniCluster 2.0) to JUSTUS 2.

= Job Priorities =
Job priorities at JUSTUS 2 depend on [https://slurm.schedmd.com/priority_multifactor.html multiple factors ]:
* Age: The amount of time a job has been waiting in the queue, eligible to be scheduled.
* Fairshare: The difference between the portion of the computing resource allocated to an association and the amount of resources that has been consumed.

'''Notes:'''

Jobs that are pending because the user reached one of the resource usage limits (see below) are not eligible to be scheduled and, thus, do not accrue priority by their age.

Fairshare does '''not''' introduce a fixed allotment, in that a user's ability to run new jobs is cut off as soon as a fixed target utilization is reached. Instead, the fairshare factor ensures that jobs from users who were under-served in the past are given higher priority than jobs from users who were over-served in the past. This keeps individual groups from long term monopolizing the resources, thus making it unfair to groups who have not used their fairshare for quite some time.

Slurm features '''backfilling''', meaning that the scheduler will start lower priority jobs if doing so does not delay the expected start time of '''any''' higher priority job. Since the expected start time of pending jobs depends upon the expected completion time of running jobs, reasonably accurate time limits are valuable for backfill scheduling to work well. This '''[https://youtu.be/OKhWwem1XZg?t=161 video]''' gives an illustrative description to how backfilling works.

In summary, an approximate model of Slurm's behavior for scheduling jobs is this:

* Step 1: Can the job in position one (highest priority) start now?
* Step 2: If it can, remove it from the queue, start it and continue with step 1.
* Step 3: If it can not, look at next job.
* Step 4: Can it start now, without delaying the start time of any job before it in the queue?
* Step 5: If it can, remove it from the queue, start it, recalculate what nodes are free, look at next job and continue with step 4.
* Step 6: If it can not, look at next job, and continue with step 4.

As soon as a new job is submitted and as soon as a job finishes, Slurm restarts its main scheduling cycle with step 1.

= Usage Limits/Throttling Policies =

While the fairshare factor ensures fair long term balance of resource utilization between users and groups, there are additional usage limits that constrain the total cumulative resources at a given time. This is to prevent individual users from short term monopolizing large fractions of the whole cluster system.

* The '''maximum walltime''' for a job is '''14 days''' (336 hours)
--time=336:00:00 or --time=14-0

* The '''maximum amount of cores''' used at any given time from jobs running is '''1920''' per user (aggregated over all running jobs). This translates to 40 nodes. An equivalent limit for allocated memory does also apply. If this limit is reached new jobs will be queued (with REASON: AssocGrpCpuLimit) but only allowed to run after resources have been relinquished.

* The maximum amount of '''remaining allocated core-minutes''' per user is '''3300000''' (aggregated over all running jobs). For example, if a user has a 4-core job running that will complete in 1 hour and a 2-core job that will complete in 6 hours, this translates to 4 * 1 * 60 + 2 * 6 * 60 = 16 * 60 = 960 remaining core-minutes. Once a user reaches the limit, no more jobs are allowed to start (REASON: AssocGrpCPURunMinutesLimit). As the jobs continue to run, the remaining core time will decrease and eventually allow more jobs to start in a staggered way. This limit also '''correlates the maximum walltime and amount of cores that can be allocated''' for this amount of time. Thus, shorter walltimes for the jobs allow more resources to be allocated at a given time (but capped by the maximum amount of cores limit above). Watch this '''[https://youtu.be/OKhWwem1XZg?t=306 video]''' for an illustrative description. An equivalent limit applies for remaining time of memory allocation in which case jobs may be held back from starting with REASON AssocGrpMemRunMinutes.

* The '''maximum amount of GPUs''' allocated by running jobs is '''8''' per user. If this limit it reached new jobs will be queued (with REASON: AssocGrpGRES) but only allowed to run after GPU resources have been relinquished.

'''Note:'''

Usage limits are subject to change.

= Efficiency / Use Cases =

When we speak of poor job efficiency, we usually mean that hardware resources are wasted.
That means, a similar overall result could have been achieved using less hardware resources, leaving those for other jobs and reducing the wait time for you and everyone.

The more resources you use, the more important efficiency becomes. If you just run 3-5 jobs that take under a day, just go ahead and choose roughly sane defaults. If you submit hundreds or thousands of jobs - jobs that will accumulate years of CPU compute time (by using many CPU cores), then think very carefully about your jobs and take some time to do trial runs until you are sure your calculations are run well.

Also consider these non-technical things:
* does the calculation give me all the results I need?
→ rerunning calculations is extremely wasteful
* am I using the most efficient algorithm?
→ using better algorithms can reduce the CPU time needed by an order of magnitude or two. And this can sometimes be something as simple as arranging loops in a more clever way or avoiding slow storage.

Some simple causes for poor overall job efficiency are:

* using $HOME or work directories for scratch space (expressively forbidden for $HOME, discouraged for work directories, except for multinode jobs that specifically need this for communication)
* poor choice of resources compared to the size of the nodes leaves part of the node blocked, but doing nothing:
** multiple of --ntasks-per-node is not the number of cores of a node (see section [[#"Exclusive User" Node Access Policy]])
** too much (un-needed) memory or disk space requested
* more cores requested than are actually used by the job
* more cores used for a single mpi/openmp parallel computation than useful
* many small jobs with a short runtime (seconds in extreme cases)
* one-core jobs with very different run-times (because of single-user policy)
* not using full node capacity
* using more cores than what your computational problem can be split into → see [[Scaling]]

== User-exclusive Nodes on Justus2 ==

For several reasons, Justus2 nodes are assigned to one user exclusively. That means that you are responsible for using the full compute node efficiently, as no jobs from other users can fill gaps!

Several key points to accomplishing that:

* Use dividers of the core number:

The Justus2 nodes have 48 cores, two sockets with 24 cores each. Use dividers of 48 to be able to use all cores of the node (e.g. 8). Be aware, that when you choose 16, one job will be executed half on one of the CPUs and the other half on the other. This might be suboptimal.
* Be aware of memory resources:

When you request more memory per core than the "small" nodes on Justus2 have per core, your jobs will not be able to use all cores on the small nodes - or will have to wait for the rarer spaces on the nodes with more memory. Try to estimate your memory requirements well and if you need more than 3.8GB per core, consider mixing in jobs with lower memory requirements to fill the nodes

== Many One or Few-Core Jobs ==

Jobs that use only a few CPU cores can lead to very inefficient node usage:

# You submit 1000 jobs, each runs for ~30s. Jobs need up to 30s to start and finish - a huge waste if the job only takes 30 seconds. Additionally, the starting and finishing of so many jobs in a short time causes strain on the scheduler SLURM and may cause severe problems for everyone and clutter the SLURM job database.
# many few-core jobs with very different run times. The jobs will start on many nodes, but at some time all quicker jobs have finished the calculation and only a few remain. Because of the single-user policy on JUSTUS2, jobs of other users cannot fill in the gaps and the rest of the node is idle.

To address the problem, you can reduce the amount of jobs and/or the amount of nodes used.

To limit the amount of jobs, start many calculations within one job (problem 1. and 2.):

* use a bash loop in your job script
* use the program GNU parallel to start the processes for you

To only limit the amount of nodes used:

* use array jobs

=== Bash Loop ===

One advantage of this method is, that you can run more threads than cores if your jobs are really short and do not use too much RAM memory and in this way keep all cores busy even if many calculations are still starting up.

It is of course even better, if you can combine such short calculations in a way that for 1000 calculations the kernel does not need to start 1000 processes which in turn need to initialize everything.

This example uses pgrep to count how many jobs are running:

<syntaxhighlight lang="bash">
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=48
#SBATCH --time=00:10:00
#SBATCH --mem=100gb

for i in {1..200}
do
echo starting up $i
bash my_calculation $i &
while [ $(pgrep -c -f my_calculation) -gt 48 ] ; do echo sleeping; sleep 5; done
done
wait
</syntaxhighlight>

The same, but by tracking the PIDs (process IDs) of the started processes. This is more robust, but is more difficult to read:

<syntaxhighlight lang="bash">
#!/bin/bash
running_jobs=()

for i in {1..200}; do
echo "Starting job $i"
sleep "$i" &
running_jobs+=($!) # Track PID

while [ "${#running_jobs[@]}" -ge 8 ]; do
sleep 2 # adjust duration depending on your runtime
echo running_jobs: ${running_jobs[@]}
echo pid-out: $(ps -o pid= -p "${running_jobs[@]}" 2>/dev/null | xargs)
echo -----
running_jobs=($(ps -o pid= -p "${running_jobs[@]}" 2>/dev/null)) # Remove finished jobs
done
done

wait # Ensure all jobs complete
</syntaxhighlight>

You may not be able to just use an index number "i" to start many calculations. In this case, for not too many files, the for loop could be used to read in config files. Here just the general idea for the for loop without
<syntaxhighlight lang="bash">
for config in config-1980-03-01_1/*; do
mycalculation -config "$config"
done
</syntaxhighlight>
This loops over all files in the directory config-1980-03-01_1/ and gives them as an input file to "mycalculation" via a hypothetical "-config" option. Adding a date to the config-dirs (and outputs) would enable you to track different runs in your lab journal more easily.

=== Gnu Parallel ===

Gnu Parallel is available on the HPC Cluster and comes with its own set of examples, you can access them like this:

<syntaxhighlight lang="bash">
$ module load system/parallel
$ cp $PARALLEL_EXA_DIR/parallel.slurm .
</syntaxhighlight>
=== Array Jobs ===
<syntaxhighlight lang="bash">$ sbatch -a 1-500%48 batch_script</syntaxhighlight>

This will submit 500 tasks to be executed, each one indexed by SLURM_ARRAY_TASK_ID ranging from 1 to 200, but will limit the number of simultaneously running tasks from this job array to 48 (number of cores on a Justus2 node).

Thee same can be done inside the job script:

<syntaxhighlight lang="bash">
#!/bin/bash
# Number of cores per individual array task
#SBATCH --ntasks-per-node=1
#SBATCH --array=1-500%48
#SBATCH --mem=3G
#SBATCH --time=1:10:00
#SBATCH --job-name=array_job
#SBATCH --output=array_job-%A_%a.out
#SBATCH --error=array_job-%A_%a.err

# Print the task id.
echo "My SLURM_ARRAY_TASK_ID: " $SLURM_ARRAY_TASK_ID

export TIMEFORMAT=%R ;
time bash mycalculation $SLURM_ARRAY_TASK_ID

</syntaxhighlight>

Also see:
* Slurm-Howto entry: [[BwForCluster_JUSTUS_2_Slurm_HOWTO#How_to_submit_an_array_job?]]
* Schedmd documentations on Job Arrays: https://slurm.schedmd.com/job_array.html

JUSTUS2/Jobscripts: Running Your Calculations

2025-06-04T07:56:46Z

M Carmesin: /* Monitoring Your Jobs */

{{Justus2}}

The JUSTUS 2 cluster uses Slurm ([https://slurm.schedmd.com/ https://slurm.schedmd.com/]) for scheduling compute jobs.

= JUSTUS 2 Slurm Howto =

This page only presents some basic introduction.

Please see the '''[[bwForCluster JUSTUS 2 Slurm HOWTO|JUSTUS 2 Slurm HOWTO]]''' for many more examples and commands for common tasks or the original slurm documentation.

= Slurm Command Overview =

{| width=750px class="wikitable"
! Slurm commands !! Brief explanation
|-
| [https://slurm.schedmd.com/sbatch.html sbatch] || Submits a job and queues it in an input queue
|-
| [https://slurm.schedmd.com/salloc.html salloc] || Request resources for an interactive job
|-
| [https://slurm.schedmd.com/squeue.html squeue] || Displays information about active, eligible, blocked, and/or recently completed jobs
|-
| [https://slurm.schedmd.com/scontrol.html scontrol] || Displays detailed job state information
|-
| [https://slurm.schedmd.com/sstat.html sstat] || Displays status information about a running job
|-
| [https://slurm.schedmd.com/scancel.html scancel] || Cancels a job
|-
| seff || Shows the "job efficiency" of a job after it has finished
|}

= Submitting Jobs on the bwForCluster JUSTUS 2 =
Batch jobs are submitted with the command:

<syntaxhighlight lang=bash>$ sbatch <job-script> </syntaxhighlight>

A job script contains options for Slurm in lines beginning with #SBATCH as well as your commands which you want to execute on the compute nodes. For example:

<syntaxhighlight lang='bash'>
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --time=06:00:00
#SBATCH --mem=1gb
echo 'Here starts the calculation'
</syntaxhighlight>

You can override options from the script on the command-line:
<syntaxhighlight lang=bash>$ sbatch --time=00:14:00 <job-script> </syntaxhighlight>

Note: --time=00:14:00 should start your job very quickly. see [[#Testing Your Jobs]]

Note: Compute jobs must not write/read from the global file systems (HOME and WORK) as a calculation swap file. 

Use local storage /tmp in the ramdisk for small files or /scratch (see [[BwForCluster_JUSTUS_2_Slurm_HOWTO#How_to_request_local_scratch_.28SSD.2FNVMe.29_at_job_submission.3F|How to request NVME]]) for this purpose

To not use the central file system for calculation, you must often configure the the program you are using to write temporary files elsewhere.

If the program uses the current directory to look for files, you must copy files to a temporary directory - and copy/save the results of the calculation in the end, else your results get deleted by automated cleanup happening after the job.

There diskless nodes have a disk in RAM memory, that can have a maximum of half the size of the total RAM. Note that files created plus memory requirement of your job need to fit into the total memory.

There are more diskless nodes than nodes with disks, so if your job can run on a diskless node, you should choose this option.

Example job script with requesting 700GB disk space and copying files:

<syntaxhighlight lang='bash'>
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --time=00:14:00
#SBATCH --mem=1gb
#SBATCH --gres=scratch:700
mkdir -p $SCRATCH/mycalculation
# copy input file
cp $HOME/inputfiles/myinput.inp $SCRATCH/mycalculation
# switch directory
cd $SCRATCH/mycalculation
echo 'Here starts the calculation'
myprogram --input=$SCRATCH/mycalculation/myinput.inp
# calculation ends
# copy result
cp outfile.out results2.txt $HOME/resultdir/job12345
# clean up
rm myinput outfile.out results2.txt
</syntaxhighlight>

If there are many files or you don't know exactly how the output files are called, you can just tar the whole directory instead of using cp:

<syntaxhighlight lang='bash'>
tar -cvzf $HOME/resultdir/mycalculation-${SLURM_JOB_ID}.tgz
</syntaxhighlight>

$SLURM_JOB_ID contains the jobid in slurm during the run on the node and so makes sure the filename is unique.

{| style=" background:#deffee; width:100%;"
| style="padding:12px; background:#cef2e0; text-align:left" |
Software examples: Most [[Environment Modules|installed software]] comes with example job scripts. To find it e.g. for lammps: <code> module load chem/lammps; cd $LAMMPS_EXA_DIR; ls -la</code>.
|}
 

== Resource Requests ==

Important resource request options for the Slurm command sbatch are:

{| width=750px class="wikitable"
! Option !! Slurm (sbatch)
|-
| #SBATCH|| Script directive
|-
| --time=<hh:mm:ss> (-t <hh:mm:ss>)|| Wall time limit
|-
| --job-name=<name> (-J <name>)|| Job name
|-
| --nodes=<count> (-N <count>)|| Node count
|-
| --ntasks=<count> (-n <count>)|| Core count
|-
| --ntasks-per-node=<count>|| Process count per node
|-
| --mem=<limit>|| Memory limit per node
|-
| --mem-per-cpu=<limit>|| Memory limit per process
|-
| --gres=gpu:<count>|| GPU count (gres = "generic resource")
|-
| --gres=scratch:<count> || Disk space of <count> GB per requested task
|-
| --exclusive|| Node exclusive job
|}

'''Nodes and Cores'''

Slurm provides a number of options to request nodes and cores.
Typically, using <code>--nodes=<count></code> and <code>--ntasks-per-node=<count></code> should work for all your jobs. For single core jobs, it would be sufficient to use the option <code>--ntasks=1</code>. Specifying only <code>--ntasks</code> may lead to slurm trying to distribute tasks over more than one node even if you requested a small amount of cores.

'''Memory'''

Memory can be requested with either the option <code>--mem=<limit></code> (memory per node) or <code>--mem-per-cpu=<limit></code> (memory per process). When looking up the maximum available memory for a certain node type subtract about 5 GB for the operating system. Specify the memory limit as a value-unit-pair, for example 500mb or 8gb.

In most cases it is preferable to use the <code>--mem=<limit></code> option.

'''GPUs''' and '''Scratch'''

These are requested as "generic resources" with <code>--gres:gpu:<count></code> and <code>--gres:scratch:<count></code>.

=== Default Values ===
Some values will be set by default if you do not specify them for your job.

{| width=500px class="wikitable"
!Option || Equivalient To || Meaning
|-
|Runtime: || --time=02:00:00 || 2 hours
|-
|Nodes: ||--nodes=1 ||one node
|-
|Tasks: || --tasks-per-node=1 ||one task per node
|-
|Cores: || --cpus-per-task=1 ||one core per task
|-
|Memory: || --mem-per-cpu=2gb ||2 GB per core
|}

== "Exclusive User" Node Access Policy ==

Nodes are exclusively allocated to one single user. However, multiple jobs (up to 48) from the same user can share a node.

For efficient resource use, choose a core count for your jobs that evenly divides 48. For example, two 24-core jobs fit on one node, while two 32-core jobs require two nodes but leave 16 cores unused on each.

The same applies to memory requests (see below).

Think of scheduling as a game of Tetris with cores, memory, and other resources. Choosing well-fitting allocations helps the scheduler pack jobs efficiently.

== Memory Limits ==

The wait time of a job also depends largely on the amount of requested resources and the available number of nodes providing this amount of resources. This must be taken into account in particular when requesting a certain amount of memory.

For example a node with 192 GB RAM can only run jobs with up to 187 GB memory requested. The remaining amount is reserved for the operating system, system services and local file systems.
This means that if a job requests 192 GB RAM per node (i.e. --mem=192gb or --tasks-per-node=48 and --mem-per-cpu=4gb), the job cannot run on one of the 456 "small" nodes but only on one of the "medium", "large" or "fat" nodes. Unnecessarily limiting your jobs to a sub-set of nodes will increase your wait time and the wait time of others, who actually need the amount of memory.

The following table provides an overview of how much memory can be allocated by user jobs on the various node types and how many nodes can serve this memory requirement:

{| width=500px class="wikitable"
!|| Physical RAM on node !! Available RAM on node !! Number of suitable nodes
|-
|small|| 192 GB || 187 GB || 692
|-
|medium|| 384 GB || 376 GB || 220
|-
|large|| 768 GB || 754 GB || 28
|-
|fat|| 1536 GB || 1510 GB || 8
|}

Also note that allocated memory is factored into resource usage accounting for fair share. This means over-requesting memory may have a negative impact on the priority of subsequent jobs.

= Testing Your Jobs =

Justus2 has three compute nodes reserved for jobs with a walltime under 15 minutes. You can test if your jobs start properly just by specifying a short walltime, e.g. --time=00:14:00 and your job should start very quickly.

= Monitoring Your Jobs =

Test things first with few jobs before you roll out hundreds of jobs!

Please ensure at minimum:
* are my jobs using the amount of cores I requested
* is my job using near to the amount of memory I requested

If you are only running more than 1-5 jobs:

* are my jobs running at the very least over 10 minutes
* do my jobs scale reasonably well → [[Scaling]]

== squeue ==

After you submitted the job, you can see it waiting using the <code>squeue</code> command:

(also read the man page with <code>man squeue</code> for more information on how to use the command)

<syntaxhighlight lang='shell'>
> squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
6260301 standard r_60_b_2 ul_yxz1 PD 0:00 1 (AssocGrpMemRunMinutes)

</syntaxhighlight>

Output shows:
* JOBID: the jobid is an unique number your job gets
* PARTITION: the cluster can be divided in different types of nodes.
* NAME: the name you gave your job with the --job-name= option
* USER: your username
* ST: the state the job is in. R = running, PD = pending, CD = completed. See man page for a full list on states.
* TIME: how long the job has been running
* NODES: how many nodes were requested
* NODELIST(REASON): either show the node(s) the job is running on, or a reason why it hasn't started

==scontrol==

You can then show more info on one specific running job using the <code>scontrol</code> command, e.g for the job with ID 6260301 listed above:

<code>
scontrol show job 6260301
</code>

displays detailed information for job with JobID 6260301

<code>
scontrol show jobs
</code>

displays detailed information for all your jobs
<code>

scontrol write batch_script 6260301 -
</code>
display job script of a running job. The "-" is a special filename which means "write to the terminal".

This does not work for already completed jobs. When enabled in slurm, one can see those job scripts with <code>sacct -B -j 6260301</code>

== Monitoring a Started Job ==

After a job has started, you can ssh from a login node to the node(s) the job is running on, using the node name from NODELIST, e.g. if your job runs on n0603:

<code>> ssh n0603
</code>

To get a live overview of the current resource usage on the node, use the command

<code>htop
</code>

On the GPU nodes, the usage of the GPU(s) can be visualized using

<code>nvtop
</code>

Further, we provide the tool jobreport, that generates plots for the resource usage over time of a given job.

<code>jobreport 6260301 </code>

generates HTML file with these plots in the current directory. For convenience, the report may alternatively be sent as email using

<code>jobreport -E max.mustermann@uni-ulm.de 6260301 </code>

= Partitions =
Job allocations at JUSTUS 2 are routed automatically to the most suitable compute node(s) that can provide the requested resources for the job (e.g. amount of cores, memory, local scratch space). This is to prevent fragmentation of the cluster system and to ensure most efficient usage of available compute resources. Thus, there is no point in requesting a partition in batch job scripts, i.e. users '''should not''' specify any partition "-p, --partition=<partition_name>" on job submission. This is of particular importance if you adapt job scripts from other cluster systems (e.g. bwUniCluster 2.0) to JUSTUS 2.

= Job Priorities =
Job priorities at JUSTUS 2 depend on [https://slurm.schedmd.com/priority_multifactor.html multiple factors ]:
* Age: The amount of time a job has been waiting in the queue, eligible to be scheduled.
* Fairshare: The difference between the portion of the computing resource allocated to an association and the amount of resources that has been consumed.

'''Notes:'''

Jobs that are pending because the user reached one of the resource usage limits (see below) are not eligible to be scheduled and, thus, do not accrue priority by their age.

Fairshare does '''not''' introduce a fixed allotment, in that a user's ability to run new jobs is cut off as soon as a fixed target utilization is reached. Instead, the fairshare factor ensures that jobs from users who were under-served in the past are given higher priority than jobs from users who were over-served in the past. This keeps individual groups from long term monopolizing the resources, thus making it unfair to groups who have not used their fairshare for quite some time.

Slurm features '''backfilling''', meaning that the scheduler will start lower priority jobs if doing so does not delay the expected start time of '''any''' higher priority job. Since the expected start time of pending jobs depends upon the expected completion time of running jobs, reasonably accurate time limits are valuable for backfill scheduling to work well. This '''[https://youtu.be/OKhWwem1XZg?t=161 video]''' gives an illustrative description to how backfilling works.

In summary, an approximate model of Slurm's behavior for scheduling jobs is this:

* Step 1: Can the job in position one (highest priority) start now?
* Step 2: If it can, remove it from the queue, start it and continue with step 1.
* Step 3: If it can not, look at next job.
* Step 4: Can it start now, without delaying the start time of any job before it in the queue?
* Step 5: If it can, remove it from the queue, start it, recalculate what nodes are free, look at next job and continue with step 4.
* Step 6: If it can not, look at next job, and continue with step 4.

As soon as a new job is submitted and as soon as a job finishes, Slurm restarts its main scheduling cycle with step 1.

= Usage Limits/Throttling Policies =

While the fairshare factor ensures fair long term balance of resource utilization between users and groups, there are additional usage limits that constrain the total cumulative resources at a given time. This is to prevent individual users from short term monopolizing large fractions of the whole cluster system.

* The '''maximum walltime''' for a job is '''14 days''' (336 hours)
--time=336:00:00 or --time=14-0

* The '''maximum amount of cores''' used at any given time from jobs running is '''1920''' per user (aggregated over all running jobs). This translates to 40 nodes. An equivalent limit for allocated memory does also apply. If this limit is reached new jobs will be queued (with REASON: AssocGrpCpuLimit) but only allowed to run after resources have been relinquished.

* The maximum amount of '''remaining allocated core-minutes''' per user is '''3300000''' (aggregated over all running jobs). For example, if a user has a 4-core job running that will complete in 1 hour and a 2-core job that will complete in 6 hours, this translates to 4 * 1 * 60 + 2 * 6 * 60 = 16 * 60 = 960 remaining core-minutes. Once a user reaches the limit, no more jobs are allowed to start (REASON: AssocGrpCPURunMinutesLimit). As the jobs continue to run, the remaining core time will decrease and eventually allow more jobs to start in a staggered way. This limit also '''correlates the maximum walltime and amount of cores that can be allocated''' for this amount of time. Thus, shorter walltimes for the jobs allow more resources to be allocated at a given time (but capped by the maximum amount of cores limit above). Watch this '''[https://youtu.be/OKhWwem1XZg?t=306 video]''' for an illustrative description. An equivalent limit applies for remaining time of memory allocation in which case jobs may be held back from starting with REASON AssocGrpMemRunMinutes.

* The '''maximum amount of GPUs''' allocated by running jobs is '''8''' per user. If this limit it reached new jobs will be queued (with REASON: AssocGrpGRES) but only allowed to run after GPU resources have been relinquished.

'''Note:'''

Usage limits are subject to change.

= Efficiency / Use Cases =

When we speak of poor job efficiency, we usually mean that hardware resources are wasted.
That means, a similar overall result could have been achieved using less hardware resources, leaving those for other jobs and reducing the wait time for you and everyone.

The more resources you use, the more important efficiency becomes. If you just run 3-5 jobs that take under a day, just go ahead and choose roughly sane defaults. If you submit hundreds or thousands of jobs - jobs that will accumulate years of CPU compute time (by using many CPU cores), then think very carefully about your jobs and take some time to do trial runs until you are sure your calculations are run well.

Also consider these non-technical things:
* does the calculation give me all the results I need?
→ rerunning calculations is extremely wasteful
* am I using the most efficient algorithm?
→ using better algorithms can reduce the CPU time needed by an order of magnitude or two. And this can sometimes be something as simple as arranging loops in a more clever way or avoiding slow storage.

Some simple causes for poor overall job efficiency are:

* using $HOME or work directories for scratch space (expressively forbidden for $HOME, discouraged for work directories, except for multinode jobs that specifically need this for communication)
* poor choice of resources compared to the size of the nodes leaves part of the node blocked, but doing nothing:
** multiple of --ntasks-per-node is not the number of cores of a node (see section [[#"Exclusive User" Node Access Policy]])
** too much (un-needed) memory or disk space requested
* more cores requested than are actually used by the job
* more cores used for a single mpi/openmp parallel computation than useful
* many small jobs with a short runtime (seconds in extreme cases)
* one-core jobs with very different run-times (because of single-user policy)
* not using full node capacity
* using more cores than what your computational problem can be split into → see [[Scaling]]

== User-exclusive Nodes on Justus2 ==

For several reasons, Justus2 nodes are assigned to one user exclusively. That means that you are responsible for using the full compute node efficiently, as no jobs from other users can fill gaps!

Several key points to accomplishing that:

* Use dividers of the core number:

The Justus2 nodes have 48 cores, two sockets with 24 cores each. Use dividers of 48 to be able to use all cores of the node (e.g. 8). Be aware, that when you choose 16, one job will be executed half on one of the CPUs and the other half on the other. This might be suboptimal.
* Be aware of memory resources:

When you request more memory per core than the "small" nodes on Justus2 have per core, your jobs will not be able to use all cores on the small nodes - or will have to wait for the rarer spaces on the nodes with more memory. Try to estimate your memory requirements well and if you need more than 3.8GB per core, consider mixing in jobs with lower memory requirements to fill the nodes

== Many One or Few-Core Jobs ==

Jobs that use only a few CPU cores can lead to very inefficient node usage:

# You submit 1000 jobs, each runs for ~30s. Jobs need up to 30s to start and finish - a huge waste if the job only takes 30 seconds. Additionally, the starting and finishing of so many jobs in a short time causes strain on the scheduler SLURM and may cause severe problems for everyone and clutter the SLURM job database.
# many few-core jobs with very different run times. The jobs will start on many nodes, but at some time all quicker jobs have finished the calculation and only a few remain. Because of the single-user policy on JUSTUS2, jobs of other users cannot fill in the gaps and the rest of the node is idle.

To address the problem, you can reduce the amount of jobs and/or the amount of nodes used.

To limit the amount of jobs, start many calculations within one job (problem 1. and 2.):

* use a bash loop in your job script
* use the program GNU parallel to start the processes for you

To only limit the amount of nodes used:

* use array jobs

=== Bash Loop ===

One advantage of this method is, that you can run more threads than cores if your jobs are really short and do not use too much RAM memory and in this way keep all cores busy even if many calculations are still starting up.

It is of course even better, if you can combine such short calculations in a way that for 1000 calculations the kernel does not need to start 1000 processes which in turn need to initialize everything.

This example uses pgrep to count how many jobs are running:

<syntaxhighlight lang="bash">
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=48
#SBATCH --time=00:10:00
#SBATCH --mem=100gb

for i in {1..200}
do
echo starting up $i
bash my_calculation $i &
while [ $(pgrep -c -f my_calculation) -gt 48 ] ; do echo sleeping; sleep 5; done
done
wait
</syntaxhighlight>

The same, but by tracking the PIDs (process IDs) of the started processes. This is more robust, but is more difficult to read:

<syntaxhighlight lang="bash">
#!/bin/bash
running_jobs=()

for i in {1..200}; do
echo "Starting job $i"
sleep "$i" &
running_jobs+=($!) # Track PID

while [ "${#running_jobs[@]}" -ge 8 ]; do
sleep 2 # adjust duration depending on your runtime
echo running_jobs: ${running_jobs[@]}
echo pid-out: $(ps -o pid= -p "${running_jobs[@]}" 2>/dev/null | xargs)
echo -----
running_jobs=($(ps -o pid= -p "${running_jobs[@]}" 2>/dev/null)) # Remove finished jobs
done
done

wait # Ensure all jobs complete
</syntaxhighlight>

You may not be able to just use an index number "i" to start many calculations. In this case, for not too many files, the for loop could be used to read in config files. Here just the general idea for the for loop without
<syntaxhighlight lang="bash">
for config in config-1980-03-01_1/*; do
mycalculation -config "$config"
done
</syntaxhighlight>
This loops over all files in the directory config-1980-03-01_1/ and gives them as an input file to "mycalculation" via a hypothetical "-config" option. Adding a date to the config-dirs (and outputs) would enable you to track different runs in your lab journal more easily.

=== Gnu Parallel ===

Gnu Parallel is available on the HPC Cluster and comes with its own set of examples, you can access them like this:

<syntaxhighlight lang="bash">
$ module load system/parallel
$ cp $PARALLEL_EXA_DIR/parallel.slurm .
</syntaxhighlight>
=== Array Jobs ===
<syntaxhighlight lang="bash">$ sbatch -a 1-500%48 batch_script</syntaxhighlight>

This will submit 500 tasks to be executed, each one indexed by SLURM_ARRAY_TASK_ID ranging from 1 to 200, but will limit the number of simultaneously running tasks from this job array to 48 (number of cores on a Justus2 node).

Thee same can be done inside the job script:

<syntaxhighlight lang="bash">
#!/bin/bash
# Number of cores per individual array task
#SBATCH --ntasks-per-node=1
#SBATCH --array=1-500%48
#SBATCH --mem=3G
#SBATCH --time=1:10:00
#SBATCH --job-name=array_job
#SBATCH --output=array_job-%A_%a.out
#SBATCH --error=array_job-%A_%a.err

# Print the task id.
echo "My SLURM_ARRAY_TASK_ID: " $SLURM_ARRAY_TASK_ID

export TIMEFORMAT=%R ;
time bash mycalculation $SLURM_ARRAY_TASK_ID

</syntaxhighlight>

Also see:
* Slurm-Howto entry: [[BwForCluster_JUSTUS_2_Slurm_HOWTO#How_to_submit_an_array_job?]]
* Schedmd documentations on Job Arrays: https://slurm.schedmd.com/job_array.html

Development

2025-05-20T08:37:49Z

M Carmesin: /* Scripting Languages */

== Compiling Software ==

Overview of [[Development/General compiler usage|general compiler usage]]

== Parallel Programming ==
Overview on [[Development/Parallel_Programming | parallel programming with OpenMP and MPI]].

== Environment Modules ==
Compiler, libraries and development tools are provided as environment modules.

Required reading to use: [[Environment Modules]]

== Available Development Software ==
Visit [https://www.bwhpc.de/software.php https://www.bwhpc.de/software.php] select your cluster and
* For compiler select <code>Category → compiler</code>
* For MPI select <code>Category → mpi</code>
* For libraries select <code>Category → lib</code>
* For numerical libraries select <code>Category → numlib</code>
* For further development tools select <code>Category → devel</code>

On a cluster use: <code>module avail <Category></code>

== Documentation ==
Availabe documentation for environment modules:
* <code>module help</code>
* examples in <code>$SOFTNAME_EXA_DIR</code>
* additional docu in this wiki

== Documentation in the Wiki ==
Environment modules and tools for software development and parallel programming with additional documentation here in the wiki:

=== Integrated Development Environments ===
* [[Development/VS_Code|Visual Studio Code]]

=== Compiler and Debugger ===
* [[Development/GCC|GCC]]
* [[Development/GDB|GDB]]
* [[Development/Intel_Compiler|Intel Compiler]]

=== Development Tools ===
* [[Development/Score-P|Score-P]]: Tracing of OpenMP-, MPI- and GPU-parallel applications for Vampir and other performance analysis tools.
* [[Development/Vampir_and_VampirServer|Vampir and VampirServer]]: Highly scalable Performance Analysis of OpenMP-, MPI- and GPU-parallel applications.
* [[Development/Pahole|Pahole]]: Analyse data structures for cache-line alignment and (un)necessary holes that increase data structure size
* [[Development/Valgrind|Valgrind]]: Very valuable framework with multiple tools, e.g. to detect memory access errors
* Forge: Tools for debugging (arm DDT) and performance analysis (arm MAP)

=== Libraries and Numerical Libraries ===
* [[Development/GSL|GSL]]
* [[Development/FFTW|FFTW]]
* [[Development/MKL|MKL]]
=== Scripting Languages ===
* [[Development/Julia|Julia]]
* [[Development/Python|Python]]

=== Development Environments ===
* [[Development/Conda|Conda]]
* [[Development/Containers|Containers]]

Development/Julia

2025-05-12T10:26:24Z

M Carmesin:

Julia is a high-level, high-performance, dynamic programming language, being designed with scientific computing in mind. Parallel programming features, such as multi-threading are included in the core language, while there also exist packages leveraging the power of MPI and CUDA.

There are no packages preinstalled besides the Julia language core, please use the Julia package manager to install any required Julia package.

The Julia module on Justus loads suitable versions of CUDA and OpenMPI and the corresponding Julia packages CUDA.jl and MPI.jl will be automatically configured to use these libraries after being installed by the user. Any changes, either by loading modules with different MPI and/ or CUDA versions as well as using the ones that come as Julia artifacts are likely to lead to errors.

== Availability ==

On UniCluster3.0 and JUSTUS 2, Julia is available as module. Check <code>module avail math/julia</code> for the provided versions. In case there is no suitable version, you can install Julia to your home directory using the [https://julialang.org/install/ JuliaUP] installer.

== Environments and Package Installation ==

It is highly recommended to use an separate Julia environment for every project. If Julia is started with the option <code>--project=.</code> the current folder will be used as environment and the <code>Project.toml</code> file containing the information on the installed packages will be created, if not yet present.

In an interactive Julia session, the [https://pkgdocs.julialang.org/v1/getting-started/#Basic-Usage package manager] is activated by entering <code>]</code>. The most importent commands are
* <code>add PACKAGENAME</code> install package PACKAGENAME in the current environment
* <code>instantiate</code>: install all packages with dependencies as stated in Project.toml and Manifest.toml, e.g. after copying the existing code to the cluster
* <code>activate PATH_TO_ENV</code>: use the environment located at the path <code>PATH_TO_ENV</code> and initialize it, if necessary.

== Interactive Example ==

Load Julia module and start interactive REPL session with 8 threads, using the environment in the current directory:

<pre>
$ module load math/julia
$ julia -t 8 --project=.
</pre>

Enter ']' to go into package manager and install package [https://github.com/JuliaPlots/UnicodePlots.jl?tab=readme-ov-file <code>UnicodePlots</code>].
<pre>
add UnicodePlots
</pre>

Leave the package manager with the backspace key.

Create a vector with 64 elements set to 0 and fill it using all 8 threads with the corresponding tread id number.

<pre>
vec = zeros(64)
Threads.@threads for i in eachindex(vec)
vec[i]= Threads.threadid()
end
</pre>

Load the <code>UnicodePlots</code> package and draw a scatter plot of the contents of <code>vec</code>

<pre>
using UnicodePlots
scatterplot(vec)
</pre>

== Further documentation ==

* [https://modernjuliaworkflows.org Modern Julia Workflows]: A collection of best practices

* [https://github.com/carstenbauer/JuliaHLRS24 Julia Workshop at HLRS]: The material of this workshop is in large parts also valid for the Justus cluster (on Justus you only need the module math/julia).

== Tips & Tricks ==

* [[JUSTUS2/Software/Julia/Parallel_Programming|Parallel Programming]]

Development/Julia

2025-05-12T10:25:49Z

M Carmesin: /* Availability */

Julia is a high-level, high-performance, dynamic programming language, being designed with scientific computing in mind. Parallel programming features, such as multi-threading are included in the core language, while there also exist packages leveraging the power of MPI and CUDA.

There are no packages preinstalled besides the Julia language core, please use the Julia package manager to install any required Julia package.

The Julia module on Justus loads suitable versions of CUDA and OpenMPI and the corresponding Julia packages CUDA.jl and MPI.jl will be automatically configured to use these libraries after being installed by the user. Any changes, either by loading modules with different MPI and/ or CUDA versions as well as using the ones that come as Julia artifacts are likely to lead to errors.

== Availability ==

On UniCluster3.0 and JUSTUS 2, Julia is available as module. Check `module avail math/julia` for the provides versions. In case there is no suitable version, you can install Julia to your home directory using the [https://julialang.org/install/ JuliaUP] installer.

== Environments and Package Installation ==

It is highly recommended to use an separate Julia environment for every project. If Julia is started with the option <code>--project=.</code> the current folder will be used as environment and the <code>Project.toml</code> file containing the information on the installed packages will be created, if not yet present.

In an interactive Julia session, the [https://pkgdocs.julialang.org/v1/getting-started/#Basic-Usage package manager] is activated by entering <code>]</code>. The most importent commands are
* <code>add PACKAGENAME</code> install package PACKAGENAME in the current environment
* <code>instantiate</code>: install all packages with dependencies as stated in Project.toml and Manifest.toml, e.g. after copying the existing code to the cluster
* <code>activate PATH_TO_ENV</code>: use the environment located at the path <code>PATH_TO_ENV</code> and initialize it, if necessary.

== Interactive Example ==

Load Julia module and start interactive REPL session with 8 threads, using the environment in the current directory:

<pre>
$ module load math/julia
$ julia -t 8 --project=.
</pre>

Enter ']' to go into package manager and install package [https://github.com/JuliaPlots/UnicodePlots.jl?tab=readme-ov-file <code>UnicodePlots</code>].
<pre>
add UnicodePlots
</pre>

Leave the package manager with the backspace key.

Create a vector with 64 elements set to 0 and fill it using all 8 threads with the corresponding tread id number.

<pre>
vec = zeros(64)
Threads.@threads for i in eachindex(vec)
vec[i]= Threads.threadid()
end
</pre>

Load the <code>UnicodePlots</code> package and draw a scatter plot of the contents of <code>vec</code>

<pre>
using UnicodePlots
scatterplot(vec)
</pre>

== Further documentation ==

* [https://modernjuliaworkflows.org Modern Julia Workflows]: A collection of best practices

* [https://github.com/carstenbauer/JuliaHLRS24 Julia Workshop at HLRS]: The material of this workshop is in large parts also valid for the Justus cluster (on Justus you only need the module math/julia).

== Tips & Tricks ==

* [[JUSTUS2/Software/Julia/Parallel_Programming|Parallel Programming]]

Development/Julia

2025-05-12T10:25:34Z

M Carmesin: Created page with "Julia is a high-level, high-performance, dynamic programming language, being designed with scientific computing in mind. Parallel programming features, such as multi-threading are included in the core language, while there also exist packages leveraging the power of MPI and CUDA. There are no packages preinstalled besides the Julia language core, please use the Julia package manager to install any required Julia package. The Julia module on Justus loads suitable versio..."

Julia is a high-level, high-performance, dynamic programming language, being designed with scientific computing in mind. Parallel programming features, such as multi-threading are included in the core language, while there also exist packages leveraging the power of MPI and CUDA.

There are no packages preinstalled besides the Julia language core, please use the Julia package manager to install any required Julia package.

The Julia module on Justus loads suitable versions of CUDA and OpenMPI and the corresponding Julia packages CUDA.jl and MPI.jl will be automatically configured to use these libraries after being installed by the user. Any changes, either by loading modules with different MPI and/ or CUDA versions as well as using the ones that come as Julia artifacts are likely to lead to errors.

== Availability ==

On UniCluster3.0 ans JUSTUS 2, Julia is available as module. Check `module avail math/julia` for the provides versions. In case there is no suitable version, you can install Julia to your home directory using the [https://julialang.org/install/ JuliaUP] installer.

== Environments and Package Installation ==

It is highly recommended to use an separate Julia environment for every project. If Julia is started with the option <code>--project=.</code> the current folder will be used as environment and the <code>Project.toml</code> file containing the information on the installed packages will be created, if not yet present.

In an interactive Julia session, the [https://pkgdocs.julialang.org/v1/getting-started/#Basic-Usage package manager] is activated by entering <code>]</code>. The most importent commands are
* <code>add PACKAGENAME</code> install package PACKAGENAME in the current environment
* <code>instantiate</code>: install all packages with dependencies as stated in Project.toml and Manifest.toml, e.g. after copying the existing code to the cluster
* <code>activate PATH_TO_ENV</code>: use the environment located at the path <code>PATH_TO_ENV</code> and initialize it, if necessary.

== Interactive Example ==

Load Julia module and start interactive REPL session with 8 threads, using the environment in the current directory:

<pre>
$ module load math/julia
$ julia -t 8 --project=.
</pre>

Enter ']' to go into package manager and install package [https://github.com/JuliaPlots/UnicodePlots.jl?tab=readme-ov-file <code>UnicodePlots</code>].
<pre>
add UnicodePlots
</pre>

Leave the package manager with the backspace key.

Create a vector with 64 elements set to 0 and fill it using all 8 threads with the corresponding tread id number.

<pre>
vec = zeros(64)
Threads.@threads for i in eachindex(vec)
vec[i]= Threads.threadid()
end
</pre>

Load the <code>UnicodePlots</code> package and draw a scatter plot of the contents of <code>vec</code>

<pre>
using UnicodePlots
scatterplot(vec)
</pre>

== Further documentation ==

* [https://modernjuliaworkflows.org Modern Julia Workflows]: A collection of best practices

* [https://github.com/carstenbauer/JuliaHLRS24 Julia Workshop at HLRS]: The material of this workshop is in large parts also valid for the Justus cluster (on Justus you only need the module math/julia).

== Tips & Tricks ==

* [[JUSTUS2/Software/Julia/Parallel_Programming|Parallel Programming]]

JUSTUS2/Software/Julia

2025-05-12T09:54:30Z

M Carmesin: /* Introduction */

{{Softwarepage|math/julia}}

{| width=600px class="wikitable"
|-
! Description !! Content
|-
| module load
| math/julia
|-
| Availability
| [[bwUniCluster]] | [[JUSTUS2]]
|-
| License
| MIT License
|-
|Citing
| [https://github.com/JuliaLang/julia/blob/master/CITATION.bib]
|-
| Links
| [https://julialang.org/ Project homepage] | [https://docs.julialang.org/en/v1/ Documentation]
|-
| Graphical Interface
| No
|}

== Introduction ==

Julia is a high-level, high-performance, dynamic programming language, being designed with scientific computing in mind. Parallel programming features, such as multi-threading are included in the core language, while there also exist packages leveraging the power of MPI and CUDA.

There are no packages preinstalled besides the Julia language core, please use the Julia package manager to install any required Julia package.

The Julia module on Justus loads suitable versions of CUDA and OpenMPI and the corresponding Julia packages CUDA.jl and MPI.jl will be automatically configured to use these libraries after being installed by the user. Any changes, either by loading modules with different MPI and/ or CUDA versions as well as using the ones that come as Julia artifacts are likely to lead to errors.

== Environments and Package Installation ==

It is highly recommended to use an separate Julia environment for every project. If Julia is started with the option <code>--project=.</code> the current folder will be used as environment and the <code>Project.toml</code> file containing the information on the installed packages will be created, if not yet present.

In an interactive Julia session, the [https://pkgdocs.julialang.org/v1/getting-started/#Basic-Usage package manager] is activated by entering <code>]</code>. The most importent commands are
* <code>add PACKAGENAME</code> install package PACKAGENAME in the current environment
* <code>instantiate</code>: install all packages with dependencies as stated in Project.toml and Manifest.toml, e.g. after copying the existing code to the cluster
* <code>activate PATH_TO_ENV</code>: use the environment located at the path <code>PATH_TO_ENV</code> and initialize it, if necessary.

== Interactive Example ==

Load Julia module and start interactive REPL session with 8 threads, using the environment in the current directory:

<pre>
$ module load math/julia
$ julia -t 8 --project=.
</pre>

Enter ']' to go into package manager and install package [https://github.com/JuliaPlots/UnicodePlots.jl?tab=readme-ov-file <code>UnicodePlots</code>].
<pre>
add UnicodePlots
</pre>

Leave the package manager with the backspace key.

Create a vector with 64 elements set to 0 and fill it using all 8 threads with the corresponding tread id number.

<pre>
vec = zeros(64)
Threads.@threads for i in eachindex(vec)
vec[i]= Threads.threadid()
end
</pre>

Load the <code>UnicodePlots</code> package and draw a scatter plot of the contents of <code>vec</code>

<pre>
using UnicodePlots
scatterplot(vec)
</pre>

== Further documentation ==

* [https://modernjuliaworkflows.org Modern Julia Workflows]: A collection of best practices

* [https://github.com/carstenbauer/JuliaHLRS24 Julia Workshop at HLRS]: The material of this workshop is in large parts also valid for the Justus cluster (on Justus you only need the module math/julia).

== Tips & Tricks ==

* [[JUSTUS2/Software/Julia/Parallel_Programming|Parallel Programming]]

BwUniCluster3.0/Jupyter

2025-05-07T07:44:51Z

M Carmesin: /* Julia language */

Jupyter can be used as an alternative to accessing HPC resources via SSH. For this purpose only a web browser is required. Within the website source code of different programming languages can be edited and executed. Furthermore different user interfaces and terminals are available.

= Short description of Jupyter =

Jupyter is a web application, central component of Jupyter is the '''Jupyter Notebook'''. It is a document, which can contain formatted text, executable code sections and (interactive) visualizations (image, sound, video, 3D views).

The Jupyter notebooks are executed in an interactive session on the compute nodes of the respective cluster. Access is via any modern web browser. Data is prepared and visualized on the server and therefore does not have to be transmitted over the network. Only the resulting text, image, sound and video data is transmitted. Starting point of a Jupyter session is the HOME directory of the user on the respective cluster.

'''JupyterLab''' is a modern user interface, within which one or more Jupyter notebooks can be opened, edited and executed. The individual notebooks can be arranged as tabs or tiled. JupyterLab is the standard user interface. Besides JupyterLab the classic notebook user interface is available, in which only one Jupyter notebook per browser tab can be opened at a time.

A '''Jupyter Kernel''' describes a separate process, in which one Jupyter Notebook is executed at a time. Different kernels are available for different programming languages or language versions.

Before a Jupyter session is started, the access authorization must be checked first. This is done via '''JupyterHub''', where the resources are selected, for example the number of CPU cores, GPUs or the required main memory.

A detailed documentation of the Jupyter project can be found at [https://jupyter.readthedocs.io https://jupyter.readthedocs.io].

= Access requirements =

{|style="background:#deffee; width:100%;"
|style="padding:5px; background:#cef2e0; text-align:left"|
[[Image:Attention.svg|center|25px]]
|style="padding:5px; background:#cef2e0; text-align:left"|
Access to Jupyter is '''limited to IP addresses from the BelWü network'''.
All home institutions of our current users are connected to BelWü, so if you are on your campus network (e.g. in your office or on the Campus WiFi) you should be able to connect to bwUniCluster 3.0 without restrictions.
If you are outside one of the BelWü networks (e.g. at home), a VPN connection to the home institution or a connection to an SSH jump host at the home institution must be established first.
|}

To use Jupyter on the HPC resources of SCC, the access requirements for [https://wiki.bwhpc.de/e/Registration/bwUniCluster bwUniCluster 3.0] apply. A [https://wiki.bwhpc.de/e/Registration/bwUniCluster registration] is required. Please note, You should've completed registration and tested your login once using [https://wiki.bwhpc.de/e/Registration/SSH Secure Shell (ssh)].

= Login process =

Login takes place at
* bwUniCluster 3.0: [https://uc3-jupyter.scc.kit.edu uc3-jupyter.scc.kit.edu]
* SDIL: [https://sdil-jupyter.scc.kit.edu sdil-jupyter.scc.kit.edu]
* HoreKa: [https://hk-jupyter.scc.kit.edu hk-jupyter.scc.kit.edu]
* HAICORE: [https://haicore-jupyter.scc.kit.edu haicore-jupyter.scc.kit.edu]

For login, your username, your password and a 2-factor authentication are required.

You will first find yourself on a landing page that also gives more information about the currently installed software versions.
By pressing the login button you will be redirected to the JupyterHub page. Click on Enter JupyterHub to start the login process. Select the organization (e.g. KIT) that has granted you access to the HPC system and press Continue. In the Login section that appears, enter your username and password (not the service password).
After pressing the Login button you will be redirected to the second factor query page. Enter the one-time password (e.g. from KIT Token or Google Authenticator App) and press Validate. Now you are done with the login process and can start selecting your computing resources.

[[File:Jupyter_Anmeldung.gif|700px]]

= Selection of the compute resources =

The Jupyter notebooks are executed in an interactive session on the compute nodes of the HPC clusters. Just like accessing an interactive session with SSH, resource allocation is done by the Workload Manager Slurm. The selection of resources for Jupyter is realized via drop-down menus. Only jobs with a maximum of one node are possible.

Available resources for selection are

* Number of CPU cores
* Number of GPUs
* Runtime
* Partition/Queue
* Amount of main memory

If Auto-Reservation is selected the automatic Jupyter reservation of the cluster is enabled.

In normal mode, the grayed-out fields contain reasonable presets, depending on the number of required CPU cores or GPUs respectively. The presets can be bypassed in advanced mode, where further options are available.

Advanced Mode can be activated by clicking on the checkbox of the same name. The following additional options then become available:

* Specification of a reservation
* LSDF mount option
* BEEOND mount option

After the selection is made, the interactive job is started with the spawn button. As when requesting interactive compute resources with the `salloc` command, waiting times may occur. These are usually the longer the larger the requested resources are.
Even if the chosen resources are available immediately, the spawning process may take up to one minute.

[[File:Ressources_neu.gif|500px]]

Please note that in advanced mode, resource combinations can be selected that are impossible to be met. In this case, an error message will appear when the job is spawned.

[[File:Jupyter_Falsche_ressourcen.gif|500px]]

The spawning timeout is currently set to 10 minutes. With a normal workload of the HPC facility, this time is usually sufficient to get interactive resources.

== Prioritized access to computing resources on bwUniCluster 3.0 ==
The use of Jupyter requires the immediate availability of computing resources since the JupyterLab server is started within an interactive Slurm session. To improve the availability of CPUs/GPUs for interactive supercomputing with Jupyter, '''automatic reservation''' for CPU (cpu_il) and GPU (gpu_a100_il) resources has been set up on '''bwUniCluster 3.0'''. It is active '''between 8am and 8pm''' every weekday. The reservation is automatically active if

* no other reservation is set manually
* Auto-Reservation is enabled

To give you a better overview of the currently available resources, a status indicator has been implemented. It appears when selecting the number of required CPUs/GPUs and shows whether a Jupyter job of the selected size can currently be started or not. Green means the selected CPU/GPU resources are available instantly. Yellow means only a single additional job of the selected size can be started. Red means there are no GPU resources left that could satisfy the selected amount of resources.

If there are no more resources available within the reservation, you can try selecting a different amount of CPUs/GPUs or activate Advanced Mode and select a different partition. Availability can be estimated using sinfo_t_idle, which is available when logging in via SSH.

= JupyterLab =

JupyterLab is the standard user interface. In the following only its essential functions are briefly introduced. A detailed documentation is available at
[https://jupyterlab.readthedocs.io https://jupyterlab.readthedocs.io].

== Menu bar ==

The menu bar at the upper edge of JupyterLab has higher-level menus that display the actions available in JupyterLab along with their shortcut keys. The default menus are:

* File: Actions related to files and directories
* Edit: Actions related to editing documents and other activities
* View: actions that change the appearance of JupyterLab
* Run: Actions to execute code in various activities like notebooks and code consoles
* Kernel: Actions to manage kernels that are separate processes for executing code
* Tabs: a list of open documents and activities in the Dock Panel
* Settings: general settings and an editor for advanced settings
* Help: a list of help links to JupyterLab and the kernel

== Left sidebar ==

In the left sidebar there are foldable tabs. The most relevant ones are:

* File browser: Switch to directories and open files with left mouse button, context menu with right mouse button
* Running kernels: Overview of running kernels
* Command overview
* Tab Overview
* Lmod software selection: Search and load/unload Lmod software modules

== Main working area ==
The main working area in JupyterLab allows to arrange, resize and divide documents (notebooks, text files, etc.) and other activities (terminals, code consoles, etc.) in tabs. By holding down the left mouse button, the tabs can be grabbed and repositioned.

In a new JupyterLab session the Launcher tab is opened first. It contains buttons for starting new notebooks, code consoles and other functions. When a notebook is open, a new Launcher tab can be started by pressing the plus symbol in the file browser tab of the left sidebar, by calling ''File > New Launcher'' in the upper menu bar or by the key combination ''Ctrl+Shift+L''.

== Classic Notebook ==

The classic Jupyter Notebook user interface offers only one open Jupyter Notebook or terminal per browser tab. From the JupyterLab user interface the classic display can be reached in the menu bar under ''Help > Launch Classic Notebook''. Clicking on the JupyterHub logo in the upper left corner will take you back to the JupyterLab interface.

= Log out =

You can log out from a running Jupyter session by calling ''File > Log Out'' in the upper menu bar.

{| style="width: 100%; margin:4px 0 0 0; background:none; border-spacing: 0px;"
| style="width:100%; border:1px solid #BBBBBB; background:#fff5fa; vertical-align:top; color:#000;" |
{| style="width:100%; vertical-align:top; border:0px solid #BBBBBB; padding:4px;" |
|-
|{{Red}}| Attention
|-
|
Please note that your interactive session will continue in the background!

|}
|}

As long as the interactive session is running, you can re-enter it at any time. Depending on the duration of your absence, it may be necessary to re-enter your one-time password and possibly KIT password.

If you want to end the interactive session before it has reached its runtime, you can do so via the Hub Control Panel. Under ''File > Hub Control Panel'' in the upper menu bar, it is opened in a new browser tab. By pressing the ''Stop My Server'' button the session will be terminated. You can now log out using the ''Logout'' button in the upper right corner or start a new session directly using the ''Start My Server'' button, for example with a changed resource selection.

[[File:logout_small.gif|750px]]

= Selection of software =

For the selection of the required Lmod software modules the corresponding tab ''Softwares'' is available in the left sidebar. The list of available modules can be narrowed down by entering the search field. The desired module is loaded by pressing the ''Load'' button. In the list with the loaded modules you can remove them with the ''Unload'' button.

{| style="width: 100%; margin:4px 0 0 0; background:none; border-spacing: 0px;"
| style="width:100%; border:1px solid #BBBBBB; background:#f5fffa; vertical-align:top; color:#000;" |
{| style="width:100%; vertical-align:top; border:0px solid #BBBBBB; padding:4px;" |
|-
|{{Green}}| Note
|-
|
On already opened Jupyter Notebooks, newly loaded software modules become active only after restarting the kernel (''Kernel > Restart Kernel'' in the upper menu bar). Terminals must be closed and reopened.
|}
|}

[[File:software_small.gif|750px]]

== Software Stacks for Jupyter ==
Currently 3 special Jupyter software stacks are available via Lmod:

* <code>jupyter/minimal</code>
*: Minimal installation of JupyterLab

* <code>jupyter/base</code>
*: Basic installation of JupyterLab.
*: For a complete list of pre-installed packages, please refer to [https://uc2-jupyter.scc.kit.edu/software-modules/#pre-installed-software-packages this site].

* <code>jupyter/tensorflow</code> (default at login, will be deprecated with the advent of bwUniCluster 3.0)
*: Preinstalled software packages for machine learning applications. Includes among others TensorFlow, Keras, Torch, Pandas, Matplotlib, SKLearn.
*: For a complete list of pre-installed packages and their respective version, please refer to [https://uc2-jupyter.scc.kit.edu/software-modules/#pre-installed-software-packages this site].

* <code>jupyter/ai</code> (will be the new default at login für bwUniCluster 3.0, contains all latest and greatest software for AI workflows)
*: Preinstalled software packages for machine learning applications. Includes among others TensorFlow, Keras, Torch, Torchvision, Lighning, Pandas, Matplotlib, SKLearn.
*: For a complete list of pre-installed packages and their respective version, please refer to [https://uc2-jupyter.scc.kit.edu/software-modules/#pre-installed-software-packages this site].

* <code>jupyter/extensions</code>
*: Same packages as tensorflow + extensions

These software stacks can be used both when accessing the cluster via JupyterHub, as well as for conventional access via SSH via module load.

A continuously updated list with the installed packages can be found on the corresponding subpage of the respective cluster:

* bwUniCluster 3.0: [https://uc3-jupyter.scc.kit.edu/software-modules uc3-jupyter.scc.kit.edu/software-modules]
* HoreKa: [https://hk-jupyter.scc.kit.edu/software-modules hk-jupyter.scc.kit.edu/software-modules]

= Installation of further software =
The software provided by the Lmod modules jupyter/minimal, jupyter/base and jupyter/tensorflow can be easily supplemented by additional Python packages. There are 2 procedures for this.

<ul>
<li>User-Installation (not recommended) 
<code>pip install --user <packageName> </code> 
The additional packages are installed under $HOME/.local/lib/python3.11/site-packages/ which is part of PYTHONPATH.
</li>
<li>Virtual environments (recommended) 
The user can create and use virtual environments (cf. Virtual environments). Packages provided by the jupyter Lmod modules remain visible and usable.
</li>
</ul>

== Virtual environments ==

Python virtual environments allow to use different versions of a package and to keep your local site-packages (accessible under <code>$PYTHONPATH</code>) free from conflicts.

=== Creation of virtual environment ===

<pre>
python -m venv <myEnv>
source <myEnv>/bin/activate
pip install <packageName>
deactivate
</pre>

The additional packages are installed under <code><myEnv>/lib/python3.11/site-packages/</code>.

=== Usage of virtual environment ===

In order to use the virtual environment, it has to be activated via <code>source <myEnv>/bin/activate</code>. <code>PYTHONPATH</code> is set accordingly. Deactivation of the venv is done via <code>deactivate</code>.

=== Usage of virtual environment in JupyterLab ===

To be able to use the virtual environments within JupyterLab, a corresponding kernel has to be installed:

<pre>
source <myEnv>/bin/activate
python -m ipykernel install \
--user \
--name myEnv \
--display-name "Python (myEnv)"
</pre>

After installing the kernel (and possibly refreshing the browser window), a button named "myEnv" is available in JupyterLab. The kernel can also be selected from the drop-down menu.

'''Attention'''
The (Lmod) base module you used in the Creation of virtual environment step must be loaded to use the venv. However, to be on the safe side, you can also use the system Python (<code>/usr/bin/python3.11</code>) at creation time, which is available even without any <code>jupyter/{base,tensorflow}</code> module loaded.

== Examples on Data processing, Machine Learning & Visualization ==

In the [https://github.com/hpcraink/workshop-parallel-jupyter/ workshop repository] the usage and best practices on Python in general, and the packages NumPy, Pandas, SciKit and Dask are provided, containing running examples based on open data. It also explains, how Jupyter interacts with pre-installed and your own provided environments.

== R language ==

In order to use R language in JupyterLab, the Lmod module <code>math/R</code> has to be loaded (blue button in JupyterLab or <code>module add math/R</code> in terminal) and a corresponding kernel has to be installed.

<pre>
R
install.packages('IRkernel')
IRkernel::installspec()
</pre>

After installing the kernel , a button named "R" is available in JupyterLab. The kernel can also be selected from the drop-down menu.

'''Attention:'''
Don't forget to load the <code>math/R</code> module (blue button) before using the kernel.

== Julia language ==

In order to use Julia language in JupyterLab, the Lmod module <code>math/julia/1.10.8</code> has to be loaded (blue button in JupyterLab or <code>module math/julia/1.10.8</code> in terminal). When the module is loaded from the JupytherLab UI, the corresponding kernel will be installed. If you use the terminal, this has to be done manually

<pre>
julia
]
add IJulia
</pre>

After installing the kernel, a button named "Julia 1.10.8" is available in JupyterLab. The kernel can also be selected from the drop-down menu.

'''Attention:'''
Don't forget to load the <code>math/julia/1.10.8</code> module (blue button) before using the kernel.

= Jupyter Container Mode =

The container integration on the jupyterhub is done via pyxis. In order to use it the checkmark for container mode has to be clicked. Available options are:

* <code> --container-image:</code> The container image to use. Corresponds to the pyxis option --container-image
* <code> --container-name:</code> The name of the image to use. Corresponds to the pyxis option --container-name. Already downloaded containers in ~/.local/share/enroot can be startet by simply specifing their name.
* <code> --container-mount-home:</code> Corresponds to the pyxis option --container-mount-home. Mounts the home-directory

'''Attention:'''
Make sure Python3.11 and pip are installed in the Container or the notebook will not spawn. This can be checked via command python3.11 -m pip list inside your container

It is advised to create the container prior via e.G. enroot and install all necessary software. For more information see [https://wiki.bwhpc.de/e/BwUniCluster2.0/Containers#SLURM_Integration here]

----

User:M Carmesin/Planning your Jobs

2025-03-24T14:29:43Z

M Carmesin:

* CPUs

* Memory

* File System

* GPU

User:M Carmesin/Planning your Jobs

2025-03-24T14:25:11Z

M Carmesin: Created page with "==CPUs== ==Memory== ==File System== ==GPU=="

==CPUs==

==Memory==

==File System==

==GPU==

JUSTUS2/Jobscripts: Running Your Calculations

2025-03-24T14:21:57Z

M Carmesin: /* Submitting Jobs on the bwForCluster JUSTUS 2 */

{{Justus2}}

The JUSTUS 2 cluster uses Slurm ([https://slurm.schedmd.com/ https://slurm.schedmd.com/]) for scheduling compute jobs.

= JUSTUS 2 Slurm Howto =

This page only presents some very basic introduction.

Please see the '''[[bwForCluster JUSTUS 2 Slurm HOWTO|JUSTUS 2 Slurm HOWTO]]''' for many more examples and commands for common tasks.

= Slurm Command Overview =

{| width=750px class="wikitable"
! Slurm commands !! Brief explanation
|-
| [https://slurm.schedmd.com/sbatch.html sbatch] || Submits a job and queues it in an input queue
|-
| [https://slurm.schedmd.com/salloc.html salloc] || Request resources for an interactive job
|-
| [https://slurm.schedmd.com/squeue.html squeue] || Displays information about active, eligible, blocked, and/or recently completed jobs
|-
| [https://slurm.schedmd.com/scontrol.html scontrol] || Displays detailed job state information
|-
| [https://slurm.schedmd.com/sstat.html sstat] || Displays status information about a running job
|-
| [https://slurm.schedmd.com/scancel.html scancel] || Cancels a job
|-
| seff || Shows the "job efficiency" of a job after it has finished
|}

= Submitting Jobs on the bwForCluster JUSTUS 2 =
Batch jobs are submitted with the command:

<syntaxhighlight lang=bash>$ sbatch <job-script> </syntaxhighlight>

A job script contains options for Slurm in lines beginning with #SBATCH as well as your commands which you want to execute on the compute nodes. For example:

<syntaxhighlight lang='bash'>
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --time=00:14:00
#SBATCH --mem=1gb
echo 'Here starts the calculation'
</syntaxhighlight>

You can override options from the script on the command-line:
<syntaxhighlight lang=bash>$ sbatch --time=03:00:00 <job-script> </syntaxhighlight>

Note: Compute jobs must not write/read from the global file systems (HOME and WORK) as a calculation swap file. 

Use local storage /tmp in the ramdisk for small files or /scratch (see [[BwForCluster_JUSTUS_2_Slurm_HOWTO#How_to_request_local_scratch_.28SSD.2FNVMe.29_at_job_submission.3F|How to request NVME]]) for this purpose

To not use the central file system for calculation, you must often configure the the program you are using to write temporary files elsewhere.

If the program uses the current directory to look for files, you must copy files to a temporary directory - and copy/save the results of the calculation in the end, else your results get deleted by automated cleanup happening after the job.

There diskless nodes have a disk in RAM memory, that can have a maximum of half the size of the total RAM. Note that files created plus memory requirement of your job need to fit into the total memory.

There are more diskless nodes than nodes with disks, so if your job can run on a diskless node, you should choose this option.

Example job script with requesting 700GB disk space and copying files:

<syntaxhighlight lang='bash'>
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --time=00:14:00
#SBATCH --mem=1gb
#SBATCH --gres=scratch:700
# copy input file
cp $HOME/inputfiles/myinput.inp $SCRATCH
# switch directory
cd $SCRATCH
echo 'Here starts the calculation'
myprogram --input=$SCRATCH/myinput.inp
# calculation ends
# copy result
cp outfile.out results2.txt $HOME/resultdir/job12345
# clean up
rm myinput outfile.out results2.txt
</syntaxhighlight>

{| style=" background:#deffee; width:100%;"
| style="padding:12px; background:#cef2e0; text-align:left" |
Software examples: Most [[Environment Modules|installed software]] comes with example job scripts. To find it e.g. for lammps: <code> module load chem/lammps; cd $LAMMPS_EXA_DIR; ls -la</code>.
|}
 

== Resource Requests ==

Important resource request options for the Slurm command sbatch are:

{| width=750px class="wikitable"
! Option !! Slurm (sbatch)
|-
| #SBATCH|| Script directive
|-
| --time=<hh:mm:ss> (-t <hh:mm:ss>)|| Wall time limit
|-
| --job-name=<name> (-J <name>)|| Job name
|-
| --nodes=<count> (-N <count>)|| Node count
|-
| --ntasks=<count> (-n <count>)|| Core count
|-
| --ntasks-per-node=<count>|| Process count per node
|-
| --mem=<limit>|| Memory limit per node
|-
| --mem-per-cpu=<limit>|| Memory limit per process
|-
| --gres=gpu:<count>|| GPU count (gres = "generic resource")
|-
| --gres=scratch:<count> || Disk space of <count> GB per requested task
|-
| --exclusive|| Node exclusive job
|}

'''Nodes and Cores'''

Slurm provides a number of options to request nodes and cores.
Typically, using <code>--nodes=<count></code> and <code>--ntasks-per-node=<count></code> should work for all your jobs. For single core jobs, it would be sufficient to use the option <code>--ntasks=1</code>. Specifying only <code>--ntasks</code> may lead to slurm trying to distribute tasks over more than one node even if you requested a small amount of cores.

'''Memory'''

Memory can be requested with either the option <code>--mem=<limit></code> (memory per node) or <code>--mem-per-cpu=<limit></code> (memory per process). When looking up the maximum available memory for a certain node type subtract about 5 GB for the operating system. Specify the memory limit as a value-unit-pair, for example 500mb or 8gb.

In most cases it is preferable to use the <code>--mem=<limit></code> option.

'''GPUs''' and '''Scratch'''

These are requested as "generic resources" with <code>--gres:gpu:<count></code> and <code>--gres:scratch:<count></code>.

== Default Values ==

Default values for jobs are:

* Runtime: --time=02:00:00 (2 hours)
* Nodes: --nodes=1 (one node)
* Tasks: --tasks-per-node=1 (one task per node)
* Cores: --cpus-per-task=1 (one core per task)
* Memory: --mem-per-cpu=2gb (2 GB per core)

== "Exclusive User" Node Access Policy ==

Nodes are exclusively allocated to one single user. However, multiple jobs (up to 48) from the same user can share a node.

For efficient resource use, choose a core count for your jobs that evenly divides 48. For example, two 24-core jobs fit on one node, while two 32-core jobs require two nodes but leave 16 cores unused on each.

The same applies to memory requests (see below).

Think of scheduling as a game of Tetris with cores, memory, and other resources. Choosing well-fitting allocations helps the scheduler pack jobs efficiently.

== Memory Limits ==

The wait time of a job also depends largely on the amount of requested resources and the available number of nodes providing this amount of resources. This must be taken into account in particular when requesting a certain amount of memory.

For example a node with 192 GB RAM can only run jobs with up to 187 GB memory requested. The remaining amount is reserved for the operating system, system services and local file systems.
This means that if a job requests 192 GB RAM per node (i.e. --mem=192gb or --tasks-per-node=48 and --mem-per-cpu=4gb), the job cannot run on one of the 456 "small" nodes but only on one of the "medium", "large" or "fat" nodes. Unnecessarily limiting your jobs to a sub-set of nodes will increase your wait time and the wait time of others, who actually need the amount of memory.

The following table provides an overview of how much memory can be allocated by user jobs on the various node types and how many nodes can serve this memory requirement:

{| width=500px class="wikitable"
! Physical RAM on node !! Available RAM on node !! Number of suitable nodes
|-
|small| 192 GB || 187 GB || 692
|-
|medium| 384 GB || 376 GB || 220
|-
|large| 768 GB || 754 GB || 28
|-
|fat| 1536 GB || 1510 GB || 8
|}

Also note that allocated memory is factored into resource usage accounting for fair share. This means over-requesting memory may have a negative impact on the priority of subsequent jobs.

= Testing Your Jobs =

Justus2 has three compute nodes reserved for jobs with a walltime under 15 minutes. You can test if your jobs start properly just by specifying a short walltime, e.g. --time=00:14:00 and your job should start very quickly.

= Monitoring Your Jobs =
== squeue ==

After you submitted the job, you can see it waiting using the <code>squeue</code> command:

(also read the man page with <code>man squeue</code> for more information on how to use the command)

<syntaxhighlight lang='shell'>
> squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
6260301 standard r_60_b_2 ul_yxz1 PD 0:00 1 (AssocGrpMemRunMinutes)

</syntaxhighlight>

Output shows:
* JOBID: the jobid is an unique number your job gets
* PARTITION: the cluster can be divided in different types of nodes.
* NAME: the name you gave your job with the --job-name= option
* USER: your username
* ST: the state the job is in. R = running, PD = pending, CD = completed. See man page for a full list on states.
* TIME: how long the job has been running
* NODES: how many nodes were requested
* NODELIST(REASON): either show the node(s) the job is running on, or a reason why it hasn't started

==scontrol==

You can then show more info on one specific running job using the <code>scontrol</code> command, e.g for the job with ID 6260301 listed above:

<code>
scontrol show job 6260301
</code>

displays detailed information for job with JobID 6260301

<code>
scontrol show jobs
</code>

displays detailed information for all your jobs
<code>

scontrol write batch_script 6260301 -
</code>
display job script of a running job. The "-" is a special filename which means "write to the terminal".

== Monitoring a Started Job ==

After a job has started, you can ssh from a login node to the node(s) the job is running on, using the node name from NODELIST, e.g. if your job runs on n0603:

<code>> ssh n0603
</code>

= Partitions =
Job allocations at JUSTUS 2 are routed automatically to the most suitable compute node(s) that can provide the requested resources for the job (e.g. amount of cores, memory, local scratch space). This is to prevent fragmentation of the cluster system and to ensure most efficient usage of available compute resources. Thus, there is no point in requesting a partition in batch job scripts, i.e. users '''should not''' specify any partition "-p, --partition=<partition_name>" on job submission. This is of particular importance if you adapt job scripts from other cluster systems (e.g. bwUniCluster 2.0) to JUSTUS 2.

= Job Priorities =
Job priorities at JUSTUS 2 depend on [https://slurm.schedmd.com/priority_multifactor.html multiple factors ]:
* Age: The amount of time a job has been waiting in the queue, eligible to be scheduled.
* Fairshare: The difference between the portion of the computing resource allocated to an association and the amount of resources that has been consumed.

'''Notes:'''

Jobs that are pending because the user reached one of the resource usage limits (see below) are not eligible to be scheduled and, thus, do not accrue priority by their age.

Fairshare does '''not''' introduce a fixed allotment, in that a user's ability to run new jobs is cut off as soon as a fixed target utilization is reached. Instead, the fairshare factor ensures that jobs from users who were under-served in the past are given higher priority than jobs from users who were over-served in the past. This keeps individual groups from long term monopolizing the resources, thus making it unfair to groups who have not used their fairshare for quite some time.

Slurm features '''backfilling''', meaning that the scheduler will start lower priority jobs if doing so does not delay the expected start time of '''any''' higher priority job. Since the expected start time of pending jobs depends upon the expected completion time of running jobs, reasonably accurate time limits are valuable for backfill scheduling to work well. This '''[https://youtu.be/OKhWwem1XZg?t=161 video]''' gives an illustrative description to how backfilling works.

In summary, an approximate model of Slurm's behavior for scheduling jobs is this:

* Step 1: Can the job in position one (highest priority) start now?
* Step 2: If it can, remove it from the queue, start it and continue with step 1.
* Step 3: If it can not, look at next job.
* Step 4: Can it start now, without delaying the start time of any job before it in the queue?
* Step 5: If it can, remove it from the queue, start it, recalculate what nodes are free, look at next job and continue with step 4.
* Step 6: If it can not, look at next job, and continue with step 4.

As soon as a new job is submitted and as soon as a job finishes, Slurm restarts its main scheduling cycle with step 1.

= Usage Limits/Throttling Policies =

While the fairshare factor ensures fair long term balance of resource utilization between users and groups, there are additional usage limits that constrain the total cumulative resources at a given time. This is to prevent individual users from short term monopolizing large fractions of the whole cluster system.

* The '''maximum walltime''' for a job is '''14 days''' (336 hours)
--time=336:00:00 or --time=14-0

* The '''maximum amount of cores''' used at any given time from jobs running is '''1920''' per user (aggregated over all running jobs). This translates to 40 nodes. An equivalent limit for allocated memory does also apply. If this limit is reached new jobs will be queued (with REASON: AssocGrpCpuLimit) but only allowed to run after resources have been relinquished.

* The maximum amount of '''remaining allocated core-minutes''' per user is '''3300000''' (aggregated over all running jobs). For example, if a user has a 4-core job running that will complete in 1 hour and a 2-core job that will complete in 6 hours, this translates to 4 * 1 * 60 + 2 * 6 * 60 = 16 * 60 = 960 remaining core-minutes. Once a user reaches the limit, no more jobs are allowed to start (REASON: AssocGrpCPURunMinutesLimit). As the jobs continue to run, the remaining core time will decrease and eventually allow more jobs to start in a staggered way. This limit also '''correlates the maximum walltime and amount of cores that can be allocated''' for this amount of time. Thus, shorter walltimes for the jobs allow more resources to be allocated at a given time (but capped by the maximum amount of cores limit above). Watch this '''[https://youtu.be/OKhWwem1XZg?t=306 video]''' for an illustrative description. An equivalent limit applies for remaining time of memory allocation in which case jobs may be held back from starting with REASON AssocGrpMemRunMinutes.

* The '''maximum amount of GPUs''' allocated by running jobs is '''8''' per user. If this limit it reached new jobs will be queued (with REASON: AssocGrpGRES) but only allowed to run after GPU resources have been relinquished.

'''Note:'''

Usage limits are subject to change.

= Considerations on Efficiency / Special Use Cases =

When we speak of poor job efficiency, we usually mean that hardware resources are wasted.
That means, a similar overall result could have been achieved using less hardware resources, leaving those for other jobs and reducing the wait time for you and everyone.

Some simple causes for poor overall job efficiency are:

* poor choice of resources compared to the size of the nodes leaves part of the node blocked, but doing nothing:
** multiple of --ntasks-per-node is not the number of cores of a node (see section [[#"Exclusive User" Node Access Policy]])
** too much (un-needed) memory or disk space requested
* more cores requested than are actually used by the job
* more cores used for a single mpi/openmp parallel computation than useful
* many small jobs with a short runtime (seconds in extreme cases)
* one-core jobs with very different run-times (because of single-user policy)

== Many One or Few-Core Jobs ==

Jobs that use only a few CPU cores can lead to very inefficient node usage:

# You submit 1000 jobs, each runs for ~30s. Jobs need up to 30s to start and finish - a huge waste if the job only takes 30 seconds. Additionally, the starting and finishing of so many jobs in a short time causes strain on the scheduler SLURM and may cause severe problems for everyone and clutter the SLURM job database.
# many few-core jobs with very different run times. The jobs will start on many nodes, but at some time all quicker jobs have finished the calculation and only a few remain. Because of the single-user policy on JUSTUS2, jobs of other users cannot fill in the gaps and the rest of the node is idle.

To address the problem, you can reduce the amount of jobs and/or the amount of nodes used.

To limit the amount of jobs, start many calculations within one job (problem 1. and 2.):

* use a bash loop in your job script
* use the program GNU parallel to start the processes for you

To only limit the amount of nodes used:

* use array jobs

=== Bash Loop ===

One advantage of this method is, that you can run more threads than cores if your jobs are really short and do not use too much RAM memory and in this way keep all cores busy even if many calculations are still starting up.

It is of course even better, if you can combine such short calculations in a way that for 1000 calculations the kernel does not need to start 1000 processes which in turn need to initialize everything.

This example uses pgrep to count how many jobs are running:

<syntaxhighlight lang="bash">
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=48
#SBATCH --time=00:10:00
#SBATCH --mem=100gb

for i in {1..200}
do
echo starting up $i
bash my_calculation $i &
while [ $(pgrep -c -f my_calculation) -gt 48 ] ; do echo sleeping; sleep 5; done
done
wait
</syntaxhighlight>

The same, but by tracking the PIDs (process IDs) of the started processes. This is more robust, but is more difficult to read:

<syntaxhighlight lang="bash">
#!/bin/bash
running_jobs=()

for i in {1..200}; do
echo "Starting job $i"
sleep "$i" &
running_jobs+=($!) # Track PID

while [ "${#running_jobs[@]}" -ge 8 ]; do
sleep 2 # adjust duration depending on your runtime
echo running_jobs: ${running_jobs[@]}
echo pid-out: $(ps -o pid= -p "${running_jobs[@]}" 2>/dev/null | xargs)
echo -----
running_jobs=($(ps -o pid= -p "${running_jobs[@]}" 2>/dev/null)) # Remove finished jobs
done
done

wait # Ensure all jobs complete
</syntaxhighlight>

You may not be able to just use an index number "i" to start many calculations. In this case, for not too many files, the for loop could be used to read in config files. Here just the general idea for the for loop without
<syntaxhighlight lang="bash">
for config in config-1980-03-01_1/*; do
mycalculation -config "$config"
done
</syntaxhighlight>
This loops over all files in the directory config-1980-03-01_1/ and gives them as an input file to "mycalculation" via a hypothetical "-config" option. Adding a date to the config-dirs (and outputs) would enable you to track different runs in your lab journal more easily.

=== Gnu Parallel ===

Gnu Parallel is available on the HPC Cluster and comes with its own set of examples, you can access them like this:

<syntaxhighlight lang="bash">
$ module load system/parallel
$ cp $PARALLEL_EXA_DIR/parallel.slurm .
</syntaxhighlight>
=== Array Jobs ===
<syntaxhighlight lang="bash">$ sbatch -a 1-500%48 batch_script</syntaxhighlight>

This will submit 500 tasks to be executed, each one indexed by SLURM_ARRAY_TASK_ID ranging from 1 to 200, but will limit the number of simultaneously running tasks from this job array to 48 (number of cores on a Justus2 node).

Thee same can be done inside the job script:

<syntaxhighlight lang="bash">
#!/bin/bash
# Number of cores per individual array task
#SBATCH --ntasks-per-node=1
#SBATCH --array=1-500%48
#SBATCH --mem=3G
#SBATCH --time=1:10:00
#SBATCH --job-name=array_job
#SBATCH --output=array_job-%A_%a.out
#SBATCH --error=array_job-%A_%a.err

# Print the task id.
echo "My SLURM_ARRAY_TASK_ID: " $SLURM_ARRAY_TASK_ID

export TIMEFORMAT=%R ;
time bash mycalculation $SLURM_ARRAY_TASK_ID

</syntaxhighlight>

Also see:
* Slurm-Howto entry: [[BwForCluster_JUSTUS_2_Slurm_HOWTO#How_to_submit_an_array_job?]]
* Schedmd documentations on Job Arrays: https://slurm.schedmd.com/job_array.html

JUSTUS2/Jobscripts: Running Your Calculations

2025-03-12T12:09:37Z

M Carmesin: /* Many One-Core Jobs */

{{Justus2}}

The JUSTUS 2 cluster uses Slurm ([https://slurm.schedmd.com/ https://slurm.schedmd.com/]) for scheduling compute jobs.

= JUSTUS 2 Slurm Howto =

This page only presents some very basic introduction.

Please see the '''[[bwForCluster JUSTUS 2 Slurm HOWTO|JUSTUS 2 Slurm HOWTO]]''' for many more examples and commands for common tasks.

= Slurm Command Overview =

{| width=750px class="wikitable"
! Slurm commands !! Brief explanation
|-
| [https://slurm.schedmd.com/sbatch.html sbatch] || Submits a job and queues it in an input queue
|-
| [https://slurm.schedmd.com/salloc.html salloc] || Request resources for an interactive job
|-
| [https://slurm.schedmd.com/squeue.html squeue] || Displays information about active, eligible, blocked, and/or recently completed jobs
|-
| [https://slurm.schedmd.com/scontrol.html scontrol] || Displays detailed job state information
|-
| [https://slurm.schedmd.com/sstat.html sstat] || Displays status information about a running job
|-
| [https://slurm.schedmd.com/scancel.html scancel] || Cancels a job
|-
| seff || Shows the "job efficiency" of a job after it has finished
|}

= Submitting Jobs on the bwForCluster JUSTUS 2 =
Batch jobs are submitted with the command:

<syntaxhighlight lang=bash>$ sbatch <job-script> </syntaxhighlight>

A job script contains options for Slurm in lines beginning with #SBATCH as well as your commands which you want to execute on the compute nodes. For example:

<syntaxhighlight lang='bash'>
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --time=00:14:00
#SBATCH --mem=1gb
echo 'Here starts the calculation'
</syntaxhighlight>

You can override options from the script on the command-line:
<syntaxhighlight lang=bash>$ sbatch --time=03:00:00 <job-script> </syntaxhighlight>

Note: Compute jobs must not write/read from the global file systems as a calculation swap file. 

Use local storage /tmp in the ramdisk for small files or /scratch (see [[BwForCluster_JUSTUS_2_Slurm_HOWTO#How_to_request_local_scratch_.28SSD.2FNVMe.29_at_job_submission.3F|How to request NVME]]) for this purpose

To not use the central file system for calculation, you must often configure the the program you are using to write temporary files elsewhere.

If the program uses the current directory to look for files, you must copy files to a temporary directory - and copy/save the results of the calculation in the end, else your results get deleted by automated cleanup happening after the job.

There diskless nodes have a disk in RAM memory, that can have a maximum of half the size of the total RAM. Note that files created plus memory requirement of your job need to fit into the total memory.

There are more diskless nodes than nodes with disks, so if your job can run on a diskless node, you should choose this option.

Example job script with requesting 700GB disk space and copying files:

<syntaxhighlight lang='bash'>
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --time=00:14:00
#SBATCH --mem=1gb
#SBATCH --gres=scratch:700
# copy input file
cp $HOME/inputfiles/myinput.inp $SCRATCH
# switch directory
cd $SCRATCH
echo 'Here starts the calculation'
myprogram --input=$SCRATCH/myinput.inp
# calculation ends
# copy result
cp outfile.out results2.txt $HOME/resultdir/job12345
# clean up
rm myinput outfile.out results2.txt
</syntaxhighlight>

{| style=" background:#deffee; width:100%;"
| style="padding:12px; background:#cef2e0; text-align:left" |
Software examples: Most [[Environment Modules|installed software]] comes with example job scripts. To find it e.g. for lammps: <code> module load chem/lammps; cd $LAMMPS_EXA_DIR; ls -la</code>.
|}
 

== Resource Requests ==

Important resource request options for the Slurm command sbatch are:

{| width=750px class="wikitable"
! Option !! Slurm (sbatch)
|-
| #SBATCH|| Script directive
|-
| --time=<hh:mm:ss> (-t <hh:mm:ss>)|| Wall time limit
|-
| --job-name=<name> (-J <name>)|| Job name
|-
| --nodes=<count> (-N <count>)|| Node count
|-
| --ntasks=<count> (-n <count>)|| Core count
|-
| --ntasks-per-node=<count>|| Process count per node
|-
| --mem=<limit>|| Memory limit per node
|-
| --mem-per-cpu=<limit>|| Memory limit per process
|-
| --gres=gpu:<count>|| GPU count (gres = "generic resource")
|-
| --gres=scratch:<count> || Disk space of <count> GB per requested task
|-
| --exclusive|| Node exclusive job
|}

'''Nodes and Cores'''

Slurm provides a number of options to request nodes and cores.
Typically, using <code>--nodes=<count></code> and <code>--ntasks-per-node=<count></code> should work for all your jobs. For single core jobs, it would be sufficient to use the option <code>--ntasks=1</code>. Specifying only <code>--ntasks</code> may lead to slurm trying to distribute tasks over more than one node even if you requested a small amount of cores.

'''Memory'''

Memory can be requested with either the option <code>--mem=<limit></code> (memory per node) or <code>--mem-per-cpu=<limit></code> (memory per process). When looking up the maximum available memory for a certain node type subtract about 5 GB for the operating system. Specify the memory limit as a value-unit-pair, for example 500mb or 8gb.

In most cases it is preferable to use the <code>--mem=<limit></code> option.

'''GPUs''' and '''Scratch'''

These are requested as "generic resources" with <code>--gres:gpu:<count></code> and <code>--gres:scratch:<count></code>.

== Default Values ==

Default values for jobs are:

* Runtime: --time=02:00:00 (2 hours)
* Nodes: --nodes=1 (one node)
* Tasks: --tasks-per-node=1 (one task per node)
* Cores: --cpus-per-task=1 (one core per task)
* Memory: --mem-per-cpu=2gb (2 GB per core)

== "Exclusive User" Node Access Policy ==

Nodes are exclusively allocated to one single user. However, multiple jobs (up to 48) from the same user can share a node.

For efficient resource use, choose a core count for your jobs that evenly divides 48. For example, two 24-core jobs fit on one node, while two 32-core jobs require two nodes but leave 16 cores unused on each.

The same applies to memory requests (see below).

Think of scheduling as a game of Tetris with cores, memory, and other resources. Choosing well-fitting allocations helps the scheduler pack jobs efficiently.

== Memory Limits ==

The wait time of a job also depends largely on the amount of requested resources and the available number of nodes providing this amount of resources. This must be taken into account in particular when requesting a certain amount of memory.

For example a node with 192 GB RAM can only run jobs with up to 187 GB memory requested. The remaining amount is reserved for the operating system, system services and local file systems.
This means that if a job requests 192 GB RAM per node (i.e. --mem=192gb or --tasks-per-node=48 and --mem-per-cpu=4gb), the job cannot run on one of the 456 "small" nodes but only on one of the "medium", "large" or "fat" nodes. Unnecessarily limiting your jobs to a sub-set of nodes will increase your wait time and the wait time of others, who actually need the amount of memory.

The following table provides an overview of how much memory can be allocated by user jobs on the various node types and how many nodes can serve this memory requirement:

{| width=500px class="wikitable"
! Physical RAM on node !! Available RAM on node !! Number of suitable nodes
|-
|small| 192 GB || 187 GB || 692
|-
|medium| 384 GB || 376 GB || 220
|-
|large| 768 GB || 754 GB || 28
|-
|fat| 1536 GB || 1510 GB || 8
|}

Also note that allocated memory is factored into resource usage accounting for fair share. This means over-requesting memory may have a negative impact on the priority of subsequent jobs.

= Testing Your Jobs =

Justus2 has three compute nodes reserved for jobs with a walltime under 15 minutes. You can test if your jobs start properly just by specifying a short walltime, e.g. --time=00:14:00 and your job should start very quickly.

= Monitoring Your Jobs =
== squeue ==

After you submitted the job, you can see it waiting using the <code>squeue</code> command:

(also read the man page with <code>man squeue</code> for more information on how to use the command)

<syntaxhighlight lang='shell'>
> squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
6260301 standard r_60_b_2 ul_yxz1 PD 0:00 1 (AssocGrpMemRunMinutes)

</syntaxhighlight>

Output shows:
* JOBID: the jobid is an unique number your job gets
* PARTITION: the cluster can be divided in different types of nodes.
* NAME: the name you gave your job with the --job-name= option
* USER: your username
* ST: the state the job is in. R = running, PD = pending, CD = completed. See man page for a full list on states.
* TIME: how long the job has been running
* NODES: how many nodes were requested
* NODELIST(REASON): either show the node(s) the job is running on, or a reason why it hasn't started

==scontrol==

You can then show more info on one specific running job using the <code>scontrol</code> command, e.g for the job with ID 6260301 listed above:

<code>
scontrol show job 6260301
</code>

displays detailed information for job with JobID 6260301

<code>
scontrol show jobs
</code>

displays detailed information for all your jobs
<code>

scontrol write batch_script 6260301 -
</code>
display job script of a running job. The "-" is a special filename which means "write to the terminal".

== Monitoring a Started Job ==

After a job has started, you can ssh from a login node to the node(s) the job is running on, using the node name from NODELIST, e.g. if your job runs on n0603:

<code>> ssh n0603
</code>

= Partitions =
Job allocations at JUSTUS 2 are routed automatically to the most suitable compute node(s) that can provide the requested resources for the job (e.g. amount of cores, memory, local scratch space). This is to prevent fragmentation of the cluster system and to ensure most efficient usage of available compute resources. Thus, there is no point in requesting a partition in batch job scripts, i.e. users '''should not''' specify any partition "-p, --partition=<partition_name>" on job submission. This is of particular importance if you adapt job scripts from other cluster systems (e.g. bwUniCluster 2.0) to JUSTUS 2.

= Job Priorities =
Job priorities at JUSTUS 2 depend on [https://slurm.schedmd.com/priority_multifactor.html multiple factors ]:
* Age: The amount of time a job has been waiting in the queue, eligible to be scheduled.
* Fairshare: The difference between the portion of the computing resource allocated to an association and the amount of resources that has been consumed.

'''Notes:'''

Jobs that are pending because the user reached one of the resource usage limits (see below) are not eligible to be scheduled and, thus, do not accrue priority by their age.

Fairshare does '''not''' introduce a fixed allotment, in that a user's ability to run new jobs is cut off as soon as a fixed target utilization is reached. Instead, the fairshare factor ensures that jobs from users who were under-served in the past are given higher priority than jobs from users who were over-served in the past. This keeps individual groups from long term monopolizing the resources, thus making it unfair to groups who have not used their fairshare for quite some time.

Slurm features '''backfilling''', meaning that the scheduler will start lower priority jobs if doing so does not delay the expected start time of '''any''' higher priority job. Since the expected start time of pending jobs depends upon the expected completion time of running jobs, reasonably accurate time limits are valuable for backfill scheduling to work well. This '''[https://youtu.be/OKhWwem1XZg?t=161 video]''' gives an illustrative description to how backfilling works.

In summary, an approximate model of Slurm's behavior for scheduling jobs is this:

* Step 1: Can the job in position one (highest priority) start now?
* Step 2: If it can, remove it from the queue, start it and continue with step 1.
* Step 3: If it can not, look at next job.
* Step 4: Can it start now, without delaying the start time of any job before it in the queue?
* Step 5: If it can, remove it from the queue, start it, recalculate what nodes are free, look at next job and continue with step 4.
* Step 6: If it can not, look at next job, and continue with step 4.

As soon as a new job is submitted and as soon as a job finishes, Slurm restarts its main scheduling cycle with step 1.

= Usage Limits/Throttling Policies =

While the fairshare factor ensures fair long term balance of resource utilization between users and groups, there are additional usage limits that constrain the total cumulative resources at a given time. This is to prevent individual users from short term monopolizing large fractions of the whole cluster system.

* The '''maximum walltime''' for a job is '''14 days''' (336 hours)
--time=336:00:00 or --time=14-0

* The '''maximum amount of cores''' used at any given time from jobs running is '''1920''' per user (aggregated over all running jobs). This translates to 40 nodes. An equivalent limit for allocated memory does also apply. If this limit is reached new jobs will be queued (with REASON: AssocGrpCpuLimit) but only allowed to run after resources have been relinquished.

* The maximum amount of '''remaining allocated core-minutes''' per user is '''3300000''' (aggregated over all running jobs). For example, if a user has a 4-core job running that will complete in 1 hour and a 2-core job that will complete in 6 hours, this translates to 4 * 1 * 60 + 2 * 6 * 60 = 16 * 60 = 960 remaining core-minutes. Once a user reaches the limit, no more jobs are allowed to start (REASON: AssocGrpCPURunMinutesLimit). As the jobs continue to run, the remaining core time will decrease and eventually allow more jobs to start in a staggered way. This limit also '''correlates the maximum walltime and amount of cores that can be allocated''' for this amount of time. Thus, shorter walltimes for the jobs allow more resources to be allocated at a given time (but capped by the maximum amount of cores limit above). Watch this '''[https://youtu.be/OKhWwem1XZg?t=306 video]''' for an illustrative description. An equivalent limit applies for remaining time of memory allocation in which case jobs may be held back from starting with REASON AssocGrpMemRunMinutes.

* The '''maximum amount of GPUs''' allocated by running jobs is '''8''' per user. If this limit it reached new jobs will be queued (with REASON: AssocGrpGRES) but only allowed to run after GPU resources have been relinquished.

'''Note:'''

Usage limits are subject to change.

= Considerations on Efficiency / Special Use Cases =

When we speak of poor job efficiency, we usually mean that hardware resources are wasted.
That means, a similar overall result could have been achieved using less hardware resources, leaving those for other jobs and reducing the wait time for you and everyone.

Some simple causes for poor overall job efficiency are:

* poor choice of resources compared to the size of the nodes leaves part of the node blocked, but doing nothing:
** multiple of --ntasks-per-node is not the number of cores of a node (see section [[#"Exclusive User" Node Access Policy]])
** too much (un-needed) memory or disk space requested
* more cores requested than are actually used by the job
* more cores used for a single mpi/openmp parallel computation than useful
* many small jobs with a short runtime (seconds in extreme cases)
* one-core jobs with very different run-times (because of single-user policy)

== Many One or Few-Core Jobs ==

Jobs that use only a few CPU cores can lead to very inefficient node usage:

# You submit 1000 jobs, each runs for ~30s. Jobs need up to 30s to start and finish - a huge waste if the job only takes 30 seconds. Additionally, the starting and finishing of so many jobs in a short time causes strain on the scheduler SLURM and may cause severe problems for everyone and clutter the SLURM job database.
# many few-core jobs with very different run times. The jobs will start on many nodes, but at some time all quicker jobs have finished the calculation and only a few remain. Because of the single-user policy on JUSTUS2, jobs of other users cannot fill in the gaps and the rest of the node is idle.

To address the problem, you can reduce the amount of jobs and/or the amount of nodes used.

To limit the amount of jobs, start many calculations within one job (problem 1. and 2.):

* use a bash loop in your job script
* use the program GNU parallel to start the processes for you

To only limit the amount of nodes used:

* use array jobs

=== Bash Loop ===

One advantage of this method is, that you can run more threads than cores if your jobs are really short and do not use too much RAM memory and in this way keep all cores busy even if many calculations are still starting up.

It is of course even better, if you can combine such short calculations in a way that for 1000 calculations the kernel does not need to start 1000 processes which in turn need to initialize everything.

This example uses pgrep to count how many jobs are running:

<syntaxhighlight lang="bash">
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=48
#SBATCH --time=00:10:00
#SBATCH --mem=100gb

for i in {1..200}
do
echo starting up $i
bash my_calculation $i &
while [ $(pgrep -c -f my_calculation) -gt 48 ] ; do echo sleeping; sleep 5; done
done
wait
</syntaxhighlight>

The same, but by tracking the PIDs (process IDs) of the started processes. This is more robust, but is more difficult to read:

<syntaxhighlight lang="bash">
#!/bin/bash
running_jobs=()

for i in {1..200}; do
echo "Starting job $i"
sleep "$i" &
running_jobs+=($!) # Track PID

while [ "${#running_jobs[@]}" -ge 8 ]; do
sleep 2 # adjust duration depending on your runtime
echo running_jobs: ${running_jobs[@]}
echo pid-out: $(ps -o pid= -p "${running_jobs[@]}" 2>/dev/null | xargs)
echo -----
running_jobs=($(ps -o pid= -p "${running_jobs[@]}" 2>/dev/null)) # Remove finished jobs
done
done

wait # Ensure all jobs complete
</syntaxhighlight>

You may not be able to just use an index number "i" to start many calculations. In this case, for not too many files, the for loop could be used to read in config files. Here just the general idea for the for loop without
<syntaxhighlight lang="bash">
for config in config-1980-03-01_1/*; do
mycalculation -config "$config"
done
</syntaxhighlight>
This loops over all files in the directory config-1980-03-01_1/ and gives them as an input file to "mycalculation" via a hypothetical "-config" option. Adding a date to the config-dirs (and outputs) would enable you to track different runs in your lab journal more easily.

=== Gnu Parallel ===

Gnu Parallel is available on the HPC Cluster and comes with its own set of examples, you can access them like this:

<syntaxhighlight lang="bash">
$ module load system/parallel
$ cp $PARALLEL_EXA_DIR/parallel.slurm .
</syntaxhighlight>
=== Array Jobs ===
<syntaxhighlight lang="bash">$ sbatch -a 1-500%48 batch_script</syntaxhighlight>

This will submit 500 tasks to be executed, each one indexed by SLURM_ARRAY_TASK_ID ranging from 1 to 200, but will limit the number of simultaneously running tasks from this job array to 48 (number of cores on a Justus2 node).

Thee same can be done inside the job script:

<syntaxhighlight lang="bash">
#!/bin/bash
# Number of cores per individual array task
#SBATCH --ntasks-per-node=1
#SBATCH --array=1-500%48
#SBATCH --mem=3G
#SBATCH --time=1:10:00
#SBATCH --job-name=array_job
#SBATCH --output=array_job-%A_%a.out
#SBATCH --error=array_job-%A_%a.err

# Print the task id.
echo "My SLURM_ARRAY_TASK_ID: " $SLURM_ARRAY_TASK_ID

export TIMEFORMAT=%R ;
time bash mycalculation $SLURM_ARRAY_TASK_ID

</syntaxhighlight>

Also see:
* Slurm-Howto entry: [[BwForCluster_JUSTUS_2_Slurm_HOWTO#How_to_submit_an_array_job?]]
* Schedmd documentations on Job Arrays: https://slurm.schedmd.com/job_array.html

JUSTUS2/Jobscripts: Running Your Calculations

2025-03-12T12:01:13Z

M Carmesin: /* Considerations on Efficiency / Special Use Cases */

{{Justus2}}

The JUSTUS 2 cluster uses Slurm ([https://slurm.schedmd.com/ https://slurm.schedmd.com/]) for scheduling compute jobs.

= JUSTUS 2 Slurm Howto =

This page only presents some very basic introduction.

Please see the '''[[bwForCluster JUSTUS 2 Slurm HOWTO|JUSTUS 2 Slurm HOWTO]]''' for many more examples and commands for common tasks.

= Slurm Command Overview =

{| width=750px class="wikitable"
! Slurm commands !! Brief explanation
|-
| [https://slurm.schedmd.com/sbatch.html sbatch] || Submits a job and queues it in an input queue
|-
| [https://slurm.schedmd.com/salloc.html salloc] || Request resources for an interactive job
|-
| [https://slurm.schedmd.com/squeue.html squeue] || Displays information about active, eligible, blocked, and/or recently completed jobs
|-
| [https://slurm.schedmd.com/scontrol.html scontrol] || Displays detailed job state information
|-
| [https://slurm.schedmd.com/sstat.html sstat] || Displays status information about a running job
|-
| [https://slurm.schedmd.com/scancel.html scancel] || Cancels a job
|-
| seff || Shows the "job efficiency" of a job after it has finished
|}

= Submitting Jobs on the bwForCluster JUSTUS 2 =
Batch jobs are submitted with the command:

<syntaxhighlight lang=bash>$ sbatch <job-script> </syntaxhighlight>

A job script contains options for Slurm in lines beginning with #SBATCH as well as your commands which you want to execute on the compute nodes. For example:

<syntaxhighlight lang='bash'>
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --time=00:14:00
#SBATCH --mem=1gb
echo 'Here starts the calculation'
</syntaxhighlight>

You can override options from the script on the command-line:
<syntaxhighlight lang=bash>$ sbatch --time=03:00:00 <job-script> </syntaxhighlight>

Note: Compute jobs must not write/read from the global file systems as a calculation swap file. 

Use local storage /tmp in the ramdisk for small files or /scratch (see [[BwForCluster_JUSTUS_2_Slurm_HOWTO#How_to_request_local_scratch_.28SSD.2FNVMe.29_at_job_submission.3F|How to request NVME]]) for this purpose

To not use the central file system for calculation, you must often configure the the program you are using to write temporary files elsewhere.

If the program uses the current directory to look for files, you must copy files to a temporary directory - and copy/save the results of the calculation in the end, else your results get deleted by automated cleanup happening after the job.

There diskless nodes have a disk in RAM memory, that can have a maximum of half the size of the total RAM. Note that files created plus memory requirement of your job need to fit into the total memory.

There are more diskless nodes than nodes with disks, so if your job can run on a diskless node, you should choose this option.

Example job script with requesting 700GB disk space and copying files:

<syntaxhighlight lang='bash'>
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --time=00:14:00
#SBATCH --mem=1gb
#SBATCH --gres=scratch:700
# copy input file
cp $HOME/inputfiles/myinput.inp $SCRATCH
# switch directory
cd $SCRATCH
echo 'Here starts the calculation'
myprogram --input=$SCRATCH/myinput.inp
# calculation ends
# copy result
cp outfile.out results2.txt $HOME/resultdir/job12345
# clean up
rm myinput outfile.out results2.txt
</syntaxhighlight>

{| style=" background:#deffee; width:100%;"
| style="padding:12px; background:#cef2e0; text-align:left" |
Software examples: Most [[Environment Modules|installed software]] comes with example job scripts. To find it e.g. for lammps: <code> module load chem/lammps; cd $LAMMPS_EXA_DIR; ls -la</code>.
|}
 

== Resource Requests ==

Important resource request options for the Slurm command sbatch are:

{| width=750px class="wikitable"
! Option !! Slurm (sbatch)
|-
| #SBATCH|| Script directive
|-
| --time=<hh:mm:ss> (-t <hh:mm:ss>)|| Wall time limit
|-
| --job-name=<name> (-J <name>)|| Job name
|-
| --nodes=<count> (-N <count>)|| Node count
|-
| --ntasks=<count> (-n <count>)|| Core count
|-
| --ntasks-per-node=<count>|| Process count per node
|-
| --mem=<limit>|| Memory limit per node
|-
| --mem-per-cpu=<limit>|| Memory limit per process
|-
| --gres=gpu:<count>|| GPU count (gres = "generic resource")
|-
| --gres=scratch:<count> || Disk space of <count> GB per requested task
|-
| --exclusive|| Node exclusive job
|}

'''Nodes and Cores'''

Slurm provides a number of options to request nodes and cores.
Typically, using <code>--nodes=<count></code> and <code>--ntasks-per-node=<count></code> should work for all your jobs. For single core jobs, it would be sufficient to use the option <code>--ntasks=1</code>. Specifying only <code>--ntasks</code> may lead to slurm trying to distribute tasks over more than one node even if you requested a small amount of cores.

'''Memory'''

Memory can be requested with either the option <code>--mem=<limit></code> (memory per node) or <code>--mem-per-cpu=<limit></code> (memory per process). When looking up the maximum available memory for a certain node type subtract about 5 GB for the operating system. Specify the memory limit as a value-unit-pair, for example 500mb or 8gb.

In most cases it is preferable to use the <code>--mem=<limit></code> option.

'''GPUs''' and '''Scratch'''

These are requested as "generic resources" with <code>--gres:gpu:<count></code> and <code>--gres:scratch:<count></code>.

== Default Values ==

Default values for jobs are:

* Runtime: --time=02:00:00 (2 hours)
* Nodes: --nodes=1 (one node)
* Tasks: --tasks-per-node=1 (one task per node)
* Cores: --cpus-per-task=1 (one core per task)
* Memory: --mem-per-cpu=2gb (2 GB per core)

== "Exclusive User" Node Access Policy ==

Nodes are exclusively allocated to one single user. However, multiple jobs (up to 48) from the same user can share a node.

For efficient resource use, choose a core count for your jobs that evenly divides 48. For example, two 24-core jobs fit on one node, while two 32-core jobs require two nodes but leave 16 cores unused on each.

The same applies to memory requests (see below).

Think of scheduling as a game of Tetris with cores, memory, and other resources. Choosing well-fitting allocations helps the scheduler pack jobs efficiently.

== Memory Limits ==

The wait time of a job also depends largely on the amount of requested resources and the available number of nodes providing this amount of resources. This must be taken into account in particular when requesting a certain amount of memory.

For example a node with 192 GB RAM can only run jobs with up to 187 GB memory requested. The remaining amount is reserved for the operating system, system services and local file systems.
This means that if a job requests 192 GB RAM per node (i.e. --mem=192gb or --tasks-per-node=48 and --mem-per-cpu=4gb), the job cannot run on one of the 456 "small" nodes but only on one of the "medium", "large" or "fat" nodes. Unnecessarily limiting your jobs to a sub-set of nodes will increase your wait time and the wait time of others, who actually need the amount of memory.

The following table provides an overview of how much memory can be allocated by user jobs on the various node types and how many nodes can serve this memory requirement:

{| width=500px class="wikitable"
! Physical RAM on node !! Available RAM on node !! Number of suitable nodes
|-
|small| 192 GB || 187 GB || 692
|-
|medium| 384 GB || 376 GB || 220
|-
|large| 768 GB || 754 GB || 28
|-
|fat| 1536 GB || 1510 GB || 8
|}

Also note that allocated memory is factored into resource usage accounting for fair share. This means over-requesting memory may have a negative impact on the priority of subsequent jobs.

= Testing Your Jobs =

Justus2 has three compute nodes reserved for jobs with a walltime under 15 minutes. You can test if your jobs start properly just by specifying a short walltime, e.g. --time=00:14:00 and your job should start very quickly.

= Monitoring Your Jobs =
== squeue ==

After you submitted the job, you can see it waiting using the <code>squeue</code> command:

(also read the man page with <code>man squeue</code> for more information on how to use the command)

<syntaxhighlight lang='shell'>
> squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
6260301 standard r_60_b_2 ul_yxz1 PD 0:00 1 (AssocGrpMemRunMinutes)

</syntaxhighlight>

Output shows:
* JOBID: the jobid is an unique number your job gets
* PARTITION: the cluster can be divided in different types of nodes.
* NAME: the name you gave your job with the --job-name= option
* USER: your username
* ST: the state the job is in. R = running, PD = pending, CD = completed. See man page for a full list on states.
* TIME: how long the job has been running
* NODES: how many nodes were requested
* NODELIST(REASON): either show the node(s) the job is running on, or a reason why it hasn't started

==scontrol==

You can then show more info on one specific running job using the <code>scontrol</code> command, e.g for the job with ID 6260301 listed above:

<code>
scontrol show job 6260301
</code>

displays detailed information for job with JobID 6260301

<code>
scontrol show jobs
</code>

displays detailed information for all your jobs
<code>

scontrol write batch_script 6260301 -
</code>
display job script of a running job. The "-" is a special filename which means "write to the terminal".

== Monitoring a Started Job ==

After a job has started, you can ssh from a login node to the node(s) the job is running on, using the node name from NODELIST, e.g. if your job runs on n0603:

<code>> ssh n0603
</code>

= Partitions =
Job allocations at JUSTUS 2 are routed automatically to the most suitable compute node(s) that can provide the requested resources for the job (e.g. amount of cores, memory, local scratch space). This is to prevent fragmentation of the cluster system and to ensure most efficient usage of available compute resources. Thus, there is no point in requesting a partition in batch job scripts, i.e. users '''should not''' specify any partition "-p, --partition=<partition_name>" on job submission. This is of particular importance if you adapt job scripts from other cluster systems (e.g. bwUniCluster 2.0) to JUSTUS 2.

= Job Priorities =
Job priorities at JUSTUS 2 depend on [https://slurm.schedmd.com/priority_multifactor.html multiple factors ]:
* Age: The amount of time a job has been waiting in the queue, eligible to be scheduled.
* Fairshare: The difference between the portion of the computing resource allocated to an association and the amount of resources that has been consumed.

'''Notes:'''

Jobs that are pending because the user reached one of the resource usage limits (see below) are not eligible to be scheduled and, thus, do not accrue priority by their age.

Fairshare does '''not''' introduce a fixed allotment, in that a user's ability to run new jobs is cut off as soon as a fixed target utilization is reached. Instead, the fairshare factor ensures that jobs from users who were under-served in the past are given higher priority than jobs from users who were over-served in the past. This keeps individual groups from long term monopolizing the resources, thus making it unfair to groups who have not used their fairshare for quite some time.

Slurm features '''backfilling''', meaning that the scheduler will start lower priority jobs if doing so does not delay the expected start time of '''any''' higher priority job. Since the expected start time of pending jobs depends upon the expected completion time of running jobs, reasonably accurate time limits are valuable for backfill scheduling to work well. This '''[https://youtu.be/OKhWwem1XZg?t=161 video]''' gives an illustrative description to how backfilling works.

In summary, an approximate model of Slurm's behavior for scheduling jobs is this:

* Step 1: Can the job in position one (highest priority) start now?
* Step 2: If it can, remove it from the queue, start it and continue with step 1.
* Step 3: If it can not, look at next job.
* Step 4: Can it start now, without delaying the start time of any job before it in the queue?
* Step 5: If it can, remove it from the queue, start it, recalculate what nodes are free, look at next job and continue with step 4.
* Step 6: If it can not, look at next job, and continue with step 4.

As soon as a new job is submitted and as soon as a job finishes, Slurm restarts its main scheduling cycle with step 1.

= Usage Limits/Throttling Policies =

While the fairshare factor ensures fair long term balance of resource utilization between users and groups, there are additional usage limits that constrain the total cumulative resources at a given time. This is to prevent individual users from short term monopolizing large fractions of the whole cluster system.

* The '''maximum walltime''' for a job is '''14 days''' (336 hours)
--time=336:00:00 or --time=14-0

* The '''maximum amount of cores''' used at any given time from jobs running is '''1920''' per user (aggregated over all running jobs). This translates to 40 nodes. An equivalent limit for allocated memory does also apply. If this limit is reached new jobs will be queued (with REASON: AssocGrpCpuLimit) but only allowed to run after resources have been relinquished.

* The maximum amount of '''remaining allocated core-minutes''' per user is '''3300000''' (aggregated over all running jobs). For example, if a user has a 4-core job running that will complete in 1 hour and a 2-core job that will complete in 6 hours, this translates to 4 * 1 * 60 + 2 * 6 * 60 = 16 * 60 = 960 remaining core-minutes. Once a user reaches the limit, no more jobs are allowed to start (REASON: AssocGrpCPURunMinutesLimit). As the jobs continue to run, the remaining core time will decrease and eventually allow more jobs to start in a staggered way. This limit also '''correlates the maximum walltime and amount of cores that can be allocated''' for this amount of time. Thus, shorter walltimes for the jobs allow more resources to be allocated at a given time (but capped by the maximum amount of cores limit above). Watch this '''[https://youtu.be/OKhWwem1XZg?t=306 video]''' for an illustrative description. An equivalent limit applies for remaining time of memory allocation in which case jobs may be held back from starting with REASON AssocGrpMemRunMinutes.

* The '''maximum amount of GPUs''' allocated by running jobs is '''8''' per user. If this limit it reached new jobs will be queued (with REASON: AssocGrpGRES) but only allowed to run after GPU resources have been relinquished.

'''Note:'''

Usage limits are subject to change.

= Considerations on Efficiency / Special Use Cases =

When we speak of poor job efficiency, we usually mean that hardware resources are wasted.
That means, a similar overall result could have been achieved using less hardware resources, leaving those for other jobs and reducing the wait time for you and everyone.

Some simple causes for poor overall job efficiency are:

* poor choice of resources compared to the size of the nodes leaves part of the node blocked, but doing nothing:
** multiple of --ntasks-per-node is not the number of cores of a node (see section [[#"Exclusive User" Node Access Policy]])
** too much (un-needed) memory or disk space requested
* more cores requested than are actually used by the job
* more cores used for a single mpi/openmp parallel computation than useful
* many small jobs with a short runtime (seconds in extreme cases)
* one-core jobs with very different run-times (because of single-user policy)

== Many One-Core Jobs ==

One-core jobs can cause major inefficient node usage:

# You submit 1000 jobs, each runs for ~30s. Jobs need up to 30s to start and finish - a huge waste if the job only takes 30 seconds. Additionally, start/finish of so many jobs in a short time causes strain on the scheduler SLURM and may cause severe problems for everyone and clutter the SLURM job database.
# many one-core jobs with very different run times. The jobs will start on many nodes, but at some time all quicker jobs have finished the calculation and only a few remain. Because of the single-user policy on JUSTUS2, jobs of other users cannot fill in the gaps and the rest of the node is idle.

To address the problem, you can reduce the amount of jobs and/or the amount of nodes used.

To limit the amount of jobs, start many calculations within one job (problem 1. and 2.):

* use a bash loop in your job script
* use the program GNU parallel to start the processes for you

To only limit the amount of nodes used:

* use array jobs

=== Bash Loop ===

One advantage of this method is, that you can run more threads than cores if your jobs are really short and do not use too much RAM memory and in this way keep all cores busy even if many calculations are still starting up.

It is of course even better, if you can combine such short calculations in a way that for 1000 calculations the kernel does not need to start 1000 processes which in turn need to initialize everything.

This example uses pgrep to count how many jobs are running:

<syntaxhighlight lang="bash">
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=48
#SBATCH --time=00:10:00
#SBATCH --mem=100gb

for i in {1..200}
do
echo starting up $i
bash my_calculation $i &
while [ $(pgrep -c -f my_calculation) -gt 48 ] ; do echo sleeping; sleep 5; done
done
wait
</syntaxhighlight>

The same, but by tracking the PIDs (process IDs) of the started processes. This is more robust, but is more difficult to read:

<syntaxhighlight lang="bash">
#!/bin/bash
running_jobs=()

for i in {1..200}; do
echo "Starting job $i"
sleep "$i" &
running_jobs+=($!) # Track PID

while [ "${#running_jobs[@]}" -ge 8 ]; do
sleep 2 # adjust duration depending on your runtime
echo running_jobs: ${running_jobs[@]}
echo pid-out: $(ps -o pid= -p "${running_jobs[@]}" 2>/dev/null | xargs)
echo -----
running_jobs=($(ps -o pid= -p "${running_jobs[@]}" 2>/dev/null)) # Remove finished jobs
done
done

wait # Ensure all jobs complete
</syntaxhighlight>

You may not be able to just use an index number "i" to start many calculations. In this case, for not too many files, the for loop could be used to read in config files. Here just the general idea for the for loop without
<syntaxhighlight lang="bash">
for config in config-1980-03-01_1/*; do
mycalculation -config "$config"
done
</syntaxhighlight>
This loops over all files in the directory config-1980-03-01_1/ and gives them as an input file to "mycalculation" via a hypothetical "-config" option. Adding a date to the config-dirs (and outputs) would enable you to track different runs in your lab journal more easily.

=== Gnu Parallel ===

Gnu Parallel is available on the HPC Cluster and comes with its own set of examples, you can access them like this:

<syntaxhighlight lang="bash">
$ module load system/parallel
$ cp $PARALLEL_EXA_DIR/parallel.slurm .
</syntaxhighlight>
=== Array Jobs ===
<syntaxhighlight lang="bash">$ sbatch -a 1-500%48 batch_script</syntaxhighlight>

This will submit 500 tasks to be executed, each one indexed by SLURM_ARRAY_TASK_ID ranging from 1 to 200, but will limit the number of simultaneously running tasks from this job array to 48 (number of cores on a Justus2 node).

Thee same can be done inside the job script:

<syntaxhighlight lang="bash">
#!/bin/bash
# Number of cores per individual array task
#SBATCH --ntasks-per-node=1
#SBATCH --array=1-500%48
#SBATCH --mem=3G
#SBATCH --time=1:10:00
#SBATCH --job-name=array_job
#SBATCH --output=array_job-%A_%a.out
#SBATCH --error=array_job-%A_%a.err

# Print the task id.
echo "My SLURM_ARRAY_TASK_ID: " $SLURM_ARRAY_TASK_ID

export TIMEFORMAT=%R ;
time bash mycalculation $SLURM_ARRAY_TASK_ID

</syntaxhighlight>

Also see:
* Slurm-Howto entry: [[BwForCluster_JUSTUS_2_Slurm_HOWTO#How_to_submit_an_array_job?]]
* Schedmd documentations on Job Arrays: https://slurm.schedmd.com/job_array.html

JUSTUS2/Software/Julia/VSCode

2025-03-07T10:31:10Z

M Carmesin: /* Connect to Nodes */

= Setting Up VS Code for Interactive Julia Sessions =

== Wrapper Script for Module Loading==
The Julia Extension of VS Code needs to run Julia, which requires loading the module.

Save the following wrapper script e.g. as $HOME/bin/julia_wrapper.sh and make it executable:

<pre>
#!/bin/bash

## Making module command available
## ------------------------------------------------------------
export MODULEPATH=/opt/bwhpc/ul/modulefiles/Core:/opt/bwhpc/common/modulefiles/Core:/etc/modulefiles:/usr/share/modulefiles:/usr/share/modulefiles/Linux:/usr/share/modulefiles/Core:/usr/share/lmod/lmod/modulefiles/Core

source /usr/share/lmod/lmod/init/profile

## ------------------------------------------------------------

# Load julia
module load math/julia/1.11.3

# Pass on all arguments to julia
exec julia "${@}"
</pre>

Next, we need to configure VSCode to use the script as Julia executable. You find the corresponding setting at File|Preferences|Settings → Tab “Remote” → Extensions|Julia → setting “Julia: Executable path“

== Connect to Nodes ==

If you do note need much CPU ressources or memory. you can develop your code on the login nodes. If you however need more ressources, you must use an interactive job, see [[BwForCluster_JUSTUS_2_Slurm_HOWTO#How to submit an interactive job?]].

You can only connect to the compute nodes via a login node. Therefore you need to adjust your SSH config to use a login node for a proxy jump to the compute node, where your interactive job is running. Thus add the following lines to $HOME/.ssh/config on your PC:

<pre>
Host n????
User YOUR_JUSTUS_USERNAME
ProxyJump justus2.uni-ulm.de
</pre>

Please replace YOUR_JUSTUS_USENAME by your username on the cluster.

To avoid to enter repeatedly your password, you should set up an SSH key to login.

JUSTUS2/Software/Julia/VSCode

2025-03-07T10:29:43Z

M Carmesin: /* Connect to Nodes */

JUSTUS2/Software/Julia/VSCode

2025-03-07T10:28:02Z

M Carmesin: /* Connect to Nodes */

= Setting Up VS Code for Interactive Julia Sessions =

== Wrapper Script for Module Loading==
The Julia Extension of VS Code needs to run Julia, which requires loading the module.

Save the following wrapper script e.g. as $HOME/bin/julia_wrapper.sh and make it executable:

<pre>
#!/bin/bash

## Making module command available
## ------------------------------------------------------------
export MODULEPATH=/opt/bwhpc/ul/modulefiles/Core:/opt/bwhpc/common/modulefiles/Core:/etc/modulefiles:/usr/share/modulefiles:/usr/share/modulefiles/Linux:/usr/share/modulefiles/Core:/usr/share/lmod/lmod/modulefiles/Core

source /usr/share/lmod/lmod/init/profile

## ------------------------------------------------------------

# Load julia
module load math/julia/1.11.3

# Pass on all arguments to julia
exec julia "${@}"
</pre>

Next, we need to configure VSCode to use the script as Julia executable. You find the corresponding setting at File|Preferences|Settings → Tab “Remote” → Extensions|Julia → setting “Julia: Executable path“

== Connect to Nodes ==

If you do note need much CPU ressources or memory. you can develop your code on the login nodes. If you however need more ressources, you must use an interactive job, see [[BwForCluster_JUSTUS_2_Slurm_HOWTO#How to submit an interactive job?]].

You can only connect to the compute nodes via a login node. Therefore you need to adjust your SSH config, to use the login nodes for a proxy jump to the compute node, where your interactive job is running. This add the following lines to $HOME/.ssh/config on your PC:

<pre>
Host n????
User YOUR_JUSTUS_USERNAME
ProxyJump justus2.uni-ulm.de
</pre>

Please replace YOUR_JUSTUS_USENAME by your username on the cluster.

JUSTUS2/Software/Julia/VSCode

2025-03-07T10:19:38Z

M Carmesin: /* Connect to Nodes */

JUSTUS2/Software/Julia/VSCode

2025-03-07T10:13:34Z

M Carmesin: /* Wrapper Script for Module Loading */

JUSTUS2/Software/Julia/VSCode

2025-03-07T10:12:58Z

M Carmesin: /* Setting Up VS Code for Interactive Julia Sessions */

JUSTUS2/Software/Julia/VSCode

2025-03-07T10:00:16Z

M Carmesin:

= Setting Up VS Code for Interactive Julia Sessions =

== Wrapper Script for Module Loading==

== Connect to Nodes ==

# proxy jump
# SSH keys

JUSTUS2/Software/Julia/VSCode

2025-03-07T09:57:58Z

M Carmesin: Created page with "= Setting up VS Code for interactive Julia settings"

= Setting up VS Code for interactive Julia settings

JUSTUS2/Jobscripts: Running Your Calculations

2025-03-07T09:48:59Z

M Carmesin: /* Monitoring a Started Job */

{{Justus2}}

The JUSTUS 2 cluster uses Slurm ([https://slurm.schedmd.com/ https://slurm.schedmd.com/]) for scheduling compute jobs.

= JUSTUS 2 Slurm Howto =

This page only presents some very basic introduction.

Please see the '''[[bwForCluster JUSTUS 2 Slurm HOWTO|JUSTUS 2 Slurm HOWTO]]''' for many more examples and commands for common tasks.

= Slurm Command Overview =

{| width=750px class="wikitable"
! Slurm commands !! Brief explanation
|-
| [https://slurm.schedmd.com/sbatch.html sbatch] || Submits a job and queues it in an input queue
|-
| [https://slurm.schedmd.com/salloc.html salloc] || Request resources for an interactive job
|-
| [https://slurm.schedmd.com/squeue.html squeue] || Displays information about active, eligible, blocked, and/or recently completed jobs
|-
| [https://slurm.schedmd.com/scontrol.html scontrol] || Displays detailed job state information
|-
| [https://slurm.schedmd.com/sstat.html sstat] || Displays status information about a running job
|-
| [https://slurm.schedmd.com/scancel.html scancel] || Cancels a job
|-
| seff || Shows the "job efficiency" of a job after it has finished
|}

= Submitting Jobs on the bwForCluster JUSTUS 2 =
Batch jobs are submitted with the command:

<source lang=bash>$ sbatch <job-script> </source>

A job script contains options for Slurm in lines beginning with #SBATCH as well as your commands which you want to execute on the compute nodes. For example:

<source lang='bash'>
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --time=00:14:00
#SBATCH --mem=1gb
echo 'Here starts the calculation'
</source>

You can override options from the script on the command-line:
<source lang=bash>$ sbatch --time=03:00:00 <job-script> </source>

Note: Compute jobs must not write/read from the global file systems as a calculation swap file. 

Use local storage /tmp in the ramdisk for small files or /scratch (see [[BwForCluster_JUSTUS_2_Slurm_HOWTO#How_to_request_local_scratch_.28SSD.2FNVMe.29_at_job_submission.3F|How to request NVME]]) for this purpose

To not use the central file system for calculation, you must often configure the the program you are using to write temporary files elsewhere.

If the program uses the current directory to look for files, you must copy files to a temporary directory - and copy/save the results of the calculation in the end, else your results get deleted by automated cleanup happening after the job.

There diskless nodes have a disk in RAM memory, that can have a maximum of half the size of the total RAM. Note that files created plus memory requirement of your job need to fit into the total memory.

There are more diskless nodes than nodes with disks, so if your job can run on a diskless node, you should choose this option.

Example job script with requesting 700GB disk space and copying files:

<source lang='bash'>
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --time=00:14:00
#SBATCH --mem=1gb
#SBATCH --gres=scratch:700
# copy input file
cp $HOME/inputfiles/myinput.inp $SCRATCH
# switch directory
cd $SCRATCH
echo 'Here starts the calculation'
myprogram --input=$SCRATCH/myinput.inp
# calculation ends
# copy result
cp outfile.out results2.txt $HOME/resultdir/job12345
# clean up
rm myinput outfile.out results2.txt
</source>

{| style=" background:#deffee; width:100%;"
| style="padding:12px; background:#cef2e0; text-align:left" |
Software examples: Most [[Environment Modules|installed software]] comes with example job scripts. To find it e.g. for lammps: <code> module load chem/lammps; cd $LAMMPS_EXA_DIR; ls -la</code>.
|}
 

== Resource Requests ==

Important resource request options for the Slurm command sbatch are:

{| width=750px class="wikitable"
! Option !! Slurm (sbatch)
|-
| #SBATCH|| Script directive
|-
| --time=<hh:mm:ss> (-t <hh:mm:ss>)|| Wall time limit
|-
| --job-name=<name> (-J <name>)|| Job name
|-
| --nodes=<count> (-N <count>)|| Node count
|-
| --ntasks=<count> (-n <count>)|| Core count
|-
| --ntasks-per-node=<count>|| Process count per node
|-
| --mem=<limit>|| Memory limit per node
|-
| --mem-per-cpu=<limit>|| Memory limit per process
|-
| --gres=gpu:<count>|| GPU count (gres = "generic resource")
|-
| --gres=scratch:<count> || Disk space of <count> GB per requested task
|-
| --exclusive|| Node exclusive job
|}

'''Nodes and Cores'''

Slurm provides a number of options to request nodes and cores.
Typically, using <code>--nodes=<count></code> and <code>--ntasks-per-node=<count></code> should work for all your jobs. For single core jobs, it would be sufficient to use the option <code>--ntasks=1</code>. Specifying only <code>--ntasks</code> may lead to slurm trying to distribute tasks over more than one node even if you requested a small amount of cores.

'''Memory'''

Memory can be requested with either the option <code>--mem=<limit></code> (memory per node) or <code>--mem-per-cpu=<limit></code> (memory per process). When looking up the maximum available memory for a certain node type subtract about 5 GB for the operating system. Specify the memory limit as a value-unit-pair, for example 500mb or 8gb.

In most cases it is preferable to use the <code>--mem=<limit></code> option.

'''GPUs''' and '''Scratch'''

These are requested as "generic resources" with <code>--gres:gpu:<count></code> and <code>--gres:scratch:<count></code>.

== Default Values ==

Default values for jobs are:

* Runtime: --time=02:00:00 (2 hours)
* Nodes: --nodes=1 (one node)
* Tasks: --tasks-per-node=1 (one task per node)
* Cores: --cpus-per-task=1 (one core per task)
* Memory: --mem-per-cpu=2gb (2 GB per core)

== "Exclusive User" Node Access Policy ==

Nodes are exclusively allocated to one single user. However, multiple jobs (up to 48) from the same user can share a node.

For efficient resource use, choose a core count for your jobs that evenly divides 48. For example, two 24-core jobs fit on one node, while two 32-core jobs require two nodes but leave 16 cores unused on each.

The same applies to memory requests (see below).

Think of scheduling as a game of Tetris with cores, memory, and other resources. Choosing well-fitting allocations helps the scheduler pack jobs efficiently.

== Memory Limits ==

The wait time of a job also depends largely on the amount of requested resources and the available number of nodes providing this amount of resources. This must be taken into account in particular when requesting a certain amount of memory.

For example a node with 192 GB RAM can only run jobs with up to 187 GB memory requested. The remaining amount is reserved for the operating system, system services and local file systems.
This means that if a job requests 192 GB RAM per node (i.e. --mem=192gb or --tasks-per-node=48 and --mem-per-cpu=4gb), the job cannot run on one of the 456 "small" nodes but only on one of the "medium", "large" or "fat" nodes. Unnecessarily limiting your jobs to a sub-set of nodes will increase your wait time and the wait time of others, who actually need the amount of memory.

The following table provides an overview of how much memory can be allocated by user jobs on the various node types and how many nodes can serve this memory requirement:

{| width=500px class="wikitable"
! Physical RAM on node !! Available RAM on node !! Number of suitable nodes
|-
|small| 192 GB || 187 GB || 692
|-
|medium| 384 GB || 376 GB || 220
|-
|large| 768 GB || 754 GB || 28
|-
|fat| 1536 GB || 1510 GB || 8
|}

Also note that allocated memory is factored into resource usage accounting for fair share. This means over-requesting memory may have a negative impact on the priority of subsequent jobs.

= Testing Your Jobs =

Justus2 has three compute nodes reserved for jobs with a walltime under 15 minutes. You can test if your jobs start properly just by specifying a short walltime, e.g. --time=00:14:00 and your job should start very quickly.

= Monitoring Your Jobs =
== squeue ==

After you submitted the job, you can see it waiting using the <code>squeue</code> command:

(also read the man page with <code>man squeue</code> for more information on how to use the command)

<source lang='shell'>
> squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
6260301 standard r_60_b_2 ul_yxz1 PD 0:00 1 (AssocGrpMemRunMinutes)

</source>

Output shows:
* JOBID: the jobid is an unique number your job gets
* PARTITION: the cluster can be divided in different types of nodes.
* NAME: the name you gave your job with the --job-name= option
* USER: your username
* ST: the state the job is in. R = running, PD = pending, CD = completed. See man page for a full list on states.
* TIME: how long the job has been running
* NODES: how many nodes were requested
* NODELIST(REASON): either show the node(s) the job is running on, or a reason why it hasn't started

==scontrol==

You can then show more info on one specific running job using the <code>scontrol</code> command, e.g for the job with ID 6260301 listed above:

<code>
scontrol show job 6260301
</code>

displays detailed information for job with JobID 6260301

<code>
scontrol show jobs
</code>

displays detailed information for all your jobs
<code>

scontrol write batch_script 6260301 -
</code>
display job script of a running job. The "-" is a special filename which means "write to the terminal".

== Monitoring a Started Job ==

After a job has started, you can ssh from a login node to the node(s) the job is running on, using the node name from NODELIST, e.g. if your job runs on n0603:

<code>> ssh n0603
</code>

= Partitions =
Job allocations at JUSTUS 2 are routed automatically to the most suitable compute node(s) that can provide the requested resources for the job (e.g. amount of cores, memory, local scratch space). This is to prevent fragmentation of the cluster system and to ensure most efficient usage of available compute resources. Thus, there is no point in requesting a partition in batch job scripts, i.e. users '''should not''' specify any partition "-p, --partition=<partition_name>" on job submission. This is of particular importance if you adapt job scripts from other cluster systems (e.g. bwUniCluster 2.0) to JUSTUS 2.

= Job Priorities =
Job priorities at JUSTUS 2 depend on [https://slurm.schedmd.com/priority_multifactor.html multiple factors ]:
* Age: The amount of time a job has been waiting in the queue, eligible to be scheduled.
* Fairshare: The difference between the portion of the computing resource allocated to an association and the amount of resources that has been consumed.

'''Notes:'''

Jobs that are pending because the user reached one of the resource usage limits (see below) are not eligible to be scheduled and, thus, do not accrue priority by their age.

Fairshare does '''not''' introduce a fixed allotment, in that a user's ability to run new jobs is cut off as soon as a fixed target utilization is reached. Instead, the fairshare factor ensures that jobs from users who were under-served in the past are given higher priority than jobs from users who were over-served in the past. This keeps individual groups from long term monopolizing the resources, thus making it unfair to groups who have not used their fairshare for quite some time.

Slurm features '''backfilling''', meaning that the scheduler will start lower priority jobs if doing so does not delay the expected start time of '''any''' higher priority job. Since the expected start time of pending jobs depends upon the expected completion time of running jobs, reasonably accurate time limits are valuable for backfill scheduling to work well. This '''[https://youtu.be/OKhWwem1XZg?t=161 video]''' gives an illustrative description to how backfilling works.

In summary, an approximate model of Slurm's behavior for scheduling jobs is this:

* Step 1: Can the job in position one (highest priority) start now?
* Step 2: If it can, remove it from the queue, start it and continue with step 1.
* Step 3: If it can not, look at next job.
* Step 4: Can it start now, without delaying the start time of any job before it in the queue?
* Step 5: If it can, remove it from the queue, start it, recalculate what nodes are free, look at next job and continue with step 4.
* Step 6: If it can not, look at next job, and continue with step 4.

As soon as a new job is submitted and as soon as a job finishes, Slurm restarts its main scheduling cycle with step 1.

= Usage Limits/Throttling Policies =

While the fairshare factor ensures fair long term balance of resource utilization between users and groups, there are additional usage limits that constrain the total cumulative resources at a given time. This is to prevent individual users from short term monopolizing large fractions of the whole cluster system.

* The '''maximum walltime''' for a job is '''14 days''' (336 hours)
--time=336:00:00 or --time=14-0

* The '''maximum amount of cores''' used at any given time from jobs running is '''1920''' per user (aggregated over all running jobs). This translates to 40 nodes. An equivalent limit for allocated memory does also apply. If this limit is reached new jobs will be queued (with REASON: AssocGrpCpuLimit) but only allowed to run after resources have been relinquished.

* The maximum amount of '''remaining allocated core-minutes''' per user is '''3300000''' (aggregated over all running jobs). For example, if a user has a 4-core job running that will complete in 1 hour and a 2-core job that will complete in 6 hours, this translates to 4 * 1 * 60 + 2 * 6 * 60 = 16 * 60 = 960 remaining core-minutes. Once a user reaches the limit, no more jobs are allowed to start (REASON: AssocGrpCPURunMinutesLimit). As the jobs continue to run, the remaining core time will decrease and eventually allow more jobs to start in a staggered way. This limit also '''correlates the maximum walltime and amount of cores that can be allocated''' for this amount of time. Thus, shorter walltimes for the jobs allow more resources to be allocated at a given time (but capped by the maximum amount of cores limit above). Watch this '''[https://youtu.be/OKhWwem1XZg?t=306 video]''' for an illustrative description. An equivalent limit applies for remaining time of memory allocation in which case jobs may be held back from starting with REASON AssocGrpMemRunMinutes.

* The '''maximum amount of GPUs''' allocated by running jobs is '''8''' per user. If this limit it reached new jobs will be queued (with REASON: AssocGrpGRES) but only allowed to run after GPU resources have been relinquished.

'''Note:'''

Usage limits are subject to change.

= Considerations on Efficiency / Special Use Cases =

When we speak of poor job efficiency, we usually mean that hardware resources are wasted.
That means, a similar overall result could have been achieved using less hardware resources, leaving those for other jobs and reducing the wait time for you and everyone.

Some simple causes for poor overall job efficiency are:

* poor choice of resources compared to the size of the nodes leaves part of the node blocked, but doing nothing:
** multiple of --ntasks-per-node is notthe number of cores of a node (see section on single user policy below)
** too much (un-needed) memory or disk space requested
* more cores requested than are actually used by the job
* more cores used for a single mpi/openmp parallel computation than useful
* many small jobs with a short runtime (seconds in extreme cases)
* one-core jobs with very different run-times (because of single-user policy)

== Many One-Core Jobs ==

One-core jobs can cause major inefficient node usage:

# You submit 1000 jobs, each runs for ~30s. Jobs need up to 30s to start and finish - a huge waste if the job only takes 30 seconds. Additionally, start/finish of so many jobs in a short time causes strain on the scheduler SLURM and may cause severe problems for everyone and clutter the SLURM job database.
# many one-core jobs with very different run times. The jobs will start on many nodes, but at some time all quicker jobs have finished the calculation and only a few remain. Because of the single-user policy on JUSTUS2, jobs of other users cannot fill in the gaps and the rest of the node is idle.

To address the problem, you can reduce the amount of jobs and/or the amount of nodes used.

To limit the amount of jobs, start many calculations within one job (problem 1. and 2.):

* use a bash loop in your job script
* use the program GNU parallel to start the processes for you

To only limit the amount of nodes used:

* use array jobs

=== Bash Loop ===

One advantage of this method is, that you can run more threads than cores if your jobs are really short and do not use too much RAM memory and in this way keep all cores busy even if many calculations are still starting up.

It is of course even better, if you can combine such short calculations in a way that for 1000 calculations the kernel does not need to start 1000 processes which in turn need to initialize everything.

This example uses pgrep to count how many jobs are running:

<syntaxhighlight lang="bash">
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=48
#SBATCH --time=00:10:00
#SBATCH --mem=100gb

for i in {1..200}
do
echo starting up $i
bash my_calculation $i &
while [ $(pgrep -c -f my_calculation) -gt 48 ] ; do echo sleeping; sleep 5; done
done
wait
</syntaxhighlight>

<syntaxhighlight lang="bash">
#!/bin/bash
running_jobs=()

for i in {1..200}; do
echo "Starting job $i"
sleep "$i" &
running_jobs+=($!) # Track PID

while [ "${#running_jobs[@]}" -ge 8 ]; do
sleep 2 # adjust duration depending on your runtime
echo running_jobs: ${running_jobs[@]}
echo pid-out: $(ps -o pid= -p "${running_jobs[@]}" 2>/dev/null | xargs)
echo -----
running_jobs=($(ps -o pid= -p "${running_jobs[@]}" 2>/dev/null)) # Remove finished jobs
done
done

wait # Ensure all jobs complete
</syntaxhighlight>

You may not be able to just use an index number "i" to start many calculations. In this case, for not too many files, the for loop could be used to read in config files. Here just the general idea for the for loop without
<syntaxhighlight lang="bash">
for config in config-1980-03-01_1/*; do
mycalculation -config "$config"
done
</syntaxhighlight>
This loops over all files in the directory config-1980-03-01_1/ and gives them as an input file to "mycalculation" via a hypothetical "-config" option. Adding a date to the config-dirs (and outputs) would enable you to track different runs in your lab journal more easily.

=== Gnu Parallel ===

Gnu Parallel is available on the HPC Cluster and comes with its own set of examples, you can access them like this:

<syntaxhighlight lang="bash">
$ module load system/parallel
$ cp $PARALLEL_EXA_DIR/parallel.slurm .
</syntaxhighlight>
=== Array Jobs ===
<syntaxhighlight lang="bash">$ sbatch -a 1-500%48 batch_script</syntaxhighlight>

This will submit 500 tasks to be executed, each one indexed by SLURM_ARRAY_TASK_ID ranging from 1 to 200, but will limit the number of simultaneously running tasks from this job array to 48 (number of cores on a Justus2 node).

Thee same can be done inside the job script:

<syntaxhighlight lang="bash">
#!/bin/bash
# Number of cores per individual array task
#SBATCH --ntasks-per-node=1
#SBATCH --array=1-500%48
#SBATCH --mem=3G
#SBATCH --time=1:10:00
#SBATCH --job-name=array_job
#SBATCH --output=array_job-%A_%a.out
#SBATCH --error=array_job-%A_%a.err

# Print the task id.
echo "My SLURM_ARRAY_TASK_ID: " $SLURM_ARRAY_TASK_ID

export TIMEFORMAT=%R ;
time bash mycalculation $SLURM_ARRAY_TASK_ID

</syntaxhighlight>

Also see:
* Slurm-Howto entry: [[BwForCluster_JUSTUS_2_Slurm_HOWTO#How_to_submit_an_array_job?]]
* Schedmd documentations on Job Arrays: https://slurm.schedmd.com/job_array.html

JUSTUS2/Jobscripts: Running Your Calculations

2025-03-07T09:47:08Z

M Carmesin: /* Memory Limits */

{{Justus2}}

The JUSTUS 2 cluster uses Slurm ([https://slurm.schedmd.com/ https://slurm.schedmd.com/]) for scheduling compute jobs.

= JUSTUS 2 Slurm Howto =

This page only presents some very basic introduction.

Please see the '''[[bwForCluster JUSTUS 2 Slurm HOWTO|JUSTUS 2 Slurm HOWTO]]''' for many more examples and commands for common tasks.

= Slurm Command Overview =

{| width=750px class="wikitable"
! Slurm commands !! Brief explanation
|-
| [https://slurm.schedmd.com/sbatch.html sbatch] || Submits a job and queues it in an input queue
|-
| [https://slurm.schedmd.com/salloc.html salloc] || Request resources for an interactive job
|-
| [https://slurm.schedmd.com/squeue.html squeue] || Displays information about active, eligible, blocked, and/or recently completed jobs
|-
| [https://slurm.schedmd.com/scontrol.html scontrol] || Displays detailed job state information
|-
| [https://slurm.schedmd.com/sstat.html sstat] || Displays status information about a running job
|-
| [https://slurm.schedmd.com/scancel.html scancel] || Cancels a job
|-
| seff || Shows the "job efficiency" of a job after it has finished
|}

= Submitting Jobs on the bwForCluster JUSTUS 2 =
Batch jobs are submitted with the command:

<source lang=bash>$ sbatch <job-script> </source>

A job script contains options for Slurm in lines beginning with #SBATCH as well as your commands which you want to execute on the compute nodes. For example:

<source lang='bash'>
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --time=00:14:00
#SBATCH --mem=1gb
echo 'Here starts the calculation'
</source>

You can override options from the script on the command-line:
<source lang=bash>$ sbatch --time=03:00:00 <job-script> </source>

Note: Compute jobs must not write/read from the global file systems as a calculation swap file. 

Use local storage /tmp in the ramdisk for small files or /scratch (see [[BwForCluster_JUSTUS_2_Slurm_HOWTO#How_to_request_local_scratch_.28SSD.2FNVMe.29_at_job_submission.3F|How to request NVME]]) for this purpose

To not use the central file system for calculation, you must often configure the the program you are using to write temporary files elsewhere.

If the program uses the current directory to look for files, you must copy files to a temporary directory - and copy/save the results of the calculation in the end, else your results get deleted by automated cleanup happening after the job.

There diskless nodes have a disk in RAM memory, that can have a maximum of half the size of the total RAM. Note that files created plus memory requirement of your job need to fit into the total memory.

There are more diskless nodes than nodes with disks, so if your job can run on a diskless node, you should choose this option.

Example job script with requesting 700GB disk space and copying files:

<source lang='bash'>
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --time=00:14:00
#SBATCH --mem=1gb
#SBATCH --gres=scratch:700
# copy input file
cp $HOME/inputfiles/myinput.inp $SCRATCH
# switch directory
cd $SCRATCH
echo 'Here starts the calculation'
myprogram --input=$SCRATCH/myinput.inp
# calculation ends
# copy result
cp outfile.out results2.txt $HOME/resultdir/job12345
# clean up
rm myinput outfile.out results2.txt
</source>

{| style=" background:#deffee; width:100%;"
| style="padding:12px; background:#cef2e0; text-align:left" |
Software examples: Most [[Environment Modules|installed software]] comes with example job scripts. To find it e.g. for lammps: <code> module load chem/lammps; cd $LAMMPS_EXA_DIR; ls -la</code>.
|}
 

== Resource Requests ==

Important resource request options for the Slurm command sbatch are:

{| width=750px class="wikitable"
! Option !! Slurm (sbatch)
|-
| #SBATCH|| Script directive
|-
| --time=<hh:mm:ss> (-t <hh:mm:ss>)|| Wall time limit
|-
| --job-name=<name> (-J <name>)|| Job name
|-
| --nodes=<count> (-N <count>)|| Node count
|-
| --ntasks=<count> (-n <count>)|| Core count
|-
| --ntasks-per-node=<count>|| Process count per node
|-
| --mem=<limit>|| Memory limit per node
|-
| --mem-per-cpu=<limit>|| Memory limit per process
|-
| --gres=gpu:<count>|| GPU count (gres = "generic resource")
|-
| --gres=scratch:<count> || Disk space of <count> GB per requested task
|-
| --exclusive|| Node exclusive job
|}

'''Nodes and Cores'''

Slurm provides a number of options to request nodes and cores.
Typically, using <code>--nodes=<count></code> and <code>--ntasks-per-node=<count></code> should work for all your jobs. For single core jobs, it would be sufficient to use the option <code>--ntasks=1</code>. Specifying only <code>--ntasks</code> may lead to slurm trying to distribute tasks over more than one node even if you requested a small amount of cores.

'''Memory'''

Memory can be requested with either the option <code>--mem=<limit></code> (memory per node) or <code>--mem-per-cpu=<limit></code> (memory per process). When looking up the maximum available memory for a certain node type subtract about 5 GB for the operating system. Specify the memory limit as a value-unit-pair, for example 500mb or 8gb.

In most cases it is preferable to use the <code>--mem=<limit></code> option.

'''GPUs''' and '''Scratch'''

These are requested as "generic resources" with <code>--gres:gpu:<count></code> and <code>--gres:scratch:<count></code>.

== Default Values ==

Default values for jobs are:

* Runtime: --time=02:00:00 (2 hours)
* Nodes: --nodes=1 (one node)
* Tasks: --tasks-per-node=1 (one task per node)
* Cores: --cpus-per-task=1 (one core per task)
* Memory: --mem-per-cpu=2gb (2 GB per core)

== "Exclusive User" Node Access Policy ==

Nodes are exclusively allocated to one single user. However, multiple jobs (up to 48) from the same user can share a node.

For efficient resource use, choose a core count for your jobs that evenly divides 48. For example, two 24-core jobs fit on one node, while two 32-core jobs require two nodes but leave 16 cores unused on each.

The same applies to memory requests (see below).

Think of scheduling as a game of Tetris with cores, memory, and other resources. Choosing well-fitting allocations helps the scheduler pack jobs efficiently.

== Memory Limits ==

The wait time of a job also depends largely on the amount of requested resources and the available number of nodes providing this amount of resources. This must be taken into account in particular when requesting a certain amount of memory.

For example a node with 192 GB RAM can only run jobs with up to 187 GB memory requested. The remaining amount is reserved for the operating system, system services and local file systems.
This means that if a job requests 192 GB RAM per node (i.e. --mem=192gb or --tasks-per-node=48 and --mem-per-cpu=4gb), the job cannot run on one of the 456 "small" nodes but only on one of the "medium", "large" or "fat" nodes. Unnecessarily limiting your jobs to a sub-set of nodes will increase your wait time and the wait time of others, who actually need the amount of memory.

The following table provides an overview of how much memory can be allocated by user jobs on the various node types and how many nodes can serve this memory requirement:

{| width=500px class="wikitable"
! Physical RAM on node !! Available RAM on node !! Number of suitable nodes
|-
|small| 192 GB || 187 GB || 692
|-
|medium| 384 GB || 376 GB || 220
|-
|large| 768 GB || 754 GB || 28
|-
|fat| 1536 GB || 1510 GB || 8
|}

Also note that allocated memory is factored into resource usage accounting for fair share. This means over-requesting memory may have a negative impact on the priority of subsequent jobs.

= Testing Your Jobs =

Justus2 has three compute nodes reserved for jobs with a walltime under 15 minutes. You can test if your jobs start properly just by specifying a short walltime, e.g. --time=00:14:00 and your job should start very quickly.

= Monitoring Your Jobs =
== squeue ==

After you submitted the job, you can see it waiting using the <code>squeue</code> command:

(also read the man page with <code>man squeue</code> for more information on how to use the command)

<source lang='shell'>
> squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
6260301 standard r_60_b_2 ul_yxz1 PD 0:00 1 (AssocGrpMemRunMinutes)

</source>

Output shows:
* JOBID: the jobid is an unique number your job gets
* PARTITION: the cluster can be divided in different types of nodes.
* NAME: the name you gave your job with the --job-name= option
* USER: your username
* ST: the state the job is in. R = running, PD = pending, CD = completed. See man page for a full list on states.
* TIME: how long the job has been running
* NODES: how many nodes were requested
* NODELIST(REASON): either show the node(s) the job is running on, or a reason why it hasn't started

==scontrol==

You can then show more info on one specific running job using the <code>scontrol</code> command, e.g for the job with ID 6260301 listed above:

<code>
scontrol show job 6260301
</code>

displays detailed information for job with JobID 6260301

<code>
scontrol show jobs
</code>

displays detailed information for all your jobs
<code>

scontrol write batch_script 6260301 -
</code>
display job script of a running job. The "-" is a special filename which means "write to the terminal".

== Monitoring a Started Job ==

After a job has started, you can ssh to the node(s) the job is running on, using the node name from NODELIST, e.g. if your job runs on n0603:

<code>> ssh n0603
</code>

= Partitions =
Job allocations at JUSTUS 2 are routed automatically to the most suitable compute node(s) that can provide the requested resources for the job (e.g. amount of cores, memory, local scratch space). This is to prevent fragmentation of the cluster system and to ensure most efficient usage of available compute resources. Thus, there is no point in requesting a partition in batch job scripts, i.e. users '''should not''' specify any partition "-p, --partition=<partition_name>" on job submission. This is of particular importance if you adapt job scripts from other cluster systems (e.g. bwUniCluster 2.0) to JUSTUS 2.

= Job Priorities =
Job priorities at JUSTUS 2 depend on [https://slurm.schedmd.com/priority_multifactor.html multiple factors ]:
* Age: The amount of time a job has been waiting in the queue, eligible to be scheduled.
* Fairshare: The difference between the portion of the computing resource allocated to an association and the amount of resources that has been consumed.

'''Notes:'''

Jobs that are pending because the user reached one of the resource usage limits (see below) are not eligible to be scheduled and, thus, do not accrue priority by their age.

Fairshare does '''not''' introduce a fixed allotment, in that a user's ability to run new jobs is cut off as soon as a fixed target utilization is reached. Instead, the fairshare factor ensures that jobs from users who were under-served in the past are given higher priority than jobs from users who were over-served in the past. This keeps individual groups from long term monopolizing the resources, thus making it unfair to groups who have not used their fairshare for quite some time.

Slurm features '''backfilling''', meaning that the scheduler will start lower priority jobs if doing so does not delay the expected start time of '''any''' higher priority job. Since the expected start time of pending jobs depends upon the expected completion time of running jobs, reasonably accurate time limits are valuable for backfill scheduling to work well. This '''[https://youtu.be/OKhWwem1XZg?t=161 video]''' gives an illustrative description to how backfilling works.

In summary, an approximate model of Slurm's behavior for scheduling jobs is this:

* Step 1: Can the job in position one (highest priority) start now?
* Step 2: If it can, remove it from the queue, start it and continue with step 1.
* Step 3: If it can not, look at next job.
* Step 4: Can it start now, without delaying the start time of any job before it in the queue?
* Step 5: If it can, remove it from the queue, start it, recalculate what nodes are free, look at next job and continue with step 4.
* Step 6: If it can not, look at next job, and continue with step 4.

As soon as a new job is submitted and as soon as a job finishes, Slurm restarts its main scheduling cycle with step 1.

= Usage Limits/Throttling Policies =

While the fairshare factor ensures fair long term balance of resource utilization between users and groups, there are additional usage limits that constrain the total cumulative resources at a given time. This is to prevent individual users from short term monopolizing large fractions of the whole cluster system.

* The '''maximum walltime''' for a job is '''14 days''' (336 hours)
--time=336:00:00 or --time=14-0

* The '''maximum amount of cores''' used at any given time from jobs running is '''1920''' per user (aggregated over all running jobs). This translates to 40 nodes. An equivalent limit for allocated memory does also apply. If this limit is reached new jobs will be queued (with REASON: AssocGrpCpuLimit) but only allowed to run after resources have been relinquished.

* The maximum amount of '''remaining allocated core-minutes''' per user is '''3300000''' (aggregated over all running jobs). For example, if a user has a 4-core job running that will complete in 1 hour and a 2-core job that will complete in 6 hours, this translates to 4 * 1 * 60 + 2 * 6 * 60 = 16 * 60 = 960 remaining core-minutes. Once a user reaches the limit, no more jobs are allowed to start (REASON: AssocGrpCPURunMinutesLimit). As the jobs continue to run, the remaining core time will decrease and eventually allow more jobs to start in a staggered way. This limit also '''correlates the maximum walltime and amount of cores that can be allocated''' for this amount of time. Thus, shorter walltimes for the jobs allow more resources to be allocated at a given time (but capped by the maximum amount of cores limit above). Watch this '''[https://youtu.be/OKhWwem1XZg?t=306 video]''' for an illustrative description. An equivalent limit applies for remaining time of memory allocation in which case jobs may be held back from starting with REASON AssocGrpMemRunMinutes.

* The '''maximum amount of GPUs''' allocated by running jobs is '''8''' per user. If this limit it reached new jobs will be queued (with REASON: AssocGrpGRES) but only allowed to run after GPU resources have been relinquished.

'''Note:'''

Usage limits are subject to change.

= Considerations on Efficiency / Special Use Cases =

When we speak of poor job efficiency, we usually mean that hardware resources are wasted.
That means, a similar overall result could have been achieved using less hardware resources, leaving those for other jobs and reducing the wait time for you and everyone.

Some simple causes for poor overall job efficiency are:

* poor choice of resources compared to the size of the nodes leaves part of the node blocked, but doing nothing:
** multiple of --ntasks-per-node is notthe number of cores of a node (see section on single user policy below)
** too much (un-needed) memory or disk space requested
* more cores requested than are actually used by the job
* more cores used for a single mpi/openmp parallel computation than useful
* many small jobs with a short runtime (seconds in extreme cases)
* one-core jobs with very different run-times (because of single-user policy)

== Many One-Core Jobs ==

One-core jobs can cause major inefficient node usage:

# You submit 1000 jobs, each runs for ~30s. Jobs need up to 30s to start and finish - a huge waste if the job only takes 30 seconds. Additionally, start/finish of so many jobs in a short time causes strain on the scheduler SLURM and may cause severe problems for everyone and clutter the SLURM job database.
# many one-core jobs with very different run times. The jobs will start on many nodes, but at some time all quicker jobs have finished the calculation and only a few remain. Because of the single-user policy on JUSTUS2, jobs of other users cannot fill in the gaps and the rest of the node is idle.

To address the problem, you can reduce the amount of jobs and/or the amount of nodes used.

To limit the amount of jobs, start many calculations within one job (problem 1. and 2.):

* use a bash loop in your job script
* use the program GNU parallel to start the processes for you

To only limit the amount of nodes used:

* use array jobs

=== Bash Loop ===

One advantage of this method is, that you can run more threads than cores if your jobs are really short and do not use too much RAM memory and in this way keep all cores busy even if many calculations are still starting up.

It is of course even better, if you can combine such short calculations in a way that for 1000 calculations the kernel does not need to start 1000 processes which in turn need to initialize everything.

This example uses pgrep to count how many jobs are running:

<syntaxhighlight lang="bash">
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=48
#SBATCH --time=00:10:00
#SBATCH --mem=100gb

for i in {1..200}
do
echo starting up $i
bash my_calculation $i &
while [ $(pgrep -c -f my_calculation) -gt 48 ] ; do echo sleeping; sleep 5; done
done
wait
</syntaxhighlight>

<syntaxhighlight lang="bash">
#!/bin/bash
running_jobs=()

for i in {1..200}; do
echo "Starting job $i"
sleep "$i" &
running_jobs+=($!) # Track PID

while [ "${#running_jobs[@]}" -ge 8 ]; do
sleep 2 # adjust duration depending on your runtime
echo running_jobs: ${running_jobs[@]}
echo pid-out: $(ps -o pid= -p "${running_jobs[@]}" 2>/dev/null | xargs)
echo -----
running_jobs=($(ps -o pid= -p "${running_jobs[@]}" 2>/dev/null)) # Remove finished jobs
done
done

wait # Ensure all jobs complete
</syntaxhighlight>

You may not be able to just use an index number "i" to start many calculations. In this case, for not too many files, the for loop could be used to read in config files. Here just the general idea for the for loop without
<syntaxhighlight lang="bash">
for config in config-1980-03-01_1/*; do
mycalculation -config "$config"
done
</syntaxhighlight>
This loops over all files in the directory config-1980-03-01_1/ and gives them as an input file to "mycalculation" via a hypothetical "-config" option. Adding a date to the config-dirs (and outputs) would enable you to track different runs in your lab journal more easily.

=== Gnu Parallel ===

Gnu Parallel is available on the HPC Cluster and comes with its own set of examples, you can access them like this:

<syntaxhighlight lang="bash">
$ module load system/parallel
$ cp $PARALLEL_EXA_DIR/parallel.slurm .
</syntaxhighlight>
=== Array Jobs ===
<syntaxhighlight lang="bash">$ sbatch -a 1-500%48 batch_script</syntaxhighlight>

This will submit 500 tasks to be executed, each one indexed by SLURM_ARRAY_TASK_ID ranging from 1 to 200, but will limit the number of simultaneously running tasks from this job array to 48 (number of cores on a Justus2 node).

Thee same can be done inside the job script:

<syntaxhighlight lang="bash">
#!/bin/bash
# Number of cores per individual array task
#SBATCH --ntasks-per-node=1
#SBATCH --array=1-500%48
#SBATCH --mem=3G
#SBATCH --time=1:10:00
#SBATCH --job-name=array_job
#SBATCH --output=array_job-%A_%a.out
#SBATCH --error=array_job-%A_%a.err

# Print the task id.
echo "My SLURM_ARRAY_TASK_ID: " $SLURM_ARRAY_TASK_ID

export TIMEFORMAT=%R ;
time bash mycalculation $SLURM_ARRAY_TASK_ID

</syntaxhighlight>

Also see:
* Slurm-Howto entry: [[BwForCluster_JUSTUS_2_Slurm_HOWTO#How_to_submit_an_array_job?]]
* Schedmd documentations on Job Arrays: https://slurm.schedmd.com/job_array.html

JUSTUS2/Jobscripts: Running Your Calculations

2025-03-07T08:31:02Z

M Carmesin:

{{Justus2}}

The JUSTUS 2 cluster uses Slurm ([https://slurm.schedmd.com/ https://slurm.schedmd.com/]) for scheduling compute jobs.

= JUSTUS 2 Slurm Howto =

This page only presents some very basic introduction.

Please see the '''[[bwForCluster JUSTUS 2 Slurm HOWTO|JUSTUS 2 Slurm HOWTO]]''' for many more examples and commands for common tasks.

= Slurm Command Overview =

{| width=750px class="wikitable"
! Slurm commands !! Brief explanation
|-
| [https://slurm.schedmd.com/sbatch.html sbatch] || Submits a job and queues it in an input queue
|-
| [https://slurm.schedmd.com/salloc.html salloc] || Request resources for an interactive job
|-
| [https://slurm.schedmd.com/squeue.html squeue] || Displays information about active, eligible, blocked, and/or recently completed jobs
|-
| [https://slurm.schedmd.com/scontrol.html scontrol] || Displays detailed job state information
|-
| [https://slurm.schedmd.com/sstat.html sstat] || Displays status information about a running job
|-
| [https://slurm.schedmd.com/scancel.html scancel] || Cancels a job
|-
| seff || Shows the "job efficiency" of a job after it has finished
|}

= Submitting Jobs on the bwForCluster JUSTUS 2 =
Batch jobs are submitted with the command:

<source lang=bash>$ sbatch <job-script> </source>

A job script contains options for Slurm in lines beginning with #SBATCH as well as your commands which you want to execute on the compute nodes. For example:

<source lang='bash'>
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --time=00:14:00
#SBATCH --mem=1gb
echo 'Here starts the calculation'
</source>

You can override options from the script on the command-line:
<source lang=bash>$ sbatch --time=03:00:00 <job-script> </source>

Note: Compute jobs must not write/read from the global file systems as a calculation swap file. 

Use local storage /tmp in the ramdisk for small files or /scratch (see [[BwForCluster_JUSTUS_2_Slurm_HOWTO#How_to_request_local_scratch_.28SSD.2FNVMe.29_at_job_submission.3F|How to request NVME]]) for this purpose

To not use the central file system for calculation, you must often configure the the program you are using to write temporary files elsewhere.

If the program uses the current directory to look for files, you must copy files to a temporary directory - and copy/save the results of the calculation in the end, else your results get deleted by automated cleanup happening after the job.

There diskless nodes have a disk in RAM memory, that can have a maximum of half the size of the total RAM. Note that files created plus memory requirement of your job need to fit into the total memory.

There are more diskless nodes than nodes with disks, so if your job can run on a diskless node, you should choose this option.

Example job script with requesting 700GB disk space and copying files:

<source lang='bash'>
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --time=00:14:00
#SBATCH --mem=1gb
#SBATCH --gres=scratch:700
# copy input file
cp $HOME/inputfiles/myinput.inp $SCRATCH
# switch directory
cd $SCRATCH
echo 'Here starts the calculation'
myprogram --input=$SCRATCH/myinput.inp
# calculation ends
# copy result
cp outfile.out results2.txt $HOME/resultdir/job12345
# clean up
rm myinput outfile.out results2.txt
</source>

{| style=" background:#deffee; width:100%;"
| style="padding:12px; background:#cef2e0; text-align:left" |
Software examples: Most [[Environment Modules|installed software]] comes with example job scripts. To find it e.g. for lammps: <code> module load chem/lammps; cd $LAMMPS_EXA_DIR; ls -la</code>.
|}
 

== Resource Requests ==

Important resource request options for the Slurm command sbatch are:

{| width=750px class="wikitable"
! Option !! Slurm (sbatch)
|-
| #SBATCH|| Script directive
|-
| --time=<hh:mm:ss> (-t <hh:mm:ss>)|| Wall time limit
|-
| --job-name=<name> (-J <name>)|| Job name
|-
| --nodes=<count> (-N <count>)|| Node count
|-
| --ntasks=<count> (-n <count>)|| Core count
|-
| --ntasks-per-node=<count>|| Process count per node
|-
| --mem=<limit>|| Memory limit per node
|-
| --mem-per-cpu=<limit>|| Memory limit per process
|-
| --gres=gpu:<count>|| GPU count (gres = "generic resource")
|-
| --gres=scratch:<count> || Disk space of <count> GB per requested task
|-
| --exclusive|| Node exclusive job
|}

'''Nodes and Cores'''

Slurm provides a number of options to request nodes and cores.
Typically, using <code>--nodes=<count></code> and <code>--ntasks-per-node=<count></code> should work for all your jobs. For single core jobs, it would be sufficient to use the option <code>--ntasks=1</code>. Specifying only <code>--ntasks</code> may lead to slurm trying to distribute tasks over more than one node even if you requested a small amount of cores.

'''Memory'''

Memory can be requested with either the option <code>--mem=<limit></code> (memory per node) or <code>--mem-per-cpu=<limit></code> (memory per process). When looking up the maximum available memory for a certain node type subtract about 5 GB for the operating system. Specify the memory limit as a value-unit-pair, for example 500mb or 8gb.

In most cases it is preferable to use the <code>--mem=<limit></code> option.

'''GPUs''' and '''Scratch'''

These are requested as "generic resources" with <code>--gres:gpu:<count></code> and <code>--gres:scratch:<count></code>.

== Default Values ==

Default values for jobs are:

* Runtime: --time=02:00:00 (2 hours)
* Nodes: --nodes=1 (one node)
* Tasks: --tasks-per-node=1 (one task per node)
* Cores: --cpus-per-task=1 (one core per task)
* Memory: --mem-per-cpu=2gb (2 GB per core)

== Memory Limits ==

The '''wait time of a job also depends largely on the amount of requested resources''' and the available number of nodes providing this amount of resources. This must be taken into account '''in particular when requesting a certain amount of memory'''.

For example, there is a total of 692 compute nodes in JUSTUS, of which 456 nodes have 192 GB RAM. However, '''not the entire amount of physical RAM is available exclusively for user jobs''', because the operating system, system services and local file systems also require a certain amount of RAM.
This means that if a job requests 192 GB RAM per node (i.e. --mem=192gb or --tasks-per-node=48 and --mem-per-cpu=4gb), Slurm will rule out 456 out of 692 nodes as being suitable for this job and considers only 220 out of 692 nodes as being eligible for running this job.

The following table provides an overview of how much memory can be allocated by user jobs on the various node types and how many nodes can serve this memory requirement:

{| width=500px class="wikitable"
! Physical RAM on node !! Available RAM on node !! Number of suitable nodes
|-
| 192 GB || 187 GB || 692
|-
| 384 GB || 376 GB || 220
|-
| 768 GB || 754 GB || 28
|-
| 1536 GB || 1510 GB || 8
|}

Also note that allocated memory is factored into resource usage accounting for fair share. This means over-requesting memory may have a negative impact on the priority of subsequent jobs.

== Node Access Policy ==

Node access policy for jobs is "'''exclusive user'''". Nodes will be exclusively allocated to users. '''Multiple jobs (up to 48) of the same user can run on a single node''' at any time.

'''Note:''' This implies that for '''sub-node jobs''', it is advisable for efficient resource utilization and maximum job throughput to '''adjust the number of cores to be an integer divisor of 48''' (total number of cores on each node). For example, two 24-core jobs can run simultaneously on one and the same node, while two 32-core jobs will always have to allocate two separate nodes, but leave 16 cores unused on each of them. Users must therefore always '''think carefully about how many cores to request''' and whether their applications really benefit from allocating more cores for their jobs. Similar considerations apply - at the same time - to the '''requested amount of memory per job'''.

Think of it as the scheduler playing a game of multi-dimensional Tetris, where the dimensions are number of cores, amount of memory and other resources. '''Users can support this by making resource allocations that allow the scheduler to pack their jobs as densely as possible on the nodes'''.

= Testing Your Jobs =

Justus2 has three compute nodes reserved for jobs with a walltime under 15 minutes. You can test if your jobs start properly just by specifying a short walltime, e.g. --time=00:14:00 and your job should start very quickly.

= Monitoring Your Jobs =
== squeue ==

After you submitted the job, you can see it waiting using the <code>squeue</code> command:

(also read the man page with <code>man squeue</code> for more information on how to use the command)

<source lang='shell'>
> squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
6260301 standard r_60_b_2 ul_yxz1 PD 0:00 1 (AssocGrpMemRunMinutes)

</source>

Output shows:
* JOBID: the jobid is an unique number your job gets
* PARTITION: the cluster can be divided in different types of nodes.
* NAME: the name you gave your job with the --job-name= option
* USER: your username
* ST: the state the job is in. R = running, PD = pending, CD = completed. See man page for a full list on states.
* TIME: how long the job has been running
* NODES: how many nodes were requested
* NODELIST(REASON): either show the node(s) the job is running on, or a reason why it hasn't started

==scontrol==

You can then show more info on one specific running job using the <code>scontrol</code> command, e.g for the job with ID 6260301 listed above:

<code>
scontrol show job 6260301
</code>

displays detailed information for job with JobID 6260301

<code>
scontrol show jobs
</code>

displays detailed information for all your jobs
<code>

scontrol write batch_script 6260301 -
</code>
display job script of a running job. The "-" is a special filename which means "write to the terminal".

== Monitoring a Started Job ==

After a job has started, you can ssh to the node(s) the job is running on, using the node name from NODELIST, e.g. if your job runs on n0603:

<code>> ssh n0603
</code>

= Partitions =
Job allocations at JUSTUS 2 are routed automatically to the most suitable compute node(s) that can provide the requested resources for the job (e.g. amount of cores, memory, local scratch space). This is to prevent fragmentation of the cluster system and to ensure most efficient usage of available compute resources. Thus, there is no point in requesting a partition in batch job scripts, i.e. users '''should not''' specify any partition "-p, --partition=<partition_name>" on job submission. This is of particular importance if you adapt job scripts from other cluster systems (e.g. bwUniCluster 2.0) to JUSTUS 2.

= Job Priorities =
Job priorities at JUSTUS 2 depend on [https://slurm.schedmd.com/priority_multifactor.html multiple factors ]:
* Age: The amount of time a job has been waiting in the queue, eligible to be scheduled.
* Fairshare: The difference between the portion of the computing resource allocated to an association and the amount of resources that has been consumed.

'''Notes:'''

Jobs that are pending because the user reached one of the resource usage limits (see below) are not eligible to be scheduled and, thus, do not accrue priority by their age.

Fairshare does '''not''' introduce a fixed allotment, in that a user's ability to run new jobs is cut off as soon as a fixed target utilization is reached. Instead, the fairshare factor ensures that jobs from users who were under-served in the past are given higher priority than jobs from users who were over-served in the past. This keeps individual groups from long term monopolizing the resources, thus making it unfair to groups who have not used their fairshare for quite some time.

Slurm features '''backfilling''', meaning that the scheduler will start lower priority jobs if doing so does not delay the expected start time of '''any''' higher priority job. Since the expected start time of pending jobs depends upon the expected completion time of running jobs, reasonably accurate time limits are valuable for backfill scheduling to work well. This '''[https://youtu.be/OKhWwem1XZg?t=161 video]''' gives an illustrative description to how backfilling works.

In summary, an approximate model of Slurm's behavior for scheduling jobs is this:

* Step 1: Can the job in position one (highest priority) start now?
* Step 2: If it can, remove it from the queue, start it and continue with step 1.
* Step 3: If it can not, look at next job.
* Step 4: Can it start now, without delaying the start time of any job before it in the queue?
* Step 5: If it can, remove it from the queue, start it, recalculate what nodes are free, look at next job and continue with step 4.
* Step 6: If it can not, look at next job, and continue with step 4.

As soon as a new job is submitted and as soon as a job finishes, Slurm restarts its main scheduling cycle with step 1.

= Usage Limits/Throttling Policies =

While the fairshare factor ensures fair long term balance of resource utilization between users and groups, there are additional usage limits that constrain the total cumulative resources at a given time. This is to prevent individual users from short term monopolizing large fractions of the whole cluster system.

* The '''maximum walltime''' for a job is '''14 days''' (336 hours)
--time=336:00:00 or --time=14-0

* The '''maximum amount of cores''' used at any given time from jobs running is '''1920''' per user (aggregated over all running jobs). This translates to 40 nodes. An equivalent limit for allocated memory does also apply. If this limit is reached new jobs will be queued (with REASON: AssocGrpCpuLimit) but only allowed to run after resources have been relinquished.

* The maximum amount of '''remaining allocated core-minutes''' per user is '''3300000''' (aggregated over all running jobs). For example, if a user has a 4-core job running that will complete in 1 hour and a 2-core job that will complete in 6 hours, this translates to 4 * 1 * 60 + 2 * 6 * 60 = 16 * 60 = 960 remaining core-minutes. Once a user reaches the limit, no more jobs are allowed to start (REASON: AssocGrpCPURunMinutesLimit). As the jobs continue to run, the remaining core time will decrease and eventually allow more jobs to start in a staggered way. This limit also '''correlates the maximum walltime and amount of cores that can be allocated''' for this amount of time. Thus, shorter walltimes for the jobs allow more resources to be allocated at a given time (but capped by the maximum amount of cores limit above). Watch this '''[https://youtu.be/OKhWwem1XZg?t=306 video]''' for an illustrative description. An equivalent limit applies for remaining time of memory allocation in which case jobs may be held back from starting with REASON AssocGrpMemRunMinutes.

* The '''maximum amount of GPUs''' allocated by running jobs is '''8''' per user. If this limit it reached new jobs will be queued (with REASON: AssocGrpGRES) but only allowed to run after GPU resources have been relinquished.

'''Note:'''

Usage limits are subject to change.

= Considerations on Efficiency / Special Use Cases =

When we speak of poor job efficiency, we usually mean that hardware resources are wasted.
That means, a similar overall result could have been achieved using less hardware resources, leaving those for other jobs and reducing the wait time for you and everyone.

Some simple causes for poor overall job efficiency are:

* poor choice of resources compared to the size of the nodes leaves part of the node blocked, but doing nothing:
** multiple of --ntasks-per-node is notthe number of cores of a node (see section on single user policy below)
** too much (un-needed) memory or disk space requested
* more cores requested than are actually used by the job
* more cores used for a single mpi/openmp parallel computation than useful
* many small jobs with a short runtime (seconds in extreme cases)
* one-core jobs with very different run-times (because of single-user policy)

== Many One-Core Jobs ==

One-core jobs can cause major inefficient node usage:

# You submit 1000 jobs, each runs for ~30s. Jobs need up to 30s to start and finish - a huge waste if the job only takes 30 seconds. Additionally, start/finish of so many jobs in a short time causes strain on the scheduler SLURM and may cause severe problems for everyone and clutter the SLURM job database.
# many one-core jobs with very different run times. The jobs will start on many nodes, but at some time all quicker jobs have finished the calculation and only a few remain. Because of the single-user policy on JUSTUS2, jobs of other users cannot fill in the gaps and the rest of the node is idle.

To address the problem, you can reduce the amount of jobs and/or the amount of nodes used.

To limit the amount of jobs, start many calculations within one job (problem 1. and 2.):

* use a bash loop in your job script
* use the program GNU parallel to start the processes for you

To only limit the amount of nodes used:

* use array jobs

=== Bash Loop ===

One advantage of this method is, that you can run more threads than cores if your jobs are really short and do not use too much RAM memory and in this way keep all cores busy even if many calculations are still starting up.

It is of course even better, if you can combine such short calculations in a way that for 1000 calculations the kernel does not need to start 1000 processes which in turn need to initialize everything.

This example uses pgrep to count how many jobs are running:

<syntaxhighlight lang="bash">
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=48
#SBATCH --time=00:10:00
#SBATCH --mem=100gb

for i in {1..200}
do
echo starting up $i
bash my_calculation $i &
while [ $(pgrep -c -f my_calculation) -gt 48 ] ; do echo sleeping; sleep 5; done
done
wait
</syntaxhighlight>

<syntaxhighlight lang="bash">
#!/bin/bash
running_jobs=()

for i in {1..200}; do
echo "Starting job $i"
sleep "$i" &
running_jobs+=($!) # Track PID

while [ "${#running_jobs[@]}" -ge 8 ]; do
sleep 2 # adjust duration depending on your runtime
echo running_jobs: ${running_jobs[@]}
echo pid-out: $(ps -o pid= -p "${running_jobs[@]}" 2>/dev/null | xargs)
echo -----
running_jobs=($(ps -o pid= -p "${running_jobs[@]}" 2>/dev/null)) # Remove finished jobs
done
done

wait # Ensure all jobs complete
</syntaxhighlight>

You may not be able to just use an index number "i" to start many calculations. In this case, for not too many files, the for loop could be used to read in config files. Here just the general idea for the for loop without
<syntaxhighlight lang="bash">
for config in config-1980-03-01_1/*; do
mycalculation -config "$config"
done
</syntaxhighlight>
This loops over all files in the directory config-1980-03-01_1/ and gives them as an input file to "mycalculation" via a hypothetical "-config" option. Adding a date to the config-dirs (and outputs) would enable you to track different runs in your lab journal more easily.

=== Gnu Parallel ===

Gnu Parallel is available on the HPC Cluster and comes with its own set of examples, you can access them like this:

<syntaxhighlight lang="bash">
$ module load system/parallel
$ cp $PARALLEL_EXA_DIR/parallel.slurm .
</syntaxhighlight>
=== Array Jobs ===
<syntaxhighlight lang="bash">$ sbatch -a 1-500%48 batch_script</syntaxhighlight>

This will submit 500 tasks to be executed, each one indexed by SLURM_ARRAY_TASK_ID ranging from 1 to 200, but will limit the number of simultaneously running tasks from this job array to 48 (number of cores on a Justus2 node).

Thee same can be done inside the job script:

<syntaxhighlight lang="bash">
#!/bin/bash
# Number of cores per individual array task
#SBATCH --ntasks-per-node=1
#SBATCH --array=1-500%48
#SBATCH --mem=3G
#SBATCH --time=1:10:00
#SBATCH --job-name=array_job
#SBATCH --output=array_job-%A_%a.out
#SBATCH --error=array_job-%A_%a.err

# Print the task id.
echo "My SLURM_ARRAY_TASK_ID: " $SLURM_ARRAY_TASK_ID

export TIMEFORMAT=%R ;
time bash mycalculation $SLURM_ARRAY_TASK_ID

</syntaxhighlight>

Also see:
* Slurm-Howto entry: [[BwForCluster_JUSTUS_2_Slurm_HOWTO#How_to_submit_an_array_job?]]
* Schedmd documentations on Job Arrays: https://slurm.schedmd.com/job_array.html

JUSTUS2/Jobscripts: Running Your Calculations

2025-03-07T08:29:06Z

M Carmesin:

{{Justus2}}

The JUSTUS 2 cluster uses Slurm ([https://slurm.schedmd.com/ https://slurm.schedmd.com/]) for scheduling compute jobs.

= JUSTUS 2 Slurm Howto =

This page only presents some very basic introduction.

Please see the '''[[bwForCluster JUSTUS 2 Slurm HOWTO|JUSTUS 2 Slurm HOWTO]]''' for many more examples and commands for common tasks.

= Slurm Command Overview =

{| width=750px class="wikitable"
! Slurm commands !! Brief explanation
|-
| [https://slurm.schedmd.com/sbatch.html sbatch] || Submits a job and queues it in an input queue
|-
| [https://slurm.schedmd.com/salloc.html salloc] || Request resources for an interactive job
|-
| [https://slurm.schedmd.com/squeue.html squeue] || Displays information about active, eligible, blocked, and/or recently completed jobs
|-
| [https://slurm.schedmd.com/scontrol.html scontrol] || Displays detailed job state information
|-
| [https://slurm.schedmd.com/sstat.html sstat] || Displays status information about a running job
|-
| [https://slurm.schedmd.com/scancel.html scancel] || Cancels a job
|-
| seff || Shows the "job efficiency" of a job after it has finished
|}

= Submitting Jobs on the bwForCluster JUSTUS 2 =
Batch jobs are submitted with the command:

<source lang=bash>$ sbatch <job-script> </source>

A job script contains options for Slurm in lines beginning with #SBATCH as well as your commands which you want to execute on the compute nodes. For example:

<source lang='bash'>
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --time=00:14:00
#SBATCH --mem=1gb
echo 'Here starts the calculation'
</source>

You can override options from the script on the command-line:
<source lang=bash>$ sbatch --time=03:00:00 <job-script> </source>

Note: Compute jobs must not write/read from the global file systems as a calculation swap file. 

Use local storage /tmp in the ramdisk for small files or /scratch (see [[BwForCluster_JUSTUS_2_Slurm_HOWTO#How_to_request_local_scratch_.28SSD.2FNVMe.29_at_job_submission.3F|How to request NVME]]) for this purpose

To not use the central file system for calculation, you must often configure the the program you are using to write temporary files elsewhere.

If the program uses the current directory to look for files, you must copy files to a temporary directory - and copy/save the results of the calculation in the end, else your results get deleted by automated cleanup happening after the job.

There diskless nodes have a disk in RAM memory, that can have a maximum of half the size of the total RAM. Note that files created plus memory requirement of your job need to fit into the total memory.

There are more diskless nodes than nodes with disks, so if your job can run on a diskless node, you should choose this option.

Example job script with requesting 700GB disk space and copying files:

<source lang='bash'>
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --time=00:14:00
#SBATCH --mem=1gb
#SBATCH --gres=scratch:700
# copy input file
cp $HOME/inputfiles/myinput.inp $SCRATCH
# switch directory
cd $SCRATCH
echo 'Here starts the calculation'
myprogram --input=$SCRATCH/myinput.inp
# calculation ends
# copy result
cp outfile.out results2.txt $HOME/resultdir/job12345
# clean up
rm myinput outfile.out results2.txt
</source>

{| style=" background:#deffee; width:100%;"
| style="padding:12px; background:#cef2e0; text-align:left" |
Software examples: Most [[Environment Modules|installed software]] comes with example job scripts. To find it e.g. for lammps: <code> module load chem/lammps; cd $LAMMPS_EXA_DIR; ls -la</code>.
|}
 

== Resource Requests ==

Important resource request options for the Slurm command sbatch are:

{| width=750px class="wikitable"
! Option !! Slurm (sbatch)
|-
| #SBATCH|| Script directive
|-
| --time=<hh:mm:ss> (-t <hh:mm:ss>)|| Wall time limit
|-
| --job-name=<name> (-J <name>)|| Job name
|-
| --nodes=<count> (-N <count>)|| Node count
|-
| --ntasks=<count> (-n <count>)|| Core count
|-
| --ntasks-per-node=<count>|| Process count per node
|-
| --mem=<limit>|| Memory limit per node
|-
| --mem-per-cpu=<limit>|| Memory limit per process
|-
| --gres=gpu:<count>|| GPU count (gres = "generic resource")
|-
| --gres=scratch:<count> || Disk space of <count> GB per requested task
|-
| --exclusive|| Node exclusive job
|}

'''Nodes and Cores'''

Slurm provides a number of options to request nodes and cores.
Typically, using <code>--nodes=<count></code> and <code>--ntasks-per-node=<count></code> should work for all your jobs. For single core jobs, it would be sufficient to use the option <code>--ntasks=1</code>. Specifying only <code>--ntasks</code> may lead to slurm trying to distribute tasks over more than one node even if you requested a small amount of cores.

'''Memory'''

Memory can be requested with either the option <code>--mem=<limit></code> (memory per node) or <code>--mem-per-cpu=<limit></code> (memory per process). When looking up the maximum available memory for a certain node type subtract about 5 GB for the operating system. Specify the memory limit as a value-unit-pair, for example 500mb or 8gb.

In most cases it is preferable to use the <code>--mem=<limit></code> option.

'''GPUs''' and '''Scratch'''

These are requested as "generic resources" with <code>--gres:gpu:<count></code> and <code>--gres:scratch:<count></code>.

== Memory Limits ==

The '''wait time of a job also depends largely on the amount of requested resources''' and the available number of nodes providing this amount of resources. This must be taken into account '''in particular when requesting a certain amount of memory'''.

For example, there is a total of 692 compute nodes in JUSTUS, of which 456 nodes have 192 GB RAM. However, '''not the entire amount of physical RAM is available exclusively for user jobs''', because the operating system, system services and local file systems also require a certain amount of RAM.
This means that if a job requests 192 GB RAM per node (i.e. --mem=192gb or --tasks-per-node=48 and --mem-per-cpu=4gb), Slurm will rule out 456 out of 692 nodes as being suitable for this job and considers only 220 out of 692 nodes as being eligible for running this job.

The following table provides an overview of how much memory can be allocated by user jobs on the various node types and how many nodes can serve this memory requirement:

{| width=500px class="wikitable"
! Physical RAM on node !! Available RAM on node !! Number of suitable nodes
|-
| 192 GB || 187 GB || 692
|-
| 384 GB || 376 GB || 220
|-
| 768 GB || 754 GB || 28
|-
| 1536 GB || 1510 GB || 8
|}

Also note that allocated memory is factored into resource usage accounting for fair share. This means over-requesting memory may have a negative impact on the priority of subsequent jobs.

= Testing Your Jobs =

Justus2 has three compute nodes reserved for jobs with a walltime under 15 minutes. You can test if your jobs start properly just by specifying a short walltime, e.g. --time=00:14:00 and your job should start very quickly.

= Monitoring Your Jobs =
== squeue ==

After you submitted the job, you can see it waiting using the <code>squeue</code> command:

(also read the man page with <code>man squeue</code> for more information on how to use the command)

<source lang='shell'>
> squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
6260301 standard r_60_b_2 ul_yxz1 PD 0:00 1 (AssocGrpMemRunMinutes)

</source>

Output shows:
* JOBID: the jobid is an unique number your job gets
* PARTITION: the cluster can be divided in different types of nodes.
* NAME: the name you gave your job with the --job-name= option
* USER: your username
* ST: the state the job is in. R = running, PD = pending, CD = completed. See man page for a full list on states.
* TIME: how long the job has been running
* NODES: how many nodes were requested
* NODELIST(REASON): either show the node(s) the job is running on, or a reason why it hasn't started

==scontrol==

You can then show more info on one specific running job using the <code>scontrol</code> command, e.g for the job with ID 6260301 listed above:

<code>
scontrol show job 6260301
</code>

displays detailed information for job with JobID 6260301

<code>
scontrol show jobs
</code>

displays detailed information for all your jobs
<code>

scontrol write batch_script 6260301 -
</code>
display job script of a running job. The "-" is a special filename which means "write to the terminal".

== Monitoring a Started Job ==

After a job has started, you can ssh to the node(s) the job is running on, using the node name from NODELIST, e.g. if your job runs on n0603:

<code>> ssh n0603
</code>

= Partitions =
Job allocations at JUSTUS 2 are routed automatically to the most suitable compute node(s) that can provide the requested resources for the job (e.g. amount of cores, memory, local scratch space). This is to prevent fragmentation of the cluster system and to ensure most efficient usage of available compute resources. Thus, there is no point in requesting a partition in batch job scripts, i.e. users '''should not''' specify any partition "-p, --partition=<partition_name>" on job submission. This is of particular importance if you adapt job scripts from other cluster systems (e.g. bwUniCluster 2.0) to JUSTUS 2.

= Job Priorities =
Job priorities at JUSTUS 2 depend on [https://slurm.schedmd.com/priority_multifactor.html multiple factors ]:
* Age: The amount of time a job has been waiting in the queue, eligible to be scheduled.
* Fairshare: The difference between the portion of the computing resource allocated to an association and the amount of resources that has been consumed.

'''Notes:'''

Jobs that are pending because the user reached one of the resource usage limits (see below) are not eligible to be scheduled and, thus, do not accrue priority by their age.

Fairshare does '''not''' introduce a fixed allotment, in that a user's ability to run new jobs is cut off as soon as a fixed target utilization is reached. Instead, the fairshare factor ensures that jobs from users who were under-served in the past are given higher priority than jobs from users who were over-served in the past. This keeps individual groups from long term monopolizing the resources, thus making it unfair to groups who have not used their fairshare for quite some time.

Slurm features '''backfilling''', meaning that the scheduler will start lower priority jobs if doing so does not delay the expected start time of '''any''' higher priority job. Since the expected start time of pending jobs depends upon the expected completion time of running jobs, reasonably accurate time limits are valuable for backfill scheduling to work well. This '''[https://youtu.be/OKhWwem1XZg?t=161 video]''' gives an illustrative description to how backfilling works.

In summary, an approximate model of Slurm's behavior for scheduling jobs is this:

* Step 1: Can the job in position one (highest priority) start now?
* Step 2: If it can, remove it from the queue, start it and continue with step 1.
* Step 3: If it can not, look at next job.
* Step 4: Can it start now, without delaying the start time of any job before it in the queue?
* Step 5: If it can, remove it from the queue, start it, recalculate what nodes are free, look at next job and continue with step 4.
* Step 6: If it can not, look at next job, and continue with step 4.

As soon as a new job is submitted and as soon as a job finishes, Slurm restarts its main scheduling cycle with step 1.

= Usage Limits/Throttling Policies =

While the fairshare factor ensures fair long term balance of resource utilization between users and groups, there are additional usage limits that constrain the total cumulative resources at a given time. This is to prevent individual users from short term monopolizing large fractions of the whole cluster system.

* The '''maximum walltime''' for a job is '''14 days''' (336 hours)
--time=336:00:00 or --time=14-0

* The '''maximum amount of cores''' used at any given time from jobs running is '''1920''' per user (aggregated over all running jobs). This translates to 40 nodes. An equivalent limit for allocated memory does also apply. If this limit is reached new jobs will be queued (with REASON: AssocGrpCpuLimit) but only allowed to run after resources have been relinquished.

* The maximum amount of '''remaining allocated core-minutes''' per user is '''3300000''' (aggregated over all running jobs). For example, if a user has a 4-core job running that will complete in 1 hour and a 2-core job that will complete in 6 hours, this translates to 4 * 1 * 60 + 2 * 6 * 60 = 16 * 60 = 960 remaining core-minutes. Once a user reaches the limit, no more jobs are allowed to start (REASON: AssocGrpCPURunMinutesLimit). As the jobs continue to run, the remaining core time will decrease and eventually allow more jobs to start in a staggered way. This limit also '''correlates the maximum walltime and amount of cores that can be allocated''' for this amount of time. Thus, shorter walltimes for the jobs allow more resources to be allocated at a given time (but capped by the maximum amount of cores limit above). Watch this '''[https://youtu.be/OKhWwem1XZg?t=306 video]''' for an illustrative description. An equivalent limit applies for remaining time of memory allocation in which case jobs may be held back from starting with REASON AssocGrpMemRunMinutes.

* The '''maximum amount of GPUs''' allocated by running jobs is '''8''' per user. If this limit it reached new jobs will be queued (with REASON: AssocGrpGRES) but only allowed to run after GPU resources have been relinquished.

'''Note:'''

Usage limits are subject to change.

= Considerations on Efficiency / Special Use Cases =

When we speak of poor job efficiency, we usually mean that hardware resources are wasted.
That means, a similar overall result could have been achieved using less hardware resources, leaving those for other jobs and reducing the wait time for you and everyone.

Some simple causes for poor overall job efficiency are:

* poor choice of resources compared to the size of the nodes leaves part of the node blocked, but doing nothing:
** multiple of --ntasks-per-node is notthe number of cores of a node (see section on single user policy below)
** too much (un-needed) memory or disk space requested
* more cores requested than are actually used by the job
* more cores used for a single mpi/openmp parallel computation than useful
* many small jobs with a short runtime (seconds in extreme cases)
* one-core jobs with very different run-times (because of single-user policy)

== Many One-Core Jobs ==

One-core jobs can cause major inefficient node usage:

# You submit 1000 jobs, each runs for ~30s. Jobs need up to 30s to start and finish - a huge waste if the job only takes 30 seconds. Additionally, start/finish of so many jobs in a short time causes strain on the scheduler SLURM and may cause severe problems for everyone and clutter the SLURM job database.
# many one-core jobs with very different run times. The jobs will start on many nodes, but at some time all quicker jobs have finished the calculation and only a few remain. Because of the single-user policy on JUSTUS2, jobs of other users cannot fill in the gaps and the rest of the node is idle.

To address the problem, you can reduce the amount of jobs and/or the amount of nodes used.

To limit the amount of jobs, start many calculations within one job (problem 1. and 2.):

* use a bash loop in your job script
* use the program GNU parallel to start the processes for you

To only limit the amount of nodes used:

* use array jobs

=== Bash Loop ===

One advantage of this method is, that you can run more threads than cores if your jobs are really short and do not use too much RAM memory and in this way keep all cores busy even if many calculations are still starting up.

It is of course even better, if you can combine such short calculations in a way that for 1000 calculations the kernel does not need to start 1000 processes which in turn need to initialize everything.

This example uses pgrep to count how many jobs are running:

<syntaxhighlight lang="bash">
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=48
#SBATCH --time=00:10:00
#SBATCH --mem=100gb

for i in {1..200}
do
echo starting up $i
bash my_calculation $i &
while [ $(pgrep -c -f my_calculation) -gt 48 ] ; do echo sleeping; sleep 5; done
done
wait
</syntaxhighlight>

<syntaxhighlight lang="bash">
#!/bin/bash
running_jobs=()

for i in {1..200}; do
echo "Starting job $i"
sleep "$i" &
running_jobs+=($!) # Track PID

while [ "${#running_jobs[@]}" -ge 8 ]; do
sleep 2 # adjust duration depending on your runtime
echo running_jobs: ${running_jobs[@]}
echo pid-out: $(ps -o pid= -p "${running_jobs[@]}" 2>/dev/null | xargs)
echo -----
running_jobs=($(ps -o pid= -p "${running_jobs[@]}" 2>/dev/null)) # Remove finished jobs
done
done

wait # Ensure all jobs complete
</syntaxhighlight>

You may not be able to just use an index number "i" to start many calculations. In this case, for not too many files, the for loop could be used to read in config files. Here just the general idea for the for loop without
<syntaxhighlight lang="bash">
for config in config-1980-03-01_1/*; do
mycalculation -config "$config"
done
</syntaxhighlight>
This loops over all files in the directory config-1980-03-01_1/ and gives them as an input file to "mycalculation" via a hypothetical "-config" option. Adding a date to the config-dirs (and outputs) would enable you to track different runs in your lab journal more easily.

=== Gnu Parallel ===

Gnu Parallel is available on the HPC Cluster and comes with its own set of examples, you can access them like this:

<syntaxhighlight lang="bash">
$ module load system/parallel
$ cp $PARALLEL_EXA_DIR/parallel.slurm .
</syntaxhighlight>
=== Array Jobs ===
<syntaxhighlight lang="bash">$ sbatch -a 1-500%48 batch_script</syntaxhighlight>

This will submit 500 tasks to be executed, each one indexed by SLURM_ARRAY_TASK_ID ranging from 1 to 200, but will limit the number of simultaneously running tasks from this job array to 48 (number of cores on a Justus2 node).

Thee same can be done inside the job script:

<syntaxhighlight lang="bash">
#!/bin/bash
# Number of cores per individual array task
#SBATCH --ntasks-per-node=1
#SBATCH --array=1-500%48
#SBATCH --mem=3G
#SBATCH --time=1:10:00
#SBATCH --job-name=array_job
#SBATCH --output=array_job-%A_%a.out
#SBATCH --error=array_job-%A_%a.err

# Print the task id.
echo "My SLURM_ARRAY_TASK_ID: " $SLURM_ARRAY_TASK_ID

export TIMEFORMAT=%R ;
time bash mycalculation $SLURM_ARRAY_TASK_ID

</syntaxhighlight>

Also see:
* Slurm-Howto entry: [[BwForCluster_JUSTUS_2_Slurm_HOWTO#How_to_submit_an_array_job?]]
* Schedmd documentations on Job Arrays: https://slurm.schedmd.com/job_array.html

== Default Values ==

Default values for jobs are:

* Runtime: --time=02:00:00 (2 hours)
* Nodes: --nodes=1 (one node)
* Tasks: --tasks-per-node=1 (one task per node)
* Cores: --cpus-per-task=1 (one core per task)
* Memory: --mem-per-cpu=2gb (2 GB per core)

== Node Access Policy ==

Node access policy for jobs is "'''exclusive user'''". Nodes will be exclusively allocated to users. '''Multiple jobs (up to 48) of the same user can run on a single node''' at any time.

'''Note:''' This implies that for '''sub-node jobs''', it is advisable for efficient resource utilization and maximum job throughput to '''adjust the number of cores to be an integer divisor of 48''' (total number of cores on each node). For example, two 24-core jobs can run simultaneously on one and the same node, while two 32-core jobs will always have to allocate two separate nodes, but leave 16 cores unused on each of them. Users must therefore always '''think carefully about how many cores to request''' and whether their applications really benefit from allocating more cores for their jobs. Similar considerations apply - at the same time - to the '''requested amount of memory per job'''.

Think of it as the scheduler playing a game of multi-dimensional Tetris, where the dimensions are number of cores, amount of memory and other resources. '''Users can support this by making resource allocations that allow the scheduler to pack their jobs as densely as possible on the nodes'''.

JUSTUS2/Software/Julia

2025-01-07T08:56:08Z

M Carmesin: /* Further documentation */

{{Softwarepage|math/julia}}

{| width=600px class="wikitable"
|-
! Description !! Content
|-
| module load
| math/julia
|-
| Availability
| [[bwUniCluster]] | [[JUSTUS2]]
|-
| License
| MIT License
|-
|Citing
| [https://github.com/JuliaLang/julia/blob/master/CITATION.bib]
|-
| Links
| [https://julialang.org/ Project homepage] | [https://docs.julialang.org/en/v1/ Documentation]
|-
| Graphical Interface
| No
|}

== Introduction ==

Julia is a high-level, high-performance, dynamic programming language, being designed with scientific computing in mind. Parallel programming features, such as multi-threading are included in the core language, while there also exit packages leveraging the power of MPI and CUDA.

There are no packages preinstalled besides the Julia language core, please use the Julia package manager to install any required Julia package.

The Julia module on Justus loads suitable versions of CUDA and OpenMPI and the corresponding Julia packages CUDA.jl and MPI.jl will be automatically configured to use these libraries after being installed by the user. Any changes, either by loading modules with different MPI and/ or CUDA versions as well as using the ones that come as Julia artifacts are likely to lead to errors.

== Environments and Package Installation ==

It is highly recommended to use an separate Julia environment for every project. If Julia is started with the option <code>--project=.</code> the current folder will be used as environment and the <code>Project.toml</code> file containing the information on the installed packages will be created, if not yet present.

In an interactive Julia session, the [https://pkgdocs.julialang.org/v1/getting-started/#Basic-Usage package manager] is activated by entering <code>]</code>. The most importent commands are
* <code>add PACKAGENAME</code> install package PACKAGENAME in the current environment
* <code>instantiate</code>: install all packages with dependencies as stated in Project.toml and Manifest.toml, e.g. after copying the existing code to the cluster
* <code>activate PATH_TO_ENV</code>: use the environment located at the path <code>PATH_TO_ENV</code> and initialize it, if necessary.

== Interactive Example ==

Load Julia module and start interactive REPL session with 8 threads, using the environment in the current directory:

<pre>
$ module load math/julia
$ julia -t 8 --project=.
</pre>

Enter ']' to go into package manager and install package [https://github.com/JuliaPlots/UnicodePlots.jl?tab=readme-ov-file <code>UnicodePlots</code>].
<pre>
add UnicodePlots
</pre>

Leave the package manager with the backspace key.

Create a vector with 64 elements set to 0 and fill it using all 8 threads with the corresponding tread id number.

<pre>
vec = zeros(64)
Threads.@threads for i in eachindex(vec)
vec[i]= Threads.threadid()
end
</pre>

Load the <code>UnicodePlots</code> package and draw a scatter plot of the contents of <code>vec</code>

<pre>
using UnicodePlots
scatterplot(vec)
</pre>

== Further documentation ==

* [https://modernjuliaworkflows.org Modern Julia Workflows]: A collection of best practices

* [https://github.com/carstenbauer/JuliaHLRS24 Julia Workshop at HLRS]: The material of this workshop is in large parts also valid for the Justus cluster (on Justus you only need the module math/julia).

== Tips & Tricks ==

* [[JUSTUS2/Software/Julia/Parallel_Programming|Parallel Programming]]

JUSTUS2/Software/Julia

2025-01-07T08:55:59Z

M Carmesin: /* Further documentation */

{{Softwarepage|math/julia}}

{| width=600px class="wikitable"
|-
! Description !! Content
|-
| module load
| math/julia
|-
| Availability
| [[bwUniCluster]] | [[JUSTUS2]]
|-
| License
| MIT License
|-
|Citing
| [https://github.com/JuliaLang/julia/blob/master/CITATION.bib]
|-
| Links
| [https://julialang.org/ Project homepage] | [https://docs.julialang.org/en/v1/ Documentation]
|-
| Graphical Interface
| No
|}

== Introduction ==

Julia is a high-level, high-performance, dynamic programming language, being designed with scientific computing in mind. Parallel programming features, such as multi-threading are included in the core language, while there also exit packages leveraging the power of MPI and CUDA.

There are no packages preinstalled besides the Julia language core, please use the Julia package manager to install any required Julia package.

The Julia module on Justus loads suitable versions of CUDA and OpenMPI and the corresponding Julia packages CUDA.jl and MPI.jl will be automatically configured to use these libraries after being installed by the user. Any changes, either by loading modules with different MPI and/ or CUDA versions as well as using the ones that come as Julia artifacts are likely to lead to errors.

== Environments and Package Installation ==

It is highly recommended to use an separate Julia environment for every project. If Julia is started with the option <code>--project=.</code> the current folder will be used as environment and the <code>Project.toml</code> file containing the information on the installed packages will be created, if not yet present.

In an interactive Julia session, the [https://pkgdocs.julialang.org/v1/getting-started/#Basic-Usage package manager] is activated by entering <code>]</code>. The most importent commands are
* <code>add PACKAGENAME</code> install package PACKAGENAME in the current environment
* <code>instantiate</code>: install all packages with dependencies as stated in Project.toml and Manifest.toml, e.g. after copying the existing code to the cluster
* <code>activate PATH_TO_ENV</code>: use the environment located at the path <code>PATH_TO_ENV</code> and initialize it, if necessary.

== Interactive Example ==

Load Julia module and start interactive REPL session with 8 threads, using the environment in the current directory:

<pre>
$ module load math/julia
$ julia -t 8 --project=.
</pre>

Enter ']' to go into package manager and install package [https://github.com/JuliaPlots/UnicodePlots.jl?tab=readme-ov-file <code>UnicodePlots</code>].
<pre>
add UnicodePlots
</pre>

Leave the package manager with the backspace key.

Create a vector with 64 elements set to 0 and fill it using all 8 threads with the corresponding tread id number.

<pre>
vec = zeros(64)
Threads.@threads for i in eachindex(vec)
vec[i]= Threads.threadid()
end
</pre>

Load the <code>UnicodePlots</code> package and draw a scatter plot of the contents of <code>vec</code>

<pre>
using UnicodePlots
scatterplot(vec)
</pre>

== Further documentation ==

* [https://modernjuliaworkflows.org Modern Julia Workflows] A collection of best practices

* [https://github.com/carstenbauer/JuliaHLRS24 Julia Workshop at HLRS]: The material of this workshop is in large parts also valid for the Justus cluster (on Justus you only need the module math/julia).

== Tips & Tricks ==

* [[JUSTUS2/Software/Julia/Parallel_Programming|Parallel Programming]]

JUSTUS2/Software/Julia

2025-01-07T08:54:54Z

M Carmesin: /* Further documentation */

{{Softwarepage|math/julia}}

{| width=600px class="wikitable"
|-
! Description !! Content
|-
| module load
| math/julia
|-
| Availability
| [[bwUniCluster]] | [[JUSTUS2]]
|-
| License
| MIT License
|-
|Citing
| [https://github.com/JuliaLang/julia/blob/master/CITATION.bib]
|-
| Links
| [https://julialang.org/ Project homepage] | [https://docs.julialang.org/en/v1/ Documentation]
|-
| Graphical Interface
| No
|}

== Introduction ==

Julia is a high-level, high-performance, dynamic programming language, being designed with scientific computing in mind. Parallel programming features, such as multi-threading are included in the core language, while there also exit packages leveraging the power of MPI and CUDA.

There are no packages preinstalled besides the Julia language core, please use the Julia package manager to install any required Julia package.

The Julia module on Justus loads suitable versions of CUDA and OpenMPI and the corresponding Julia packages CUDA.jl and MPI.jl will be automatically configured to use these libraries after being installed by the user. Any changes, either by loading modules with different MPI and/ or CUDA versions as well as using the ones that come as Julia artifacts are likely to lead to errors.

== Environments and Package Installation ==

It is highly recommended to use an separate Julia environment for every project. If Julia is started with the option <code>--project=.</code> the current folder will be used as environment and the <code>Project.toml</code> file containing the information on the installed packages will be created, if not yet present.

In an interactive Julia session, the [https://pkgdocs.julialang.org/v1/getting-started/#Basic-Usage package manager] is activated by entering <code>]</code>. The most importent commands are
* <code>add PACKAGENAME</code> install package PACKAGENAME in the current environment
* <code>instantiate</code>: install all packages with dependencies as stated in Project.toml and Manifest.toml, e.g. after copying the existing code to the cluster
* <code>activate PATH_TO_ENV</code>: use the environment located at the path <code>PATH_TO_ENV</code> and initialize it, if necessary.

== Interactive Example ==

Load Julia module and start interactive REPL session with 8 threads, using the environment in the current directory:

<pre>
$ module load math/julia
$ julia -t 8 --project=.
</pre>

Enter ']' to go into package manager and install package [https://github.com/JuliaPlots/UnicodePlots.jl?tab=readme-ov-file <code>UnicodePlots</code>].
<pre>
add UnicodePlots
</pre>

Leave the package manager with the backspace key.

Create a vector with 64 elements set to 0 and fill it using all 8 threads with the corresponding tread id number.

<pre>
vec = zeros(64)
Threads.@threads for i in eachindex(vec)
vec[i]= Threads.threadid()
end
</pre>

Load the <code>UnicodePlots</code> package and draw a scatter plot of the contents of <code>vec</code>

<pre>
using UnicodePlots
scatterplot(vec)
</pre>

== Further documentation ==

* [https://modernjuliaworkflows.org Modern Julia Workflows]

* [https://github.com/carstenbauer/JuliaHLRS24 Julia Workshop at HLRS]: The material of this workshop is in large parts also valid for the Justus cluster (on Justus you only need the module math/julia).

== Tips & Tricks ==

* [[JUSTUS2/Software/Julia/Parallel_Programming|Parallel Programming]]