Running Calculations: Difference between revisions

From bwHPC Wiki
Jump to navigation Jump to search
m (M Janczyk moved page HPC Glossary/Running Calculations to Running Calculations over redirect)
 
(33 intermediate revisions by 4 users not shown)
Line 1: Line 1:
← This page is used in the [[HPC Glossary]] to explain the term "Batch Scheduler" and "Batch System"
== Description ==
== Life Cycle of a Calculation (Job) ==
[[File:running_calculations_on_cluster.svg|thumb|upright=0.4]]
[[File:running_calculations_on_cluster.svg|thumb|upright=0.4]]
On your desktop computer, you start your calculations and they start immediately, run until they are finished, then your desktop does mostly nothing, until you start another calculation. A [[compute cluster]] has several hundred, maybe a thousand computers ([[compute node]]s), all of them are busy most of the time and many people want to run a great number of calculations. So running your job has to include some extra steps:
On your desktop computer, you start your calculations and they start immediately, run until they are finished, then your desktop does mostly nothing, until you start another calculation. A compute cluster has several hundred, maybe a thousand computers (compute nodes), all of them are busy most of the time and many people want to run a great number of calculations. So running your job has to include some extra steps:


# prepare a [[script]] (usually a shell script), with all the commands that are necessary to run your calculation from start to finish. In addition to the commands necessary to run the calculation, this ''[[batch script]]'' has a header section, in which you specify details like required [[compute core]]s, [[estimated runtime]], [[memory requirements]], disk space needed, etc.
# prepare a script (a set commands to run - usually as a shell script), with all the commands that are necessary to run your calculation from start to finish. In addition to the commands necessary to run the calculation, this ''[[batch script]]'' has a header section, in which you specify details like required compute cores (processing units witin a computer), estimated runtime, memory requirements, disk space needed, etc.
# ''[[Submit]]'' the script into a [[queue]], where your ''[[job]]'' (calculation)
# ''Submit'' the script into a queue, where your ''job'' (calculation)
# Job is queued and waits in row with other compute jobs until the resources you requested in the header become available.
# is queued and waits in row with other compute jobs until the resources you requested in the header become available.
# Execution: Once your job reaches the front of the queue, your script is executed on a compute node. Your calculation runs on that node until it is finished or reaches the specified time limit.
# Execution: Once your job reaches the front of the queue, your script is executed on a compute node. Your calculation runs on that node until it is finished or reaches the specified time limit.
# Save results: At the end of your script, include commands to save the calculation results back to your home directory.
# Save results: At the end of your script, include commands to save the calculation results back to your home directory.


There are two types of [[batch system]]s currently used on bwHPC clusters, called "[[Moab]]" (legacy installs) and "[[Slurm]]".
The software that distributes Jobs on compute nodes is called a '''[[batch system]]''' or '''batch scheduler'''. The software currently used as a [[batch system]] on bwHPC clusters is "Slurm".


Learn more about the functioning of job distribution in
== Link to Batch System per Cluster ==


→ '''[[batch system]]'''
Because of differences in configuration (partly due to different available hardware), each cluster has their own batch system documention:


== Example Jobs ==
* Slurm systems
**[[bwUniCluster_2.0_Slurm_common_Features|Slurm bwUniCluster 2.0]]
** [[JUSTUS2/Slurm | Slurm JUSTUS 2]]
** [[Helix/Slurm | Slurm Helix]]
* Moab systems (legacy systems with deprecated queuing system)
** [[NEMO/Moab|Moab NEMO specific information]]
** [[BinAC/Moab|Moab BinAC specific information]]


For most software that a bwHPC project installed on the cluster, we have prepared an example job script running some example calculation with that exact software.
{|style="background:#deffee; width:100%;"
|style="padding:5px; background:#cef2e0; text-align:left"|
[[Image:Attention.svg|center|25px]]
|style="padding:5px; background:#cef2e0; text-align:left"|
Scientific software installed on the bwHPC Clusters often comes with simple example jobs (job script and input files). See [[Software Modules]] on how to load examples.
|}


How to access these examples is described in the "Software job examples" section of the page
== How to Use Computing Ressources Efficiently ==


→ '''[[Environment Modules]]'''


== Link to Batch System per Cluster ==
When you are running your calculations, you will have to decide on how many compute-cores your calculation will be simultaneously calculated.
For this, your computational problem will have to be divided into pieces, which always causes some overhead.


Because of differences in configuration (partly due to different available hardware), each cluster has their own batch system documention:
How to find a reasonable number of how many compute cores to use for your calculation can be found under '''[[Scaling]]'''


→ '''[[BwUniCluster3.0/Running_Jobs|Slurm bwUniCluster 3.0]]'''
Information regarding the supported parallel programming paradigms and specific hints on their usage are summarized at '''[[Parallel_Programming]]'''


→ '''[[JUSTUS2/Jobscripts: Running Your Calculations | Slurm JUSTUS 2]]'''
Running calculations on an HPC node consumes a lot of energy. To make the most of the available resources and keep cluster and energy use as efficient as possible please also see our advice for '''[[Energy Efficient Cluster Usage]]
'''


→ '''[[Helix/Slurm | Slurm Helix]]'''
== HPC Glossary ==


→ '''[[NEMO2/Slurm | Slurm NEMO2]]'''
A short definition of the typical elements of an HPC cluster.


→ '''[[BinAC2/Slurm | Slurm BinAC2]]'''
;Batch System // Job Scheduler // Batch Scheduler
: The software that distributes the compute [[Jobs]] of the users on the available resources (compute nodes).


== How to Use Computing Ressources Efficiently ==
;Core
:The physical unit that can independently execute the instructions of a program on a CPU. Modern CPUs generally have multiple cores.


;CPU
:Central Processing Unit. It performs the actual computation in a compute node. A modern CPU is composed of numerous cores and layers of cache.


When you are running your calculations, you will have to decide on how many compute-cores your calculation will be simultaneously calculated.
;GPU:Graphics Processing Unit. GPUs in HPC clusters are used as high-performance accelerators and are particularly useful to process workloads in Machine Learning (ML) and Artificial Intelligence (AI) more efficiently. The software has to be explicitly designed to use GPUs. CUDA and OpenACC are the most popular platforms in scientific computing with GPUs.
For this, your computational problem will have to be divided into pieces, which always causes some overhead.

;HPC
: short for '''H'''igh '''P'''erformance '''C'''omputing

;HPC Cluster
:Collection of compute nodes with (usually) high bandwidth and low latency communication. They can be accessed via login nodes.

;Hyperthreading
: Modern computers can be configured so that one real compute-[[core]] appears like two "logical" cores on the system. These two "hyperthreads" can sometimes do computations in parallel, if the calculations use two different sub-units of the compute-core - but most of the time, two calculations on two hyperthreads run on the same physical hardware and both run half as fast as if one thread had a full core. Some programs (e.g. gromacs) can profit from running with twice as many threads on hyperthreads and finish 10-20% faster if run in that way.

; Job or Compute Job
: A calculation you want to run on one of the compute nodes and for which you have written a [[batch script]] and which will automaticall start on one of the compute nodes after [[submit]]ting the job

;Multithreading
: Multithreading means that one computer program runs calculations on more than one compute-core using several logical "threads" of serial compute instructions to do so (eg. to work through different and independent data arrays in parallel). Specific types of multithreaded parallelization are [[OpenMP]] or [[MPI]].

;Node
:An individual computer with one or more sockets, part of an HPC cluster.

;RAM
:Random Access Memory. It is used as the working memory for the cores.

;Socket
:Physical socket where the CPU capsules are placed. Often used as a synonym to CPU if a computer has more than one socket and one wants to make clear that only one of the CPU chips sitting in one socket is meant.

;Thread
:Logical unit that can be executed independently.

; Script
: A set of instructions that the computer runs one after another, but that is not compiled into computer-instructions like a program.

; Batch Script
: A [[script]] that contains information in the form of special [[comment]]s at the beginning of the script which contain information about how many compute resources of what kind are needed.

; Slurm
: A Batch System // Job Scheduler software

; Moab
: A Batch System // Job Scheduler software

; Runtime // Wall Clock Time
: The time a calculation needs to run. The term "Wall Clock Time" is used to distinguish it from [[CPU time]].


How to find a reasonable number of how many compute cores to use for your calculation can be found under
; CPU time
: The time that CPUs have spent to calculate something. If 10 CPU cores calculate something for 1 hour each (even if it happens within the same hour), then 10 CPU-hours have been used for this calculation.


; Scaling
→ '''[[Scaling]]'''
: dividing a problem in several sub-problems creates additional work for taking track of the sub-problems and assembling the pieces to solve the whole problem. At some point this additional work becomes larger than the work spent on calculating the actual problem. A problem is called to "scale well", if little such additional work is needed.


Running calculations on an HPC node consumes a lot of energy. To make the most of the available resources and keep cluster and energy use as efficient as possible please also see our advice for
; Submit
: send a compute job into the queue to wait until it can run on a compute node


→ '''[[Energy Efficient Cluster Usage]]'''
; Parallelization
: Making it possible for programs to calculate parts of the problem they want to solve in parallel.

Latest revision as of 11:30, 15 July 2025

← This page is used in the HPC Glossary to explain the term "Batch Scheduler" and "Batch System"

Life Cycle of a Calculation (Job)

Running calculations on cluster.svg

On your desktop computer, you start your calculations and they start immediately, run until they are finished, then your desktop does mostly nothing, until you start another calculation. A compute cluster has several hundred, maybe a thousand computers (compute nodes), all of them are busy most of the time and many people want to run a great number of calculations. So running your job has to include some extra steps:

  1. prepare a script (a set commands to run - usually as a shell script), with all the commands that are necessary to run your calculation from start to finish. In addition to the commands necessary to run the calculation, this batch script has a header section, in which you specify details like required compute cores (processing units witin a computer), estimated runtime, memory requirements, disk space needed, etc.
  2. Submit the script into a queue, where your job (calculation)
  3. is queued and waits in row with other compute jobs until the resources you requested in the header become available.
  4. Execution: Once your job reaches the front of the queue, your script is executed on a compute node. Your calculation runs on that node until it is finished or reaches the specified time limit.
  5. Save results: At the end of your script, include commands to save the calculation results back to your home directory.

The software that distributes Jobs on compute nodes is called a batch system or batch scheduler. The software currently used as a batch system on bwHPC clusters is "Slurm".

Learn more about the functioning of job distribution in

batch system

Example Jobs

For most software that a bwHPC project installed on the cluster, we have prepared an example job script running some example calculation with that exact software.

How to access these examples is described in the "Software job examples" section of the page

Environment Modules

Link to Batch System per Cluster

Because of differences in configuration (partly due to different available hardware), each cluster has their own batch system documention:

Slurm bwUniCluster 3.0

Slurm JUSTUS 2

Slurm Helix

Slurm NEMO2

Slurm BinAC2

How to Use Computing Ressources Efficiently

When you are running your calculations, you will have to decide on how many compute-cores your calculation will be simultaneously calculated. For this, your computational problem will have to be divided into pieces, which always causes some overhead.

How to find a reasonable number of how many compute cores to use for your calculation can be found under

Scaling

Running calculations on an HPC node consumes a lot of energy. To make the most of the available resources and keep cluster and energy use as efficient as possible please also see our advice for

Energy Efficient Cluster Usage