bwHPC Wiki - User contributions [en]

Main Page

2024-06-18T10:28:39Z

J Steuer:

'''Welcome to the bwHPC Wiki.'''

bwHPC represents services and resources in the State of '''B'''aden-'''W'''ürttemberg, Germany, for High Performance Computing ('''HPC'''), Data Intensive Computing ('''DIC''') and Large Scale Scientific Data Management ('''LS2DM''').

The main bwHPC web page is at [https://www.bwhpc.de/ https://www.bwhpc.de/].

Many topics depend on the cluster system you use.
First choose the cluster you use, then select the correct topic.

{| style=" background:#eeeefe; width:100%;"
| style="padding:8px; background:#dedefe; font-size:120%; font-weight:bold; text-align:left" | Courses / eLearning
|-
|
* [https://training.bwhpc.de/ eLearning and Online Courses]
* [https://hpc-wiki.info/hpc/Introduction_to_Linux_in_HPC Introduction to Linux in HPC (external resource)]
|}

{| style=" background:#deffee; width:100%;"
| style="padding:8px; background:#cef2e0; font-size:120%; font-weight:bold; text-align:left" | Need Access to a Cluster?
|-
|
* [[Registration]]
|}

{| style=" background:#deffee; width:100%;"
| style="padding:8px; background:#cef2e0; font-size:120%; font-weight:bold; text-align:left" | HPC System Specific Documentation
|-
|
bwHPC Clusters are dedicated to [https://www.bwhpc.de/bwhpc-domains.php specific research domains].
Documentation differs between compute clusters, please see cluster specific overview pages:
{|
|-
| style="padding:5px; width:30%" | [[BwUniCluster2.0|bwUniCluster 2.0]]
| style="padding-left:20px;" | General Purpose, Teaching
|-
| style="padding:5px; width:30%" | [[:JUSTUS2| bwForCluster JUSTUS 2]]
| style="padding-left:20px;" | Theoretical Chemistry, Condensed Matter Physics, and Quantum Sciences
|-
| style="padding:5px; width:30%" | [[Helix|bwForCluster Helix]]
| style="padding-left:20px;" | Structural and Systems Biology, Medical Science, Soft Matter, Computational Humanities, and Mathematics and Computer Science
|-
| style="padding:5px; width:30%" | [[NEMO|bwForCluster NEMO]]
| style="padding-left:20px;" | Neurosciences, Particle Physics, Materials Science, and Microsystems Engineering
|-
| style="padding:5px; width:30%" | [[BinAC|bwForCluster BinAC]]
| style="padding-left:20px;" | Bioinformatics, Geosciences and Astrophysics.
|}
|-
|bwHPC Clusters: [https://www.bwhpc.de/cluster.php operational status]
Further Compute Clusters in Baden-Württemberg (separate access policies):
* bwHPC tier 1: [https://kb.hlrs.de/platforms/index.php/HPE_Hawk Hawk] ([https://www.hlrs.de/solutions-services/academic-users/ getting access])
* bwHPC tier 2: [https://www.nhr.kit.edu/userdocs/horeka HoreKa] ([https://www.nhr.kit.edu/userdocs/horeka/projects/ getting access])
|}

{| style="background:#deffee; width:100%;"
| style="padding:8px; background:#cef2e0; font-size:120%; font-weight:bold; text-align:left" | Documentation valid for all Clusters
|-
|
* [[Environment Modules| Software Modules]] and software documentation explained
* [https://www.bwhpc.de/software.html List of Software] on all clusters
* [[Development| Software Development and Parallel Programming]]
* [[Energy Efficient Cluster Usage]]
* [[HPC Glossary]]

{| style="height:100%; background:#ffeaef; width:100%"
| style="padding:8px; background:#f5dfdf; font-size:120%; font-weight:bold; text-align:left" | Scientific Data Storage
|-
|
For user guides of the scientific data storage services:
* [[SDS@hd]]
* [https://www.rda.kit.edu/english bwDataArchive]
* [https://zas.bwsfs.uni-tuebingen.de/info/uebersicht bwSFS]
Associated, but local scientific storage services are:
* [https://wiki.scc.kit.edu/lsdf/index.php/Category:LSDF_Online_Storage LSDF Online Storage] (only for KIT and KIT partners)
|}

{| style="height:100%; background:#ffeaef; width:100%"
| style="padding:8px; background:#f5dfdf; font-size:120%; font-weight:bold; text-align:left" | Data Management
|-
|
* [[Data Transfer|Data Transfer]]
* [https://www.forschungsdaten.org/index.php/FDM-Kontakte#Deutschland Research Data Management (RDM)] contact persons
* [https://www.forschungsdaten.info Portal for Research Data Management] (Forschungsdaten.info)
|}
{| style=" background:#eeeefe; width:100%;"
| style="padding:8px; background:#dedefe; font-size:120%; font-weight:bold; text-align:left" | Support
|-
|
Support is provided by the [https://www.bwhpc.de/teams.php bwHPC Competence Centers]:
* [https://bw-support.scc.kit.edu/ Submit a Ticket]
* Extended Support via [https://zas.bwhpc.de/en/zas_info_tigerteamsupport.php "Tiger Teams"]
|}
{| style=" background:#e6e9eb; width:100%;"
| style="padding:8px; background:#d1dadf; font-size:120%; font-weight:bold; text-align:left" | Acknowledgement
|-
|
* Please [[Acknowledgement|acknowledge]] our resources in your publications.
|}

2023-08-21T11:57:05Z

J Steuer: /* Introduction */

= Introduction =

Energy consumption of data centers has been increasing continuously throughout the last decade. In 2020, the energy consumption of all data centers in Germany amounted to around [https://www.bundestag.de/resource/blob/863850/423c11968fcb5c9995e9ef9090edf9e6/WD-8-070-21-pdf-data.pdf 3 percent] of the total electricity produced. Accompanying this large energy consumption are large-scale emissions of CO2 to the atmosphere and thus significant contributions to climate change.
To illustrate this, an average compute job running on a single node for one day may easily consume 10 kWh or even more. That translates roughly to brewing 700 cups of coffee.
Assuming that a typical bwHPC cluster has a few hundred compute nodes, this amounts to the energy consumption of a village for each cluster.

Although a large amount of this energy consumption is an intrinsic requirement of running large HPC clusters (even when ist processors are idle, a cluster uses a lot of energy), efficient use of the available resources is important. Using as many resources as possible does not make a power user. Using them wisely does.
In the following, a basic introduction to some of the most important aspects of energy-efficient HPC usage from a user perspective is given.

We can generally distinguish three tasks when optimizing for running HPC jobs efficiently.

→ What do I want to do and why do I need an HPC Cluster for it?

→ How many and which kind of hardware resources do I require for it?

→ How do I optimize my code to use these resources most efficiently?

= What do I want to do and why do I need an HPC Cluster for it? =

The bwHPC clusters are used to almost full capacity, and running a job on an HPC node consumes a lot of energy, as shown above.
Therefore, users are requested to run only necessary jobs.

Please consider testing new setups and their output for validity prior to submitting jobs that require lots of resources. This also includes projects where a lot of (smaller) similar jobs are submitted.

Make sure to double-check your jobs prior to the submission, having to discard the output data of an HPC project due to faulty input files is wasting a lot of computational resources.

Finally, identifying the specific resource requirements for a given job is important to allocate the optimal your compute job, and to decide if an HPC cluster is needed at all.

= How many and which kind of hardware resources do I require for it =

Resource allocation is a crucial part when working on an HPC cluster.
As this is dependent on both the job as well as the specific cluster hardware and architecture available.

A small number of jobs and few resources
* Submit to the scheduler. No extended testing and resource scaling analysis are needed.

Medium-sized projects
* Run only necessary jobs: Please consider testing new setups and their output for validity prior to submitting a huge amount of similar jobs
* Start small: Run your problem on a small set of resources first.
* Use the proper tools for development: If you develop your own code, please use the proper tools for debugging and parallel performance analysis. See: [[Development#Documentation_in_the_Wiki|Development]].
* A look at the job feedback can help you determine if you are using the cluster efficiently

Large projects
* Same approach as for medium-sized projects.
* Run a scaling analysis for your project with regard to how many resources work best. See: [[Scaling]].

Many short jobs
* Handling via the scheduler is inefficient.
* Simple parallelization by hand is advisable. See: A basic introduction to [[Parallel Programming]].

= How do I optimize my code to use these resources most efficiently? =

The above recommendations will help use the cluster resources more efficiently.
Regarding software development, power efficiency correlates obviously heavily with '''computing performance''', but also with memory usage, i.e. the amount of memory used, but also memory efficiency.

Here, we have gathered a few results based on other research:
* Use an efficient programming language such as Rust, C, and C++ -- well any compiled language. Do not use any interpreted language like Perl or Python. Since Machine Learning is a hot topic, this deserves a few words: Any ML-Python code using Tensorflow or other libraries will make heavy usage of NumPy and other math packages, which will use C-based implementations. Please make sure, you use the provided Python modules, which are optimized to use Intel MKL and other mathematical libraries.

Further reading:
Rui Pereira, et al: "''Energy efficiency across programming languages: how do energy, time, and memory relate?''", SLE 2017: Proc. of the 10th ACM SIGPLAN Int. Conf. on SW Language Eng., Oct. 2017, pp. 256–267, [https://doi.org/10.1145/3136014.3136031 doi:10.1145/3136014.3136031]

* Analyse memory access patterns

* For small tight loops checking for locks, use the <code>pause</code> instruction.

= Summary: General Recommendations =

* Choose the most '''efficient algorithms''' for the given problem
* Run only '''necessary''' jobs: Please consider testing new setups and their output for validity prior to submitting a huge amount of similar jobs
* Start '''small''': Run Your problem on a small number of parallel entities (be it processes or threads) first.
* '''Estimate''' the runtime of the parallel job as '''exactly''' as possible to increase the efficiency of the scheduling of the whole system
* Use the proper tools for development: If You develop your own code, please use the proper tools for debugging and parallel performance analysis. More information is available on the bwHPC Wiki.
* A look at the '''job feedback''' can help you determine if you are using the cluster efficiently

Acknowledgement

2023-08-21T08:39:51Z

J Steuer:

Remember to mention the cluster in your publications. Cluster-specific information can be found here:

→[[BwUniCluster2.0/Acknowledgement| bwUniCluster Acknowledgement]]

→[[BinAC/Acknowledgement| bwForCluster BinAC Acknowledgement]]

→[[Helix/Acknowledgement| bwForCluster Helix Acknowledgement]]

→[[JUSTUS2/Acknowledgement| bwForCluster JUSTUS2 Acknowledgement]]

→[[NEMO/Acknowledgement| bwForCluster NEMO Acknowledgement]]

Such recognition is important for acquiring funding for the next generation of hardware, support services, data storage, and infrastructure.

The publications will be referenced on the bwHPC website:
https://www.bwhpc.de/user_publications.html

2023-07-04T09:00:42Z

J Steuer:

'''A valid account for the University of Konstanz is required to access to bwForCluster. So you need at least an employee- or student ID (Matrikelnummer).'''

== ==

Your [[#BwForCluster_User_Access | registration request]] for a new ''rechenvorhaben'' will be delivered to your local support team at the University of Konstanz. They will automatically proof your request and set the bwForCluster-Entitlement or contact you directly if further information is necessary.

In case of joining an existing ''rechenvorhaben'', please contact the [http://www.rz.uni-konstanz.de/en/support/ local support] to obtain the bwForCluster-Entitlement.

== ==

For more information about registration visit the related web pages and follow the instructions documented on this page.

German version: [https://www.kim.uni-konstanz.de/services/forschen-und-lehren/high-performance-computing/zugang-bwforcluster/ Zugang bwForCluster]

English version: [https://www.kim.uni-konstanz.de/en/services/research-and-teaching/high-performance-computing/access-bwforcluster/ Access bwForCluster (in progress)]

Scaling

2023-07-04T08:39:25Z

J Steuer: /* Introduction */

= Introduction =

Before you submit large production runs on a bwHPC cluster you should define an optimal number of resources required for your compute job. Poor job efficiency means that hardware resources are wasted and a similar overall result could have been achieved using fewer hardware resources, leaving those for other jobs and reducing the queue wait time for all users.

The main advantage of today‘s compute clusters is that they are able to perform calculations in parallel. If and how your code is able to be parallelized is of fundamental importance for achieving good job efficiency and performance on an HPC cluster. A scaling analysis is done by identifying the number of resources (such as the number of cores, nodes, or GPUs) that enable the best performance for a given compute job.

See also [[Energy Efficient Cluster Usage]].

= Considering Resources vs. Queue Time =

When a job is submitted to the scheduler of an HPC cluster, the job first waits in the queue before being executed on the compute nodes.
The amount of time spent in the queue is called the queue time.
The amount of time it takes for the job to run on the compute nodes is called the execution time.

The figure below shows that the queue time increases with increasing resources (e.g., CPU cores) while the execution time decreases with increasing resources.
One should try to find the optimal set of resources that minimizes the "time to solution" which is the sum of the queue and execution times.
A simple rule is to choose the smallest set of resources that gives a reasonable speed-up over the baseline case.

[[File:Fig_resources_vs_queue_time.jpg|800px|center]]

= Scaling Efficiency =

When you run a parallel program, the problem has to be cut into several independent pieces. For some problems, this is easier than for others - but in every case, this produces an overhead of time used to divide the problem, distribute parts of it to tasks, and stitch the results together.
For a theoretical amount of "infinite calculations", calculating each problem on one single core would be the most efficient way to use the hardware.
In extreme cases, when the problem is very hard to divide, using more compute cores, can even make the job finish later.

For real calculations, it is often impractical to wait for calculations to finish if they are done on a single core.
Typical calculation times for a job should stay under 2 days, or up to 2 weeks for jobs that cannot use more cores efficiently.
Any longer and the risks such as node failures, cluster downtimes due to maintenance, and getting (possibly wrong) results after too much wait time can become too much of a problem.

A common way to assess the efficiency of a parallel program is through its speedup.
Here, the speedup is defined as the ratio of the time a serial program needs to run to the time for the parallel program that accomplishes the same work.

Speedup= Time(serial program) / Time(parallel program)

A simple example would be a calculation that takes 1000 hours on 1 core.
Without any overhead from parallelization, the same calculation run on 1000 cores would need 1000/100= 10 hours, the ideal speedup.
More realistically, such a calculation for parallelized code would need around 30 hours.

[[File:Fig_speedup.png|400px|center]]

However, there is a theoretical upper limit on how much faster you can solve the original problem by using additional cores ([[Wikipedia:Amdahl%27s_law|Amdahl's Law]]).
While a considerable part of a compute job might parallelize nicely, there is always some portion of time spent on I/O, such as saving or reading from disc, network limitations, communication overhead, or performing calculations that cannot be parallelized, thus reducing the speedup that is possible by simply adding more computational resources.

From the speedup, a useful definition of efficiency can be derived:

Efficiency = Speedup / Number of cores = Time(serial program) / (Time(parallel program) * Number of cores)

The efficiency allows for an estimation of how well your code is using additional cores, and how much of the resources are lost by doing parallelization overhead calculations.
Coming back to the previous example, we can now use the time of a serial calculation (1000 hours), the time our parallelized code took to finish (30 hours), the number of cores we used (100 cores), and calculate the efficiency.

Efficiency = 1000 / (30 * 100) = 0.3

This shows that for this example, only 30% of the resources are used to solve the problem, while 70% of our resources are spent on parallelization overhead.
A semi-arbitrary cut-off for determining if a job is well-scaled is if 50% or less of the computation is wasted on parallelization overhead.
Therefore, we can determine that for this example too many resources are used.

In many cases, the time needed for calculating a given code in serial, on a single core, is not accessible, as this would take a very long time and is usually the reason why an HPC cluster is needed in the first place.
To circumvent this, the relative speedup when doubling the number of cores is calculated.

Relative Speedup (N Cores -> 2N Cores) = Time(N Cores) / Time(2N Cores)

The relative speedup obtained by doubling the number of cores can be used as a rough guideline for a scaling analysis.
If doubling the number of cores results in a relative speedup of above 1.8, the scaling is considered good.
Above 1.7 is considered acceptable, while a relative speedup of less than 1.7 should usually be avoided.
We can illustrate this by using our simple parallelization example from above.
If we assume that our code would have finished in 45 hours when using 50 cores, we can calculate the relative speedup:

Relative Speedup (50 Cores -> 100 Cores) = Time(50 Cores) / Time(100 Cores) = 45 h / 30 h = 1,5

A relative speedup of 1,5 is considered undesirable, so we should run our example code using 50 rather than 100 cores on the HPC cluster.

In the following a scaling analysis from a real example using the program VASP is shown.

[[File:Fig_speedup_and_efficiency_1.png|700px|center]]

[[File:Fig_speedup_and_efficiency_2.png|700px|center]]

= Basic Recipe to Determine Core Numbers =

From the previous chapter, the following rules of thumb for determining a suitable number of cores for HPC compute jobs can be summarized:

(1) Optimizing resource usage is most relevant when you submit many jobs or resource-heavy jobs.
For smaller projects, simply try to use a reasonable core number and you are done.

(2) If you plan to submit many jobs, verify that the core number is acceptable.
If the jobs use N cores (i.e. N is 96 for a two-node job), then run the same job with N/2 cores (in this example 48 cores).

(3) To calculate the speedup, you then divide the (longer) run time of the N/2-core-job by the (shorter) run time of the N-core-job. Typically the speedup is a number between 1.0 (no speedup at all) and 2.0 (perfect speedup - all additional cores speed up the job).

(4a) IF the speedup is better than a factor of 1.7, THEN using N cores is perfectly fine.

(4b) IF the speedup is worse than a factor of 1.7, THEN using N cores wastes too many resources and N/2 cores should be used.

= Better Resource Usage by Increasing the System Size =

Amdahl’s law, as illustrated above, gives the upper limit of speedup for a problem of fixed size.
By simply increasing the number of cores to speed up a calculation your compute job can quickly become very inefficient, and wasteful.
While this appears to be a bottleneck for parallel computing, a different strategy was pointed out ([[Wikipedia:Gustafson's_law|Gustafson's law]]).

If a problem only requires a small number of resources, it is not beneficial to use a large number of resources to carry out the computation.
A more reasonable choice is to use small amounts of resources for small problems and larger quantities of resources for big problems.
Thus, researchers can take advantage of available cores by scaling up parallel programs to explore their questions in higher resolution or at a larger scale.
With increases in computational power, researchers can be increasingly ambitious about the scale and complexity of their programs.

= Reasons For Poor Job Efficiency =

Some simple causes for poor overall job efficiency are:

* Poor choice of resources compared to the size of the nodes leaves part of the node blocked, but doing nothing:
** Multiple of --ntasks-per-node is not the number of cores on a node (e.g. 48)
** Too much (un-needed) memory or disk space requested
* More cores requested than are actually used by the job
* More cores used for a single mpi/openmp parallel computation than useful
* Many small jobs with a short runtime (seconds in extreme cases)
* One-core jobs with very different run-times (because of single-user policy)