Difference between revisions of "Scaling"

From bwHPC Wiki
Jump to: navigation, search
(Scalability)
(Scalability)
Line 28: Line 28:
   
 
= Scalability =
 
= Scalability =
When you run a parallel program, the problem has to be cut into several independent pieces. For some problems, this is easier than for others - but in every case, this produces an overhead of time used to divide the problem, distribute parts of it to tasks, and stitch the results together.
 
   
  +
When you run a parallel program, the problem has to be cut into several independent pieces. For some problems, this is easier than for others - but in every case, this produces an overhead of time used to divide the problem, distribute parts of it to tasks, and stitch the results together.
 
For a theoretical amount of "infinite calculations", calculating each problem on one single core would be the most efficient way to use the hardware
 
For a theoretical amount of "infinite calculations", calculating each problem on one single core would be the most efficient way to use the hardware
 
In extreme cases, when the problem is very hard to divide, using more compute cores, can even make the job finish later.
 
In extreme cases, when the problem is very hard to divide, using more compute cores, can even make the job finish later.
Line 36: Line 36:
 
Typical calculation times for a job should stay under 2 days, or up to 2 weeks for jobs that cannot use more cores efficiently.
 
Typical calculation times for a job should stay under 2 days, or up to 2 weeks for jobs that cannot use more cores efficiently.
 
Any longer and the risks such as node failures, cluster downtimes due to maintenance, and getting (possibly wrong) results after too much wait time can become too much of a problem.
 
Any longer and the risks such as node failures, cluster downtimes due to maintenance, and getting (possibly wrong) results after too much wait time can become too much of a problem.
How do you decide on how many cores to run your ?
 
 
For this, we want to compare the time needed to solve the problem on 1 core with the time needed on N cores.
 
Example:
 
 
Calculating on 1 core would take 1000 hours. Without any overhead from parallelization, the same calculation run on 96 cores would need
 
1000/96= ~10 hours.
 
More realistically, such a calculation would need ~30 hours.
 
   
 
A common way to assess the efficiency of a parallel program is through its speedup.
 
A common way to assess the efficiency of a parallel program is through its speedup.
Here, speedup is defined as the ratio of the time that a serial program needs to run to the time for the parallel program that accomplishes the same work.
+
Here, speedup is defined as the ratio of the time a serial program needs to run to the time for the parallel program that accomplishes the same work.
   
speedup= time(serial program) / time(parallel program)
+
Speedup= Time(serial program) / Time(parallel program)
   
 
A simple example would be a calculation that takes 1000 hours on 1 core.
 
A simple example would be a calculation that takes 1000 hours on 1 core.
 
Without any overhead from parallelization, the same calculation run on 1000 cores would need 1000/100= 10 hours, the ideal speedup.
 
Without any overhead from parallelization, the same calculation run on 1000 cores would need 1000/100= 10 hours, the ideal speedup.
  +
More realistically, such a calculation for parallelized code would need around 30 hours.
   
 
[[File:Fig_speedup.png|400px|center]]
 
[[File:Fig_speedup.png|400px|center]]
Line 58: Line 51:
 
While a considerable part of a compute job might parallelize nicely, there is always some portion of time spent on I/O, such as saving or reading from disc, network limitations, communication overhead, or performing calculations that cannot be parallelized, thus reducing the speedup that I possible by simply adding more computational resources.
 
While a considerable part of a compute job might parallelize nicely, there is always some portion of time spent on I/O, such as saving or reading from disc, network limitations, communication overhead, or performing calculations that cannot be parallelized, thus reducing the speedup that I possible by simply adding more computational resources.
   
  +
From the speedup, a useful definition of efficiency can be derived:
   
  +
Efficiency = Speedup / Number of cores = Time(serial program) / (Time(parallel program) * Number of cores)
We can define a speedup S:
 
   
  +
The efficiency allows for an estimation of how well your code is using additional cores, and how much of the resources are lost by doing parallelization overhead calculations.
S(N) = \frac{T(1)}/{T(N)}</math> <math>\sqrt{2} (perfect scaling: S(N)=N)
 
  +
Coming back to the previous example, we can now use the time of a serial calculation (1000 hours), the time our parallelized code took to finish (30 hours), the number of cores we used (100 cores), and calculate the efficiency.
T(x): time to compute your problem
 
N: number of cores the calculation is done on
 
   
  +
Efficiency = 1000 / (30 * 100)=0.3
and an efficiency E:
 
  +
  +
This shows that for this example, only 30% of the resources are used to solve the problem, while 70% of our resources are spent on parallelization overhead.
  +
A rough
  +
  +
In many cases, the time needed for calculating a given code in serial, on a single core, is not accessible, as this would take a very long time and is usually the reason why an HPC cluster is needed in the first place.
  +
To circumvent this, the relative speedup when doubling the number of cores is calculated.
  +
  +
Relative Speedup (16->32) = Time(16) / Time(32)
   
E(N) = S(N) / N (ideally, this would = 1)
 
or combined
 
E(N) = T(1)/(T(N)*N)
 
   
 
E is the percentage (0.00-1.00) of cores that actually contribute to the speedup. The other cores are busy doing overhead calculations needed to run the problem in parallel.
 
E is the percentage (0.00-1.00) of cores that actually contribute to the speedup. The other cores are busy doing overhead calculations needed to run the problem in parallel.
  +
  +
   
 
In this example, we have E(N) = 1000/(30*96)=0.35
 
In this example, we have E(N) = 1000/(30*96)=0.35

Revision as of 17:22, 8 May 2023

1 Introduction

Before you submit large production runs on a bwHPC cluster you should define an optimal number of resources required for your compute job. Poor job efficiency means that hardware resources are wasted and a similar overall result could have been achieved using fewer hardware resources, leaving those for other jobs and reducing the queue wait time for all users.

The main advantage of today‘s compute clusters is that they are able to perform calculations in parallel. If and how your code is able to be parallelized is of fundamental importance for achieving good job efficiency and performance on an HPC cluster. A scaling analysis is done by identifying the number of resources (such as the number of cores, nodes, or GPUs) that enable the best performance for a given compute job.

2 Reasons for poor job efficiency

Some simple causes for poor overall job efficiency are:

  • Poor choice of resources compared to the size of the nodes leaves part of the node blocked, but doing nothing:
    • Multiple of --ntasks-per-node is not the number of cores on a node (e.g. 48)
    • Too much (un-needed) memory or disk space requested
  • More cores requested than are actually used by the job
  • More cores used for a single mpi/openmp parallel computation than useful
  • Many small jobs with a short runtime (seconds in extreme cases)
  • One-core jobs with very different run-times (because of single-user policy)


3 Time to Finish: Considering Resources vs. Queue Time

When a job is submitted to the Slurm scheduler, the job first waits in the queue before being executed on the compute nodes. The amount of time spent in the queue is called the queue time. The amount of time it takes for the job to run on the compute nodes is called the execution time.

The figure below shows that the queue time increases with increasing resources (e.g., CPU cores) while the execution time decreases with increasing resources. One should try to find the optimal set of resources that minimizes the "time to solution" which is the sum of the queue and execution times. A simple rule is to choose the smallest set of resources that gives a reasonable speed-up over the baseline case.

Fig resources vs queue time.jpg

4 Scalability

When you run a parallel program, the problem has to be cut into several independent pieces. For some problems, this is easier than for others - but in every case, this produces an overhead of time used to divide the problem, distribute parts of it to tasks, and stitch the results together. For a theoretical amount of "infinite calculations", calculating each problem on one single core would be the most efficient way to use the hardware In extreme cases, when the problem is very hard to divide, using more compute cores, can even make the job finish later.

For real calculations, it is often impractical to wait for calculations to finish if they are done on a single core. Typical calculation times for a job should stay under 2 days, or up to 2 weeks for jobs that cannot use more cores efficiently. Any longer and the risks such as node failures, cluster downtimes due to maintenance, and getting (possibly wrong) results after too much wait time can become too much of a problem.

A common way to assess the efficiency of a parallel program is through its speedup. Here, speedup is defined as the ratio of the time a serial program needs to run to the time for the parallel program that accomplishes the same work.

Speedup= Time(serial program) / Time(parallel program)

A simple example would be a calculation that takes 1000 hours on 1 core. Without any overhead from parallelization, the same calculation run on 1000 cores would need 1000/100= 10 hours, the ideal speedup. More realistically, such a calculation for parallelized code would need around 30 hours.

Fig speedup.png

However, there is a theoretical upper limit on how much faster you can solve the original problem by using additional cores (Amdahl's Law ((Amdahl's Law)). While a considerable part of a compute job might parallelize nicely, there is always some portion of time spent on I/O, such as saving or reading from disc, network limitations, communication overhead, or performing calculations that cannot be parallelized, thus reducing the speedup that I possible by simply adding more computational resources.

From the speedup, a useful definition of efficiency can be derived:

Efficiency = Speedup / Number of cores = Time(serial program) / (Time(parallel program) * Number of cores)

The efficiency allows for an estimation of how well your code is using additional cores, and how much of the resources are lost by doing parallelization overhead calculations. Coming back to the previous example, we can now use the time of a serial calculation (1000 hours), the time our parallelized code took to finish (30 hours), the number of cores we used (100 cores), and calculate the efficiency.

Efficiency = 1000 / (30 * 100)=0.3

This shows that for this example, only 30% of the resources are used to solve the problem, while 70% of our resources are spent on parallelization overhead. A rough

In many cases, the time needed for calculating a given code in serial, on a single core, is not accessible, as this would take a very long time and is usually the reason why an HPC cluster is needed in the first place. To circumvent this, the relative speedup when doubling the number of cores is calculated.

Relative Speedup (16->32) = Time(16) / Time(32)


E is the percentage (0.00-1.00) of cores that actually contribute to the speedup. The other cores are busy doing overhead calculations needed to run the problem in parallel.


In this example, we have E(N) = 1000/(30*96)=0.35 So 35% of cores are used to solve your problem, and 65% of the cores are used to calculate how to parallelize the problem. As a semi-arbitrary cut-off, we can define that jobs are scaled well if they waste 50% of the computation or less on parallelization overhead. As we can see in the example, calculating T(1) would take 42 days and is hence not practical. How do we determine how many cores to use, if running the calculation on one core takes "forever"? As we cannot compare with T(1), all we can do is compare the relative speedup R(N1,2*N1), that happens on doubling the number of cores used: for example: R(4,8)= T(4)/T(8).

As a rough guideline, an R of 1.8 for doubling the number of cores is good, an R of 1.7 is still acceptable. Avoid R below 1.7 if you have many jobs to run. Overall, this leads us to the following rules of thumb or recipe for determining the number of cores to use:

(1) Optimizing resource usage is most relevant when you submit many jobs. If you submit only one or two jobs, simply try to use a reasonable core number and you are done.

(2) If you plan to submit many jobs, verify that the core number is acceptable. If the jobs use N cores (i.e. N is 96 for a two-node job), then run the same job with N/2 cores (in this example 48 cores).

(3) To calculate the speedup, you then divide the (longer) run time of the N/2-core-job by the (shorter) run time of the N-core-job. Typically the speedup is a number between 1.0 (no speedup at all) and 2.0 (perfect speedup - all additional cores speed up the job).

(4a) IF the speedup is better than a factor of 1.7, THEN using N cores is perfectly fine.

(4b) IF the speedup is worse than a factor of 1.7, THEN using N cores wastes too many resources and N/2 cores should be used. You can see data from a real example using the program VASP in the following

Fig speedup and efficiency 1.png

Fig speedup and efficiency 2.png