Difference between revisions of "Development/Parallel Programming"

From bwHPC Wiki
Jump to: navigation, search
(Introduction to OpenMP)
Line 4: Line 4:
   
 
= Parallel Programming =
 
= Parallel Programming =
On this page, You will find valuable information regarding the supported parallel programming paradigms and specific hints on the usage..
+
On this page, You will find information regarding the supported parallel programming paradigms and specific hints on their usage.
 
Please refer to [[BwUniCluster_Environment_Modules]] how to setup your environment on bwUniCluster to load a specific installation.
 
Please refer to [[BwUniCluster_Environment_Modules]] how to setup your environment on bwUniCluster to load a specific installation.
   
 
== OpenMP ==
 
== OpenMP ==
  +
=== General Information ===
under construction.
 
  +
OpenMP is a mature [http://openmp.org/wp/openmp-specifications/|specification] to allow easy, portable, and most importantly incremental node-level parallelisation of code.
<!-- General Information -->
 
  +
Being a thread-based approach, OpenMP is aimed at more fine-grained parallelism than [[BwHPC_Best_Practices_Repository#MPI|MPI]].
<!-- Usage (gcc and Intel -->
 
  +
Although there have been extensions to extend OpenMP for inter-node parallelisation, it is a node-level approach aimed to make best usage of a node's cores -- the section [[BwHPC_Best_Practices_Repository#Hybrid Parallelisation|Hybrid Parallelisation]] will explain how to parallelise utilizing MPI plus a thread-based parallelization paradigm like OpenMP.
  +
  +
With regard to ease-of-use, OpenMP is ahead of any other common approach: the source-code is annotated using <tt>#pragma omp</tt> or <tt>!$omp</tt> statements, in C/C++ and Fortran respectively.
  +
Whenever the compiler encompasses a semantic block of code encapsulated in a parallel region, this block of code is transparently compiled into a function, which is invoked by a so-called team-of-threads upon entering this semantic block. This fork-join model of execution eases a lot of the programmer's pain involved with Threads.
  +
Being a loop-centric approach, OpenMP is aimed at codes with long/time-consuming loops.
  +
A single <tt>pragma omp parallel for</tt> directive will tell the compiler to automatically parallelize the ensuing for-loop.
  +
  +
The following example is a bit more advanced in that even reductions of variables over multiple threads are easily programmable:
  +
<source lang="c">
  +
for (int i=0, sum = 0.0; i < VECTOR_LEN; i++)
  +
norm2 += (v[i]*v[i]);
  +
</source>
  +
in parallel:
  +
<source lang="c">
  +
# pragma omp parallel for reduction(+:norm2)
  +
for (int i=0, sum = 0.0; i < VECTOR_LEN; i++)
  +
norm2 += (v[i]*v[i]);
  +
</source>
  +
With <tt>VECTOR_LENGTH</tt> being large enough, this piece of code compiled with OpenMP will run in parallel, exhibiting very nice speedup.
  +
Compiled without, the code remains as is. Developpers may therefore incrementally parallelize their application based on the profile derived from performance analysis, starting with the most time-consuming loops.
  +
Using OpenMP's concise API, one may query the number of running threads, the number of processors, a time to calculate runtime, and even set parameters such as the number of threads to execute a parallel region.
  +
  +
The OpenMP-4.0 specification added support for the SIMD-directive to better utilize SIMD-vectorization, as well as integrating directives to offload computation to accelerators using the <tt>target</tt> directive: these are integrated into the Intel Compiler and are actively being worked on for the GNU compiler, some restrictions may apply.
  +
  +
=== Usage ===
  +
OpenMP is supported by various compilers, here the usage for two main compilers [[BwHPC_BPG_Compiler#GCC|GCC]] and [BwHPC_BPG_Compiler#Intel Suite|Intel Suite]] are introduced.
  +
For both compilers, You first need to turn on OpenMP support by specifying a parameter on the compiler's command-line -- as well as include the header-file <tt>omp.h</tt> in case You make function calls to OpenMP's API.
  +
One may set or even change the number of executing thread
  +
  +
==== OpenMP with GNU Compiler Collection ====
  +
Starting with version 4.2 the gcc compiler supports OpenMP-2.5.
  +
Since then the analysis capabilities of the GNU compiler have steadily improved.
  +
The installed compilers support OpenMP-3.1.
  +
<!-- Starting with gcc-4.9 OpenMP-4.0 is supported, however the <tt>target</tt> directive will only offload to the host processor. -->
  +
  +
To use OpenMP with the gcc-compiler, pass <tt>-fopenmp</tt> as parameter.
  +
  +
==== OpenMP with Intel Compiler ====
  +
The Intel Compiler's support to programmers using OpenMP is much more advanced then gcc.
  +
To use OpenMP with the Intel compiler, pass <code>-openmp</code> as parameter.
  +
  +
One may get very insightfull information about OpenMP, when compiling with
  +
# Compiling with <tt>-openmp-report2</tt> to get information, which loops were parallelized and a reason why not.
  +
# Compiling with <tt>-diag-enable sc-parallel3</tt> to get errors and warnings about your sources weaknesses with regard to parallelization (see example below).
  +
  +
  +
=== Specific OpenMP Best practices ===
  +
The following silly example to calculate the squared Euklidian Norm shows some techniques:
  +
<source lang="c">
  +
#include <stdlib.h>
  +
#include <stdio.h>
  +
#include <omp.h>
  +
  +
#define VECTOR_LENGTH 5
  +
  +
int main (int argc, char * argv[])
  +
{
  +
int len = VECTOR_LENGTH;
  +
int i;
  +
double * v;
  +
double norm2 = 0.0;
  +
double t1, tdiff;
  +
  +
if (argc > 1)
  +
len = atoi (argv[1]);
  +
v = malloc (len * sizeof(double));
  +
  +
t1 = omp_get_wtime();
  +
#pragma omp parallel for
  +
for (i=0; i < len; i++) {
  +
v[i] = i;
  +
}
  +
  +
#pragma omp parallel for reduction(+:norm2)
  +
for(i=0; i < len; i++) {
  +
norm2 += (v[i]*v[i]);
  +
}
  +
tdiff = omp_get_wtime() - t1;
  +
  +
printf ("norm2: %f Time:%f\n", norm2, tdiff);
  +
return 0;
  +
}
  +
</source>
 
<!-- Specific OpenMP hints: default(none), reproducability, thread-safey -->
 
<!-- Specific OpenMP hints: default(none), reproducability, thread-safey -->
  +
  +
# Group independent parallel sections together: int eh above example, You may combine those two sections into one larger parallel block. This will enter once a parallel region (in the fork-join model) instead of twice. Especially in inner loops, this will decrease overhead
  +
# Compile with <tt>-diag-enable sc-parallel3</tt> to get the further warnings on thread-safety, performance, etc. E.g. the following code with loop-carried dependency will compile fine (aka without warning);
  +
<source lang="c">
  +
#pragma omp parallel for reduction(+:norm2)
  +
for(i=1; i < len-1; i++) {
  +
v[i] = v[i-1]+v[i+1];
  +
}
  +
</source>
  +
However the Intel compiler with <tt>-diag-enable sc-parallel3</tt> will produce the following warning:
  +
<tt>warning #12246: variable "v" has loop carried data dependency that may lead to incorrect program execution in parallel mode; see (file:omp_norm2.c line:32)</tt>
  +
# Always specify <tt>default(none)</tt> on larger parallel regions in order to specifically set the visibility of variables to either <tt>shared</tt> or <tt>private</tt>.
  +
  +
  +
   
 
== MPI ==
 
== MPI ==
Line 23: Line 121:
 
=== General Performance Considerations ===
 
=== General Performance Considerations ===
 
Specific performance considerations with regard to MPI (independent of the implementation):
 
Specific performance considerations with regard to MPI (independent of the implementation):
# No communication is best: Only communicate between processes if at all necessary. Consider that file-access is "communication" as well
+
# No communication is best: Only communicate between processes if at all necessary. Consider that file-access is "communication" as well
# If communication is done, try to involve as many processes: MPI optimizes the communication pattern for so-called "collective communication" to take advantage of the underlying network (with regard to network topology, message sizes, queueing capabilities of the network interconnect, etc.). Therefore try to always think in collective communication if a communication pattern involves a group of processors. Function calls like <code>MPI_Cart_create</code> will come in handy for applications with cartesian domains but also general communicators derived from <code>MPI_COMM_WORLD</code> may benefit by knowing the underlying network topology.
+
# If communication is done, try to involve as many processes: MPI optimizes the communication pattern for so-called "collective communication" to take advantage of the underlying network (with regard to network topology, message sizes, queueing capabilities of the network interconnect, etc.). Therefore try to always think in collective communication if a communication pattern involves a group of processors. Function calls like <tt>MPI_Cart_create</tt> will come in handy for applications with cartesian domains but also general communicators derived from <tt>MPI_COMM_WORLD</tt> may benefit by knowing the underlying network topology.
 
# File-accesses to load / store data must be done collectively: Writing to storage, or even reading the initialization data -- all of which involves getting data from/to all MPI processes must be done collectively. MPI's Parallel IO implementation offers a rich API to read and distribute the data access -- in order to take advantage of parallel filesystems like Lustre. A many-fold performance improvement may be seen by writing data in large chunks in collective fashion -- and at the same time being nice to other users and applications.
 
# File-accesses to load / store data must be done collectively: Writing to storage, or even reading the initialization data -- all of which involves getting data from/to all MPI processes must be done collectively. MPI's Parallel IO implementation offers a rich API to read and distribute the data access -- in order to take advantage of parallel filesystems like Lustre. A many-fold performance improvement may be seen by writing data in large chunks in collective fashion -- and at the same time being nice to other users and applications.
# For Point-to-point communication (P2P), try to hide the communication by computation by using non-blocking / immediate P2P-calls (<code>MPI_Isend</code> and <code>MPI_Irecv</code> followed by <code>MPI_Wait</code> or <code>MPI_Test</code>. This may allow the MPI-implementation to offload communication to the network interconnect and resume executing your application, while data is being transferred.
+
# For Point-to-point communication (P2P), try to hide the communication by computation by using non-blocking / immediate P2P-calls (<tt>MPI_Isend</tt> and <tt>MPI_Irecv</tt> followed by <tt>MPI_Wait</tt> or <tt>MPI_Test</tt>. This may allow the MPI-implementation to offload communication to the network interconnect and resume executing your application, while data is being transferred.
 
# Every call to MPI may trigger an access to physical hardware -- limit it: When calling communication-related functions such as <code>MPI_Test</code> to check whether a specific communication has finished, the queue of the network adapter may need to be queried. This memory access or even physical hardware access to query the state will cost cycles. Therefore, the programmer may want to use functions such as <code>MPI_Testall</code> or <code>MPI_Waitall</code>.
 
# Every call to MPI may trigger an access to physical hardware -- limit it: When calling communication-related functions such as <code>MPI_Test</code> to check whether a specific communication has finished, the queue of the network adapter may need to be queried. This memory access or even physical hardware access to query the state will cost cycles. Therefore, the programmer may want to use functions such as <code>MPI_Testall</code> or <code>MPI_Waitall</code>.
  +
# Make usage of derived datatypes: instead of manually copying data into temporary, even newly allocated memory, describe the data-layout to MPI -- and let the implementation, or even the network HCA's hardware do the data fetching.
   
 
 
 
=== Open MPI ===
 
=== Open MPI ===
The [[http://www.open-mpi.org|Open MPI]] library is an open, flexible and nevertheless performant implementation of MPI-2 and MPI-3. Licensed under BSD, it is being actively developped by an open community of industry and research institutions.
+
The [http://www.open-mpi.org|Open MPI] library is an open, flexible and nevertheless performant implementation of MPI-2 and MPI-3. Licensed under BSD, it is being actively developped by an open community of industry and research institutions.
The flexibility comes in handy: using the concept of a [[http://www.open-mpi.org/faq/?category=tuning#mca-def|MCA]] (aka a plugin) Open MPI supports many different network interconnects (Infinband, TCP, Cray, etc.) , on the other hand, a installation may be tailored to suite an installation, e.g. the network (Infiniband with specific settings), the main startup-mechanism, etc.
+
The flexibility comes in handy: using the concept of a [http://www.open-mpi.org/faq/?category=tuning#mca-def|MCA] (aka a plugin) Open MPI supports many different network interconnects (Infinband, TCP, Cray, etc.) , on the other hand, a installation may be tailored to suite an installation, e.g. the network (Infiniband with specific settings), the main startup-mechanism, etc.
Furthermore, the [[http://www.open-mpi.org/faq/|FAQ]] offers hints on [[http://www.open-mpi.org/faq/?category=tuning|performance tuning]].
+
Furthermore, the [http://www.open-mpi.org/faq/|FAQ] offers hints on [http://www.open-mpi.org/faq/?category=tuning|performance tuning].
   
 
==== Usage ====
 
==== Usage ====
Line 46: Line 145:
 
<code>ompi_info --param ob1 all</code> -- which may be increased for applications that are certain to benefit from a larger value upon startup.
 
<code>ompi_info --param ob1 all</code> -- which may be increased for applications that are certain to benefit from a larger value upon startup.
 
# Open MPI allows adapting MCA parameters on the command-line: parameters may be supplied, e.g. the above-mentioned parameter <code>mpirun -np 128 --mca mpirun -np 16 --mca pml_ob1_free_list_num 128 ./mpi_stub</code>.
 
# Open MPI allows adapting MCA parameters on the command-line: parameters may be supplied, e.g. the above-mentioned parameter <code>mpirun -np 128 --mca mpirun -np 16 --mca pml_ob1_free_list_num 128 ./mpi_stub</code>.
# Open MPI internally uses the tool [[http://www.open-mpi.org/projects/hwloc/|hwloc]] for node-local processor-information, as well as process- and memory-affinity. This tool also is a good tool to get information on the node's processor topology and Cache-information. This may be used to optimize and balance memory usage or for choosing a better ratio of MPI processes per node vs. OpenMP threads per core.
+
# Open MPI internally uses the tool [http://www.open-mpi.org/projects/hwloc/|hwloc] for node-local processor-information, as well as process- and memory-affinity. This tool also is a good tool to get information on the node's processor topology and Cache-information. This may be used to optimize and balance memory usage or for choosing a better ratio of MPI processes per node vs. OpenMP threads per core.
   
 
=== Intel® MPI ===
 
=== Intel® MPI ===
Line 54: Line 153:
 
<!-- Usage -->
 
<!-- Usage -->
 
<!-- Further information -->
 
<!-- Further information -->
  +
  +
  +
== Hybrid Parallelization ==
  +
under construction.

Revision as of 22:48, 29 January 2014

Navigation: bwHPC BPR

1 Parallel Programming

On this page, You will find information regarding the supported parallel programming paradigms and specific hints on their usage. Please refer to BwUniCluster_Environment_Modules how to setup your environment on bwUniCluster to load a specific installation.

1.1 OpenMP

1.1.1 General Information

OpenMP is a mature [1] to allow easy, portable, and most importantly incremental node-level parallelisation of code. Being a thread-based approach, OpenMP is aimed at more fine-grained parallelism than MPI. Although there have been extensions to extend OpenMP for inter-node parallelisation, it is a node-level approach aimed to make best usage of a node's cores -- the section Hybrid Parallelisation will explain how to parallelise utilizing MPI plus a thread-based parallelization paradigm like OpenMP.

With regard to ease-of-use, OpenMP is ahead of any other common approach: the source-code is annotated using #pragma omp or !$omp statements, in C/C++ and Fortran respectively. Whenever the compiler encompasses a semantic block of code encapsulated in a parallel region, this block of code is transparently compiled into a function, which is invoked by a so-called team-of-threads upon entering this semantic block. This fork-join model of execution eases a lot of the programmer's pain involved with Threads. Being a loop-centric approach, OpenMP is aimed at codes with long/time-consuming loops. A single pragma omp parallel for directive will tell the compiler to automatically parallelize the ensuing for-loop.

The following example is a bit more advanced in that even reductions of variables over multiple threads are easily programmable:

   for (int i=0, sum = 0.0; i < VECTOR_LEN; i++)
     norm2 += (v[i]*v[i]);

in parallel:

#  pragma omp parallel for reduction(+:norm2)
   for (int i=0, sum = 0.0; i < VECTOR_LEN; i++)
     norm2 += (v[i]*v[i]);

With VECTOR_LENGTH being large enough, this piece of code compiled with OpenMP will run in parallel, exhibiting very nice speedup. Compiled without, the code remains as is. Developpers may therefore incrementally parallelize their application based on the profile derived from performance analysis, starting with the most time-consuming loops. Using OpenMP's concise API, one may query the number of running threads, the number of processors, a time to calculate runtime, and even set parameters such as the number of threads to execute a parallel region.

The OpenMP-4.0 specification added support for the SIMD-directive to better utilize SIMD-vectorization, as well as integrating directives to offload computation to accelerators using the target directive: these are integrated into the Intel Compiler and are actively being worked on for the GNU compiler, some restrictions may apply.

1.1.2 Usage

OpenMP is supported by various compilers, here the usage for two main compilers GCC and [BwHPC_BPG_Compiler#Intel Suite|Intel Suite]] are introduced. For both compilers, You first need to turn on OpenMP support by specifying a parameter on the compiler's command-line -- as well as include the header-file omp.h in case You make function calls to OpenMP's API. One may set or even change the number of executing thread

1.1.2.1 OpenMP with GNU Compiler Collection

Starting with version 4.2 the gcc compiler supports OpenMP-2.5. Since then the analysis capabilities of the GNU compiler have steadily improved. The installed compilers support OpenMP-3.1.

To use OpenMP with the gcc-compiler, pass -fopenmp as parameter.

1.1.2.2 OpenMP with Intel Compiler

The Intel Compiler's support to programmers using OpenMP is much more advanced then gcc. To use OpenMP with the Intel compiler, pass -openmp as parameter.

One may get very insightfull information about OpenMP, when compiling with

  1. Compiling with -openmp-report2 to get information, which loops were parallelized and a reason why not.
  2. Compiling with -diag-enable sc-parallel3 to get errors and warnings about your sources weaknesses with regard to parallelization (see example below).


1.1.3 Specific OpenMP Best practices

The following silly example to calculate the squared Euklidian Norm shows some techniques:

#include <stdlib.h>
#include <stdio.h>
#include <omp.h>

#define VECTOR_LENGTH 5

int main (int argc, char * argv[])
{
    int len = VECTOR_LENGTH;
    int i;
    double * v;
    double norm2 = 0.0;
    double t1, tdiff;

    if (argc > 1)
        len = atoi (argv[1]);
    v = malloc (len * sizeof(double));

    t1 = omp_get_wtime(); 
#pragma omp parallel for
    for (i=0; i < len; i++) {
        v[i] = i;
    }

#pragma omp parallel for reduction(+:norm2)
    for(i=0; i < len; i++) {
        norm2 += (v[i]*v[i]);
    }
    tdiff = omp_get_wtime() - t1; 

    printf ("norm2: %f Time:%f\n", norm2, tdiff);
    return 0;
}
  1. Group independent parallel sections together: int eh above example, You may combine those two sections into one larger parallel block. This will enter once a parallel region (in the fork-join model) instead of twice. Especially in inner loops, this will decrease overhead
  2. Compile with -diag-enable sc-parallel3 to get the further warnings on thread-safety, performance, etc. E.g. the following code with loop-carried dependency will compile fine (aka without warning);
#pragma omp parallel for reduction(+:norm2)
    for(i=1; i < len-1; i++) {
        v[i] = v[i-1]+v[i+1];
    }

However the Intel compiler with -diag-enable sc-parallel3 will produce the following warning: warning #12246: variable "v" has loop carried data dependency that may lead to incorrect program execution in parallel mode; see (file:omp_norm2.c line:32)

  1. Always specify default(none) on larger parallel regions in order to specifically set the visibility of variables to either shared or private.



1.2 MPI

On this section, You will find valuable information regarding the supported installations of the Message-Passing Interface libraries and their usage. Due to the Fortran interface ABI, all MPI-libraries are normally bound to a specific compiler-vendor and even the specific compiler version. Therefore, as listed in BwHPC_BPG_Compiler two compilers are supported on bwUniCluster: GCC and Intel Suite. As both compilers are continously improving, the communication libraries will be adopted in lock-step.

With a set of different implementations, there comes the problem of choice. These pages should inform the user of the communication libraries, what considerations should be done with regard to performance, maintainability and debugging -- in general tool support -- of the various implementations.

1.2.1 General Performance Considerations

Specific performance considerations with regard to MPI (independent of the implementation):

  1. No communication is best: Only communicate between processes if at all necessary. Consider that file-access is "communication" as well
  2. If communication is done, try to involve as many processes: MPI optimizes the communication pattern for so-called "collective communication" to take advantage of the underlying network (with regard to network topology, message sizes, queueing capabilities of the network interconnect, etc.). Therefore try to always think in collective communication if a communication pattern involves a group of processors. Function calls like MPI_Cart_create will come in handy for applications with cartesian domains but also general communicators derived from MPI_COMM_WORLD may benefit by knowing the underlying network topology.
  3. File-accesses to load / store data must be done collectively: Writing to storage, or even reading the initialization data -- all of which involves getting data from/to all MPI processes must be done collectively. MPI's Parallel IO implementation offers a rich API to read and distribute the data access -- in order to take advantage of parallel filesystems like Lustre. A many-fold performance improvement may be seen by writing data in large chunks in collective fashion -- and at the same time being nice to other users and applications.
  4. For Point-to-point communication (P2P), try to hide the communication by computation by using non-blocking / immediate P2P-calls (MPI_Isend and MPI_Irecv followed by MPI_Wait or MPI_Test. This may allow the MPI-implementation to offload communication to the network interconnect and resume executing your application, while data is being transferred.
  5. Every call to MPI may trigger an access to physical hardware -- limit it: When calling communication-related functions such as MPI_Test to check whether a specific communication has finished, the queue of the network adapter may need to be queried. This memory access or even physical hardware access to query the state will cost cycles. Therefore, the programmer may want to use functions such as MPI_Testall or MPI_Waitall.
  6. Make usage of derived datatypes: instead of manually copying data into temporary, even newly allocated memory, describe the data-layout to MPI -- and let the implementation, or even the network HCA's hardware do the data fetching.


1.2.2 Open MPI

The MPI library is an open, flexible and nevertheless performant implementation of MPI-2 and MPI-3. Licensed under BSD, it is being actively developped by an open community of industry and research institutions. The flexibility comes in handy: using the concept of a [2] (aka a plugin) Open MPI supports many different network interconnects (Infinband, TCP, Cray, etc.) , on the other hand, a installation may be tailored to suite an installation, e.g. the network (Infiniband with specific settings), the main startup-mechanism, etc. Furthermore, the [3] offers hints on tuning.

1.2.2.1 Usage

Like other MPI implementations, after loading the module, Open MPI provides the compiler-wrappers mpicc, mpicxx and the various Fortran-representatives mpif77 for the C-, C++ and Fortran compilers respectively. Albeit their usage is not required, these wrappers are handy to not have to use the command-line options for header- or library directories, aka -I and -L, as well as the actual needed MPI-libraries itselve.


1.2.2.2 Further information

Open MPI also features a few specific functionalities that will help users and developpers, alike:

  1. Open MPI's tool ompi_info allows seeing all of Open MPI's installed MCA components and their specific options.

Without any option the user gets a list of flags, the Open MPI installation was compiled for (version of compilers, specific configure-flags, e.g. debugging, or profiling options). Furthermore, using ompi_info --param all all one may see all of the MCA's options, e.g. that the default PML-MCA uses an initial free-list of 4 blocks (increased by 64 upon first encountering this limit): ompi_info --param ob1 all -- which may be increased for applications that are certain to benefit from a larger value upon startup.

  1. Open MPI allows adapting MCA parameters on the command-line: parameters may be supplied, e.g. the above-mentioned parameter mpirun -np 128 --mca mpirun -np 16 --mca pml_ob1_free_list_num 128 ./mpi_stub.
  2. Open MPI internally uses the tool [4] for node-local processor-information, as well as process- and memory-affinity. This tool also is a good tool to get information on the node's processor topology and Cache-information. This may be used to optimize and balance memory usage or for choosing a better ratio of MPI processes per node vs. OpenMP threads per core.

1.2.3 Intel® MPI

under construction.


1.3 Hybrid Parallelization

under construction.