Difference between revisions of "Development/Parallel Programming"

From bwHPC Wiki
Jump to: navigation, search
m (Usage)
m
Line 1: Line 1:
  +
{| width=600px class="wikitable"
{| style="border-style: solid; border-width: 1px"
 
  +
|-
! Navigation: [[BwHPC_Best_Practices_Repository|bwHPC BPR]]
 
  +
! Description !! Content
  +
|-
  +
| module load
  +
| mpi/impi | mpi/openmpi
  +
|-
  +
| Availability
  +
| [[bwUniCluster]] | [[BwForCluster_Chemistry]] | bwGRiD_tu
  +
|-
  +
| Links
  +
| [https://software.intel.com/en-us/intel-mpi-library Intel® MPI Library] | [http://www.open-mpi.org/ Open MPI]
  +
|-
  +
| License
  +
| [https://software.intel.com/en-us/articles/intel-mpi-library-licensing-faq Intel MPI Library Licensing FAQ] <small>install-doc/EULA.txt</small> &#124; [http://www.open-mpi.org/community/license.php Open MPI License]
 
|}
 
|}
  +
<br>
 
 
= Introduction =
 
= Introduction =
 
This page will provide information regarding the supported parallel programming paradigms and specific hints on their usage.
 
This page will provide information regarding the supported parallel programming paradigms and specific hints on their usage.
 
Please refer to the [[BwUniCluster_Environment_Modules|Modules Documentation]] how to setup your environment on bwUniCluster to load a specific software installation.
 
Please refer to the [[BwUniCluster_Environment_Modules|Modules Documentation]] how to setup your environment on bwUniCluster to load a specific software installation.
  +
<br>
 
  +
<br>
  +
= Versions and Availability =
  +
=== impi (Intel) ===
  +
A list of versions currently available compilers on the bwHPC-C5-Clusters can be obtained from the
  +
<br>
  +
<big>
  +
<br>
  +
[https://cis-hpc.uni-konstanz.de/prod.cis/ Cluster Information System CIS]
  +
<br></big>
  +
{{#widget:Iframe
  +
|url=https://cis-hpc.uni-konstanz.de/prod.cis/bwUniCluster/mpi/impi
  +
|width=99%
  +
|height=1000
  +
}}
  +
=== openmpi ===
  +
A list of versions currently available compilers on the bwHPC-C5-Clusters can be obtained from the
  +
<br>
  +
<big>
  +
<br>
  +
[https://cis-hpc.uni-konstanz.de/prod.cis/ Cluster Information System CIS]
  +
<br></big>
  +
{{#widget:Iframe
  +
|url=https://cis-hpc.uni-konstanz.de/prod.cis/bwUniCluster/mpi/openmpi
  +
|width=99%
  +
|height=1500
  +
}}
  +
<br>
 
= OpenMP =
 
= OpenMP =
 
== General Information ==
 
== General Information ==
Line 12: Line 52:
 
Being a thread-based approach, OpenMP is aimed at more fine-grained parallelism than [[BwHPC_Best_Practices_Repository#MPI|MPI]].
 
Being a thread-based approach, OpenMP is aimed at more fine-grained parallelism than [[BwHPC_Best_Practices_Repository#MPI|MPI]].
 
Although there have been extensions to extend OpenMP for inter-node parallelisation, it is a node-level approach aimed to make best usage of a node's cores<!-- -- the section [[#Hybrid Parallelisation|Hybrid Parallelisation]] will explain how to parallelise utilizing MPI plus a thread-based parallelization paradigm like OpenMP-->.
 
Although there have been extensions to extend OpenMP for inter-node parallelisation, it is a node-level approach aimed to make best usage of a node's cores<!-- -- the section [[#Hybrid Parallelisation|Hybrid Parallelisation]] will explain how to parallelise utilizing MPI plus a thread-based parallelization paradigm like OpenMP-->.
  +
<br>
 
 
With regard to ease-of-use, OpenMP is ahead of any other common approach: the source-code is annotated using <tt>#pragma omp</tt> or <tt>!$omp</tt> statements, in C/C++ and Fortran respectively.
 
With regard to ease-of-use, OpenMP is ahead of any other common approach: the source-code is annotated using <tt>#pragma omp</tt> or <tt>!$omp</tt> statements, in C/C++ and Fortran respectively.
 
Whenever the compiler encompasses a semantic block of code encapsulated in a parallel region, this block of code is transparently compiled into a function, which is passed to a so-called team-of-threads upon entering this semantic block. This fork-join model of execution eases a lot of the programmer's pain involved with Threads.
 
Whenever the compiler encompasses a semantic block of code encapsulated in a parallel region, this block of code is transparently compiled into a function, which is passed to a so-called team-of-threads upon entering this semantic block. This fork-join model of execution eases a lot of the programmer's pain involved with Threads.
 
Being a loop-centric approach, OpenMP is aimed at codes with long/time-consuming loops.
 
Being a loop-centric approach, OpenMP is aimed at codes with long/time-consuming loops.
A single combined directive <tt>pragma omp parallel for</tt> will tell the compiler to automatically parallelize the ensuing for-loop.
+
A single combined directive <tt>pragma omp parallel for</tt> will tell the compiler to automatically parallel the ensuing for-loop.
  +
<br>
 
The following example is a bit more advanced in that even reductions of variables over multiple threads are easily parallelizable:
+
The following example is a bit more advanced in that even reductions of variables over multiple threads are easily to parallel:
 
<source lang="c">
 
<source lang="c">
 
for (int i=0, sum = 0.0; i < VECTOR_LEN; i++)
 
for (int i=0, sum = 0.0; i < VECTOR_LEN; i++)
Line 32: Line 72:
 
Compiled without, the code remains as is. Developpers may therefore incrementally parallelize their application based on the profile derived from performance analysis tools, starting with the most time-consuming loops.
 
Compiled without, the code remains as is. Developpers may therefore incrementally parallelize their application based on the profile derived from performance analysis tools, starting with the most time-consuming loops.
 
Using OpenMP's concise API, one may query the number of running threads, the number of processors, a time to calculate runtime, and even set parameters such as the number of threads to execute a parallel region.
 
Using OpenMP's concise API, one may query the number of running threads, the number of processors, a time to calculate runtime, and even set parameters such as the number of threads to execute a parallel region.
  +
<br>
 
 
The OpenMP-4.0 specification added support for the SIMD-directive to better utilize SIMD-vectorization, as well as integrating directives to offload computation to accelerators using the <tt>target</tt> directive: these are integrated into the Intel Compiler and are actively being worked on for the GNU compiler, some restrictions may apply.
 
The OpenMP-4.0 specification added support for the SIMD-directive to better utilize SIMD-vectorization, as well as integrating directives to offload computation to accelerators using the <tt>target</tt> directive: these are integrated into the Intel Compiler and are actively being worked on for the GNU compiler, some restrictions may apply.
 
 
== OpenMP Best Practice Guide ==
 
== OpenMP Best Practice Guide ==
 
The following silly example to calculate the squared Euklidian Norm shows some techniques:
 
The following silly example to calculate the squared Euklidian Norm shows some techniques:
Line 87: Line 126:
 
<tt>warning #12246: variable "v" has loop carried data dependency that may lead to incorrect program execution in parallel mode; see (file:omp_norm2.c line:32)</tt>
 
<tt>warning #12246: variable "v" has loop carried data dependency that may lead to incorrect program execution in parallel mode; see (file:omp_norm2.c line:32)</tt>
 
* Always specify <tt>default(none)</tt> on larger parallel regions in order to specifically set the visibility of variables to either <tt>shared</tt> or <tt>private</tt>.
 
* Always specify <tt>default(none)</tt> on larger parallel regions in order to specifically set the visibility of variables to either <tt>shared</tt> or <tt>private</tt>.
* Try to restructure code to allow for <tt>nowait</tt>: OpenMP defines synchronization points (implied barriers) at the end of worksharing constructs such as the <tt>pragma omp for<tt> directive. If the ensuing section of code does not depend on data being generated inside the parallel section, adding the <tt>nowait</tt> clause to the worksharing directive allows the compiler to eliminate this synchronization point. This reduces overhead, allows for better overlap and better utilization of the processor's ressources. This might imply however to restructure the code (move portions of independent code in between dependent works-sharing constructs).
+
* Try to restructure code to allow for <tt>nowait</tt>: OpenMP defines synchronization points (implied barriers) at the end of work sharing constructs such as the <tt>pragma omp for<tt> directive. If the ensuing section of code does not depend on data being generated inside the parallel section, adding the <tt>nowait</tt> clause to the worksharing directive allows the compiler to eliminate this synchronization point. This reduces overhead, allows for better overlap and better utilization of the processor's resources. This might imply however to restructure the code (move portions of independent code in between dependent works-sharing constructs).
 
 
 
 
== Usage ==
 
== Usage ==
 
OpenMP is supported by various compilers, here the usage for two main compilers [[BwHPC_BPG_Compiler#GCC|GCC]] and [[BwHPC_BPG_Compiler#Intel Suite|Intel Suite]] are introduced.
 
OpenMP is supported by various compilers, here the usage for two main compilers [[BwHPC_BPG_Compiler#GCC|GCC]] and [[BwHPC_BPG_Compiler#Intel Suite|Intel Suite]] are introduced.
Line 96: Line 132:
 
In case You make function calls to OpenMP's API, You also need to include the header-file <tt>omp.h</tt>.
 
In case You make function calls to OpenMP's API, You also need to include the header-file <tt>omp.h</tt>.
 
OpenMP's API allows to query or set the number of threads, query the number of processors, get a wall-clock time to measure execution times, etc.
 
OpenMP's API allows to query or set the number of threads, query the number of processors, get a wall-clock time to measure execution times, etc.
 
 
=== OpenMP with GNU Compiler Collection ===
 
=== OpenMP with GNU Compiler Collection ===
 
Starting with version 4.2 the gcc compiler supports OpenMP-2.5.
 
Starting with version 4.2 the gcc compiler supports OpenMP-2.5.
Line 102: Line 137:
 
The installed compilers support OpenMP-3.1.
 
The installed compilers support OpenMP-3.1.
 
<!-- Starting with gcc-4.9 OpenMP-4.0 is supported, however the <tt>target</tt> directive will only offload to the host processor. -->
 
<!-- Starting with gcc-4.9 OpenMP-4.0 is supported, however the <tt>target</tt> directive will only offload to the host processor. -->
  +
<br>
 
 
To use OpenMP with the gcc-compiler, pass <tt>-fopenmp</tt> as parameter.
 
To use OpenMP with the gcc-compiler, pass <tt>-fopenmp</tt> as parameter.
 
 
=== OpenMP with Intel Compiler ===
 
=== OpenMP with Intel Compiler ===
 
The Intel Compiler's support for OpenMP is more advanced than gcc's -- especially in term of programmer support.
 
The Intel Compiler's support for OpenMP is more advanced than gcc's -- especially in term of programmer support.
 
To use OpenMP with the Intel compiler, pass <tt>-openmp</tt> as command-line parameter.
 
To use OpenMP with the Intel compiler, pass <tt>-openmp</tt> as command-line parameter.
  +
<br>
 
 
One may get very insightful information about OpenMP, when compiling with
 
One may get very insightful information about OpenMP, when compiling with
 
* Compiling with <tt>-openmp-report2</tt> to get information, which loops were parallelized and a reason why not.
 
* Compiling with <tt>-openmp-report2</tt> to get information, which loops were parallelized and a reason why not.
 
* Compiling with <tt>-diag-enable sc-parallel3</tt> to get errors and warnings about your sources weaknesses with regard to parallelization (see example below).
 
* Compiling with <tt>-diag-enable sc-parallel3</tt> to get errors and warnings about your sources weaknesses with regard to parallelization (see example below).
  +
<br>
 
  +
<br>
 
  +
<!-- ---------------------------------------------------------------------------------------------- -->
 
= MPI =
 
= MPI =
In this section, You will find information regarding the supported installations of the Message-Passing Interface libraries and their usage.
+
In this section, You will find information regarding the supported installations of the Message-Passing Interface libraries and their usage.<br>
 
Due to the Fortran interface ABI, all MPI-libraries are normally bound to a specific compiler-vendor and even the specific compiler version.
 
Due to the Fortran interface ABI, all MPI-libraries are normally bound to a specific compiler-vendor and even the specific compiler version.
 
Therefore, as listed in [[BwHPC_BPG_Compiler]] two compilers are supported on bwUniCluster: [[BwHPC_BPG_Compiler#GCC|GCC]] and [[BwHPC_BPG_Compiler#Intel Suite|Intel Suite]].
 
Therefore, as listed in [[BwHPC_BPG_Compiler]] two compilers are supported on bwUniCluster: [[BwHPC_BPG_Compiler#GCC|GCC]] and [[BwHPC_BPG_Compiler#Intel Suite|Intel Suite]].
 
As both compilers are continously improving, the communication libraries will be adopted in lock-step.
 
As both compilers are continously improving, the communication libraries will be adopted in lock-step.
  +
<br>
 
 
With a set of different implementations, there comes the problem of choice. These pages should inform the user of the communication libraries, what considerations should be done with regard to performance, maintainability and debugging -- in general tool support -- of the various implementations.
 
With a set of different implementations, there comes the problem of choice. These pages should inform the user of the communication libraries, what considerations should be done with regard to performance, maintainability and debugging -- in general tool support -- of the various implementations.
 
 
== MPI Introduction ==
 
== MPI Introduction ==
 
The Message-Passing Interface is a standard provided by the [http://www.mpi-forum.org MPI-Forum] which regularly convenes for the [http://meetings.mpi-forum.org MPI-Forum Meetings] to update this standard. The current version is MPI-3.0 available as [http://mpi-forum.org/docs/mpi-3.0/mpi30-report.pdf PDF].
 
The Message-Passing Interface is a standard provided by the [http://www.mpi-forum.org MPI-Forum] which regularly convenes for the [http://meetings.mpi-forum.org MPI-Forum Meetings] to update this standard. The current version is MPI-3.0 available as [http://mpi-forum.org/docs/mpi-3.0/mpi30-report.pdf PDF].
 
This document defines the API of over 300 functions for the C- and the Fortran-language -- however, You will certainly not need all of them to begin with.
 
This document defines the API of over 300 functions for the C- and the Fortran-language -- however, You will certainly not need all of them to begin with.
  +
<br>
 
 
Every MPI-conforming program needs to call <tt>MPI_Init()</tt> and <tt>MPI_Finalize()</tt> upon start and shutdown -- or <tt>MPI_Abort()</tt> in case of an abnormal termination.
 
Every MPI-conforming program needs to call <tt>MPI_Init()</tt> and <tt>MPI_Finalize()</tt> upon start and shutdown -- or <tt>MPI_Abort()</tt> in case of an abnormal termination.
 
After initialization the programm may call any other MPI-function, specifically communication functions.
 
After initialization the programm may call any other MPI-function, specifically communication functions.
 
However to do so, it is required to find out how many processes the program has been started with, using <tt>MPI_Comm_size()</tt> and what number (here called a rank= this particular process has using <tt>MPI_Comm_rank()</tt>.
 
However to do so, it is required to find out how many processes the program has been started with, using <tt>MPI_Comm_size()</tt> and what number (here called a rank= this particular process has using <tt>MPI_Comm_rank()</tt>.
 
Communication is always relative to a so-called communicator -- the default one after initialization being called <tt>MPI_COMM_WORLD</tt>-
 
Communication is always relative to a so-called communicator -- the default one after initialization being called <tt>MPI_COMM_WORLD</tt>-
  +
<br>
 
 
There's basically three ways of communication:
 
There's basically three ways of communication:
 
* two-sided communication using point-to-point (often abbreviated P2P) functions, such as <tt>MPI_Send()</tt> and <tt>MPI_Recv()</tt>, which always involves two participating processes,
 
* two-sided communication using point-to-point (often abbreviated P2P) functions, such as <tt>MPI_Send()</tt> and <tt>MPI_Recv()</tt>, which always involves two participating processes,
 
* collectice communcation functions (often abbreviated as colls) involve multiple processes, examples are <tt>MPI_Bcast()</tt> and <tt>MPI_Reduce()</tt>,
 
* collectice communcation functions (often abbreviated as colls) involve multiple processes, examples are <tt>MPI_Bcast()</tt> and <tt>MPI_Reduce()</tt>,
 
* one-sided communication, where communication between two processes is initiated by one-process, only. With proper RMA-hardware support and careful programming, this may allow higher performance or scalibility.
 
* one-sided communication, where communication between two processes is initiated by one-process, only. With proper RMA-hardware support and careful programming, this may allow higher performance or scalibility.
  +
<br>
 
 
All parts of the programm, which reference MPI functionality need to be compiled with the '''same''' compiler settings/include files and linked to the same MPI-Library. This is stressed here, since without taking pre-cautions, a different MPI's header may be included, resulting in funny errors: consider that Intel MPI is derived from MPIch, with MPI-datatypes being C <tt>int</tt>s, while Open MPI uses pointers to structures (the former being 4, the latter being 8 bytes on bwUniCluster).
 
All parts of the programm, which reference MPI functionality need to be compiled with the '''same''' compiler settings/include files and linked to the same MPI-Library. This is stressed here, since without taking pre-cautions, a different MPI's header may be included, resulting in funny errors: consider that Intel MPI is derived from MPIch, with MPI-datatypes being C <tt>int</tt>s, while Open MPI uses pointers to structures (the former being 4, the latter being 8 bytes on bwUniCluster).
 
To ease the programmer's life, MPI implementations offer compiler-wrappers, e.g. <tt>mpicc</tt> for C and <tt>mpif90</tt> for Fortran90 for compilation and linking, taking care to include all required libraries.
 
To ease the programmer's life, MPI implementations offer compiler-wrappers, e.g. <tt>mpicc</tt> for C and <tt>mpif90</tt> for Fortran90 for compilation and linking, taking care to include all required libraries.
  +
<br>
 
 
All programs must be started using the <tt>mpirun</tt> or <tt>mpiexec</tt> command. Depending on the actual implementation, it uses different arguments, however the following works with any MPI:
 
All programs must be started using the <tt>mpirun</tt> or <tt>mpiexec</tt> command. Depending on the actual implementation, it uses different arguments, however the following works with any MPI:
 
* <tt>mpirun -np 128 ./app</tt> starts 128 processes (with ranks 0 to 127)
 
* <tt>mpirun -np 128 ./app</tt> starts 128 processes (with ranks 0 to 127)
Line 144: Line 178:
 
* <tt>mpiexec -n 64 ./app1 : -n 64 ./app2</tt> starts 128 processes, 64 of which execute <tt>app1</tt>, the other 64 execute <tt>app2</tt>. All processes however participate in the same <tt>MPI_COMM_WORLD</tt> and therefore must accordingly take care about their respective ranks.
 
* <tt>mpiexec -n 64 ./app1 : -n 64 ./app2</tt> starts 128 processes, 64 of which execute <tt>app1</tt>, the other 64 execute <tt>app2</tt>. All processes however participate in the same <tt>MPI_COMM_WORLD</tt> and therefore must accordingly take care about their respective ranks.
 
Please note, that process placement (e.g. a round-robin scheme), and specifically process-binding to sockets is MPI-implementation dependant.
 
Please note, that process placement (e.g. a round-robin scheme), and specifically process-binding to sockets is MPI-implementation dependant.
 
 
== MPI Best Practice Guide ==
 
== MPI Best Practice Guide ==
 
Specific performance considerations with regard to MPI (independent of the implementation):
 
Specific performance considerations with regard to MPI (independent of the implementation):
Line 156: Line 189:
 
* Bind Your processes to sockets: Operating Systems are good in making best use of the ressources -- which sometimes involves moving tasks from one core to another, or even (though more unlikely since the OS' heuristics try to avoid it) to another socket, with the obvious effects: Caches are cold, every memory access to memory allocated on the previous socket "has to travel the bus". This is particularly happening if You have multiple OpenMP parallel regions which are separated by code that does IO -- and threads are sleeping -- the processes doing IO may wander to a different socket... Bind Your processes to at least the socket. All major MPIs support this binding (see below).
 
* Bind Your processes to sockets: Operating Systems are good in making best use of the ressources -- which sometimes involves moving tasks from one core to another, or even (though more unlikely since the OS' heuristics try to avoid it) to another socket, with the obvious effects: Caches are cold, every memory access to memory allocated on the previous socket "has to travel the bus". This is particularly happening if You have multiple OpenMP parallel regions which are separated by code that does IO -- and threads are sleeping -- the processes doing IO may wander to a different socket... Bind Your processes to at least the socket. All major MPIs support this binding (see below).
 
* Do not use the C++ interface: First of all, it has been marked as deprecated in the MPI-3.0 standard, since it added little benefit to C++ programmers over the C-interface. Moreover, since MPI implementations are written in C, the interface adds another level of indirection and therefore a bit of overhead in terms of instructions and Cache misses.
 
* Do not use the C++ interface: First of all, it has been marked as deprecated in the MPI-3.0 standard, since it added little benefit to C++ programmers over the C-interface. Moreover, since MPI implementations are written in C, the interface adds another level of indirection and therefore a bit of overhead in terms of instructions and Cache misses.
 
 
== Open MPI ==
 
== Open MPI ==
 
The [http://www.open-mpi.org Open MPI] library is an open, flexible and nevertheless performant implementation of MPI-2 and MPI-3. Licensed under BSD, it is being actively developed by an open community of industry and research institutions.
 
The [http://www.open-mpi.org Open MPI] library is an open, flexible and nevertheless performant implementation of MPI-2 and MPI-3. Licensed under BSD, it is being actively developed by an open community of industry and research institutions.
 
The flexibility comes in handy: using the concept of a [http://www.open-mpi.org/faq/?category=tuning#mca-def MCA] (aka a plugin) Open MPI supports many different network interconnects (Infinband, TCP, Cray, etc.) , on the other hand, a installation may be tailored to suite an installation, e.g. the network (Infiniband with specific settings), the main startup-mechanism, etc.
 
The flexibility comes in handy: using the concept of a [http://www.open-mpi.org/faq/?category=tuning#mca-def MCA] (aka a plugin) Open MPI supports many different network interconnects (Infinband, TCP, Cray, etc.) , on the other hand, a installation may be tailored to suite an installation, e.g. the network (Infiniband with specific settings), the main startup-mechanism, etc.
 
Furthermore, the [http://www.open-mpi.org/faq/ FAQ] offers hints on [http://www.open-mpi.org/faq/?category=tuning performance tuning].
 
Furthermore, the [http://www.open-mpi.org/faq/ FAQ] offers hints on [http://www.open-mpi.org/faq/?category=tuning performance tuning].
 
 
=== Usage ===
 
=== Usage ===
 
Like other MPI implementations, after loading the [[BwUniCluster_Environment_Modules|module]], Open MPI provides the compiler-wrappers <tt>mpicc</tt>, <tt>mpicxx</tt> and <tt>mpifort</tt> (or for
 
Like other MPI implementations, after loading the [[BwUniCluster_Environment_Modules|module]], Open MPI provides the compiler-wrappers <tt>mpicc</tt>, <tt>mpicxx</tt> and <tt>mpifort</tt> (or for
 
versions lower than 1.7 <tt>mpif77</tt> and <tt>mpif90</tt>) for the C-, C++ and Fortran compilers respectively. Albeit their usage is not required, these wrappers are handy to not have to use the command-line options for header- or library directories, aka <tt>-I</tt> and <tt>-L</tt>, as well as the actual needed MPI-libraries itselve.
 
versions lower than 1.7 <tt>mpif77</tt> and <tt>mpif90</tt>) for the C-, C++ and Fortran compilers respectively. Albeit their usage is not required, these wrappers are handy to not have to use the command-line options for header- or library directories, aka <tt>-I</tt> and <tt>-L</tt>, as well as the actual needed MPI-libraries itselve.
 
 
=== Further information ===
 
=== Further information ===
 
 
Open MPI also features a few specific functionalities that will help users and developpers, alike:
 
Open MPI also features a few specific functionalities that will help users and developpers, alike:
 
* Open MPI's tool <tt>ompi_info</tt> allows seeing all of Open MPI's installed MCA components and their specific options.
 
* Open MPI's tool <tt>ompi_info</tt> allows seeing all of Open MPI's installed MCA components and their specific options.
Line 175: Line 204:
 
* Open MPI internally uses the tool [http://www.open-mpi.org/projects/hwloc/ hwloc] for node-local processor-information, as well as process- and memory-affinity. This tool also is a good tool to get information on the node's processor topology and Cache-information. This may be used to optimize and balance memory usage or for choosing a better ratio of MPI processes per node vs. OpenMP threads per core.
 
* Open MPI internally uses the tool [http://www.open-mpi.org/projects/hwloc/ hwloc] for node-local processor-information, as well as process- and memory-affinity. This tool also is a good tool to get information on the node's processor topology and Cache-information. This may be used to optimize and balance memory usage or for choosing a better ratio of MPI processes per node vs. OpenMP threads per core.
   
<!--
+
<!--
 
== Intel MPI ==
 
== Intel MPI ==
  +
=== General information ===
 
  +
=== Usage ===
General information
 
  +
=== Further information ===
Usage
 
  +
-->
Further information
 
  +
<!--
 
= Hybrid Parallelization =
+
= Hybrid Parallelization =
 
 
-->
 
-->

Revision as of 11:42, 21 December 2015

Description Content
module load mpi/impi | mpi/openmpi
Availability bwUniCluster | BwForCluster_Chemistry | bwGRiD_tu
Links Intel® MPI Library | Open MPI
License Intel MPI Library Licensing FAQ install-doc/EULA.txt | Open MPI License


1 Introduction

This page will provide information regarding the supported parallel programming paradigms and specific hints on their usage. Please refer to the Modules Documentation how to setup your environment on bwUniCluster to load a specific software installation.

2 Versions and Availability

2.1 impi (Intel)

A list of versions currently available compilers on the bwHPC-C5-Clusters can be obtained from the

Cluster Information System CIS
{{#widget:Iframe |url=https://cis-hpc.uni-konstanz.de/prod.cis/bwUniCluster/mpi/impi |width=99% |height=1000 }}

2.2 openmpi

A list of versions currently available compilers on the bwHPC-C5-Clusters can be obtained from the

Cluster Information System CIS
{{#widget:Iframe |url=https://cis-hpc.uni-konstanz.de/prod.cis/bwUniCluster/mpi/openmpi |width=99% |height=1500 }}

3 OpenMP

3.1 General Information

OpenMP is a mature specification [1] to allow easy, portable, and most importantly incremental node-level parallelisation of code. Being a thread-based approach, OpenMP is aimed at more fine-grained parallelism than MPI. Although there have been extensions to extend OpenMP for inter-node parallelisation, it is a node-level approach aimed to make best usage of a node's cores.
With regard to ease-of-use, OpenMP is ahead of any other common approach: the source-code is annotated using #pragma omp or !$omp statements, in C/C++ and Fortran respectively. Whenever the compiler encompasses a semantic block of code encapsulated in a parallel region, this block of code is transparently compiled into a function, which is passed to a so-called team-of-threads upon entering this semantic block. This fork-join model of execution eases a lot of the programmer's pain involved with Threads. Being a loop-centric approach, OpenMP is aimed at codes with long/time-consuming loops. A single combined directive pragma omp parallel for will tell the compiler to automatically parallel the ensuing for-loop.
The following example is a bit more advanced in that even reductions of variables over multiple threads are easily to parallel:

   for (int i=0, sum = 0.0; i < VECTOR_LEN; i++)
     norm2 += (v[i]*v[i]);

is parallelized by just adding a single line as in:

#  pragma omp parallel for reduction(+:norm2)
   for (int i=0, sum = 0.0; i < VECTOR_LEN; i++)
     norm2 += (v[i]*v[i]);

With VECTOR_LENGTH being large enough, this piece of code compiled with OpenMP will run in parallel, exhibiting very nice speedup. Compiled without, the code remains as is. Developpers may therefore incrementally parallelize their application based on the profile derived from performance analysis tools, starting with the most time-consuming loops. Using OpenMP's concise API, one may query the number of running threads, the number of processors, a time to calculate runtime, and even set parameters such as the number of threads to execute a parallel region.
The OpenMP-4.0 specification added support for the SIMD-directive to better utilize SIMD-vectorization, as well as integrating directives to offload computation to accelerators using the target directive: these are integrated into the Intel Compiler and are actively being worked on for the GNU compiler, some restrictions may apply.

3.2 OpenMP Best Practice Guide

The following silly example to calculate the squared Euklidian Norm shows some techniques:

#include <stdlib.h>
#include <stdio.h>
#include <omp.h>

#define VECTOR_LENGTH 5

int main (int argc, char * argv[])
{
    int len = VECTOR_LENGTH;
    int i;
    double * v;
    double norm2 = 0.0;
    double t1, tdiff;

    if (argc > 1)
        len = atoi (argv[1]);
    v = malloc (len * sizeof(double));

    t1 = omp_get_wtime();
    // Initialization already with (the same number of) threads
#pragma omp parallel for
    for (i=0; i < len; i++) {
        v[i] = i;
    }

    // Now aggregate the sum-of-squares by specifying a reduction
#pragma omp parallel for reduction(+:norm2)
    for(i=0; i < len; i++) {
        norm2 += (v[i]*v[i]);
    }
    tdiff = omp_get_wtime() - t1;

    printf ("norm2: %f Time:%f\n", norm2, tdiff);
    return 0;
}
  • Group independent parallel sections together: in the above example, You may combine those two sections into one larger parallel block. This will just once enter the parallel region (in the fork-join model) instead of twice. Especially in inner loops, this will considerably decrease overhead.
  • Compile with the Intel compiler's option -diag-enable sc-parallel3 to get the further warnings on thread-safety, performance, etc. The following code with loop-carried dependency will e.g. compile fine (aka without warning):
#pragma omp parallel for reduction(+:norm2)
    for(i=1; i < len-1; i++) {
        v[i] = v[i-1]+v[i+1];
    }

However the Intel compiler with -diag-enable sc-parallel3 will produce the following warning: warning #12246: variable "v" has loop carried data dependency that may lead to incorrect program execution in parallel mode; see (file:omp_norm2.c line:32)

  • Always specify default(none) on larger parallel regions in order to specifically set the visibility of variables to either shared or private.
  • Try to restructure code to allow for nowait: OpenMP defines synchronization points (implied barriers) at the end of work sharing constructs such as the pragma omp for directive. If the ensuing section of code does not depend on data being generated inside the parallel section, adding the nowait clause to the worksharing directive allows the compiler to eliminate this synchronization point. This reduces overhead, allows for better overlap and better utilization of the processor's resources. This might imply however to restructure the code (move portions of independent code in between dependent works-sharing constructs).

3.3 Usage

OpenMP is supported by various compilers, here the usage for two main compilers GCC and Intel Suite are introduced. For both compilers, You first need to turn on OpenMP support by specifying a parameter on the compiler's command-line. In case You make function calls to OpenMP's API, You also need to include the header-file omp.h. OpenMP's API allows to query or set the number of threads, query the number of processors, get a wall-clock time to measure execution times, etc.

3.3.1 OpenMP with GNU Compiler Collection

Starting with version 4.2 the gcc compiler supports OpenMP-2.5. Since then the analysis capabilities of the GNU compiler have steadily improved. The installed compilers support OpenMP-3.1.
To use OpenMP with the gcc-compiler, pass -fopenmp as parameter.

3.3.2 OpenMP with Intel Compiler

The Intel Compiler's support for OpenMP is more advanced than gcc's -- especially in term of programmer support. To use OpenMP with the Intel compiler, pass -openmp as command-line parameter.
One may get very insightful information about OpenMP, when compiling with

  • Compiling with -openmp-report2 to get information, which loops were parallelized and a reason why not.
  • Compiling with -diag-enable sc-parallel3 to get errors and warnings about your sources weaknesses with regard to parallelization (see example below).



4 MPI

In this section, You will find information regarding the supported installations of the Message-Passing Interface libraries and their usage.
Due to the Fortran interface ABI, all MPI-libraries are normally bound to a specific compiler-vendor and even the specific compiler version. Therefore, as listed in BwHPC_BPG_Compiler two compilers are supported on bwUniCluster: GCC and Intel Suite. As both compilers are continously improving, the communication libraries will be adopted in lock-step.
With a set of different implementations, there comes the problem of choice. These pages should inform the user of the communication libraries, what considerations should be done with regard to performance, maintainability and debugging -- in general tool support -- of the various implementations.

4.1 MPI Introduction

The Message-Passing Interface is a standard provided by the MPI-Forum which regularly convenes for the MPI-Forum Meetings to update this standard. The current version is MPI-3.0 available as PDF. This document defines the API of over 300 functions for the C- and the Fortran-language -- however, You will certainly not need all of them to begin with.
Every MPI-conforming program needs to call MPI_Init() and MPI_Finalize() upon start and shutdown -- or MPI_Abort() in case of an abnormal termination. After initialization the programm may call any other MPI-function, specifically communication functions. However to do so, it is required to find out how many processes the program has been started with, using MPI_Comm_size() and what number (here called a rank= this particular process has using MPI_Comm_rank(). Communication is always relative to a so-called communicator -- the default one after initialization being called MPI_COMM_WORLD-
There's basically three ways of communication:

  • two-sided communication using point-to-point (often abbreviated P2P) functions, such as MPI_Send() and MPI_Recv(), which always involves two participating processes,
  • collectice communcation functions (often abbreviated as colls) involve multiple processes, examples are MPI_Bcast() and MPI_Reduce(),
  • one-sided communication, where communication between two processes is initiated by one-process, only. With proper RMA-hardware support and careful programming, this may allow higher performance or scalibility.


All parts of the programm, which reference MPI functionality need to be compiled with the same compiler settings/include files and linked to the same MPI-Library. This is stressed here, since without taking pre-cautions, a different MPI's header may be included, resulting in funny errors: consider that Intel MPI is derived from MPIch, with MPI-datatypes being C ints, while Open MPI uses pointers to structures (the former being 4, the latter being 8 bytes on bwUniCluster). To ease the programmer's life, MPI implementations offer compiler-wrappers, e.g. mpicc for C and mpif90 for Fortran90 for compilation and linking, taking care to include all required libraries.
All programs must be started using the mpirun or mpiexec command. Depending on the actual implementation, it uses different arguments, however the following works with any MPI:

  • mpirun -np 128 ./app starts 128 processes (with ranks 0 to 127)
  • mpiexec -n 128 -hostfile mynodes.txt ./app starts 128 processes on only the nodes listed line-by-line in the provided text-file mynodes.txt.
  • mpiexec -n 64 ./app1 : -n 64 ./app2 starts 128 processes, 64 of which execute app1, the other 64 execute app2. All processes however participate in the same MPI_COMM_WORLD and therefore must accordingly take care about their respective ranks.

Please note, that process placement (e.g. a round-robin scheme), and specifically process-binding to sockets is MPI-implementation dependant.

4.2 MPI Best Practice Guide

Specific performance considerations with regard to MPI (independent of the implementation):

  • No communication at all is best: Only communicate between processes if at all necessary. Consider that file-access is "communication" as well.
  • If communication is done with multiple processes, try to involve as many processes in just one call: MPI optimizes the communication pattern for so-called "collective communication" to take advantage of the underlying network (with regard to network topology, message sizes, queueing capabilities of the network interconnect, etc.). Therefore try to always think in collective communication, if a communication pattern involves a group of processes.
  • Try to group processes together: Function calls like MPI_Cart_create will come in handy for applications with cartesian domains but also general communicators derived from MPI_COMM_WORLD using MPI_Comm_split() may benefit by MPI's knowing the underlying network topology. Use MPI3's MPI_Comm_split_type() with MPI_COMM_TYPE_SHARED for a sub-communicator with processes having access to the same shared memory region (aka on bwUniCluster the same node).
  • File-accesses to load / store data must be done collectively: Writing to storage, or even reading the initialization data -- all of which involves getting data from/to all MPI processes -- must be done collectively. MPI's Parallel IO offers a rich API to read and distribute the data access -- in order to take advantage of parallel filesystems like Lustre. A many-fold performance improvement may be seen by writing data in large chunks in collective fashion -- and at the same time being nice to other users and applications.
  • Try to hide the communication by computation: Try to hide (some) of the cost of communication of Point-to-point communication by using non-blocking / immediate P2P-calls (MPI_Isend and MPI_Irecv et al, followed by MPI_Wait or MPI_Test et al). This may allow the MPI-implementation to initiate or even offload communication to the network interconnect and resume executing your application, while data is being transferred. MPI-3 adds non-blocking collectives, e.g. MPI_Ibcast() or MPI_Iallreduce(). For extra credit, explain the use-cases of MPI_Ibarrier().
  • Every call to MPI may trigger an access to physical hardware -- limit it: When calling communication-related functions such as MPI_Test to check whether a specific communication has finished, the queue of the network adapter may need to be queried. This memory access or even physical hardware access to query the state will cost cycles. Therefore, the programmer should combine multiple requests with functions such as MPI_Waitall() or MPI_Waitany() or their Test*-counterparts.
  • Make usage of derived datatypes: instead of manually copying data into temporary, even newly allocated memory, describe the data-layout to MPI -- and let the implementation, or even the network HCA's hardware do the data fetching.
  • Bind Your processes to sockets: Operating Systems are good in making best use of the ressources -- which sometimes involves moving tasks from one core to another, or even (though more unlikely since the OS' heuristics try to avoid it) to another socket, with the obvious effects: Caches are cold, every memory access to memory allocated on the previous socket "has to travel the bus". This is particularly happening if You have multiple OpenMP parallel regions which are separated by code that does IO -- and threads are sleeping -- the processes doing IO may wander to a different socket... Bind Your processes to at least the socket. All major MPIs support this binding (see below).
  • Do not use the C++ interface: First of all, it has been marked as deprecated in the MPI-3.0 standard, since it added little benefit to C++ programmers over the C-interface. Moreover, since MPI implementations are written in C, the interface adds another level of indirection and therefore a bit of overhead in terms of instructions and Cache misses.

4.3 Open MPI

The Open MPI library is an open, flexible and nevertheless performant implementation of MPI-2 and MPI-3. Licensed under BSD, it is being actively developed by an open community of industry and research institutions. The flexibility comes in handy: using the concept of a MCA (aka a plugin) Open MPI supports many different network interconnects (Infinband, TCP, Cray, etc.) , on the other hand, a installation may be tailored to suite an installation, e.g. the network (Infiniband with specific settings), the main startup-mechanism, etc. Furthermore, the FAQ offers hints on performance tuning.

4.3.1 Usage

Like other MPI implementations, after loading the module, Open MPI provides the compiler-wrappers mpicc, mpicxx and mpifort (or for versions lower than 1.7 mpif77 and mpif90) for the C-, C++ and Fortran compilers respectively. Albeit their usage is not required, these wrappers are handy to not have to use the command-line options for header- or library directories, aka -I and -L, as well as the actual needed MPI-libraries itselve.

4.3.2 Further information

Open MPI also features a few specific functionalities that will help users and developpers, alike:

  • Open MPI's tool ompi_info allows seeing all of Open MPI's installed MCA components and their specific options.

Without any option the user gets a list of flags, the Open MPI installation was compiled for (version of compilers, specific configure-flags, e.g. debugging, or profiling options). Furthermore, using ompi_info --param all all one may see all of the MCA's options, e.g. that the default PML-MCA uses an initial free-list of 4 blocks (increased by 64 upon first encountering this limit): ompi_info --param ob1 all -- which may be increased for applications that are certain to benefit from a larger value upon startup.

  • Open MPI allows adapting MCA parameters on the command-line: parameters may be supplied, e.g. the above-mentioned parameter mpirun -np 128 --mca mpirun -np 16 --mca pml_ob1_free_list_num 128 ./mpi_stub.
  • Open MPI internally uses the tool hwloc for node-local processor-information, as well as process- and memory-affinity. This tool also is a good tool to get information on the node's processor topology and Cache-information. This may be used to optimize and balance memory usage or for choosing a better ratio of MPI processes per node vs. OpenMP threads per core.