Difference between revisions of "BwUniCluster 2.0 Maintenance/2020-10/Software Issues"

From bwHPC Wiki
Jump to: navigation, search
(Software modules without known fixes)
(2 intermediate revisions by the same user not shown)
Line 5: Line 5:
 
* There is a bug in Intel MPI 2019.x which leads to crashes when multiple MPI applications which are linked against Intel MPI 2019.x are run on the same node (e.g. in the "single" partition). The first application will run normally, but all others will crash. This can be fixed by setting the environment variable '''I_MPI_HYDRA_TOPOLIB="ipl"'''. The ''mpi/impi/2019'' and ''mpi/impi/2020'' modules provided on the cluster already set this variable.
 
* There is a bug in Intel MPI 2019.x which leads to crashes when multiple MPI applications which are linked against Intel MPI 2019.x are run on the same node (e.g. in the "single" partition). The first application will run normally, but all others will crash. This can be fixed by setting the environment variable '''I_MPI_HYDRA_TOPOLIB="ipl"'''. The ''mpi/impi/2019'' and ''mpi/impi/2020'' modules provided on the cluster already set this variable.
   
* There is a bug in Intel MPI 2019.x which leads to incorrect CPU binding/affinity in conjunction with the Slurm batch system used on the clusters. All MPI ranks will run on the same CPU core instead of being bound to all available CPU cores. This can be fixed by setting the environment variable '''I_MPI_HYDRA_BOOTSTRAP_EXEC_EXTRA_ARGS="--cpu-bind=none""'''. The ''mpi/impi/2019'' and ''mpi/impi/2020'' modules provided on the cluster already set this variable.
+
* There is a bug in Intel MPI 2019.x which leads to incorrect CPU binding/affinity in conjunction with the Slurm batch system used on the clusters. All MPI ranks will run on the same CPU core instead of being bound to all available CPU cores. This can be fixed by setting the environment variable '''I_MPI_HYDRA_BOOTSTRAP_EXEC_EXTRA_ARGS="--cpu-bind=none"'''. The ''mpi/impi/2019'' and ''mpi/impi/2020'' modules provided on the cluster already set this variable.
   
 
There is a number of Third-Party software modules installed on the cluster system which come with their own copies of various Intel MPI library versions. These software modules fall into the following categories:
 
There is a number of Third-Party software modules installed on the cluster system which come with their own copies of various Intel MPI library versions. These software modules fall into the following categories:
Line 13: Line 13:
 
The following software modules have been corrected by the HPC software maintainers. They should currently work as expected.
 
The following software modules have been corrected by the HPC software maintainers. They should currently work as expected.
   
* StarCCM+: The included Intel MPI 2018 library was replaced with a more recent version.
+
* ''StarCCM+'': The included Intel MPI 2018 library was replaced with a more recent version.
   
* LS-DYNA: The included Intel MPI library was replaced with a more recent version.
+
* ''LS-DYNA'': The included Intel MPI library was replaced with a more recent version.
   
* CST: The license does not allow multi-node jobs, so the problematic code paths cannot be used.
+
* ''CST'': The license does not allow multi-node jobs, so the problematic code paths cannot be used.
   
 
== Software modules with known fixes ==
 
== Software modules with known fixes ==
Line 31: Line 31:
 
For the following software modules there is currently no known fix:
 
For the following software modules there is currently no known fix:
   
  +
* ''Abaqus'': Comes with Intel MPI 2017, for which there is currently no known fix. We are working on a solution. The ''cae/abaqus/2019'' software modules will not be removed because it can still be used for pre-/post-processing and single-node parallelisation using e.g. OpenMP.
* ''cae/abaqus/2019'' (comes with Intel MPI 2017).We are working on a solution.
 
 
Non-working software modules will not be removed because they can still be used for pre-/post-processing and single-node parallelisation using e.g. OpenMP.
 

Revision as of 12:18, 27 October 2020

After the last regular maintenance interval (from 06.10.2020 to 13.10.2020) the following issues with Intel MPI exist:

  • Intel MPI 2018 is incompatible with Red Hat 8.2. Any invocation, even a simple "Hello World" MPI program, will result in a crash. The mpi/impi/2018 module has therefore been removed.
  • There is a bug in Intel MPI 2019.x which leads to crashes when multiple MPI applications which are linked against Intel MPI 2019.x are run on the same node (e.g. in the "single" partition). The first application will run normally, but all others will crash. This can be fixed by setting the environment variable I_MPI_HYDRA_TOPOLIB="ipl". The mpi/impi/2019 and mpi/impi/2020 modules provided on the cluster already set this variable.
  • There is a bug in Intel MPI 2019.x which leads to incorrect CPU binding/affinity in conjunction with the Slurm batch system used on the clusters. All MPI ranks will run on the same CPU core instead of being bound to all available CPU cores. This can be fixed by setting the environment variable I_MPI_HYDRA_BOOTSTRAP_EXEC_EXTRA_ARGS="--cpu-bind=none". The mpi/impi/2019 and mpi/impi/2020 modules provided on the cluster already set this variable.

There is a number of Third-Party software modules installed on the cluster system which come with their own copies of various Intel MPI library versions. These software modules fall into the following categories:

1 Corrected software modules

The following software modules have been corrected by the HPC software maintainers. They should currently work as expected.

  • StarCCM+: The included Intel MPI 2018 library was replaced with a more recent version.
  • LS-DYNA: The included Intel MPI library was replaced with a more recent version.
  • CST: The license does not allow multi-node jobs, so the problematic code paths cannot be used.

2 Software modules with known fixes

The following software modules require additional user interaction to work:

  • ANSYS Mechanical and Fluent: The software has to be switched to OpenMPI using the -mpi=openmpi command line argument.
  • ANSYS CFX: The software has to be switched to OpenMPI using the -start-method 'Open MPI Distributed Parallel' command line argument.

3 Software modules without known fixes

For the following software modules there is currently no known fix:

  • Abaqus: Comes with Intel MPI 2017, for which there is currently no known fix. We are working on a solution. The cae/abaqus/2019 software modules will not be removed because it can still be used for pre-/post-processing and single-node parallelisation using e.g. OpenMP.