Difference between revisions of "BwUniCluster 2.0 Maintenance/2020-10/Software Issues"

From bwHPC Wiki
Jump to: navigation, search
Line 7: Line 7:
 
* There is a bug in Intel MPI 2019.x which leads to incorrect CPU binding/affinity in conjunction with the Slurm batch system used on the clusters. All MPI ranks will run on the same CPU core instead of being bound to all available CPU cores. This can be fixed by setting the environment variable '''I_MPI_HYDRA_BOOTSTRAP_EXEC_EXTRA_ARGS="--cpu-bind=none""'''. The ''mpi/impi/2019'' and ''mpi/impi/2020'' modules provided on the cluster already set this variable.
 
* There is a bug in Intel MPI 2019.x which leads to incorrect CPU binding/affinity in conjunction with the Slurm batch system used on the clusters. All MPI ranks will run on the same CPU core instead of being bound to all available CPU cores. This can be fixed by setting the environment variable '''I_MPI_HYDRA_BOOTSTRAP_EXEC_EXTRA_ARGS="--cpu-bind=none""'''. The ''mpi/impi/2019'' and ''mpi/impi/2020'' modules provided on the cluster already set this variable.
   
There is a number of Third-Party software modules installed on the cluster system which come with their own copies of various Intel MPI library versions. These software modules fall into onthe following categories.
+
There is a number of Third-Party software modules installed on the cluster system which come with their own copies of various Intel MPI library versions. These software modules fall into the following categories:
   
 
== Corrected software modules ==
 
== Corrected software modules ==

Revision as of 12:11, 27 October 2020

After the last regular maintenance interval (from 06.10.2020 to 13.10.2020) the following issues with Intel MPI exist:

  • Intel MPI 2018 is incompatible with Red Hat 8.2. Any invocation, even a simple "Hello World" MPI program, will result in a crash. The mpi/impi/2018 module has therefore been removed.
  • There is a bug in Intel MPI 2019.x which leads to crashes when multiple MPI applications which are linked against Intel MPI 2019.x are run on the same node (e.g. in the "single" partition). The first application will run normally, but all others will crash. This can be fixed by setting the environment variable I_MPI_HYDRA_TOPOLIB="ipl". The mpi/impi/2019 and mpi/impi/2020 modules provided on the cluster already set this variable.
  • There is a bug in Intel MPI 2019.x which leads to incorrect CPU binding/affinity in conjunction with the Slurm batch system used on the clusters. All MPI ranks will run on the same CPU core instead of being bound to all available CPU cores. This can be fixed by setting the environment variable I_MPI_HYDRA_BOOTSTRAP_EXEC_EXTRA_ARGS="--cpu-bind=none"". The mpi/impi/2019 and mpi/impi/2020 modules provided on the cluster already set this variable.

There is a number of Third-Party software modules installed on the cluster system which come with their own copies of various Intel MPI library versions. These software modules fall into the following categories:

1 Corrected software modules

The following software modules have been corrected by the HPC software maintainers. They should currently work as expected.

  • StarCCM+: The included Intel MPI 2018 library was replaced with a more recent version.
  • LS-DYNA: The included Intel MPI library was replaced with a more recent version.
  • CST: The license does not allow multi-node jobs, so the problematic code paths cannot be used.

2 Software modules with known fixes

The following software modules require additional user interaction to work:

  • ANSYS Mechanical and Fluent: The software has to be switched to OpenMPI using the -mpi=openmpi command line argument.
  • ANSYS CFX: The software has to be switched to OpenMPI using the -start-method 'Open MPI Distributed Parallel' command line argument.

3 Software modules without known fixes

For the following software modules there is currently no known fix:

  • cae/abaqus/2019 (comes with Intel MPI 2017).We are working on a solution.

Non-working software modules will not be removed because they can still be used for pre-/post-processing and single-node parallelisation using e.g. OpenMP.