CIS Computing & Information Services

Message Passing Interface (MPI)

Message Passing Interface is the de facto standard for communication among processes in High Performance Computing. Almost all highly parallel applications use MPI for communication. Simply explained, think about several processes running your program in parallel with each other so that it completes faster. They would have to update the data they are working on at regular intervals according to the computation on other processes. If they are on the same node, they have access to a shared memory space. However, if we want our program to span across multiple nodes, the parallel processes no longer have access to each other's memory space. Hence, there is a need to communicate data among the processes.

Resources from the web on getting started with MPI:

MPI is one standard that dictates the semantics and features of "message passing". There are several implementations of MPI. Those installed on Oscar are

  • MVAPICH2
  • OpenMPI
  • MPICH

We recommend using MVAPICH2 as it is integrated with the SLURM scheduler and optimized for the Infiniband network. Hence, while running programs with MVAPICH2, you would use SLURM's srun command to run your application. Otherwise, if the MPI implementation is not configured with SLURM support, like in case of OpenMPI on Oscar, you would use its mpirun command.

Note that the default mpirun on Oscar at /usr/local/bin/mpirun is merely a wrapper to SLURM's srun.


MPI modules on Oscar

Here are all the modules under "mpi" category on Oscar:

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ category: mpi ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
intel-mpi/2017.0              mvapich2/2.0rc1-pgi
mpich/1.5                     mvapich2/2.0rc2
mpich/1.5-slurm-pmi           mvapich2/2.0rc2-no-slurm-pmi
mpich/3.0.4                   mvapich2/2.1
mpich/3.0.4-no-slurm          mvapich2/2.2a
mpich/3.1.1                   mvapich2/2.2a-intel
mvapich2/2.0rc1               mvapich2/2.2a-pgi
mvapich2/2.0rc1-cave          openmpi/1.8
mvapich2/2.0rc1-intel         openmpi/1.8.3

The MPI implementations include compilers like mpicc and mpicxx which are "wrapper" compilers that add relevant flags to the "base" compiler or "underlying" compiler so that it can compile code with MPI functions. All MPI implementations lets you choose what "base" compiler you want to use (like Intel, GCC or PGI). While compiling your MPI application, the most convenient way to change the underlying compiler is to set certain environment variables.

For MVAPICH2, we have implemented this for you through the environment modules which set the environment variables (like MPICH_CC). The naming format of mvapich2 modules is:

mvapich2/<version>-<compiler>

If the base compiler is not mentioned, it means it's GCC. The default is mvapich2/2.0rc1.

But for OpenMPI, there are no separate modules for using different base compilers. The default is gcc. You can change this by setting the environment variables like "export OMPI_MPICC=icc". For more info: https://www.open-mpi.org/faq/?category=mpi-apps. You can use the "-show" flag to see what command is actually used.

$ module load mvapich2
module: loading 'mvapich2/2.0rc1'
$ mpicc -show
gcc -I/gpfs/runtime/opt/mvapich2/2.0rc1/include -L/gpfs/runtime/opt/mvapich2/2.0rc1/lib -lmpich -lpmi -lopa -lmpl

As mentioned before, MVAPICH2 is built with SLURM support and is optimized for InfiniBand, hence we recommend using that. Although OpenMPI is not built with SLURM support, it's mpirun command is still "SLURM-aware" i.e. it knows how many processes to launch based on resources allocated to your job.

We recommend that you do not have more than one of these MPI modules loaded at a time to avoid confusion and some frustrating errors while running.


Compiling MPI programs

After loading the relevant module, simply use the MPI wrapper in place of the compiler you would normally use, for example:

$ module load mvapich2
$ mpicc -g -O2 -o mpiprogram mpiprogram.c

Some software which use Makefiles for compiling might require you to set environment variables like CC, CXX and FC to point to the corresponding MPI wrappers like mpicc, mpicxx and mpifort.


Running MPI programs - Interactive

As with other programs, MPI jobs can be run interactively or in form of batch jobs. To run interactively, first create a resource allocation using salloc. The options for salloc are almost similar to the options for sbatch. Visit this page for more detailed information: https://slurm.schedmd.com/salloc.html

$ salloc -N <# nodes> -n <# MPI tasks> -p <partition> -t <minutes>

For example, to ask for 4 cores to run 4 tasks (MPI processes):

$ salloc -n 4 

Note: It is not possible to run MPI programs on compute nodes by using the interact command.

Next - if using MVAPICH2, you can launch MPI processes using srun.

$ srun ./myapp

If using OpenMPI or a version of MVAPICH2 configured without SLURM support, use mpirun.

$ mpirun ./myapp

With both mpirun and srun, there is no need to mention number of processes or provide a node list as these are detected automatically. Alternatively, you can use a subset of the allocation by specifying the number of nodes and tasks explicitly: https://slurm.schedmd.com/srun.html

$ srun -N <# nodes> -n <# MPI tasks> ./myapp ...

When you are finished running MPI commands, you can release the allocation by exiting the shell:

$ exit

Also, if you only need to run a single MPI program, you can skip the salloc command and specify the resources in a single srun command:

$ srun -N <# nodes> -n <# MPI tasks> -p <partition> -t <minutes> ./my-mpi-program

This will create the allocation, run the MPI program, and release the allocation.


Running MPI programs - Batch Jobs

Here is a sample batch script to run an MPI program:

#!/bin/bash

# Request an hour of runtime:
#SBATCH --time=1:00:00

# Use 2 nodes with 8 tasks each, for 16 MPI tasks:
#SBATCH --nodes=2
#SBATCH --tasks-per-node=8

# Specify a job name:
#SBATCH -J MyMPIJob

# Specify an output file
#SBATCH -o MyMPIJob-%j.out
#SBATCH -e MyMPIJob-%j.err

# Load required modules
module load mvapich2

srun ./MyMPIProgram

Use mpirun if using OpenMPI or any other implementation not configured with SLURM support. Load the corresponding MPI module before running the program as often the program is linked to the MPI libraries which are then required at run time. If the program is not in the current working directory, then instead of ./MyMPIProgram, full path to the file can be provided. Alternatively, the directory containing the file can be added to the $PATH environment variable and just use srun MyMPIProgram. In case if the software is installed as a module on CCV, then loading the module will add the corresponding programs to $PATH.


Hybrid MPI+OpenMP

If your code has OpenMP directives for multi-threading, you can have several cores attached with a single MPI task using the --cpus-per-task or -c option with sbatch or salloc. The environment variable OMP_NUM_THREADS governs the number of threads that will be used.

#!/bin/bash

# Use 2 nodes with 2 tasks each (4 MPI tasks)
# And allocate 4 CPUs to each task for multi-threading
#SBATCH --nodes=2
#SBATCH --tasks-per-node=2
#SBATCH --cpus-per-task=4

# Load required modules
module load mvapich2

export OMP_NUM_THREADS=4
srun ./MyMPIProgram

The above batch script will launch 4 MPI tasks - 2 on each node - and allocate 4 CPUs for each task (total 16 cores for the job). Setting OMP_NUM_THREADS governs the number of threads to be used, although this can also be set in the program.


Other considerations

  1. Use as many MPI processes on each node as possible instead of spanning over many different nodes. This will reduce the communication overhead as the latency within a node is smaller.
  2. The performance of MPI programs depends highly on the load balancing, i.e. the uniformity in the amount of computation that each process has to work on. If it is possible to manage this with your program, try to distribute equal amount of work to each MPI process.
  3. Try to use equal number of cores on each node.

Performance Scaling

The maximum theoretical speedup that can be achieved by a parallel program is governed by the proportion of sequential part in the program (Amdahl's law). Moreover, as the number of MPI processes increases, the communication overhead increases i.e. the amount of time spent in sending and receiving messages among the processes increases. For more than a certain number of processes, this increase starts dominating over the decrease in computational run time. This results in the overall program slowing down instead of speeding up as number of processes are increased.

Hence, MPI programs (or any parallel program) do not run faster as the number of processes are increased beyond a certain point.

If you intend to carry out a lot of runs for a program, the correct approach would be to find out the optimum number of processes which will result in the least run time or a reasonably less run time. Start with a small number of processes like 2 or 4 and first verify the correctness of the results by comparing them with the sequential runs. Then increase the number of processes gradually to find the optimum number beyond which the run time flattens out or starts increasing.