Message Passing Interface is the de facto standard for communication among processes in High Performance Computing. Almost all highly parallel applications use MPI for communication. Simply explained, think about several processes running your program in parallel with each other so that it completes faster. They would have to update the data they are working on at regular intervals according to the computation on other processes. If they are on the same node, they have access to a shared memory space. However, if we want our program to span across multiple nodes, the parallel processes no longer have access to each other's memory space. Hence, there is a need to communicate data among the processes.
Resources from the web on getting started with MPI:
MPI is one standard that dictates the semantics and features of "message passing". There are several implementations of MPI. Those installed on Oscar are
We recommend using MVAPICH2 as it is integrated with the SLURM scheduler and optimized for the Infiniband network. Hence, while running programs with MVAPICH2, you would use SLURM's
srun command to run your application. Otherwise, if the MPI implementation is not configured with SLURM support, like in case of OpenMPI on Oscar, you would use its
Note that the default
mpirun on Oscar at
/usr/local/bin/mpirun is merely a wrapper to SLURM's
MPI modules on Oscar
Here are all the modules under "mpi" category on Oscar:
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ category: mpi ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The MPI implementations include compilers like
mpicxx which are "wrapper" compilers that add relevant flags to the "base" compiler or "underlying" compiler so that it can compile code with MPI functions. All MPI implementations lets you choose what "base" compiler you want to use (like Intel, GCC or PGI). While compiling your MPI application, the most convenient way to change the underlying compiler is to set certain environment variables.
For MVAPICH2, we have implemented this for you through the environment modules which set the environment variables (like
MPICH_CC). The naming format of
mvapich2 modules is:
If the base compiler is not mentioned, it means it's GCC. The default is
But for OpenMPI, there are no separate modules for using different base compilers. The default is
gcc. You can change this by setting the environment variables like "
export OMPI_MPICC=icc". For more info:
https://www.open-mpi.org/faq/?category=mpi-apps. You can use the "
-show" flag to see what command is actually used.
$ module load mvapich2
module: loading 'mvapich2/2.0rc1'
$ mpicc -show
gcc -I/gpfs/runtime/opt/mvapich2/2.0rc1/include -L/gpfs/runtime/opt/mvapich2/2.0rc1/lib -lmpich -lpmi -lopa -lmpl
As mentioned before, MVAPICH2 is built with SLURM support and is optimized for InfiniBand, hence we recommend using that. Although OpenMPI is not built with SLURM support, it's
mpirun command is still "SLURM-aware" i.e. it knows how many processes to launch based on resources allocated to your job.
We recommend that you do not have more than one of these MPI modules loaded at a time to avoid confusion and some frustrating errors while running.
Compiling MPI programs
After loading the relevant module, simply use the MPI wrapper in place of the compiler you would normally use, for example:
$ module load mvapich2
$ mpicc -g -O2 -o mpiprogram mpiprogram.c
Some software which use Makefiles for compiling might require you to set environment variables like
FC to point to the corresponding MPI wrappers like
Running MPI programs - Interactive
As with other programs, MPI jobs can be run interactively or in form of batch jobs. To run interactively, first create a resource allocation using
salloc. The options for
salloc are almost similar to the options for
sbatch. Visit this page for more detailed information: https://slurm.schedmd.com/salloc.html
$ salloc -N <# nodes> -n <# MPI tasks> -p <partition> -t <minutes>
For example, to ask for 4 cores to run 4 tasks (MPI processes):
$ salloc -n 4
Note: It is not possible to run MPI programs on compute nodes by using the
Next - if using MVAPICH2, you can launch MPI processes using
$ srun ./myapp
If using OpenMPI or a version of MVAPICH2 configured without SLURM support, use
$ mpirun ./myapp
srun, there is no need to mention number of processes or provide a node list as these are detected automatically. Alternatively, you can use a subset of the allocation by specifying
the number of nodes and tasks explicitly: https://slurm.schedmd.com/srun.html
$ srun -N <# nodes> -n <# MPI tasks> ./myapp ...
When you are finished running MPI commands, you can release the allocation by exiting the shell:
Also, if you only need to run a single MPI program, you can skip the
salloc command and specify the resources in a single
$ srun -N <# nodes> -n <# MPI tasks> -p <partition> -t <minutes> ./my-mpi-program
This will create the allocation, run the MPI program, and release the allocation.
Running MPI programs - Batch Jobs
Here is a sample batch script to run an MPI program:
# Request an hour of runtime:
# Use 2 nodes with 8 tasks each, for 16 MPI tasks:
# Specify a job name:
#SBATCH -J MyMPIJob
# Specify an output file
#SBATCH -o MyMPIJob-%j.out
#SBATCH -e MyMPIJob-%j.err
# Load required modules
module load mvapich2
mpirun if using OpenMPI or any other implementation not configured with SLURM support.
Load the corresponding MPI module before running the program as often the program is linked to the MPI libraries which are then required at run time. If the program is not in the current working directory, then instead of
./MyMPIProgram, full path to the file can be provided. Alternatively, the directory containing the file can be added to the $PATH environment variable and just use
srun MyMPIProgram. In case if the software is installed as a module on CCV, then loading the module will add the corresponding programs to $PATH.
If your code has OpenMP directives for multi-threading, you can have several cores attached with a single MPI task using the
-c option with
salloc. The environment variable
OMP_NUM_THREADS governs the number of threads that will be used.
# Use 2 nodes with 2 tasks each (4 MPI tasks)
# And allocate 4 CPUs to each task for multi-threading
# Load required modules
module load mvapich2
The above batch script will launch 4 MPI tasks - 2 on each node - and allocate 4 CPUs for each task (total 16 cores for the job). Setting
OMP_NUM_THREADS governs the number of threads to be used, although this can also be set in the program.
- Use as many MPI processes on each node as possible instead of spanning over many different nodes. This will reduce the communication overhead as the latency within a node is smaller.
- The performance of MPI programs depends highly on the load balancing, i.e. the uniformity in the amount of computation that each process has to work on. If it is possible to manage this with your program, try to distribute equal amount of work to each MPI process.
- Try to use equal number of cores on each node.
The maximum theoretical speedup that can be achieved by a parallel program is governed by the proportion of sequential part in the program (Amdahl's law). Moreover, as the number of MPI processes increases, the communication overhead increases i.e. the amount of time spent in sending and receiving messages among the processes increases. For more than a certain number of processes, this increase starts dominating over the decrease in computational run time. This results in the overall program slowing down instead of speeding up as number of processes are increased.
Hence, MPI programs (or any parallel program) do not run faster as the number of processes are increased beyond a certain point.
If you intend to carry out a lot of runs for a program, the correct approach would be to find out the optimum number of processes which will result in the least run time or a reasonably less run time. Start with a small number of processes like 2 or 4 and first verify the correctness of the results by comparing them with the sequential runs. Then increase the number of processes gradually to find the optimum number beyond which the run time flattens out or starts increasing.