Project

General

Profile

MPI

OpenMPI

Compile your code using mpicc. Then you can launch it with an mpiexec in batch mode using a batch script:

#!/bin/bash
#SBATCH --job-name=picalc # Job name
#SBATCH --mail-type=ALL # Mail events (NONE, BEGIN, END, FAIL, ALL)
#SBATCH --mail-user=marco.miralto@szn.it # Where to send mail   
#SBATCH --ntasks=48
#SBATCH --nodes=2 #Number of nodes
#SBATCH --tasks-per-node=24
#SBATCH --output=mpi_test_%j.out # Standard output and error log

mpiexec ./a.out

date

MPICH2

We have MPICH2 support cluster-wide. MPICH is compiled in two different ways:

  • in /opt/mpich2/ you will find the MPICH2 subtree compiled with --pm=hydra. This version uses the hydra process manager, which is the default PM at the moment. This version supports slurm natively. Please refer to this page (section "MPICH with MPIEXEC"). Have a look at this other page for an overview of the hydra process manager
  • in /opt/mpich2-slurm/ you will find the MPICH2 subtree compiled with --with-pmi=slurm --with-pm=none. Linking against this version allows you to use slurm srun to schedule your tasks. Please refer to this page.

MPICH2 3.2 hydra PM

MPICH2 ships with the Hydra Process Manager. This process manager interacts natively with SLURM. You can load this environment before compiling your code to link against the correct libraries:

module load mpi/mpich2

A drawback of using the native Hydra PM is that processes must be executed interactively (version 3.2): you should allocate resources, execute your program and then release resources, all manually.
To launch your binary you can create an allocation with SLURM and then execute. From a bash shell you should call salloc with the desired reservations (e.g.: number of nodes, number of tasks, number of tasks per node).
salloc -N 2

With the above command we asked for 2 nodes, each with one task going. After having created the allocation start your job:
mpiexec -localhost frontend ./a.out

The "-localhost frontend" is mandatory to have MPI processes talking to each other. The "a.out" is your program binary. Once your job has completed, you should release resources: a call to exit from the command line should work. Please, refer carefully to the salloc documentation here

A better syntax is the following one-liner, which shows some additional options too:

salloc --ntasks=48 --tasks-per-node=24 --nodes=2 mpiexec -localhost frontend ./a.out

This last syntax will release resources once your task has completed.

A full dump of the bash interface of a real command submission:

mmiralto@kraken:~/pi-calc-mpi$ salloc -N 5 
salloc: Granted job allocation 415
mmiralto@kraken:~/pi-calc-mpi$ mpiexec -localhost frontend ./a.out
This is my sum: 0.6283185316069759 from rank: 0 name: node4
This is my sum: 0.6283185311622378 from rank: 1 name: node5
This is my sum: 0.6283185302735453 from rank: 3 name: node7
This is my sum: 0.6283185298290889 from rank: 4 name: node9
This is my sum: 0.6283185307178860 from rank: 2 name: node6
Pi is approximately 3.1415926535897341, Error is 0.0000000000000591
Time of calculating PI is: 2.461559
mmiralto@kraken:~/pi-calc-mpi$ exit
salloc: Relinquishing job allocation 415
salloc: Job allocation 415 has been revoked.

MPICH2 with slurm PM

The documented Hydra PM can be disabled and slurm itself can be used as a process manager. This allows you to use srun to run your jobs, but it can introduce some overhead in process intercommunication and thus in performances.
To compile your code against these libraries, issue the following command in your shell:

module load mpi/mpich2-slurm

Once you have your compiled with mpicc, you can write a slurm batch script to execute it:
#!/bin/bash
#SBATCH --job-name=picalc # Job name
#SBATCH --mail-type=ALL # Mail events (NONE, BEGIN, END, FAIL, ALL)
#SBATCH --mail-user=marco.miralto@szn.it # Where to send mail    
#SBATCH --ntasks=48
#SBATCH --nodes=2 #Number of nodes
#SBATCH --tasks-per-node=24
#SBATCH --output=mpi_test_%j.out # Standard output and error log

# output some generic informations
pwd; hostname; date

echo "Running example mpich2 binary. Using $SLURM_JOB_NUM_NODES nodes with $SLURM_NTASKS  tasks, each with $SLURM_CPUS_PER_TASK cores." 
# eventually purge your environment and load the one(s) you need
#module purge; module load mpi/mpich2-slurm
module load mpi/mpich2-slurm

# now launch your mpi app
srun ./a.out