No Title

Part II. USING MPI

3. How to design a parallel computation by using MPI?

* A note about mid-term exam 1:

Exam question:

Write a C code with MPI for matrix-matrix multiplication A*B. Print the resulting matrix on processor 0 (source computer) and print its first row on processor 1, 2nd row on processor 2, 3rd row on processor 3, etc. Here A and B are N by N matrix.

Common Problems:

The purpose of using MPI was not reflected in some of the answer papers. Instead of distributing the job to different computers, some of you assigned the entire matrix computation job on every computer.

In some of the answer papers, the print-out was the same on every machine.

How to design a parallel computation by using MPI?

1. Load and initiate MPI:

Load MPI:

#include "mpi.h"
Initiate MPI:

MPI_Init(&argc,&argv);
Get to know your computer clusters (the number of processors):

MPI_Comm_size(MPI_COMM_WORLD,&numtasks);
Asign an ID to each processor:
MPI_Comm_rank(MPI_COMM_WORLD,&taskid);

2. Distribute the job to multiple processors:

Use of if statement:

if (taskid == MASTER)
   {
      for (dest=1; dest<=numworkers; dest++)
      {
         rows = (dest <= extra) ? averow+1 : averow;
         printf("   sending %d rows to task %d\n",rows,dest);
         MPI_Send(&offset, 1, MPI_INT, dest, mtype, MPI_COMM_WORLD);
         MPI_Send(&rows, 1, MPI_INT, dest, mtype, MPI_COMM_WORLD);
         MPI_Send(&a[offset][0], rows*NCA, MPI_DOUBLE, dest, mtype,
                   MPI_COMM_WORLD);
         MPI_Send(&b, NCA*NCB, MPI_DOUBLE, dest, mtype,
                           MPI_COMM_WORLD);
         offset = offset + rows;
      }
   }

if (taskid > MASTER)
   {
      mtype = FROM_MASTER;
      MPI_Recv(&offset, 1, MPI_INT, MASTER, mtype, MPI_COMM_WORLD,
                          &status);
      MPI_Recv(&rows, 1, MPI_INT, MASTER, mtype, MPI_COMM_WORLD,
                          &status);
      MPI_Recv(&a, rows*NCA, MPI_DOUBLE, MASTER, mtype,
                          MPI_COMM_WORLD, &status);
      MPI_Recv(&b, NCA*NCB, MPI_DOUBLE, MASTER, mtype,
                          MPI_COMM_WORLD, &status);
      for (k=0; k<NCB; k++)
         for (i=0; i<rows; i++)
         {
            c[i][k] = 0.0;
            for (j=0; j<NCA; j++)
               c[i][k] = c[i][k] + a[i][j] * b[j][k];
         }
      mtype = FROM_WORKER;
      MPI_Send(&offset, 1, MPI_INT, MASTER, mtype, MPI_COMM_WORLD);
      MPI_Send(&rows, 1, MPI_INT, MASTER, mtype, MPI_COMM_WORLD);
      MPI_Send(&c, rows*NCB, MPI_DOUBLE, MASTER, mtype,
                          MPI_COMM_WORLD);
   }
   MPI_Finalize();
}

3. Return the result to source computer (processor 0) and post-process to result:

Use of if statement:

   if (taskid == MASTER)
   {
      for (i=1; i<=numworkers; i++)
      {
         source = i;
         MPI_Recv(&offset, 1, MPI_INT, source, mtype, MPI_COMM_WORLD, &status);
         MPI_Recv(&rows, 1, MPI_INT, source, mtype, MPI_COMM_WORLD, &status);
         MPI_Recv(&c[offset][0], rows*NCB, MPI_DOUBLE, source, mtype,
                            MPI_COMM_WORLD, &status);
      }
    }

if (taskid > MASTER)
   {

      mtype = FROM_WORKER;
      MPI_Send(&offset, 1, MPI_INT, MASTER, mtype, MPI_COMM_WORLD);
      MPI_Send(&rows, 1, MPI_INT, MASTER, mtype, MPI_COMM_WORLD);
      MPI_Send(&c, rows*NCB, MPI_DOUBLE, MASTER, mtype,
                          MPI_COMM_WORLD);
   }
   MPI_Finalize();
}

4. Output the results:

To print the same results on every processor:

      printf("Here is the result matrix\n");
      for (i=0; i<NRA; i++)
      {
         printf("\n");
         for (j=0; j<NCB; j++)
            printf("%6.2f   ", c[i][j]);
      }
      printf ("\n");

To print different results on different proceesors:

Use of if statement:

   if (taskid == MASTER)
   {

printf("Here is the result matrix\n");
      for (i=0; i<NRA; i++)
      {
         printf("\n");
         for (j=0; j<NCB; j++)
            printf("%6.2f   ", c[i][j]);
      }
      printf ("\n");
    }

if (taskid == MASTER)
   {
/* print each the taskid-th row of c */
    }

Sample program (as an example answer to the exam question).

Preview of Collective Communications in MPI.

Communication is coordinated among a group of processes.

Groups can be constructed ``by hand'' with MPI group-manipulation routines or by using MPI topology-definition routines.

Message tags are not used. Different communicators are used instead.

No non-blocking collective operations.

Three classes of collective operations:

synchronization
data movement
collective computation

Synchronization:
MPI_Barrier(MPI_Comm comm)

Function blocks untill all processes (in comm) have reached this routine (i.e, have called it).

Data movement:
Schematic representation of collective data movement in MPI

Collective computations:

Schematic representation of collective data movement in MPI

MPI Collective Routines

MPI_Allgather
MPI_Allgatherv
MPI_Allreduce
MPI_Alltoall
MPI_Alltoallv
MPI_Bcast
MPI_Gather
MPI_Gatherv
MPI_Reduce
MPI_ReduceScatter
MPI_Scan
MPI_Scatter
MPI_Scatterv
All versions deliver results to all participating processes.
V versions allow the chunks to have different sizes.
Allreduce, Reduce, ReduceScatter, and Scan take both built-in and user-defined combination functions.

Homework 4.