Chained Matrix Multiplication
Chained Matrix Multiplication
Suppose we want to multiply two matrices do not have the same number of rows and columns We can multiply two matrices A1 and A2 only if the number of columns of A1 is equal to the number of rows of A2 Example: We want to multiply a 2 X 3 matrix by a 3 X 4 matrix This will have 4 terms in the top row and 4 in the bottom Each term is the result of 3 multiplications So the total number of multiplications is 2*3*4
We are given a sequence (chain) A1, A2, ,An of n matrices, and we wish to find the product The way we parenthesize a chain of matrices can have a dramatic impact on the cost of evaluating the product. This problem is to determine the best way to parenthesize the matrices to minimize the number of multiplications
Example:
A1 5 X 3 A2 3 X 4 A3 4 X 6 A4 6 X 5 The problem: what is the best order to multiply them? If we multiply (A1( (A2A3) A4) ) takes 237 multiplications (A1 (A2 (A3A4) )) takes 255 multiplications ( (A1A2) (A3A4) ) takes 280 multiplications ( ( (A1A2) A3) A4) takes 330 multiplications ( (A1(A2A3) ) A4) takes 312 multiplications
In the case of four matrices, there are only five ways to order the multiplications But with n matrices, the number of ways to parenthesize them grows exponentially (4n/n3/2) so we do not want to look at all the possibilities Dividing the problem into subproblems We use the principal of optimality which is said to apply if an optimal solution to an instance of a problem always contains optimal solutions to all substances. If A1((((A2A3)A4)A5)A6) is the optimal order then we know that (A2A3)A4 is the optimal order for A2A3A4
Suppose we have matrix-chain A1 .. An We divide this into subproblems A1..Ak and Ak+1 .. An The problem is that we do not know what the k should be We find k by looking at the optimal solutions of each of the subproblems. This means looking at all the values for k
Data Structure
The 2-D array called N
1.
2.
3.
N[i][j] will hold the number of multiplications to multiply from Ai to Aj N[i][i] This of course is zero, since it is a chain of length one Ni,j = min{ Ni,k + Nk+1,,j + di-1*dk*dj } where i <= k < j
Another Example
A1 30 X 35 A2 35 X 15 A3 15 X 5 A4 5 X 10 A5 10 X 20 A6 20 X 25
1
1 2 3 0 0
2
15,750
3
7,875 2,625 0
6
24000 12375 5,375
4
5 6
1,000
0
3,500
5,000 0
Sequential Code
int minMult(int n, int [ ] d, index [ ][ ] P) { index i,j, k dia; int [ ][ ] M = new int[1..n][1..n] for (i = 1; i <= n; i++) M[i][i] = 0; // initialize for (dia=1; dia<=n-1; dia++) for (i=1; i <=n-dia; i++) { j = I + dia; M[i][j]= min M[i][k] + M[k+1][j] +d[i1]*d[k] * d[j]; i k j-1 } return M[i][j]; }
1 1 0
2 15,750
3 7,875
4 9,375
5 11,875 15,125
2,625
4,375
7,125
10,500
750
750
5,375
1,000
3,500
5,000
To Calculate Diagonal 1 We need no data To Calculate Diagonal 2 We need Diagonal 1 Elements .. So we implement each diagonal calculation in one step or one Processor Step (or Processor) 2 Need Data from Step 1 Step (or Processor) 3 Need Data from Step 1 and Step 2
Pipeline Design
We know that pipeline approach can provide increased speed under the following 3 types of computation : 1-if more than one instance of the complete problem is to be executed. 2-if a series of data item must be processed, each requiring multiple operations 3-if information to start the next process can be passed forward before the process has completed all its internal operations.
If we see the previous table we can understand that step 2 can started after that step 1 calculate 2 first element . in this order each step can start to calculate after that previous step generate 2 first element .
P1 . P2 . P3 (n-1) + (n-2) Message (n-1) Message
In this implementation we can divide problem in to n-1 step ( when we have n matrix ) In each step we calculate elements of this step (diagonal ) it means that we calculate 1 diagonal in each step We have one server and n (n>0) clients All clients and server know step number Clients request job from server and server send a job for this client.
Server
Centralize Work Pool
Result Job
Client 1
Client2
Client n
Dreams
Suppose that MPI is Event Driven ! What happened? We can implement our program very simple and efficient The size of Message Passing is very low because we can use of on demand request, it means that we can request of any processor if we need.
If we go back and look at pipeline implementation of chained matrix multiplication we can see the number of message that pass between processes is very high and some of them is not necessary Now we suppose that MPI is Event Driven and write the pipeline program . In this implementation each process calculate a diagonal ( process p calculate p diagonal ) .
Void Main()
//Process Number P that should calculate P // Diagonal For(i=1;i<=n-P;i++) { j=P+I; N[i][j]=min( GetNij ( i , k ) + GetNij ( k+1 , j ) + ( di-1 )( dk )( dj ) ); //i<=k<j }
Int GetNij(int I, int j) { if ( this process dont have N[i][j] ) { To=j-i; send(request,i,j,To); recv(data); N[i][j]=data; } return N[i][j]; }
bool MPIEVENTS (MPIEventType Event) bool Handled; switch ( Event ) { case recv: MPI_Recv( message , From ) ; if ( requestdata ) MPI_Send( data , From ); Handled=true; break; default : Handled=false; break; }
Now We write code for real mpi . We want to implement and write code for centeralized work pool . In this implementation we have 4 functions: 1-void Server(); 2-void Client(); 3-calculate( i , j , Value[] ); 4-map( i , j , im , jm );
#include "stdafx.h" #include "iostream.h" #include "stdlib.h" #include "stdio.h" #include <mpi.h> #define n 6 //4 #define request 0 #define value 1 #define CONTINUE 1 #define STOP 0 #define infinit 99999 void Server(MPI_Comm comm , int processors); void Client( int my_rank , MPI_Comm comm ); int Calculate(int I , int J ,int Nv[n*2] ); int map(int i, int j , int I , int J , int Nv[n*2]); void fill_Nv (int x , int y ,int *Nv); int N[n+1][n+1]={0}; int d[n+1]={30,35,15,5,10,20,25};//{5,3,4,6,5};//{5,2,3,4,6,7,8};
int main(int argc, char* argv[]) { int my_rank; int processors; MPI_Comm io_comm; MPI_Init(&argc, &argv); MPI_Comm_size(MPI_COMM_WORLD, &processors); MPI_Comm_rank(MPI_COMM_WORLD, &my_rank); MPI_Comm_dup(MPI_COMM_WORLD, &io_comm); if(my_rank==0) if(processors<2) { cout<<"the number of process should be grater than 1"<<endl; MPI_Finalize(); exit(0); } MPI_Bcast(d, n+1, MPI_INT, 0, MPI_COMM_WORLD);
return 0;
}
void Server(MPI_Comm comm, int processors) { int Step=1; int x[3]; int To=0; int Nv[n*2]; int Count=0; MPI_Status st; while ( Step < n ) { for( int i=0 ; i < n-Step ; i++ ) { MPI_Recv(&x,3, MPI_INT,MPI_ANY_SOURCE,request,comm,&st); To=x[0]; x[0]=Step; x[1]=i+1; x[2]=Step+1+i; fill_Nv ( x[1] , x[2] , Nv); MPI_Send(&x,3,MPI_INT,To,0,comm); Count=(Step-1)*2; MPI_Send(&Nv,Count,MPI_INT,To,1,comm); MPI_Recv(&x,3, MPI_INT,MPI_ANY_SOURCE,value,comm,&st); N[x[0]][x[1]]=x[2]; } Step++; }
for ( int i =1 ; i < processors ; i++ ) { MPI_Recv(&x,3, MPI_INT,MPI_ANY_SOURCE,request,comm,&st); To=x[0]; x[0]=STOP; MPI_Send(&x,3,MPI_INT,To,0,comm); } cout <<" N is :"<<endl<<endl; for(i=1;i<=n;i++) {for(int j=1;j<=n;j++) cout<<N[i][j]<<" "; cout<<endl;} cout<<endl<<endl<<"minimom multipliction is : " <<N[1][n];
void Client( int my_rank , MPI_Comm comm ) { int x[3]; int I,J,Step,Count; int Val; int Nv[n*2]={0}; MPI_Status st; while( true ) { x[0]=my_rank; MPI_Send(&x,3,MPI_INT,0,request,comm); MPI_Recv(&x,3, MPI_INT,0,0,comm,&st); if( x[0] == STOP ) break; Step=x[0]; I=x[1]; J=x[2]; Count=(Step-1)*2; MPI_Recv(&Nv,Count, MPI_INT,0,1,comm,&st); Val=Calculate(I,J,Nv); x[0]=I; x[1]=J; x[2]=Val; MPI_Send(&x,3,MPI_INT,0,value,comm); } }
int Calculate(int I , int J ,int Nv[n*2] ) { int minval=infinit; int val; int k=0;
for ( k = I ; k < J ; k++ ) { val=map(I,k,I,J,Nv) + map(k+1,J,I,J,Nv) + d[I1]*d[k]*d[J]; if ( minval > val ) minval=val; } return minval; }
int map(int i, int j , int I , int J , int Nv[n*2]) { if ( i == j ) return 0; if ( j == J ) return Nv[( i - I ) + ( j - i - 1 ) ]; else return Nv[( J - j ) - 1 ];
}
void fill_Nv (int x , int y ,int *Nv) { int i=x+1; int j=y-1; int k=0; while ( j > x ) { Nv[k]=N[x][j]; k++; j--; } while ( i < y ) { Nv[k]=N[i][y]; k++; i++; } }