8 Week Report
8 Week Report
ABSTRACT:
In this final report following problems that were implemented and analysed during the
internship have been discussed.
INTRODUCTION:
Recently, shared memory multiprocessors have been used to implement a wide range of
high performance applications. The use of multithreading to program such applications
is becoming popular, and POSIX threads or Pthreads are now a standard supported on
most platforms.
Pthreads may be implemented either at the kernel level, or as a user-level threads
library. Kernel-level implementations require a system call to execute most thread
operations, and the threads are scheduled by the kernel scheduler. This approach
provides a single, uniform thread model with access to system-wide resources, at the
cost of making thread operations fairly expensive. In contrast, most operations on user-
level threads, including creation, synchronization and context switching, can be
implemented in user space without kernel intervention, making them significantly
cheaper than corresponding operations on kernel threads. Thus parallel programs can
be written with a large number of lightweight, user-level Pthreads, leaving the
scheduling and load balancing to the threads implementation, and resulting in simple,
well structured, and architecture-independent code. A user-level implementation also
provides the flexibility to choose a scheduler that best suits the application,
independent of the kernel scheduler.
PROBLEMS:
The solution for the producer is to either go to sleep or discard data if the buffer
is full. The next time the consumer removes an item from the buffer, it notifies
the producer, who starts to fill the buffer again. In the same way, the consumer
can go to sleep if it finds the buffer to be empty. The next time the producer puts
data into the buffer, it wakes up the sleeping consumer. The solution can be
reached by means of inter-process communication, typically using semaphores.
An inadequate solution could result in a deadlock where both processes are
waiting to be awakened. The problem can also be generalized to have multiple
producers and consumers.
#include <stdio.h>
#include<stdlib.h>
#include <sys/types.h>
#include<pthread.h>
#include <semaphore.h>
#include <time.h>
sem_t mut;
sem_t empty;
sem_t full;
int buffer[200];
pthread_t ptd[20],ctd[20];
int cap,cntp,cntc,up,uc,bi=0,bo=0;
buffer[bi]=item;
time_t dur=time(NULL);
printf("%dth item: %d produced by thread %d at %s into buffer location
%d\n",i,item,ID,asctime(gmtime(&dur)),bi);
sem_post(&mut);
sem_post(&full);
}}
int main(){
int i,np,nc;
printf("Enter capacity,np,nc,cntp,cntc,up,uc: ");
scanf("%d %d %d %d %d %d %d",&cap,&np,&nc,&cntp,&cntc,&up,&uc);
sem_init(&mut,0,1);sem_init(&empty,0,cap);sem_init(&full,0,0);
for(i=1;i<=np;i++)
pthread_create(&ptd[i],NULL,producer,NULL);
for(i=1;i<=nc;i++)
pthread_create(&ctd[i],NULL,consumer,NULL);
for(i=1;i<=np;i++)
pthread_join(ptd[i],NULL);
for(i=1;i<=nc;i++)
pthread_join(ctd[i],NULL);
exit(0);}
Sequential code:
#include <bits/stdc++.h>
#include <pthread.h>
#include <time.h>
#include <semaphore.h>
#define n 1000
#define ll int
using namespace std;
float X[2*n][n+1];
void soln(){
for(ll t=0;t<n;t++){
float temp=(float)(1/(1-X[t][t]));
for(ll i=t+1;i<=n+t;i++)
for(ll j=t+1;j<=n;j++)
X[i][j]+=(X[i][t] * X[t][j] * temp);
}}
int main(){
clock_t tStart = clock();
ll a[n][n],b[n][1];
int k=0,j=0,i;
for(ll i=0;i<n*n;i++){
a[j][k]=rand()%40;
k++;
if(k==n){k=0;j++;}}
for(i=0;i<n;i++){
b[i][0]=rand()%50;}
for(i=0;i<2*n;i++){
for(j=0;j<=n;j++){
if(j<n && i<n){
if(i==j){
X[i][j]=1-a[i][j];}
else X[i][j]=(-a[i][j]);}
else if(j==n && i<n)X[i][j]=b[i][0];
else{
if((i-j)==n && i>=n)X[i][j]=1;
else X[i][j]=0;
}}}
soln();
cout<<"\nValues of x are as follows: \n";
for(i=0;i<n;i++)
cout<<"x"<<i+1<<"= "<<X[i+n][n]<<"\n";
printf("\nTime taken: %.6fs\n",(clock() - tStart)/CLOCKS_PER_SEC);
exit(0);}
Parallelized code (for 100 unknowns in each of the 100 algebraic equations & 8 threads):
#include<bits/stdc++.h>
#include <pthread.h>
#include <semaphore.h>
using namespace std;
#define N 100
#define M 8
int a[N][N],b[N][M];
double x[2*N][N+1];
pthread_t mat[N];
int k=0,n,y;
void *matrix(void *arg)
{ if(y < (n-1)){
double temp=1/(1-x[k][k]);
for(int i=k+1;i<=N+k;i++)
{ for(int j=k+1;j<=N;j++)
{x[i][j]=(double)(x[i][j]+(x[i][k]*x[k][j]*temp));
} }
++k;}
else
{ for(int l=k;l<=(N-1);l++){
double temp=1/(1-x[l][l]);
for(int i=l+1;i<=N+l;i++)
{ for(int j=l+1;j<=N;j++)
{x[i][j]=(double)(x[i][j]+(x[i][l]*x[l][j]*temp)); }
}}}
}
int main()
{ clock_t tStart = clock();
int j=0,t=0,n;
cout<<"Enter number of threads :";
cin>>n;
for(int i=0;i<10000;i++)
{ if(t==99)
{ a[j][t]=rand()%40;
t=0;j++;}
else{
a[j][t]=rand()%40;
t++;}
}
for(int i=0;i<100;i++)
b[i][M]=rand()%40;
for(int i=0;i<N;i++)
{for(int j=0;j<N;j++)
{ if(i==j)
x[i][j]=(double)1-a[i][j];
else
x[i][j]=(double)-a[i][j];
} }
for(int i=0;i<N;i++)
x[i][N]=(double)b[i][0];
for(int i=N;i<=(2*N-1);i++)
{ for(int j=0;j<=N;j++)
{
if(i-N==j)
x[i][j]=(double)1;
else
x[i][j]=(double)0;
}}
void *exit_status;
for (y=0;y<n;y++){
pthread_create(&mat[y],NULL,matrix,&y);
}
for(int i=0;i<n;i++)
pthread_join(mat[i],&exit_status);
for(int i=N;i<=(2*N-1);i++)
cout<<"x["<<i-N<<"]="<<x[i][N]<<endl;
printf("\nTime taken: %.6fs\n"(clock() - tStart)/CLOCKS_PER_SEC);
return 0;
}
For matrix A (in AX=B) with order 100, the running time for the sequential code was
3.39 seconds while that for the parallelized code (running on 8 threads) was 0.028836
seconds. Further, the programs were tested for matrix A of order 1000, to which the
parallelized program successfully executed in 7.11 seconds and the sequential program
resulted in abnormal termination.
• Solving dense system of linear algebraic equations on a multi-processor
system using Successive Gauss Elimination ( SGE ): We have to determine the
unknown solution column vector X, in the linear algebraic equation AX = B (where A
is a real nonsingular matrix of order N and B is a known right hand side vector).
The required research paper[2] can be found under references section.
Being a variant of Gaussian Elimination (GE), the dependencies of all the unknowns
are reduced to half at every stage and finally to zero in log2 N stages (N linear
independent equations at Stage 1 are replaced by two sets of N/2 linear
independent equations at Stage 2, and further) which is accomplished by using the
concept of forward (left to right) and backward (right to left) eliminations. Unlike the
conventional GE algorithm, the SGE algorithm does not have a separate back
substitution phase, which requires O(N) steps using O(N) processors or O (log 2 N)
steps using O (N^3) processors, for solving a system of linear algebraic equations.
It replaces the back substitution phase by only one step division (hence constant
time) and possesses numerical stability through partial pivoting.
The problem was implemented recursively as well as iteratively in a sequential as
well as parallelized manner.
Recursive solution:
• Sequential code
#include<bits/stdc++.h>
#include <pthread.h>
#include <time.h>
#include <semaphore.h>
#include<stdlib.h>
#include <sys/types.h>
#define ll long long int
#define pb push_back
using namespace std;
vector<float> ans;
for(i=k-1;i>=0;i--)
b[i][m-1]-=((b[i][k])*(b[k][m-1]));
}}
for(int i=0;i<n/2;i++){vector<float> v;
for(int j=0;j<n/2;j++)
{v.pb(a2[i][j]);}v.pb(a2[i][m-1]);
a4.pb(v);}
for(int i=n/2;i<n;i++){vector<float> v;
for(int j=n/2;j<n;j++){
v.pb(a1[i][j]);}v.pb(a1[i][m-1]);
a3.pb(v);}
SGE(a3,n/2,n/2+1);
SGE(a4,n/2,n/2+1);
}
else ans.pb(a[0][1]/a[0][0]);
}
int main(){
clock_t tStart = clock();
float q;
int i,j,r=1000,c=1001;
vector<vector<float> >a;
for(i=0;i<r;i++){
vector<float> v;
for(j=0;j<c;j++){
v.pb(rand()%25);
}
a.pb(v);
}
SGE(a,r,c);
for(i=0;i<ans.size();i++)
{printf("x[%d] = %f\n",i+1,ans[i]);}
printf("\nTime taken: %.6fs\n", (double)(clock() - tStart)/CLOCKS_PER_SEC);
exit(0);}
• Parallelized code (using 2 threads)
#include<bits/stdc++.h>
#include <pthread.h>
#include <time.h>
#include <semaphore.h>
#include<stdlib.h>
#include <sys/types.h>
#define ll long long int
#define pb push_back
using namespace std;
sem_t mut;
pthread_t ptd[10];
int i,j,r=1000,c=1001;
vector<float> ans;
vector<vector<float> >a;
for(i=k-1;i>=0;i--)
b[i][m-1]-=((b[i][k])*(b[k][m-1]));
}
}
for(int i=0;i<n/2;i++){vector<float> v;
for(int j=0;j<n/2;j++)
{v.pb(a2[i][j]);}v.pb(a2[i][m-1]);
a4.pb(v);}
for(int i=n/2;i<n;i++){vector<float> v;
for(int j=n/2;j<n;j++){
v.pb(a1[i][j]);}v.pb(a1[i][m-1]);
a3.pb(v);}
SGE(a3,n/2,n/2+1);
SGE(a4,n/2,n/2+1);
}
else ans.pb(a[0][1]/a[0][0]);
}
int main(){
clock_t tStart = clock();
float q;
for(i=0;i<r;i++){
vector<float> v;
for(j=0;j<c;j++){
v.pb(rand()%25);
}
a.pb(v);
}
sem_init(&mut,0,1);
pthread_create(&ptd[0],NULL,soln,NULL);
pthread_join(ptd[0],NULL);
for(i=0;i<ans.size();i++)
{printf("x[%d] = %f\n",i+1,ans[i]);}
printf("\nTime taken: %.6fs\n", (double)(clock() - tStart)/CLOCKS_PER_SEC);
exit(0);}
In the recursive solution, the performance gain was quite less (for a matrix of order
1000, the sequential program executed in 40.186076 seconds and the parallelized
solution executed in 39.814362 seconds). The performance similarity can be realised
using the following line plots drawn using experimental testing of the programs.
Due to the inefficiency of the recursive solution, an iterative solution to the above
method was attempted.
Iterative solution:
• Sequential code
#include<bits/stdc++.h>
#include <time.h>
#include<stdlib.h>
#include <sys/types.h>
#define ll long long int
#define pb push_back
using namespace std;
int main(){
clock_t tStart = clock();
ll i,j,k,q,r=200,c=201;
vector<vector<float> >A,B,C;
for(i=0;i<r;i++){
vector<float> v;
for(j=0;j<c;j++){
v.pb(rand()%25);
}
A.pb(v);
}
ll l=0,n=r,m=n,t;
while(m>1){
t=0;//for(i=0;i<n;i++){B[i].clear();C[i].clear();}
for(i=0;i<r;i++){
vector<float> v;
for(j=0;j<c;j++){
v.pb(A[i][j]);}B.pb(v);C.pb(v);}
for(t=0;t<(n/m);t++){
for(k=0+t*m;k<((m/2)+t*m);k++){
float max=B[k][k];
for(i=k;i<m+t*m;i++)
if(B[i][k]>=max){max=B[i][k];q=i;}
for(i=0+t*m;i<m+t*m;i++)
{float g=B[q][i];B[q][i]=B[k][i];B[k][i]=g;}
for(q=k+1;q<m+t*m;q++)
B[q][k]=(B[q][k]/B[k][k]);
for(j=k+1;j<m+t*m;j++)
for(i=k+1;i<m+t*m;i++)
B[i][j]-=((B[i][k])*(B[k][j]));
for(i=k+1;i<m+t*m;i++)
B[i][c-1]-=((B[i][k])*(B[k][c-1]));
}//for k
for(i=0+t*m;i<((m/2)+t*m);i++)
for(j=0+t*m;j<((m/2)+t*m);j++)
A[i][j]=B[i][j];
for(i=0+t*m;i<((m/2)+t*m);i++)
A[i][c-1]=B[i][c-1];
}
for(t=0;t<(n/m);t++){
for(k=c-2-t*m;k>=(c-2)-(m/2)-t*m+1;k--){
float max=C[k][k];
for(i=k;i>=t*m;i--)
if(C[i][k]>=max){max=C[i][k];q=i;}
for(i=(c-2)-t*m;i>=(c-2)-(t)*m-m+1;i--)
{float g=C[q+1][i];C[q+1][i]=C[k][i];C[k][i]=g;}
for(q=k-1;q>=(r-1)-t*m-m+1;q--)
C[q][k]=(C[q][k]/C[k][k]);
for(j=k-1;j>=(c-2)-(t)*m-m+1;j--)
for(i=k-1;i>=(r-1)-t*m-m+1;i--)
C[i][j]-=((C[i][k])*(C[k][j]));
for(i=k-1;i>=(r-1)-t*m-m+1;i--)
C[i][c-1]-=((C[i][k])*(C[k][c-1]));
}
for(i=c-2-t*m;i>=(c-2)-(m/2)-t*m+1;i--)
for(j=c-2-t*m;j>=(c-2)-(m/2)-t*m+1;j--)
A[i][j]=C[i][j];
for(i=c-2-t*m;i>=(c-2)-(m/2)-t*m+1;i--)
A[i][j]=C[i][j];
}
m/=2;
}
printf("Values of X:\n");
for(i=0;i<n;i++)
printf("%.6f ",(A[i][c-1]/A[i][i]));
printf("\n");
printf("\nTime taken: %.6fs\n", (double)(clock() - tStart)/CLOCKS_PER_SEC);
exit(0);}
#include <bits/stdc++.h>
#include <pthread.h>
#include <time.h>
#include <semaphore.h>
#include <stdlib.h>
#define ll int
#define pb push_back
#define r 200
#define c 201
using namespace std;
float A[r][c],B[r][c],C[r][c];
ll l=0,n=r,m=n,t,x,ti,i,j,k,q;
pthread_t ptd[5];
for(t=0;t<(n/m);t++){
for(k=0+t*m;k<((m/2)+t*m);k++){
float max=B[k][k];
for(i=k;i<m+t*m;i++)
if(B[i][k]>=max){max=B[i][k];q=i;}
for(i=0+t*m;i<m+t*m;i++)
{float g=B[q][i];B[q][i]=B[k][i];B[k][i]=g;}
for(q=k+1;q<m+t*m;q++)
B[q][k]=(B[q][k]/B[k][k]);
for(j=k+1;j<m+t*m;j++)
for(i=k+1;i<m+t*m;i++)
B[i][j]-=((B[i][k])*(B[k][j]));
for(i=k+1;i<m+t*m;i++)
B[i][c-1]-=((B[i][k])*(B[k][c-1]));
}
for(i=0+t*m;i<((m/2)+t*m);i++)
for(j=0+t*m;j<((m/2)+t*m);j++)
A[i][j]=B[i][j];
for(i=0+t*m;i<((m/2)+t*m);i++)
A[i][c-1]=B[i][c-1];
}
for(t=0;t<(n/m);t++){
for(k=c-2-t*m;k>=(c-2)-(m/2)-t*m+1;k--){
float max=C[k][k];
for(i=k;i>=t*m;i--)
if(C[i][k]>=max){max=C[i][k];q=i;}
for(i=(c-2)-t*m;i>=(c-2)-(t)*m-m+1;i--)
{float g=C[q+1][i];C[q+1][i]=C[k][i];C[k][i]=g;}
for(q=k-1;q>=(r-1)-t*m-m+1;q--)
C[q][k]=(C[q][k]/C[k][k]);
for(j=k-1;j>=(c-2)-(t)*m-m+1;j--)
for(i=k-1;i>=(r-1)-t*m-m+1;i--)
C[i][j]-=((C[i][k])*(C[k][j]));
for(i=k-1;i>=(r-1)-t*m-m+1;i--)
C[i][c-1]-=((C[i][k])*(C[k][c-1]));
}
for(i=c-2-t*m;i>=(c-2)-(m/2)-t*m+1;i--)
for(j=c-2-t*m;j>=(c-2)-(m/2)-t*m+1;j--)
A[i][j]=C[i][j];
for(i=c-2-t*m;i>=(c-2)-(m/2)-t*m+1;i--)
A[i][j]=C[i][j];
}
m/=2;}
else{
while(m>1){
t=0;
for(i=0;i<r;i++){
for(j=0;j<c;j++){
B[i][j]=A[i][j];C[i][j]=A[i][j];}}
for(t=0;t<(n/m);t++){
for(k=0+t*m;k<((m/2)+t*m);k++){
float max=B[k][k];
for(i=k;i<m+t*m;i++)
if(B[i][k]>=max){max=B[i][k];q=i;}
for(i=0+t*m;i<m+t*m;i++)
{float g=B[q][i];B[q][i]=B[k][i];B[k][i]=g;}
for(q=k+1;q<m+t*m;q++)
B[q][k]=(B[q][k]/B[k][k]);
for(j=k+1;j<m+t*m;j++)
for(i=k+1;i<m+t*m;i++)
B[i][j]-=((B[i][k])*(B[k][j]));
for(i=k+1;i<m+t*m;i++)
B[i][c-1]-=((B[i][k])*(B[k][c-1]));
}
for(i=0+t*m;i<((m/2)+t*m);i++)
for(j=0+t*m;j<((m/2)+t*m);j++)
A[i][j]=B[i][j];
for(i=0+t*m;i<((m/2)+t*m);i++)
A[i][c-1]=B[i][c-1];
}
for(t=0;t<(n/m);t++){
for(k=c-2-t*m;k>=(c-2)-(m/2)-t*m+1;k--){
float max=C[k][k];
for(i=k;i>=t*m;i--)
if(C[i][k]>=max){max=C[i][k];q=i;}
for(i=(c-2)-t*m;i>=(c-2)-(t)*m-m+1;i--)
{float g=C[q+1][i];C[q+1][i]=C[k][i];C[k][i]=g;}
for(q=k-1;q>=(r-1)-t*m-m+1;q--)
C[q][k]=(C[q][k]/C[k][k]);
for(j=k-1;j>=(c-2)-(t)*m-m+1;j--)
for(i=k-1;i>=(r-1)-t*m-m+1;i--)
C[i][j]-=((C[i][k])*(C[k][j]));
for(i=k-1;i>=(r-1)-t*m-m+1;i--)
C[i][c-1]-=((C[i][k])*(C[k][c-1]));
}
for(i=c-2-t*m;i>=(c-2)-(m/2)-t*m+1;i--)
for(j=c-2-t*m;j>=(c-2)-(m/2)-t*m+1;j--)
A[i][j]=C[i][j];
for(i=c-2-t*m;i>=(c-2)-(m/2)-t*m+1;i--)
A[i][j]=C[i][j];
}
m/=2;}
}}
int main()
{int q;
printf("Enter no. of threads:");
scanf("%d",&q);
clock_t tStart = clock();
for(i=0;i<r;i++){
for(j=0;j<c;j++){
A[i][j]=rand()%25;
}}
for(x=0;x<q;x++)
pthread_create(&ptd[x],NULL,soln,&x);
for(i=0;i<q;i++)
pthread_join(ptd[i],NULL);
printf("Values of X:\n");
for(i=0;i<n;i++)
printf("%.6f ",(A[i][c-1]/A[i][i]));
printf("\n");
printf("\nTime taken: %.6fs\n", (double)(clock() - tStart)/CLOCKS_PER_SEC);
exit(0);}
Quite significant performance gain was observed in the running time of the parallelized
version of the iterative solution of SGE in compared to that of the recursive solution. The
results are tabulated below.
Order/Method 25 100 200 500 800 900 1000
Recursive 0.00283 0.043988 0.296094 4.534074 19.710256 28.365654 39.814362
(sec)
Iterative(sec) 0.000791 0.021037 0.156625 2.189652 11.239122 16.320862 26.467252
Static allocation of threads: In this method, I have explicitly assigned the task of each
thread before runtime (hence static).
#include<stdio.h>
#include<bits/stdc++.h>
#include<pthread.h>
#include<unistd.h>
#include<stdlib.h>
#include<mutex>
#include <semaphore.h>
using namespace std;
int r1=MAX,r2=MAX,c1=MAX,c2=MAX,num=0,num_threads=5;
vector<vector<long long int> > mat1,mat2,res_mat;
mutex m;
sem_t mu;
void *mult_thread(void *param)
{
int i,tot=0;
int row_num=*((int *)param);
int k=row_num;
while( k < r1){
for(int j=0;j<c2;j++)
{
tot=0;
for(i=0;i<c1;i++)
{
tot += mat1[k][i]*mat2[i][j];
}
res_mat[k][j]=tot;
}
k+=num_threads;
}
}
void accept(vector<vector<long long int> > &mat, int r,int c)
{
long long int val=0;
for(int i=0;i<r;i++){
vector<long long int>v;
for(int j=0;j<c;j++)
{
v.push_back(val++);
}
mat.push_back(v);
}
}
void display()
{ for(int i=0;i<r1;i++)
{
for(int j=0;j<c2;j++)
{
printf("%d ",res_mat[i][j]);
}
cout<<endl;
}}
int main()
{
clock_t tStart = clock();
int i,j;
accept(mat1,r1,c1);
accept(mat2,r2,c2);
for(int i=0;i<r1;i++)
{ vector<long long int>v;
for(int j=0;j<c2;j++)
v.push_back(0);
res_mat.push_back(v);
}
pthread_t p[num_threads];
void *exit_status;
sem_init(&mu,0,1);
if(c1!=r2)
cout<<"Invalid size of matrix for multiplication"<<endl;
else if(c1==r2)
{
int row[num_threads];
int i=0;
while(i<num_threads)
{
row[i]=i;
pthread_create(&p[i],NULL,*mult_thread,&row[i]);
++i;
}
}
for(i=0;i<num_threads;i++)pthread_join(p[i],&exit_status);
printf("\nTime taken:%.6fs\n",(clock() - tStart)/CLOCKS_PER_SEC);
display();
return 0;
}
Dynamic allocation of threads: In this method the task of each thread is implicitly
assigned by the kernel. It is unknown to us before runtime and is decided during the
runtime (hence dynamic).
res_mat[k][j]=tot;
}
sem_wait(&mu);
++num;
k=num;
sem_post(&mu);
}}
for(int i=0;i<r;i++){
vector<long long int>v;
for(int j=0;j<c;j++)
{
v.push_back(val++);
}
mat.push_back(v);
}}
void display()
{for(int i=0;i<r1;i++)
{
for(int j=0;j<c2;j++)
{
printf("%d ",res_mat[i][j]);
}
cout<<endl;
}}
int main()
{
clock_t tStart = clock();
int i,j;
accept(mat1,r1,c1);
accept(mat2,r2,c2);
for(int i=0;i<r1;i++)
{
vector<long long int>v;
for(int j=0;j<c2;j++)
v.push_back(0);
res_mat.push_back(v);
}
pthread_t p[num_threads];
void *exit_status;
sem_init(&mu,0,1);
if(c1!=r2)
cout<<"Invalid size of matrix for multiplication"<<endl;
else if(c1==r2)
{
num=num_threads-1;
int row[num_threads];
int i=0;
while(i<num_threads)
{
row[i]=i;
pthread_create(&p[i],NULL,*mult_thread,&row[i]);
++i;
}
}
for(i=0;i<num_threads;i++)pthread_join(p[i],&exit_status);
printf("\nTime taken: %.6fs\n",clock() - tStart)/CLOCKS_PER_SEC);
display();
return 0;
}
A graph was plotted using the observed experimental data of the above two programs
and was compared with the sequential one. It was observed that the static & dynamic
thread allocation program took approximately same running time but that was much
less in comparison to the sequential matrix multiplication of 2 matrices.
• Efficient chain matrices multiplication through dynamic programming
using multi-threading: This problem involves the question of determining the
optimal sequence for performing a series of operations. This general class of
problem is important in compiler design for code optimization and in databases
for query optimization.
Sequential code:
#include <bits/stdc++.h>
#define ll long long int
#define pb push_back
#define mp make_pair
using namespace std;
int n;//no. of matrices
int s[500][500],m[500][500];
vector<int>p;
vector<vector<int> >A[500];
void create(){int d;
for(int i=1;i<=n;i++){
int x=p[i-1],y=p[i];vector<vector<int> > W;
for(int t=0;t<x;t++){vector<int> v;
for(int l=0;l<y;l++)
{d=rand()%4;v.pb(d);}
W.pb(v);
}
A[i]=(W);
}
}
int main(){
scanf("%d",&n);p.assign(n+1,0);
for(int i=0;i<=n;i++)p[i]=i+2;
clock_t tStart = clock();
create();match(p);
int x=p[0],y=p[n];vector<vector<int> > P;
P=chmul(1,n);
for(int i=0;i<x;i++){
for(int j=0;j<y;j++)
printf("%d ",P[i][j]);
printf("\n");
}
printf("\nTime taken: %.6fs\n",(clock() - tStart)/CLOCKS_PER_SEC);
return 0;
}
#include <bits/stdc++.h>
#include <pthread.h>
#include <semaphore.h>
#include <mutex>
#define ll long long int
#define pb push_back
#define mp make_pair
using namespace std;
int n,u,q=0,w=1;//no. of matrices
int s[500][500],m[500][500];
vector<int>p;
vector<vector<int> >A[500];
mutex mu;
sem_t z;
int main(){
cout<<"Enter number of threads :";
cin>>u;
scanf("%d",&n);p.assign(n+1,0);
for(int i=0;i<=n;i++)p[i]=i+2;
clock_t tStart = clock();
pthread_t mat[u];
void *exit_status;
sem_init(&z,0,1);
while(q<u){
pthread_create(&mat[q],NULL,create,NULL);
++q;}
for(int i=0;i<u;i++)
pthread_join(mat[i],&exit_status);
match(p);
int x=p[0],y=p[n];vector<vector<int> > P;
P=chmul(1,n);
for(int i=0;i<x;i++){
for(int j=0;j<y;j++)
printf("%d ",P[i][j]);
printf("\n");
}
printf("\nTime taken: %.6fs\n",(clock() - tStart)/CLOCKS_PER_SEC);
return 0;}
The parallelized version was executed with 4 threads (also with 2 threads), resulting in a
running time of 42.0866 seconds (48.7685 seconds using 2 threads) in comparison to
the sequential running time of 54.286186 seconds, for the chain matrix multiplication of
350 matrices. For 400 matrices, the 4 threaded parallelized version executed in
48.1161 seconds (48.7685 seconds for 2 threaded version),while the sequential version
executed in 66.175954 seconds.
A comparison of running times is drawn in the form of following graph.
REFERENCES:
• K.N. Balasubramanya Murthy, Srinivas Aluru and S. Kamal Abdali, Solving Linear
Systems on Linear Processor Arrays using a ∗Semiring Based Algorithm
• K.N. Balasubramanya Murthy and C. Siva Ram Murthy, A New Gaussian Elim
inationBased Algorithm for Parallel Solution of Linear Equations, Computers Math.
Applic. Vol. 29, No. 7, pp. 3954, 1995
• D. Tang and G. Gupta, An Efficient Parallel Dynamic Programming Algorithm,
Computers Math. Applic. Vol. 30, No. 8, pp. 65-74, 1995, Pergamon.