ECSE 420 - Parallel Cholesky Algorithm - Report
ECSE 420 - Parallel Cholesky Algorithm - Report
i=1
|y
i
f(x
i
)| (1)
B. Test 2
The second test is conducted to measure the accuracy of
the decomposed matrix generated by each implementation.
To acquire this measure, a set of matrix operations is applied
both to the original matrix A and the decomposed matrix L
to produce an experimental result x
to be compared to a
theoretical result x. More precisely, we begin by generating a
random matrix A and a unidimensional vector x of the same
length (x being the theoretical value). We then compute the
product b of these matrices using the following equation:
Ax = b (2)
Subsequently, As Cholesky decomposition, LL
T
is gener-
ated. Using these matrices and the result b previously found,
the experimental value x
= b (3)
Since L and L
T
are triangular matrices, nding x is
straightforward. Indeed, one can replace L
T
x
by y and nd
for y by solving this system of linear equations:
Ly = b (4)
3
Finally, x can be computed in a similar way using:
LTx
= y (5)
The error between the theoretical and experimental vectors
x and x is computed by least absolute error :
S =
n
i=1
|y
i
f(x
i
)| (6)
C. Test 3
The third test aims at comparing the performance of
the three implementations. To acquire this information, we
measure the program execution time with respect to the matrix
size and the number of threads (the latter concerning OpenMP
and MPI exclusively). The serial, OpenMP and MPI programs
are tested with square matrices of sizes varying from 50 to
5000, 50 to 4000 and 50 to 2000/3000 (see discussion for
more details) respectively. The OpenMP and MPI programs
are executed with the number of threads ranging from 2 to
32 and 2 to 9 respectively.
The matrix size and number of threads are inputted as
arguments before running the executables, while the execution
time is obtained by comparing the system time before and
after Cholesky decomposition and printing this value to the
command prompt.
For each program execution, a new random matrix is
generated. In order to ensure consistency between different
runs, each test case is conducted ve times and the average
runtime is logged as the measurement.
VI. RESULTS
The following section summarizes the statistics acquired
after executing each three of the aforementioned tests.
TABLE II
MEASURE OF THE CORRECTNESS OF EACH ALGORITHM
Serial OpenMP MPI
Correctness Passed Passed Passed
The least absolute deviation between the Cholesky decom-
position and the original matrix is computed with a precision
of 0.00000000001 for all implementations. For the serial
algorithm and the OpenMP and MPI versions with a matrix
size from 50 to 3000 and 1 to 9 threads, the results of the
deviation are always exactly 0. Therefore, all implementations
of the algorithm are correct, although it is not certain whether
the parallelization of the task is optimal.
VII. DISCUSSION
In, this section, we discuss and analyze our ndings based
on the data collected during tests 1, 2 and 3.
The results of the rst test show that all three implemen-
tation that we designed are correct, as expected. Indeed, the
error for each one of them is equal to exactly 0. This result is
required to conduct further testing and comparison between
the different implementations since it shows that Cholesky
decomposition is properly programmed for each version.
Surprisingly, the least square error measured in the second
test is precisely 0 for all implementations when we expected
TABLE III
BEST EXECUTION TIMES IN FUNCTION OF THE MATRIX SIZE FOR EACH
IMPLEMENTATION
Matrix
size
Serial execution
time (s)
OpenMP best exe-
cution time (s)
MPI best
execution
time (s)
50 0.00043 0.00113 0.000229
100 0.00464 0.00299 0.00291
200 0.01739 0.01047 0.008708
300 0.07026 0.03178 0.023265
500 0.28296 0.14314 0.10016
1000 3.86278 0.94137 0.833415
1500 11.31204 2.98861 2.856436
2000 30.60744 6.92203 6.927265
3000 113.79156 22.94476 22.438851
4000 284.43736 54.20875 -
5000 546,4589 - -
some loss of precision in the MPI implementation. This may
be explained by the fact that test matrices were diagonally
dominant, and hence the condition number of the matrices
were very low. This property ensures that the decomposition
yields little error, even though we didnt expect it to yield no
error at all.
Fig. 2. Runtime of serial Cholesky algorithm
Using data acquired during the third testing phase, the
execution time versus the matrix size is plotted for each
implementation. For the serial version, we observe that the
execution time increases cubically in function of the matrix
size, as displayed on gure 2. This result corresponds to our
expectations since Cholesky decomposition is composed of
three nested for loops resulting in a time complexity of O(n
3
).
Similarly, we note that the OpenMP and MPI execution times
(for a xed number of threads) increase with respect to the
matrix size, but less dramatically as displayed on Figures 3
and 4. This results illustrates that parallelizing computations
improves execution time.
Second, one can observe from Figure 3 that for most
of the conducted tests, the execution time of the OpenMP
algorithm for a xed matrix size decreases with an increasing
number of threads until 5 threads are summoned. With 5
threads onwards, the execution time generally increases as
more threads are added. However, there is a notable exception;
for matrix sizes above 1000, calling 9 threads often results in
a longer execution time than that of 16 or 32 threads. This
might be due to a particularity of Cholesky decomposition, or
is more likely a result of the underlying computer architecture.
The speedup plotted on Figure 5 illustrates the ratio of
4
TABLE I
EXECUTION TIME IN FUNCTION OF THE MATRIX SIZE AND NUMBER OF THREADS FOR THE OPENMP AND MPI IMPLEMENTATIONS
Fig. 3. Average OpenMP Implementation execution time
Fig. 4. Average OpenMPI Implementation execution time
the computation time of the OpenMP implementation versus
its serial counterpart. As it can be observed, the speedup
grows linearly as the matrix size increases. Indeed, the larger
the matrix size, the more OpenMP outperforms the serial
algorithm since the former completes in O(n
2
) and the latter
in O(n
3
).
It is very interesting to discuss an intriguing problem
encountered with the OpenMP implementation. Initially, a
column Cholesky algorithm was programmed and paral-
Fig. 5. Speedups of Parallel Implementations
lelized. After executing dozens of tests, we noted that the
performance improved as the number of threads was increased
up to 2000 threads -despite the program being run on a 4-
core machine! We suspect that spawning an overwhelming
amount of threads forces the program to terminate prema-
turely, without necessarily returning an error message. As
a result, increasing the amount of threads causes prompter
termination and thus smaller execution times. To address this
problem, a row Cholesky algorithm was programmed instead,
which resulted in logical results as previously discussed. We
chose to completely change the Cholesky implementation
from the column to the row version because OpenMP may
not deliver the expected level of performance with some
algorithms [Chapman].
Figure 4 shows the execution time of the MPI implementa-
tion for varying matrix sizes and number of threads. As it can
be seen, the execution time for a xed matrix size decreases
as the number of threads increases until the number of threads
equals 4 (corresponding to the number of cores on the tested
machine). From that point on, the execution time generally
increases with the number of threads, and one can observe
that this is consistent for different matrix sizes. An interesting
observation can also be made from Table I regarding the
execution time with more than 4 threads. When the number
of threads is set to an even number, the execution time is
less than with the preceding and following odd numbers.
5
Once again, this might be a result of the underlying computer
architecture.
The ratio of the computation time of MPI over the serial
implementation is plotted in Figure 5 and we can observe
that as the matrix size increases, the speedup increases
linearly. In fact, this conrms that MPI outperforms the serial
implementation by a factor of the matrix size n. Interestingly,
we can conclude from these results that MPIs behaviour is
very similar to that of OpenMP.
As previously implied, the best OpenMP and MPI perfor-
mance for most matrix sizes is achieved with an amount of 4
threads. This is due to the fact that tests were conducted on
a 4-core machine; maximum performance is achieved when
the number of threads matches the number of cores [Jie
Chen]. Indeed, a too small amount of threads doesnt fully
exploit the computers parallelization capabilities, while a too
large amount induces important context switching overhead
between processes [Chapman].
A. Serial/MPI/OpenMP comparison
As expected, the parallel implementations result in a
much higher performance than their serial counterpart; obvi-
ously, parallelizing independent computations results in much
smaller runtimes than executing them serially. Surprisingly, in
their best cases, the OpenMP and MPI algorithms complete
Cholesky decomposition in perceptibly identical runtimes,
with MPI having an advantage of a few microseconds, as seen
on Figure 6. We were expecting OpenMP to perform better,
since we believed that message passing between processes
would cause more overhead than using a shared memory
space. However, we have not been able to explain this result
on a theoretical basis.
Fig. 6. Average OpenMPI Speedup
One can observe that the OpenMP and MPI implemen-
tations were tested with a maximum matrix size of 4000
and 3000 respectively, compared to that of 5000 for the
serial implementation. This outcome is due to the fact that
running the program with a size higher than these values
would completely freeze the computer because of memory
limitations. Therefore, the serial implementation seems to
have an advantage over the parallel algorithms when the
matrix size is signicantly large because it doesnt require
additional memory elements such extra threads, processes or
messages. Similarly, MPI was not tested with values of 16 and
32 threads unlike OpenMP because these values would freeze
the computer. We believe that this behavior is explained by the
fact that MPI requires more memory for generating not only
processes, but also messages, while OpenMP only requires
threads.
Finally, it is essential to mention that it is much simpler to
parallelize a program with OpenMP than with MPI [Mall on];
the former only requires a few extra lines of code dening
the parallelizable sections and required synchronization mech-
anisms (locks, barriers, etc), while the latter requires the code
to be restructured in its entirety before setting up the message
passing architecture.
VIII. CONCLUSION
Throughout this project, we evaluated the performance
impact of using different parallelization techniques for the
Cholesky decomposition. First, a serial implementation was
developed and used as the reference base. Second, we used
multicore programming tools, namely OpenMP and MPI, to
apply different parallelization approaches. We inspected the
results of the different implementations by varying the matrix
size (between 0 to 4000) and the number of threads used
(between 1 to 32). As expected, both parallel implementations
result in a higher performance than the serial version. Also,
we observed that MPI resulted in better execution times of a
few microseconds over OpenMP.
In the future, we would like to experiment with a mixed
mode of OpenMP and MPI in the hope of discovering an even
more efcient parallelization scheme for Cholesky decompo-
sition, since this scheme may increase the code performance.
Moreover, we would like to conduct further tests with the
current implementations by running the different programs on
computers with various amounts of CPUs, such as 8,16,32 and
64. We would expect machines with more cores to provide
better execution times.
This project improved our knowledge regarding the dif-
ferent parallelization tools that can be used to parallelize a
program. In fact, we were able to apply the parallel computing
theories learned in class.
IX. REFERENCES
D. Mall on, et al., Performance Evaluation of MPI,
UPC and OpenMP on Multicore Architectures, in Re-
cent Advances in Parallel Virtual Machine and Message
Passing Interface. vol. 5759, M. Ropo, et al., Eds., ed:
Springer Berlin Heidelberg, 2009, pp. 174-184.
J. Shen and A. L. Varbanescu, A Detailed Performance
Analysis of the OpenMP Rodinia Benchmark, Techni-
cal Report PDS-2011-011, 2011.
B. Chapman, et al., Using OpenMP: Portable Shared
Memory Parallel Programming, The MIT Press, 2008.
M.T. Heath. Parallel Numerical Algorithms course,
Chapter 7 - Cholesky Factorization. University of Illi-
nois at Urbana Champaign.
K. A. Gallivan, et al., Parallel Algorithms for Matrix
Computations: Society for Industrial and Applied Math-
ematics, 1990.
L. Smith and M. Bull, Development of mixed mode
MPI / OpenMP applications, Sci. Program., vol. 9, pp.
83-98, 2001.
6
X. APPENDIX
A. Test Machine Specications
1) Machine 1
Hardware:
CPU: Intel Core i3 M350 Quad-Core @ 2.27 GHz
Memory: 3.7 GB
Software:
Ubuntu 12.10 64-bit
Linux kernel 3.5.0-17-generic
GNOME 3.6.0
2) Machine 2
Hardware:
CPU: Intel Core i7 Q 720 Quad-Core @ 1.60 GHz
Memory: 4.00 GB
Software:
Ubuntu 13.04 64-bit
Linux kernel 3.5.0-17-generic
GNOME 3.6.0
XI. CODE
A. matrix.c - Helper Functions
# i nc l ude ma t r i x . h
/ / Pr i n t a s quar e mat r i x .
voi d p r i n t ( doubl e mat r i x , i nt ma t r i x Si z e ) {
i nt i , j ;
f or ( i = 0; i < ma t r i x Si z e ; i ++) {
f or ( j = 0; j < ma t r i x Si z e ; j ++) {
p r i n t f ( %.2 f \ t , ma t r i x [ i ] [ j ] ) ;
}
p r i n t f ( \n ) ;
}
p r i n t f ( \n ) ;
}
/ / Mu l t i p l y t wo s quar e ma t r i c e s of t he same s i z e .
doubl e ma t r i x Mu l t i p l y ( doubl e mat r i x1 , doubl e mat r i x2 , i nt ma t r i x Si z e ) {
/ / Al l o c a t e s memory f o r a mat r i x of doubl e s .
i nt i , j , k ;
doubl e mat r i xOut = ( doubl e ) mal l oc ( ma t r i x Si z e s i z e o f ( doubl e ) ) ;
f or ( i = 0; i < ma t r i x Si z e ; i ++){
mat r i xOut [ i ] = ( doubl e ) mal l oc ( ma t r i x Si z e s i z e o f ( doubl e ) ) ;
}
doubl e r e s u l t = 0;
/ / F i l l each c e l l of t he mat r i x out put .
f or ( i = 0 ; i < ma t r i x Si z e ; i ++){
f or ( j = 0; j < ma t r i x Si z e ; j ++){
/ / Mu l t i p l y each row of mat r i x 1 wi t h each col umn of mat r i x 2.
f or ( k = 0; k < ma t r i x Si z e ; k++){
r e s u l t += ma t r i x1 [ i ] [ k ] ma t r i x2 [ k ] [ j ] ;
}
mat r i xOut [ i ] [ j ] = r e s u l t ;
r e s u l t = 0; / / Re s e t ;
}
7
}
ret urn mat r i xOut ;
}
/ / Add t wo s quar e ma t r i c e s of t he same s i z e .
doubl e ma t r i xAddi t i on ( doubl e mat r i x1 , doubl e mat r i x2 , i nt ma t r i x Si z e ) {
/ / Al l o c a t e s memory f o r a mat r i x of doubl e s .
i nt i , j ;
doubl e mat r i xOut = ( doubl e ) mal l oc ( ma t r i x Si z e s i z e o f ( doubl e ) ) ;
f or ( i = 0; i < ma t r i x Si z e ; i ++){
mat r i xOut [ i ] = ( doubl e ) mal l oc ( ma t r i x Si z e s i z e o f ( doubl e ) ) ;
}
/ / F i l l each c e l l of t he mat r i x out put .
f or ( i = 0 ; i < ma t r i x Si z e ; i ++){
f or ( j = 0; j < ma t r i x Si z e ; j ++){
mat r i xOut [ i ] [ j ] = ma t r i x1 [ i ] [ j ] + ma t r i x2 [ i ] [ j ] ;
}
}
ret urn mat r i xOut ;
}
/ / Mu l t i p l y a s quar e mat r i x by a v e c t o r . Ret ur n n u l l i f f a i l u r e .
doubl e v e c t o r Mu l t i p l y ( doubl e mat r i x , doubl e ve c t or , i nt ma t r i xSi z e , i nt v e c t o r Si z e ) {
doubl e r e s u l t = ( doubl e ) mal l oc ( ma t r i x Si z e s i z e o f ( doubl e ) ) ;
i f ( v e c t o r Si z e ! = ma t r i x Si z e ) {
ret urn NULL;
}
i nt i , j ;
doubl e sum = 0 . 0 ;
/ / Mu l t i p l i c a t i o n .
f or ( i = 0 ; i < ma t r i x Si z e ; i ++){
f or ( j = 0; j < ma t r i x Si z e ; j ++){
sum += ma t r i x [ i ] [ j ] v e c t o r [ j ] ;
}
r e s u l t [ i ] = sum;
sum = 0; / / Re s e t .
}
ret urn r e s u l t ;
}
/ / Ret ur n t he t r a n s p o s e of a s quar e mat r i x .
doubl e t r a n s p o s e ( doubl e mat r i x , i nt ma t r i x Si z e ) {
/ / Al l o c a t e s memory f o r a mat r i x of doubl e s .
i nt i , j ;
doubl e mat r i xOut = ( doubl e ) mal l oc ( ma t r i x Si z e s i z e o f ( doubl e ) ) ;
f or ( i = 0; i < ma t r i x Si z e ; i ++){
mat r i xOut [ i ] = ( doubl e ) mal l oc ( ma t r i x Si z e s i z e o f ( doubl e ) ) ;
}
/ / Tr ans pos e t he mat r i x .
f or ( i = 0 ; i < ma t r i x Si z e ; i ++){
f or ( j = 0; j < ma t r i x Si z e ; j ++){
8
mat r i xOut [ i ] [ j ] = ma t r i x [ j ] [ i ] ;
}
}
ret urn mat r i xOut ;
}
/ / Cr eat e a r e a l p o s i t i v e d e f i n i t e mat r i x .
doubl e i n i t i a l i z e ( i nt mi nVal ue , i nt maxValue , i nt ma t r i x Si z e ) {
/ / Al l o c a t e s memory f o r a ma t r i c e s of doubl e s .
i nt i , j ;
doubl e ma t r i x = ( doubl e ) mal l oc ( ma t r i x Si z e s i z e o f ( doubl e ) ) ;
doubl e i d e n t i t y = ( doubl e ) mal l oc ( ma t r i x Si z e s i z e o f ( doubl e ) ) ;
f or ( i = 0; i < ma t r i x Si z e ; i ++){
ma t r i x [ i ] = ( doubl e ) mal l oc ( ma t r i x Si z e s i z e o f ( doubl e ) ) ;
i d e n t i t y [ i ] = ( doubl e ) mal l oc ( ma t r i x Si z e s i z e o f ( doubl e ) ) ;
}
/ / Cr e at e s an uppert r i a n g u l a r mat r i x of random numbers bet ween mi nVal ue and maxVal ue .
/ / Cr e at e s an i d e n t i t y mat r i x mu l t i p l i e d by maxVal ue .
doubl e random ;
f or ( i = 0 ; i < ma t r i x Si z e ; i ++){
i d e n t i t y [ i ] [ i ] = maxValue ma t r i x Si z e ;
f or ( j = 0 ; j < ma t r i x Si z e ; j ++){
random = ( maxValue mi nVal ue )
( ( doubl e ) r and ( ) / ( doubl e )RAND MAX) + mi nVal ue ;
i f ( random == 0 . 0 ) {
random = 1 . 0 ; / / Avoi d d i v i s i o n by 0.
}
ma t r i x [ i ] [ j ] = random ;
}
}
/ / Tr ans f or m t o p o s i t i v e d e f i n i t e .
doubl e t r a n s p o s e d = t r a n s p o s e ( mat r i x , ma t r i x Si z e ) ;
ma t r i x = ma t r i xAddi t i on ( mat r i x , t r a ns pos e d , ma t r i x Si z e ) ;
ma t r i x = ma t r i xAddi t i on ( mat r i x , i d e n t i t y , ma t r i x Si z e ) ;
ret urn ma t r i x ;
}
/ / Comput es t he sum of Abs ol ut e e r r or bet ween 2 v e c t o r s
doubl e vect or Comput eSumof AbsEr r or ( doubl e ve c t o r 1 , doubl e ve ct or 2 , i nt s i z e )
{
i nt i ;
doubl e sumOfAbsError = 0;
f or ( i = 0; i < s i z e ; i ++)
{
sumOfAbsError += f a bs ( v e c t or 2 [ i ] ve c t o r 1 [ i ] ) ;
}
ret urn sumOfAbsError ;
}
/ / Comput es t he sum of Abs ol ut e e r r or bet ween 2 ma t r i c e s
voi d ComputeSumOfAbsError ( doubl e ma t r i x1 , doubl e mat r i x2 , i nt s i z e )
9
{
p r i n t f ( Mat r i x 1 : \ n ) ;
i nt i , j ;
doubl e sumOfAbsError = 0;
f or ( i = 0; i < s i z e ; i ++)
{
f or ( j = 0; j < s i z e ; j ++)
{
sumOfAbsError += f a bs ( ma t r i x1 [ i ] [ j ] ma t r i x2 [ i ] [ j ] ) ;
}
}
p r i n t f ( The sum of a b s o l u t e e r r o r i s %10.6 f \n , sumOfAbsError ) ;
}
voi d p r i n t Ve c t o r ( doubl e ve c t or , i nt s i z e ) {
i nt i ;
f or ( i = 0; i < s i z e ; i ++){
p r i n t f ( \ t %10.6 f , v e c t o r [ i ] ) ;
p r i n t f ( \n ) ;
}
p r i n t f ( \n ) ;
}
doubl e i n i t Ma t r i x ( i nt s i z e ) {
doubl e ma t r i x = ( doubl e ) mal l oc ( s i z e s i z e o f ( doubl e ) ) ;
i nt i ;
f or ( i = 0; i < s i z e ; i ++)
ma t r i x [ i ] = ( doubl e ) mal l oc ( s i z e s i z e o f ( doubl e ) ) ;
ret urn ma t r i x ;
}
voi d t r ans Copy ( doubl e s our ce , doubl e des t , i nt s i z e ) {
i nt i , j ;
f or ( i = 0; i < s i z e ; i ++){
f or ( j = 0; j <= i ; j ++){
d e s t [ i ] [ j ] = s our c e [ i ] [ j ] ;
}
}
}
voi d copyMat r i x ( doubl e s our ce , doubl e des t , i nt s i z e ) {
i nt i , j ;
f or ( i = 0; i < s i z e ; i ++){
f or ( j = 0; j < s i z e ; j ++){
d e s t [ i ] [ j ] = s our c e [ i ] [ j ] ;
}
}
}
;
10
B. cholSerial.c - Serial Cholesky
# i nc l ude c h o l S e r i a l . h
doubl e c h o l S e r i a l ( doubl e A, i nt n ) {
/ / Copy mat r i x A and t ak e onl y l ower t r i a n g u l a r par t
doubl e L = i n i t Ma t r i x ( n ) ;
t r ans Copy (A, L, n ) ;
i nt i , j , k ;
f or ( j = 0; j < n ; j ++){
f or ( k = 0; k < j ; k++){
/ / I nne r sum
f or ( i = j ; i < n ; i ++){
L[ i ] [ j ] = L[ i ] [ j ] L[ i ] [ k ] L[ j ] [ k ] ;
}
}
L[ j ] [ j ] = s q r t ( L[ j ] [ j ] ) ;
f or ( i = j +1; i < n ; i ++){
L[ i ] [ j ] = L[ i ] [ j ] / L[ j ] [ j ] ;
}
}
ret urn L;
}
;
11
C. cholOMP.c - OpenMP Cholesky
# i nc l ude ma t r i x . h
# i nc l ude <omp . h>
doubl e cholOMP ( doubl e L, i nt n ) {
/ / Warni ng : a c t s d i r e c t l y on gi v e n mat r i x !
i nt i , j , k ;
omp l ock t wr i t e l o c k ;
omp i ni t l oc k (&wr i t e l o c k ) ;
f or ( j = 0; j < n ; j ++) {
f or ( i = 0; i < j ; i ++){
L[ i ] [ j ] = 0;
}
#pragma omp p a r a l l e l f or s ha r e d ( L) p r i v a t e ( k )
f or ( k = 0; k < i ; k++) {
omp s et l ock (&wr i t e l o c k ) ;
L[ j ] [ j ] = L[ j ] [ j ] L[ j ] [ k ] L[ j ] [ k ] ; / / Cr i t i c a l s e c t i o n .
omp uns et l ock(&wr i t e l o c k ) ;
}
#pragma omp s i n g l e
L[ i ] [ i ] = s q r t ( L[ j ] [ j ] ) ;
#pragma omp p a r a l l e l f or s ha r e d ( L) p r i v a t e ( i , k )
f or ( i = j +1; i < n ; i ++) {
f or ( k = 0; k < j ; k++) {
L[ i ] [ j ] = L[ i ] [ j ] L[ i ] [ k ] L[ j ] [ k ] ;
}
L[ i ] [ j ] = L[ i ] [ j ] / L[ j ] [ j ] ;
}
omp des t r oy l ock (&wr i t e l o c k ) ;
}
ret urn L;
}
;
12
D. cholMPI.c - OpenMPI Cholesky
# i nc l ude <mpi . h>
# i nc l ude ma t r i x . h
voi d chol MPI ( doubl e A, doubl e L, i nt n , i nt ar gc , char ar gv ) {
/ / Warni ng : chol MPI ( ) a c t s d i r e c t l y on t he gi v e n mat r i x !
i nt npes , r ank ;
MPI I ni t (&ar gc , &ar gv ) ;
MPI Comm size (MPI COMM WORLD, &npes ) ;
MPI Comm rank (MPI COMM WORLD, &r ank ) ;
doubl e s t a r t , end ;
MPI Bar r i er (MPI COMM WORLD) ; / Ti mi ng /
i f ( r ank == 0) {
s t a r t = MPI Wtime ( ) ;
/ / / Te s t
p r i n t f ( A = \n ) ;
p r i n t ( L , n ) ; /
}
/ / For each col umn
i nt i , j , k ;
f or ( j = 0; j < n ; j ++) {
/
St e p 0:
Repl ace t he e n t r i e s above t he di agonal wi t h z e r oe s
/
i f ( r ank == 0) {
f or ( i = 0; i < j ; i ++) {
L[ i ] [ j ] = 0 . 0 ;
}
}
/
St e p 1:
Updat e t he di agonal e l e me nt
/
i f ( j%npes == r ank ) {
f or ( k = 0; k < j ; k++) {
L[ j ] [ j ] = L[ j ] [ j ] L[ j ] [ k ] L[ j ] [ k ] ;
}
L[ j ] [ j ] = s q r t ( L[ j ] [ j ] ) ;
}
/ / Br oadcas t row wi t h new v al ue s t o ot he r p r o c e s s e s
MPI Bcast ( L[ j ] , n , MPI DOUBLE, j%npes , MPI COMM WORLD) ;
/
St e p 2:
Updat e t he e l e me nt s bel ow t he di agonal e l e me nt
/
/ / Di vi de t he r e s t of t he work
13
f or ( i = j +1; i < n ; i ++) {
i f ( i%npes == r ank ) {
f or ( k = 0; k < j ; k++) {
L[ i ] [ j ] = L[ i ] [ j ] L[ i ] [ k ] L[ j ] [ k ] ;
}
L[ i ] [ j ] = L[ i ] [ j ] / L[ j ] [ j ] ;
}
}
}
MPI Bar r i er (MPI COMM WORLD) ; / Ti mi ng /
i f ( r ank == 0) {
end = MPI Wtime ( ) ;
p r i n t f ( Te s t i n g OpenMpi i mpl e me nt a t i on Out put : \n ) ;
p r i n t f ( Runt i me = %l f \n , ends t a r t ) ;
p r i n t f ( Te s t i n g MPI i mpl e me nt a t i on Out put : ) ;
t e s t Ba s i c Ou t p u t (A, L, n ) ;
/ / Te s t
/ doubl e LLT = ma t r i x Mu l t i p l y ( L , t r a n s p o s e ( L , n ) , n ) ;
p r i n t f ( LL T = \n ) ;
p r i n t ( LLT , n ) ; /
}
MPI Fi nal i ze ( ) ;
}
i nt t e s t Ba s i c Ou t p u t ( doubl e A, doubl e L, i nt n )
{
doubl e LLT = ma t r i x Mu l t i p l y ( L, t r a n s p o s e ( L, n ) , n ) ;
i nt i , j ;
f l o a t p r e c i s i o n = 0. 0000001;
f or ( i = 0; i < n ; i ++){
f or ( j = 0; j < n ; j ++){
i f ( ! ( abs ( LLT[ i ] [ j ] A[ i ] [ j ] ) < p r e c i s i o n ) )
{
p r i n t f ( FAILED\n ) ;
ComputeSumOfAbsError (A, LLT, n ) ;
ret urn 0;
}
}
}
p r i n t f ( PASSED\n ) ;
ret urn 1;
}
;
14
E. tests.c - General Test Code
# i nc l ude <s t d i o . h>
# i nc l ude <s t r i n g . h>
# i nc l ude <math . h>
# i nc l ude <f l o a t . h>
# i nc l ude <t i me . h>
# i nc l ude <s t d l i b . h>
# i nc l ude <t i me . h>
# i nc l ude <omp . h>
# i nc l ude ma t r i x . h
t ypedef i nt bool ;
enum { f a l s e , t r u e };
s t r uc t t i me s pe c begi n ={0 , 0} , end ={0 , 0};
t i me t s t a r t , s t op ;
i nt main ( i nt ar gc , char ar gv )
{
/ / ge ne r at e s eed
s r a nd ( t i me (NULL) ) ;
i f ( a r gc ! = 3)
{
p r i n t f ( You di d not f e e d me ar gument s , I wi l l di e now : ( . . . \n ) ;
p r i n t f ( Usage : %s [ ma t r i x s i z e ] [ number of t h r e a d s ] \n , ar gv [ 0 ] ) ;
ret urn 1;
}
i nt ma t r i x Si z e = a t o i ( ar gv [ 1 ] ) ;
i nt t hr eads Number = a t o i ( ar gv [ 2 ] ) ;
p r i n t f ( Te s t b a s i c out p ut f o r a ma t r i x of s i z e %d : \ n , ma t r i x Si z e ) ;
/ / Gener at e random SPD mat r i x
doubl e A = i n i t i a l i z e ( 0 , 10 , ma t r i x Si z e ) ;
/ p r i n t f ( Chol mat r i x \n ) ;
p r i n t ( A, ma t r i x S i z e ) ; /
doubl e L = i n i t i a l i z e ( 0 , 10 , ma t r i x Si z e ) ;
/ / Te s t S e r i a l Program
/ / Appl y S e r i a l Chol es ky
p r i n t f ( Te s t i n g S e r i a l i mpl e me nt a t i on Out put : \n ) ;
c l o c k g e t t i me (CLOCK MONOTONIC, &begi n ) ;
L = c h o l S e r i a l (A, ma t r i x Si z e ) ;
c l o c k g e t t i me (CLOCK MONOTONIC, &end ) ; / / Get t he c u r r e n t t i me .
t e s t Ba s i c Out put Of Chol (A, L, ma t r i x Si z e ) ;
/ / Te s t e x e c u t i o n t i me
p r i n t f ( The s e r i a l comput at i on t ook %.5 f s econds \n ,
( ( doubl e ) end . t v s e c + 1. 0 e9 end . t v ns e c )
( ( doubl e ) begi n . t v s e c + 1. 0 e9 begi n . t v ns e c ) ) ;
/ / Te s t i n g OpenMP Program
p r i n t f ( Te s t i n g OpenMP i mpl e me nt a t i on Out put : \n ) ;
omp s et num t hr eads ( t hr eads Number ) ;
15
copyMat r i x (A, L, ma t r i x Si z e ) ;
c l o c k g e t t i me (CLOCK MONOTONIC, &begi n ) ;
cholOMP ( L, ma t r i x Si z e ) ;
c l o c k g e t t i me (CLOCK MONOTONIC, &end ) ; / / Get t he c u r r e n t t i me .
t e s t Ba s i c Out put Of Chol (A, L, ma t r i x Si z e ) ;
/ / Te s t e x e c u t i o n t i me
p r i n t f ( The OpenMP comput at i on t ook %.5 f s econds \n ,
( ( doubl e ) end . t v s e c + 1. 0 e9 end . t v ns e c )
( ( doubl e ) begi n . t v s e c + 1. 0 e9 begi n . t v ns e c ) ) ;
p r i n t f ( \n ) ;
ret urn 0;
}
i nt t e s t Ba s i c Out put Of Chol ( doubl e A, doubl e L, i nt n )
{
doubl e LLT = ma t r i x Mu l t i p l y ( L, t r a n s p o s e ( L, n ) , n ) ;
i nt i , j ;
f l o a t p r e c i s i o n = 0. 00000000001;
f or ( i = 0; i < n ; i ++){
f or ( j = 0; j < n ; j ++){
i f ( ! ( abs ( LLT[ i ] [ j ] A[ i ] [ j ] ) < p r e c i s i o n ) )
{
p r i n t f ( FAILED\n ) ; / / i f i t f a i l s show t he e r r or
ComputeSumOfAbsError (A, LLT, n ) ;
ret urn 0;
}
}
}
p r i n t f ( PASSED\n ) ;
ret urn 1;
}
voi d t e s t Ti me f o r Se r i a l Ch o l ( i nt n )
{
p r i n t f ( Te s t d u r a t i o n f o r s e r i a l v e r s i o n wi t h ma t r i x of s i z e %d \n , n ) ;
/ / Gener at e random SPD mat r i x
doubl e A = i n i t i a l i z e ( 0 , 10 , n ) ;
c l o c k t s t a r t = c l oc k ( ) ;
/ / Appl y Chol es ky
doubl e L = c h o l S e r i a l (A, n ) ;
c l o c k t end = c l oc k ( ) ;
f l o a t s econds = ( f l o a t ) ( end s t a r t ) / CLOCKS PER SEC;
p r i n t f ( I t t ook %f s econds \n , s econds ) ;
}
voi d t e s t Er r o r Of Li n e a r Sy s t e mAp p l i c a t i o n ( i nt ma t r i x Si z e )
{
p r i n t f ( Te s t l i n e a r s ys t em a p p l i c a t i o n of Chol esky f o r ma t r i x s i z e %d : \ n ,
ma t r i x Si z e ) ;
doubl e A = i n i t i a l i z e ( 0 , 10 , ma t r i x Si z e ) ;
doubl e xTheo = ( doubl e ) mal l oc ( ma t r i x Si z e s i z e o f ( doubl e ) ) ;
i nt i ndex ;
f or ( i ndex = 0; i ndex < ma t r i x Si z e ; i ndex ++)
{
16
xTheo [ i ndex ] = r and ( ) / ( doubl e ) RAND MAX 10;
}
doubl e b = v e c t o r Mu l t i p l y (A, xTheo , ma t r i xSi z e , ma t r i x Si z e ) ;
/ / Appl y Chol es ky
doubl e L = c h o l S e r i a l (A, ma t r i x Si z e ) ;
doubl e y = ( doubl e ) mal l oc ( ma t r i x Si z e s i z e o f ( doubl e ) ) ;
/ / Forwards u b s t i t u t i o n par t
i nt i , j ;
f or ( i = 0; i < ma t r i x Si z e ; i ++){
y [ i ] = b [ i ] ;
f or ( j = 0; j < i ; j ++){
y [ i ] = y [ i ] L[ i ] [ j ] y [ j ] ;
}
y [ i ] = y [ i ] / L[ i ] [ i ] ;
}
/ / Backs u b s t i t u t i o n par t
doubl e LT = t r a n s p o s e ( L, ma t r i x Si z e ) ;
doubl e xExpr = ( doubl e ) mal l oc ( ma t r i x Si z e s i z e o f ( doubl e ) ) ;
f or ( i = ma t r i x Si z e 1; i >=0; i ){
xExpr [ i ] = y [ i ] ;
f or ( j = i + 1 ; j < ma t r i x Si z e ; j ++){
xExpr [ i ] = xExpr [ i ] LT[ i ] [ j ] xExpr [ j ] ;
}
xExpr [ i ] = xExpr [ i ] / LT[ i ] [ i ] ;
}
p r i n t f ( x e x p e r i me n t a l i s : \n ) ;
p r i n t Ve c t o r ( xExpr , ma t r i x Si z e ) ;
p r i n t f ( The sum of abs e r r o r i s %10.6 f \n ,
vect or Comput eSumof AbsEr r or ( xTheo , xExpr , ma t r i x Si z e ) ) ;
}
;
17
F. testMPI.c - Test program for MPI implementation
# i nc l ude ma t r i x . h
i nt main ( i nt ar gc , char ar gv )
{
/ / ge ne r at e s eed
s r a nd ( t i me (NULL) ) ;
i f ( a r gc ! = 2)
{
p r i n t f ( You di d not f e e d me ar gument s , I wi l l di e now : ( . . . \n ) ;
p r i n t f ( Usage : %s [ ma t r i x s i z e ] \n , ar gv [ 0 ] ) ;
ret urn 1;
}
i nt ma t r i x Si z e = a t o i ( ar gv [ 1 ] ) ;
/ / Gener at e random SPD mat r i x
doubl e A = i n i t i a l i z e ( 0 , 10 , ma t r i x Si z e ) ;
/ p r i n t f ( Chol mat r i x \n ) ;
p r i n t ( A, ma t r i x S i z e ) ; /
doubl e L = i n i t i a l i z e ( 0 , 10 , ma t r i x Si z e ) ;
/ / Te s t i n g OpenMpi Program
copyMat r i x (A, L, ma t r i x Si z e ) ;
chol MPI (A, L, ma t r i xSi z e , ar gc , ar gv ) ;
/ / Warni ng : chol MPI ( ) a c t s d i r e c t l y on t he gi v e n mat r i x L
}
;