Parallel Computing A Comparative
Parallel Computing A Comparative
Relatore: Candidato:
Prof. Giuseppe Rodriguez Edoardo Daniele Cannas
Introduction ii
2 Parallel MATLAB 14
2.1 Parallel MATLAB . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2 PCT and MDCS: MATLAB way to distributed computing and
multiprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.3 Infrastructure: MathWorks Job Manager and MPI . . . . . . . 17
2.4 Some parallel language constructs: pmode, SPMD and parfor . 21
2.4.1 Parallel Command Window: pmode . . . . . . . . . . . 21
2.4.2 SPMD: Single Program Multiple Data . . . . . . . . . 21
2.4.3 Parallel For Loop . . . . . . . . . . . . . . . . . . . . . 25
2.5 Serial or parallel? Implicit parallelism in MATLAB . . . . . . 27
1
CONTENTS 2
5 Conclusion 56
Bibliography 58
Acknowledgmentes and thanks
Professor Rodriguez, for the help and opportunity to study such an inter-
esting, yet manifold, topic.
All the colleagues met during my university career, for sharing even a lit-
tle time in such an amazing journey.
Everyone who is not a skilled english speaker, for not pointing out all the
grammar mistakes inside this thesis.
Introduction
With the term parallel computing, we usually refer to the symoltaneous use of
multiple compute resources to solve a computational problem. Generally, it
involves several aspects of computer science, starting from the hardware de-
sign of the machine executing the processing, to the several levels where the
actual parallelization of the software code takes place (bit-level, instruction-
level, data or task level, thread level, etc...).
While this particular tecnique has been exploited for many years especially
in the high-performance computing field, recently parallel computing has be-
come the dominant paradigm in computer architecture, mainly in the form
of multi-core processors, due to the physical constraints related to frequency
ramping in the development of more powerful single-core processors [14].
Therefore, as parallel computers became cheaper and more accessible, their
computing power has been used in a wide range of scientific fields, from
graphical models and finite-state machine simulation, to dense and sparse
linear algebra.
The aim of this thesis, is to analyze and show, when possible, the implications
in the parallelization of linear algebra algorithms, in terms of performance
increase and decrease, using the Parallel Computing Toolbox of MAT-
LAB.
Our work focused mainly on two algorithms:
• the classical Jacobi iterative method for solving linear systems;
• the Global Lanczos decomposition method for computing the Estrada
index of complex networks.
whose form has been modified to be used with MATLAB’s parallel language
constructs and methods. We compared our parallel version of these algo-
rithms to their serial counterparts, testing their performance and highlight-
ing the factors behind their behaviour, in chapter 3 and 4.
In chapter 1 we present a little overview about parallel computing and its
related aspects, while in chapter 2 we focused our attention on the imple-
mentation details of MATLAB’s Parallel Computing Toolbox.
ii
Chapter 1
1
CHAPTER 1. PARALLEL COMPUTING: A GENERAL OVERVIEW 2
bi-directional flow of operand between processor and memory (see figure 1.2).
Flynn’s classification divides computers based on the way their CPU’s in-
struction and data streams are organized during the execution of a program.
Being I and D the minimum number of instruction and data flows at any
time during the computation, computers can then be categorized as follows:
The MIMD category actually holds the formal definition of parallel comput-
ers: anyway, modern computers, depending on the instruction set of their
processors, can act differently from program to program. In fact, programs
which do the same operation repeatedly over a large data set, like a signal
processing application, can make our computers act like SIMDs, while the
regular use of a computer for Internet browsing while listening to music, can
be regarded as a MIMD behaviour.
So, Flynn’s taxonomy is rather large and general, but it fits very well for clas-
sifying both computers and programs for it’s understability and is a widely
used scheme.
In the next sections we will try to provide a more detailed classification based
on hardware implementations and programming paradigms.
Depending on the way the processors access data in the memory, tightly
coupled systems can be further divided in:
All these systems may present some mechanism of cache coherence, mean-
ing that if one process update a location in the shared memory, all the other
processors are aware of the update.
(a) (b)
For these last two reasons, distributed memory models are usually referred
as message passing models. They require the programmer to specify and
realize the parallelism using a set of libraries and/or subroutines. Nowadays,
there is a standard interface for message passing implementations, called
MPI (Message Passing Interface), that is the ”de-facto” industry stan-
dard for distributed programming models.
tation. The SPMD is probably the most common programming model used
in multi-node cluster environments.
• multi-core computers.
now necessary, expecially for what is related to linear algebra and scien-
tific computing.
Chapter 2
The following chapter aims to describe briefly the features and mechanism
through which MATLAB, a technical computing language and development
environment, exploits the several techniques of parallel computing discussed
in the previous chapter. Starting from the possibility of having a parallel
computer for high performance computation, like a distributed computer or
a cluster, using different parallel programming models (i.e. SPMD and or
MPI), to the implementation of different parallelism level in the software it-
self, there are various fashions of obtaining incremented performance through
parallel computations.
The information here reported are mainly quoted from [2] and [23].
14
CHAPTER 2. PARALLEL MATLAB 15
cessor machines become easier, the demand for such a toolbox like the PCT
spread towards the user community.
The PCT software and DCS are toolboxes made to extend MATLAB such as
a library or an extension of the language do. These two instruments anyway
are not the only means avaible to the user for improving the performance of
its code through tecniques of parallelization.
In fact, MATLAB supports a mechanism of implicit and explicit mul-
tithreading of computations, and presents a wide range of mechanism to
transform the data and make the computation vectorized, so that computa-
tion can take advantage of the use of libraries such as BLAS and LAPACK.
We will discuss this details later.
However, we can now roughly classify MATLAB parallelism into four cate-
gories ([2], pag.285; see [18] too):
Figure 2.2: Workers processes are delimited by the red rectangule, while the
client is delimited by the green rectangule. The computer is an Intel 5200U
(2 physical cores double threaded, for a total of 4 cores)
Figure 2.3: Creation of a job of three equal tasks, submitted to the scheduler
that supports Binary Large Objects (BLOBs). The scheduler is then pro-
vided with an API with significant data facilities, to allow both the user and
MATLAB to relate with both the database and the workers.
Figure 2.5: Infrastructure for schedulers, jobs and parallel interactive sessions
In fact, all the workers have their separate workspace and can work indepen-
dently from each other without requiring any communication infrastructure.
Having said that, they can still communicate with each other through a
message passing infrastructure, consisting of two principal layers:
Figure 2.7: Creation of a parpool: as job and tasks, parpool are instances of
objects
Figure 2.8: Building and configuration of the MPI ring: the mpiexec con-
fig.file (green square) is a configuration file for setting up and running the
pool, whose workers are delimited by the blue square
CHAPTER 2. PARALLEL MATLAB 21
Inside the body of the spmd statement, every worker is denoted as a lab, and
is associated with a specific and unique index. This index allow the user to
diversify the execution of the code by the labs using conditional statements
(see figure 2.10), but also to use some message passing functions which are
high-level abstraction of functions described in the MPI-2 standard.
Users may use these operations explicitly, or implicitly whithin the code of
other constructs.
For example, trying to reduce the programming complexity by abstracting
out the details of message passing. the PCT implements the PGAS (Par-
titioned Global Address Space) model for SPMD through distributed
arrays. These arrays are partitioned with portions localized to every worker
space, and are implemented as MATLAB objects, so that they also store
information about the type of distribution, the local indices, the global array
size, the number of worker processes, etc...
Their distributions are usually one-dimonsional or two-dimensional block
cyclic: users may specify their personal data distribution too, creating their
own distribution object, through the functions distributor and codistribu-
tor4 . An important aspect of data distributions in MATLAB is that they are
dynamic[23], as the user may change distribution as it likes by redistributing
data with a specific function (redistribute).
Distributed arrays can be used with almost all MATLAB built-in functions,
letting users write parallel programs that do not really differ from serial ones.
Some of these functions, like the ones for linear dense algebra, are finely tuned
as they are based on the ScaLAPACK interface: ScaLAPACK routines are
accessed by distributed array functions in MATLAB without any effort by
the user, with the PCT enconding and decoding the data from the ScaLA-
PACK library5 . Other algorithms, suchs as the ones used for sparse matrices,
are implemented in the MATLAB language instead.
Anyway, their main characteristic is that their use is easy to the program-
mer: the access and use of distributed arrays are syntactically equal to reg-
ular MATLAB arrays access and use, with all the aspects related to data
location and communication between the workers being hidden to the user
though they are implemented in the definition of the functions deployed with
distributed arrays. In fact, all workers have access to all the portion of the
arrays, with all the communication syntax needed to transfer the data from
their location to the worker which has not to be specified within the code
lines of the user.
The same idea lays behind the reduction operations, which allow the user
to see the distributed array as a Composite object directly in the workspace
of the client without any effort by the user.
4
Please note that distributed and codistributed arrays are the same arrays but seen from
different perspective: a codistributed array which exists on the workers is accessible from
the client as distributed, and viceversa[2]
5
ScaLAPACK is a library which includes a subset of LAPACK routines specifically
designed for parallel computers, written in a SPMD style. For more information, see [9]
CHAPTER 2. PARALLEL MATLAB 24
Thus, the SPMD construct brings the user a powerful tool to elaborate par-
allel algorhytms in a simple but still performing fashion.
If anyone of the whole variables of the loop failed to be classified into one of
these categories, or the user utilizes any functions that modifies the workers’
CHAPTER 2. PARALLEL MATLAB 26
This useful tool, which operates at level 3 and 4 of parallelism level in pro-
grams (see section 1.4), is not the only one helpful to the user to accelerate its
code. In fact, MATLAB use today’s state-of-the-art math libraries such as
PBLAS, LAPACK, ScaLAPACK and MAGMA[17], and for platforms having
Intel or AMD CPUs, takes advanteges of Intel’s Math Kernel Library (MKL),
which uses SSE and AVX CPU instructions that are SIMD (Single Instruc-
tion Multiple Data) instructions, implementing the level 1 of parallelism level
in programs (see figure 1.5), and includes both BLAS and LAPACK7 .
So, yet the user may not be aware of that, some of his own serial code may be,
in fact, parallelized in some manner: from multithreaded functions, to SIMD
instruction sets and highly-tuned math libraries, there are several levels in
which the parallelization of the computing can take place.
This aspect revealed to be very important during the analysis of our algo-
rithms’ performance, so we think it has to be described within a specific
section. Anyway, we were not able to determine where this mechanism of
6
An official list of MATLAB’s multithreaded functions is avaiable at [29]; a more de-
tailed but unofficial list is avaiable at [7]
7
For information about the MKL, see [28]; for information about the SSE and AVX set
of instructions, see [26]
CHAPTER 2. PARALLEL MATLAB 28
implicit parallelitazion have taken place, since the implementation details are
not public.
Chapter 3
This chapter and the following one will focus on two linear algebra appli-
cations: the Jacobi iterative method for solving linear systems and the
Global Lanczos decomposition used for the computation of the Estrada
index of complex networks analysis.
Both these algorithms are theoretically parallelizable, thus, we formulated
a serial and a parallel version of them, implemented using the SPMD and
parfor constructs of the PCT. Then, we compared their performance against
their serial counterparts, trying to understand and illustrate the reasons
behind their behaviour, and highlighting the implications of parallelization
within them.
29
CHAPTER 3. PARALLELIZING THE JACOBI METHOD 30
Iterative methods of the first order are methods where the computation of
the solution at the step (k + 1) involves just the approximate solution at the
step k; thus, are methods in the form of x(k+1) = ϕ(x(k) ).
where x is the solution of the system and || · || identifies any vector norm.
Definition 3.2. An iterative method is consistent if
x(k) = x =⇒ x(k+1) = x
.
From these definitions, we obtain the following theorem, whose demonstra-
tion is immediate.
Theorem 3.1. Consistency is a necessary, but not sufficient, condition for
convergence.
A linear, stationary, iterative method of the first order is a method in
the form
x(k+1) = Bx(k) + f (3.1)
where the linearity comes from the relation that defines it, the stationarity
comes from the fact that the iteration matrix B and the vector f do not
change when varying the iteration index k, and the computation of the vector
x(k+1) depends on the previous term x(k) only.
Theorem 3.2. A linear, stationary iterative method of the first order is
consistent if and only if
f = (I − B)A−1 b.
Demonstration. Direct inspection.
Defining the error at k step as the vector
e(k) = x(k) − x,
where x is the solution of the system, applying the equation 3.1 and the
definition 3.2, we obtain the following result
e(k) = x(k) −x = (Bx(k−1) +f )−(Bx+f ) = Be(k−1) = B 2 e(k−2) = · · · = B k e(0) .
(3.2)
CHAPTER 3. PARALLELIZING THE JACOBI METHOD 31
The demonstration for this theorem can be found at page 119 of [21].
A = P − N,
with B = P −1 N and f = P −1 b.
From theorem 3.3 and 3.4, a sufficient condition for convergence is that
||B|| < 1; in this case ||P −1 N || < 1, where || · || is any consistent matrix
CHAPTER 3. PARALLELIZING THE JACOBI METHOD 32
A = D − E − F,
where
( ( (
aij , i=j −aij , i>j −aij , i<j
Dij = , Eij = , Fij = ,
0 i 6= j 0 i≤j 0 i≥j
or rather
a11 0 0 −aij
A=
... −
... − ...
,
ann −aij 0 0
P = D, N = E + F. (3.4)
From this relation, we can see that the x(k+1) components can be computed
from the ones of the x(k) solution in any order independently from each other.
In other words, the computation is parallelizable.
1
Obviously, for the method to being operative, P must be easier to invert than A.
CHAPTER 3. PARALLELIZING THE JACOBI METHOD 33
Figure 3.1 illustrates our serial MATLAB implementation of the Jacobi iter-
ative method. We chose to implement a sort of vectorized version of it, where
instead of having a permutation matrix P we have a vector p containing all
the diagonal elements of A, and a vector f whose elements are those of the
vector data b divided one-by-one by the diagonal elements of A. Thus, the
(k+1)
computation of the xi component of the approximated solution at the
step (k + 1) is given by
n
(k+1) 1h X (k)
i
xi = bi − aij xj i = 1, . . . , n.
di j=1
j6=i
Our goal was to make the most out of the routines of PBLAS and LAPACK
libraries, trying to give MATLAB a code in the best vectorized manner pos-
sible. The matrix N instead, is identical to the one in equation 3.4.
Jacobi par is the parallel twin of the serial implementation of the Jacobi
iterative method of figure 3.1. Like Jacobi ser, it is a MATLAB function
which returns the approximate solution and the step at which the compu-
CHAPTER 3. PARALLELIZING THE JACOBI METHOD 34
tation stopped. For both of them, the iterativity of the algorithm is im-
plemented through a while construct. We used the Cauchy criterion of
convergence to stop the iterations, checking the gap between two subse-
quent approximated solutions with a generic vector norm and a fixed toler-
ance τ > 0.
Generally, a stop criterion is given in the form of
||x(k) − x(k−1) || ≤ τ,
or in the form of
||x(k) − x(k−1) ||
≤ τ,
||x(k) ||
which considers the relative gap between the two approximated solutions.
However, for numerical reasons, we chose to implement this criterion in the
form of2
||x(k) − x(k−1) || ≤ τ ||x(k) ||.
To avoid an infinite loop, we also set a maximum number of iterations k, so
that our stop criterion is a boolean flag whose value is true if any one of the
two expressions is true (the Cauchy criterion or the maximum step number).
As we said at the beginning of this section, we used the SPMD construct to
develop our parallel version of Jacobi. Figure 3.2 shows that all the com-
2
See [21], page 125
CHAPTER 3. PARALLELIZING THE JACOBI METHOD 35
putation and data distribution are internal to the SPMD body3 . First of
all, after having computed our N matrix and our p and f vectors, we put a
conditional construct to verify if a parpool was set up and to create one if
it has not been done yet. We then realize our own data distribution scheme
with the codistributor1d function.
This function allows the user to create its own data distribution scheme on
the workers. In this case, the user can decide to distribute the data along
one dimension, choosing between rows and columns of the matrix with the
first argument of the function.
We opted to realize a distribution along rows, with equally distributed parts
all over the workers, setting the second parameter as unsetPartition. With
this choice, we left MATLAB the duty to realize the best partition possi-
ble using the information about the global size of the matrix (the third
argument of the function) and the partition dimension.
Figure 3.3 shows a little example of distribution of an identity matrix. As
we can see, any operation executed inside the SPMD involving distributed
arrays, is performed by the workers on the part they store of it independently.
This is the reason why we decided to use this construct to implement the
Jacobi method: the computation of the components of the approximated
solution at the (k + 1) step could be performed in parallel by the workers by
just having the whole solution vector at the prior step k in their workspace
(see lines 38–39 of the Jacobi par function in figure 3.2).
To do so, we employed the gather function just before the computation of
the solution at that step (line 37, figure 3.2).
The gather function is a reduction operation that allows the user to
recollect the whole pieces of a distributed array in a single workspace. Inside
an SPMD statement, without any other argument, this function collects all
the segments of data in all the workers’ workspaces; thus, it perfectly fits our
need for this particular algorithm, recollecting all the segments of the prior
step solution vector.
Figure 3.5: Computation time for Jacobi ser and Jacobi par
As we can see from figure 3.5, our serial implementation revealed to be al-
ways faster than our parallel one. In fact, Jacobi par produced slower com-
putations and decreasing performances with increasing number of workers.
Looking for an explanation for this behaviour, we decided to debug our al-
gorithm using the MATLAB Parallel Profiler tool.
MATLAB Parallel Profiler is nothing different from the regular MATLAB
4
See [22] page 109 for a demonstration of this statement.
CHAPTER 3. PARALLELIZING THE JACOBI METHOD 38
Profiler: it just works on parallel blocks of code enabling us to see how much
time each worker spent evaluating each function, how much time was spent
on communicating or waiting for communications with the other workers,
etc...5
Therefore, in this context the parallelization realized through the SPMD con-
CHAPTER 3. PARALLELIZING THE JACOBI METHOD 40
Figure 3.10: Codistributed arrays operations: as we can see, they are all
implemented as methods of the codistributed.class
struct and the codistributed arrays did not perform in an efficient manner:
in fact, to have some performance benefit with codistributed arrays, they
should work with ”data sizes that do not fit onto a single machine”[12].
Anyhow, we have to remember that our serial implementation may, in fact,
not be serial: going beyond the sure fact that the BLAS and LAPACK rou-
tines are more efficient than the MATLAB codistributed.class operations, we
cannot overlook the possibility that our computation may have been multi-
threaded by MATLAB’s implicit multithreading mechanisms, or our program
executed by parallelized instruction sets (such as the Intel’s MKL), being our
machine equipped with an Intel processor. Obviously, without any other type
of information besides the one reported in section 2.5 and the computation
time measured, we cannot identify if one or more or none of these features
played a role inside our test.
Chapter 4
41
CHAPTER 4. COMPUTING THE TRACE OF F(A) 42
HL = QTl AQl
The only differences between this method and the standard one are that:
The Global Lanczos method is very similar to the regular block Lanczos
method. The only difference lies in the fact that the orthonormality of the
1
We suggest the reader who is interested in this topic to see [21] from page 141 to page
149, for demonstrations, implementations and implications of these algorithms.
CHAPTER 4. COMPUTING THE TRACE OF F(A) 43
V base is required only between the blocks of which it is made of, while, in
the standard block Lanczos iteration, every column of every block has to be
orthonormal within each other.
In this case, the matrix Tl becomes a tridiagonal symmetric matrix in the
form of
α 1 β2
β2 α2 β3
...
Tl = β3 α 3 ;
. .
. . . . βl
βl αl
the matrix Tl+1 instead, is the matrix Tl plus a row of null-elements besides
the last one, which is equal to βl+1 . Thus, it is equal to
α 1 β2
β2 α2 β3
...
β3 α 3
Tl+1,l = . (4.4)
. .
. . . . βl
βl α l
βl+1
with µ(λ) := ki=1 µi (λ); we can then define the following Gauss-quadrature
P
rules
Gl = ||W ||2F eTi f (Tl )ei , Rl+1,ζ = ||W ||2F eTi f (Tl+1,ζ )ei ,
being Tl the result matrix of the Global Lanczos decomposition of matrix A
with initial block-vector W .
These rules are exact for polynomial of degree 2l − 1 and 2l, so that, if the
derivatives of f behave nicely and ζ is suitably chosen, we have
thus, we can determine the upper and lower bounds of the function If .
where
n+k−1
Ej = [e(j−1)k+1 , . . . , emin{jk,n} ], j = 1, . . . , ñ; ñ := [ ].
k
This result is rather remarkable, as it shows that there is a level of feasible
parallelization.
Besides the computational speed increase given by the use of the Global
Lanczcos decomposition in the evaluation of the trace at each step of the
summation, each term of it can, in fact, be computed independently.
through a for cycle where each term of the summation is evaluated serially,
CHAPTER 4. COMPUTING THE TRACE OF F(A) 48
and then summed to the other terms in the trc variable (line 70 of figure
4.1). The single terms computation, which is an evaluation of the trace of an
exponential function through the Global Lanczos decomposition, is realized
through the gausstrexp function, which uses an algorithm similar to Algo-
rithm 1 for obtaining the decomposition.
As we have already highlighted the feasible parallelizability of the computa-
tion of the Estrada index in equation 4.5, the actual MATLAB implemen-
tation of it presents itself as a perfect candidate for the use of the parfor
construct.
For all the computation involving sparse matrices, our parallel algorithm
revealed to be always faster than its serial twin.
Figure 4.4: Computation time and speedup for the Yeast network adjacency
matrix (2114 nodes, sparse); kmin block-size is equal to 60
Figure 4.5: Computation time and speedup for the Internet network adja-
cency matrix (22963 nodes, sparse); kmin block-size is equal to 8
Figure 4.6: Computation time and speedup for the Facebook network adja-
cency matrix (63731 nodes, sparse); kmin block-size is equal to 60
CHAPTER 4. COMPUTING THE TRACE OF F(A) 52
into the computation to have some performance increase. In figure 4.7 for
example, where we have reported the test results for the Estrada index com-
putation of the matrix Power, the actual speedup of the execution time kicks
in only when we use more than 4 workers inside the parpool.
Figure 4.7: Computation time and speedup for the Power network adjacency
matrix (4941 nodes, dense); kmin block-size is equal to 40
However, the average results are rather important, though the computations
peaked very high values of speedup: more than 9, for example, for the Face-
book adjacency matrix (see figure 4.6).
As inside the parfor construct we cannot use any type of Profiler like we did
in chapter 3 with the SPMD syntax, we have less information about the use
of computing resources by our algorithm.
What we have noticed, is that during the execution of the serial algorithm,
MATLAB seems not to use all the available processors to carry out the com-
putation; when executing the parallel algorithm instead, it exploits all the
resources to finish its job.
From figure 4.8, the impression is that only two processors are working: the
computational load switches between the cores leaving operating just two of
them per time. Figure 4.9 shows instead that during the parallel compu-
tation, all the processors are intensively exploited: in fact, as figure 4.103
illustrates, all the workers executed an even number of iterations, everyone
of them being working for more than the 80 percent of the time. However,
we did not investigate any further this aspect of our test, since, again, we do
not have very much information about MATLAB’s internal mechanisms.
3
Figure 4.10 has been created through the use of the parTicToc timing utility by Sarah
Wait Zanereck, which can be found at [30].
CHAPTER 4. COMPUTING THE TRACE OF F(A) 53
Figure 4.10: Some information about the worker’s executional load during
the parallel computation of the Estrada Index for the Internet network (22963
nodes, sparse). The machine is an Intel 5200U (2 cores with hyperthreading,
for a total of 4 processors)
Another remarkable observation is that while for the first tests the increase
in speedup is almost linear with increasing number of workers, in general
all the computations after reaching a maximum did not show any other im-
provement in their execution time.
We think that this feature may be related to the singular structures of the
matrices, that may be particulary suited for certain data distributions related
to a determined number of workers: thus, in this context, like for jacobi par
increasing the number of workers in the pool may not always lead to better
performances.
It is peculiar though, that all tests show a leveling in the performance of the
algorithm for parpools of more of 12 workers: it would seem that MATLAB
could not really take advantage of the use of logical cores as it does for the
physical ones.
Chapter 5
Conclusion
Our first goal when we started working on this bachelor thesis was to un-
derstand and explore the advantages, drawbacks and in general the overall
implications in the parallel implementations of linear algebra algorithms.
We soon realized that our intent actually involved the knowledge and un-
derstanding of several scientific fields, especially computer science and en-
gineering. Even when using an high level toolbox as MATLAB’s PCT, the
awareness of its implementation details played a major role in evaluating the
behaviour and performances of our software.
From the use of lowly-tuned math operations with codistributed arrays, to
an intensive use of processors with the parfor construct, to the mechanism of
implict multithreading, in many cases our predictions based on just the the-
oretical formulation of our algorithms had a surprising hidden side, leading
us to two main remarks.
On one hand, the use of multiple computing resources is becoming the next
step in the evolution of both software programming and hardware archi-
tecture design, requiring engineers and software developers to master the
concepts about parallel computing to really exploit the possibilities in terms
of performances that nowadays computers have to offer at different software
levels. This concerns not only the development of new softwares that are
able to take advantage of the parallel hardware characteristics, but even the
reformulation of the existent algorithms in such a way that keeps them com-
petitive in a parallel environment.
On the other hand, it is quite difficult to ask average computer users, even
with good programming skills and crafts, but who may not be expert com-
puter scientists, to struggle with all the aspects related to the realization
and, particularly, the optimization of parallel softwares. It is quite likely
that software and hardware companies will put great effort in realizing high-
level programs that solve this duties without any stress for the user: the
56
CHAPTER 5. CONCLUSION 57
Parallel Computing Toolbox and the Intel Math Kernel Library are perfect
examples of that.
Anyhow, for what concerns our applications and tests. we also realized that
parallel programming requires and intimate knowledge of both the problem
and the instruments through which the software is implemented. Many fac-
tors from the instruction set employed, to the paradigm implemented by the
programming language, to the architecture of the machine, can play a deter-
minant role in the whole performance of a program as parallelism can take
place at many levels.
Besides the negative (such as those of the parallelization of the Jacobi method)
and positive (such as those of the computation of the Estrada index) results
of our research, in a field like linear algebra a deeper understanding of the
internal functioning mechanisms of MATLAB would have brought more va-
lidity to our work.
However, it leaves open new prospectives for further researches in this topic,
such as the use of explicit multithreading for parallel computations [1], use of
different parallel programming models with various memory structures and
access, use of other data distributions to look how they affect the execution
performances using both the SPMD and parfor construct.
Bibliography
[7] Mike Croucher. Which matlab functions are multicore aware?, http:
//www.walkingrandomly.com/?p=1894.
[9] J.J. Dongarra and D.W. Walker. The design of linear algebra libraries
for high performance computers. Aug 1993.
58
BIBLIOGRAPHY 59
[23] Gaurav Sharma and Jos Martin. Matlab R : A language for parallel
computing. International Journal of Parallel Programming, 37(1):3–36,
2009.
[28] Intel Support Team. Parallelism in the intel R math kernel library,
https://software.intel.com/sites/default/files/m/d/4/1/d/8/
4-2-ProgTools_-_Parallelism_in_the_Intel_C2_AE_Math_Kernel_
Library.pdf.