Introduction Parallel Programming Pmatlab v20 Kim WW 11699
Introduction Parallel Programming Pmatlab v20 Kim WW 11699
Abstract
The computational demands of software continue to outpace the capacities of processor and
memory technologies, especially in scientific and engineering programs. One option to improve
performance is parallel processing. However, despite decades of research and development,
writing parallel programs continues to be difficult. This is especially the case for scientists and
engineers who have limited backgrounds in computer science. MATLAB®, due to its ease of use
compared to other programming languages like C and Fortran, is one of the most popular
languages for implementing numerical computations, thus making it an excellent platform for
developing an accessible parallel computing framework.
The MIT Lincoln Laboratory has developed two libraries, pMatlab and MatlabMPI, that not only
enables parallel programming with MATLAB in a simple fashion, accessible to non-computer
scientists. This document will overview basic concepts in parallel programming and introduce
pMatlab.
1. Introduction
Even as processor speeds and memory capacities continue to rise, they never seem to be able to
satisfy the increasing demands of software; scientists and engineers routinely push the limits of
computing systems to tackle larger, more detailed problems. In fact, they often must scale back
the size and accuracy of their simulations so that they may complete in a reasonable amount of
time of fit into memory. The amount of processing power directly impacts the scale of the
problem being addressed.
One option for improving performance is parallel processing. There have been decades of
research and development in hardware and software technologies to allow programs to run on
multiple processors, improving the performance of computationally intensive applications. Yet
writing accurate, efficient, high performance parallel code is still highly non-trivial, requiring a
substantial amount of time and energy to master the discipline of high performance computing,
commonly referred to as HPC. However, the primary concern of most scientists and engineers is
conduct research, not to write code.
The MIT Lincoln Laboratory has been developing technology to allow MATLAB users to
benefit from the advantages of parallel programming. Specifically, two MATLAB libraries –
This work is sponsored by the Department of the Air Force under Air Force contract FA8721-05-C-0002. Opinions,
interpretations, conclusions and recommendations are those of the author and are not necessarily endorsed by the
United States Government. MATLAB® is a registered trademark of The Mathworks, Inc.
pMatlab and MatlabMPI – have been created to allow MATLAB users to run multiple instances
of MATLAB to speed up their programs. This document is targeted towards MATLAB users
who are unfamiliar with parallel programming. It provides a high-level introduction to basic
parallel programming concepts, the use of pMatlab to parallelize MATLAB programs and a
more detailed understanding of the strengths of global array semantics, the programming model
used by pMatlab to simplify parallel programming.
The rest of this paper is structured as follows. Section 2 explains the motivation for developing
technologies that simplify parallel programming, such as pMatlab. Section 3 provides a brief
overview of parallel computer hardware. Section 4 discusses the difficulties in achieving high
performance computation in parallel programs. Section 5 introduces the single-program
multiple-data model for constructing parallel programs. Section 6 describes the message-passing
programming model, the most popular parallel programming model in use today, used by
MatlabMPI. Section 7 describes a global array semantics, a new type of parallel programming
model used by pMatlab and Section 8 provides an introduction to the pMatlab library and how to
write parallel MATLAB programs using pMatlab. Section 9 discusses how to measure
performance of parallel programs and Section 10 concludes with a summary.
2. Motivation
As motivation for many of the concepts and issues to be presented in this document, we begin by
posing the following question: what makes parallel programming so difficult? While we are
familiar with serial programming, the transition to thinking in parallel can easily become
overwhelming. Parallel programming introduces new complications not present in serial
programming such as:
Consider transposing a matrix. Let A be a 4×4 matrix. In a serial program, the value Ai,j is
swapped with the value Aj,i. In a parallel program, matrix transpose is much more complicated.
Let A be distributed column-wise across four processors, such that column n resides on
processor n-1, e.g. A1,1, A2,1, A3,1 and A4,1 are on processor 0, A1,2, A2,2, A3,2 and A4,2 are on
processor 1, and so on. (Numbering processors starting at 0 is common practice.) To correctly
transpose the matrix, data must “move” between processors. For example, processor 0 sends
A2,1, A3,1 and A4,1 to processors 1, 2 and 3 and receives A1,2, A1,3 and A1,4 from processors 1, 2
and 3, respectively. Figure 1 depicts a parallel transpose.
Clearly, the increase in speed and memory associated with parallel computing requires a change
in the programmer’s way of thinking; she must now keep track of where data reside and where
data needs to be moved to. Becoming a proficient parallel programmer can require years of
practice. The goal of pMatlab is to hide these details, to provide an environment where
MATLAB programmers can benefit from parallel processing without focusing on these details.
3. Parallel Computers
Any discussion on parallel programming is incomplete without a discussion of parallel
computers. An exhaustive discussion of parallel computer architecture is outside the scope of
this paper. Rather, this section will provide a very high-level overview of parallel hardware
technology.
There are numerous ways to build a parallel computer. On one end of the spectrum is the
multiprocessor, in which multiple processors are connected via a high performance
communication medium and are packaged into a single computer. Many modern
“supercomputers,” such as those made by Cray and IBM, are multiprocessors composed of
dozens to hundreds to even thousands of processors linked by high-performance interconnects.
Twice a year, the TOP500 project ranks the 500 most powerful computers in the world [1]. The
top ranked computer on the TOP500 list as of June 2005 is IBM’s Blue Gene/L at Lawrence
Livermore National Laboratory [2]. Blue Gene/L had 65,536 processors, with plans to increase
that number even further. Even many modern desktop machines are multiprocessors, supporting
two CPUs.
On the other end is the Beowulf cluster which uses commodity networks (e.g. Gigabit Ethernet)
to connect commodity computers (e.g. Dell) [3]. The advantage of Beowulf clusters is their
lower costs and greater availability of components. Performance of Beowulf clusters has begun
to rival that of traditional “supercomputers.” In November 2003, the TOP500 project ranked a
Beowulf cluster composed of 1100 dual-G5 Apple XServe’s at Virginia Tech the 3rd most
powerful computer in the world [4]. MATLAB supports many different operating systems and
processor architectures; thus, components for building a Beowulf cluster to run pMatlab and
MatlabMPI are relatively cheap and widely available. Figure 2 shows the differences between
multiprocessors and Beowulf clusters.
At first glance, it appears that adding more processors will cause a parallel program to keep
running faster, right? Not necessarily. A common misconception by programmers new to
parallel programming is that adding more processors always improves performance of parallel
programs. There are several basic obstacles that restrict how much parallelization can improve
performance.
4.3. Communication
Almost all parallel programs require some communication. Clearly communication is required
to distribute sub-problems to processors. However, even after a problem is distributed,
communication is often required during processing, e.g. redistributing data or synchronizing
processors. Regardless, communication adds overhead to parallel processing. As the number of
processors grows, work per processor decreases but communication can – and often does –
increase. It is not unusual to see performance increase, peak, then decrease as the number of
processors grows, due to an increase in communication. This is referred to as slowdown or
speeddown and can be avoided by carefully choosing the number of processors such that each
processor has enough work relative to the amount of communication. Finding this balance is not
generally intuitive; it often requires an iterative process.
One way to mitigate the effects of communication on performance is to reduce the total number
of individual messages by sending a few large messages rather than sending many small
messages. All networking protocols incur a fixed amount of overhead when sending or receiving
a message. Rather than sending multiple messages, it is often worthwhile to wait and send a
single large message, thus incurring the network overhead once rather than multiple times as
depicted in Figure 4. Depending on the communication medium, overhead may be large or
small. In the case of pMatlab and MatlabMPI, overhead per message can be large; therefore, it is
to the user’s advantage not only to reduce the total amount of communication but also reduce the
total number of messages, when possible.
5. Single-Program Multiple-Data
There are multiple models for constructing parallel programs. At a very high level, there are two
basic models: multiple-program multiple-data (MPMD) and single-program multiple-data
(SPMD).
In the MPMD model, each processor executes a different program (“multiple-program”), with
the different programs working in tandem to process different data (“multiple-data”). Streaming
signal processing chains are excellent examples of applications that can be parallelized using the
MPMD model. In a typical streaming signal processing chain, data streams in and flows through
a sequence of operations. In a serial program, each operation must be processed in sequential
order on a single processor. In a parallel program, each operation can be a separate program
executing simultaneously on different processors. This allows the parallel program to process
multiple sets of data simultaneously, in different stages of processing.
Consider an example of a simulator that generates simulated data to test a signal processing
application, shown in Figure 5. The simulator is structured as a pipeline, in which data flow in
one direction through various stages of processing. Stages in the pipeline can be executed on
different processors or a single stage could even execute on multiple processors. If two
consecutive stages each run on multiple processors, it may be necessary to redistribute data from
one stage to the next; this is known as a cornerturn.
Figure 6 – Each arrow represents one Monte Carlo simulation. The top set of arrows represents serial
execution of the simulations. The bottom set of arrows represent parallel execution of the simulations.
Note that a pipeline does not necessarily speed up processing a single frame. However, since
each stage executes on different processors, the pipeline can process multiple frames
simultaneously, increasing throughput.
In the SPMD model, the same program runs on each processor (“single-program”) but each
program instance processes different data (“multiple-data”). Different instances of the program
may follow different paths through the program (e.g. if-else statements) due to differences in
input data, but the same basic program executes on all processors. pMatlab and MatlabMPI
follow the SPMD model. Note that SPMD programs can be written in a MPMD manner by
using if-else statements to execute different sections of code on different processors. In fact, the
example shown in Figure 5 is actually a pMatlab program written in an MPMD manner. The
scalability and flexibility of the SPMD model has made it the dominant parallel programming
model. SPMD codes range from tightly coupled (fine-grain parallelism) simulations of
evolution equations predicting chemical processes, weather and combustion, to “embarrassingly
parallel” (coarse-grain) programs searching for genes, data strings, and patterns in data
sequences or processing medical images or signals.
Monte Carlo simulations are well suited to SPMD models. Each instance of a Monte Carlo
simulation has a unique set of inputs. Since the simulations are independent, i.e. the input of one
is not dependent on the output of another, the set of simulations can be distributed among
processors and execute simultaneously, as depicted in Figure 6.
parallel region
master master
f j
o o
OpenMP r i
k n
Figure 7 – In OpenMP, threads are spawned at parallel regions of code marked by the programmer. In MPI,
every process runs the entire application.
The two most common implementations of the SPMD model are the Message Passing Interface
(MPI) and OpenMP. In MPI, the program is started on all processors. The program is
responsible for distributing data among processors, then each processor processes its section of
the data, communicating among themselves as needed. MPI is an industry standard and is
implemented on a wide range of parallel computers, from multiprocessor to cluster architectures
[5]. Consequently, MPI is the most popular technology for writing parallel SPMD programs.
More details of MPI are discussed in the next section.
The OpenMP model provides a more dynamic approach to parallelism by using a fork-join
approach [6]. In OpenMP, programs start as a single process known as a master thread that
executes on a single processor. The programmer designates parallel regions in the program.
When the master thread reaches a parallel region, a fork operation is executed that creates a team
of threads, which execute the parallel region on multiple processors. At the end of the parallel
region, a join operation terminates the team of threads, leaving only the master thread to continue
on a single processor.
pMatlab and MatlabMPI use the MPI model, thus the code runs on all processors for the entire
runtime of the program. Figure 7 graphically depicts the difference between pMatlab/MatlabMPI
and OpenMP programs. At first, this may seem unnecessarily redundant. Why bother executing
Consider initialization code (which is typically non-parallel) that initializes variables to the same
value on each processor. Why not execute the code on a single processor, then send the values to
the other processors? This incurs communication which can significantly reduce performance.
If each processor executes the initialization code, the need for communication is removed. Using
a single processor for non-parallel code doesn’t improve performance or reduce runtime of the
overall parallel program.
The fork-join model works with OpenMP because OpenMP runs on shared-memory systems, in
which every processor uses the same physical memory. Therefore, communication between
threads does not involve high latency network transactions but rather low-latency memory
transactions.
6. Message Passing
Within the SPMD model, there are multiple styles of programming. One of the most common
styles is message passing. In message passing, all processes are independent entities, i.e. each
process has its own private memory space. Processes communicate and share data by passing
Figure 8 – Example of message passing. Each color represents a different processor. The upper half
represents program flow on each processor. The lower half depicts each processor’s memory space.
MPI is a basic networking library tailored to parallel systems, enabling processes in a parallel
program to communicate with each other. Details of the underlying network protocols and
infrastructure are hidden from the programmer. This helps achieve MPI’s portability mandate
while enabling programmers to focus on writing parallel code rather than networking code.
To distinguish sub-processes from each other, each process is given a unique identifier called a
rank. The rank is the primary mechanism MPI uses to distinguish processors from each other.
Ranks are used for various purposes. MPI programs use ranks to identify which data should be
processed by each process or what operations should be run on each process. The following are
a couple examples:
if (rank == 0)
% Rank 0 executes the if block
else
% All other ranks execute the else block
end
6.1. MatlabMPI
MatlabMPI is a MATLAB implementation of the MPI standard developed at the Lincoln
Laboratory to emulate the “look and feel” of MPI [7]. MatlabMPI is written in pure MATLAB
code (i.e. no MEX files) and uses MATLAB’s file I/O functions to perform communication,
allowing it to run on any system that MATLAB supports. MatlabMPI has been used
successfully both inside and outside Lincoln to obtain speedups in MATLAB programs
previously not possible. Figure 9 depicts a message being sent and received in MatlabMPI.
Despite MPI’s popularity, writing parallel programs with MPI and MatlabMPI can still be
difficult. Explicit communication requires careful coordination of sending and receiving
messages and data that must be divided among processors can require complex index
computations. The next section describes how message passing can be used to create a new
parallel programming model, called global array semantics, that hides the complexity of message
passing from the programmer.
Global array semantics is a parallel programming model in which the programmer views an
array as a single global array rather than multiple, independent arrays located on different
processors, e.g MPI. The ability to view related data distributed across processors as a single
array more closely matches the serial programming model, thus making the transition from serial
to parallel programming much smoother.
A global array library can be implemented using message passing libraries such as MPI or
MatlabMPI. The program calls functions from the global array library. The global array library
determines if and how data must be redistributed and calls functions from the message passing
library to perform the communication. Communication is hidden from the programmer; arrays
are automatically redistributed when necessary, without the knowledge of the programmer. Just
as message passing libraries like MPI allow programmers to focus on writing parallel code rather
than networking code, global array libraries like pMatlab allow programmers to focus on writing
scientific and engineering code rather than parallel code.
7.1. pMatlab
pMatlab brings global array semantics to MATLAB, using the message passing capabilities of
MatlabMPI [8]. The ultimate goal of pMatlab is to move beyond basic messaging (and its
inherent programming complexity) towards higher level parallel data structures and functions,
allowing any MATLAB user to parallelize their existing program by simply changing and adding
a few lines, rather than rewriting their program, as would likely be the case with MatlabMPI.
Clearly, pMatlab allows the scientist or engineer to access the power of parallel programming
while focusing on the science or engineering. Figure 11 depicts the hierarchy between pMatlab
applications, the pMatlab and MatlabMPI libraries, MATLAB and the parallel hardware.
tag = 0;
if (my_rank==0) | (my_rank==1) | (my_rank==2) | (my_rank==3)
A_local=fft(A_local);
for ii = 0:3
MPI_Send(ii+4, tag, comm, A_local(ii*M/4 + 1:(ii+1)*M/4,:));
end
end
Figure 11 – Parallel MATLAB consists of two layers. pMatlab provides parallel data structures and
library functions. MatlabMPI provides messaging capability.
8. Introduction to pMatlab
This section introduces the fundamentals of pMatlab: creating data structures; writing and
launching applications; and testing, debugging and scaling your application. A tutorial for
pMatlab can be found in [9]. For details on functions presented here and examples, refer to [10].
Currently, pMatlab overloads four MATLAB constructor functions: zeros, ones, rand, and
spalloc. Each of these functions accepts the same set of parameters as their corresponding
MATLAB functions, with the addition of a map parameter. The map parameter accepts a map
object which describes how to distribute the dmat object across multiple processors.
Figure 12 contains examples of how to create dmats of varying dimensions. Section 8.2
describes map objects and how to create them in greater detail. Section 8.3 discusses how to
program with dmats. Refer to [10] for details on each constructor function.
8.2. Maps
For a dmat object to be useful, the programmer must tell pMatlab how and where the dmat must
be distributed. This is accomplished with a map object. To create a dmat, the programmer first
creates a map object, then calls one of several constructor functions, passing in the dmat’s
dimensions and the map object.
• 3D map, 2×3×2 grid, block-cyclic along rows and columns with block size 2, cyclic along
third dimension:
• 2D map, 2×2 grid, block along both dimensions, overlap in the column dimension of size
1 (1 column overlap). This example shows how to specify a map in a single line of code:
pMatlab overloads a number of basic MATLAB functions. Most of these overloaded functions
implement only a subset of the available functionality of each function. For example,
MATLAB’s zeros function allow matrix dimensions to be specified by arbitrarily long vectors
while the pMatlab zeros function does not.
Occasionally, pMatlab users may wish to directly manipulate a dmat object. Thus, pMatlab
provides additional functions that allow the user to access the contents of dmat object without
using overloaded pMatlab functions.
This section will introduce various classes of functions available in pMatlab and touch on the
major differences between MATLAB and pMatlab functions.
N = 1000000;
% Fully supported
A(:,:) % Entire matrix reference
A(i,j) % Single element reference
% Partially supported
A(i:k, j:l) % Arbitrary submatrix reference
Figure 18 – Examples of fully supported and partially supported cases subsref in pMatlab.
Subscripted assignment
pMatlab’s subsasgn operator supports nearly all capabilities of MATLAB’s built-in operator.
pMatlab’s subsasgn can assign any arbitrary subsection of a dmat. If a subsection spans
multiple processors, subsasgn will automatically write to the indices on the appropriate
processors.
subsasgn in pMatlab has one restriction. In MATLAB, the parentheses operator can be
omitted when assigning to the entire matrix. In pMatlab, the parentheses operator must be used
in this case. Performing an assignment without () simply performs a memory copy and the
destination dmat will be identical – including the mapping – to the source; data will not be
redistributed. See Figure 17 for an example.
Subscripted reference
The capabilities of pMatlab’s subsref operator are much more limited than subsasgn.
subsref returns a “standalone” dmat only in the following cases:
Referring to an arbitrary subsection will not produce a “standalone” dmat, i.e. the resulting dmat
can not be directly used as an input to any pMatlab function, with the exception of local. The
details are outside the scope of this document; this is a limitation of pMatlab and addressing
these issues requires further research. For more information about the local function, see
Section 8.3.4. Figure 18 contains examples of fully and partially supported cases of subsref.
N = 1000000;
M1 = map([1 Np], {}, 0:Np-1);
M2 = map([Np 1], {}, 0:Np-1);
N = 1000000;
M1 = map([1 Np], {}, 0:Np-1);
M2 = map([Np 1], {}, 0:Np-1);
% Absolute value
A = rand(N, M1); % NxN dmat mapped to M1
B1 = abs(A); % Result is mapped to M1
% FFT
D = rand(N, M1); % NxN dmat mapped to M1
E1 = fft(D); % Result is mapped to M1
The local and put_local functions give the user direct access to the local contents of dmat
objects. Each processor in a dmat’s map contains a section of the dmat. local returns a matrix
that contains the section of the dmat that resides on the local processor. Conversely,
put_local writes the contents of a matrix into that processor’s section of the dmat. The
local and put_local functions allow the user to perform operations not implemented in
pMatlab while still taking advantage of pMatlab’s ability to distribute data and computation.
When local returns the local portion of a dmat, the resulting matrix loses global index
information, i.e. which indices in the global map it maps to. See Figure 21 for an example.
Sometimes it is necessary for each processor to know which global indices it owns. pMatlab
contains functions that compute what global indices each processor owns for a given dmat:
Line 4 defines the Ncpus variable, which specifies the how many processors to run on.
Lines 7 defines the mfile variable, which specifies which pMatlab program to run. mfile
should be set to the filename of the pMatlab program without the .m extension.
Lines 12 and 15-18 are examples of different ways to define the cpus variable, which specifies
where to launch the pMatlab program. The user should uncomment the line(s) he wishes to use.
• Line 12 directs pMatlab to launch all MATLAB processes on the user’s local machine;
this is useful for debugging purposes, but Ncpus should be set to a small number (i.e. less
than or equal to 4) to prevent overloading the local machine’s processor.
• Lines 15-18 directs pMatlab to launch the MATLAB processes on the machines in the
list. If Ncpus is smaller than the list, pMatlab will use the first Ncpus machines in the
list. If Ncpus is larger than the list, pMatlab will wrap around to the beginning of the list.
These are the basic ways to launch pMatlab; for a more details, please refer to pRUN and
MPI_Run in [10].
Line 23 launches the pMatlab program specified by mfile on Ncpus processors, using the
machines specified in cpus.
Note that program specific parameters can not be passed into pMatlab applications, via
pRUN. Program parameters should be implemented as variables initialized at the beginning of
the pMatlab application specified in mfile, such as line 1 in the pMatlab code shown in Figure
23, which will be explained in the next section.
Line 4 enables or disables the pMatlab library. If PARALLEL is set to 0, then the script will not
call any pMatlab functions or create any pMatlab data structures and will run serially on the local
processor. If PARALLEL is set to 1, then the pMatlab library will be initialized, dmats will be
created instead of regular MATLAB matrices, and any functions that accept dmat inputs will
call the overloaded pMatlab functions.
If PARALLEL is set 1, then lines 8 through 12 are executed. Lines 10 and 11 create map objects
for the dmats X and Y. The maps distribute the dmats column-wise.
Lines 15 and 16 construct the dmats X and Y. X is an N×N dmat of random values and Y is an
NxN dmat of zeros.
Line 19 calls fft on X and assigns the output to Y. Since X is a dmat, rather than call the built-
in fft, MATLAB calls pMatlab’s fft that is overloaded for dmat.
Lines 28 through 30 finalizes the execution of the pMatlab library if PARALLEL is set to 1.
command prompt or by invoking scripts and functions that contain MATLAB code. The former
is MATLAB’s definition of “interactive.”
In pMatlab, the user can issue pMatlab commands only by invoking scripts or functions that
contain pMatlab code. pMatlab commands can not be run directly from the command
prompt. Thus, pMatlab does not support MATLAB’s notion of “interactive” use.
While pMatlab still supports limited batch processing, it introduces a level of interactivity to
parallel programming previously unseen. To launch pMatlab jobs, a user first starts MATLAB
on his personal machine, then runs the RUN.m script, such as the one presented in Figure 22.
When the pMatlab job starts it not only launches MATLAB processes on remote machines but it
also uses the existing MATLAB process, known as the leader process, on user’s personal
machine as well, as shown in Figure 24. Consequently, unlike the traditional batch process
model of submitting a parallel job, waiting for a notification of completion, transferring files to a
desktop and then post-processing the results, pMatlab users are able to use their personal display
and utilize MATLAB’s graphics capabilities as the job processes or completes.
pMatlab code is no exception and should be written in such a manner that the pMatlab library
can be easily disabled to run the application serially, i.e. on a single processor. Developing
applications in this manner assists the user in locating bugs. In parallel computing there are
generally two types of errors, those resulting from algorithmic choice and those resulting from
parallel implementation. In all cases it is recommended that the code be run in serial first to
resolve any algorithmic issues. After the code has been debugged in serial, any new bugs in the
parallel code can be assumed to arise from the parallel implementation. Being able to easily
enable and disable the pMatlab library can help the user determine if a bug is caused by the
algorithm or by parallelization. Figure 22 provides an example of this development approach,
using the PARALLEL variable.
The above practice of avoiding the use of hard coded numbers of processors applies to all aspects
of the code. The more the code is parameterized the less rewriting, rebuilding, retesting is
required. Thus, when initializing pMatlab, make sure to obtain the number of processors from
the pMATLAB variable and store it to a variable like Ncpus. Use this variable when creating
maps in order to create scalable, flexible code. See Figure 22 for an example.
if (Pid == 0)
% Process D_agg on Pid 0
end
pMatlab users should not need to write code that depends on the Pid and should indeed avoid it
as rPid-dependent code can potentially break or cause unexpected behavior in pMatlab programs.
There are some exceptions to this rule. One is related to the agg function which returns the
entire contents of a dmat to the leader processor but returns only the local portion of matrix to
the remaining processors. There may be cases where the user wishes to operate on the
aggregated matrix without redistributing it. In this case, the code that operates on the aggregated
matrix might be within an if block than runs on only the leader processor. Note that Pid 0 is
chosen as the leader processor because there will always be a Pid 0, even when running on a
single processor.
The other main exception to writing Pid dependent code is associated with I/O. If the user
wishes each process in the pMatlab program to save its results to a .mat file, then each
processor must save its results to a unique filename. The simplest way to accomplish this is to
include the rank in the filename as each rank is unique. Figure 25 provides examples of “safe”
Pid-dependent pMatlab code.
8.7.2. Testing
When a serial MATLAB program fails, the user is notified directly via an error message (and
usually a beep) at the MATLAB command prompt. Failures in pMatlab programs can be much
more devious and subtle. A failure in just one MATLAB process can cause the entire program to
hang. For instance, suppose a pMatlab program contains a cornerturn, which requires all
processes to communicate with all other processors. If one MATLAB process fails prior to the
cornerturn, the remaining processes will hang while waiting for a message that will never arrive
from the failed process. All non-leader MATLAB processes redirect their output – and error
messages – to .out files in the MatMPI directory. If the error occurs on a non-leader process,
the user will receive no notification of the error. This is a common occurrence in many pMatlab
programs. Users should make a habit of routinely checking the contents of the
MatMPI/*.out files when they suspect their programs have hung.
Improving performance via parallelism is no excuse for inefficient code. Parallel programs can
also benefit from profiling to identify bottlenecks in both serial and parallel sections.
Since pMatlab is SPMD, every processor should execute every line of code. However, since the
debugger only runs on the leader process, breakpoints do not affect non-leader processes. Non-
leader processes will simply continue past breakpoints set by the user until they reach a
parallelized operation which requires communication with the leader. However, this
asynchronous behavior will not affect the behavior of the program or the debugger.
8.7.4. Scaling
Another tenet of good software engineering is that programs should not be run on full scale
inputs immediately. Rather, programs should initially be run on a small test problem to verify
functionality and to validate against known results. The programs should be scaled to larger and
more complicated problems until the program is fully validated and ready to be run at full scale.
The same is true for parallel programming. Parallel programs should not run at full scale on 32
processors as soon as the programmer has finished taking a first stab at writing the application.
Both the test input and number of processors should be gradually scaled up.
The following is the recommended procedure for scaling up a pMatlab application. This
procedure gradually adds complexity to running the application.
1. Run with 1 processor on the user’s local machine with the pMatlab library disabled.
This tests the basic serial functionality of the code.
2. Run with 1 processor on the local machine with pMatlab enabled. Tests that the
pMatlab library has not broken the basic functionality of the code.
3. Run with 2 processors on the local machine. Tests the program’s functionality works
with more than one processor without network communication.
4. Run with 2 processors on multiple machines. Test that the program works with
network communication.
5. Run with 4 processors on multiple machines.
6. Increase the number of processors, as desired.
Figure 26 shows the sequence of parameters that should be used to scaling pMatlab applications.
See Figure 22 and Figure 23 for examples on how to set these parameters.
9.1. Speedup
The most common measure of performance for parallel programs is speedup. For input size n
and number of processors p, speedup is the ratio between the serial and parallel runtimes:
timeserial ( n )
speedup( n, p) =
time parallel ( n, p)
• Linear – Occurs!when the application runs p times faster on p processors than on a single
processor. In general, linear (also known as ideal) is the theoretical limit.
• Superlinear – In rare cases, a parallel program can achieve speedup greater than linear.
This can occur if the subproblems fit into cache or main memory, significantly reducing
data access times.
• Sublinear – Occurs when adding more processors improves performance, but not as fast
as linear. This is a common speedup observed in many parallel programs.
• Saturation – Occurs when communication time grows faster than computation time
decreases. Often there is a “sweet spot” when performance peaks. This is also a
common speedup observed in parallel programs.
100
Linear
Superlinear
Sublinear
Speedup
Saturation
10
1
1 2 4 8 16 32 64
Processors
1
speedup "
f + (1 ! f ) p
From this equation, we see that as the number of processors increases, the term (1 ! f ) p
approaches 0, resulting in the following:
1
lim speedup !
p #" f
As we can see, the maximum possible speedup for a parallel program depends on how much of a
program can be parallelized. Consider three programs with three different values of f: 0.1, 0.01,
and 0.001. The maximum possible speedups for these programs are 10, 100, and 100,
respectively. Figure 28 shows the speedups for these programs up to 1024 processors.
Amdahl’s Law clearly demonstrates the law of diminishing returns: as the number of processors
increases, the amount of speedup attained by adding more processors decreases.
100 Linear
Speedup
f = 0.001
f = 0.01
10 f = 0.1
1
8
6
2
16
32
64
24
1
2
4
8
12
25
51
10
Processors
Figure 28 – Theoretical maximum speedups for three parallel programs with different values of f.
10. Conclusion
In this document, we have introduced the following:
And yet, this is just the beginning. Before you tackle writing your own pMatlab application, take
a moment to look other available pMatlab documentation. For example, many programs fall into
a class of applications known as parameter sweep applications in which a set of code, e.g. a
Monte Carlo simulation, is run multiple times using different input parameters. [12] introduces a
pMatlab template that can be used to easily parallelize existing parameter sweep applications.
Additionally, the pMatlab distribution contains an examples directory which contains example
programs that can be run and modified.
11. References
[1] TOP500 webpage. http://www.top500.org.
[2] IBM Blue Gene webpage. http://www.research.ibm.com/bluegene/
[3] T. L. Sterling, J. Salmon, D. J. Becker, and D. F. Savarese. How to Build a Beowulf: A
Guide to the Implementation and Application of PC Clusters. MIT Press, Cambridge, MA,
1999.
[4] Virginia Tech’s System X webpage. http://www.tcf.vt.edu/systemX.html
[5] M. Snir, S. Otto, S. Huss-Lederman, D. Walker, J. Dongarra. MPI – The Complete
Reference: Volume 1, The MPI Core. MIT Press. Cambridge, MA. 1998.
[6] OpenMP webpage. http://www.openmp.org
[7] J. Kepner. “Parallel Programing with MatlabMPI.” 5th High Performance Embedded
Computing workshop (HPEC 2001), September 25-27, 2001, MIT Lincoln Laboratory,
Lexington, MA
[8] J. Kepner and N. Travinin. “Parallel Matlab: The Next Generation.” 7th High
Performance Embedded Computing workshop (HPEC 2003), September 23-25, 2003, MIT
Lincoln Laboratory, Lexington, MA.
[9] J. Kepner, A. Reuther, H. Kim. “Parallel Programming in Matlab Tutorial.” MIT Lincoln
Laboratory.
[10] H. Kim, N. Travinin. “pMatlab v0.7 Function Reference.” MIT Lincoln Laboratory.
[11] D. Patterson, J. Hennessy. “Computer Architecture: A Quantitative Approach.” Morgan
Kaufman, 2003.
[12] H. Kim, A. Reuther. “Writing Parameter Sweep Applications using pMatlab.” MIT
Lincoln Laboratory.