Applied Parallel Computing Deng 2011
Applied Parallel Computing Deng 2011
PARALLEL COMPUTING
Yuefan Deng
This manuscript, Applied Parallel Computing, gathers the core materials from a
graduate course (AMS530) I taught at Stony Brook for nearly 20 years, and from
a summer course I gave at the Hong Kong University of Science and Technology
in 1995, as well as from multiple month-long and week-long parallel computing
training sessions I organized at the following institutions: HKU, CUHK, HK
Polytechnic, HKBC, the Institute of Applied Physics and Computational
Mathematics in Beijing, Columbia University, Brookhaven National Laboratory,
Northrop-Grumman Corporation, and METU in Turkey, KISTI in Korea.
YFD
Contents
Preface ...................................................................................................1
Chapter 1 Introduction ..................................................................1
1.1 Definition of Parallel Computing .................................................... 1
1.2 Evolution of Computers .......................................................................... 4
1.3 An Enabling Technology .......................................................................... 7
1.4 Cost Effectiveness ..................................................................................... 8
Chapter 2 Performance Metrics and Models ........................ 13
2.1 Parallel Activity Trace ............................................................................ 13
2.2 Speedup .................................................................................................... 14
2.3 Parallel Efficiency .................................................................................... 15
2.4 Load Imbalance ....................................................................................... 15
2.5 Granularity ............................................................................................... 17
2.6 Overhead ................................................................................................... 17
2.7 Scalability.................................................................................................. 18
2.8 Amdahl’s Law........................................................................................... 19
Chapter 3 Hardware Systems ................................................... 20
3.1 Node Architectures ................................................................................. 20
3.2 Network Interconnections ..................................................................... 22
i
3.3 Instruction and Data Streams .............................................................. 31
3.4 Processor-Memory Connectivity .......................................................... 32
3.5 IO Subsystems ......................................................................................... 32
3.6 System Convergence .............................................................................. 32
3.7 Design Considerations ........................................................................... 33
Chapter 4 Software Systems ..................................................... 35
4.1 Node Software ......................................................................................... 35
4.2 Programming Models ............................................................................. 37
4.3 Debuggers ................................................................................................. 43
4.4 Performance Analyzers ......................................................................... 43
Chapter 5 Design of Algorithms .............................................. 45
5.1 Algorithm Models ................................................................................... 46
5.2 Examples of Collective Operations ..................................................... 53
5.3 Mapping Tasks to Processors ............................................................... 57
Chapter 6 Linear Algebra ........................................................... 65
6.1 Problem Decomposition ........................................................................ 65
6.2 Matrix Operations ................................................................................... 68
6.3 Solution of Linear Systems ................................................................... 80
6.4 Eigenvalue Problems .............................................................................. 88
Chapter 7 Differential Equations ............................................. 89
7.1 Integration and Differentiation ...................................................... 89
7.2 Partial Differential Equations ............................................................... 92
Chapter 8 Fourier Transforms ............................................... 105
8.1 Fourier Transforms .......................................................................... 105
8.2 Discrete Fourier Transforms ........................................................ 106
8.3 Fast Fourier Transforms ................................................................ 107
8.4 Simple Parallelization ...................................................................... 111
8.5 The Transpose Method ................................................................... 112
8.6 Complexity analysis for FFT ............................................................... 113
Chapter 9 Optimization ............................................................ 133
ii
9.1 General Issues ........................................................................................ 133
9.2 Linear Programming ............................................................................. 133
9.3 Convex Feasibility Problems ............................................................... 133
9.4 Monte Carlo Methods ........................................................................... 133
Chapter 10 Applications ............................................................ 137
10.1 Newton’s Equation and Molecular Dynamics .............................. 139
10.2 Schrödinger’s Equations and Quantum Mechanics .................... 149
10.3 Partition Function, DFT and Material Science.............................. 149
10.4 Maxwell’s Equations and Electrical Engineering ......................... 150
10.5 Diffusion Equation and Mechanical Engineering ........................ 151
10.6 Navier-Stokes Equation and CFD .................................................... 152
10.7 Other Applications ............................................................................ 152
Appendix A MPI ............................................................................ 155
A.1 An MPI Primer ........................................................................................ 155
A.2 Examples of Using MPI......................................................................... 181
A.3 MPI Tools ................................................................................................ 183
A.4 Complete List of MPI Functions ......................................................... 189
Appendix B OpenMP ................................................................... 192
B.1 Introduction to OpenMP ...................................................................... 192
B.2 Memory Model of OpenMP .................................................................. 193
B.3 OpenMP Directives ................................................................................ 193
B.4 Synchronization .................................................................................... 195
B.5 Runtime Library Routines ................................................................... 197
B.6 Examples of Using OpenMP ................................................................ 200
Appendix C Projects .................................................................... 201
Project 1 Matrix Inversion .......................................................................... 201
Project 2 Matrix Multiplication ................................................................. 201
Project 3 Mapping Wave Equation to Torus ........................................... 202
Project 4 Load Balance on 3D Mesh ......................................................... 202
Project 5 FFT on a Beowulf Computer ..................................................... 203
iii
Project 6 Compute Coulomb’s Forces ..................................................... 203
Project 7 Timing Model for MD ................................................................. 204
Project 8 Lennard-Jones Potential Minimization ................................... 204
Project 9 Review of Supercomputers ....................................................... 205
Project 10 Top 500 and BlueGene Systems ........................................... 205
Project 11 Top 5 Supercomputers .......................................................... 206
Project 12 Cost of a 0.1 Pflops System Estimate ................................. 206
Project 13 Design of a Pflops System ..................................................... 207
Appendix D Program Examples ............................................... 208
D.1 Matrix-Vector Multiplication ............................................................... 208
D.2 Long Range N-body Force.................................................................... 211
D.3 Integration .............................................................................................. 217
D.4 2D Laplace Solver .................................................................................. 218
Index .................................................................................................. 219
Bibliography .................................................................................... 223
iv
Chapter 1
Introduction
Serial computing systems have been with us for more than five decades
since John von Neumann introduced digital computing in the 1950s. A
serial computer refers to a system with one central processing unit (CPU)
and one memory unit, which may be so arranged as to achieve efficient
referencing of data in the memory. The overall speed of a serial computer
is determined by the execution clock rate of instructions and the
bandwidth between the memory and the instruction unit.
To speed up the execution, one would need to either increase the clock
rate or reduce the computer size to reduce the signal travel time.
1
http://www.nitrd.gov/pubs/bluebooks/1995/section.5.html
1
Chapter 1 Introduction
On the other hand, the raw performance and flexibility of these systems
goes along the opposite direction: distributed-memory multiple-
instruction multiple-data system, distributed-memory single-instruction
2
1.1 Definition of Parallel Computing
1
Wolpert, D.H., Macready, W.G. (1997), "No Free Lunch Theorems for Optimization," IEEE
Transactions on Evolutionary Computation 1, 67.
3
Chapter 1 Introduction
1,000 Nodes
𝑂(1) Minutes 𝑂(1) Weeks 2D CFD, Simple designs
Beowulf Cluster
High-end
𝑂(1) Hours 𝑂(10) Years
Workstation
PC with 2GHz
𝑂(1) Days 𝑂(100) Years
Pentium
Table 1.1: Time scales for solving medium sized and grand challenge problems.
Floating-point
Speeds Representative Computer
Operations Per Second
Figure 1.2: The scatter plot of supercomputers' LINPACK and power efficiencies in
2011.
5
Chapter 1 Introduction
10000
i386
1
1000 100 10
Minmium Feature Size (nm)
6
1.3 An Enabling Technology
Rmax
Vendor Year Computer Cores Site Country
(Tflops)
RIKEN Advanced
1 Fujitsu 2011 K computer 8,162 548,352 Institute for Japan
Computational Science
National
2 NUDT 2010 Tianhe-1A 2,566 186,368 Supercomputing China
Center in Tianjin
National
Nebulae
4 Dawning 2010 1,271 120,640 Supercomputing China
Dawning Cluster
Centre in Shenzhen
Cielo
6 Cray 2011 1,110 142,272 DOE/NNSA/LANL/SNL USA
Cray XE6
Hopper
8 Cray 2010 1,054 153,408 DOE/SC/LBNL/NERSC USA
Cray XE6
Tera-100 Commissariat a
9 Bull SA 2010 1,050 138,368 France
Bull Bullx l'Energie Atomique
Roadrunner
10 IBM 2009 1,042 122,400 DOE/NNSA/LANL USA
IBM BladeCenter
7
Chapter 1 Introduction
8
1.4 Cost Effectiveness
9
Chapter 1 Introduction
era when no single computing platform was able to achieve one Gflops,
this table lists the total cost for multiple instances of a fast computing
platform which speed sums to one Gflops. Otherwise, the least expensive
computing platform able to achieve one Gflops is listed.
10
1.4 Cost Effectiveness
Figure 1.5: Microprocessor Transistor Counts 1971-2011 & Moore’s Law (Source:
Wgsimon on Wikipedia)
Most of the operating costs involve powering up the hardware and cool it
off. The latest (June 2011) Green5001 list shows that the most efficient
Top 500 supercomputer runs at 2097.19 Mflops per watt, i.e., an energy
requirement of 0.5W per Gflops. Operating such a system for one year
will cost 4 KWh per Gflops. Therefore, the lowest annual power consump-
tion of operating the most power-efficient system of 1 Pflops is 4,000,000
KWh. The energy cost on Long Island, New York in 2011 is $0.25 per KWh,
so the annual energy monetary cost of operating a 1 Pflops system is $1M.
Quoting the same Green500 list, the least efficient Top 500 supercom-
puter runs at 21.36 Mflops per watt, or nearly 100 times less efficient
than the most efficient system. Thus, if we were to run such a system of
1 Pflops, the power cost is $100M per year. In summary, the annual costs
of operating the most and least power-efficient supercomputers of 1
Pflops in 2011 are $1M and $100M respectively, and the median cost is
$12M.
1
http://www.green500.org
11
Chapter 1 Introduction
12
Chapter 2
Performance Metrics and Models
13
Chapter 2 Performance Metrics and Models
the underlying serial computation takes. Above the bar, one may
write down the function being executed.
(3) A red wavy line indicates the processor is sending a message. The
two ends of the line indicate the starting and ending times of the
message sending and thus the length of the line shows the amount
of time for sending a message to one or more processors. Above
the line, one may write down the “ranks” or IDs of the receiver of
the message.
(4) A yellow wavy line indicates the processor is receiving a message.
The two ends of the line indicate the starting and ending times of
the message receiving and thus the length of the line shows the
amount of time for receiving a message. Above the line, one may
write down the “ranks” or IDs of the sender of the message.
(5) An empty interval signifies the fact the processing unit is idle.
2.2 Speedup
Let 𝑇(1, 𝑁) be the time required for the best serial algorithm to solve
problem of size 𝑁 on 1 processor and 𝑇(𝑃, 𝑁) be the time for a given
14
2.3 Parallel Efficiency
𝑇(1, 𝑁)
(2.1) 𝑆(𝑃, 𝑁) =
𝑇(𝑃, 𝑁)
For some memory intensive applications, super speedup may occur for
some small 𝑁 because of memory utilization. Increase in 𝑁 also increases
the amount of memory, which will reduce the frequency of the swapping.
Hence largely increases the speedup. The effect of memory increase will
fade away when 𝑁 becomes large. For this kind of applications, it is better
to measure the speedup based on some 𝑃& processors rather than one.
Thus, the speedup can be defined as
𝑃& 𝑇(𝑃& , 𝑁)
(2.2) 𝑆(𝑃, 𝑁) =
𝑇(𝑃, 𝑁)
𝑇(1, 𝑁) 𝑆(𝑃, 𝑁)
(2.3) 𝐸(𝑃, 𝑁) = =
𝑇(𝑃, 𝑁)𝑃 𝑃
15
Chapter 2 Performance Metrics and Models
∑HI*
FJ& 𝑇F
(2.4) 𝑇KLM =
𝑃
The term 𝑇max = max{𝑇F } is the maximum time spent by any processor, so
the total processor time is 𝑃𝑇max . Thus, the parameter called load
imbalance ratio is given by
𝑃𝑇TKU − ∑HI*
FJ& 𝑇F 𝑇TKU
(2.5) 𝐼(𝑃, 𝑁) = = −1
∑HI*
FJ& 𝑇F 𝑇KLM
Remarks:
16
2.5 Granularity
2.5 Granularity
The size of the sub-domains allocated to the processors is called the
granularity of the decomposition. Here is a list of remarks:
2.6 Overhead
In all parallel computing, it is the communication and load imbalance
overhead that affects the parallel efficiency. Communication costs are
usually buried in the processor active time. When a co-processor is added
for communication, the situation becomes trickier.
Let 𝑡max = max{𝑡F } and the time that the entire system of 𝑃 processors
spent for computation or communication is ∑XFJ* 𝑡F . Finally, let the total
time that all processors are occupied (by computation, communication,
or being idle) be 𝑃𝑡max . The ratio of these two is defined as the load
balance ratio,
∑HFJ* 𝑇F 𝑇KLM
(2.6) 𝐿(𝑃, 𝑁) = =
𝑃𝑇TKU 𝑇TKU
17
Chapter 2 Performance Metrics and Models
𝐿(𝑃, 𝑁) only measures the percentage of the utilization during the system
“up time,” which does not care what the system is doing. For example, if
we only keep one of the 𝑃 = 2 processors in a system busy, we
get 𝐿(2, 𝑁) = 50%, meaning we achieved 50% utilization. If 𝑃 = 100 and one
is used, then 𝐿(100, 𝑁) = 1%, which is badly imbalanced.
Also, we define the load imbalance ratio as 1 − 𝐿(𝑃, 𝑁). The overhead is
defined as
𝑃
(2.7) 𝐻(𝑃, 𝑁) = −1
𝑆(𝑃, 𝑁)
2.7 Scalability
First, we define two terms: scalable algorithm and quasi-scalable
algorithm. A scalable algorithm is defined as those whose parallel
efficiency 𝐸(𝑃, 𝑁) remains bounded from below, i.e., 𝐸(𝑃, 𝑁) ≥ 𝐸& > 0, when
the number of processors 𝑃 → ∞ at fixed problem size.
More specifically, those can keep the efficiency when keep the problem
size 𝑁 constant are called strong scalable, and those can only keep the
efficiency when 𝑁 increases along with 𝑃 are called weak scalable.
18
2.8 Amdahl’s Law
(2.8) 𝑇(1, 𝑁) = 𝜏
Thus,
Therefore,
1
(2.10) 𝑆(𝑃, 𝑁) =
𝑓 + (1 − 𝑓)⁄𝑃
19
Chapter 3
Hardware Systems
For a serial computer with one CPU and one chunk of memory, ignoring
the details of possible memory hierarchy, plus some peripherals, only
two parameters are needed to describe the computer: its CPU speed and
its memory size.
20
3.1 Node Architectures
In recent years, the vast majority of the designs are centered on four of
the processor families: Power, AMD x86-64, Intel EM64T, and Intel
Itanium IA-64. These four together with Cray and NEC families of vector
processors are the only architectures that are still being actively utilized
in the high-end supercomputer systems. As shown in the following table
constructed with data from top500.org for June 2011 release of Top 500
supercomputers, 90% of the supercomputers use x86 processors.
Currently both companies, Intel and AMD, are revamping their product
lines and are transitioning their server offerings to the quad-core
processor designs. AMD introduced its Barcelona core on 65nm
manufacturing process as a competitor to the Intel Core architecture that
should be able to offer a comparable to Core 2 instruction per clock
performance, however the launch has been plagued by delays caused by
the difficulties in manufacturing sufficient number of higher clocked
units and emerging operational issues requiring additional bug fixing
that so far have resulted in subpar performance. At the same time, Intel
21
Chapter 3 Hardware Systems
enhanced their product line with the Penryn core refresh on a 45nm
process featuring ramped up the clock speed, optimized execution
subunits, additional SSE4 instructions, while keeping the power
consumption down within previously defined TDP limits of 50W for the
energy-efficient, 80W for the standard and 130W for the high-end parts..
According to the roadmaps published by both companies, the parts
available in 2008 will consist of up to four cores on the same processor
die with peak performance per core in the range of 8 to 15 Gflops on a
power budget of 15 to 30W and 16 to 32 Gflops on a power budget of 50
to 68W. Due to the superior manufacturing capabilities, Intel is expected
to maintain its performance per watt advantage with top Penryn parts
clocked at 3.5 GHz or above, while the AMD Barcelona parts in the same
power envelope are not expected to exceed 2.5 GHz clock speed until the
second half of 2008 at best. The features of the three processor-families
that power the top supercomputers in 2-11 are given in the table below
with data collected from respective companies’ websites.
1
http://www.fujitsu.com/downloads/TC/090825HotChips21.pdf
22
3.2 Network Interconnections
doesn’t make sense to say which one is the best in general. The structure
of a network is usually measured by the following parameters:
3.2.1 Topology
Here are some of the topologies that are currently in common use:
23
Chapter 3 Hardware Systems
Figure 3.1: Mesh topologies: (a) 1-D mesh (array); (b) 2-D mesh; (c) 3-D mesh.
Figure 3.2: Torus topologies: (a) 1-D torus (ring); (b) 2-D torus; (c) 3-D torus.
24
3.2 Network Interconnections
Switch 7
Switch 5 Switch 6
1 2 3 4 5 6 7 8
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
Among those topologies, mesh, torus and fat tree are more frequently
adopted in latest systems. Their properties are summarized in Table 3.3.
25
Chapter 3 Hardware Systems
High connectivity
Scale up is costly
Low connectivity
26
3.2 Network Interconnections
the end of 2007 showed that the ratio of clusters to mesh-based systems
decreased with two cluster systems and eight systems with mesh
networks.
MPU Networks
The MPU network is a combination of two 𝑘 -dimensional rectangular
meshes of equal size, which are offset by ½ of a hop along each
dimension to surround each vertex from one mesh with a cube of 2j
neighbors from the other mesh. Connecting vertices in one mesh
diagonally to their immediate neighbors in the other mesh and removing
original rectangle mesh connections produces the MPU network. Figure
27
Chapter 3 Hardware Systems
Dimensionality 𝑘 𝑘 1
MSRT Networks
28
3.2 Network Interconnections
To understand the MSRT topology, let us first start from 1D MSRT bypass
rings. A 1D MSRT bypass ring originates from a 1D SRT ring by
eliminating every other bypass link. In, 1D MSRT(𝐿 = 2; 𝑙* = 2, 𝑙+ = 4) is a
truncated 1D SRT(𝐿 = 2; 𝑙* = 2, 𝑙+ = 4). 𝐿 = 2 is the maximum node level.
This means that two types of bypass links exist, i.e., 𝑙* and 𝑙+ links. Then,
𝑙* = 2 and 𝑙+ = 4 indicate the short and long bypass links spanning over
2op = 4 and 2oq = 16 hops respectively. Figure 3.6 shows the similarity and
difference of SRT and MSRT. We extend 1D MSRT bypass rings to 3D
MSRT networks. To maintain a suitable node degree, we add two types of
bypass links only in 𝑥- and 𝑦-axis and then form a 2D expansion network
in 𝑥𝑦-plane.
2 0 62 60
4 58
6 56
8 54
10 52
12 50
14 48
16 46
18 44
20 42
22 40
24 38
26 36
28 34 Level-0 node
30 32
Level-1 node
Level-2 node
29
Chapter 3 Hardware Systems
Proprietary Other
6% 2%
Infiniband
41%
Gigabit
Ethernet
46%
Custom
5%
Figure 3.7: Networks for Top 500 supercomputers by system count in June 2011.
30
3.3 Instruction and Data Streams
Proprietary Other
17% 1%
Infiniband
39%
Gigabit
Ethernet
20%
Custom
23%
Figure 3.8: Networks for Top 500 supercomputers by performance in June 2011.
31
Chapter 3 Hardware Systems
Ø Distributed-memory;
Ø Shared-memory;
Ø Shared-distributed-memory;
Ø Distributed-shared-memory.
3.5 IO Subsystems
The higher these parameters are, the better the parallel system. On the
other hand, it is the proper balance of these three parameters that
guarantees a cost-effective system. For example, a narrow inter-node width
will slow down communication and make many applications unscalable. A
low I/O rate will keep nodes waiting for data and slow overall performance.
A slow node will make the overall system slow.
32
3.7 Design Considerations
Bus/Switch
1 2 3 4
M M M M M M M
(a) (b)
33
Chapter 3 Hardware Systems
In this “4D” hardware parameter space, one can find infinite amount of
points, each representing a particular parallel computer. Parallel
computing is an application-driven technique, unreasonable combination
of computer architectures are eliminated through selection. Fortunately,
there exists a large amount of such architectures that can be eliminated
so that we are left with a dozen useful cases. To name a few:
distributed-memory MIMD (Paragon on 2D mesh topology, iPSC on
hypercube, CM5 on tree network), distributed-memory SIMD (MasPars,
CM-2, CM-200), shared-memory MIMD (CRAYS, IBM mainframes.)
34
Chapter 4
Software Systems
35
Chapter 4 Software Systems
Linux
Aix
SunOS
Mac OS
4.1.2 Compilers
Most compilers add some communication mechanisms to conventional
languages like FORTRAN, C, or C++; a few expand other languages, such
as Concurrent Pascal.
4.1.3 Libraries
4.1.4 Profilers
36
4.2 Programming Models
send: One processor sends a message to the network. The sender does
not need to specify the receiver but it does need to designate a
“name” to the message.
recv: One processor receives a message from the network. The receiver
does not have the attributes of the sender but it does know the
name of the message for retrieval.
37
Chapter 4 Software Systems
Synchronous Communication
In synchronous communication, the sender will not proceed to its next
task until the receiver retrieves the message from the network. This is
analogous to hand delivering a message to someone, which can be quite
time consuming!
Asynchronous Communication
During asynchronous communication, the sender will proceed to the next
task whether the receiver retrieves the message from the network or not.
This is similar to mailing a letter. Once the letter is placed in the post
office, the sender is free to resume his or her activities. There is no
protection for the message in the buffer.
1. The first option for the Receiver is to wait until the message has
arrived and then make use of it.
38
4.2 Programming Models
do_sth_with_it
else do_sth_else
Table 4.1
Interrupt
The receiver interrupts the sender’s current activity in order to pull
messages from the sender. This is analogous to the infamous 1990s
telemarketer’s call at dinner time in US. The sender issues a short
message to interrupt the current execution stream of the receiver. The
receiver becomes ready to receive a longer message from the sender.
After an appropriate delay (for the interrupt to return the operation
pointer to the messaging process), the sender pushes through the
message to the right location of the receiver’s memory, without any
delay.
39
Chapter 4 Software Systems
Communication Patterns
There are nine different communication patterns:
40
4.2 Programming Models
One
Partial
All
Standard
1. Assumes nothing on the matching send
Buffered
Sends the message to the “buffer” in the sender/receiver/both;
Copies from this location when receiver is ready (where ready means
processor facilitating the message is free).
Synchronous
Sends a request to receiver, when replying ready, sender pushes the
message.
41
Chapter 4 Software Systems
Ready
Before sending, the sender knows that the matching receiver has already
been posted.
4.2.2 Shared-Memory
Shared-memory computer is another large category of parallel
computers. As the name indicates, all or parts of the memory space are
shared among all the processors. That is, the memory can be
simultaneously accessed directly by multiple processors. Under this
situation, communications among processors are implicitly done by
accessing the same memory space from different processors. This
scheme gives developer an easier and more flexible environment to write
parallel program. However, concurrency becomes an important issue
during the development. As we will discuss later, multiple techniques are
provided to solve this issue.
Although all the memory can be shared among processors without extra
difficulty, in practical implementation, memory space on a single node
has always been partitioned into shared and private part, at least
logically, in order to provide flexibility in program development (Error!
Reference source not found.). Shared data is accessible by all but private
data can only be accessed by the one owns it.
42
4.3 Debuggers
4.3 Debuggers
A parallel debugger is no different from a serial debugger. dbx is one
popular example of serial debugger. dbx is a utility for source-level
debugging and execution of programs written in C, Pascal, and Fortran.
Most parallel debuggers are built around dbx with additional functions
for handling parallelism, for example, to report variable addresses and
contents in different processors. There are many variations to add
convenience by using graphic interface.
A typical debugger called IPD has been upgraded from the iPSC/860 to
the Paragon. IPD stands for the Interactive Parallel Debugger. It is a
complete symbolic, source-level debugger for parallel programs that run
under the Paragon OSF/1 operating system. Beyond the standard
operations that facilitate the debugging of serial programs, IPD offers
custom features that facilitate debugging parallel programs.
IPD lets you debug parallel programs written in C, Fortran, and assembly
language. IPD consists of a set of debugging commands, for which help is
available from within IPD. After invoking IPD, entering either help or ? at
the IPD prompt returns a summary of all IPD commands.
IPD resides on the iPSC SRM, so you must invoke IPD under UNIX from
the SRM. You may be logged in either directly or remotely. When you
invoke the debugger, IPD automatically executes commands in its
configuration file, if you create such a file. This file must be called .ipdrc
(note the period) and must reside in your home directory. It is an ASCII
file containing IPD commands (see the exec command for more
information).
ParaGraph
43
Chapter 4 Software Systems
44
Chapter 5
Design of Algorithms
Quality algorithms, in general, must be (a) accurte, (b) efficient, (c) stable,
(d) portable and (e) maintainable.
1. Communication costs
2. Load imbalance costs
For scalability, there are two distinct types: strong scaling and weak
scaling. A strong scaling algorithm is such that it is scalable for solving a
fixed total problem size with a varying number of processors. Conversely,
a weak scaling algorithm is such that it is scalable for solving a fixed
problem size per processor with a varying number of processors.
45
Chapter 5 Design of Algorithms
Achieving all will likely ensure the high parallel efficiency of a parallel
algorithm. How to do it?
(1) master-slave
(2) domain decomposition
(3) control decomposition
(4) data parallel
(5) single program multiple data (SPMD)
(6) virtual-shared-memory model
46
5.1 Algorithm Models
5.1.1 Master-Slave
A master-slave model is the simplest parallel programming paradigm
with exception of the embarrassingly parallel models. In the master-slave
model, a master processor controls the operations of the rest of slave
processors in the system. Obviously, this model can cover a large portion
of applications. Figure 5.1 illustrates the Master-Slave mode.
Figure 5.1: This is the Master-Slave model. Here, communication only occur
between master and slaves and no inter-slaves communication.
1
Fox Johnson Lyzenga Otto Salmon Walker
47
Chapter 5 Design of Algorithms
Figure 5.2: In this model, the computational domain is decomposed and each sub-
domain is assigned to a process. This is useful in SIMD and MIMD contexts, and is
popular in PDEs. Also, domain decomposition is good for problems with locality.
5.1.4 Virtual-Shared-Memory
Figure 5.4 illustrates the virtual-shared-memory model.
48
5.1 Algorithm Models
Figure 5.3: In this model, the computational domain is decomposed and each sub-
domain is assigned to a process. This is useful in SIMD and MIMD contexts, and is
popular in PDEs. Also, domain decomposition is good for problems with locality.
49
Chapter 5 Design of Algorithms
(1) The efforts between the researcher and the computer (see Figure
5.5)
(2) Code clarity and code efficiency
Figure 5.5: This figure illustrates the relationship between programmer time and
computer time, and how it is related to the different ways of handling
communication.
50
5.1 Algorithm Models
Embarrassingly parallel
51
Chapter 5 Design of Algorithms
Embarrassingly parallel:
• Little communication
• Little load imbalance
Synchronous parallel:
Asynchronous:
Synchronized parallel
52
5.2 Examples of Collective Operations
Asynchronized parallel
Does this mean parallel computing is useless for scientific problems? The
answer is no. The reason is that in reality, when increasing the number
of processors, one always studies larger problems, which should then
elevate the parallel efficiency. 50% efficiency does not pose a big
challenge, but it can always make a run faster (sometimes by orders of
magnitude) than a serial run. In fact, serial computing is indeed a dead
end for scientific computing.
53
Chapter 5 Design of Algorithms
5.2.1 Broadcast
Suppose processor 0 possesses 𝑁 floating-point numbers {𝑋}I* , … , 𝑋* , 𝑋& }
that need to be broadcast to the other 𝑃 − 1 processors in the system. The
best way to do this is to let 𝑃& send out 𝑋& to 𝑃* , which then sends 𝑋& to 𝑃+
(keeping a copy for itself ) while 𝑃+ sends the next number, 𝑋* , to 𝑃* . At
the next step, while 𝑃& sends out 𝑋+ to 𝑃* , 𝑃* sends out 𝑋* to 𝑃+ , etc., in a
pipeline fashion. This is called wormhole communication.
Suppose the time needed to send one number from a processor to its
neighbor is 𝑇comm . Also, if a processor starts to send a number to its
54
5.2 Examples of Collective Operations
neighbor at time 𝑇* and the neighbor starts to ship out this number at
time 𝑇+ , we define a parameter called
(5.1) 𝑇•‚Kƒ‚„… = 𝑇+ − 𝑇*
MPI_cart_sub(
IN comm,
IN remain_dims, /* logical variable TRUE/FALSE */
new_comm
)
Typically, 𝑇•‚Kƒ‚„… ≥ 𝑡†‡TT . Therefore, the total time needed to broadcast all
𝑁 numbers to all 𝑃 processors is given by
55
Chapter 5 Design of Algorithms
At the end of the third step, all processors have the global sum.
𝑃* 𝐴* 𝐴+ 𝐴' 𝐴. 𝑃* 𝐴*
scatter
𝑃+ 𝑃+ 𝐴+
𝑃. 𝑃. 𝐴.
5.2.3 Allgather
We explain the global summation for a 3D hypercube. As shown in Error!
Reference source not found., processor 𝑃F contains a number 𝑋F for 𝑖 =
1,2, … ,7. We do the summation in three steps.
Step 1: Processors 𝑃& and 𝑃* exchange contents and then add them up. So
do the pairs of processors 𝑃+ and 𝑃' ,𝑃. and 𝑃, , as well as 𝑃( and 𝑃3 .
All these pairs perform the communication and addition in
parallel. At the end of this parallel process, each processor has its
partner’s content added to its own content, e.g., 𝑃& now has 𝑋& +
𝑋* .
Step 2: 𝑃& exchanges its new content with 𝑃+ and performs addition (other
pairs follow this pattern). At the end of this stage, 𝑃& has 𝑋& + 𝑋* +
𝑋+ + 𝑋' .
Step 3: Finally, 𝑃& exchanges its new content with 𝑃. and performs
addition (again, other pairs follow the pattern). At the end of this
stage, 𝑃& has 𝑋& + 𝑋* + 𝑋+ + 𝑋' + 𝑋. + 𝑋, + 𝑋( + 𝑋3 .
56
5.3 Mapping Tasks to Processors
In fact, after log + 8 = 3stages, every processor has a copy of the global
sum. In general for a system containing 𝑃 processors, the total time
needed for global summation is 𝑂(log + 𝑃).
1. Linear Mapping
2. 2D Mapping
3. 3D Mapping
4. Random Mapping: flying processors all over the communicator
5. Overlap Mapping: convenient for communication
6. Any combination of the above
0 1 2
(0,0) (0,1) (0,2)
3 4 5
(1,0) (1,1) (1,2)
6 7 8
(2,0) (2,1) (2,2)
57
Chapter 5 Design of Algorithms
MPI_Dims_create(
number_nodes_in_grid,
number_cartesian_dims,
array_specifying_meshs_in_each_dim
)
MPI_Dims_create(6,2,dims) 3,2
MPI_Dims_create(7,3,dims) no answer
The functions listed in figure 5.10 are useful when embedding a lower
dimensional cartesian grid into a bigger one, with each sub-grid forming
a sub-communicator.
new_comm = 2 × 4 (8 processes)
58
5.3 Mapping Tasks to Processors
bandwidth and other are non-uniform and they depend on the relative or
even absolute locations of the processor in the supercomputer. Thus, the
way to assign computation modules or subtasks on nodes may impact
the communication costs and the total running. To optimize this part of
the running time, we need to utilize the task mapping.
Latency matrix
1
Shahid H. Bokhari, "On the Mapping Problem," IEEE Transactions on Computers, vol. 30,
no. 3, pp. 207-214, March 1981.
59
Chapter 5 Design of Algorithms
The latency matrix are measured from the communication time for
sending a 0 byte message from one nodes to all other nodes that forms a
matrix.
Linear regression analysis of the latency with respect to the hop can show
that there are many outliers that may mislead the optimization. Most of
the differences result from the torus in Z-dimension of the BG/L system.
60
5.3 Mapping Tasks to Processors
Figure 5.14: Linear regression of the latency with respect to the hop
Model by Bokhari
In 1981, Bokhari proposed a model that maps 𝑛 tasks to 𝑛 processors to
find the minimum communication cost independent of computation
costs which can be represented as
61
Chapter 5 Design of Algorithms
Model by Heiss
In 1996, Heiss and Dormanns formulate the mapping problem as to find
a mapping 𝑇 → 𝑃
Model by Bhanot
In 2005, a model is developed by Bhanot et al. to minimize only intertask
communication. They neglect the actual computing cost when placing
tasks on processors linked by mesh or torus by the following model
62
5.3 Mapping Tasks to Processors
Let {𝑡* , … , 𝑡 } be the set of subtasks of the problem and {𝑝* , … , 𝑝w } be the
set of heterogeneous processors of the parallel computers to which the
subtasks are assigned. In general, 𝑛 ≥ 𝑚. Let 𝑋£X be the decision Boolean
variable that is defined as
𝑦££ ² XX²
(5.7) 1, if subtask 𝑡 and 𝑡 ¡ are assigned to processor 𝑝 and 𝑝¡ respectively
=¤
0, otherwise
Subject to
w
⎧ Ž 𝑥 = 1, 𝑡 = 1, … , 𝑛
£X
⎪
⎪XJ*
⎪
(5.9) Ž 𝑥£X ≥ 1, 𝑝 = 1, … , 𝑚
⎨ £J*
⎪
⎪ 𝐴X
⎪Ž 𝑥£X ≤ º × 𝑛» , 𝑝 = 1, … 𝑚
⎩ £J* 𝐴£
63
Chapter 5 Design of Algorithms
64
Chapter 6
Linear Algebra
65
Chapter 6 Linear Algebra
which means 10*+ double precision data that roughly requires 8 terabytes
of memory. Hence, in this situation, decomposition across data is also
required.
𝑀** 𝑀*+ ⋯ 𝑀*
𝑀 𝑀++ ⋯ 𝑀+
(6.1) ¼ +* Á
⋮ ⋮ ⋱ ⋮
𝑀w* 𝑀w+ ⋯ 𝑀w
where 𝑝𝑞 = 𝑃.
1. Row Partition;
2. Column Partition;
3. Block Partition;
4. Scatter Partition.
(6.3) 𝑎 @1
𝐶 = Ë **
𝑎*+ @2 𝑏** @1 𝑏*+ @2
ÍË Í
𝑎+* @3 𝑎++ @4 𝑏+* @3 𝑏++ @4
66
6.1 Problem Decomposition
𝑐** 𝑐*+
= Ï𝑐 𝑐++ Ð
+*
where
67
Chapter 6 Linear Algebra
The properties of these decompositions vary widely. The table below lists
their key features:
Method Properties
(6.9) 𝐴𝑏 = 𝑐
68
6.2 Matrix Operations
Vector 𝑏: The elements of vector 𝑏 are given to each processor, 𝑖 for each
integer 𝑖 ∈ [1, 𝑝]
𝑏* @𝑖
⎛ 𝑏+ @𝑖 ⎞
(6.10) ⎜ 𝑏' @𝑖 ⎟
⋮
⎝𝑏 @𝑖 ⎠
69
Chapter 6 Linear Algebra
𝑇(𝑛, 1) 𝑃
(6.13) 𝑆(𝑛, 𝑝) = =
𝑇comp (𝑛, 𝑝) + 𝑇comm (𝑛, 𝑝) 1 + 𝑐𝑝
𝑛
Remarks:
𝑏* @1
⎛ 𝑏+ @2 ⎞
(6.14) ⎜ 𝑏' @3 ⎟
⋮
⎝ 𝑏 @𝑝 ⎠
@1: 𝐴** × 𝑏*
@2: 𝐴++ × 𝑏+
⋮ ⋮
@𝑝: 𝐴XX × 𝑏X
Vector b:
70
6.2 Matrix Operations
𝑏+ @1
⎛𝑏' @2⎞
(6.15) ⎜𝑏. @3⎟
⋮
𝑏
⎝ * @𝑝 ⎠
Then, multiply the next off-diagonal elements with the local vector
elements.
@1: 𝐴*+ × 𝑏+ ,
@2: 𝐴+' × 𝑏' ,
⋮ ⋮
@𝑝: 𝐴X* × 𝑏* .
Step 3: Repeat step 2 until all elements of b have visited all processors.
The communication time to roll up the elements of vector b and form the
final vector is
𝑛
(6.17) 𝑇comm (𝑛, 𝑝) = 𝑑′𝑝 ×
𝑝
71
Chapter 6 Linear Algebra
𝑇(𝑛, 1) 𝑃
(6.18) 𝑆(𝑛, 𝑝) = =
𝑇comp (𝑛, 𝑝) + 𝑇comm (𝑛, 𝑝) 𝑐′𝑝
1+ 𝑛
Remarks:
@1: 𝐴** × 𝑏*
@2: 𝐴*+ × 𝑏+
⋮ ⋮
@𝑝: 𝐴XX × 𝑏X
The results are then “lumped” to form the element 𝑐* of the vector 𝑐.
Step 2: Repeat step 1 for all rows to form the complete resulting vector c.
72
6.2 Matrix Operations
𝑛
(6.20) 𝑇comm (𝑛, 𝑝) = 𝑑′′𝑝 ×
𝑝
𝑇(𝑛, 1) 𝑃
(6.21) 𝑆(𝑛, 𝑝) = =
𝑇comp (𝑛, 𝑝) + 𝑇comm (𝑛, 𝑝) 𝑐′𝑝
1+ 𝑛
Remarks:
73
Chapter 6 Linear Algebra
(6.22) 𝐴𝐵 = 𝐶
where
𝐴** 𝐴*+ ⋯ 𝐴*
𝐴+* 𝐴++ ⋯ 𝐴+
(6.23) 𝐴=¼ Á
⋮ ⋮ ⋱ ⋮
𝐴w* 𝐴w+ ⋯ 𝐴w
𝐴** @1 𝐴*+ @1 ⋯ 𝐴* @1
⎛ +* @2
𝐴 𝐴++ @2 ⋯ 𝐴+ @2 ⎞
𝐴 = ⎜ 𝐴'* @3 𝐴'+ @3 ⋯ 𝐴' @3 ⎟
⋮ ⋮ ⋱ ⋮
⎝ w* @𝑚
𝐴 𝐴w+ @𝑚 ⋯ 𝐴w @𝑚⎠
𝐶**
𝐶++
𝐶=⎛ ⎞
⋱
⎝ 𝐶ww ⎠
where
74
6.2 Matrix Operations
𝐴+* @1 𝐴++ @1 ⋯ 𝐴+ @1
⎛ 𝐴'* @2 𝐴'+ @2 ⋯ 𝐴' @2 ⎞
𝐴 = ⎜ 𝐴.* @3 𝐴.+ @3 ⋯ 𝐴. @3 ⎟
⋮ ⋮ ⋱ ⋮
⎝ ** @𝑚
𝐴 𝐴*+ @𝑚 ⋯ 𝐴* @𝑚⎠
𝐶** 𝐶*w
𝐶= ⎛𝐶+* 𝐶++ ⎞
⋱ ⋱
⎝ 𝐶w,wI* 𝐶ww ⎠
where
Step 3: Repeat step 2 until all rows of A have passed through all
processors.
75
Chapter 6 Linear Algebra
𝑛+
(6.27) 𝑇comm (𝑛, 𝑝) = (𝑝 − 1) Û Ü𝑡 ≈ 𝑛+ 𝑡comm
𝑝 comm
The computation cost for multiplying a matrix of size n⁄p with a matrix
of size 𝑛 is
𝑇(𝑛, 1) 𝑃
(6.29) 𝑆(𝑛, 𝑝) = =
𝑇comp (𝑛, 𝑝) + 𝑇comm (𝑛, 𝑝) 1 + 𝑐 ¡ 𝑝
𝑛
Remarks:
76
6.2 Matrix Operations
Þcomm
• Overhead is proportional to Þcomp
, i.e., speedup increases when
Þcomp
Þcomm
increases.
• The previous two comments are universal for parallel
computing.
• Memory is parallelized.
77
Chapter 6 Linear Algebra
Step 4: Repeat the step 3 until all elements of 𝐵 elements have traveled to
all processors, then gather all the elements of .
78
6.2 Matrix Operations
𝑇(𝑛, 1)
(6.35) 𝑆(𝑛, 𝑝) =
𝑇â (𝑛, 𝑝) + 𝑇ã (𝑛, 𝑝) + 𝑇ä (𝑛, 𝑝)
1 𝑃 − 2𝑞 𝑡comm
(6.36) ℎ(𝑛, 𝑝) = Ë + ÍÛ Ü
𝑛𝑝 2𝜋 + 𝑡comp
Remarks:
79
Chapter 6 Linear Algebra
𝑐** = 𝑆* + 𝑆. − 𝑆, + 𝑆3
𝑐*+ = 𝑆' + 𝑆,
(6.39)
𝑐+* = 𝑆+ + 𝑆.
𝑐++ = 𝑆* + 𝑆' − 𝑆+ + 𝑆(
80
6.3 Solution of Linear Systems
(6.40) 𝐴𝐱 = 𝑏
From the pseudo-code, it is easy to conclude that the operation count for
the Gaussian elimination is
2
(6.41) 𝑇comp (𝑛, 1) = 𝑛' 𝑡comp
3
where 𝑡comp is the time for unit computation.
u = a
for i = 1 to n - 1
for j = i+ 1 to n
l = u(j,i)/u(i,i)
for k = i to n
u(j,k) = u(j,k) – l * u(i,k)
end
end
end
Back Substitution
81
Chapter 6 Linear Algebra
Row Partition
Row partition is among the most straightforward methods that
parallelizing the inner most loop. For simplicity when demonstrate this
method, we assume the number of processors 𝑝 = 𝑛 , the number of
dimensions, but it is quite easy to port it into situations that 𝑝 < 𝑛.
STEP 2: For row 𝑖, every element subtract the value of the corresponding
Œ(&,&)
element in first row multiple by Œ(F,&)
. By doing this, the elements
in the first column of the matrix become all zero.
STEP 3: Repeat step 1 and 2 on the right-lower sub matrix until the
system becomes an upper triangle matrix.
82
6.3 Solution of Linear Systems
In this method, as we can see, every step has one less node involving in
the computation. This means the parallelization is largely imbalanced.
The total computation time can be written as
𝑛'
(6.43) 𝑇comp (𝑛, 𝑝) = 𝑡comp
𝑝
where 𝑡comp is the time for unit operation. Compare with the serial case
(6.42), we know that roughly 1/3 of the computing power is wasted in
waiting.
During the process, the right hand side can be treated as the last column
of the matrix. Thus, it can be distributed among the processor and no
special consideration is needed.
STEP 3: Every processor plug this variable and update its corresponding
row.
83
Chapter 6 Linear Algebra
has the largest first column elements and calculate based on this row
instead of the first one. This requires finding the largest elements among
processors which adds more communication in the algorithm. An
alternative way to solve this problem is to use the column partition.
Column Partition
Column partition gives an alternative way of doing 1-D partition. In this
method, less data will be transferred in each communication circle. This
give smaller communication overhead and easier for pivoting. However,
the problem with load imbalance still existed in this method.
STEP 2: Broadcast the pivot and its row number to other processors.
This method has same overall computation time but much less
communication when pivoting is needed.
Block Partition
We illustrate the idea by a special example 20 × 20 matrix solved by 16
processors. The idea can be easily generalized to more processors with a
large matrix.
Step 3: Move to the next row and column and repeat steps 1 and 2 until
the last matrix element is reached.
Remarks:
84
6.3 Solution of Linear Systems
LU Factorization
In the practical applications, people usually need to solve a linear system
𝐴𝑥 = 𝑏
for multiple times with different 𝑏 while 𝐴 is constant. In this case, a pre-
decomposed matrix that offers fast calculation of solution with different
𝑏’s is desirable. To achieve this, LU decomposition is a good method.
u = a
for i = 1 to n - 1
for j = i+ 1 to n
l(j,i) = u(j,i)/u(i,i)
for k = i to n
u(j,k) = u(j,k) – l(j,i) * u(i,k)
end
end
end
85
Chapter 6 Linear Algebra
General Description
𝑃 processors are used to solve sparse tri-, 5-, and 7-diagonal systems
(also known as banded systems) of the form 𝐴𝑥 = 𝑏. For example, we may
have
𝑎** 𝑎*+ 𝑥* 𝑏*
𝑥+
(6.44) 𝐴= ⎛𝑎+* 𝑎++ 𝑎+' 𝑏
⎞ , 𝑥 = ¼ Á , 𝑏 = ¼ +Á
𝑎'+ 𝑎'' 𝑎'. 𝑥' 𝑏'
⎝ 𝑎.' 𝑎.. ⎠ 𝑥. 𝑏.
Classical Techniques:
• Relaxation
• Conjugate-Gradient
• Minimal-Residual
86
6.3 Solution of Linear Systems
where 𝑢(&) are initial values and T is the iteration matrix, tailored by the
structure of A.
6.3.3 ADI
The Alternating Direction Implicit (ADI) method was first used in 1950s
by Peaceman and Rachford [?] for solving parabolic PDEs. Since then, it
has been widely used in many applications.
Varga [?] has refined and extended the discussions of this method.
87
Chapter 6 Linear Algebra
6.4.1 QR Factorization
6.4.2 QR Algorithm
88
Chapter 7
Differential Equations
ìIx
slice the interval [a, b] into N mesh blocks, each of equal size ∆𝑥 = } ,
shown in (7.1). Suppose 𝑥F = 𝑎 + (𝑖 − 1)∆𝑥, the integral is approximated as
}
89
Chapter 7 Differential Equations
ì
Figure 7.1: 𝐼 = ∫x 𝑓(𝑥)𝑑𝑥. In the Riemann sum, the totall area A under the curve
𝑓(𝑥) is the approximation of the intergral 𝐼. With 2 processors 𝑗 and 𝑗 + 1, we can
compute 𝐴 = 𝐴› + 𝐴›ï* completely in parallel. Each processor only needs to know
its starting and end points, in addition to the integrand, for partial integration. At
the end, all participating processors use the global summation technique to add
up partial integrals computed by each processor. This method has 100% parallel
efficiency.
90
7.1 Integration and Differentiation
(7.3) 𝐼 = ë 𝑔(𝑥)𝑑𝑥
ñ
1
(7.4) 𝐼òw = ó𝑔•𝑥 (*) ‘ + ⋯ + 𝑔•𝑥 (w) ‘ô
𝑚
where 𝑥 (F) ’s are random samples from 𝐷. According to the central limit
theorem, the error term
where 𝜎 + = var{𝑔(𝑥)}. This means the convergence rate of the Monte Carlo
method is 𝑂•𝑚I*⁄+ ‘, regardless of the dimension of 𝐷. This property gives
Monte Carlo method an advantage when integrating in high dimensions.
Simple Parallelization
The most straightforward parallelization of Monte Carlo method is to let
(F)
every processor generation its own random values 𝑔Ï𝑥X Ð .
Communication occurs once at the end of calculation for aggregating all
the local approximations and produces the final result. Thus, the parallel
efficiency will tend to close to 100%.
1
ℇ* ∝ 𝑂 Ë Í
√𝑚
91
Chapter 7 Differential Equations
1
ℇH ∝ 𝑂 Ë Í
√𝑚𝑃
In fact, we are not precisely sure how the overall convergence varies with
increasing number of processors.
1D Wave Equation
We first demonstrate parallel solution of 1D wave equations with the
simplest method, i.e. an explicit finite difference scheme. Consider the
following 1D wave equation:
𝑢 = 𝑐 + 𝑢yy , ∀𝑡 > 0
(7.6) ¤ ££
Proper BC and IC
When performing updates for the interior points, each processor can
behave like a serial processor. However, when a point on the processor
boundary needs updating, a point from a neighboring processor is
needed. This requires communication. Most often, this is done by
building buffer zones for the physical sub-domains. After updating its
physical sub-domain, a processor will request that its neighbors (2 in 1D,
8 in 2D, and 26 in 3D) send their boarder mesh point solution values (the
number of mesh point solution values sent depends on the numerical
scheme used) and in the case of irregular and adaptive grid, the point
coordinates. With this information, this processor then builds a so-called
virtual sub-domain that contains the original physical sub-domain
surrounded by the buffer zones communicated from other processors.
93
Chapter 7 Differential Equations
94
7.2 Partial Differential Equations
𝑀
(7.10) 𝑇(𝑝, 𝑀) = 𝑡 + 2𝑡†‡TT
𝑝 †‡T…
𝑇(1, 𝑀) 𝑝
𝑆(𝑝, 𝑀) = =
(7.11) 𝑇(𝑝, 𝑀) 1 + 2𝑝 𝑡†‡TT
𝑚 𝑡†‡T…
2𝑝 𝑡†‡TT
(7.12) ℎ(𝑝, 𝑀) =
𝑀 𝑡†‡T…
It is quite common for algorithms for solving PDEs to have such overhead
dependencies on granularity and communication-to-computation ratio.
2D Wave Equation
Consider the following wave equation
95
Chapter 7 Differential Equations
with proper BC and IC. It’s very easy to solve this equation on a
sequential computer, but on a parallel computer, it’s a bit more complex.
We perform a little difference operation (applying central differences on
both sides):
jï* jI* j
𝑢F› + 𝑢F› − 2𝑢F›
∆𝑡 +
(7.14) j j j j j j
𝑢Fï*,› + 𝑢FI*,› − 2𝑢F› 𝑢F›ï* + 𝑢F,›I* − 2𝑢F›
= +
∆𝑥 + ∆𝑦 +
96
7.2 Partial Differential Equations
97
Chapter 7 Differential Equations
(7.23) 𝑚 = (𝑗 − 1)𝑋 + 𝑖
(7.24) 𝒜𝑢 = 𝐵
𝐴* 𝐼 ⋯ 0 0
⎛𝐼 𝐴+ ⋯ 0 0⎞
𝒜=⎜ ⋮ ⋮ ⋱ ⋮ ⋮ ⎟
0 0 ⋯ 𝐴(I* 𝐼
⎝0 0 ⋯ 𝐼 𝐴( ⎠(×(
−4 1 ⋯ 0 0
⎛ 1 −4 ⋯ 0 0⎞
𝐴F = ⎜ ⋮ ⋮ ⋱ ⋮ ⋮ ⎟
0 0 ⋯ −4 1
⎝0 0 ⋯ 1 −4⎠)×)
98
7.2 Partial Differential Equations
(7.30) 𝒜𝑢 = 𝑏
𝐴* 𝐼 ⋯ 0
𝒜=¼𝐼 𝐴+ ⋯ 0Á
⋮ ⋮ ⋱ ⋮
0 0 ⋯ 𝐴* *×*
𝐴*j 𝐼 ⋯ 0
𝐴j = ⎛ 𝐼 𝐴j+ ⋯ 0⎞
⋮ ⋮ ⋱ ⋮
⎝ 0 0 ⋯ 𝐴j( ⎠(×(
−6 1 ⋯ 0
1 −6 ⋯ 0
𝐴›j = ¼ Á
⋮ ⋮ ⋱ ⋮
0 0 ⋯ −6 )×)
99
Chapter 7 Differential Equations
𝑢 + 𝑢•• + 𝑢## + 𝛼 + 𝑢 = 0
(7.31) ¤ yy
Proper BC
100
7.2 Partial Differential Equations
(2) Force matrix defines the force from all pairs of particles or
molecules
𝑓** ⋯ 𝑓*
ℱ=ß ⋮ ⋱ ⋮ à
𝑓* ⋯ 𝑓
and it has the folloing important properties:
a) anti-symmetrical resulting from Newton’s 3rd law;
b) diagonal elements are all zero (no self-interaction);
c) sum of one complete row is the force acting on one particle
by all others;
d) sum of one complete column is the force exerted by one
particle on all others;
e) the matrix is dense and complexity of obtaining the force is
𝑂(𝑁 + ) if particles are involved in long-ranged interactions
such as Coulomb’s force;
f) the matrix is sparse and complexity of obtaining the force is
𝑂(𝑁) if particles are involved in short-ranged interactions
such as bound force only.
There are many ways to decompose the system. The two most popular
ways are: particle decomposition and spatial decomposition. A 3rd way,
not as common, considers force decomposition. Interaction ranges (short-
, medium-, long-) determine the decomposition methods:
101
Chapter 7 Differential Equations
Calculation of forces
Typical force function for protein simulation. For example, the CHARM
force field
𝑉 = Ž 𝑘ì (𝑏 − 𝑏& )+ + Ž 𝑘. (𝜃 − 𝜃& )+
bonds angles
+ Ž 𝑘0 [1 + cos(𝑛𝜙 − 𝛿)]
dihedrals
+ Ž 𝑘3 (𝜔 − 𝜔& )+
(7.35)
impropers
+ Ž 𝑘5 (𝑢 − 𝑢& )+
Urey-Bradley
𝑅min;< *+ 𝑅min;< ( 𝑞F 𝑞›
+ Ž 𝜖 9Û Ü −Û Ü >+
𝑟F› 𝑟F› 𝜖𝑟F›
nonbonded
Term-2 accounts for the bond angles where 𝑘. is the angle-force constant
and 𝜃 − 𝜃& is the angle from equilibrium between 3 bounded atoms.
Term-3 is for the dihedrals (a.k.a. torsion angles) where 𝑘0 is the dihedral
force constant, 𝑛 is the multiplicity of the function, 𝜙 is the dihedral
angle and 𝛿 is the phase shift.
Term-4 accounts for the impropers, that is out of plane bending, where
𝑘3 is the force constant and 𝜔 − 𝜔& is the out of plane angle. The Urey-
Bradley component (cross-term accounting for angle bending using 1,3
non-bonded interactions) comprises Term-5 where 𝑘5 is the respective
force constant and 𝑢 is the distance between the 1,3 atoms in the
harmonic potential.
102
7.2 Partial Differential Equations
𝜎 *+ 𝜎 (
(7.36) 𝑉?@ (𝑟) = 4𝜖 AÏ Ð − Ï Ð B
𝑟 𝑟
where 𝜖 is the depth of the depth of the potential well, 𝜎 is the (finite)
distance at which the inter-particle potential is zero, and 𝑟 is the distance
between the particles. These parameters can be fitted to reproduce
experimental data or mimic the accurate quantum chemistry calculations.
The 𝑟 I*+ term describes Pauli repulsion at short ranges due to
overlapping electron orbitals and the 𝑟 I( term describes attraction at
long ranges such as the van der Waals force or dispersion force.
The last term, the long-ranged force in the form of Coulomb potential
between two charged particles,
1
Aziz, R.A., .A highly accurate interatomic potential for argon, J. Chem. Phys. 1993, Pp
4518
103
Chapter 7 Differential Equations
𝑞F 𝑞›
EEEE⃗
𝐹CD = EEE⃗C − EEE⃗
(𝑋 𝑋C )
(7.37) '
EEE⃗C − EEE⃗
G𝑋 𝑋D G
𝑞F 𝑞›
(7.38) EE⃗
𝐹C = Ž EEE⃗C − EEE⃗
(𝑋 𝑋C )
'
EEE⃗ EEE⃗
›HF G𝑋C − 𝑋D G
Estimating the forces costs 90% of the total MD computing time. But,
solution of the equation of motion is of particular applied Mathematics
interests as it involves solves the system of ODEs.
𝑦 ¡ = 𝑓(𝑡, 𝑦)
1
𝑦(𝑡 + ℎ) = 𝑦(𝑡) + ℎ(𝑘* + 2𝑘+ + 2𝑘' + 𝑘4)
6
𝑘* = 𝑓(𝑡, 𝑦)
(7.41) 1 1
𝑘+ = 𝑓 Ë𝑡 + ℎ, 𝑦 + 𝑘* Í
2 2
1 1
𝑘' = 𝑓 Ë𝑡 + ℎ, 𝑦 + 𝑘+ Í
2 2
𝑘. = 𝑓(𝑡 + ℎ, 𝑦 + 𝑘' )
104
Chapter 8
Fourier Transforms
ïK
1
(8.2) 𝑓(𝑡) = ë 𝑔(𝜔)𝑒 IF3£ 𝑑𝜔
√2𝜋 IK
105
Chapter 8 Fourier Transforms
106
8.3 Fast Fourier Transforms
}I*
F›j
(8.4) 𝑥j = ℱ I* (𝑋) = Ž 𝑋› 𝑒 +L } , ∀𝑗 = 0,1,2, … , 𝑁 − 1
›J&
It is obvious that the complex numbers 𝑋_𝑘 represent the amplitude and
phase of the different sinusoidal components of the input “signal” 𝑥 .
The DFT computes the 𝑋j from the 𝑥 , while the IDFT shows how to
;<N
*
compute the 𝑥 as a sum of sinusoidal components 𝑋 𝑒 +L O with
} j
frequency 𝑘/𝑁 cycles per sample.
𝑛o = 0,1, … , 𝑁o − 1 ∀𝑙 ∈ [1, 𝑑]
Simply put, the complexity of FFT can be estimated this way. One 𝐹𝑇(𝑀 →
𝑀) can be converted to two shortened FTs as 𝐹𝑇(𝑀⁄2 → 𝑀⁄2) through
107
Chapter 8 Fourier Transforms
There are many FFT algorithms and the following is a short list:
1
James W. Cooley (born 1926) received an M.A. in 1951 and a Ph.D. in 1961 in applied
mathematics, both from Columbia University. He was a programmer on John von
Neumann’s computer at the Institute for Advanced Study at Princeton (1953-’56), and
retired from IBM in 1991.
108
8.3 Fast Fourier Transforms
J&
More specifically, the Radix-2 DIT algorithm rearranges the DFT of the
function 𝑥 into two parts: a sum over the even-numbered indices 𝑛 = 2𝑚
and a sum over the odd-numbered indices 𝑛 = 2𝑚 + 1:
} }
I* I*
+ +
+LF +LF
(8.7) 𝑋j = Ž 𝑥+w 𝑒 I }
(+w)j
+ Ž 𝑥+wï* 𝑒 I }
(+wï*)j
wJ& wJ&
109
Chapter 8 Fourier Transforms
First, we divide the transform into odd and even parts (assuming M is
even):
2𝜋𝑖𝑘
𝑋j = 𝐸j + exp 𝑂
𝑀 j
and
2𝜋𝑖𝑘
𝑋jïã/+ = 𝐸j − exp 𝑂
𝑀 j
Next, we recursively transform 𝐸j and 𝑂j to the next two 2 × 2 terms. We
iterate continuously until we only have one point left for Fourier
Transform. In fact, one point does not need any transform.
110
8.4 Simple Parallelization
Function Y=FFT(X,n)
if (n==1)
Y=X
else
E=FFT({X[0],X[2],...,X[n-2]},n/2)
O=FFT({X[1],X[3],...,X[n-1]},n/2)
for j=0 to n-1
Y[j]=E[j mod (n/2)]
+exp(-2*PI*i*j/(n/2))*O[j mod (n/2)]
end for
end if
Figure 8.2: Pseudo code for serial FFT
General case:
111
Chapter 8 Fourier Transforms
𝑃& 𝑃* 𝑃+ 𝑃'
0 1 2 3
4 5 6 7
8 9 10 11
12 13 14 15
Figure 8.3: The scatter partition for performing 16-point FFT on 4 processors
This algorithm, although simple and closely align with the original serial
algorithm, is in fact quite efficient especially in the cases the data points
are far more than the number of processors. The communication
happens log 𝑃 times in the whole calculation and in each communication,
every processor send and receive 𝑁/𝑃 amount of data.
112
8.6 Complexity analysis for FFT
First, divide the transform into odd and even parts (assuming 𝑀 is even):
We then have
𝑖𝑘
(8.9) 𝑋j = 𝐸j + exp Ë2𝜋 Í 𝑂 , 𝑘 = 0,1, … , 𝑀′
𝑀 j
𝑖𝑘
(8.10) 𝑋jïã = 𝐸j − exp Ë2𝜋 Í 𝑂 , 𝑘 = 0,1, … , 𝑀′
+ 𝑀 j
𝑀 𝑇T„U‚
(8.15) 𝑇(𝑀) = 2𝑇 Ë Í + 𝑀 Ë𝑇…T + Í
2 2
Iteratively, we have
(8.16) 𝑇T„U‚
𝑇(𝑀) = Ë𝑇…T + Í 𝑀 log 𝑀
2
113
Chapter 8 Fourier Transforms
Let us define
For simple FFT: We can decompose the task by physical space, Fourier
space, or both. In any of these cases, we can always obtain perfect speed
up, as this problem is the EP problem.
Case I 𝐌 > 𝑃:
𝑀 𝑇ï 𝑀
(8.19) 𝑇(𝑃, 𝑁) = 2𝑇 Ë𝑃, Í+
2 2 𝑃
𝑀 𝑇ï 𝑀
(8.20) = Ë Í 𝑇(𝑃, 𝑀) + log Ë Í
𝑃 2 𝑃
Case I 𝐌 ≤ 𝐏:
𝑀
(8.21) 𝑇(𝑃, 𝑀) = 𝑇 Ë𝑃, Í + 𝑇I = 𝑇I log 𝑀
2
Substituting this equation into the above, we get
𝑀 𝑀
(8.22) 𝑇(𝑃, 𝑀) = A2𝑇I log 𝑃 + 𝑇ï log B
2𝑝 𝑝
𝑀
(8.23) = [𝑇 log 𝑀 + (2𝑇I − 𝑇ï ) log 𝑃]
2𝑃 ï
Therefore, the speedup is
𝑃
𝑆(𝑃, 𝑀) =
(8.24) 2𝑇 − 𝑇 log 𝑃
1 + I𝑇 ï log 𝑀
ï
114
8.6 Complexity analysis for FFT
2𝑇I − 𝑇ï log 𝑃
(8.25) ℎ(𝑃, 𝑀) =
𝑇ï log 𝑀
Remarks
1. Fourier Transforms
ïK
1
𝑔(𝜔) = ë 𝑓(𝑡) exp(−𝑗𝜔𝑡) 𝑑𝑡
√2𝜋 IK
ïK
1
𝑓(𝑡) = ë 𝑔(𝜔) exp(+𝑗𝜔𝑡) 𝑑𝜔
√2𝜋 IK
Related topics:
115
Chapter 8 Fourier Transforms
Fourier series
Fourier transforms
116
8.6 Complexity analysis for FFT
List of applications:
• Data mining
117
Chapter 8 Fourier Transforms
Comments:
• The DFT computes the 𝑋j from the 𝑥 , while the IDFT shows how to
compute the 𝑥 as a sum of sinusoidal components
* +LF j
}
𝑋j exp _ }
` with frequency 𝑘/𝑁 cycles per sample.
More Remarks:
• The fact that the input to the DFT is a finite sequence of real or
complex numbers makes the DFT ideal for processing information
stored in computers.
118
8.6 Complexity analysis for FFT
𝑥(𝑛* , 𝑛+ , … , 𝑛P )
𝑛o = 0,1, … , 𝑁o − 1 ∀ 𝑙 ∈ [1, 2, … , 𝑑]
FFT is an efficient algorithm to compute DFT and its inverse. There are
many distinct FFT algorithms involving a wide range of mathematics,
from simple complex-number arithmetic to group theory and number
theory.
• An FFT can get the same result in only 𝑂(𝑁 𝑙𝑜𝑔 𝑁) operations. The
difference in speed can be substantial, especially for long data sets
where 𝑁 = 10' ~10( , the computation time can be reduced by
several orders of magnitude or is roughly proportional to 𝑁 /
𝑙𝑜𝑔(𝑁).
• Since the inverse DFT is the same as the DFT, but with the opposite
sign in the exponent and a 1/N factor, any FFT algorithm can easily
be adapted for it.
119
Chapter 8 Fourier Transforms
• FFTW "Fastest Fourier Transform in the West" - 'C' library for the
discrete Fourier transform (DFT) in one or more dimensions.
120
8.6 Complexity analysis for FFT
A dose of history:
Radix-2 DIT first computes the DFTs of the even-indexed 𝑥+w (𝑥& , 𝑥+ , 𝑥}I+ )
inputs and of the odd-indexed inputs 𝑥+wï* (𝑥* , 𝑥' , 𝑥}I* ) and then
combines those two results to produce the DFT of the whole sequence.
This idea can then be performed recursively to reduce the overall runtime
to 𝑂(𝑁 𝑙𝑜𝑔 𝑁). This simplified form assumes that N is a power of two;
since the number of sample points N can usually be chosen freely by the
application, this is often not an important restriction.
121
Chapter 8 Fourier Transforms
More specifically, the Radix-2 DIT algorithm rearranges the DFT of the
function 𝑥 into two parts: a sum over the even-numbered indices n = 2m
and a sum over the odd-numbered indices n = 2m + 1:
122
8.6 Complexity analysis for FFT
First, we divide the transform into odd and even parts (assuming M is
even):
2𝜋𝑖𝑘
𝑋j = 𝐸j + exp 𝑂
𝑀 j
and
2𝜋𝑖𝑘
𝑋jïã/+ = 𝐸j − exp 𝑂
𝑀 j
Next, we recursively transform 𝐸j and 𝑂j to the next two 2 × 2 terms. We
iterate continuously until we only have one point left for Fourier
Transform. In fact, one point does not need any transform.
Pseudocode of FFT:
if N = 1 then
else
123
Chapter 8 Fourier Transforms
t ← Yk
endfor
endif
General case:
124
8.6 Complexity analysis for FFT
Spectral analysis: When the DFT is used for spectral analysis, the {𝑥 }
sequence represents a finite set of uniformly spaced time-samples of
some signal.
Polynomial multiplication:
125
Chapter 8 Fourier Transforms
ã
2𝜋𝑖 𝑗𝑘
𝑋j = Ž 𝑋› exp ∀ 𝑘 = 1, 2, … , 𝑁
𝑀
›J*
126
8.6 Complexity analysis for FFT
𝑋*,jJ* ⋯ 𝑋ã,jJ*
S ⋮ ⋱ ⋮ T
𝑋*,jJ} ⋯ 𝑋ã,jJ}
127
Chapter 8 Fourier Transforms
First, we divide the transform into odd and even parts (assuming M is
even):
2𝜋𝑖𝑘
𝑋j = 𝐸j + exp 𝑂
𝑀 j
and
2𝜋𝑖𝑘
𝑋jïã/+ = 𝐸j − exp 𝑂
𝑀 j
128
8.6 Complexity analysis for FFT
Let’s define a new symbol T+ =2TPM + Tmult, thus we obtained a formula for
single processor time as
DFT(MàM) on 𝑷 processors:
129
Chapter 8 Fourier Transforms
Case II (𝑀 ≤ 𝑃):
𝑀
𝑇(𝑃, 𝑀) = 𝑇 Ë𝑃, Í + 𝑇I = 𝑇I log 𝑀
2
𝑀
𝑇(𝑃, 𝑀) = 𝑀/(2𝑃) A 2𝑇I log 𝑃 + 𝑇ï log B = 𝑀/(2𝑃)[ 𝑇ï log 𝑀 + (2𝑇I − 𝑇ï ) log 𝑃]
𝑃
𝑃
𝑆(𝑃, 𝑀) =
2𝑇 − 𝑇 log 𝑃
1 + I𝑇 ï log 𝑀
ï
The overhead is
2𝑇I − 𝑇ï log 𝑃
ℎ(𝑃, 𝑀) =
𝑇ï log 𝑀
We can estimate the 𝑇I and 𝑇ï in terms of the traditional 𝑇efwX and 𝑇efww :
U‡M H
Remarks: ℎ(𝑃, 𝑀) is proportional to U‡M ã with weaker dependence on 𝑃 and
𝑀 compared to 𝑃/𝑀, making parallelizing (or dropping off the overhead)
difficult.
131
Chapter 8 Fourier Transforms
132
Chapter 9
Optimization
9.4.1 Basics
• Simplest Monte Carlo Method—Metropolis method
• Simulated Annealing—Why and How
• Where we stand
134
9.4 Monte Carlo Methods
A typical cost function (in 1D) may look like what is pictured in Figure 9.1.
Figure 9.1 This is what a typical cost function might look like. 𝐸* < 𝐸+ < 𝐸' .
135
Chapter 9 Optimization
𝐸(𝑌) − 𝐸(𝑥)
(9.1) 𝑃(𝑌) = min j1, exp Û− Ük
𝑇
where 𝑋 and 𝑌 are old and new states, respectively, 𝐸(𝑋) and 𝐸(𝑌) are the
associated cost functions, and 𝑇 is an algorithmic parameter called
temperature.
136
Chapter 10
Applications
137
Chapter 10 Applications
For the former, we can get partial differential equations (PDEs) which are
usually nonlinear and ordinary differential equations (ODEs) (which are,
in many cases, stiff). These equations are largely solved by finite-
different, finite-element, or particle method or their hybridization. While
for the latter, we always get problems with multiple local minima, for
which Monte Carlo methods usually prove to be advantageous.
138
10.1 Newton’s Equation and Molecular Dynamics
𝑑𝑚
(10.1) 𝐹 = 𝑚𝑎 + 𝑣
𝑑𝑡
In molecular dynamics and N-body problems, or any models that involve
“particles,”
𝑑+ 𝑥
(10.2) j𝑚F + = 𝑓F (𝑥* , 𝑥+ , … , 𝑥} )k ∀𝑖 ≤ 𝑁
𝑑𝑡
Main issues:
Remarks:
139
Chapter 10 Applications
Depending of the level of resolution required, one could apply 2nd , 4th ,
or 6th order ODE solvers.
140
10.1 Newton’s Equation and Molecular Dynamics
1. Particle-particle (PP)
2. Particle-mesh (PM)
3. Multipole method
141
Chapter 10 Applications
142
10.1 Newton’s Equation and Molecular Dynamics
good method to improve this bound except to let every particle interact
with every other particle. While for short range interaction, the
complexity is 𝐶 × 𝑁 where 𝐶 is a constant dependant on the range of
interaction and method used to compute the forces.
The cost is so high that the only hope to solve these problems must lie in
parallel computing. So, designing an efficient and scalable parallel MD
algorithm is of great concern. The MD problems we are dealing with are
classical 𝑁-body problems, i.e. to solve the following generally nonlinear
ordinary differential equations(ODE),
𝑑 + 𝑥F
𝑚F + = Ž 𝑓F› (𝑥F , 𝑥› )
𝑑𝑡
›
(10.4)
+ Ž 𝑔F›j •𝑥F , 𝑥› , 𝑥j ‘ + ⋯ , 𝐼 = 1,2, … , 𝑁
›,j
where 𝑚F is the mass of particle 𝑖, 𝑥F is its position, 𝑓F› (⋅,⋅) is a two body
force, and 𝑔F›j (⋅,⋅,⋅) is a three-body force. The boundary conditions and
initial conditions are properly given. To make our study simple, we only
consider two-force interactions, so 𝑔F›j = 0.
d𝑋‡Up
(10.5) 𝑋åno = 𝑋‡Up + ∆ Ë𝑋‡Up , ,…Í
d𝑡
The entire problem is reduced to computing
(10.6) d𝑋‡Up
∆ Ë𝑋‡Up , ,…Í
d𝑡
143
Chapter 10 Applications
Figure 10.1: The force matrix and force vector. The force matrix element 𝑓F› is the
force on particle 𝑖 exerted by particle 𝑗. Adding up the elements in row 𝑖, we get to
total force acting on particle 𝑖 by all other particles.
In fact in the core of the calculation is that of the force. Typically, solving
the above equation, if the force terms are known, costs less than 5% of
the total time. Normally, Runge-Kutta, or Predictor-Corrector, or Leapfrog
methods are used for this purpose. More than 95% of the time is spent
on computing the force terms. So we concentrate our parallel algorithm
design on the force calculation.
There are two types of interactions: long range and short range forces.
The long range interactions occur often in gases, while short range
interactions are common in solids and liquids. The solution of these two
types of problems are quite different. For long range forces, the cost is
typically 𝑂(𝑁 + ) and there are not many choices of schemes. However, for
short range forces, the cost is generally 𝑂(𝑁) and there exists a variety of
methods.
Long-Ranged Interactions
N-body problems and plasma under Coulomb’s interactions.
144
10.1 Newton’s Equation and Molecular Dynamics
(10.7) 𝐹F = Ž 𝑓F›
›HF
Performance Analysis
First, we define some parameters. Let 𝑡U†YM be the time it takes to move on
particle from one processor to its neighbor and let 𝑡pair be the time it
145
Chapter 10 Applications
and
1
𝑆(𝑝, 𝑁) = 𝑝
(10.10) 1 𝑡U†YM
1+𝑛𝑡
…KZƒ
1 𝑡U†YM
(10.11) ℎ(𝑝, 𝑁) =
𝑛 𝑡…KZƒ
Remarks:
*
• is important for reducing overhead.
£qþrs
• £"tuv
again controls the overhead.
• For long range interaction, ℎ(𝑝, 𝑁) is typically very small.
Short-Ranged Interactions
For short range interaction, the algorithm is very different. Basically, two
major techniques (applicable to both serial or parallel) are widely used to
avoid computing force terms that have negligible contribution to the total
force:
1. particle-in-cell (binning)
2. neighbor lists
Assume that particles with a distance less than 𝑟ˆ have a finite, non-zero
interaction, while particles with distances larger than 𝑟ˆ do not interact.
Neighbor lists: The idea here is similar to the above, but better refined.
Instead of creating a “list” for a cluster of particles, we create one for
each particle. Thus, the idea goes like: for particle 𝑖, make a list to all
particles in the system that can interact with 𝑖. Normally, particles do not
move far within a time step, in order to avoid creating the lists every time
step, we record particles in the sphere of radius 𝑟ˆ + 𝜖 instead of just ˆ .
This ϵ is an adjustable parameter (depending on physics and computing
resources).
These two methods are very useful; I call them screening method.
Step I: Screening.
Step II: Communicate boarder mesh particles to relevant
processors.
Step III: Update particle positions in all processors.
Step IV: Scatter the new particles (same particles but new
positions).
Remarks:
147
Chapter 10 Applications
Algorithm:
Force Computing
1. Screening
2. Compute 𝐹ÈÉ and sum over all 𝛽 to find the partial force on
particles 𝛼 by particles 𝛽.
3. Processor 𝑃ÈÉ collects the partial forces from row 𝛼 processors
to compute the total force on particles 𝛼
4. Update the 𝑁/𝑝 particles within a group 𝛼
5. Broadcast new positions to all other processors.
6. Repeat steps 1-4 to the next time step.
Remarks:
148
10.2 Schrödinger’s Equations and Quantum
Mechanics
4. Check the new positions of the particles which lie in the buffer
zone at
5. 𝑡‡Up
6. Repeat steps 2-4.
Remarks:
10.1.3 MD Packages
ℎ+ ℎ 𝜕Ψ(𝑟, 𝑡)
(10.12) ∇+ Ψ(𝑟, 𝑡) + 𝑉Ψ(𝑟, 𝑡) = −
+
8𝜋 𝑚 2𝜋𝑖 𝜕𝑡
149
Chapter 10 Applications
150
10.5 Diffusion Equation and Mechanical Engineering
𝜕𝐵
∇×𝐸 = −
𝜕𝑡
∇⋅𝐷 = 𝜌
(10.14)
𝜕𝐷
∇×𝐻 = +𝐽
𝜕𝑡
∇⋅𝐵 = 0
(10.15) −∆𝑢 + 𝜆 = 𝑓
𝜕𝑢
(10.16) ∆𝑢 =
𝜕𝑡
Topics include structural analysis, combustion simulation, and vehicle
simulation.
151
Chapter 10 Applications
𝜕𝑢 1 1
(10.17) + (𝑢 ⋅ ∇)𝑢 = − ∇𝑝 + 𝛾∇+ 𝑢 + 𝐹
𝜕𝑡 𝜌 𝜌
10.7.1 Astronomy
Data volumes generated by Very Large Array or Very Long Baseline Array
radio telescopes currently overwhelm the available computational
resources. Greater computational power will significantly enhance their
usefulness in exploring important problems in radio astronomy.
152
10.7 Other Applications
10.7.4 Geosciences
Topics include oil and seismic exploration, enhanced oil and gas recovery.
Enhanced Oil and Gas Recovery: This challenge has two parts. First, one
needs to locate the estimated billions of barrels of oil reserves on the
earth and then to devise economic ways of extracting as much of it as
possible. Thus, improved seismic analysis techniques in addition to
improved understanding of fluid flow through geological structures is
required.
10.7.5 Meteorology
Topics include prediction of Weather, Climate, Typhoon, and Global
Change. The aim here is to understand the coupled atmosphere-ocean,
biosphere system in enough detail to be able to make long-range
153
Chapter 10 Applications
10.7.6 Oceanography
Ocean Sciences: The objective is to develop a global ocean predictive
model incorporating temperature, chemical composition, circulation, and
coupling to the atmosphere and other oceanographic features. This ocean
model will couple to models of the atmosphere in the effort on global
weather and have specific implications for physical oceanography as well.
154
Appendix A
MPI
155
Appendix A MPI
Message Passing: This is the method by which data from one processor’s
memory is copied to the memory of another processor. In
distributed-memory systems, data is generally sent as packets of
information over a network from one processor to another. A
message may consist of one or more packets, and usually includes
routing and/or other control information.
157
Appendix A MPI
require the sending process to specify the data’s location, size, type
and the destination. Receive operations should match a
corresponding send operation.
Application Buffer: The address space that holds the data which is to be
sent or received. For example, suppose your program uses a variable
called inmsg. The application buffer for inmsg is the program
memory location where the value of inmsg resides.
System Buffer: System space for storing messages. Depending upon the
type of send/receive operation, data in the application buffer may be
required to be copied to/from system buffer space. This allows com-
munication to be asynchronous.
158
A.1 An MPI Primer
covered in more detail later. For now, simply use MPI COMM WORLD
whenever a communicator is required — it is the predefined
communicator which includes all of your MPI processes.
Rank: Within a communicator, every process has its own unique, integer
identifier assigned by the system when the process initializes. A rank
is sometimes also called a “process ID.” Ranks are contiguous and
begin at zero. In addition, it is used by the programmer to specify
the source and destination of messages, and is often used
conditionally by the application to control program execution. For
example,
rank = 0, do this
¤
rank = 1, do that
Message Attributes:
1. The envelope
2. Rank of destination
3. Message tag
a) ID for a particular message to be matched by both sender
and receiver.
b) It’s like sending multiple gifts to your friend; you need to
identify them.
c) MPI TAG UB ≤ 32767
d) Similar in functionality to “comm” to group messages
e) “comm” is safer than “tag”, but “tag” is more convenient
4. Communicator
5. The Data
6. Initial address of send buffer
7. Number of entries to send
8. Datatype of each entry
a) MPI_INTEGER
b) MPI_REAL
c) MPI_DOUBLE PRECISION
d) MPI COMPLEX
e) MPI_LOGICAL
f) MPI_BYTE
g) MPI _INT
h) MPI_CHAR
159
Appendix A MPI
i) MPI_FLOAT
j) MPI_DOUBLE
Header File
C Fortran
Program Structure
The general structures are the same in both language families. These
four components must be included in the program:
MPI_Init
160
A.1 An MPI Primer
MPI_Init(*argc,*argv)
MPI_INIT(ierr)
MPI_Comm_size
MPI_Comm_size(comm,*size)
MPI_COMM_SIZE(comm,size,ierr)
MPI_Comm_rank
MPI_Comm rank determines the rank of the calling process within the
communicator. Initially, each process will be assigned a unique integer
rank between 0 and P−1, within the communicator MPI_COMM WORLD. This
rank is often referred to as a task ID. If a process becomes associated
with other communicators, it will have a unique rank within each of these
as well.
MPI_Comm_rank(comm,*rank)
MPI_COMM_RANK(comm,rank,ierr)
MPI_Abort
MPI_Abort(comm,errorcode)
MPI_ABORT(comm,errorcode,ierr)
MPI_Get_processor_name
161
Appendix A MPI
This gets the name of the processor on which the command is executed.
It also returns the length of the name. The buffer for name must be at
least MPI_MAX_ PROCESSOR_NAME characters in size. What is returned
into name is implementation dependent—it may not be the same as the
output of the hostname or host shell commands.
MPI_Get_processor_name(*name,*resultlength)
MPI_GET_PROCESSOR_NAME(name,resultlength,ierr)
MPI_Initialized
MPI_Initialized(*flag)
MPI_INITIALIZED(flag,ierr)
MPI_Wtime
MPI_Wtime()
MPI_WTIME()
MPI_Wtick
MPI_Wtick()
MPI_WTICK()
MPI_Finalize
162
A.1 An MPI Primer
MPI_Finalize()
MPI_FINALIZE(ierr)
Examples: Figure A.3 and Figure A.4 provide some simple examples of
environment management routine calls.
The more commonly used MPI blocking message passing routines are
described on page 164.
#include"mpi.h"
#include<stdio.h>
rc = MPI_Init(&argc, &argv);
if (rc != 0){
printf("Error starting MPI program. Terminating.\n");
MPI_Abort(MPI_COMM_WORLD, rc);
}
MPI_Comm_size(MPI_COMM_WORLD, &numtasks);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
printf("Number of tasks= %d My rank= %d\n", numtasks, rank);
MPI_Finalize();
return 0;
}
163
Appendix A MPI
program simple
include 'mpif.h'
call MPI_INIR(ierr)
end if
Data Type: For reasons of portability, MPI predefines its data types.
Programmers may also create their own data types (derived types).
Note that the MPI types MPI_BYTE and MPI_PACKED do not
164
A.1 An MPI Primer
Status: For a receive operation, this indicates the source of the message
and the tag of the message. In C, this argument is a pointer to a
predefined structure MPI_Status (ex. stat.MPI_SOURCE
stat.MPI_TAG). In Fortran, it is an integer array of size
MPI_STATUS_SIZE (ex. stat(MPI_SOURCE) stat(MPI_TAG)).
Additionally, the actual number of bytes received are obtainable
from Status via the MPI_Get_count_routine.
165
Appendix A MPI
MPI_COMPLEX complex
MPI_LOGICAL logical
MPI_Recv
Receive a message and block until the requested data is available in the
application buffer in the receiving task.
MPI_Recv(*buf,count,datatype,source,tag,comm,*status)
MPI_RECV(buf,count,datatype,source,tag,comm,status,ierr)
MPI_Ssend
166
A.1 An MPI Primer
MPI_Ssend(*buf,count,datatype,dest,tag,comm,ierr)
MPI_SSEND(buf,count,datatype,dest,tag,comm,ierr)
MPI_Bsend
MPI_Bsend(*buf,count,datatype,dest,tag,comm)
MPI_BSEND(buf,count,datatype,dest,tag,comm,ierr)
MPI_Buffer_attach, MPI_Buffer_detach
MPI_Buffer_attach(*buffer,size)
MPI_Buffer_detach(*buffer,size)
MPI_BUFFER_ATTACH(buffer,size,ierr)
MPI_BUFFER _DETACH(buffer,size,ierr)
MPI_Rsend
MPI_Rsend(*buf,count,datatype,dest,tag,comm)
MPI_RSEND(buf,count,datatype,dest,tag,comm,ierr)
MPI_Sendrecv
Send a message and post a receive before blocking. This will block until
the sending application buffer is free for reuse and until the receiving
application buffer contains the received message.
167
Appendix A MPI
MPI_Sendrecv(*sendbuf,sendcount,sendtype,dest,sendtag,
*recv_buf,recvcount,recvtype,source,recvtag,comm,*status)
MPI_SENDRECV(sendbuf,sendcount,sendtype,dest,sendtag,
recvbuf,recvcount,recvtype,source,recvtag, comm,status,ierr)
MPI_Probe
MPI_Probe(source,tag,comm,*status)
MPI_PROBE(source,tag,comm,status,ierr)
The more commonly used MPI non-blocking message passing routines are
described below.
MPI_Isend
MPI_Isend(*buf,count,datatype,dest,tag,comm,*request)
MPI_ISEND(buf,count,datatype,dest,tag,comm,request,ierr)
MPI_Irecv
168
A.1 An MPI Primer
MPI_Irecv(*buf,count,datatype,source,tag,comm,*request)
MPI_IRECV(buf,count,datatype,source,tag,comm,request,ierr)
#include "mpi.h"
#include <stdio.h>
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &numtasks);
MPI_Comm_rank(MPI_COOM_World, &rank);
if (rank == 0){
dest = 1;
source = 1;
rc = MPI_Send(&outmsg, 1, MPI_CHAR, dest, tag,
MPI_COMM_WORLD);
rc = MPI_Recv(&inmsg, 1, MPI_CHAR, source, tag,
MPI_COMM_WORLD, &Stat);
}
MPI_Finalize();
return 0;
}
Figure A.5: A simple example of Blocking message passing in C. Task 0 pings task
1 and awaits a return ping.
169
Appendix A MPI
MPI_Issend
MPI_Issend(*buf,count,datatype,dest,tag,comm,*request)
MPI_ISSEND(buf,count,datatype,dest,tag,comm,request,ierr)
program ping
include 'mpif.h'
integer stat(MPI_STATUS_SIZE)
tag = 1
call MPI_INIT(ierr)
dest = 1
source = 1
outmsg = 'x'
dest = 1
170 source = 1
outmsg = 'x'
MPI_Ibsend(*buf,count,datatype,dest,tag,comm,*request)
MPI_IBSEND(buf,count,datatype,dest,tag,comm,request,ierr)
MPI_Irsend
MPI_Irsend(*buf,count,datatype,dest,tag,comm,*request)
MPI_IRSEND(buf,count,datatype,dest,tag,comm,request,ierr)
MPI_Test(*request,*flag,*status)
MPI_Testany(count,*array_of_requests,*index,*flag,*status)
MPI_Testall(count,*array_of_requests,*flag,
*array_of_statuses)
MPI_Testsome(incount,*array_of_requests,*outcount,
*array_of_offsets, *array_of_statuses)
MPI_TEST(request,flag,status,ierr)
MPI_TESTANY(count,array_of_requests,index,flag,status,ierr)
MPI_TESTALL(count,array_of_requests,flag,array_of_statuses,
ierr)
MPI_TESTSOME(incount,array_of_requests,outcount,
array_of_offsets, array_of_statuses,ierr)
171
Appendix A MPI
MPI_Wait(*request,*status)
MPI_Waitany(count,*array_of_requests,*index,*status)
MPI_Waitall(count,*array_of_requests,*array_of_statuses)
MPI_Waitsome(incount,*array_of_requests,*outcount,
*array_of_offsets,*array_of_statuses)
MPI_WAIT(request,status,ierr)
MPI_WAITANY(count,array_of_requests,index,status,ierr)
MPI_WAITALL(count,array_of_requests,array_of_statuses,ierr)
MPI_WAITSOME(incount,array_of_requests,outcount,
array_of_offsets, array_of_statuses,ierr)
172
A.1 An MPI Primer
#include "mpi.h"
#include <stdio.h>
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &numtasks);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
prev = rank-1;
next = rank+1;
if (rank == 0) prev = numtasks – 1;
if (rank == (numtasks – 1)) next = 0;
MPI_Finalize();
return 0;
}
Figure A.7: A simple example of non-blocking message passing in C, this code
represents a nearest neighbor exchange in a ring topology.
173
Appendix A MPI
program ringtopo
include 'mpif.h'
tag1 = 1
tag2 = 2
call MPI_INIT(ierr)
prev = rank – 1
next = rank + 1
prev = numtasks – 1
endif
endif
MPI_Iprobe
MPI_Iprobe(source,tag,comm,*flag,*status)
MPI_IPROBE(source,tag,comm,flag,status,ierr)
MPI_Barrier
MPI_Barrier(comm)
MPI_BARRIER(comm,ierr)
175
Appendix A MPI
MPI_Bcast
Broadcasts (sends) a message from the process with rank “root” to all
other processes in the group.
MPI_Bcast(*buffer,count,datatype,root,comm)
MPI_BCAST(buffer,count,datatype,root,comm,ierr)
MPI_Scatter
MPI_Scatter(*sendbuf,sendcnt,sendtype,*recvbuf,ecvcnt,
recvtype,root,comm)
MPI_SCATTER(sendbuf,sendcnt,sendtype,recvbuf,recvcnt,
recvtype,root,comm,ierr)
MPI_Gather
MPI_Gather(*sendbuf,sendcnt,sendtype,recvbuf,recvcount,
recvtype,root,comm)
MPI_GATHER(sendbuf,sendcnt,sendtype,recvbuf,recvcount,
recvtype,root,comm,ierr)
MPI_Allgather
MPI_Allgather(*sendbuf,sendcount,sendtype,recvbuf,
recvcount,recvtype,comm)
MPI_ALLGATHER(sendbuf,sendcount,sendtype,recvbuf,
recvcount,recvtype,comm,info)
MPI_Reduce
Applies a reduction operation on all tasks in the group and places the
result in one task.
176
A.1 An MPI Primer
MPI_Reduce(*sendbuf,*recvbuf,count,datatype,op,root,comm)
MPI_REDUCE(sendbuf,recvbuf,count,datatype,op,root,comm,ierr)
Table A.2: This table contains the predefined MPI reduction operations. Users can
also define their own reduction functions by using the MPI_Op_create route.
MPI_Allreduce
Applies a reduction operation and places the result in all tasks in the
group. This is equivalent to an MPI_Reduce followed by an MPI_Bcast.
MPI_Allreduce(*sendbuf,*recvbuf,count,datatype,op,comm)
MPI_ALLREDUCE(sendbuf,recvbuf,count,datatype,op,comm,ierr)
MPI_Reduce_scatter
177
Appendix A MPI
MPI_Reduce_scatter(*sendbuf,*recvbuf,recvcount,
datatype,op,comm)
MPI_REDUCE_SCATTER(sendbuf,recvbuf,recvcount,
datatype,op,comm,ierr)
#include "mpi.h"
#include <stdio.h>
#define SIZE 4
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &numtasks);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
if (numtasks == SIZE) {
source = 1;
sendcount = SIZE;
recvcount = SIZE;
MPI_Scatter(sendbuf, sendcount, MPI_FLOAT, recvbuf,
recvcount, MPI_FLOAT, source, MPI_COMM_WORLD);
MPI_Alltoall
178
A.1 An MPI Primer
MPI_Alltoall(*sendbuf,sendcount,sendtype,*recvbuf,
recvcnt,recvtype,comm)
MPI_ALLTOALL(sendbuf,sendcount,sendtype,recvbuf,
recvcnt,recvtype,comm,ierr)
MPI_Scan
MPI_Scan(*sendbuf,*recvbuf,count,datatype,op,comm)
MPI_SCAN(sendbuf,recvbuf,count,datatype,op,comm,ierr)
Examples: Figure A.10 and Figure A.11 provide some simple examples of
collective communications.
179
Appendix A MPI
program scatter
include 'mpif.h'
integer SIZE
parameter (SIZE=4)
integer numtasks, rank, sendcount
integer recvcount, source, ierr
real*4 sendbuf(SIZE, SIZE), recvbuf(SIZE)
call MPI_INIT(ierr)
call MPI_COMM_RANK(MPI_COMM_WORLD, rank, ierr)
call MPI_COMM_SIZE(MPI_COMM_WORLD, numtasks, ierr)
call MPI_FINALIZE(ierr)
end
Figure A.11: This is the output of the example given in Error! Reference
source not found.
180
A.2 Examples of Using MPI
#include "mpi.h"
181
Appendix A MPI
A.2.2 Integration
#include<stdio.h>
#include<math.h>
#include"mpi.h"
int my_rank;
int p;
int source;
int dest;
int tag=0;
int i, n, N; /* indices */
MPI_Init(&artc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &my_rank);
MPI_Comm_size(MPI_Comm_WORLD,&p);
h =A.13:
Figure (B-A)/N; /* which
The first part of integral.c, mesh performs
size a simple integral. */
n =integral.c
Compiling N/p; /* meshes on each processor
(Figure A.13): */
a = A + my_rank*n*h;
>mpicc -o integral integral.c
b = a+ n*h; /* calculates the local bounds*/
VAMPIR
VAMPIR is currently the most successful MPI tool product (see also
“Supercomputer European Watch,” July 1997), or check references at
1. http://www.cs.utk.edu/∼browne/perftools-review
2. http://www.tc.cornell.edu/UserDoc/Software/PTools/vampir/
ScaLAPACK
PGAPack
183
Appendix A MPI
ARCH
http://www.tc.cornell.edu/Research/tech.rep.html
OOMPI
http://www.cse.nd.edu/∼lsc/research/oompi
http://www.osc.edu/Lam/lam/xmpi.html
184
A.3 MPI Tools
programming details but who have large sparse linear systems which
require an efficiently utilized Parallel computing system.
http://www.cs.sandia.gov/HPCCIT/aztec.html
MPIMap
http://www.llnl.gov/livcomp/mpimap/
STAR/MPI
ftp://ftp.ccs.neu.edu/pub/people/gene/starmpi/
http://www.cs.utexas.edu/users/rvdg/abstracts/sB_BLAS.html
An “alpha test release” of the BLACS for MPI is available from the
University of Tennessee, Knoxville. For more information contact R. Clint
Whaley ([email protected]).
185
Appendix A MPI
http://www.nag.co.uk/numeric/FM.html
ftp://ftp.scri.fsu.edu/pub/DQS
http://www.cs.utexas.edu/users/rvdg/intercom/
http://www.mcs.anl.gov/home/gropp/petsc.html
MSG Toolkit
http://www.cerca.umontreal.ca/malevsky/MSG/MSG.html
186
A.3 MPI Tools
Para++
The Para++ project provides a C++ interface to the MPI and PVM message
passing libraries. Their approach is to overload input and output
operators to do communication. Communication looks like standard C++
I/O.
http://www.loria.fr/para++/parapp.html
Parallel FFTW
http://www.fftw.org/
http://dino.ph.utexas.edu/∼furnish/c4
MPI Cubix
MPI Cubix is an I/O library for MPI applications. The semantics and
language binding reflect POSIX in its sequential aspects and MPI in its
parallel aspects. The library is built on a few POSIX I/O functions and
each of the POSIX-like Cubix functions translate directly to a POSIX
operation on a file system somewhere in the parallel computer. The
library is also built on MPI and is therefore portable to any computer that
supports both MPI and POSIX.
187
Appendix A MPI
http://www.osc.edu/Lam/mpi/mpicubix.html
http://www.erc.msstate.edu/mpi/mpix.html
mpC
http://www.ispras.ru/∼mpc/
MPIRUN
http://lovelace.nas.nasa.gov/Parallel/People/fineberg_homepage.html.
188
A.4 Complete List of MPI Functions
Constants MPI_Keyval_free
MPI_File_iwrite_shared MPI_Attr_put
MPI_Info_set MPI_File_read_shared
MPIO_Request_c2f MPI_NULL_COPY_FN
MPI_File_open MPI_Barrier
MPI_Init MPI_File_seek
MPIO_Request_f2c MPI_NULL_DELETE_FN
MPI_File_preallocate MPI_Bcast
MPI_Init_thread MPI_File_seek_shared
MPIO_Test MPI_Op_create
MPI_File_read MPI_Bsend
MPI_Initialized MPI_File_set_atomicity
MPIO_Wait MPI_Op_free
MPI_File_read_all MPI_Bsend_init
MPI_Int2handle MPI_File_set_errhandler
MPI_Abort MPI_Pack
MPI_File_read_all_begin MPI_Buffer_attach
MPI_Intercomm_create MPI_File_set_info
MPI_Address MPI_Pack_size
MPI_File_read_all_end MPI_Buffer_detach
MPI_Intercomm_merge MPI_File_set_size
MPI_Allgather MPI_Pcontrol
MPI_File_read_at MPI_CHAR
MPI_Iprobe MPI_File_set_view
MPI_Allgatherv MPI_Probe
MPI_File_read_at_all MPI_Cancel
MPI_Irecv MPI_File_sync
MPI_Allreduce MPI_Recv
MPI_File_read_at_all_begin MPI_Cart_coords
MPI_Irsend MPI_File_write
MPI_Alltoall MPI_Recv_init
MPI_File_read_at_all_end MPI_Cart_create
MPI_Isend MPI_File_write_all
MPI_Alltoallv MPI_Reduce
MPI_File_read_ordered MPI_Cart_get
MPI_Issend MPI_File_write_all_begin
MPI_Attr_delete MPI_Reduce_scatter
MPI_File_read_ordered_begin MPI_Cart_map
MPI_Keyval_create MPI_File_write_all_end
MPI_Attr_get MPI_Request_c2f
MPI_File_read_ordered_end MPI_Cart_rank
189
Appendix A MPI
MPI_File_write_at MPI_Comm_split
MPI_Request_free MPI_Get_version
MPI_Cart_shift MPI_Status_set_cancelled
MPI_File_write_at_all MPI_Comm_test_inter
MPI_Rsend MPI_Graph_create
MPI_Cart_sub MPI_Status_set_elements
MPI_File_write_at_all_begin MPI_DUP_FN
MPI_Rsend_init MPI_Graph_get
MPI_Cartdim_get MPI_Test
MPI_File_write_at_all_end MPI_Dims_create
MPI_Scan MPI_Graph_map
MPI_Comm_compare MPI_Test_cancelled
MPI_File_write_ordered MPI_Errhandler_create
MPI_Scatter MPI_Graph_neighbors
MPI_Comm_create MPI_Testall
MPI_File_write_ordered_begin MPI_Errhandler_free
MPI_Scatterv MPI_Graph_neighbors_count
MPI_Comm_dup MPI_Testany
MPI_File_write_ordered_end MPI_Errhandler_get
MPI_Send MPI_Graphdims_get
MPI_Comm_free MPI_Testsome
MPI_File_write_shared MPI_Errhandler_set
MPI_Send_init MPI_Group_compare
MPI_Comm_get_name MPI_Topo_test
MPI_Finalize MPI_Error_class
MPI_Sendrecv MPI_Group_difference
MPI_Comm_group MPI_Type_commit
MPI_Finalized MPI_Error_string
MPI_Sendrecv_replace MPI_Group_excl
MPI_Comm_rank MPI_Type_contiguous
MPI_Gather MPI_File_c2f
MPI_Ssend MPI_Group_free
MPI_Comm_remote_group MPI_Type_create_darray
MPI_Gatherv MPI_File_close
MPI_Ssend_init MPI_Group_incl
MPI_Comm_remote_size MPI_Type_create_subarray
MPI_Get_count MPI_File_delete
MPI_Start MPI_Group_intersection
MPI_Comm_set_name MPI_Type_extent
MPI_Get_elements MPI_File_f2c
MPI_Startall MPI_Group_range_excl
MPI_Comm_size MPI_Type_free
MPI_Get_processor_name MPI_File_get_amode
MPI_Status_c2f MPI_Group_range_incl
190
A.4 Complete List of MPI Functions
MPI_Type_get_contents MPI_Info_get_valuelen
MPI_File_get_atomicity MPI_File_iwrite_shared
MPI_Group_rank MPI_Info_set
MPI_Type_get_envelope
MPI_File_get_byte_offset
MPI_Group_size
MPI_Type_hvector
MPI_File_get_errhandler
MPI_Group_translate_ranks
MPI_Type_lb
MPI_File_get_group
MPI_Group_union
MPI_Type_size
MPI_File_get_info
MPI_Ibsend
MPI_Type_struct
MPI_File_get_position
MPI_Info_c2f
MPI_Type_ub
MPI_File_get_position_shared
MPI_Info_create
MPI_Type_vector
MPI_File_get_size
MPI_Info_delete
MPI_Unpack
MPI_File_get_type_extent
MPI_Info_dup
MPI_Wait
MPI_File_get_view
MPI_Info_f2c
MPI_Waitall
MPI_File_iread
MPI_Info_free
MPI_Waitany
MPI_File_iread_at
MPI_Info_get
MPI_Waitsome
MPI_File_iread_shared
MPI_Info_get_nkeys
MPI_Wtick
MPI_File_iwrite
MPI_Info_get_nthkey
MPI_Wtime
MPI_File_iwrite_at
191
Appendix B
OpenMP
192
B.2 Memory Model of OpenMP
The first attempt to standardize the shared-memory API was the draft for
ANSI X3H5 in 1994. Unfortunately, it was never adopted. In October
1997, the OpenMP Architecture Review Board (ARB) published its first API
specifications. OpenMP for Fortran 1.0, In 1998 they released the C/C++
standard. The version 2.0 of the Fortran specifications was published in
2000 and version 2.0 of the C/C++ specifications was released in 2002.
Version 2.5 is a combined C/C++/Fortran specification that was released
in 2005. Version 3.0, released in May, 2008, is the current version of the
API specifications.
193
Appendix B OpenMP
Example:
if(scalar-expression)
num_threads(integer-expression)
default(shared|none)
private(list)
firstprivate(list)
shared(list)
copyin(list)
reduction(operator:list)
private(list)
firstprivate(list)
lasstprivate(list)
reduction(operator:list)
schedule(kind[, chunck_size])
collapse(n)
ordered
nowait
194
B.4 Synchronization
lastprivate clause does what private does and copy the variable from
the last loop iteration to the original variable object.
shared clause declares variables in its list to be shared among all threads
default clause allow user to specify a default scope for all variables
within a parallel region to be either shared or not.
B.4 Synchronization
Before we discuss the synchronization, let us consider a simple example
where two processors trying to do a read/update on same variable.
x = 0;
#pragma omp parallel shared(x)
{
x = x + 1;
}
195
Appendix B OpenMP
The result of x will be 1, not 2 as it should be. To avoid situation like this,
x should be synchronized between two processors.
critical
atomic
barrier
196
B.5 Runtime Library Routines
flush
omp_set_num_threads
Sets the number of threads that will be used in the next parallel region.
Must be a postive integer.
omp_get_num_threads
Returns the number of threads that are currently in the team executing
the parallel region from which it is called.
int omp_get_num_threads(void)
omp_get_max_threads
197
Appendix B OpenMP
int omp_get_max_threads(void)
omp_get_thread_num
Returns the thread number of the thread, within the team, making this
call. This number will be between 0 and OMP_GET_NUM_THREADS - 1. The
master thread of the team is thread 0.
int omp_get_thread_num(void)
omp_in_parallel
int omp_in_parallel(void)
The OpenMP lock routines access a lock variable in such a way that they
always read and update the most current value of the lock variable. The
lock routines include a flush with no list; the read and update to the lock
variable must be implemented as if they are atomic with the flush.
Therefore, it is not necessary for an OpenMP program to include explicit
flush directives to ensure that the lock variable’s value is consistent
among different tasks.
omp_init_lock
198
B.5 Runtime Library Routines
omp_destroy_lock
This subroutine disassociates the given lock variable from any locks.
omp_set_lock
This subroutine forces the executing thread to wait until the specified
lock is available. A thread is granted ownership of a lock when it becomes
available.
omp_unset_lock
omp_test_lock
This subroutine attempts to set a lock, but does not block if the lock is
unavailable.
omp_get_wtime
double omp_get_wtime(void)
omp_get_wtick
double omp_get_wtick(void)
199
Appendix B OpenMP
#include<stdio.h>
int N = omp_get_num_threads();
B.6.2 Calculating π by Integration
printf("Hello from %d of %d.\n", ID, N);
#include"omp.h"
}
#include<stdio.h>
return 0;
static long num_step = 100000;
}
double step;
#define NUM_THREADS 2
int i;
step = 1.0/(double)num_steps;
omp_set_num_threads(NUM_THREADS);
x = (i + 0.5) * step;
Appendix C
Projects
201
Appendix C Projects
(1) Design an algorithm to place the 256 × 256 mesh points to the 64
processors with optimal utilization and load balance of the CPUs
and communication links;
202
Project 5
(2) Compute the expected load imbalance ratio for CPUs resulting
from your method;
(3) Compute the expected load imbalance ratio among all links
resulting from your method;
(4) Repeat the above 3 steps if each one of the top layer of 16
processors is 4 times faster than each one of the processors on the
second and third layers.
where 𝐹F› is the force on particle 𝑖 by j (with distance 𝑟F› ) with charges 𝑞F
and 𝑞› .
203
Appendix C Projects
204
Project 9
1 2
𝑉F› = *+ − (
𝑟F› 𝑟F›
where 𝑉F› is the pair-wise potential between particles 𝑖 and 𝑗 with distance
𝑟F› .
(1) Write a serial program to minimize the total energy of the particle
system for 𝑁 = 100 by using simulated annealing method (or any
optimization method you prefer).
(2) For 𝑃 = 2, 4, and 8, write a parallel program to minimize the energy. The
minimizations should terminate at a similar final energy you can
obtained by 𝑃 = 1 as in (1) above (Results within 5 percent of relative
errors are considered similar.) You need to use the following two
methods to decompose the problem:
a) Particle decomposition;
b) Spatial decomposition.
(3) Report the timing results and speedup curve for both decompositions
and comment on their relative efficiency.
205
Appendix C Projects
1) Fujitsu K computer
2) NDUT Tianhe-1A
3) Cray XT5-HE
4) Dawning TC3600 Cluster
5) HP Cluster Platform 3000SL
6) Cray XE6
7) SGI Altix ICE
8) Cray XE6
9) Bull Bullx supernode S6010/6030
10) Roadrunner IBM BladeCenter
In each of the categories, create a rank order for the computers you select.
206
Project 13
207
Appendix D
Program Examples
parameter(N1=8) ! columns
parameter(N2=8) ! rows
include '/net/campbell/04/theory/lam_axp/h/mpif.h'
real M(N1,N2),v(N1),prod(N2)
integer size,my_rank,tag,root
integer send_count, recv_count
integer N_rows
tag = 0
root = 0
C here(not always) for MPI_Gather to work root should be 0
call MPI_Init(ierr)
208
D.1 Matrix-Vector Multiplication
call MPI_Comm_rank(MPI_comm_world,my_rank,ierr)
call MPI_Comm_size(MPI_comm_world,size,ierr)
if(mod(N2,size).ne.0)then
print*,'rows are not divisible by processes'
stop
end if
if(my_rank.eq.root)then
call initialize_M(M,N1,N2)
call initialize_v(v,N1)
end if
call multiply(prod,v,M,N_rows,N1)
send_count = N_rows
recv_count = N_rows
C if(my_rank.eq.root)recv_count = N2
call MPI_Gather(prod,
@ send_count,
@ MPI_REAL,
@ prod,
@ recv_count,
@ MPI_REAL,
@ root,
@ MPI_COMM_WORLD,
209
Appendix D Program Examples
@ ierr)
if(my_rank.eq.root)call write_prod(prod,N2)
call MPI_Finalize(ierr)
end
subroutine multiply(prod,v,M,N2,N1)
real M(N2,N1),prod(N2),v(N1)
do i=1,N2
prod(i)=0
do j=1,N1
prod(i)=prod(i) + M(j,i)*v(j)
end do
end do
return
end
subroutine initialize_M(M,N2,N1)
real M(N2,N1)
do i=1,N2
do j=1,N1
M(j,i) = 1.*i/j
end do
end do
return
end
subroutine initialize_v(v,N1)
real v(N1)
do j=1,N1
v(j) = 1.*j
end do
return
end
subroutine write_prod(prod,N2)
real prod(N2)
C . directory for all process except the one the program was started on
C is your home directory
open(unit=1,file='~/LAM/F/prod',status='new')
do j=1,N2
write(1,*)j,prod(j)
end do
return
210
D.2 Long Range N-body Force
end
Table D.1: Matrix-Vector Multiplication in Fortran
C code…
211
Appendix D Program Examples
c One process acts as the host and reads in the number of particles
c
if (myrank .eq. pseudohost) then
open (4,file='nbody.input')
if (mod(nprocs,2) .eq. 0) then
read (4,*) npts
if (npts .gt. nprocs*NN) then
print *,'Warning!! Size out of bounds!!'
npts = -1
else if (mod(npts,nprocs) .ne. 0) then
print *,'Number of processes must divide npts'
npts = -1
end if
else
print *,'Number of processes must be even'
npts = -1
end if
end if
c
c The number of particles is broadcast to all processes
c
call mpi_bcast (npts, 1, MPI_INTEGER, pseudohost,
# MPI_COMM_WORLD, ierr)
c
c Abort if number of processes and/or particles is incorrect
c
if (npts .eq. -1) goto 999
c
c Work out number of particles in each process
c
nlocal = npts/nprocs
c
c The pseudocode hosts initializes the particle data and sends each
c process its particles.
c
if (myrank .eq. pseudohost) then
iran = myrank + 111
do i=0,nlocal-1
p(MM,i) = sngl(ran(iran))
p(PX,i) = sngl(ran(iran))
p(PY,i) = sngl(ran(iran))
p(PZ,i) = sngl(ran(iran))
p(FX,i) = 0.0
p(FY,i) = 0.0
p(FZ,i) = 0.0
end do
do k=0,nprocs-1
if (k .ne. pseudohost) then
do i=0,nlocal-1
212
D.2 Long Range N-body Force
q(MM,i) = sngl(ran(iran))
q(PX,i) = sngl(ran(iran))
q(PY,i) = sngl(ran(iran))
q(PZ,i) = sngl(ran(iran))
q(FX,i) = 0.0
q(FY,i) = 0.0
q(FZ,i) = 0.0
end do
call mpi_send (q, 7*nlocal, MPI_REAL,
# k, 100, MPI_COMM_WORLD, ierr)
end if
end do
else
call mpi_recv (p, 7*nlocal, MPI_REAL,
# pseudohost, 100, MPI_COMM_WORLD, status, ierr)
end if
c
c Initialization is now complete. Start the clock and begin work.
c First each process makes a copy of its particles.
c
timebegin = mpi_wtime ()
do i= 0,nlocal-1
q(MM,i) = p(MM,i)
q(PX,i) = p(PX,i)
q(PY,i) = p(PY,i)
q(PZ,i) = p(PZ,i)
q(FX,i) = 0.0
q(FY,i) = 0.0
q(FZ,i) = 0.0
end do
c
c Now the interactions between the particles in a single process are
c computed.
c
do i=0,nlocal-1
do j=i+1,nlocal-1
dx(i) = p(PX,i) - q(PX,j)
dy(i) = p(PY,i) - q(PY,j)
dz(i) = p(PZ,i) - q(PZ,j)
sq(i) = dx(i)**2+dy(i)**2+dz(i)**2
dist(i) = sqrt(sq(i))
fac(i) = p(MM,i) * q(MM,j) / (dist(i) * sq(i))
tx(i) = fac(i) * dx(i)
ty(i) = fac(i) * dy(i)
tz(i) = fac(i) * dz(i)
p(FX,i) = p(FX,i)-tx(i)
q(FX,j) = q(FX,j)+tx(i)
p(FY,i) = p(FY,i)-ty(i)
q(FY,j) = q(FY,j)+ty(i)
213
Appendix D Program Examples
p(FZ,i) = p(FZ,i)-tz(i)
q(FZ,j) = q(FZ,j)+tz(i)
end do
end do
c
c The processes are arranged in a ring. Data will be passed in an
C anti-clockwise direction around the ring.
c
dest = mod (nprocs+myrank-1, nprocs)
src = mod (myrank+1, nprocs)
c
c Each process interacts with the particles from its nprocs/2-1
c anti-clockwise neighbors. At the end of this loop p(i) in each
c process has accumulated the force from interactions with particles
c i+1, ...,nlocal-1 in its own process, plus all the particles from
its
c nprocs/2-1 anti-clockwise neighbors. The "home" of the q array is
C regarded as the process from which it originated. At the end of
c this loop q(i) has accumulated the force from interactions with
C particles 0,...,i-1 in its home process, plus all the particles from
the
C nprocs/2-1 processes it has rotated to.
c
do k=0,nprocs/2-2
call mpi_sendrecv_replace (q, 7*nlocal, MPI_REAL, dest, 200,
# src, 200, MPI_COMM_WORLD, status, ierr)
do i=0,nlocal-1
do j=0,nlocal-1
dx(i) = p(PX,i) - q(PX,j)
dy(i) = p(PY,i) - q(PY,j)
dz(i) = p(PZ,i) - q(PZ,j)
sq(i) = dx(i)**2+dy(i)**2+dz(i)**2
dist(i) = sqrt(sq(i))
fac(i) = p(MM,i) * q(MM,j) / (dist(i) * sq(i))
tx(i) = fac(i) * dx(i)
ty(i) = fac(i) * dy(i)
tz(i) = fac(i) * dz(i)
p(FX,i) = p(FX,i)-tx(i)
q(FX,j) = q(FX,j)+tx(i)
p(FY,i) = p(FY,i)-ty(i)
q(FY,j) = q(FY,j)+ty(i)
p(FZ,i) = p(FZ,i)-tz(i)
q(FZ,j) = q(FZ,j)+tz(i)
end do
end do
end do
c
c Now q is rotated once more so it is diametrically opposite its home
c process. p(i) accumulates forces from the interaction with particles
214
D.2 Long Range N-body Force
c 0,..,i-1 from its opposing process. q(i) accumulates force from the
c interaction of its home particles with particles i+1,...,nlocal-1 in
c its current location.
c
if (nprocs .gt. 1) then
call mpi_sendrecv_replace (q, 7*nlocal, MPI_REAL, dest, 300,
# src, 300, MPI_COMM_WORLD, status, ierr)
do i=nlocal-1,0,-1
do j=i-1,0,-1
dx(i) = p(PX,i) - q(PX,j)
dy(i) = p(PY,i) - q(PY,j)
dz(i) = p(PZ,i) - q(PZ,j)
sq(i) = dx(i)**2+dy(i)**2+dz(i)**2
dist(i) = sqrt(sq(i))
fac(i) = p(MM,i) * q(MM,j) / (dist(i) * sq(i))
tx(i) = fac(i) * dx(i)
ty(i) = fac(i) * dy(i)
tz(i) = fac(i) * dz(i)
p(FX,i) = p(FX,i)-tx(i)
q(FX,j) = q(FX,j)+tx(i)
p(FY,i) = p(FY,i)-ty(i)
q(FY,j) = q(FY,j)+ty(i)
p(FZ,i) = p(FZ,i)-tz(i)
q(FZ,j) = q(FZ,j)+tz(i)
end do
end do
c
c In half the processes we include the interaction of each particle
with
c the corresponding particle in the opposing process.
c
if (myrank .lt. nprocs/2) then
do i=0,nlocal-1
dx(i) = p(PX,i) - q(PX,i)
dy(i) = p(PY,i) - q(PY,i)
dz(i) = p(PZ,i) - q(PZ,i)
sq(i) = dx(i)**2+dy(i)**2+dz(i)**2
dist(i) = sqrt(sq(i))
fac(i) = p(MM,i) * q(MM,i) / (dist(i) * sq(i))
tx(i) = fac(i) * dx(i)
ty(i) = fac(i) * dy(i)
tz(i) = fac(i) * dz(i)
p(FX,i) = p(FX,i)-tx(i)
q(FX,i) = q(FX,i)+tx(i)
p(FY,i) = p(FY,i)-ty(i)
q(FY,i) = q(FY,i)+ty(i)
p(FZ,i) = p(FZ,i)-tz(i)
q(FZ,i) = q(FZ,i)+tz(i)
end do
215
Appendix D Program Examples
endif
c
c Now the q array is returned to its home process.
c
dest = mod (nprocs+myrank-nprocs/2, nprocs)
src = mod (myrank+nprocs/2, nprocs)
call mpi_sendrecv_replace (q, 7*nlocal, MPI_REAL, dest, 400,
# src, 400, MPI_COMM_WORLD, status, ierr)
end if
c
c The p and q arrays are summed to give the total force on each
particle.
c
do i=0,nlocal-1
p(FX,i) = p(FX,i) + q(FX,i)
p(FY,i) = p(FY,i) + q(FY,i)
p(FZ,i) = p(FZ,i) + q(FZ,i)
end do
c
c Stop clock and write out timings
c
timeend = mpi_wtime ()
print *,'Node', myrank,' Elapsed time: ',
# timeend-timebegin,' seconds'
c
c Do a barrier to make sure the timings are written out first
c
call mpi_barrier (MPI_COMM_WORLD, ierr)
c
c Each process returns its forces to the pseudohost which prints them
out.
c
if (myrank .eq. pseudohost) then
open (7,file='nbody.output')
write (7,100) (p(FX,i),p(FY,i),p(FZ,i),i=0,nlocal-1)
call mpi_type_vector (nlocal, 3, 7, MPI_REAL, newtype, ierr)
call mpi_type_commit (newtype, ierr)
do k=0,nprocs-1
if (k .ne. pseudohost) then
call mpi_recv (q(FX,0), 1, newtype,
# k, 100, MPI_COMM_WORLD, status, ierr)
write (7,100) (q(FX,i),q(FY,i),q(FZ,i),i=0,nlocal-1)
end if
end do
else
call mpi_type_vector (nlocal, 3, 7, MPI_REAL, newtype, ierr)
call mpi_type_commit (newtype, ierr)
call mpi_send (p(FX,0), 1, newtype,
# pseudohost, 100, MPI_COMM_WORLD, ierr)
216
D.3 Integration
end if
c
c Close MPI
c
999 call mpi_finalize (ierr)
stop
100 format(3e15.6)
end
Table D.2: Long-Ranged N-Body force calculation in Fortran (Source: Oak Ridge
National Laboratory)
C code …
D.3 Integration
program integrate
include 'mpif.h'
parameter (pi=3.141592654)
integer rank
call mpi_init (ierr)
call mpi_comm_rank (mpi_comm_world, rank, ierr)
call mpi_comm_size (mpi_comm_world, nprocs, ierr)
if (rank .eq. 0) then
open (7, file='input.dat')
read (7,*) npts
end if
call mpi_bcast (npts, 1, mpi_integer, 0, mpi_comm_world, ierr)
nlocal = (npts-1)/nprocs + 1
nbeg = rank*nlocal + 1
nend = min (nbeg+nlocal-1,npts)
deltax = pi/npts
psum = 0.0
do i=nbeg,nend
x = (i-0.5)*deltax
psum = psum + sin(x)
end do
call mpi_reduce (psum, sum, 1, mpi_real, mpi_sum, 0,
# mpi_comm_world, ierr)
if (rank.eq. 0) then
print *,'The integral is ',sum*deltax
end if
call mpi_finalize (ierr)
stop
217
Appendix D Program Examples
end
Table D.3: Simple Integration in Fortan (Souce: Oak Ridge National Lab)
C code ?
218
Index
3 C
219
INDEX
220
INDEX
221
INDEX
222
Bibliography
[7] A. Gara et al., "Overview of the Blue Gene/L System Architecture," IBM
Journal of Research and Development, vol. 49, no. 2/3, p. 195, 2005.
223
BIBLIOGRAPHY
224