Parallel Architectures
Parallel Architectures
Parallel Architectures
Outline
• Some chapter references
• Brief review of complexity
– Terminology for comparisons
• Interconnection networks
• Processor arrays
• Multiprocessors
• Multicomputers
• Flynn’s Taxonomy – moved to Chpt 1
2
Some Chapter References
• Selim Akl, The Design and Analysis of Parallel
Algorithms, Prentice Hall, 1989 (earlier textbook).
• G. C. Fox, What Have We Learnt from Using Real
Parallel Machines to Solve Real Problems? Technical
Report C3P-522, Cal Tech, December 1989. (Included in
part in more recent books co-authored by Fox.)
• A. Grama, A. Gupta, G. Karypis, V. Kumar, Introduction
to Parallel Computing, Second Edition, 2003 (first edition
1994), Addison Wesley.
• Harry Jordan, Gita Alaghband, Fundamentals of Parallel
Processing: Algorithms, Architectures, Languages,
Prentice Hall, 2003, Ch 1, 3-5.
• F. Thomson Leighton; Introduction to Parallel Algorithms
and Architectures: Arrays, Trees, Hypercubes; 1992;
Morgan Kaufmann Publishers.
3
References - continued
• Gregory Pfsiter, In Search of Clusters: The ongoing
Battle in Lowly Parallelism, 2nd Edition, Ch 2. (Discusses
details of some serious problems that MIMDs incur).
• Michael Quinn, Parallel Programming in C with MPI and
OpenMP, McGraw Hill,2004 (Current Textbook), Chapter
2.
• Michael Quinn, Parallel Computing: Theory and Practice,
McGraw Hill, 1994, Ch. 1,2
• Sayed H. Roosta, “Parallel Processing & Parallel
Algorithms: Theory and Computation”, Springer Verlag,
2000, Chpt 1.
• Wilkinson & Allen, Parallel Programming: Techniques
and Applications, Prentice Hall, 2nd Edition, 2005, Ch 1-
2.
4
Brief Review
Complexity Concepts Needed for Comparisons
• Whenever we define a counting function, we usually
characterize the growth rate of that function in terms of
complexity classes.
• Technical Definition: We say a function f(n) is in O(g(n)),
if (and only if) there are positive constants c and n0 such
that
0 ≤ f(n) cg(n) for n n0
• O(n) is read as big-oh of n.
• This notation can be used to separate counting functions
into complexity classes that characterize the size of the
count.
• We can use it for any kind of counting functions such as
timings, bisection widths, etc.
5
Big-Oh and Asymptotic Growth Rate
• The big-Oh notation gives an upper bound on the
(asymptotic) growth rate of a function
• The statement “f(n) is O(g(n))” means that the growth
rate of f(n) is not greater than the growth rate of g(n)
• We can use the big-Oh notation to rank functions
according to their growth rate
• big-Theta
– f(n) is (g(n)) if there are constants c’ > 0 and c’’ > 0
and an integer constant n0 1 such that 0 ≤ c’g(n) f(n)
c’’•g(n) for n n0
Intuitively, this says up to a constant factor, f(n) and g(n)
are asymptotically the same.
Note: These concepts are covered in algorithm courses7
Relatives of Big-Oh
• little-oh
– f(n) is o(g(n)) if, for any constant c > 0, there is an
integer constant n0 0 such that 0 f(n) < cg(n) for n
n0
Intuitively, this says f(n) is, up to a constant,
asymptotically strictly less than g(n), so f(n) ≠ (g(n)).
• little-omega
– f(n) is (g(n)) if, for any constant c > 0, there is an
integer constant n0 0 such that f(n) > cg(n) ≥ 0 for n
n0
Intuitively, this says f(n) is, up to a constant,
asymptotically strictly greater than g(n), so f(n) ≠
(g(n)).
These are not used as much as the earlier definitions,
but they round out the picture. 8
Summary for Intuition for
Asymptotic Notation
big-Oh
– f(n) is O(g(n)) if f(n) is asymptotically less than or equal to
g(n)
big-Omega
– f(n) is (g(n)) if f(n) is asymptotically greater than or
equal to g(n)
big-Theta
– f(n) is (g(n)) if f(n) is asymptotically equal to g(n)
little-oh
– f(n) is o(g(n)) if f(n) is asymptotically strictly less than g(n)
little-omega
– f(n) is (g(n)) if is asymptotically strictly greater than g(n)
9
A CALCULUS DEFINITION OF O,
(often easier to use)
Definition: Let f and g be functions defined on
the positive integers with nonnegative values.
We say g is in O(f) if and only if
lim g(n)/f(n) = c
n ->
for some nonnegative real number c--- i.e. the
limit exists and is not infinite.
Definition: We say f is in (g) if and only if
f is in O(g) and g is in O(f)
Note: Often use L'Hopital's Rule to calculate the
limits you need.
10
Why Asymptotic Behavior is Important
11
Recall: ORDER WINS OUT
(Example from Baase’s Algorithms Text)
The TRS-80
Main language support: BASIC - typically a slow running
interpreted language
For more details on TRS-80 see:
http://mate.kjsl.com/trs80/
The CRAY-YMP
Language used in example: FORTRAN- a fast running language
For more details on CRAY-YMP see:
http://ds.dial.pipex.com/town/park/abm64/CrayWWWStuff/Cfaq
p1.html#TOC3 12
CRAY YMP TRS-80
with FORTRAN with BASIC
complexity is 3n3 complexity is 19,500,000n
microsecond (abbr µsec) One-millionth of a second.
n is: millisecond (abbr msec) One-thousandth of a second.
a. With shared medium, one message is sent & all processors listen
b. With switched medium, multiple messages are possible.
15
Shared Medium
• Allows only message at a time
• Messages are broadcast
• Each processor “listens” to every message
• Before sending a message, a processor
“listen” until medium is unused
• Collisions require resending of messages
• Ethernet is an example
16
Switched Medium
• Supports point-to-point messages
between pairs of processors
• Each processor is connected to one switch
• Advantages over shared media
– Allows multiple messages to be sent
simultaneously
– Allows scaling of the network to
accommodate the increase in processors
17
Switch Network Topologies
• View switched network as a graph
– Vertices = processors or switches
– Edges = communication paths
• Two kinds of topologies
– Direct
– Indirect
18
Direct Topology
• Ratio of switch nodes to processor nodes
is 1:1
• Every switch node is connected to
– 1 processor node
– At least 1 other switch node
Indirect Topology
21
Terminology for Evaluating
Switch Topologies
• Bisection width – The minimum number of edges
between switch nodes that must be removed in
order to divide the network into two halves
(within 1 node, if the number of processors is
odd.)
• High bisection width is desirable.
• In algorithms requiring large amounts of data
movement, the size of the data set divided by
the bisection width puts a lower bound on the
complexity of an algorithm,
• Actually proving what the bisection width of a
network is can be quite difficult.
22
Terminology for Evaluating
Switch Topologies
26
Evaluating 2-D Meshes
(Assumes mesh is a square)
n = number of processors
• Diameter:
– (n1/2)
– Places a lower bound on algorithms that require
processing with arbitrary nodes sharing data.
• Bisection width:
– (n1/2)
– Places a lower bound on algorithms that require
distribution of data to all nodes.
• Max number of edges per switch:
– 4 is the degree
• Constant edge length?
– Yes
• Does this scale well? 27
– Yes
Linear Network
• Switches arranged into a 1-D mesh
• Direct topology
• Corresponds to a row or column of a 2-D mesh
• Ring : A variant that allows a wraparound
connection between switches on the end.
• The linear and ring networks have many
applications
• Essentially supports a pipeline in both directions
• Although these networks are very simple, they
support many optimal algorithms.
28
Evaluating Linear and Ring Networks
• Diameter
– Linear : n-1 or Θ(n)
– Ring: n/2 or Θ(n)
• Bisection width:
– Linear: 1 or Θ(1)
– Ring: 2 or Θ(1)
• Degree for switches:
– 2
• Constant edge length?
– Yes
• Does this scale well?
– Yes
29
Binary Tree Network
• Indirect topology
• n = 2d processor nodes, 2n-1 switches,
where d= 0,1,... is the number of levels
i.e. 23 = 8 processors
on bottom and
2(n) – 1 = 2(8) – 1 =
15 switches
30
Evaluating Binary Tree Network
• Diameter:
– 2 log n or O(log n).
– Note- this is small
• Bisection width:
– 1, the lowest possible number
• Degree:
–3
• Constant edge length?
– No
• Does this scale well?
– No
31
Hypertree Network (of degree 4 and
depth 2)
33
Evaluating 4-ary Hypertree with
Depth d
• A 4-ary hypertree has n = 4d processors
– General formula for k-ary hypertree is n = kd
• Diameter is 2d = 2 log n
– shares the low diameter of binary tree
• Bisection width = 2d+1
– Note here, 2d+1 = 23 = 8
– Large value - much better than binary tree
• Constant edge length?
– No
• Degree = 6
34
Butterfly Network
As complicated as this
switching network appears Rank 1 1,0 1,1 1,2 1,3 1,4 1,5 1,6 1,7
37
Butterfly Network Routing
Send message from
processor 2 to processor 5.
Algorithm:
0 means ship left;
1 means ship right.
• Bisection width:
– n / 2 *(Likely error 32/2=16) Rank 1 1,0 1,1 1,2 1,3 1,4 1,5 1,6 1,7
• Degree:
– 4 (even for d > 3) Rank 2 2,0 2,1 2,2 2,3 2,4 2,5 2,6 2,7
0110 0111
1010 1011
0010 0011
1100 1101
0100 0101
1000 1001
0000 0001
42
Evaluating Hypercube Network
with n = 2d nodes
• Diameter:
• d = log n
•Bisection width: 1110 1111
•n/2
0110 0111
•Edges per node:
• log n 1010 1011
45
Shuffle-exchange Network
with n = 2d Processors
0 1 2 3 4 5 6 7
• Direct topology
• Number of nodes is a power of 2
• Nodes have addresses 0, 1, …, 2d-1
• Two outgoing links from node i
Shuffle link to node LeftCycle(i)
Exchange link between node i and node i+1 46
when i is even
Shuffle-exchange Addressing – 16
processors
3
1000 1001 1010 1011 1100 1101 1110 1111
• Bisection width:
(n/ log n)
Between 2n/log n and n/(2 log n)*
48
* See Leighton pg 480
Two Problems with
Shuffle-Exchange
• Shuffle-Exchange does not expand well
– A large shuffle-exchange network does not
decompose well into smaller separate shuffle
exchange networks.
– In a large shuffle-exchange network, a small
percentage of nodes will be hot spots
• They will encounter much heavier traffic
• Above results are in dissertation of one of
Batcher’s students.
49
Comparing Networks
• All have logarithmic diameter
except 2-D mesh
• Hypertree, butterfly, and hypercube have
bisection width n / 2 (? Likely true only for n-cube)
• All have constant edges per node except
hypercube
• Only 2-D mesh, linear, and ring topologies keep
edge lengths constant as network size increases
• Shuffle-exchange is a good compromise- fixed
number of edges per node, low diameter, good
bisection width.
– However, negative results on preceding slide also
need to be considered. 50
Alternate Names for SIMDs
• Recall that all active processors of a SIMD
computer must simultaneously access the same
memory location.
• The value in the i-th processor can be viewed as
the i-th component of a vector.
• SIMD machines are sometimes called vector
computers [Jordan,et.al.] or processor arrays
[Quinn 94,04] based on their ability to execute
vector and matrix operations efficiently.
51
SIMD Computers
• SIMD computers that focus on vector
operations
– Support some vector and possibly matrix
operations in hardware
– Usually limit or provide less support for non-
vector type operations involving data in the
“vector components”.
• General purpose SIMD computers
– Usually also provide some vector and
possibly matrix operations in hardware.
– Support more traditional type operations (e.g.,
other than for vector/matrix data types).
52
Pipelined Architectures
• Pipelined architectures are sometimes
considered to be SIMD architectures
– See pg 37 of Textbook & pg 8-9 Jordan et. al.
– Vector components are entered successively into first
processor in pipeline.
– The i-th processor of the pipeline receives the output
from the (i-1)th processor.
• Normal “operations” in each processor are much
larger (coarser) in pipelined computers than in
true SIMDs
• “Pipelined” is somewhat SIMD in nature in that
synchronization is not required.
53
Why Processor Arrays?
• Historically, high cost of control units
• Scientific applications have data parallelism
54
Data/instruction Storage
• Front end computer
– Also called the control unit
– Holds and runs program
– Data manipulated sequentially
• Processor array
– Data manipulated in parallel
55
Processor Array Performance
• Performance: work done per time unit
• Performance of processor array
– Speed of processing elements
– Utilization of processing elements
56
Performance Example 1
• 1024 processors
• Each adds a pair of integers in 1 sec (1
microsecond or one millionth of second or
10-6 second.)
• What is the performance when adding two
1024-element vectors (one per
processor)?
Performanc e 10241operations
sec 1.024 109
ops / sec
57
Performance Example 2
• 512 processors
• Each adds two integers in 1 sec
• What is the performance when adding two
vectors of length 600?
• Since 600 > 512, 88 processor must add
two pairs of integers.
• The other 424 processors add only a
single pair of integers.
58
Example of a 2-D Processor
Interconnection Network in a Processor
Array
Each VLSI chip has 16 processing elements.
Each PE can simultaneously send a value to a neighbor.
PE =
processor
element
59
SIMD Execution Style
• The traditional (SIMD, vector, processor array) execution
style ([Quinn 94, pg 62], [Quinn 2004, pgs 37-43]:
– The sequential processor that broadcasts the
commands to the rest of the processors is called the
front end or control unit (or sometimes host).
– The front end is a general purpose CPU that stores
the program and the data that is not manipulated in
parallel.
– The front end normally executes the sequential
portions of the program.
– Each processing element has a local memory that
can not be directly accessed by the control unit or
other processing elements.
60
SIMD Execution Style
– Collectively, the individual memories of the
processing elements (PEs) store the (vector)
data that is processed in parallel.
– When the front end encounters an instruction
whose operand is a vector, it issues a
command to the PEs to perform the
instruction in parallel.
– Although the PEs execute in parallel, some
units can be allowed to skip any particular
instruction.
61
Masking on Processor Arrays
• All the processors work in lockstep except those
that are masked out (by setting mask register).
• The conditional if-then-else is different for
processor arrays than sequential version
– Every active processor tests to see if its data meets
the negation of the boolean condition.
– If it does, it sets its mask bit so those processors will
not participate in the operation initially.
– Next the unmasked processors, execute the THEN
part.
– Afterwards, mask bits (for original set of active
processors) are flipped and unmasked processors
perform the the ELSE part.
62
if (COND) then A else B
63
if (COND) then A else B
64
if (COND) then A else B
65
SIMD Machines
• An early SIMD computer designed for vector and
matrix processing was the Illiac IV computer
– Initial development at the University of Illinois 1965-70
– Moved to NASA Ames, completed in 1972 but not
fully functional until 1976.
– See Jordan et. al., pg 7 and Wikepedia
• The MPP, DAP, the Connection Machines CM-1
and CM-2, MasPar MP-1 and MP-2 are
examples of SIMD computers
– See Akl pg 8-12 and [Quinn, 94]
• The CRAY-1 and the Cyber-205 use pipelined
arithmetic units to support vector operations and
are sometimes called a pipelined SIMD
– See [Jordan, et al, p7], [Quinn 94, pg 61-2], and
[Quinn 2004, pg37). 66
SIMD Machines
• Quinn [1994, pg 63-67] discusses the CM-2
Connection Machine (with 64K PEs) and a
smaller & updated CM-200.
• Professor Batcher was the chief architect for the
STARAN and the MPP (Massively Parallel
Processor) and an advisor for the ASPRO
– ASPRO is a small second generation STARAN used
by the Navy in the spy planes.
• Professor Batcher is best known architecturally
for the MPP, which is at the Smithsonian Institute
& currently displayed at a D.C. airport.
67
Today’s SIMDs
• Many SIMDs are being embedded in sequential
machines.
• Others are being build as part of hybrid
architectures.
• Others are being build as special purpose
machines, although some of them could classify
as general purpose.
• Much of the recent work with SIMD architectures
is proprietary.
– Often the fact that a parallel computer is SIMD is not
mentioned by company building them.
68
A Company Building Inexpensive
SIMD
• ClearSpeed is producing a COTS (commodity
off the shelf) SIMD Board
– WorldScape is developing defense and commercial
applications for this computer.
• Not a traditional SIMD as the hardware doesn’t
synchronize every step.
– Hardware design supports efficient synchronization
• This machine is programmed like a SIMD.
• The U.S. Navy has observed that their machines
process radar a magnitude faster than others.
• There is quite a bit of information about this at
www.wscape.com and www.clearspeed.com
69
An Example of a Hybrid SIMD
• Embedded Massively Parallel Accelerators
– Systola 1024: PC add-
on board with 1024
processors
Interface processors 72
SIMDs Embedded in SISDs
• Intel's Pentium 4 included what they call MMX
technology to gain a significant performance
boost
• IBM and Motorola incorporated the technology
into their G4 PowerPC chip in what they call
their Velocity Engine.
• Both MMX technology and the Velocity Engine
are the chip manufacturer's name for their
proprietary SIMD processors and parallel
extensions to their operating code.
• This same approach is used by NVidia and
Evans & Sutherland to dramatically accelerate
graphics rendering.
73
Special Purpose SIMDs in the
Bioinformatics Arena
• Parcel
– Acquired by Celera Genomics in 2000
– Products include the sequence
supercomputer GeneMatcher, which has a
high throughput sequence analysis capability
• Supports over a million processors
– GeneMatcher was used by Celera in their
race with U.S. government to complete the
description of the human genome sequencing
• TimeLogic, Inc
– Has DeCypher, a reconfigurable SIMD
74
Advantages of SIMDs
• Reference: [Roosta, pg 10]
• Less hardware than MIMDs as they have only
one control unit.
– Control units are complex.
• Less memory needed than MIMD
– Only one copy of the instructions need to be stored
– Allows more data to be stored in memory.
• Less startup time in communicating between
PEs.
75
Advantages of SIMDs (cont)
• Single instruction stream and synchronization of
PEs make SIMD applications easier to program,
understand, & debug.
– Similar to sequential programming
• Control flow operations and scalar operations
can be executed on the control unit while PEs
are executing other instructions.
• MIMD architectures require explicit
synchronization primitives, which create a
substantial amount of additional overhead.
76
Advantages of SIMDs (cont)
• During a communication operation between
PEs,
– PEs send data to a neighboring PE in parallel and in
lock step
– No need to create a header with routing information
as “routing” is determined by program steps.
– the entire communication operation is executed
synchronously
– SIMDs are deterministic & have much more
predictable running time.
• Can normally compute a tight (worst case) upper bound for
the time for communications operations.
• Less complex hardware in SIMD since no
message decoder is needed in the PEs
– MIMDs need a message decoder in each PE.
77
SIMD Shortcomings
(with some rebuttals)
81
SIMD Shortcomings
(with some rebuttals)
82
SIMD Shortcomings
(with some rebuttals)
Claim 5: Requires customized VLSI for processors
and expense of control units in PCs has dropped.
• Reliance on COTS (Commodity, off-the-shelf parts)
has dropped the price of MIMDS
• Expense of PCs (with control units) has dropped
significantly
• However, reliance on COTS has fueled the success
of ‘low level parallelism’ provided by clusters and
restricted new innovative parallel architecture
research for well over a decade.
83
SIMD Shortcomings
(with some rebuttals)
Claim 5 (cont.)
• There is strong evidence that the period of
continual dramatic increases in speed of PCs
and clusters is ending.
• Continued rapid increases in parallel
performance in the future will be necessary in
order to solve important problems that are
beyond our current capabilities
• Additionally, with the appearance of the very
economical COTS SIMDs, this claim no longer
appears to be relevant.
84
Multiprocessors
• Multiprocessor: multiple-CPU computer
with a shared memory
• Same address on two different CPUs
refers to the same memory location
• Avoids three cited criticisms for SIMDs
– Can be built from commodity CPUs
– Naturally support multiple users
– Maintain efficiency in conditional code
85
Centralized Multiprocessor
86
Centralized Multiprocessor
• Straightforward extension of uniprocessor
• Add CPUs to bus
• All processors share same primary memory
• Memory access time same for all CPUs
– Uniform memory access (UMA)
multiprocessor
• Also called a symmetrical multiprocessor (SMP)
87
Private and Shared Data
• Private data: items used only by a single
processor
• Shared data: values used by multiple
processors
• In a centralized multiprocessor (i.e. SMP),
processors communicate via shared data
values
88
Problems Associated with Shared Data
89
Write Invalidate Protocol
Most common solution to cache coherency
1. Each CPU’s cache controller monitors
(snoops) the bus & identifies which cache
blocks are requested by other CPUs.
2. A PE gains exclusive control of data item
before performing “write”.
3. Before “write” occurs, all other copies of data
item cached by other PEs are invalidated.
4. When any other CPU tries to read a memory
location from an invalidated cache block,
– a cache miss occurs
– It has to retrieve updated data from memory
90
Cache-coherence Problem
Memory
X 7
Cache Cache
CPU A CPU B
91
Cache-coherence Problem
CPU A CPU B
92
Cache-coherence Problem
Memory
X 7
7
7
CPU A CPU B
93
Cache-coherence Problem
Write to main
Memory memory is a
problem.
X 2
2
7
CPU A CPU B
94
Write Invalidate Protocol
A cache control monitor
snoops the bus to see
X 7 which cache block is
being requested by
other processors.
7
7
CPU A CPU B
95
Write Invalidate Protocol
X 7
Intent to write X
7
7 Before a write
can occur, all
copies of data at
CPU A CPU B that address are
declared invalid.
96
Write Invalidate Protocol
X 7
Intent to write X
CPU A CPU B
97
Write Invalidate Protocol
When another
processor tries to
read from this
X 2 location in cache,
it receives a cache
miss error and will
have to refresh
from main
2 memory.
CPU A CPU B
98
Synchronization Required for
Shared Data
• Mutual exclusion
– Definition: At most one process can be engaged in an
activity at any time.
– Example: Only one processor can write to the same
address in main memory at the same time.
– We say that process must mutually exclude all others
while it performs this write.
• Barrier synchronization
– Definition: Guarantees that no process will proceed
beyond a designated point (called the barrier) until
every process reaches that point.
99
Distributed Multiprocessor
100
Distributed Multiprocessors
101
Cache Coherence
• Some NUMA multiprocessors do not
support it in hardware
– Only instructions and private data are stored
in cache
– Policy creates a large memory access time
variance
• Implementation more difficult
– No shared memory bus to “snoop”
– Directory-based protocol needed
102
Directory-based Protocol
• Distributed directory contains information
about cacheable memory blocks
• One directory entry for each cache block
• Each entry has
– Sharing status
– Which processors have copies
103
Sharing Status
• Uncached -- (denoted by “U”)
– Block not in any processor’s cache
• Shared – (denoted by “S”)
– Cached by one or more processors
– Read only
• Exclusive – (denoted by “E”)
– Cached by exactly one processor
– Processor has written to block
– Copy in memory is obsolete
104
Directory-based Protocol - step1
Interconnection Network
Directories X U000
Memories X 7
Caches
Directories X U000
Memories X 7
Caches
Directories X S100
Memories X 7
Caches
Directories X S100
Memories X 7
Caches X 7
Directories X S100
Memories X 7
Read Miss
Caches X 7
Directories X S101
Memories X 7
Caches X 7
Directories X S101
Memories X 7
Caches X 7 X 7
Memories X 7
Caches X 7 X 7
Directories X S101
Memories Invalidate X 7
Caches X 7 X 7
Directories X E100
Memories X 7
Caches X 6
Directories X E100
Memories X 7
Caches X 6
Directories X E100
Memories X 7
Caches X 6
Directories X E100
Memories X 6
Caches X 6
Directories X S110
Memories X 6
Caches X 6 X 6
Directories X S110
Caches X 6 X 6
Directories X S110
Memories X 6
Caches X 6 X 6
Directories X E001
Memories X 6
Caches X 5
Directories X E001
Memories X 6
Caches X 5
Directories X E100
Caches X 5
Directories X E010
Memories X 5
Caches X 5
Directories X E100
Memories X 5
Caches
Directories X E100
Creates
Memories
cache block
X 5
storage for X
Caches X 5
Directories X E100
Memories X 5
Caches X 4
Memories X 45
Caches X 4
Directories X U000
Memories X 4
Caches
132
Typically, Two Flavors of
Multicomputers
• Commercial multicomputers
– Custom switch network
– Low latency (the time it takes to send a message).
– High bandwidth (data path width) across processors
• Commodity clusters
– Mass produced computers, switches and other
equipment
– Use low cost components
– Message latency is higher
– Communications bandwidth is lower
133
Multicomputer Communication
• Processors are connected by an interconnection
network
• Each processor has a local memory and can
only access its own local memory
• Data is passed between processors using
messages, as required by the program
• Data movement across the network is also
asynchronous
– A common approach is to use MPI to handling
message passing
134
Multicomputer Communications (cont)
135
Message-Passing Advantages
• No problem with simultaneous access
to data.
• Allows different PCs to operate on the
same data independently.
• Allows PCs on a network to be easily
upgraded when faster processors
become available.
136
Disadvantages of
Message-Passing
• Programmers must make explicit
message-passing calls in the code
• This is low-level programming and is error
prone.
• Data is not shared but copied, which
increases the total data size.
• Data Integrity
– Difficulty in maintaining correctness of multiple
copies of data item.
137
Some Interconnection Network
Terminology (1/2)
References: Wilkinson, et. al. & Grama, et. al.
Also, earlier slides on architecture & networks.
A link is the connection between two nodes.
• A switch that enables packets to be routed
through the node to other nodes without
disturbing the processor is assumed.
• The link between two nodes can be either
bidirectional or use two directional links .
• Can assume either one wire that carries one bit
or parallel wires (one wire for each bit in word).
• The above choices do not have a major impact
on the concepts presented in this course.
138
Network Terminology (2/2)
• The bandwidth is the number of bits that can be
transmitted in unit time (i.e., bits per second).
• The network latency is the time required to
transfer a message through the network.
– The communication latency is the total time
required to send a message, including
software overhead and interface delay.
– The message latency or startup time is the
time required to send a zero-length message.
• Includes software & hardware overhead, such as
– Choosing a route
– packing and unpacking the message
139
Circuit Switching Message Passing
• Technique establishes a path and allows the
entire message to transfer uninterrupted.
• Similar to telephone connection that is held until
the end of the call.
• Links used are not available to other messages
until the transfer is complete.
• Latency (message transfer time): If the length of
control packet sent to establish path is small wrt
(with respect to) the message length, the latency
is essentially
– the constant L/B, where L is message length
and B is bandwidth. 140
Store-and-forward Packet
Switching
• Message is divided into “packets” of
information
• Each packet includes source and
destination addresses.
• Packets can not exceed a fixed, maximum
size (e.g., 1000 byte).
• A packet is stored in a node in a buffer
until it can move to the next node.
141
Packet Switching (cont)
• At each node, the designation information is
looked at and used to select which node to
forward the packet to.
• Routing algorithms (often probabilistic) are used
to avoid hot spots and to minimize traffic jams.
• Significant latency is created by storing each
packet in each node it reaches.
• Latency increases linearly with the length of the
route.
142
Virtual Cut-Through Package
Switching
143
Wormhole Routing
• Alternate to store-and-forward packet
routing
• A message is divided into small units
called flits (flow control units).
• Flits are 1-2 bytes in size.
• Can be transferred in parallel on links with
multiple wires.
• Only head of flit is initially transferred
when the next link becomes available.
144
Wormhole Routing (cont)
• As each flit moves forward, the next flit can
move forward.
• The entire path must be reserved for a message
as these packets pull each other along (like cars
of a train).
• Request/acknowledge bit messages are
required to coordinate these pull-along moves.
– See Wilkinson, et. al.
• Latency: If the head of the flit is very small
compared to the length of the message, then the
latency is essentially the constant L/B, with L the
message length and B the link bandwidth. 145
Deadlock
• Routing algorithms needed to find a path
between the nodes.
• Adaptive routing algorithms choose different
paths, depending on traffic conditions.
• Livelock is a deadlock-type situation where a
packet continues to go around the network,
without ever reaching its destination.
• Deadlock: No packet can be forwarded because
they are blocked by other stored packets waiting
to be forwarded.
146
Asymmetric Multicomputers
• Has a front-end that interacts with users
and I/O devices.
– Processors in back end are used for
computation.
– Programming similar to SIMDs (i.e., processor
arrays)
– Common with early multicomputers
• Examples of asymmetrical multicomputers
given in textbook.
147
Asymmetrical MC Advantages
• Back-end processors
dedicated to parallel
computations
Easier to understand,
model, tune
performance
• Only a simple back-
end operating system
needed Easy for a
vendor to create
148
Asymmetrical MC Disadvantages
• Front-end computer is a
single point of failure
• Single front-end computer
limits scalability of system
• Primitive operating
system in back-end
processors makes
debugging difficult
• Every application requires
development of both
front-end and back-end
program
149
Symmetric Multicomputers
• Every computer executes the same operating
system and has identical functionality.
• Users may log into any computer to edit or
compile their programs.
• Any or all computers may be involved in the
execution of their program.
• During execution of programs, every PE
executes the same program.
• When only one PE should execute an operation,
an “if” statement is used to select the PE.
150
Symmetric Multicomputers
151
Symmetrical MC Advantages
• Alleviate performance
bottleneck caused by
single front-end computer
• Better support for
debugging
• Every processor executes
same program
152
Symmetrical MC Disadvantages
154
Best Model for Commodity Cluster
• Full-Fledged operating system (e.g.,
Linux) desirable
– Feature of symmetric multicomputer
• Desirable to increase cache hits
– Favors having only a single user process on each PE
– Favors most nodes being off-limits for program
development
• Need fast network
– Keep program development users off networks and
have them access front-end by another path.
– Reserve interconnection network to usage by parallel
processes
• Overall, a mixed model may be best for
commodity clusters 155
Ideal Commodity Cluster Features
• Co-located computers
• Computers dedicated to running a single
process at a time to lower cache misses
• Accessible only from the network
– No keyboards or displays
• Users access front-end by another route.
• Identical operating systems
• Identical local disk images
• Administered as an entity
156
ParPar Cluster, A Mixed Model
• Mixed model
• Incorporates both asymetrical and
symetrical designs.
157
Network of Workstations
• Dispersed computers
– Typically located on user’s desks
• First priority: response time for person at
the keyboard
• Parallel jobs wait in background and run
with spare CPU cycles are available.
• Different operating systems
• Different local images
• Checkpointing and restarting important
• Typically connected by ethernet
– Too slow for commodity network usage 158
A Commodity Cluster vs
Network of Workstations
• A commodity cluster contains components of
local area networks
– Commodity computers
– Switches
• A network of workstations is a dispersed
collection of computers
– Distributed hetergeneous computers
– Located on primary users’ desks
– Unused cycles available for parallel use
– Example: SETI project
159
Summary
• Commercial parallel computers appeared
in 1980s
• Multiple-CPU computers now dominate
• Small-scale: Centralized multiprocessors
• Large-scale: Distributed memory
architectures
– Multiprocessors
– Multicomputers
160