0% found this document useful (0 votes)
62 views160 pages

Parallel Architectures

This document discusses parallel architectures and outlines several key concepts: 1. It introduces complexity concepts like Big-O notation that are used to characterize the growth rate of functions and compare algorithms. 2. It discusses different types of interconnection networks like shared vs switched media that are used to connect processors. Shared media only allows one message at a time while switched media allows multiple simultaneous messages. 3. It outlines terminology used to evaluate switch topologies like diameter, bisection width, edges per node, and edge length. These characteristics influence what types of parallel algorithms can efficiently be implemented on a machine.

Uploaded by

Rache
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
62 views160 pages

Parallel Architectures

This document discusses parallel architectures and outlines several key concepts: 1. It introduces complexity concepts like Big-O notation that are used to characterize the growth rate of functions and compare algorithms. 2. It discusses different types of interconnection networks like shared vs switched media that are used to connect processors. Shared media only allows one message at a time while switched media allows multiple simultaneous messages. 3. It outlines terminology used to evaluate switch topologies like diameter, bisection width, edges per node, and edge length. These characteristics influence what types of parallel algorithms can efficiently be implemented on a machine.

Uploaded by

Rache
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 160

Chapter 2

Parallel Architectures
Outline
• Some chapter references
• Brief review of complexity
– Terminology for comparisons
• Interconnection networks
• Processor arrays
• Multiprocessors
• Multicomputers
• Flynn’s Taxonomy – moved to Chpt 1
2
Some Chapter References
• Selim Akl, The Design and Analysis of Parallel
Algorithms, Prentice Hall, 1989 (earlier textbook).
• G. C. Fox, What Have We Learnt from Using Real
Parallel Machines to Solve Real Problems? Technical
Report C3P-522, Cal Tech, December 1989. (Included in
part in more recent books co-authored by Fox.)
• A. Grama, A. Gupta, G. Karypis, V. Kumar, Introduction
to Parallel Computing, Second Edition, 2003 (first edition
1994), Addison Wesley.
• Harry Jordan, Gita Alaghband, Fundamentals of Parallel
Processing: Algorithms, Architectures, Languages,
Prentice Hall, 2003, Ch 1, 3-5.
• F. Thomson Leighton; Introduction to Parallel Algorithms
and Architectures: Arrays, Trees, Hypercubes; 1992;
Morgan Kaufmann Publishers.
3
References - continued
• Gregory Pfsiter, In Search of Clusters: The ongoing
Battle in Lowly Parallelism, 2nd Edition, Ch 2. (Discusses
details of some serious problems that MIMDs incur).
• Michael Quinn, Parallel Programming in C with MPI and
OpenMP, McGraw Hill,2004 (Current Textbook), Chapter
2.
• Michael Quinn, Parallel Computing: Theory and Practice,
McGraw Hill, 1994, Ch. 1,2
• Sayed H. Roosta, “Parallel Processing & Parallel
Algorithms: Theory and Computation”, Springer Verlag,
2000, Chpt 1.
• Wilkinson & Allen, Parallel Programming: Techniques
and Applications, Prentice Hall, 2nd Edition, 2005, Ch 1-
2.
4
Brief Review
Complexity Concepts Needed for Comparisons
• Whenever we define a counting function, we usually
characterize the growth rate of that function in terms of
complexity classes.
• Technical Definition: We say a function f(n) is in O(g(n)),
if (and only if) there are positive constants c and n0 such
that
0 ≤ f(n)  cg(n) for n  n0
• O(n) is read as big-oh of n.
• This notation can be used to separate counting functions
into complexity classes that characterize the size of the
count.
• We can use it for any kind of counting functions such as
timings, bisection widths, etc.

5
Big-Oh and Asymptotic Growth Rate
• The big-Oh notation gives an upper bound on the
(asymptotic) growth rate of a function
• The statement “f(n) is O(g(n))” means that the growth
rate of f(n) is not greater than the growth rate of g(n)
• We can use the big-Oh notation to rank functions
according to their growth rate

Assume: f(n) is O(g(n)) g(n) is O(f(n))


g(n) grows faster Yes No
f(n) grows faster No Yes
Same growth Yes Yes
6
Relatives of Big-Oh
• big-Omega
– f(n) is (g(n)) if there is a constant c > 0
and an integer constant n0  1 such that
f(n)  cg(n) ≥ for n  n0
Intuitively, this says up to a constant factor, f(n)
asymptotically is greater than or equal to g(n)

• big-Theta
– f(n) is (g(n)) if there are constants c’ > 0 and c’’ > 0
and an integer constant n0  1 such that 0 ≤ c’g(n)  f(n)
 c’’•g(n) for n  n0
Intuitively, this says up to a constant factor, f(n) and g(n)
are asymptotically the same.
Note: These concepts are covered in algorithm courses7
Relatives of Big-Oh
• little-oh
– f(n) is o(g(n)) if, for any constant c > 0, there is an
integer constant n0  0 such that 0  f(n) < cg(n) for n
 n0
Intuitively, this says f(n) is, up to a constant,
asymptotically strictly less than g(n), so f(n) ≠ (g(n)).
• little-omega
– f(n) is (g(n)) if, for any constant c > 0, there is an
integer constant n0  0 such that f(n) > cg(n) ≥ 0 for n
 n0
Intuitively, this says f(n) is, up to a constant,
asymptotically strictly greater than g(n), so f(n) ≠
(g(n)).
These are not used as much as the earlier definitions,
but they round out the picture. 8
Summary for Intuition for
Asymptotic Notation
big-Oh
– f(n) is O(g(n)) if f(n) is asymptotically less than or equal to
g(n)
big-Omega
– f(n) is (g(n)) if f(n) is asymptotically greater than or
equal to g(n)
big-Theta
– f(n) is (g(n)) if f(n) is asymptotically equal to g(n)
little-oh
– f(n) is o(g(n)) if f(n) is asymptotically strictly less than g(n)
little-omega
– f(n) is (g(n)) if is asymptotically strictly greater than g(n)
9
A CALCULUS DEFINITION OF O, 
(often easier to use)
Definition: Let f and g be functions defined on
the positive integers with nonnegative values.
We say g is in O(f) if and only if
lim g(n)/f(n) = c
n -> 
for some nonnegative real number c--- i.e. the
limit exists and is not infinite.
Definition: We say f is in (g) if and only if
f is in O(g) and g is in O(f)
Note: Often use L'Hopital's Rule to calculate the
limits you need.
10
Why Asymptotic Behavior is Important

• 1) Allows us to compare counts on large sets.


• 2) Helps us understand the maximum size of
input that can be handled in a given time,
provided we know the environment in which we
are running.
• 3) Stresses the fact that even dramatic
speedups in hardware can not overcome the
handicap of an asymptotically slow algorithm.

11
Recall: ORDER WINS OUT
(Example from Baase’s Algorithms Text)
The TRS-80
Main language support: BASIC - typically a slow running
interpreted language
For more details on TRS-80 see:
http://mate.kjsl.com/trs80/

The CRAY-YMP
Language used in example: FORTRAN- a fast running language
For more details on CRAY-YMP see:
http://ds.dial.pipex.com/town/park/abm64/CrayWWWStuff/Cfaq
p1.html#TOC3 12
CRAY YMP TRS-80
with FORTRAN with BASIC
complexity is 3n3 complexity is 19,500,000n
microsecond (abbr µsec) One-millionth of a second.
n is: millisecond (abbr msec) One-thousandth of a second.

10 3 microsec 200 millisec

100 3 millisec 2 sec

1000 3 sec 20 sec

2500 50 sec 50 sec

10000 49 min 3.2 min


1000000 95 years 5.4 hours
13
Interconnection Networks
• Uses of interconnection networks
– Connect processors to shared memory
– Connect processors to each other
• Interconnection media types
– Shared medium
– Switched medium
• Different interconnection networks define
different parallel machines.
• The interconnection network’s properties
influence the type of algorithm used for various
machines as it affects how data is routed.
14
Shared versus Switched Media

a. With shared medium, one message is sent & all processors listen
b. With switched medium, multiple messages are possible.

15
Shared Medium
• Allows only message at a time
• Messages are broadcast
• Each processor “listens” to every message
• Before sending a message, a processor
“listen” until medium is unused
• Collisions require resending of messages
• Ethernet is an example

16
Switched Medium
• Supports point-to-point messages
between pairs of processors
• Each processor is connected to one switch
• Advantages over shared media
– Allows multiple messages to be sent
simultaneously
– Allows scaling of the network to
accommodate the increase in processors

17
Switch Network Topologies
• View switched network as a graph
– Vertices = processors or switches
– Edges = communication paths
• Two kinds of topologies
– Direct
– Indirect

18
Direct Topology
• Ratio of switch nodes to processor nodes
is 1:1
• Every switch node is connected to
– 1 processor node
– At least 1 other switch node

Indirect Topology

• Ratio of switch nodes to processor nodes is


greater than 1:1
• Some switches simply connect to other
19
switches
Terminology for Evaluating
Switch Topologies
• We need to evaluate 4 characteristics of a
network in order to help us understand their
effectiveness in implementing efficient parallel
algorithms on a machine with a given network.
• These are
– The diameter
– The bisection width
– The edges per node
– The constant edge length
• We’ll define these and see how they affect
algorithm choice.
• Then we will investigate several different
topologies and see how these characteristics 20
are evaluated.
Terminology for Evaluating
Switch Topologies
• Diameter – Largest distance between two
switch nodes.
– A low diameter is desirable
– It puts a lower bound on the complexity of
parallel algorithms which requires
communication between arbitrary pairs of
nodes.

21
Terminology for Evaluating
Switch Topologies
• Bisection width – The minimum number of edges
between switch nodes that must be removed in
order to divide the network into two halves
(within 1 node, if the number of processors is
odd.)
• High bisection width is desirable.
• In algorithms requiring large amounts of data
movement, the size of the data set divided by
the bisection width puts a lower bound on the
complexity of an algorithm,
• Actually proving what the bisection width of a
network is can be quite difficult.

22
Terminology for Evaluating
Switch Topologies

• Number of edges / node


– It is best if the maximum number of edges/node is a
constant independent of network size, as this allows
the processor organization to scale more easily to a
larger number of nodes.
– Degree is the maximum number of edges per node.
• Constant edge length? (yes/no)
– Again, for scalability, it is best if the nodes and edges
can be laid out in 3D space so that the maximum
edge length is a constant independent of network
size.
23
Evaluating Switch Topologies
• Many have been proposed and analyzed. We
will consider several well known ones:
– 2-D mesh
– linear network
– binary tree
– hypertree
– butterfly
– hypercube
– shuffle-exchange
• Those in yellow have been used in commercial
parallel computers.
24
2-D Meshes

Note: Circles represent switches and squares


25
represent processors in all these slides.
2-D Mesh Network
• Direct topology
• Switches arranged into a 2-D lattice or grid
• Communication allowed only between
neighboring switches
• Torus: Variant that includes wraparound
connections between switches on edge of
mesh

26
Evaluating 2-D Meshes
(Assumes mesh is a square)
n = number of processors
• Diameter:
– (n1/2)
– Places a lower bound on algorithms that require
processing with arbitrary nodes sharing data.
• Bisection width:
– (n1/2)
– Places a lower bound on algorithms that require
distribution of data to all nodes.
• Max number of edges per switch:
– 4 is the degree
• Constant edge length?
– Yes
• Does this scale well? 27
– Yes
Linear Network
• Switches arranged into a 1-D mesh
• Direct topology
• Corresponds to a row or column of a 2-D mesh
• Ring : A variant that allows a wraparound
connection between switches on the end.
• The linear and ring networks have many
applications
• Essentially supports a pipeline in both directions
• Although these networks are very simple, they
support many optimal algorithms.

28
Evaluating Linear and Ring Networks
• Diameter
– Linear : n-1 or Θ(n)
– Ring: n/2 or Θ(n)
• Bisection width:
– Linear: 1 or Θ(1)
– Ring: 2 or Θ(1)
• Degree for switches:
– 2
• Constant edge length?
– Yes
• Does this scale well?
– Yes
29
Binary Tree Network
• Indirect topology
• n = 2d processor nodes, 2n-1 switches,
where d= 0,1,... is the number of levels
i.e. 23 = 8 processors
on bottom and
2(n) – 1 = 2(8) – 1 =
15 switches

30
Evaluating Binary Tree Network
• Diameter:
– 2 log n or O(log n).
– Note- this is small
• Bisection width:
– 1, the lowest possible number
• Degree:
–3
• Constant edge length?
– No
• Does this scale well?
– No
31
Hypertree Network (of degree 4 and
depth 2)

(a) Front view: 4-ary tree of height 2


(b) Side view: upside down binary tree of
height d
32
(c) Complete network
Hypertree Network
• Indirect topology
• Note- the degree k and the depth d must be
specified.
• This gives from the front a k-ary tree of height d.
• From the side, the same network looks like an
upside down binary tree of height d.
• Joining the front and side views yields the
complete network.

33
Evaluating 4-ary Hypertree with
Depth d
• A 4-ary hypertree has n = 4d processors
– General formula for k-ary hypertree is n = kd
• Diameter is 2d = 2 log n
– shares the low diameter of binary tree
• Bisection width = 2d+1
– Note here, 2d+1 = 23 = 8
– Large value - much better than binary tree
• Constant edge length?
– No
• Degree = 6
34
Butterfly Network

• Indirect topology A 23 = 8 processor


• n = 2d processor butterfly network with
nodes connected 8*4=32 switching nodes
0 1 2 3 4 5 6 7
by n(log n + 1)
switching nodes Rank 0 0,0 0,1 0,2 0,3 0,4 0,5 0,6 0,7

As complicated as this
switching network appears Rank 1 1,0 1,1 1,2 1,3 1,4 1,5 1,6 1,7

to be, it is really quite


simple as it admits a very
Rank 2 2,0 2,1 2,2 2,3 2,4 2,5 2,6 2,7
nice routing algorithm!
Wrapped Butterfly: When
Rank 3 3,0 3,1 3,2 3,3 3,4 3,5 3,6 3,7
top and bottom ranks are 35
merged into single rank. The rows are called ranks.
Building the 23 Butterfly Network
• There are 8 processors.
• Have 4 ranks (i.e. rows) with 8 switches per rank.
• Connections:
– Node(i,j), for i > 0, is connected to two nodes on rank
i-1, namely node(i-1,j) and node(i-1,m), where m is
the integer found by flipping the ith most significant bit
in the binary d-bit representation of j.
– For example, suppose i = 2 and j = 3. Then node (2,3)
is connected to node (1,3).
– To get the other connection, 3 = 0112. So, flip 2nd
significant bit – i.e. 0012 and connect node(2,3) to
node(1,1) --- NOTE: There is an error on pg 32 on this
example.
• Nodes connected by a cross edge from rank i to rank i+1
have node numbers that differ only in their (i+1) bit.
36
Why It Is Called a Butterfly Network
• Walk cycles such as node(i,j), node(i-1,j),
node(i,m), node(i-1,m), node(i,j) where m
is determined by the bit flipping as shown
and you “see” a butterfly:

37
Butterfly Network Routing
Send message from
processor 2 to processor 5.

Algorithm:
0 means ship left;
1 means ship right.

1) 5 = 101. Pluck off leftmost


bit 1 and send “01msg” to
right.
2) Pluck off leftmost bit 0
and send “1msg” to left.
3) Pluck off leftmost bit 1
and send “msg” to right.

Each cross edge followed


changes address by 1 bit.
38
Evaluating the Butterfly Network
with n Processors
• Diameter: 0 1 2 3 4 5 6 7
– log n
Rank 0 0,0 0,1 0,2 0,3 0,4 0,5 0,6 0,7

• Bisection width:
– n / 2 *(Likely error 32/2=16) Rank 1 1,0 1,1 1,2 1,3 1,4 1,5 1,6 1,7

• Degree:
– 4 (even for d > 3) Rank 2 2,0 2,1 2,2 2,3 2,4 2,5 2,6 2,7

• Constant edge length?


– No, grows exponentially Rank 3 3,0 3,1 3,2 3,3 3,4 3,5 3,6 3,7

as rank size decrease


39
* On pg 442, Leighton gives “(n / log(n))” as the bisection width. Simply remove
cross edges between two successive levels to create bisection cut.
Hypercube
(also called binary n-cube)
1110 1111

0110 0111

1010 1011

0010 0011

1100 1101

0100 0101

1000 1001

0000 0001

A hypercube with n = 2d processors & switches for d=4 40


Hypercube (or Binary n-cube)
n = 2d Processors
• Direct topology
• 2 x 2 x … x 2 mesh
• Number of nodes is a 1110 1111
power of 2 0110 0111
• Node addresses 0, 1,
…, n-1 1010 1011

• Node i is connected 0010 0011


to k nodes whose
1100 1101
addresses differ from i
in exactly one bit 0100 0101
position.
1000 1001
• Example: k = 0111 is
connected to 1111, 0000 0001
0011, 0101, and 0110 41
Growing a Hypercube
Note: For d = 4, it is called
a 4-dimensional cube.

42
Evaluating Hypercube Network
with n = 2d nodes
• Diameter:
• d = log n
•Bisection width: 1110 1111
•n/2
0110 0111
•Edges per node:
• log n 1010 1011

•Constant edge 0010 0011

length? 1100 1101


• No. 0100 0101
The length of the
1000 1001
longest edge
increases as n 0000 0001
43
increases.
Routing on the Hypercube Network
• Example: Send a • As with the butterfly network, bit
message from node 2 flipping helps you route on this
= 0010 to node 5 = network.
0101
1110 1111
• The nodes differ in 3
bits so the shortest 0110 0111
path will be of length 1010 1011
3.
• One path is 0010 0011

0010  0110  1100 1101


0100  0101
0100 0101
obtained by flipping
one of the differing bits 1000 1001
at each step. 0000 0001
• Similar to butterfly 44
A Perfect Shuffle
• A permutation that is produced as follows is
called a perfect shuffle:
• Given a power of 2 cards, numbered 0, 1, 2, ...,
2d -1, write the card number with d bits. By left
rotating the bits with a wrap, we calculate the
position of the card after the perfect shuffle.
• Example: For d = 3, card 5 = 101. Left rotating
and wrapping gives us 011. So, card 5 goes to
position 3. Note that card 0 = 000 and card 7 =
111, stay in position.

45
Shuffle-exchange Network
with n = 2d Processors

0 1 2 3 4 5 6 7

• Direct topology
• Number of nodes is a power of 2
• Nodes have addresses 0, 1, …, 2d-1
• Two outgoing links from node i
Shuffle link to node LeftCycle(i)
Exchange link between node i and node i+1 46
when i is even
Shuffle-exchange Addressing – 16
processors

0000 0001 0010 0011 0100 0101 0110 0111

1000 1001 1010 1011 1100 1101 1110 1111

No arrows on line segment means it is bidirectional.


Otherwise, you must follow the arrows.
Devising a routing algorithm for this network is
interesting and will be a homework problem. 47
Evaluating the Shuffle-exchange
• Diameter:
 2log n – 1

• Edges per node:


0000 0001 0010 0011 0100 0101 0110 0111

 3
1000 1001 1010 1011 1100 1101 1110 1111

• Constant edge length?


 No

• Bisection width:
 (n/ log n)
 Between 2n/log n and n/(2 log n)*
48
* See Leighton pg 480
Two Problems with
Shuffle-Exchange
• Shuffle-Exchange does not expand well
– A large shuffle-exchange network does not
decompose well into smaller separate shuffle
exchange networks.
– In a large shuffle-exchange network, a small
percentage of nodes will be hot spots
• They will encounter much heavier traffic
• Above results are in dissertation of one of
Batcher’s students.
49
Comparing Networks
• All have logarithmic diameter
except 2-D mesh
• Hypertree, butterfly, and hypercube have
bisection width n / 2 (? Likely true only for n-cube)
• All have constant edges per node except
hypercube
• Only 2-D mesh, linear, and ring topologies keep
edge lengths constant as network size increases
• Shuffle-exchange is a good compromise- fixed
number of edges per node, low diameter, good
bisection width.
– However, negative results on preceding slide also
need to be considered. 50
Alternate Names for SIMDs
• Recall that all active processors of a SIMD
computer must simultaneously access the same
memory location.
• The value in the i-th processor can be viewed as
the i-th component of a vector.
• SIMD machines are sometimes called vector
computers [Jordan,et.al.] or processor arrays
[Quinn 94,04] based on their ability to execute
vector and matrix operations efficiently.

51
SIMD Computers
• SIMD computers that focus on vector
operations
– Support some vector and possibly matrix
operations in hardware
– Usually limit or provide less support for non-
vector type operations involving data in the
“vector components”.
• General purpose SIMD computers
– Usually also provide some vector and
possibly matrix operations in hardware.
– Support more traditional type operations (e.g.,
other than for vector/matrix data types).
52
Pipelined Architectures
• Pipelined architectures are sometimes
considered to be SIMD architectures
– See pg 37 of Textbook & pg 8-9 Jordan et. al.
– Vector components are entered successively into first
processor in pipeline.
– The i-th processor of the pipeline receives the output
from the (i-1)th processor.
• Normal “operations” in each processor are much
larger (coarser) in pipelined computers than in
true SIMDs
• “Pipelined” is somewhat SIMD in nature in that
synchronization is not required.
53
Why Processor Arrays?
• Historically, high cost of control units
• Scientific applications have data parallelism

54
Data/instruction Storage
• Front end computer
– Also called the control unit
– Holds and runs program
– Data manipulated sequentially
• Processor array
– Data manipulated in parallel

55
Processor Array Performance
• Performance: work done per time unit
• Performance of processor array
– Speed of processing elements
– Utilization of processing elements

56
Performance Example 1
• 1024 processors
• Each adds a pair of integers in 1 sec (1
microsecond or one millionth of second or
10-6 second.)
• What is the performance when adding two
1024-element vectors (one per
processor)?

Performanc e  10241operations
 sec  1.024 109
ops / sec
57
Performance Example 2
• 512 processors
• Each adds two integers in 1 sec
• What is the performance when adding two
vectors of length 600?
• Since 600 > 512, 88 processor must add
two pairs of integers.
• The other 424 processors add only a
single pair of integers.

58
Example of a 2-D Processor
Interconnection Network in a Processor
Array
Each VLSI chip has 16 processing elements.
Each PE can simultaneously send a value to a neighbor.

PE =
processor
element

59
SIMD Execution Style
• The traditional (SIMD, vector, processor array) execution
style ([Quinn 94, pg 62], [Quinn 2004, pgs 37-43]:
– The sequential processor that broadcasts the
commands to the rest of the processors is called the
front end or control unit (or sometimes host).
– The front end is a general purpose CPU that stores
the program and the data that is not manipulated in
parallel.
– The front end normally executes the sequential
portions of the program.
– Each processing element has a local memory that
can not be directly accessed by the control unit or
other processing elements.
60
SIMD Execution Style
– Collectively, the individual memories of the
processing elements (PEs) store the (vector)
data that is processed in parallel.
– When the front end encounters an instruction
whose operand is a vector, it issues a
command to the PEs to perform the
instruction in parallel.
– Although the PEs execute in parallel, some
units can be allowed to skip any particular
instruction.

61
Masking on Processor Arrays
• All the processors work in lockstep except those
that are masked out (by setting mask register).
• The conditional if-then-else is different for
processor arrays than sequential version
– Every active processor tests to see if its data meets
the negation of the boolean condition.
– If it does, it sets its mask bit so those processors will
not participate in the operation initially.
– Next the unmasked processors, execute the THEN
part.
– Afterwards, mask bits (for original set of active
processors) are flipped and unmasked processors
perform the the ELSE part.
62
if (COND) then A else B

63
if (COND) then A else B

64
if (COND) then A else B

65
SIMD Machines
• An early SIMD computer designed for vector and
matrix processing was the Illiac IV computer
– Initial development at the University of Illinois 1965-70
– Moved to NASA Ames, completed in 1972 but not
fully functional until 1976.
– See Jordan et. al., pg 7 and Wikepedia
• The MPP, DAP, the Connection Machines CM-1
and CM-2, MasPar MP-1 and MP-2 are
examples of SIMD computers
– See Akl pg 8-12 and [Quinn, 94]
• The CRAY-1 and the Cyber-205 use pipelined
arithmetic units to support vector operations and
are sometimes called a pipelined SIMD
– See [Jordan, et al, p7], [Quinn 94, pg 61-2], and
[Quinn 2004, pg37). 66
SIMD Machines
• Quinn [1994, pg 63-67] discusses the CM-2
Connection Machine (with 64K PEs) and a
smaller & updated CM-200.
• Professor Batcher was the chief architect for the
STARAN and the MPP (Massively Parallel
Processor) and an advisor for the ASPRO
– ASPRO is a small second generation STARAN used
by the Navy in the spy planes.
• Professor Batcher is best known architecturally
for the MPP, which is at the Smithsonian Institute
& currently displayed at a D.C. airport.
67
Today’s SIMDs
• Many SIMDs are being embedded in sequential
machines.
• Others are being build as part of hybrid
architectures.
• Others are being build as special purpose
machines, although some of them could classify
as general purpose.
• Much of the recent work with SIMD architectures
is proprietary.
– Often the fact that a parallel computer is SIMD is not
mentioned by company building them.
68
A Company Building Inexpensive
SIMD
• ClearSpeed is producing a COTS (commodity
off the shelf) SIMD Board
– WorldScape is developing defense and commercial
applications for this computer.
• Not a traditional SIMD as the hardware doesn’t
synchronize every step.
– Hardware design supports efficient synchronization
• This machine is programmed like a SIMD.
• The U.S. Navy has observed that their machines
process radar a magnitude faster than others.
• There is quite a bit of information about this at
www.wscape.com and www.clearspeed.com
69
An Example of a Hybrid SIMD
• Embedded Massively Parallel Accelerators
– Systola 1024: PC add-
on board with 1024
processors

– Fuzion 150: 1536 processors


on a single chip

– Other accelerators: Decypher, Biocellerator,


GeneMatcher2, Kestrel, SAMBA, P-NAC, Splash-2,
BioScan
(This and next three slides are due to Prabhakar R. Gudla (U
of Maryland) at a CMSC 838T Presentation, 4/23/2003.) 70
Hybrid Architecture
Systola Systola Systola Systola Systola Systola Systola Systola
1024 1024 1024 1024 1024 1024 1024 1024

High speed Myrinet switch

Systola Systola Systola Systola Systola Systola Systola Systola


1024 1024 1024 1024 1024 1024 1024 1024

– combines SIMD and MIMD paradigm within a parallel


architecture  Hybrid Computer
71
Architecture of Systola
1024
• Instruction Systolic
Array:
– 32  32 mesh of
processing elements
RAM NORTH
– wavefront instruction
RAM WEST execution
program memory Controller

host computer bus


ISA

Interface processors 72
SIMDs Embedded in SISDs
• Intel's Pentium 4 included what they call MMX
technology to gain a significant performance
boost
• IBM and Motorola incorporated the technology
into their G4 PowerPC chip in what they call
their Velocity Engine.
• Both MMX technology and the Velocity Engine
are the chip manufacturer's name for their
proprietary SIMD processors and parallel
extensions to their operating code.
• This same approach is used by NVidia and
Evans & Sutherland to dramatically accelerate
graphics rendering.
73
Special Purpose SIMDs in the
Bioinformatics Arena
• Parcel
– Acquired by Celera Genomics in 2000
– Products include the sequence
supercomputer GeneMatcher, which has a
high throughput sequence analysis capability
• Supports over a million processors
– GeneMatcher was used by Celera in their
race with U.S. government to complete the
description of the human genome sequencing
• TimeLogic, Inc
– Has DeCypher, a reconfigurable SIMD
74
Advantages of SIMDs
• Reference: [Roosta, pg 10]
• Less hardware than MIMDs as they have only
one control unit.
– Control units are complex.
• Less memory needed than MIMD
– Only one copy of the instructions need to be stored
– Allows more data to be stored in memory.
• Less startup time in communicating between
PEs.

75
Advantages of SIMDs (cont)
• Single instruction stream and synchronization of
PEs make SIMD applications easier to program,
understand, & debug.
– Similar to sequential programming
• Control flow operations and scalar operations
can be executed on the control unit while PEs
are executing other instructions.
• MIMD architectures require explicit
synchronization primitives, which create a
substantial amount of additional overhead.
76
Advantages of SIMDs (cont)
• During a communication operation between
PEs,
– PEs send data to a neighboring PE in parallel and in
lock step
– No need to create a header with routing information
as “routing” is determined by program steps.
– the entire communication operation is executed
synchronously
– SIMDs are deterministic & have much more
predictable running time.
• Can normally compute a tight (worst case) upper bound for
the time for communications operations.
• Less complex hardware in SIMD since no
message decoder is needed in the PEs
– MIMDs need a message decoder in each PE.
77
SIMD Shortcomings
(with some rebuttals)

• Claims are from our textbook [i.e., Quinn 2004].


– Similar statements are found in [Grama, et. al].
• Claim 1: Not all problems are data-parallel
– While true, most problems seem to have a
data parallel solution.
– In [Fox, et.al.], the observation was made in
their study of large parallel applications at
national labs, that most were data parallel by
nature, but often had points where significant
branching occurred.
78
SIMD Shortcomings
(with some rebuttals)
• Claim 2: Speed drops for conditionally executed
branches
– MIMDs processors can execute multiple branches
concurrently.
– For an if-then-else statement with execution times for
the “then” and “else” parts being roughly equal, about
½ of the SIMD processors are idle during its execution
• With additional branching, the average number of
inactive processors can become even higher.
• With SIMDs, only one of these branches can be
executed at a time.
• This reason justifies the study of multiple SIMDs (or
MSIMDs).
79
SIMD Shortcomings
(with some rebuttals)

• Claim 2 (cont): Speed drops for


conditionally executed code
– In [Fox, et.al.], the observation was made that
for the real applications surveyed, the
MAXIMUM number of active branches at any
point in time was about 8.
– The cost of the extremely simple processors
used in a SIMD are extremely low
• Programmers used to worry about ‘full utilization of
memory’ but stopped this after memory cost
became insignificant overall.
80
SIMD Shortcomings
(with some rebuttals)

• Claim 3: Don’t adapt to multiple users well.


– This is true to some degree for all parallel computers.
– If usage of a parallel processor is dedicated to a
important problem, it is probably best not to risk
compromising its performance by ‘sharing’
– This reason also justifies the study of multiple SIMDs
(or MSIMD).
– SIMD architecture has not received the attention that
MIMD has received and can greatly benefit from
further research.

81
SIMD Shortcomings
(with some rebuttals)

• Claim 4: Do not scale down well to


“starter” systems that are affordable.
– This point is arguable and its ‘truth’ is likely to
vary rapidly over time
– ClearSpeed currently sells a very economical
SIMD board that plugs into a PC.

82
SIMD Shortcomings
(with some rebuttals)
Claim 5: Requires customized VLSI for processors
and expense of control units in PCs has dropped.
• Reliance on COTS (Commodity, off-the-shelf parts)
has dropped the price of MIMDS
• Expense of PCs (with control units) has dropped
significantly
• However, reliance on COTS has fueled the success
of ‘low level parallelism’ provided by clusters and
restricted new innovative parallel architecture
research for well over a decade.
83
SIMD Shortcomings
(with some rebuttals)
Claim 5 (cont.)
• There is strong evidence that the period of
continual dramatic increases in speed of PCs
and clusters is ending.
• Continued rapid increases in parallel
performance in the future will be necessary in
order to solve important problems that are
beyond our current capabilities
• Additionally, with the appearance of the very
economical COTS SIMDs, this claim no longer
appears to be relevant.
84
Multiprocessors
• Multiprocessor: multiple-CPU computer
with a shared memory
• Same address on two different CPUs
refers to the same memory location
• Avoids three cited criticisms for SIMDs
– Can be built from commodity CPUs
– Naturally support multiple users
– Maintain efficiency in conditional code

85
Centralized Multiprocessor

86
Centralized Multiprocessor
• Straightforward extension of uniprocessor
• Add CPUs to bus
• All processors share same primary memory
• Memory access time same for all CPUs
– Uniform memory access (UMA)
multiprocessor
• Also called a symmetrical multiprocessor (SMP)

87
Private and Shared Data
• Private data: items used only by a single
processor
• Shared data: values used by multiple
processors
• In a centralized multiprocessor (i.e. SMP),
processors communicate via shared data
values

88
Problems Associated with Shared Data

The cache coherence problem


• Replicating data across multiple caches reduces
contention among processors for shared data
values.
• But - how can we ensure different processors
have the same value for same address?
• The cache coherence problem is when an
obsolete value is still stored in a processor’s
cache.

89
Write Invalidate Protocol
Most common solution to cache coherency
1. Each CPU’s cache controller monitors
(snoops) the bus & identifies which cache
blocks are requested by other CPUs.
2. A PE gains exclusive control of data item
before performing “write”.
3. Before “write” occurs, all other copies of data
item cached by other PEs are invalidated.
4. When any other CPU tries to read a memory
location from an invalidated cache block,
– a cache miss occurs
– It has to retrieve updated data from memory
90
Cache-coherence Problem

Memory

X 7

Cache Cache

CPU A CPU B
91
Cache-coherence Problem

Memory Read from


memory is not a
X 7 problem.

CPU A CPU B
92
Cache-coherence Problem

Memory

X 7

7
7

CPU A CPU B
93
Cache-coherence Problem
Write to main
Memory memory is a
problem.
X 2

2
7

CPU A CPU B
94
Write Invalidate Protocol
A cache control monitor
snoops the bus to see
X 7 which cache block is
being requested by
other processors.

7
7

CPU A CPU B
95
Write Invalidate Protocol

X 7

Intent to write X

7
7 Before a write
can occur, all
copies of data at
CPU A CPU B that address are
declared invalid.
96
Write Invalidate Protocol

X 7

Intent to write X

CPU A CPU B
97
Write Invalidate Protocol
When another
processor tries to
read from this
X 2 location in cache,
it receives a cache
miss error and will
have to refresh
from main
2 memory.

CPU A CPU B
98
Synchronization Required for
Shared Data
• Mutual exclusion
– Definition: At most one process can be engaged in an
activity at any time.
– Example: Only one processor can write to the same
address in main memory at the same time.
– We say that process must mutually exclude all others
while it performs this write.
• Barrier synchronization
– Definition: Guarantees that no process will proceed
beyond a designated point (called the barrier) until
every process reaches that point.
99
Distributed Multiprocessor

• Distributes primary memory among processors


• Increase aggregate memory bandwidth and
lower average memory access time
• Allows greater number of processors
• Also called non-uniform memory access (NUMA)
multiprocessor
– Local memory access time is fast
– Non-local memory access time can vary
– Distributed memories have one logical address space

100
Distributed Multiprocessors

101
Cache Coherence
• Some NUMA multiprocessors do not
support it in hardware
– Only instructions and private data are stored
in cache
– Policy creates a large memory access time
variance
• Implementation more difficult
– No shared memory bus to “snoop”
– Directory-based protocol needed

102
Directory-based Protocol
• Distributed directory contains information
about cacheable memory blocks
• One directory entry for each cache block
• Each entry has
– Sharing status
– Which processors have copies

103
Sharing Status
• Uncached -- (denoted by “U”)
– Block not in any processor’s cache
• Shared – (denoted by “S”)
– Cached by one or more processors
– Read only
• Exclusive – (denoted by “E”)
– Cached by exactly one processor
– Processor has written to block
– Copy in memory is obsolete

104
Directory-based Protocol - step1
Interconnection Network

Directory Directory Directory

Local Memory Local Memory Local Memory

Cache Cache Cache

CPU 0 CPU 1 CPU 2 105


X has value 7 – step 2
Interconnection Network
Bit Vector

Directories X U000

Memories X 7

Caches

CPU 0 CPU 1 CPU 2 106


CPU 0 Reads X – step 3
Interconnection Network
Read Miss

Directories X U000

Memories X 7

Caches

CPU 0 CPU 1 CPU 2 107


CPU 0 Reads X –step 4
Interconnection Network

Directories X S100

Memories X 7

Caches

CPU 0 CPU 1 CPU 2 108


CPU 0 Reads X –step 5
Interconnection Network

Directories X S100

Memories X 7

Caches X 7

CPU 0 CPU 1 CPU 2 109


CPU 2 Reads X – step 6
Interconnection Network

Directories X S100

Memories X 7
Read Miss

Caches X 7

CPU 0 CPU 1 CPU 2 110


CPU 2 Reads X – step 7
Interconnection Network

Directories X S101

Memories X 7

Caches X 7

CPU 0 CPU 1 CPU 2 111


CPU 2 Reads X – step 8
Interconnection Network

Directories X S101

Memories X 7

Caches X 7 X 7

CPU 0 CPU 1 CPU 2 112


CPU 0 Writes 6 to X – step 9
Interconnection Network
Write Miss
Directories X S101

Memories X 7

Caches X 7 X 7

CPU 0 CPU 1 CPU 2 113


CPU 0 Writes 6 to X – step 10
Interconnection Network

Directories X S101

Memories Invalidate X 7

Caches X 7 X 7

CPU 0 CPU 1 CPU 2 114


CPU 0 Writes 6 to X – step 11
Interconnection Network

Directories X E100

Memories X 7

Caches X 6

CPU 0 CPU 1 CPU 2 115


CPU 1 Reads X – step 12
Interconnection Network
Read Miss

Directories X E100

Memories X 7

Caches X 6

CPU 0 CPU 1 CPU 2 116


CPU 1 Reads X – step 13
Interconnection Network
Switch to Shared

Directories X E100

Memories X 7

Caches X 6

CPU 0 CPU 1 CPU 2 117


CPU 1 Reads X – step 14
Interconnection Network

Directories X E100

Memories X 6

Caches X 6

CPU 0 CPU 1 CPU 2 118


CPU 1 Reads X – step 15
Interconnection Network

Directories X S110

Memories X 6

Caches X 6 X 6

CPU 0 CPU 1 CPU 2 119


CPU 2 Writes 5 to X – step 16
Interconnection Network

Directories X S110

Memories Write Miss X 6

Caches X 6 X 6

CPU 0 CPU 1 CPU 2 120


CPU 2 Writes 5 to X - step 17
Interconnection Network
Invalidate

Directories X S110

Memories X 6

Caches X 6 X 6

CPU 0 CPU 1 CPU 2 121


CPU 2 Writes 5 to X –step 18
Interconnection Network

Directories X E001

Memories X 6

Caches X 5

CPU 0 CPU 1 CPU 2 122


CPU 0 Writes 4 to X – step 19
Interconnection Network
Write Miss

Directories X E001

Memories X 6

Caches X 5

CPU 0 CPU 1 CPU 2 123


CPU 0 Writes 4 to X – step 20
Interconnection Network

Directories X E100

Memories Take Away X 6

Caches X 5

CPU 0 CPU 1 CPU 2 124


CPU 0 Writes 4 to X – step 21
Interconnection Network

Directories X E010

Memories X 5

Caches X 5

CPU 0 CPU 1 CPU 2 125


CPU 0 Writes 4 to X – step 22
Interconnection Network

Directories X E100

Memories X 5

Caches

CPU 0 CPU 1 CPU 2 126


CPU 0 Writes 4 to X – step 23
Interconnection Network

Directories X E100

Creates
Memories
cache block
X 5
storage for X

Caches X 5

CPU 0 CPU 1 CPU 2 127


CPU 0 Writes 4 to X – step 24
Interconnection Network

Directories X E100

Memories X 5

Caches X 4

CPU 0 CPU 1 CPU 2 128


CPU 0 Writes Back X Block – step
25
Interconnection Network
Data Write Back
Directories X E100

Memories X 45

Caches X 4

CPU 0 CPU 1 CPU 2 129


CPU 0 flushes cache block X
step 26
Interconnection Network

Directories X U000

Memories X 4

Caches

CPU 0 CPU 1 CPU 2 130


Characteristics of Multiprocessors
• Interprocessor communication is done in the
memory interface by read and write instructions
• Memory may be physically distributed and the
reads and writes from different processors may
take different time.
• Congestion and hotspots in the interconnection
network may occur.
• Memory latency (i.e., time to complete a read or
write) may be long and variable.
• Most messages through the bus or
interconnection network are the size of single
memory words.
• Randomization of requests may be used to
reduce the probability of collisions.
131
Multicomputers
• Distributed memory multiple-CPU
computer
• Same address on different processors
refers to different physical memory
locations
• Processors interact through message
passing

132
Typically, Two Flavors of
Multicomputers
• Commercial multicomputers
– Custom switch network
– Low latency (the time it takes to send a message).
– High bandwidth (data path width) across processors
• Commodity clusters
– Mass produced computers, switches and other
equipment
– Use low cost components
– Message latency is higher
– Communications bandwidth is lower
133
Multicomputer Communication
• Processors are connected by an interconnection
network
• Each processor has a local memory and can
only access its own local memory
• Data is passed between processors using
messages, as required by the program
• Data movement across the network is also
asynchronous
– A common approach is to use MPI to handling
message passing
134
Multicomputer Communications (cont)

• Multicomputers can be scaled to larger


sizes much easier than multiprocessors.
• The amount of data transmissions
between processors have a huge impact
on the performance
– The distribution of the data among the
processors is a very important factor in
the performance efficiency.

135
Message-Passing Advantages
• No problem with simultaneous access
to data.
• Allows different PCs to operate on the
same data independently.
• Allows PCs on a network to be easily
upgraded when faster processors
become available.
136
Disadvantages of
Message-Passing
• Programmers must make explicit
message-passing calls in the code
• This is low-level programming and is error
prone.
• Data is not shared but copied, which
increases the total data size.
• Data Integrity
– Difficulty in maintaining correctness of multiple
copies of data item.
137
Some Interconnection Network
Terminology (1/2)
References: Wilkinson, et. al. & Grama, et. al.
Also, earlier slides on architecture & networks.
A link is the connection between two nodes.
• A switch that enables packets to be routed
through the node to other nodes without
disturbing the processor is assumed.
• The link between two nodes can be either
bidirectional or use two directional links .
• Can assume either one wire that carries one bit
or parallel wires (one wire for each bit in word).
• The above choices do not have a major impact
on the concepts presented in this course.
138
Network Terminology (2/2)
• The bandwidth is the number of bits that can be
transmitted in unit time (i.e., bits per second).
• The network latency is the time required to
transfer a message through the network.
– The communication latency is the total time
required to send a message, including
software overhead and interface delay.
– The message latency or startup time is the
time required to send a zero-length message.
• Includes software & hardware overhead, such as
– Choosing a route
– packing and unpacking the message
139
Circuit Switching Message Passing
• Technique establishes a path and allows the
entire message to transfer uninterrupted.
• Similar to telephone connection that is held until
the end of the call.
• Links used are not available to other messages
until the transfer is complete.
• Latency (message transfer time): If the length of
control packet sent to establish path is small wrt
(with respect to) the message length, the latency
is essentially
– the constant L/B, where L is message length
and B is bandwidth. 140
Store-and-forward Packet
Switching
• Message is divided into “packets” of
information
• Each packet includes source and
destination addresses.
• Packets can not exceed a fixed, maximum
size (e.g., 1000 byte).
• A packet is stored in a node in a buffer
until it can move to the next node.
141
Packet Switching (cont)
• At each node, the designation information is
looked at and used to select which node to
forward the packet to.
• Routing algorithms (often probabilistic) are used
to avoid hot spots and to minimize traffic jams.
• Significant latency is created by storing each
packet in each node it reaches.
• Latency increases linearly with the length of the
route.

142
Virtual Cut-Through Package
Switching

• Used to reduce the latency.


• Allows packet to pass through a node
without being stored, if the outgoing link is
available.
• If complete path is available, a message
can immediately move from source to
destination..

143
Wormhole Routing
• Alternate to store-and-forward packet
routing
• A message is divided into small units
called flits (flow control units).
• Flits are 1-2 bytes in size.
• Can be transferred in parallel on links with
multiple wires.
• Only head of flit is initially transferred
when the next link becomes available.
144
Wormhole Routing (cont)
• As each flit moves forward, the next flit can
move forward.
• The entire path must be reserved for a message
as these packets pull each other along (like cars
of a train).
• Request/acknowledge bit messages are
required to coordinate these pull-along moves.
– See Wilkinson, et. al.
• Latency: If the head of the flit is very small
compared to the length of the message, then the
latency is essentially the constant L/B, with L the
message length and B the link bandwidth. 145
Deadlock
• Routing algorithms needed to find a path
between the nodes.
• Adaptive routing algorithms choose different
paths, depending on traffic conditions.
• Livelock is a deadlock-type situation where a
packet continues to go around the network,
without ever reaching its destination.
• Deadlock: No packet can be forwarded because
they are blocked by other stored packets waiting
to be forwarded.

146
Asymmetric Multicomputers
• Has a front-end that interacts with users
and I/O devices.
– Processors in back end are used for
computation.
– Programming similar to SIMDs (i.e., processor
arrays)
– Common with early multicomputers
• Examples of asymmetrical multicomputers
given in textbook.

147
Asymmetrical MC Advantages
• Back-end processors
dedicated to parallel
computations 
Easier to understand,
model, tune
performance
• Only a simple back-
end operating system
needed  Easy for a
vendor to create
148
Asymmetrical MC Disadvantages
• Front-end computer is a
single point of failure
• Single front-end computer
limits scalability of system
• Primitive operating
system in back-end
processors makes
debugging difficult
• Every application requires
development of both
front-end and back-end
program

149
Symmetric Multicomputers
• Every computer executes the same operating
system and has identical functionality.
• Users may log into any computer to edit or
compile their programs.
• Any or all computers may be involved in the
execution of their program.
• During execution of programs, every PE
executes the same program.
• When only one PE should execute an operation,
an “if” statement is used to select the PE.
150
Symmetric Multicomputers

151
Symmetrical MC Advantages
• Alleviate performance
bottleneck caused by
single front-end computer
• Better support for
debugging
• Every processor executes
same program

152
Symmetrical MC Disadvantages

• More difficult to maintain


illusion of single “parallel
computer”
• No simple way to balance
program development
workload among
processors
• More difficult to achieve
high performance when
multiple processes run on
each processor
– Details on next slide
153
Symmetric MC Disadvantages (cont)
• (cont.) More difficult to achieve high
performance when multiple processes run
on each processor
– Processes on same processor compete for
same resources
• CPU Cycles
• Cache space
• Memory bandwidth
– Increased cache misses
• Cache is “PE oriented” instead of “process
oriented”

154
Best Model for Commodity Cluster
• Full-Fledged operating system (e.g.,
Linux) desirable
– Feature of symmetric multicomputer
• Desirable to increase cache hits
– Favors having only a single user process on each PE
– Favors most nodes being off-limits for program
development
• Need fast network
– Keep program development users off networks and
have them access front-end by another path.
– Reserve interconnection network to usage by parallel
processes
• Overall, a mixed model may be best for
commodity clusters 155
Ideal Commodity Cluster Features
• Co-located computers
• Computers dedicated to running a single
process at a time to lower cache misses
• Accessible only from the network
– No keyboards or displays
• Users access front-end by another route.
• Identical operating systems
• Identical local disk images
• Administered as an entity
156
ParPar Cluster, A Mixed Model
• Mixed model
• Incorporates both asymetrical and
symetrical designs.

157
Network of Workstations
• Dispersed computers
– Typically located on user’s desks
• First priority: response time for person at
the keyboard
• Parallel jobs wait in background and run
with spare CPU cycles are available.
• Different operating systems
• Different local images
• Checkpointing and restarting important
• Typically connected by ethernet
– Too slow for commodity network usage 158
A Commodity Cluster vs
Network of Workstations
• A commodity cluster contains components of
local area networks
– Commodity computers
– Switches
• A network of workstations is a dispersed
collection of computers
– Distributed hetergeneous computers
– Located on primary users’ desks
– Unused cycles available for parallel use
– Example: SETI project
159
Summary
• Commercial parallel computers appeared
in 1980s
• Multiple-CPU computers now dominate
• Small-scale: Centralized multiprocessors
• Large-scale: Distributed memory
architectures
– Multiprocessors
– Multicomputers
160

You might also like