Parallel Processing Chapter - 2
Parallel Processing Chapter - 2
January 2013
Parallel Processing
January 2013
1 / 135
Outline
1
2
Introduction to subject
Parallel Programming Platforms
Implicit Parallelism: Trends in Microprocessor Architectures
Limitations of Memory System Performance
Dichotomy of Parallel Computing Platforms
Communication Model of Parallel Platforms
Physical Organization of Parallel Platforms
Communication Costs in Parallel Machines
Messaging Cost Models and Routing Mechanisms
Mapping Techniques
Parallel Processing
January 2013
2 / 135
Introduction to subject
Introduction to subject
Evaluation Scheme:
Parallel Processing
January 2013
3 / 135
Introduction to subject
Reference Books
Text Books:
1
Reference Books:
1
Parallel Processing
January 2013
4 / 135
Topic Overview
Parallel Processing
January 2013
5 / 135
Scope of Parallelism
Parallel Processing
January 2013
6 / 135
Outline
Introduction to subject
Parallel Processing
January 2013
7 / 135
Microprocessor clock speeds have posted impressive gains over the past
two decades (two to three orders of magnitude).
Higher levels of device integration have made available a large number
of transistors.
The question of how best to utilize these resources is an important one.
Current processors use these resources in multiple functional units and
execute multiple instructions in the same cycle.
The precise manner in which these instructions are selected and executed provides impressive diversity in architectures.
Parallel Processing
January 2013
8 / 135
Parallel Processing
January 2013
9 / 135
Parallel Processing
January 2013
10 / 135
Parallel Processing
January 2013
11 / 135
Parallel Processing
January 2013
12 / 135
Parallel Processing
January 2013
13 / 135
Superscalar Execution
Scheduling of instructions is determined by a number of factors:
True Data Dependency: The result of one operation is an input to the
next.
Resource Dependency: Two operations require the same resource.
Branch Dependency: Scheduling instructions across conditional branch
statements cannot be done deterministically a-priori.
The scheduler, a piece of hardware looks at a large number of instructions in an instruction queue and selects appropriate number of
instructions to execute concurrently based on these factors.
The complexity of this hardware is an important constraint on superscalar processors.
Parallel Processing
January 2013
14 / 135
Parallel Processing
January 2013
15 / 135
Parallel Processing
January 2013
16 / 135
Parallel Processing
January 2013
17 / 135
Parallel Processing
January 2013
18 / 135
Outline
Introduction to subject
Parallel Processing
January 2013
19 / 135
Memory system, and not processor speed, is often the bottleneck for
many applications.
Memory system performance is largely captured by two parameters,
latency and bandwidth.
Latency is the time from the issue of a memory request to the time the
data is available at the processor.
Bandwidth is the rate at which data can be pumped to the processor
by the memory system.
Parallel Processing
January 2013
20 / 135
Parallel Processing
January 2013
21 / 135
Parallel Processing
January 2013
22 / 135
Parallel Processing
January 2013
23 / 135
Caches are small and fast memory elements between the processor and
DRAM.
This memory acts as a low-latency high-bandwidth storage.
If a piece of data is repeatedly used, the effective latency of this memory
system can be reduced by the cache.
The fraction of data references satisfied by the cache is called the cache
hit ratio of the computation on the system.
Cache hit ratio achieved by a code on a memory system often determines its performance.
Parallel Processing
January 2013
24 / 135
Consider the architecture from the previous example. In this case, we introduce a cache of size 32 KB with a latency of 1 ns or one cycle. We use this
setup to multiply two matrices A and B of dimensions 32 32. We have
carefully chosen these numbers so that the cache is large enough to store
matrices A and B, as well as the result matrix C .
Parallel Processing
January 2013
25 / 135
Parallel Processing
January 2013
26 / 135
Impact of Caches
Parallel Processing
January 2013
27 / 135
Parallel Processing
January 2013
28 / 135
Consider the same setup as before, except in this case, the block size is 4
words instead of 1 word. We repeat the dot-product computation in this
scenario:
Assuming that the vectors are laid out linearly in memory, eight FLOPs
(four multiply-adds) can be performed in 200 cycles.
This is because a single memory access fetches four consecutive words
in the vector.
Therefore, two accesses can fetch four elements of each of the vectors. This corresponds to a FLOP every 25 ns, for a peak speed of 40
MFLOPS.
Parallel Processing
January 2013
29 / 135
It is important to note that increasing block size does not change latency of the system.
Physically, the scenario illustrated here can be viewed as a wide data
bus (4 words or 128 bits) connected to multiple memory banks.
In practice, such wide buses are expensive to construct.
In a more practical system, consecutive words are sent on the memory
bus on subsequent bus cycles after the first word is retrieved.
Parallel Processing
January 2013
30 / 135
Parallel Processing
January 2013
31 / 135
Parallel Processing
January 2013
32 / 135
Parallel Processing
January 2013
33 / 135
Parallel Processing
January 2013
34 / 135
The series of examples presented in this section illustrate the following concepts:
Exploiting spatial and temporal locality in applications is critical for
amortizing memory latency and increasing effective memory bandwidth.
The ratio of the number of operations to number of memory accesses
is a good indicator of anticipated tolerance to memory bandwidth.
Memory layouts and organizing computation appropriately can make a
significant impact on the spatial and temporal locality.
Parallel Processing
January 2013
35 / 135
Parallel Processing
January 2013
36 / 135
Parallel Processing
January 2013
37 / 135
In the code, the first instance of this function accesses a pair of vector
elements and waits for them.
In the meantime, the second instance of this function can access two
other vector elements in the next cycle, and so on.
After l units of time, where l is the latency of the memory system, the
first function instance gets the requested data from memory and can
perform the required computation.
In the next cycle, the data items for the next function instance arrive, and so on. In this way, in every clock cycle, we can perform a
computation.
Parallel Processing
January 2013
38 / 135
Parallel Processing
January 2013
39 / 135
Parallel Processing
January 2013
40 / 135
Parallel Processing
January 2013
41 / 135
Parallel Processing
January 2013
42 / 135
Outline
Introduction to subject
Parallel Processing
January 2013
43 / 135
Parallel Processing
January 2013
44 / 135
Parallel Processing
January 2013
45 / 135
Processing units in parallel computers either operate under the centralized control of a single control unit or work independently.
If there is a single control unit that dispatches the same instruction to
various processors (that work on different data), the model is referred
to as single instruction stream, multiple data stream (SIMD).
If each processor has its own control control unit, each processor can
execute different instructions on different data items. This model is
called multiple instruction stream, multiple data stream (MIMD).
Parallel Processing
January 2013
46 / 135
Parallel Processing
January 2013
47 / 135
SIMD Processors
Some of the earliest parallel computers such as the Illiac IV, MPP, DAP,
CM-2, and MasPar MP-1 belonged to this class of machines.
Variants of this concept have found use in co-processing units such as
the MMX units in Intel processors and DSP chips such as the Sharc.
SIMD relies on the regular structure of computations (such as those in
image processing).
It is often necessary to selectively turn off operations on certain data
items. For this reason, most SIMD programming paradigms allow for
an activity mask, which determines if a processor should participate
in a computation or not.
Parallel Processing
January 2013
48 / 135
Parallel Processing
January 2013
49 / 135
MIMD Processors
Parallel Processing
January 2013
50 / 135
SIMD-MIMD Comparison
Parallel Processing
January 2013
51 / 135
Outline
Introduction to subject
Parallel Processing
January 2013
52 / 135
There are two primary forms of data exchange between parallel tasks
accessing a shared data space and exchanging messages.
Platforms that provide a shared data space are called shared-addressspace machines or multiprocessors.
Platforms that support messaging are also called message passing platforms or multicomputers.
Parallel Processing
January 2013
53 / 135
Shared-Address-Space Platforms
Parallel Processing
January 2013
54 / 135
Parallel Processing
January 2013
55 / 135
Parallel Processing
January 2013
56 / 135
Parallel Processing
January 2013
57 / 135
Message-Passing Platforms
These platforms comprise of a set of processors and their own (exclusive) memory.
Instances of such a view come naturally from clustered workstations
and non-shared-address-space multicomputers.
These platforms are programmed using (variants of) send and receive
primitives.
Libraries such as MPI and PVM provide such primitives.
Parallel Processing
January 2013
58 / 135
Parallel Processing
January 2013
59 / 135
Outline
Introduction to subject
Parallel Processing
January 2013
60 / 135
We begin this discussion with an ideal parallel machine called Parallel Random Access Machine, or PRAM.
Parallel Processing
January 2013
61 / 135
A natural extension of the Random Access Machine (RAM) serial architecture is the Parallel Random Access Machine, or PRAM.
PRAMs consist of p processors and a global memory of unbounded size
that is uniformly accessible to all processors.
Processors share a common clock but may execute different instructions
in each cycle.
Parallel Processing
January 2013
62 / 135
Parallel Processing
January 2013
63 / 135
Parallel Processing
January 2013
64 / 135
Parallel Processing
January 2013
65 / 135
Parallel Processing
January 2013
66 / 135
Parallel Processing
January 2013
67 / 135
Interconnection Networks
Parallel Processing
January 2013
68 / 135
Parallel Processing
January 2013
69 / 135
Network Topologies
Parallel Processing
January 2013
70 / 135
Parallel Processing
January 2013
71 / 135
Figure : Bus-based interconnects (a) with no local caches; (b) with local
memory/caches.
Parallel Processing
January 2013
72 / 135
Parallel Processing
January 2013
73 / 135
Parallel Processing
January 2013
74 / 135
Parallel Processing
January 2013
75 / 135
Parallel Processing
January 2013
76 / 135
Parallel Processing
January 2013
77 / 135
Parallel Processing
January 2013
78 / 135
Parallel Processing
January 2013
79 / 135
Figure : A complete omega network connecting eight inputs and eight outputs.
Prepared By: Prof. Mitul K. Patel
Parallel Processing
January 2013
80 / 135
Parallel Processing
January 2013
81 / 135
Parallel Processing
January 2013
82 / 135
Parallel Processing
January 2013
83 / 135
Parallel Processing
January 2013
84 / 135
Parallel Processing
January 2013
85 / 135
In a linear array, each node has two neighbors, one to its left and one
to its right. If the nodes at either end are connected, we refer to it as
a 1-D torus or a ring.
A generalization to 2 dimensions has nodes with 4 neighbors, to the
north, south, east, and west.
A further generalization to d dimensions has nodes with 2d neighbors.
A special case of a d-dimensional mesh is a hypercube. Here, d = log p,
where p is the total number of nodes.
Parallel Processing
January 2013
86 / 135
Figure : Linear arrays: (a) with no wraparound links; (b) with wraparound link.
Parallel Processing
January 2013
87 / 135
Figure : Two and three dimensional meshes: (a) 2-D mesh with no wraparound;
(b) 2-D mesh with wraparound link (2-D torus); and (c) a 3-D mesh with no
wraparound.
Parallel Processing
January 2013
88 / 135
Parallel Processing
January 2013
89 / 135
Parallel Processing
January 2013
90 / 135
Figure : Complete binary tree networks: (a) a static tree network; and (b) a
dynamic tree network.
Parallel Processing
January 2013
91 / 135
Parallel Processing
January 2013
92 / 135
Parallel Processing
January 2013
93 / 135
Parallel Processing
January 2013
94 / 135
Network
Diameter
Bisection
Width
Arc
Connectivity
Cost
(No. of links)
Completely-connected
Star
Complete binary tree
Linear array
2-D mesh, no wraparound
2-D wraparound mesh
Hypercube
Wraparound k-ary d-cube
1
2
2 log((p + 1)/2)
p1
2( p 1)
2b p/2c
log p
dbk/2c
p 2 /4
1
1
1
2 p
p/2
2k d1
p1
1
1
1
2
4
log p
2d
p(p 1)/2
p1
p1
p1
2(p p)
2p
(p log p)/2
dp
Parallel Processing
January 2013
95 / 135
Network
Crossbar
Omega Network
Dynamic Tree
Diameter
1
log p
2 log p
Bisection
Width
p
p/2
1
Parallel Processing
Arc
Connectivity
1
2
2
Cost
(No. of links)
p2
p/2
p1
January 2013
96 / 135
Parallel Processing
January 2013
97 / 135
Parallel Processing
January 2013
98 / 135
If a processor just reads a value once and does not need it again, an
update protocol may generate significant overhead.
If two processors make interleaved test and updates to a variable, an
update protocol is better.
Both protocols suffer from false sharing overheads (two words that are
not shared, however, they lie on the same cache line).
False sharing refers to the situation in which different processors update
different parts of the same cache-line. Thus, although the updates are
not performed on shared variables, the system does not detect this.
Most current machines use invalidate protocols.
Parallel Processing
January 2013
99 / 135
Parallel Processing
January 2013
100 / 135
Parallel Processing
January 2013
101 / 135
Parallel Processing
January 2013
102 / 135
Parallel Processing
January 2013
103 / 135
Once copies of data are tagged dirty, all subsequent operations can be
performed locally on the cache without generating external traffic.
If a data item is read by a number of processors, it transitions to the
shared state in the cache and all subsequent read operations become
local.
If processors read and update data at the same time, they generate
coherence requests on the bus which is ultimately bandwidth limited.
Parallel Processing
January 2013
104 / 135
Parallel Processing
January 2013
105 / 135
Parallel Processing
January 2013
106 / 135
Parallel Processing
January 2013
107 / 135
Outline
Introduction to subject
Parallel Processing
January 2013
108 / 135
Parallel Processing
January 2013
109 / 135
Outline
Introduction to subject
Parallel Processing
January 2013
110 / 135
The total time to transfer a message over a network comprises of the following:
Startup time (ts ): Time spent at sending and receiving nodes (executing the routing algorithm, programming routers, etc.).
Per-hop time (th ): This time is a function of number of hops and
includes factors such as switch latencies, network delays, etc.
Per-word transfer time (tw ): This time includes all overheads that are
determined by the length of the message. This includes bandwidth of
links, error checking and correction, etc.
Parallel Processing
January 2013
111 / 135
Store-and-Forward Routing
A message traversing multiple hops is completely received at an intermediate hop before being forwarded to the next hop.
The total communication cost for a message of size m words to traverse
l communication links is
tcomm = ts + (mtw + th )l.
(1)
Parallel Processing
January 2013
112 / 135
Routing Techniques
Parallel Processing
January 2013
113 / 135
Packet Routing
Store-and-forward makes poor use of communication resources.
Packet routing breaks messages into packets and pipelines them through
the network.
Since packets may take different paths, each packet must carry routing information, error checking, sequencing, and other related header
information.
The total communication time for packet routing is approximated by:
tcomm = ts + th l + tw m
where
tw = tw 1 + tw 2 (1 + s/r )
The factor tw accounts for overheads in packet headers.
Prepared By: Prof. Mitul K. Patel
Parallel Processing
January 2013
114 / 135
Cut-Through Routing
Parallel Processing
January 2013
115 / 135
Cut-Through Routing
Parallel Processing
January 2013
116 / 135
Parallel Processing
January 2013
117 / 135
Parallel Processing
January 2013
118 / 135
Parallel Processing
January 2013
119 / 135
Parallel Processing
January 2013
120 / 135
Parallel Processing
January 2013
121 / 135
Mapping Techniques
Outline
Introduction to subject
Parallel Processing
January 2013
122 / 135
Mapping Techniques
Parallel Processing
January 2013
123 / 135
Mapping Techniques
Parallel Processing
January 2013
124 / 135
Mapping Techniques
Parallel Processing
January 2013
125 / 135
Mapping Techniques
G (0, 1) = 0
G (1, 1) = 1
G (i, x),
i < 2x
G (i, x + 1) =
x
x+1
2 + G (2
1 i, x), i 2x
The function G is called the binary reflected Gray code (RGC).
Since adjoining entries (G (i, d) and G (i + 1, d)) differ from each other at
only one bit position, corresponding processors are mapped to neighbors
in a hypercube. Therefore, the congestion, dilation, and expansion of the
mapping are all 1.
Prepared By: Prof. Mitul K. Patel
Parallel Processing
January 2013
126 / 135
Mapping Techniques
Figure : (a) A three-bit reflected Gray code ring; and (b) its embedding into a
three-dimensional hypercube.
Prepared By: Prof. Mitul K. Patel
Parallel Processing
January 2013
127 / 135
Mapping Techniques
Parallel Processing
January 2013
128 / 135
Mapping Techniques
Figure : (a) A 4 4 mesh illustrating the mapping of mesh nodes to the nodes in
a four-dimensional hypercube; and (b) a 2 4 mesh embedded into a
three-dimensional hypercube.
Prepared By: Prof. Mitul K. Patel
Parallel Processing
January 2013
129 / 135
Mapping Techniques
Since a mesh has more edges than a linear array, we will not have an
optimal congestion/dilation mapping.
We first examine the mapping of a linear array into a mesh and then
invert this mapping.
This gives us an optimal mapping (in terms of congestion).
Parallel Processing
January 2013
130 / 135
Mapping Techniques
Figure : (a) Embedding a 16 node linear array into a 2-D mesh; and (b) the
inverse of the mapping. Solid lines correspond to links in the linear array and
normal lines to links in the mesh.
Prepared By: Prof. Mitul K. Patel
Parallel Processing
January 2013
131 / 135
Mapping Techniques
Parallel Processing
January 2013
132 / 135
Mapping Techniques
Parallel Processing
January 2013
133 / 135
Mapping Techniques
Cost-Performance Tradeoffs
If the cost of the network is proportional to the number of wires, then
a square p-node wraparound mesh with (log p)/4 wires per channel
costs as much as a p-node hypercube with one wire per channel.
The average communication latency for a hypercube is given by
ts + th (log p)/2 + tw m
and that for a wraparound mesh for the same cost is
Parallel Processing
January 2013
134 / 135
Mapping Techniques
Cost-Performance Tradeoffs
As the number of messages increases, there is a contention on the network. Contention affects the mesh network more adversely than the hypercube network. Therefore, if the network is heavily loaded, the hypercube will outperform the mesh.
If the cost of a network is proportional to its bisection width, then a p-node wraparound
mesh with p/4 wires per channel has a cost equal to a p-node hypercube with one
wire per channel.
The communication times for the hypercube and the mesh networks of the same
cost are given by
ts + th (log p)/2 + tw m
and
ts + th p/2 + 4tw m/ p
, respectively.
For a large enough messages, a mesh is always better than a hypercube of the same
cost, provided the network is lightly loaded.
Even when the network is heavily loaded, the performance of a mesh is similar to
that of a hypercube of the same cost.
Prepared By: Prof. Mitul K. Patel
Parallel Processing
January 2013
135 / 135