Parallel Architecture

Download as pdf or txt
Download as pdf or txt
You are on page 1of 33

Parallel Architecture

Sathish Vadhiyar
Motivations of Parallel Computing
 Faster execution times
 From days or months to hours or seconds
 E.g., climate modelling, bioinformatics
 Large amount of data dictate parallelism
 Parallelism more natural for certain kinds of
problems, e.g., climate modelling
 Due to computer architecture trends
 CPU speeds have saturated
 Slow memory bandwidths

2
Classification of Architectures – Flynn’s
classification
In terms of parallelism in
instruction and data stream
 Single Instruction Single
Data (SISD): Serial
Computers
 Single Instruction Multiple
Data (SIMD)
- Vector processors and
processor arrays
- Examples: CM-2, Cray-90,
Cray YMP, Hitachi 3600

Courtesy: http://www.llnl.gov/computing/tutorials/parallel_comp/

3
Classification of Architectures – Flynn’s
classification
 Multiple Instruction Single
Data (MISD): Not popular
 Multiple Instruction
Multiple Data (MIMD)
- Most popular
- IBM SP and most other
supercomputers,
clusters, computational
Grids etc.

Courtesy: http://www.llnl.gov/computing/tutorials/parallel_comp/

4
Classification of Architectures – Based on
Memory
 Shared memory
 2 types – UMA and
NUMA NUMA
Examples: HP-
Exemplar, SGI Origin,
UMA Sequent NUMA-Q

Courtesy: http://www.llnl.gov/computing/tutorials/parallel_comp/ 5
Classification 2:
Shared Memory vs Message Passing
 Shared memory machine: The n processors
share physical address space
 Communication can be done through this shared
memory P
M P
M P
M P
M P
M P
M P
M

P P P Interconnect
P P P P

Interconnect
Main Memory

 The alternative is sometimes referred to


as a message passing machine or a
distributed memory machine
6
Shared Memory Machines
The shared memory could itself be
distributed among the processor nodes
 Each processor might have some portion of the
shared physical address space that is physically
close to it and therefore accessible in less time
 Terms: NUMA vs UMA architecture
 Non-Uniform Memory Access
 Uniform Memory Access

7
SHARED MEMORY AND
CACHES

8
Shared Memory Architecture: Caches
P1 P2
ReadX=1
Write X Read X
Cache hit:
Wrong data!!
X:
X:10 X: 0

X: 1
0

9
Cache Coherence Problem
 If each processor in a shared memory
multiple processor machine has a data cache
 Potential data consistency problem: the cache
coherence problem
 Shared variable modification, private cache
 Objective: processes shouldn’t read `stale’
data
 Solutions
 Hardware: cache coherence mechanisms

10
Cache Coherence Protocols
 Write update – propagate cache line to other
processors on every write to a processor
 Write invalidate – each processor gets the
updated cache line whenever it reads stale
data
 Which is better?

11
Invalidation Based Cache Coherence
P1 P2
ReadX=1
Write X Read X

X: 1
X:
X:10 X: 0

Invalidate

X: 0 X: 1

12
Cache Coherence using invalidate protocols

 3 states associated with data items


 Shared – a variable shared by 2
caches
 Invalid – another processor (say P0)
has updated the data item
 Dirty – state of the data item in P0

13
Implementations of cache coherence protocols

 Snoopy
 for bus based architectures
 shared bus interconnect where all cache
controllers monitor all bus activity
 There is only one operation through bus at a
time; cache controllers can be built to take
corrective action and enforce coherence in
caches
 Memory operations are propagated over the bus
and snooped

14
Implementations of cache coherence protocols

 Directory-based
 Instead of broadcasting memory operations to
all processors, propagate coherence operations
to relevant processors
 A central directory maintains states of cache
blocks, associated processors

15
Implementation of Directory Based
Protocols
 Using presence bits for the owner processors
 Two schemes:
 Full bit vector scheme – O(MxP) storage for
P processors and M cache lines
 But not necessary
 Modern day processors use sparse or tagged
directory scheme
 Limited cache lines and limited presence bits
16
False Sharing
 Cache coherence occurs at the granularity of
cache lines – an entire cache line is
invalidated
 Modern day cache lines are 64 bytes in size
 Consider a Fortran program dealing with a
matrix
 Assume each thread or process accessing a
row of a matrix
 Leads to false sharing

17
False sharing: Solutions
 Reorganize the code so that each processor
access a set of rows
 Can still lead to overlapping of cache lines if
matrix size not divisible by processors
 In such cases, employ padding
 Padding: dummy elements added to make
the matrix size divisible

18
INTERCONNECTION
NETWORKS

19
Interconnects
 Used in both shared memory and
distributed memory architectures
 In shared memory: Used to connect
processors to memory
 In distributed memory: Used to connect
different processors
 Components
 Interface (PCI or PCI-e): for connecting
processor to network link
 Network link connected to a communication
network (network of connections)

20
Communication network
 Consists of switching elements to which
processors are connected through ports
 Switch: network of switching elements
 Switching elements connected with each
other using a pattern of connections
 Pattern defines the network topology

 In shared memory systems, memory units


are also connected to communication
network
21
Parallel Architecture: Interconnections
 Routing techniques: how the route taken by the message
from source to destination is decided
 Network topologies
 Static – point-to-point communication links among processing
nodes
 Dynamic – Communication links are formed dynamically by
switches

22
Network Topologies

 Static
 Bus
 Completely connected
 Star
 Linear array, Ring (1-D torus)
 Mesh
 k-d mesh: d dimensions with k nodes in each dimension
 Hypercubes – 2-logp mesh
 Trees – our campus network
 Dynamic – Communication links are formed dynamically by
switches
 Crossbar
 Multistage
 For more details, and evaluation of topologies, refer to book by
Grama et al.
23
Network Topologies
 Bus, ring – used in small-
scale shared memory
systems

 Crossbar switch – used in


some small-scale shared
memory machines, small
or medium-scale
distributed memory
machines 24
Crossbar Switch
 Consists of 2D grid of switching elements
 Each switching element consists of 2 input
ports and 2 output ports
 An input port dynamically connected to an
output port through a switching logic

25
Multistage network – Omega network
 To reduce switching complexity
 Omega network – consisting of logP stages,
each consisting of P/2 switching elements

 Contention
 In crossbar – nonblocking
 In Omega – can occur during multiple
communications to disjoint pairs
26
Mesh, Torus, Hypercubes, Fat-tree
 Commonly used network topologies in
distributed memory architectures
 Hypercubes are networks with dimensions

27
Mesh, Torus, Hypercubes

2D
Mesh
Hypercube (binary n-cube)

n=2 n=3

Torus

28
Fat Tree Networks
 Binary tree
 Processors arranged in leaves
 Other nodes correspond to switches
 Fundamental property:
No. of links from a node to
a children = no. of links
from the node to its parent
 Edges become fatter as we traverse up the
tree

29
Fat Tree Networks
 Any pairs of processors can communicate
without contention: non-blocking network
 Constant Bisection Bandwidth (CBB)
networks
 Two level fat tree has a diameter of four

30
Evaluating Interconnection topologies
 Diameter – maximum distance between any two processing
nodes
 Full-connected – 1

 Star – 2
 Ring – p/2

 Hypercube - logP

 Connectivity – multiplicity of paths between 2 nodes. Miniimum


number of arcs to be removed from network to break it into two
disconnected networks
 Linear-array – 1

 Ring – 2
 2-d mesh – 2

 2-d mesh with wraparound – 4

 D-dimension hypercubes – d

31
Evaluating Interconnection topologies
 bisection width – minimum number of links to
be removed from network to partition it into 2
equal halves
 Ring – 2
 P-node 2-D mesh - Root(P)
 Tree – 1
 Star – 1
 Completely connected – P2/4
 Hypercubes - P/2

32
Evaluating Interconnection topologies

 channel width – number of bits that can be


simultaneously communicated over a link, i.e.
number of physical wires between 2 nodes
 channel rate – performance of a single physical
wire
 channel bandwidth – channel rate times channel
width
 bisection bandwidth – maximum volume of
communication between two halves of network,
i.e. bisection width times channel bandwidth

33

You might also like