0% found this document useful (0 votes)
93 views33 pages

Parallel Architecture

This document discusses parallel computing architectures and classification. It covers Flynn's classification of architectures based on instruction and data streams, including SISD, SIMD, MISD, and MIMD models. It also discusses shared memory versus distributed memory architectures. Key topics covered include cache coherence problems in shared memory systems and solutions using cache coherence protocols, as well as different interconnection network topologies used in parallel systems like meshes, toruses, hypercubes, and fat trees.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
93 views33 pages

Parallel Architecture

This document discusses parallel computing architectures and classification. It covers Flynn's classification of architectures based on instruction and data streams, including SISD, SIMD, MISD, and MIMD models. It also discusses shared memory versus distributed memory architectures. Key topics covered include cache coherence problems in shared memory systems and solutions using cache coherence protocols, as well as different interconnection network topologies used in parallel systems like meshes, toruses, hypercubes, and fat trees.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 33

Parallel Architecture

Sathish Vadhiyar
Motivations of Parallel Computing
 Faster execution times
 From days or months to hours or seconds
 E.g., climate modelling, bioinformatics
 Large amount of data dictate parallelism
 Parallelism more natural for certain kinds of
problems, e.g., climate modelling
 Due to computer architecture trends
 CPU speeds have saturated
 Slow memory bandwidths

2
Classification of Architectures – Flynn’s
classification
In terms of parallelism in
instruction and data stream
 Single Instruction Single
Data (SISD): Serial
Computers
 Single Instruction Multiple
Data (SIMD)
- Vector processors and
processor arrays
- Examples: CM-2, Cray-90,
Cray YMP, Hitachi 3600

Courtesy: http://www.llnl.gov/computing/tutorials/parallel_comp/

3
Classification of Architectures – Flynn’s
classification
 Multiple Instruction Single
Data (MISD): Not popular
 Multiple Instruction
Multiple Data (MIMD)
- Most popular
- IBM SP and most other
supercomputers,
clusters, computational
Grids etc.

Courtesy: http://www.llnl.gov/computing/tutorials/parallel_comp/

4
Classification of Architectures – Based on
Memory
 Shared memory
 2 types – UMA and
NUMA NUMA
Examples: HP-
Exemplar, SGI Origin,
UMA Sequent NUMA-Q

Courtesy: http://www.llnl.gov/computing/tutorials/parallel_comp/ 5
Classification 2:
Shared Memory vs Message Passing
 Shared memory machine: The n processors
share physical address space
 Communication can be done through this shared
memory P
M P
M P
M P
M P
M P
M P
M

P P P Interconnect
P P P P

Interconnect
Main Memory

 The alternative is sometimes referred to


as a message passing machine or a
distributed memory machine
6
Shared Memory Machines
The shared memory could itself be
distributed among the processor nodes
 Each processor might have some portion of the
shared physical address space that is physically
close to it and therefore accessible in less time
 Terms: NUMA vs UMA architecture
 Non-Uniform Memory Access
 Uniform Memory Access

7
SHARED MEMORY AND
CACHES

8
Shared Memory Architecture: Caches
P1 P2
ReadX=1
Write X Read X
Cache hit:
Wrong data!!
X:
X:10 X: 0

X: 1
0

9
Cache Coherence Problem
 If each processor in a shared memory
multiple processor machine has a data cache
 Potential data consistency problem: the cache
coherence problem
 Shared variable modification, private cache
 Objective: processes shouldn’t read `stale’
data
 Solutions
 Hardware: cache coherence mechanisms

10
Cache Coherence Protocols
 Write update – propagate cache line to other
processors on every write to a processor
 Write invalidate – each processor gets the
updated cache line whenever it reads stale
data
 Which is better?

11
Invalidation Based Cache Coherence
P1 P2
ReadX=1
Write X Read X

X: 1
X:
X:10 X: 0

Invalidate

X: 0 X: 1

12
Cache Coherence using invalidate protocols

 3 states associated with data items


 Shared – a variable shared by 2
caches
 Invalid – another processor (say P0)
has updated the data item
 Dirty – state of the data item in P0

13
Implementations of cache coherence protocols

 Snoopy
 for bus based architectures
 shared bus interconnect where all cache
controllers monitor all bus activity
 There is only one operation through bus at a
time; cache controllers can be built to take
corrective action and enforce coherence in
caches
 Memory operations are propagated over the bus
and snooped

14
Implementations of cache coherence protocols

 Directory-based
 Instead of broadcasting memory operations to
all processors, propagate coherence operations
to relevant processors
 A central directory maintains states of cache
blocks, associated processors

15
Implementation of Directory Based
Protocols
 Using presence bits for the owner processors
 Two schemes:
 Full bit vector scheme – O(MxP) storage for
P processors and M cache lines
 But not necessary
 Modern day processors use sparse or tagged
directory scheme
 Limited cache lines and limited presence bits
16
False Sharing
 Cache coherence occurs at the granularity of
cache lines – an entire cache line is
invalidated
 Modern day cache lines are 64 bytes in size
 Consider a Fortran program dealing with a
matrix
 Assume each thread or process accessing a
row of a matrix
 Leads to false sharing

17
False sharing: Solutions
 Reorganize the code so that each processor
access a set of rows
 Can still lead to overlapping of cache lines if
matrix size not divisible by processors
 In such cases, employ padding
 Padding: dummy elements added to make
the matrix size divisible

18
INTERCONNECTION
NETWORKS

19
Interconnects
 Used in both shared memory and
distributed memory architectures
 In shared memory: Used to connect
processors to memory
 In distributed memory: Used to connect
different processors
 Components
 Interface (PCI or PCI-e): for connecting
processor to network link
 Network link connected to a communication
network (network of connections)

20
Communication network
 Consists of switching elements to which
processors are connected through ports
 Switch: network of switching elements
 Switching elements connected with each
other using a pattern of connections
 Pattern defines the network topology

 In shared memory systems, memory units


are also connected to communication
network
21
Parallel Architecture: Interconnections
 Routing techniques: how the route taken by the message
from source to destination is decided
 Network topologies
 Static – point-to-point communication links among processing
nodes
 Dynamic – Communication links are formed dynamically by
switches

22
Network Topologies

 Static
 Bus
 Completely connected
 Star
 Linear array, Ring (1-D torus)
 Mesh
 k-d mesh: d dimensions with k nodes in each dimension
 Hypercubes – 2-logp mesh
 Trees – our campus network
 Dynamic – Communication links are formed dynamically by
switches
 Crossbar
 Multistage
 For more details, and evaluation of topologies, refer to book by
Grama et al.
23
Network Topologies
 Bus, ring – used in small-
scale shared memory
systems

 Crossbar switch – used in


some small-scale shared
memory machines, small
or medium-scale
distributed memory
machines 24
Crossbar Switch
 Consists of 2D grid of switching elements
 Each switching element consists of 2 input
ports and 2 output ports
 An input port dynamically connected to an
output port through a switching logic

25
Multistage network – Omega network
 To reduce switching complexity
 Omega network – consisting of logP stages,
each consisting of P/2 switching elements

 Contention
 In crossbar – nonblocking
 In Omega – can occur during multiple
communications to disjoint pairs
26
Mesh, Torus, Hypercubes, Fat-tree
 Commonly used network topologies in
distributed memory architectures
 Hypercubes are networks with dimensions

27
Mesh, Torus, Hypercubes

2D
Mesh
Hypercube (binary n-cube)

n=2 n=3

Torus

28
Fat Tree Networks
 Binary tree
 Processors arranged in leaves
 Other nodes correspond to switches
 Fundamental property:
No. of links from a node to
a children = no. of links
from the node to its parent
 Edges become fatter as we traverse up the
tree

29
Fat Tree Networks
 Any pairs of processors can communicate
without contention: non-blocking network
 Constant Bisection Bandwidth (CBB)
networks
 Two level fat tree has a diameter of four

30
Evaluating Interconnection topologies
 Diameter – maximum distance between any two processing
nodes
 Full-connected – 1

 Star – 2
 Ring – p/2

 Hypercube - logP

 Connectivity – multiplicity of paths between 2 nodes. Miniimum


number of arcs to be removed from network to break it into two
disconnected networks
 Linear-array – 1

 Ring – 2
 2-d mesh – 2

 2-d mesh with wraparound – 4

 D-dimension hypercubes – d

31
Evaluating Interconnection topologies
 bisection width – minimum number of links to
be removed from network to partition it into 2
equal halves
 Ring – 2
 P-node 2-D mesh - Root(P)
 Tree – 1
 Star – 1
 Completely connected – P2/4
 Hypercubes - P/2

32
Evaluating Interconnection topologies

 channel width – number of bits that can be


simultaneously communicated over a link, i.e.
number of physical wires between 2 nodes
 channel rate – performance of a single physical
wire
 channel bandwidth – channel rate times channel
width
 bisection bandwidth – maximum volume of
communication between two halves of network,
i.e. bisection width times channel bandwidth

33

You might also like