Parallel Programming: Sathish S. Vadhiyar Course Web Page
Parallel Programming: Sathish S. Vadhiyar Course Web Page
Sathish S. Vadhiyar
Course Web Page:
http://www.serc.iisc.ernet.in/~vss/courses/PPP2007
Outline
Motivation for parallel programming
Challenges in parallel programming
Evaluating a parallel
program/algorithm – speedup,
efficiency, scalability analysis
Parallel Algorithm – Design, Types
and Models
Parallel Architectures
Motivation for Parallel Programming
• Faster Execution time due to non-dependencied
between regions of code
• Presents a level of modularity
• Resource constraints. Large databases.
• Certain class of algorithms lend themselves
• Aggregate bandwidth to memory/disk. Increase in data
throughput.
• Clock rate improvement in the past decade – 40%
• Memory access time improvement in the past decade
– 10%
• Grand challenge problems (more later)
Challenges / Problems in Parallel
Algorithms
Building efficient algorithms.
Avoiding
Communication delay
Idling
Synchronization
Challenges
P0
P1
Idle time
Computation
Communication
Synchronization
How do we evaluate a parallel
program?
Execution time, Tp
Speedup, S
S(p, n) = T(1, n) / T(p, n)
Usually, S(p, n) < p
Sometimes S(p, n) > p (superlinear speedup)
Efficiency, E
E(p, n) = S(p, n)/p
Usually, E(p, n) < 1
Sometimes, greater than 1
Scalability – Limitations in parallel
computing, relation to n and p.
Speedups and efficiency
S E
Ideal p p
Practical
Limitations on speedup – Amdahl’s
law
Amdahl's law states that the performance
improvement to be gained from using some
faster mode of execution is limited by the
fraction of the time the faster mode can be
used.
Overall speedup in terms of fractions of
computation time with and without
enhancement, % increase in enhancement.
Places a limit on the speedup due to
parallelism.
Speedup = 1
(fs + (fp/P))
Amdahl’s law Illustration
S = 1 / (s + (1-s)/p)
1 Efficiency
0.8
0.6
0.4
0.2
0
Courtesy: 0 5 10 15
http://www.metz.supelec.fr/~dedu/docs/kohPaper/node2.html
http://nereida.deioc.ull.es/html/openmp/pdp2002/sld008.htm
Amdahl’s law analysis
f P=1 P=4 P=8 P=16 P=32
1.00 1.0 4.00 8.00 16.00 32.00
0.99 1.0 3.88 7.48 13.91 24.43
0.98 1.0 3.77 7.02 12.31 19.75
0.96 1.0 3.57 6.25 10.00 14.29
•For the same fraction, speedup numbers keep moving away from
processor size.
•Thus Amdahl’s law is a bit depressing for parallel programming.
•In practice, the number of parallel portions of work has to be large
enough to match a given number of processors.
Gustafson’s Law
Amdahl’s law – keep the parallel work fixed
Gustafson’s law – keep computation time
on parallel processors fixed, change the
fraction of parallel work to match the
computation time
Serial component of code is independent of
problem size
Parallel component scales as problem size
which scales as number of processors
Scaled Speedup, S =
(Seq + Par(P)*P)/(Seq + Par(P))
Metrics (Contd..)
Courtesy: http://www.llnl.gov/computing/tutorials/parallel_comp/
Parallel Algorithm Types and
Models
Master-Worker / P0
parameter sweep /
task farming P1 P2 P3 P4
Pipleline / systolic /
wavefront
P0 P1 P2 P3 P4
Courtesy: http://www.llnl.gov/computing/tutorials/parallel_comp/
Parallel Algorithm Types and
Models
Data parallel model
Processes perform identical tasks on different data
Task parallel model
Different processes perform different tasks on same
or different data – based on task dependency graph
Work pool model
Any task can be performed by any process. Tasks are
added to a work pool dynamically
Pipeline model
A stream of data passes through a chain of processes
– stream parallelism
Parallel Architectures
- Classification
- Cache coherence in shared memory platforms
- Interconnection networks
Classification of Architectures –
Flynn’s classification
Single Instruction
Single Data (SISD):
Serial Computers
Single Instruction
Multiple Data (SIMD)
- Vector processors
and processor arrays
- Examples: CM-2,
Cray-90, Cray YMP,
Hitachi 3600
Courtesy: http://www.llnl.gov/computing/tutorials/parallel_comp/
Classification of Architectures –
Flynn’s classification
Multiple Instruction
Single Data (MISD):
Not popular
Multiple Instruction
Multiple Data (MIMD)
- Most popular
- IBM SP and most
other supercomputers,
clusters,
computational Grids
etc.
Courtesy: http://www.llnl.gov/computing/tutorials/parallel_comp/
Classification of Architectures –
Based on Memory
Shared memory
2 types – UMA and NUMA
NUMA Examples: HP-
Exemplar, SGI Origin,
UMA Sequent NUMA-Q
Courtesy: http://www.llnl.gov/computing/tutorials/parallel_comp/
Classification of Architectures –
Based on Memory
Distributed memory
Courtesy: http://www.llnl.gov/computing/tutorials/parallel_comp/
Recently multi-cores
Yet another classification – MPPs, NOW (Berkeley), COW, Computational Grids
Programming Paradigms, Algorithm
Types, Techniques
Shared memory
model – Threads,
OpenMP
Message passing
model – MPI
Data parallel model
– HPF
Courtesy: http://www.llnl.gov/computing/tutorials/parallel_comp/
Cache Coherence in SMPs
• All processes
read variable ‘x’
CPU0 CPU1 CPU2 CPU3 residing in
cache line ‘a’
•Each process
a a a a updates ‘x’ at
cache0 cache1 cache2 cache3 different points
of time
a
Challenge: To maintain consistent view
of the data Main
Protocols: Memory
•Write update
•Write invalidate
Caches Coherence Protocols and
Implementations
Write update – propagate cache line to other
processors on every write to a processor
Write invalidate – each processor get the
updated cache line whenever it reads stale
data
Which is better??
Caches –False sharing
•Different processors
update different parts of
the same cache line
•Leads to ping-pong of
CPU0 cache lines between CPU1
processors
•Situation better in
A0, A2, A4… A1, A3, A5…
update protocols than
cache0
invalidate protocols. cache1
Why?
A0 – A8
A9 – A15 •Modify the
Main algorithm to change
the stride
Memory
Caches Coherence using invalidate
protocols
3 states associated with data items
Shared – a variable shared by 2 caches
Invalid – another processor (say P0) has updated the data
item
Dirty – state of the data item in P0
Implementations
Snoopy
for bus based architectures
Memory operations are propagated over the bus and snooped
Instead of broadcasting memory operations to all processors,
propagate coherence operations to relevant processors
Directory-based
A central directory maintains states of cache blocks, associated
processors
Implemented with presence bits
Interconnection Networks
An interconnection network defined by
switches, links and interfaces
Switches – provide mapping between
input and output ports, buffering, routing
etc.
Interfaces – connects nodes with network
Network topologies
Static – point-to-point communication
links among processing nodes
Dynamic – Communication links are
formed dynamically by switches
Interconnection Networks
Static
Bus – SGI challenge
Completely connected
Star
Linear array, Ring (1-D torus)
Mesh – Intel ASCI Red (2-D) , Cray T3E (3-D), 2DTorus
k-d mesh: d dimensions with k nodes in each
dimension
Hypercubes – logp-2 mesh – e.g. many MIMD
machines
Trees – our campus network
Dynamic – Communication links are formed
dynamically by switches
Crossbar – Cray X series – non-blocking network
Multistage – SP2 – blocking network.
Evaluating Interconnection
topologies
Diameter – maximum distance between any two
processing nodes
Full-connected – 1
Star – 2
Ring – p/2
Hypercube - logP
Connectivity – multiplicity of paths between 2 nodes.
Maximum number of arcs to be removed from
network to break it into two disconnected networks
Linear-array – 1
Ring – 2
2-d mesh – 2
2-d mesh with wraparound – 4
D-dimension hypercubes – d
Evaluating Interconnection
topologies
bisection width – minimum number of
links to be removed from network to
partition it into 2 equal halves
Ring – 2
P-node 2-D mesh - Root(P)
Tree – 1
Star – 1
Completely connected – P2/4
Hypercubes - P/2
Evaluating Interconnection
topologies
channel width – number of bits that can be
simultaneously communicated over a link,
i.e. number of physical wires between 2
nodes, channel rate, channel bandwidth,
bisection bandwidth, cost
channel rate – performance of a single
physical wire
channel bandwidth – channel rate times
channel width
bisection bandwidth – maximum volume of
communication between two halves of
network, i.e. bisection width times channel
bandwidth